SpiNNaker Book
SpiNNaker Book
ISBN: 978-1-68083-652-3
E-ISBN: 978-1-68083-653-0
DOI: 10.1561/9781680836530
Suggested citation: Steve Furber and Petruţ Bogdan (eds.). (2020). SpiNNaker – A Spiking Neural
Network Architecture. Boston–Delft: Now Publishers
The work will be available online open access and governed by the Creative Commons “Attribution-
Non Commercial” License (CC BY-NC), according to https://creativecommons.org/licenses/
by-nc/4.0/
Table of Contents
Preface ix
Acknowledgements xi
Glossary xii
Chapter 1 Origins 1
By Steve Furber
1.1 From Ada to Alan − Early Thoughts on Brains
and Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Ada Lovelace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Alan Turing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Reinventing Neural Networks − Early Thoughts
on the Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Mighty ARMs from Little Acorns Grow . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Realising Our Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Reinventing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 The Architecture Comes Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 The State of the Neuromorphic Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 What Could We Bring to Neuromorphics? . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Multicast Packet-switched AER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 Optimise, Optimise… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.5 Flexibility to Cope with Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.6 Big Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.7 Ready to Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 A Scalable Hardware Architecture for Neural Simulation . . . . . . . . . 11
1.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iii
iv Table of Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
About the Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Contributing Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Preface
ix
x Preface
impulses that are pure asynchronous events – as the primary mode of real-time com-
munication, and the very high degree of connectivity found in the biological brain
where each neuron typically connects to many thousands of other neurons.
The research was ultimately configured to address two high-level questions:
This book is the story of the first 20 years of this research programme, an
outcome of which is the world’s largest neuromorphic computing platform ulti-
mately incorporating a million processor cores, capable of modelling spiking neu-
ral networks of the scale of a mouse brain in biological real time. The mouse
brain is around a thousand times smaller than, but in some senses otherwise very
similar to, the human brain. So there is still a long way to go to deliver a real-
time model of a full human brain, but SpiNNaker can support sophisticated and
biologically-realistic models of substantial brain subsystems, albeit with the empha-
sis on the details of the network topology rather than the internal complexities of
the individual neurons.
Steve Furber, 12 July 2019
Acknowledgements
Many staff, students and collaborators have contributed to this story over the last
two decades; quite a number of them have contributed to the writing of this book,
though there are many more whose names do not appear directly here, but whose
contributions are nevertheless an integral part of the SpiNNaker story. Their con-
tributions are all gratefully acknowledged.
Funding Acknowledgements
The design and construction of the SpiNNaker machine was supported by the
UK Engineering and Physical Sciences Research Council (EPSRC) under grants
EP/D07908X/1 and EP/G015740/1, in collaboration with the universities of
Southampton, Cambridge and Sheffield and with industry partners ARM Ltd,
Silistix Ltd and Thales. Ongoing development of the software is supported by the
EU ICT Flagship Human Brain Project (FP7-604102, H2020 720270 and H2020
785907), in collaboration with many university and industry partners across the
EU and beyond, and exploration of the capabilities of the machine was supported
by the European Research Council under the European Union’s Seventh Frame-
work Programme (FP7/2007-2013)/ERC grant agreement 320689. The support
of these funding agencies is gratefully acknowledged.
xi
Glossary
A
ABB - Adaptive Body Biasing - a technique used on a CMOS chip to reduce the
impact of manufacturing variability on the performance of the chip. 277, 279
AD - Absolute Deviation - the deviation of the mean receptive field from its ideal
location in a topographic map. 240, 241, 243, 244
ADPLL - All Digital Phase-Locked Loop - a digital circuit that controls the fre-
quency and phase of one signal to match the frequency and phase of a refer-
ence signal. 273
AER - Address Event Representation - a mechanism for encoding spike events from
an SNN as a stream of numbers or ‘addresses’, where each address corresponds
to a spike from a particular neuron. 6–10, 19, 22, 55, 107, 164, 166
AGI - Artificial General Intelligence - AI approaching human-like general-purpose
capabilities. 128
AHB - AMBA High-Performance Bus - a multi-master bus protocol introduced in
AMBA version 2. A simple transaction on the AHB consists of an address
phase and a subsequent data phase. 27, 28, 34, 39, 275
AI - Artificial Intelligence - a term applied broadly to machine learning systems that
display some specific human-like capability, such as the ability to play chess
or to recognise cats in an image. 128, 155, 156, 160, 161, 163, 179, 203,
263
AN - Auditory Nerve - a bundle of axons representing the output of the cochlea.
139–142
xii
Glossary xiii
ARM - Acorn RISC Machine - although this expansion is now deprecated, the term
is used to describe the company ARM Ltd or the range of microprocessor
architectures that they design and that are widely used in mobile phones and
many other computer systems, including SpiNNaker. 4, 5, 7, 9, 10, 14, 16,
18, 19, 25, 27, 28, 30, 34, 39, 41, 43, 44, 48, 49, 51, 54, 55, 57, 67, 68, 73,
76, 80, 120, 127, 172, 176, 207, 216, 217, 263, 266, 270, 272, 275–277
BG - Basal Ganglia - a brain region responsible for action selection, among others.
128, 132, 143–146
ConvNet - Convolutional Neural Network - a form of ANN used for image classi-
fication. 161–166, 169, 196, 197, 201, 264
CPU - Central Processing Unit - a hardware component responsible for basic arith-
metic, logic, control and I/O. 88, 127, 141, 142, 217
DDR - Double Data Rate - a form of computer memory where data is delivered
on both the rising and the falling edges of the clock. 43
DfT - Design for Test - a systematic approach to microchip design that takes the
testability of the logic into account throughout the design process. 267
DLL - Delay-Locked Loop - a mechanism that adjusts the relative phases of different
clock signals. 54
DMA - Direct Memory Access - a hardware mechanism for copying blocks of mem-
ory. 10, 25, 27–30, 39, 42, 46, 48, 51, 81, 107, 108, 112, 114, 116–118,
123, 213, 215, 267, 272
DNN - Deep Neural Network - an ANN with many layers of neurons. 161, 200,
263, 275, 276
E
EA - Evolutionary Algorithm - a computer algorithm that optimises parameters
using an approach more-or-less similar to biological evolution. 255–257,
260, 261
EoP - End-of-Packet - a special marker used to indicate the end of a sequentially-
transmitted packet. 35–38
EPSRC - The UK Engineering and Physical Sciences Research Council - the UK’s
research funding body for engineering and the physical sciences. xii, 5, 16
ES - Evolutionary Strategies - a class of optimisation methods. 260
F
FIFO - First In First Out - a form of queue where outputs emerge in the order that
they were input. 65, 267
FIQ - Fast Interrupt Request - a name used by ARM for an input signal to a micro-
processor that interrupts the processor, at a higher priority than IRQ. 30, 81,
108, 112, 118
Flash - Flash Memory - solid-state, non-volatile data storage that can be electrically
erased and reprogrammed. 55
flit - flow control unit - a link-level atomic piece that forms a network packet or
stream. 36–38
FPGA - Field-Programmable Gate Array - a microchip that has logic that can be
configured to perform an arbitrary function. 55, 57–60, 62, 63, 65, 66, 74,
178
FPU - Floating Point Unit - a hardware unit in a microprocessor that can handle
operations on floating-point numbers. 172, 270
xvi Glossary
HBP - Human Brain Project - the European Commission’s 10-year Flagship project
to advance the use of computer technology in brain research, in which
SpiNNaker plays a role. 76, 103, 207, 219
IHC - Inner Hair Cell - motion sensitive cell located in the cochlea. 138–141
ISI - Inter-spike interval - the time between two consecutive spikes in an SNN. 216
Glossary xvii
JTAG - Joint Test Action Group - a standard for testing microchip I/O and internal
functions. 43, 59
LCD - Liquid-Crystal Display - a form of flat-panel display that exploits the ability
of liquid crystal to modulate light passing through them. 68
LED - Light-Emitting Diode - a semiconductor device that emits light when a cur-
rent flows through it. 55, 59, 72
LIF - Leaky Integrate and Fire - a point-neuron model that accumulates inputs
until a threshold is reached, whereupon it emits a spike and resets itself; the
accumulation is ‘leaky’, so it decays over time in the absence of further input.
106, 118, 120, 121, 149, 170, 171, 178, 180–185, 187–193, 195–199, 201,
203, 204, 214, 223, 225, 226, 235, 249
LUT - Look-up table - a table containing pre-computed values that would otherwise
be expensive to compute at the time of use. 274
MAC - Media Access Control - the low-level mechanism used to allow computers
to connect to the internet (see also: MAc - Multiply-Accumulate). 43, 54, 55,
68
xviii Glossary
MNIST - Modified NIST - data set of images containing handwritten digits. 134,
170–172, 174, 176, 177, 197, 198, 238, 247, 258, 259, 261
NSP - Noisy Softplus - an activation function designed to closely simulate the firing
activity of simple spiking neurons. 178, 180, 181, 189–197, 199–201, 203
Glossary xix
PC - Personal Computer - a computer designed for use by a single user, often iden-
tified with the IBM PC standard. 15, 129
PCB - Printed Circuit Board - a multilayer board, usually made of fibreglass, with
copper interconnect patterns formed by a printing and etching process, used
to mount and connect multiple microchips and other electronic components.
12, 36, 43, 49, 56, 58
PHY - Physical Layer Device - connects a link layer device (MAC) to a physical
medium such as a copper cable. 43, 267
PLL - Phase-Locked Loop - a control system that generates an output signal whose
phase is related to the phase of an input signal. A simple implementation
consists of a variable frequency oscillator and a phase detector in a feedback
loop. 39, 279
POST - Power-On Self-Test - a system whereby a computer or device can check its
own functionality every time it is switched on. 60
PSP - Post-Synaptic Potential - the effect over time of an incoming spike on the
membrane potential of the post-synaptic neuron. 150, 179
PVT - Process, Voltage and Temperature - the manufacturing and operational factors
that affect the performance of a CMOS circuit. 277–279
xx Glossary
RAM - Random-Access Memory - computer memory where the contents can be read
and written in any order. 8, 10, 14, 19, 25–27, 31, 34, 44, 46, 48, 49, 51,
147
ReLU - Rectified Linear Unit - a very popular activation function in the context of
ANNs. 181, 189, 195–201, 203, 276
RMSE - Root Mean Squared Error - a widely used measure for the accuracy of the
fit of an equation to a set of data. 247, 252
RNN - Recurrent Neural Network - a neural network where information flow is not
simply unidirectional. 162
ROM - Read-Only Memory - a random access memory whose contents are fixed.
14, 26, 27, 29, 39, 40, 43, 54, 60, 67, 81
RTZ - Return-To-Zero - a signalling system where the high or low level on a wire
conveys information, such as 1 and 0, respectively. 36
SCAMP - SpiNNaker Control And Monitor Program - software written to allow one
of the cores to operate as a monitor processor through which the chip can be
controlled. 80, 82–85, 95, 96
SDP - SpiNNaker Datagram Protocol - communication protocol employed on the
SpiNNaker communication fabric. 80, 81, 95, 96
SDRAM - Synchronous Dynamic Random-Access Memory - a form of computer
memory with high density and performance. 10–16, 19, 24, 27–30, 38–40,
43–46, 48, 51, 54, 60, 61, 79, 80, 85, 88–96, 100, 101, 106–108, 112–114,
118, 122, 123, 139, 165, 214, 265, 266, 272
SerDes - Serialiser/Deserialiser - an interface that converts parallel data to serial data
and vice versa for high-speed inter-chip communication. 266
SNN - Spiking Neural Network - a neural network where communication between
neurons is in the form of asynchronous impulses, or ‘spikes’, where informa-
tion is conveyed only in the timing of the spikes. 77, 102–105, 107, 110,
116, 117, 119, 122, 127, 128, 136, 138, 149–158, 161, 170, 171, 176–181,
188, 189, 191, 192, 195–197, 199–204, 206, 207, 231, 250, 255, 257–262,
264, 268
SNr - substantia nigra pars reticulata - the output structure of the basal ganglia.
144, 145
SoC - System-on-Chip - a microchip that incorporates most of the required system
functions, usually including one or more microprocessor cores, memories,
on-chip buses or NoCs, specialized interfaces, etc. 12, 15, 16, 265
Softplus - an activation function in the context of ANNs. 183, 195, 197, 199–203
Spalloc - SpiNNaker machine partitioning and allocation server - the SpiNNaker
job submission system that allocates a subset of the machine to individual
user jobs. 74, 259
SPI - Serial Peripheral Interface - a synchronous serial communication interface
specification used mostly in embedded systems. 43, 66
SpiN1API - SpiNNaker1 API - the SpiNNaker1 set of low-level, on-chip libraries
implementing its event-based operating system. 81, 84, 105, 107, 108
SpiNNaker - Spiking Neural Network Architecture - a many-core neuromorphic
computing platform. 1, 4, 8–10, 16–20, 24, 27, 30, 31, 36, 38–45, 47,
48, 50–69, 71–80, 82, 84–89, 95–99, 101, 103–108, 110, 116, 118, 120,
122, 127–130, 133, 138–147, 149, 152, 163–172, 176–178, 203, 205, 207,
xxii Glossary
210, 213–217, 220, 222, 227, 229, 233, 234, 236–238, 241, 248, 250, 251,
256–262, 264–266, 268–270, 272–274, 279
TM - Tympanic Membrane - the membrane separating the outer and middle ear.
138, 139
Glossary xxiii
U
UDP - User Datagram Protocol - internet communication protocol. 80–82, 97
V
VIC - Vectored Interrupt Controller - a device that is used to combine several sources
of interrupt onto one or more CPU lines, while allowing priority levels to be
assigned to its interrupt outputs. 49
VLSI - Very Large Scale Integration - microchip technology whereby many transis-
tors can be ‘printed’ on a single chip. 5, 6
W
WTA - Winner-Takes-All - a neural mechanism whereby the most stimulated neu-
ron in a group suppresses activity on other neurons in the group completely.
151, 250, 254
Z
ZIF - Zero-Insertion-Force - a socket for a microchip that allows easy insertion and
removal of the microchip. 54
DOI: 10.1561/9781680836530.ch1
Chapter 1
Origins
By Steve Furber
I have my hopes, and very distinct ones too, of one day getting cerebral phenomena such that
I can put them into mathematical equations – in short, a law or laws for the mutual actions of
the molecules of brain … I hope to bequeath to the generations a calculus of the nervous system.
— Ada Lovelace
The Spiking Neural Network Architecture (SpiNNaker) project has as its aim the
design and construction of a massively parallel computer to support the modelling
of large-scale systems of spiking neural networks in biological real time. The objec-
tives of this research are two fold: firstly, to build a machine that can contribute
to progress towards the scientific Grand Challenge of understanding the principles
underpinning information processing in the brain; and secondly, to use what we
do understand about the brain to help build better computers.
The brain remains as one of the great frontiers of science – how does this organ upon
which we all so critically depend do its job? We know a great deal about the low-level
details of neurons and synapses, glial cells and mitochondria, and we can use brain
1
2 Origins
imaging machines to see how activity moves around in the brain in response to
external stimuli. But all of the interesting information processing takes place at
intermediate scales reachable neither by bottom-up neuroscience nor by top-down
brain imaging. The only tools available at these intermediate levels are those based
on computer modelling, where we can test hypotheses about fundamental questions
such as how does the brain learn and store new information, and how is what
we see with our eyes represented in spatio-temporal patterns of spikes within our
brains.
Interest in the brain is not new, of course. It took some time to determine that
our central control system was based in our head, not in our heart, and even more
time to understand the neuronal basis of this control [204]. But even before we
achieved this basic level of understanding, there was speculation about what might
be happening, and here we look at just two characters in this long story of working
towards an understanding of how we operate as an allegedly intelligent species.
Sadly, Ada never got to deliver on this ambition. She lived before Ramón y
Cajal’s revelations of the details of the neuron, but even today, with all the detailed
knowledge gleaned in the interim years, her agenda would be considered highly
ambitious!
which subsequent generations know simply as the Turing test for human-like arti-
ficial intelligence.
Turing reckoned that all that a computer would need to pass his test, compared
with the Manchester Baby machine, was more memory, about a gigabyte (a billion
bytes) should be enough. (Baby had 128 bytes of memory and could execute some
700 instructions per second.) He thought that by the turn of the 21st century
computers might have that much memory.
Indeed, by the turn of the 21st century a typical desktop computer would
have about a gigabyte of memory, and it would be a million times more power-
ful than the Baby, but it would not pass Turing’s test. This would have surprised
Turing.
Why has human-like artificial intelligence proved so much harder than Turing,
and many others since him, predicted? Perhaps it is because we still do not under-
stand natural intelligence, so we do not know exactly what it is we are trying to
reproduce in our machines. Natural intelligence is the product of the brain.
This line of thinking drew us inexorably towards neural networks as the direction
we should explore to seek answers to our questions about how the brain does some
things so much better than our machines, however fast they might be.
We (the Advanced Processor Technologies group) were a bunch of computer
architects and engineers with a solid research background in unconventional (espe-
cially asynchronous) computing, taking novel designs all the way down to very
demanding silicon implementations. What could we bring to the neural network
party?
So, hard logic had drawn us into the neural network game. This is not, of course,
virgin research territory; many had looked into VLSI implementations of circuits
based upon our (limited) knowledge of how the brain works. The first steps were to
look into what had gone before, what was known about the functions of neurons
and synapses, and what were the main problems that had arisen in previous work.
Then, the goal was to synthesise something from our basic research strengths that
stood a reasonable chance of yielding a substantive contribution to the field, with
sufficient differentiation from others’ work.
topologies? AER suggested a starting point but, as noted above, bus-based AER has
limited scalability.
1. We subsequently added a mechanism to re-insert the dropped packets, re-establishing reliable delivery under
most circumstances.
10 Origins
1.3.7 Ready to Go
This, then, was the thinking that went into defining the architecture of the
SpiNNaker system – a processing chip with as many small ARM cores as would
fit, each with local code and data memory, and a shared SDRAM chip to hold the
large synaptic data structures. Each processor subsystem would have a spike packet
transmitter and receiver and a Direct Memory Access (DMA) engine to transfer
synaptic data between the SDRAM and the local data RAM, thereby hiding the
variable SDRAM latency. Each chip would have a multicast AER packet router
using TCAM associative lookup and links to convey packets to and from neigh-
bouring chips.
All that was left to do was find funding to get the chip designed and built, then
build a machine and, of course, write some software!
The following section reproduces the content of a note written in May 2005
that outlines the key architectural concepts that were the starting point for the
SpiNNaker development. Some details changed in the course of that development,
so this should be read as a historical note, not as an authoritative definition of the
A Scalable Hardware Architecture for Neural Simulation 11
final architecture! Although some of the details would change during the imple-
mentation phase that followed, the key concepts are already in place in this note.
1.4.1 Introduction
Over the last couple of years, I have been struggling with several aspects of
the proposed neural hardware system. Issues that have come to the fore are
the importance of modelling axonal delays, the importance of the sparse con-
nectivity of biological neurons, the cost issues relating to the use of very large
on-chip memories, and the need to keep as many decisions open for as long as
possible. I have now found a way to resolve all of these issues at once through
a radical change in the architecture proposal: push the memory off chip into
a standard SDRAM and implement the on-chip neural functions through
parallel programmable processors of a fairly conventional nature.
This approach yields a highly programmable system of much greater power
than that previously proposed and a safer (more familiar) development path.
It also points directly towards a development route that can be used to prove
the proposed plan using technology already to hand.
2. We did!
12 Origins
In the first instance, I see this as a system that is well suited to support-
ing research into complex neuro-dynamics, and I think this will be the pri-
mary market until/unless there is a breakthrough in our understanding. This
system is well positioned both to expedite that breakthrough and to exploit
the consequences of it.
It is also possible that there might be a market for this system as a general-
purpose low-cost high-performance computer system. It has very high integer
performance and could be well suited to code-cracking, database search and
similar applications that do not need floating-point hardware. However, this
will require further investigation.
Potential products include neural simulation software, chips, boards and
full-blown systems. We could also sell time on systems.
m
monitor
oonnititor
monitor
m oor
pr
processor
pro
oocceessssor
processor
pr oor
Rx
RRxxi/f
Rx /i/f
/ff
i/f Tx
Txi/f
/i/f
/ff
i/f
Rx
RRxxi/f
Rx /i/f
/ff
i/f Tx
Txi/f
/i/f
/ff
i/f
arbiter
arbbititer
arbiter e
Rx
RRxxi/f
Rx /i/f
/ff
i/f Tx
Txi/f
/i/f
/ff
i/f
router
oouutter
rroutereerr
ar
Rx
Rxxi/f
Rx
R /ii/f
/f ff
/f Tx
Txi/f
/i/f
/fff
i/f
fascicle
a
fasscic
cciclele
cicl
fascicle
pr
processor
pro
proocceessssor
processor oor
fascicle
a
fasscic
cciclele
cicl
fascicle
pr
processor
pro
proocceessssor
processor oor
fascicle
a
fasscic
cciclele
cicl
fascicle
pr
processor
pro
proocceessssor
processor oor
fascicle
a sscic
cciclele
cicl
fascicle
fa
pr
processor
pro
oocceessssor
processor
pr oor
3. At this time, I had seen ‘fascicle’ used to describe a bundle of neuron fibres and thought it was widely
used this way. I was wrong! We now use ‘population’ to describe a bunch of neurons with common
inputs and outputs.
14 Origins
It is feasible to use the Excalibur parts we obtained from Altera to prove the
ideas. These chips include a 200 MHz ARM9 with caches and an SDRAM
interface, and an area of programmable logic. We can prototype the neural
algorithms on the ARM9 and prototype the router and inter-chip commu-
nications in the programmable logic. We have 2 development systems to get
started and 10 chips that could be used to build a 100,000 neuron engine.
Such a system would be an asset if we wished to attract venture capital fund-
ing to support the SoC development and/or production.
Alternatively, I could put in a large EPSRC proposal to support the SoC
design.
Timescales (rough estimates):
This will yield prototype silicon. Moving this into production will incur a
large mask charge ($1.5 million) and, at this stage, this will require a part-
nership and/or investment.
1.5 Summary
The above May 2005 note, reproduced in Section 1.4, outlines all of the key con-
cepts at the start of the development of the detailed design of the SpiNNaker
chip. Funding was successfully sought from EPSRC, and the design work started
in earnest in October 2006. Many of the estimates in the note turned out to be
horribly optimistic – for example, the chip design took more like 5 years and 40
person-years rather than the 3 years and 4 person-years (4 years and 6 person-
years including the prototype, that was never built) in the note, but the choice
of a 130-nm Complementary Metal Oxide Semiconductor (CMOS) technology
kept the mask cost to $250 k rather than $1.5 M, so swings and roundabouts!
DOI: 10.1561/9781680836530.ch2
Chapter 2
Architecture should speak of its time and place, but yearn for timelessness.
— Frank Gehry
The central component in the SpiNNaker system is the SpiNNaker chip [186],
and the central focus of the SpiNNaker chip is scalability. The key concepts were
described in the previous chapter, but now these concepts must be realised in prac-
tice. This realisation, which took 40 person-years of design effort and 5 years of
elapsed time, is the subject of this chapter.
2.1 Introduction
Biological neurons are fairly slow at processing. The processes they perform are
quite complex and the appropriate abstraction – to separate the computing from
the process of simply living – is unclear, although the models are becoming more
sophisticated annually. There are also a lot of neurons in a mammalian brain and,
despite dense connectivity, most are independent from each other.
Electronic computing devices are very much faster than biology at computing
simple functions. This means that one electronic device can, in principle, model
numerous biological neurons and still provide real-time performance. There are
17
18 The SpiNNaker Chip
many possible levels at which a model can be built, ranging from direct electronic
models of the neurons (which can process many times faster than biology) [114]
to massive computers that trawl through enormous data sets at great speed [199];
each approach has its merits and demerits.
SpiNNaker [65] was designed to function somewhere in the middle of this spec-
trum. To provide the flexibility to experiment with neuron models, it was deter-
mined that these should be implemented in software. Running software carries a
significant overhead in both performance and power consumption: the former can
be addressed by using a large array of processors, since the problem is amenable to a
massively parallel-processing solution; the latter concern was tackled by employing
power-efficient rather than fast microprocessors.
2.2 Architecture
2.2.1 An Overview
Imagine a large array of microprocessors where each processor simulates the bio-
logical computing of a number of neurons. In imagination, the array is almost
infinitely scalable, since the neurons themselves are largely independent. There is
then a choice as to how many neurons are mapped onto each processor, which is
governed by speed – both of the processors themselves and the desired speed of
simulation – and the memory capacity of each processor.
Outside the world of imagination, there are other pragmatic limits. Building
a customised microprocessor, specialised for neuron modelling, is impractically
expensive, not (just) from the hardware development view but from the software
support: an established architecture is much to be preferred. Then, there is the con-
sideration of powering and cooling a machine of any size. Finally, if a custom logic
is to be made, the design and verification effort must not be impractically high.
To provide significant (and convenient) computing power without excessive elec-
trical power dissipation suggests a 32-bit architecture. A 32-bit integer can provide
232 or about four billion unique codes, which is (very) approximately a match for
the number of neurons in a mammalian brain. (A human has about 86 billion
neurons; a domestic cat has around three-quarters of a billion [19, 90].) As a back
of the envelope initial figure, ‘one billion neurons’ seemed a credible target. This
could be spread over a million processors – each simulating 1,000 neurons – with
the processors grouped into chips, each chip being a multicore Application-Specific
Integrated Circuit (ASIC).
The chosen processor was an ARM968 [6]. This ARM9 device was already
mature at the time of selection but still gave good power/performance efficiency
and, crucially, was kindly licensed, on a non-commercial basis, by ARM Ltd. For
Architecture 19
manufacture, a 130 nm process was selected: again not state-of-the-art even at the
time of design but cost-effective and without too many new process issues for the
(necessarily) limited design team. With this process and this processor, a target
operating clock frequency of 200 MHz seemed reasonable and static RAM macros
that supported this target were available. Calculation suggested that this could sup-
port the target number of neurons in real time, with some flexibility to cope with a
varying load. Energy efficiency is important not so much on an individual processor
basis but when multiplied by a million processors in the system or, indeed, twenty
or so in the same package; the ARM968 is a power-efficient microprocessor when
executing and is able to ‘sleep’ – consuming almost no dynamic power – when there
is nothing to do, which may be expected frequently in a real-time system.
The amount of RAM needed to balance this model was also reckoned. In prac-
tice, for the intended application, the RAM was infeasibly large; however, much
of this is relatively infrequently used, so the model was subdivided in a memory
hierarchy, with a fast SRAM and a much larger but slower SDRAM component. A
local data space of 64 KByte plus 32 KByte of code space (small, since the processors
are running dedicated, embedded application code) was allocated. This needs to be
backed up by tables up to a few megabytes in size. Available (low power) technology
meant a single cost-effective die supplied 128 MByte but the relatively low demands
expected meant that one die could reasonably be shared amongst several processors.
With area estimates for the processor subsystems – including their SRAM –
and a feasible ASIC die size – it appeared that about 20 processors on each ASIC,
together with a single, shared SDRAM, would provide an appropriately balanced
system. This implied that 50,000 ASICs would be needed for a 1,000,000 processor
machine – a number which would (attractively) fit in a 16-bit binary index.
Neurons alone do not compute; there needs to be interconnection and, indeed,
there is overwhelming evidence that it is the patterns and strengths of connec-
tions which programme biological computers [115]. The problem for the system
architect is that, in biology, the output from any one neuron may be routed to a
unique set of hundreds, thousands and even tens of thousands of destination neu-
rons (Figure 2.1). This far exceeds typical computer communications uses, other
than with a broadcast mechanism; here, with a million possible sources, broadcast
is not practical, either from the communications bandwidth needed or the power
requirement for inter-chip communications.
It is therefore the specialist communications network, designed to support the
specific spiking neural network applications, that differentiates SpiNNaker from
most other multiprocessor systems.
SpiNNaker communicates with short packets. In neural operation, each packet
represents a particular neuron firing. A packet is identified using AER [152]; it is
tagged only with its originator’s identifier. (With 1 billion neurons, this requires at
20 The SpiNNaker Chip
least 30 bits; a 32-bit field is allocated for convenience.) Packets are then multicast
to their destinations with most of the routeing and duplication being done in (and
by) the network itself.
The first important point in the design is that the aggregate bandwidth of the run-
ning system – where packets are duplicated in flight but only as needed to reach all
their destinations – is not infeasibly high. Just like the processor – neuron relation-
ship, a single network link can carry many, multiplexed spike links as the electronic
connections are much faster than the biological axons. Indeed, practically, the time
to deliver a spike is typically negligible compared to biological transmission. Thus,
the actual network topology is not particularly important although, since neural
systems themselves (and their traffic) are fairly homogeneous, some form of mesh
is suitable – and amenable to the construction of scalable systems.
The chosen topology for the SpiNNaker network is a two-dimensional mesh.
The mesh is triangular (Figure 2.2) rather than Cartesian, with each ASIC con-
nected to six neighbours; this provides more potential bandwidth over the given
links and was also intended as a provision for automatically routeing around faulty
connections. (In practice, it has been observed that this latter feature was over-
cautious and is little used.) The edges of the mesh can be closed to form a torus
that reduces the longest paths; the maximum expected system – 216 chips or a
256 × 256 grid – would therefore have a longest path of 128 hops although most
would be much shorter.
Although there are other packet delivery mechanisms, the novelty and speciali-
sation in SpiNNaker is in handling multicast packets. These are optimised to model
biological neuron interconnection, where each neuron has a single output that
feeds its own set of targets. Biological destinations are not entirely random; there is
some structure and neurons tend to be clustered within populations with an output
feeding some subset of the neurons in several populations. This structure can be
abstracted as a tree (Figure 2.3).
Architecture 21
Router
Router Router
Router
Router Router
Router
For simulation, it is logical to map neurons within a population to the same pro-
cessor(s). This means that a single packet delivered to a processor can be multicast
to the neurons – the last branching of the tree – by software. The populations them-
selves need to be distributed across the mesh network. In this manner, it is likely
that multicast packets can share part of their journey, effectively extending the tree
structure to multiple (series) branches (Figure 2.4). This also reduces the network
traffic as a packet is often not cloned until some way towards its destination.
The routeing from chip to chip is managed by a custom router on each
ASIC. Logically speaking, each router checks the (neuron) source ID – the only
information in the packet – and looks up a set of outputs, potentially including
both chip-to-chip links and processor systems on that chip itself. The packet is
then duplicated to all the specified outputs.
22 The SpiNNaker Chip
Neuron
Network
routeing
Axon
Software
routeing
Synapse
Neuron
Neuron
With a 32-bit neuron AER, each router is potentially holding 4 billion words
of routeing look-up table: this is impractical. However, the logical table can be
compressed considerably in practice:
These properties are exploited to shrink the routeing tables to a manageable size.
This makes the table sparse, so rather than a simple array it is stored as an associative
structure using Content-Addressable Memory (CAM) to identify IDs of interest. If
an ID is not recognised, a topological assumption is made about the interconnection
mesh and the packet is simply forwarded to the opposite link from which it arrived:
this is referred to as default routeing (Figure 2.5). Default routeing reduces the
number of table entries to those corresponding to packets which are both expected
and need some action: changing direction in the mesh, being duplicated or arriving
at their destination – or any combination of these.
Lastly, providing the neurons in a given population are identified sensibly – i.e.,
with similar IDs – they can usually be routed with a single table entry. This is
Architecture 23
R R
Figure 2.4. A single neuron tree mapped onto a SpiNNaker chip network. The source
neuron is on the shaded chip. ‘R’ indicates a router table entry; other involved routers
use default routeing. Solid dots are processors and spikes are typically duplicated to
many neurons in each by software.
Router
ARM
SRAM I D (Private)
SDRAM
(shared)
because the CAM contains a binary mask that specifies which bits in each key
are significant to that router. For example, if a population contains around 2,000
neurons, it can have a 21-bit ID with the remaining 11 bits determining the par-
ticular neuron. One routeing table entry can provide for all 2,000 neurons. For
implementation, the number of table entries is arbitrary: 1,024 was chosen for
SpiNNaker.
The final stage of neuron packet routeing takes place after delivery to a proces-
sor subsystem. Here a spike is multicast to a subset of the local neurons; however
there is now more information needed. Each connection has some associated
information:
The details of these variables are not important here. What does matter is that
there is one entry per synapse. Even with a very rough calculation – say 1,000 neu-
rons each with 1,000 synapses – it becomes clear that several megabytes of storage
are required for each processor subsystem. This is the data that reside in the (shared)
SDRAM and is fetched on demand.
Each processor has its fast, private memory and shared access to the SDRAM
(Figure 2.6). Although it can be used for communications, the main intended pur-
pose of the SDRAM is to act as a backing store for the large, relatively infrequently
accessed data tables. For this purpose, the SDRAM space is partitioned in software
with each processor allocated space according to its needs. For many applications,
data are simply copied in as needed although synaptic weights could be modified
and written back if the network is adaptive.
Architecture 25
ARM968
ARM9 processor
AHB
ITCM DTCM
32 KB 64 KB
Peripherals
AHB
AXI
System NoC
to SDRAM, etc.
The act of moving data around the memory map is simple but tedious and
inefficient for software. Each processor subsystem therefore contains a memory-
to-memory DMA Controller (DMAC) that can download these structures in the
background. The unit is also capable of uploading data if the synaptic weights
change, which will occur if the neural network is learning. The impact of trans-
fers on the processor is minimal since the local SRAM is bank-interleaved, always
assuming the processor has other work to do.
The impact of DMA transfers on the processing should also be small as the fetch-
ing of data is a background task. To decouple the process further, the DMAC
has a command buffer, allowing a request to be queued while its predecessor is
in progress; DMA transfers can therefore run continuously (if necessary) with con-
siderable leeway in servicing the completion interrupts.
Other than the ARM968, its RAM and the DMAC, there is very little else
within a subsystem. The only peripherals are timers, a communications interface,
which allows the processor to send and receive packets and an interrupt controller
(Figure 2.7).
The ASIC was planned to contain about 20 such processor systems. All the
processor subsystems used identical layout for development convenience, meaning
the timing closure was only necessary once; on the chosen manufacturing pro-
cess, it is permissible to rotate, as well as reflect, the hardened layout macrocells.
When possible floor plans were examined and the feasible chip area was taken
into consideration, it became apparent that 18 processor – memory combinations,
26 The SpiNNaker Chip
together with the router, fitted better. As the specific number was not critical, this
was adopted (Figure 2.8). This can be post-rationalised into 16 neuron processors,
a monitor processor to manage the chip as a computer component plus a spare, but
the constraint was primarily physical. The processor count does have some impact
on the router since, when multicasting packets, it is necessary to specify whether
each of the 24 destinations – 6 chip-to-chip connections plus 18 local processors –
is used; 24 bits is a reasonably convenient size to pack into the RAM tables, so this
is a bonus.
There are also a few shared resources on each chip, to facilitate operation
as a computer component. These provide features such as the clock generators;
interrupt and watchdog reset control and communications; multiprocessor inter-
lock support; a small, shared SRAM for inter-processor messaging; an Ethernet
interface (as a host link) and, inevitably, some general purpose I/O bits for opera-
tions such as run-time configuration and status indication. A boot ROM containing
some preliminary self-test and configuration software completes the set of shared
resources. The details of some of these components are discussed in the following
sections.
Architecture 27
ARM9
ARM968
Instructions Data
DMAC
Bridge
Buffer Buffer
To secondary memory
via network
It was also anticipated that in an expanded system, the soft error rate in the
aggregate SDRAM would be non-negligible. The DMAC therefore includes
a programmable Cyclic Redundancy Check (CRC) generator/checker that can
append a CRC word when a transfer is written to SDRAM or verify a CRC when
it is read.
Also contained within the DMAC, although not a DMA function, is a bus bridge
that allows the ARM direct access to the SDRAM, although this form of access is
not particularly efficient. A write buffering option is available to reduce the latency
if desired.
The only peripheral of particular note is the communications controller. This
provides bidirectional on-chip communication with the router. The input inter-
connection is blocking, so it is important to read arriving packets with low latency;
the ARM’s Fast Interrupt Request (FIQ) is typically used for this. Failure to read
packets will cause the appropriate network buffers to fill and, ultimately, stall the
on-chip router. Similarly, the outgoing link is blocking but the back-pressure may
partially rely on software checking availability.
2.2.3 Router
The router is the key specialised unit in SpiNNaker. Each router has 24 network
input and output pairs, one to each of the 18 processor subsystems and 6 to connect
Architecture 31
to neighbouring chips. Largely the links are identical, the only difference being that
off-chip links (only) are notionally paired, so that there is a default output associated
with each input which is used in some cases if no other routeing information is
found.
All router packets are short. They comprise an 8-bit header field, a 32-bit data
field and an optional 32-bit payload. Much of the network is (partially) serialised,
so omitting the payload when not required reduces the demand on bandwidth and
saves some energy.
There are four types of packet:
Each of the packet types is separated and routed according to its particular rules.
The simplest are P2P packets that provide chip interconnection. A fully expanded
SpiNNaker system is designed to have 216 chips, so a 16-bit field in a P2P packet
determines the destination chip. This is used as an index into a RAM table that
specifies which output link to use for that packet. Each entry in the table is 3 bits
long, which permits the selection of any of the six chip-to-chip links plus an internal
option, used for when the packet has reached its destination chip; the routeing of
all possible packets is therefore fully specified in this table.
When the P2P packet reaches its destination chip, it has to be directed to a
particular processor. All internal P2P packets are sent to a preselected processor
subsystem, programmed into that router. The design intention is that this monitor
subsystem will, at least primarily, manage the computer itself rather than run
applications. It can forward messages to other systems if required in software, using
the shared RAM on the chip.
MC packet routeing is rather more complicated. As previously mentioned, it is
not feasible to store a complete routeing table for a billion neurons, so the neurons
are grouped and only a subset of the groups need be recognised by any particular
router. The first job is to recognise a packet (or not). This function is performed
32 The SpiNNaker Chip
by a TCAM in which the packet key is compared with all the entries. Each table
entry consists of a key and a mask. Within each entry, each bit is compared with the
corresponding stored state, which can be:
0 0 Always match
0 1 Never match
1 0 Match if 0
1 1 Match if 1
Subsequently, all the bit matches are ANDed, and if the result is true, the entry
is a ‘hit’. These combinations allow each entry to match with particular patterns
of ‘0’s and ‘1’s in the key, disregarding some other bits. For example, an entry with
key = 0x5a5a5a00 and mask = 0xffffff00 will match the 256 packet keys in the
range [0x5a5a5a00, 0x5a5a5aff] as it ignores the 8 least-significant bits. Including
a never match bit anywhere in the entry indicates that the entry is unused, as it will
never produce a match.
The inclusion of don’t care fields means that it is possible to match multiple
different TCAM entries quite legitimately. This is an exploitable feature since the
matches are prioritised and the highest priority match is isolated for the subse-
quent stage. Placing more specific entries in higher priority positions can simulate
having more entries than are physically present. For example, an entry with key =
0x5a5a5a5a and mask = 0xffffffff will match the single packet key 0x5a5a5a5a,
which is part of the range matched by the entry listed in the previous paragraph.
If the new entry is included in the table at a higher priority than the previous
entry, it will make that entry only ever match the other 255 keys in the range.
Matching a set of 255 packet keys would require a larger set of non-prioritised
entries.
If a match has been made, the next step is to look up the output vector. This
comprises a 24-bit word where each ‘1’ bit indicates that the packet should be
copied onto that link. This facilitates the multicast operation.
Fixed route packets are very simple to direct. Each router has a single, pro-
grammable register that says which output link(s) to use. They are really a special
case of MC packets with a single, always matched key field and they require almost
no additional hardware. They can be used for specific purposes, such as building
network trees to funnel monitoring data back to host interfaces but can only pro-
vide one such structure in any single configuration.
Architecture 33
Unlike the other communication packets, NN packets can be routed before the
network tables are initialised; their routeing is determined by the chip hardware
and the network topology. They are provided:
Source Destination
Any processor on this chip One or all inter-chip links
inter-chip link monitor processor on this chip
By convention, only the local monitor processor should originate such packets;
just like the other packets, they carry a 32-bit data field with an optional 32-bit
extra payload.
For debug purposes, a different type of NN packet is used. These are trapped by
the router on the destination device, which becomes a master of the shared address
space on that chip. This means that one chip can read and write some of the state of
any neighbouring device. The convention adopted here was that only 32-bit words
can be moved and the presence of a payload: in a request indicates a write request;
in a response indicates a returned read value.
All the routeing units deliver packets to an output stage together with a bit vector
indicating their output direction(s). All being well, copies are dispatched simultane-
ously on each of the indicated links. However there can be congestion which causes
back-pressure on an output; in this circumstance the router output stalls and waits
for the link(s) to clear. MC packets stall if any output is blocked rather than trans-
mitting on the unblocked links first; this facilitates some error recovery, if necessary,
later.
The network is not guaranteed deadlock free! In particular, the cloning of MC
packets can generate a lot more traffic than is initially injected. It is also infeasible to
implement an end-to-end flow control protocol on such packets. There is therefore
a risk – indeed a significant probability! – that the network could deadlock, at
least unless some other protection exists. This contingency is handled by using a
time-out on blocked packets. If a packet has been stuck for a pre-programmed time,
it is dropped and the next is output instead. Dropped packets are caught in software
and can be re-injected later. Ensuring that (multicast) transmission is all-or-nothing
means that only the packet needs to be saved, the packet routeing being re-derived
on re-injection.
34 The SpiNNaker Chip
Router
Router
Break or
blockage
Router
latch, rather than D-type flip-flop, cells for storage, which roughly halves the area.
To meet timing constraints, writing to these latches requires two clock cycles with
a resulting hiccup in the pipeline flow; however, writing is rare, so this is not a
serious issue. To further reduce cost, the multiplexer trees that would be needed to
read back the contents were omitted. Some means of production test is still required
though, and a scan chain through the latches is a difficult (and costly) alternative.
Instead, the TCAM is tested by association. A key pattern can be written to a
test register location and the presence or absence of a match can be determined,
together with the internal address of the first match. The test is conducted by one
of the on-chip processors during the boot process.
In a fault-free environment, all packets arriving at a router will be intact, correct
and intentionally present. However, the router does some straightforward checks
to increase the robustness of the system. Firstly, an arriving packet has to have a
legal size, as counted by the number of symbols (‘flits’) arriving, delimited by End-
of-Packet (EoP) markers. It was conceived that noise on the asynchronous links
could easily introduce spurious symbols and corrupt packets. (In practice, such
problems have not been observed in existing machines given that the long, cabled
links which had been envisaged on the original design were avoided in the end.)
36 The SpiNNaker Chip
Packet corruption could still occur though, if a chip is reset (due to local problems)
while sending to its neighbours. There is also a parity bit in packets where space
allows, as a crude intactness check.
Finally, there is a timestamp on potentially long-lived packets, intended to guard
against misprogrammed routeing allowing packets to circulate in the system-wide
network indefinitely. This is a simple, slowly changing phase number known by all
routers and appended to packets as they are transmitted. To use this mechanism,
all the routers in the system need to be synchronised, to some resolution. Synchro-
nisation will not be perfect and, in any case, the time phase may change while a
packet is in flight. A 2-bit Gray code is therefore used for the time phase, where a
router will detect a mismatch on both bits and will remove the packet before try-
ing to route it; this is separate from the dropping due to congestion. A packet will
then time out if undelivered somewhere between one and two time phases after
transmission. The time phases are set in software but envisaged to be of the order
of a few milliseconds; legitimate deliveries should be completed in much less time
than this.
0 11_0001 001_0001
1 10_0011 001_0010
2 10_0101 001_0100
3 10_1001 001_1000
4 01_0011 010_0001
5 11_0010 010_0010
6 10_0110 010_0100
7 10_1010 010_1000
8 01_0101 100_0001
9 01_0110 100_0010
A 11_0100 100_0100
B 10_1100 100_1000
C 01_1001 000_0011
D 01_1010 000_0110
E 01_1100 000_1100
F 11_1000 000_1001
EoP – 110_0000
– 00_0111 000_0101
– 00_1011 000_1010
– 00_1101 011_0000
– 00_1110 101_0000
complicated however! In this case, to reduce the wiring (and pin) overhead, EoP is
coded as another flit. A 2-of-7 code has 21 separate symbols, so the required 17 fit
comfortably [225].
The asynchronous handshake protocol relies on transmitter and receiver alter-
nating in action. This functions well in the absence of faults but there can be a
problem if one end of the communications loses state. This can happen, for exam-
ple, if a chip crashes badly and takes a complete watchdog reset. Was the chip in
an active or passive phase on each of its links? The solution employed is to assume
38 The SpiNNaker Chip
data 0
data 1
data 2
data 3
data 4
data 5
data 6
ack
that the chip is active, so it can send data (as soon as it has some to send) but it also
acknowledges data which it may or may not have been sent. The transition detectors
will ignore a second transition if they are already active, so if the acknowledgement
is spurious it is ignored and lost; however, if the corresponding device had just sent
a flit, it is now acknowledged even though its content has been lost. Flit-level com-
munication is resumed; the flits, including EoP markers, are forwarded to the next
router which will detect an incomplete packet, discard it, raise an interrupt and
resynchronise.
The network described above, the comms NoC, supports the SpiNNaker (short)
packet communications across the entire machine. There is a second, independent
network on each chip, the system NoC, which acts as the local shared bus. This
employs the same asynchronous interconnection technology to simplify timing clo-
sure but the interface and traffic patterns are different and the topology reflects this
to some extent.
The local shared resources comprise the SDRAM and all the rest, the latter cat-
egory being peripheral interfaces et alia. The heaviest data traffic was anticipated
to be to the SDRAM. The system NoC is therefore decoded near each source and
crossbar-switched into these two branches, where various requests are then arbi-
trated and serialised. There are 19 masters on this network: the 18 processor sub-
systems and the router, which can read and write to shared resources, prompted by
NN packets from a neighbouring device. This latter facility provides a debugging
aid and allows code – and even router network tables – to be promulgated during
boot.
Architecture 39
The heaviest traffic on the system NoC is DMA from – and, to a lesser extent,
to – the SDRAMs. This comprises bursts of contiguous data that are well suited
to SDRAM efficiency. On the one hand, the interface to this part of the network
uses an AXI interface, which is optimised for such trains of data and, in this case,
is 64 bits wide. On the other hand, the remaining shared devices are slaved on an
AHB. The system NoC bridges these different protocols.
In a similar fashion to the inter-chip links, this asynchronous interconnec-
tion can be disrupted by unusual events. The only anticipated problem stems
from the loss of coherency due to a processor being reset during an outstand-
ing transaction. The ARM itself provides no alternative but a straightforward
restart; under any conditions but a full power-up, the bus bridge retains some
state and is able to complete (and discard) any outstanding transactions before
reconnecting the processor. This avoids a crash-reset jamming the whole network
and allows the affected processor to recover. The mechanism extends to freeing
up any bus locking in the unlikely event of resetting during a read-modify-write
operation.
feed different subsystems. Processor clocks are limited to two groups (9 processors
in each) but typically use the same source; 200 MHz is the design maximum fre-
quency. The router is typically run slower because it can (but need not) be in spiking
applications, and 133 MHz is convenient. The SDRAM controller is optimised sep-
arately to get better performance from the SDRAM device, and a 130 MHz clock
is usually used. At these speeds, under reasonable load, the power consumption of
the chip is around 1 W.
One slightly unusual shared subsystem is the mechanism for picking one pro-
cessor to be chip monitor. One processor is normally dedicated to functions such
as setting up and maintaining routeing tables and host communication, and this
is set into the architecture as the receiver of P2P packets. Rather than dedicat-
ing a particular subsystem to this task, the selection is left to run-time. The rea-
soning for this is partly in consideration of increasing useful chip manufacturing
yield.
Due to defects, not all manufactured integrated circuits work. Defects tend to
be randomly situated, so in a chip like SpiNNaker any particular defect is likely to
be in one of the processor subsystems – and, in particular, probably in its SRAM.
On boot up, each processor system runs some simple tests and, if it completes these
successfully, assumes that it is okay and attempts to claim the title of monitor. This
is done by reading a particular peripheral device (in the System Controller)1 which
has been cleared by power-up reset. The first device to do this is granted permission
to go ahead and its identity is recorded; subsequent devices are rejected and their
software moves to a subservient role. If everything is still functional, the victorious
monitor then brings up the whole chip.
However, the tests in the boot ROM are reasonably primitive as it was perceived
as risky to commit to having too much unfixable code on the chip mask and it was
envisaged that the subsequent discovery of a fault could then be fatal to the whole
device. To protect against this, a second reset – such as a watchdog – will repeat
the process, the difference being that the previous monitor will be refused even if,
as is likely, it is still the first to ask. To have two subtly broken claimants to the
‘monitorship’ would be particularly unlikely.
As a final line of defence against faults, each chip has a hardware watchdog unit.
This is intended to provide protection for the chip monitor which can then provide
more sophisticated monitoring of the applications processors in software. It acts to
reset the monitor processor after a preprogrammed interval unless itself periodically
reset by the software. The unit also has a second time-out interval and a further
output which will only trip if the monitor has not recovered after the first watchdog;
1. The System Controller also includes functions such as individual core resets and interrupts, and sempahore
registers.
Multiprocessor Support 41
this is set up to reset (and thus reboot) the whole chip, although it is anticipated
that simply recovering the monitor will normally be sufficient to initiate recovery.
The monitor (or, indeed, any processor) can reset any processor(s) using the System
Controller, which can provide a reset pulse such that a processor can safely reset
itself, if desired.
As part of the support for larger systems, there was an (inexpert) attempt to build
a thermometer on each chip. This is possible because the properties of the electronic
components – particularly speed – change with the temperature. Unfortunately, the
properties also change with variations in (local) operating supply voltage and indi-
vidual manufacturing conditions. To overcome this, three different temperature-
sensitive circuits were implemented. One is a simple inverter ring oscillator, which
can be timed against a known, crystal-regulated delay; the second is a mixed-signal
ring oscillator whose stage delay reduces rather than increases with rising temper-
ature; the third was a timer of the delay of the leakage discharge of a capacitor.
By taking three measurements with three unknowns, it is, in principle, possible to
extract values for all three, independently.
proves inadequate, they can be used to lock larger structures or the SWP approach
can be used.
Global
Nearly all the pins on SpiNNaker fall into one of two categories: the inter-chip
links – where each of the six links comprises a complementary pair of asynchronous
links with seven data and one acknowledge wire each – and the SDRAM interface.
Provision was made for two 1 Gb low-power Double Data Rate (DDR) SDRAM
chips (the contemporary technology) in the architecture; one such die is physically
stacked onto the ASIC die and wire bonded before packaging – this makes the over-
all system PCB footprint significantly smaller – but the interface is also pinned out
for expansion; in practice, extending the SDRAM has not proved necessary. These
interfaces leave little room for conventional interfacing, but some of the remain-
ing pins provide for this. Chiefly, the ASIC includes an Ethernet interface pro-
viding a Media-Independent Interface (MII) which provides for a host link to a
standard network. This requires an external Physical Layer Device (PHY) – a phys-
ical medium adaptor – but it was never planned to provide Ethernet connectivity
to all the devices, just specific selected ones to provide gateways to the SpiNNaker
network.
The other general purpose I/O is a standard, parallel I/O port. In some cases, the
bits here may be used to support (for example) the Ethernet control. One bit is read
at boot time to select one of two boot options in the internal ROM: the conven-
tional start-up and a (tested, but not generally needed or used) option to use other
pins as a serial bus (Serial Peripheral Interface [SPI]) to download a different boot
sequence from an external source. This second option was to guard against a seri-
ous mistake in the main boot ROM code; it has not been needed. However, some
devices still use an external ROM, a good example being an Ethernet-expanded
chip which needs individual data such as a Media Access Control (MAC) address.
There are still several always-uncommitted bits that are useful primarily for debug-
ging purposes and the all-important blinkenlight.
Finally, an IEEE 1149.1-compliant Joint Test Action Group (JTAG) port is also
available for debugging purposes. Internally, the device chain comprises only the
18 ARM processors, as JTAG support was not deemed cost-effective for other sys-
tem components.
2.6 Monitoring
To facilitate tuning of the system and to give feedback on the design – this is,
after all, a research project – some hardware monitors were built in as counters.
Fundamentally, these are used to help observe the behaviour of the communication
networks.
44 The SpiNNaker Chip
The router has sixteen 32-bit counters which count packets in particular classes.
Each counter has an input filter which can be set to include or exclude packets
of a given type (such as multicast only), whether they have a payload, if they have
been actively (as opposed to default) routed, where they have been routed to and
so on. These are encoded as Boolean switches so a user can enable various com-
binations, including all. (Because there are so many possible internal destinations,
internally the last mentioned category is divided only into each chip-to-chip link,
monitor processor, any application processor and dumped.) Emergency routeing
states are included so any re-routeing, which would otherwise be invisible, can be
detected. These allow traffic patterns to be observed over time and any hot spots
detected.
As a bonus, the filters can be used to activate interrupts, so the passing of a par-
ticular sort of packet can attract immediate attention from the monitor (or other)
processor.
A second counter set monitors the behaviour of the system NoC; in this case,
rather than count transactions (which are already known), it times the latency of a
request to reveal how well the SDRAM is serving. Here counters are incremented
according to the number of clock cycles between the memory request and response,
which encompasses the travel time across the asynchronous network, any delays
due to arbitration and the latency of the SDRAM itself. The counters are in bins
of adjacent values and results are presented in the form of a hardware histogram,
accumulated over a set period.
The completed ASIC (Figure 2.12) measures about 10 mm2 and contains about
100 million transistors, mostly as static RAM. Its feature size is 130 nm. The pro-
cessors and router can run (within specification) at 200 MHz; the processor sub-
systems are typically run at this frequency although it is normal to run the router at
133 MHz since it still meets its usual demands at this speed. The SDRAM interface
is usually set to 130 MHz.
The power consumption depends on the active loading; the processor can halt
when there is no work to do, which reduces the power consumption significantly.
However with all 18 processors running at 200 MHz, the power dissipation is still
around the 1 W mark.
The implementation of the SpiNNaker chip was a big challenge given the size
and complexity of the system. SpiNNaker integrates several external IP devices,
such as the ARM processors and SDRAM controller, with components developed
in-house by the SpiNNaker team.
Design Critique 45
Figure 2.12. The SpiNNaker ASIC, bonded to its ‘piggy-back’ SDRAM. Photo courtesy of
Unisem Europe, Ltd.
SystemC was used to validate the architecture and design of the SpiNNaker chip.
The synchronous models were cycle accurate, while asynchronous network models
were based on early delay estimates. External synchronous IP was delivered in RTL
Verilog, which was also used to develop most of the in-house designs, whereas asyn-
chronous IP was delivered in technology-mapped, gate-level Verilog. Equivalence
checking was used to verify RTL synthesis and optimisation. Gate-level models
with extracted parasitics and annotated delays were used for simulations.
The Synopsys Galaxy Design Platform was used for the design and implemen-
tation tasks. The implementation employs architecture and logic-level clock gat-
ing. The design methodology was fine-tuned with special emphasis on the power
efficiency of the clock networks. Power-aware synthesis was used throughout the
flow. A hierarchical methodology was employed [195] for the implementation of
the fully asynchronous networks, encapsulating small sections of the logic in cus-
tomised macros and using these as blocks for the larger sections.
The SpiNNaker chip is packaged in a 300LBGA package with 1 mm ball pitch.
All IO is assumed to operate at 1.8 V with CMOS logic levels. The package exports
an SDRAM interface that operates at 1.8 V LVCMOS that is usually left uncon-
nected, as the package incorporates an internal SDRAM die.
There are a few, minor niggles but the ASIC is basically fully functional. At the
time of writing, the SpiNNaker chips have been in use for some years. Largely they
have proved to fulfil their intended function although some shortfalls have become
46 The SpiNNaker Chip
2. A trend which may continue – currently, we focus on point neuron models, but interest is growing in two-
compartment and dendritic computation abstractions.
3. We are grateful to Prof. Tobias Noll for drawing our attention to this matter and to public data from Micron,
with which our measured error rates are broadly consistent, at: https://www.hotchips.org/wp-content/uploads/
hc_archives/hc16/1_Sun/10_HC16_pmTut_1_bw.pdf
Design Critique 47
error rates are small – negligible in a small system (say, a few thousand processors)
or short time-scales but are a concern for a million processor machine running over
several days.
The router copes well with the design loads and is typically run satisfactorily at
half its maximum speed. With spiking neural models, the spike packets are approx-
imately evenly distributed in time (as a result of careful software design to ensure
that this is the case), so network congestion is fairly unusual. However, other users
have implemented other applications – neural and otherwise – on the machine,
some of which are synchronous, resulting in network idle periods interspersed with
floods of packets. In these conditions, points in the network congest and, once
a router is blocked, the back-pressure causes the congestion to spread. The time-
out/packet drop will free this in time but it is not helped by the relatively slow
(software mediated) packet dropping rates. More elasticity or faster dropping would
alleviate many of the problems posed by these applications. However, permanently
dropping packets – even neural spikes which might be expected to be fairly unim-
portant in a fault tolerant system – turns out to be unacceptable to many users,
so the hardware dropping rates are typically limited to the speed that software can
salvage all dropped packets.
The size of the multicast routeing tables was set by (somewhat inspired) guess-
work. The number of entries here is arbitrary but sets the size of the TCAM which,
in turn, dominates the router area; 1024 entries were implemented; this allows
functional placement and routeing of most neural networks so far tried, although,
even with some clever exploitation of the bit-fields and prioritisation of entries, it is
uncomfortably small in some circumstances. Furthermore, tightly optimising the
initial setting of the TCAM, such as the sharing of entries, makes subsequent, run-
time modification more difficult. Neural interconnection updates primarily involve
changing synaptic weights, but if new neural projections appear – and, in biology,
they do – then changes to the TCAM may be needed to model this. A larger table
would ease this process considerably. However, the existing table is not vastly too
small: something like a doubling in size should easily accommodate anything so far
envisioned.
The most serious architectural drawback in SpiNNaker is probably not con-
nected with simulating spiking neurons as with being a computer. The short com-
munication packets provide well as models of neural spikes but provide poorly for
copying bytes between computer memories. There are two readily identifiable areas
where this is needed:
Loading data: Code is necessarily short and is typically the same in most proces-
sors; the data tables that define the neural net are all different and are large.
These tables are not hand-generated at the neuron level; they are specified
48 The SpiNNaker Chip
Unloading results: Having run a simulation, the user needs to see what happened.
This is the reverse problem from the data loading. Again it could be alle-
viated by performing some of the statistical compression in parallel in the
SpiNNaker machine; currently, it involves dumping large quantities of data
through the network and processing this on the host computer.
While both these problems can largely be mitigated by more sophisticated soft-
ware, better (faster) up- and down-loading of the SpiNNaker’s SDRAM contents
would give a more generally usable machine.
The NN packets provide a means of remote access to the shared resources of a
chip without the need for software cooperation from the target chip. This proves to
be a valuable debug and, especially, diagnostic tool. It would be even more valuable
if it could also reach into the individual TCMs of crashed processors. This was not
done as it would have been a significant additional feature to make the system NoC
bidirectional and provide a second master to the DMAC bus; however, in hindsight,
it may have paid off to do this.
The purely on-chip (system) network, which is used primarily to DMA SDRAM
contents to (and from) the individual processors’ SRAMs, meets its requirements
well. It does not provide enough bandwidth for a single processor to use the full
SDRAM bandwidth, but this was never the intention as the SDRAM is a shared
resource. In practice, three or four processors can share the SDRAM relatively
unimpeded; with more active processors, the share is limited and each participant
is allocated a roughly equal share. Since the applications do not require continuous
SDRAM access and requests are not correlated, this network and the RAM are not
a bottleneck.
One omission which was not obvious in the original design is the lack of hard-
ware memory protection. Various hardware systems that need security are protected
from user-mode accesses but there is no protection of the RAM itself. The reason-
ing was that the applications are embedded, the users are trusted, and therefore,
the hardware overhead is unnecessary. While this reasoning still holds, the software
development process has shown that the addition of some protection of the RAM
would facilitate easier code debugging by helping to localise programming faults.
There is an ARM standard Memory Protection Unit (MPU), which offers access
Design Critique 49
control of programmed regions of the address space with little hardware overhead.
With hindsight, this would probably have been a worthwhile inclusion.
The triangular connectivity of the network was partially determined by the desire
to provide emergency routeing around broken or blocked inter-chip links; packets
can be routed around a breakage via the other two sides of the triangle. In practice,
this has never really been an issue; blockages are most likely due to congestion at
the destination router, so finding an alternative path is not useful. This feature is
therefore somewhat redundant.
Emergency routeing is unlikely to be used and, since this is the only constraint
requiring the 2D triangular mesh, there are possibilities to use the chip in other
network topologies. The most obvious such is a 3D cubic mesh, which can still
exploit the default routeing feature to save on TCAM entries. A machine configured
as a 3D torus has an advantage in shortening the average path length. This has not
been put into practice though, since the network capacity of the machine is more
than adequate for neural simulation using the original layout and the inter-chip
and inter-PCB connections are already well understood – and, probably, somewhat
more tractable.
Each ARM9 processor is supported by a Vectored Interrupt Controller (VIC)
with 32 interrupt inputs; the particular VIC allows interrupt prioritisation, which
supports nesting of interrupt service routines and half of the inputs to be vectored
directly using their specific service routines; the other 16 processors need some form
of software dispatcher. The choice of interrupt signals seemed fairly clear at design
time as there were about 32 hardware status signals that could sensibly be used; it
was largely a matter of filling up the available interrupt inputs with status signals.
Only one bit was allocated as a software-triggered input, allowing software on one
processor to request attention from any other.
This is typically restricted to communications to and from the monitor processor.
The inefficiency is that a single interrupt has to serve all the potential communica-
tion needs, which implies software checking of status previously implanted in the
(slow) shared memory. This is a particular burden for the monitor processor which
needs to determine which other processor(s) are requesting attention and the rea-
son(s) in each case by working through a collection of flags planted in shared RAM.
In retrospect, combining some of the less important hardware signals could have
made room for more software signalling which could relieve some of this burden.
Ideally, extending the interrupt structure to cascade more discrete software signals
would have been even more useful. This could facilitate simpler and faster message
communication, particularly as all host to application processor links use P2P pack-
ets and, necessarily, are mediated by the monitor processor. Peer-to-peer signalling
in the applications processors could also be useful in extending the flexibility of the
chip running not-spiking-neuron tasks.
50 The SpiNNaker Chip
The temperature sensors were tested and functioned basically as predicted. Some
curves have been plotted where properties could be controlled. Unfortunately (to
date), the calibration and extraction of the true temperature has defeated everyone
who has tried.
The asynchronous inter-chip links have proved reliable, delivering 250 Mb/s
consistently; a modest speed by current standards but, as prioritised, extremely
energy efficient. The links scale more than adequately to massive sizes: the full-size
SpiNNaker system, described in Chapter 3, contains over 57,000 SpiNNaker chips
with a bisection bandwidth of 480 Gb/s and a worst-case latency in the 34–46 µs
range.
2.9 Summary
The SpiNNaker chip was designed by a small team of academic researchers and
postgraduate students with the associated restrictions and constraints regarding
fabrication cost and access to process technologies, standard cell libraries and intel-
lectual property. Overall, a 40 person-years effort was devoted to its design, imple-
mentation and verification. A test chip with two processors was taped out in August
2009 followed by the production chip in December 2010. Key SpiNNaker figures
are listed in Table 2.4.
Although SpiNNaker is a high-performance architecture highly optimised for
running neuroscience applications, it can also be used for other distributed com-
puting, such as ray tracing and protein folding. The chip provides a cost-effective
means of achieving over 1014 operations per second, provided that floating-point
arithmetic is not required.
As a message-passing system, the greatest performance bottleneck is the com-
munications between processors and, therefore, SpiNNaker was optimised for
the short, multicast messages (spikes) associated with neural network simulation.
This optimisation has resulted in some additional overhead for other applications,
including loading and control of neural networks, which is done by sharing the run-
time network. In hindsight, provision for larger payloads with guaranteed delivery
would relieve the software burden in sharing these disparate tasks. However, the
decision to share a single network still appears sensible, and this relieves some of
the system-level problems.
Experimental results show that, for massively parallel neural network simula-
tions, the customised multi-core architecture is energy efficient while keeping the
flexibility of software-implemented neuronal and synaptic models, absent in cur-
rent neuromorphic hardware.
Summary 51
Process
Processing
Memory
Communications
Power Consumption
Peak (chip) 1W
Idle (chip) 360 mW
Idle (core) 20 mW
Off-chip link (full speed) 6.3 mW (25 pJ/bit)
SDRAM 170 mW
Implementation
Chapter 3
Strive for perfection in everything you do. Take the best that exists and make it better.
When it does not exist, design it.
The 40 person-year effort required to develop the SpiNNaker chip constituted only
the first step in the path to build a platform to help understand how the human
brain works. The next step was to make SpiNNaker chips work together in a mas-
sive scale to simulate very large spiking neural networks. The initial target was to
simulate a billion neurons, around 1% of the human brain, in real time. Our esti-
mates suggested that it would require one million processing cores, that is, over
57,000 SpiNNaker chips.
Figure 3.1 illustrates the road to a billion neurons. The monumental task of
assembling the million-core, massively parallel SpiNNaker1M computer would
involve configuring, testing and deploying 1,200 SpiNNaker boards, 150 power
supplies, 60 network switches, 50 fan trays and 1.5 km of high-speed interconnect
cables, all housed in 10 standard 19" cabinets.
52
Putting Chips Together 53
At the time that this book was written, there were around 300 SpiNNaker boards
in use in many places around the world, as shown in Figure 3.2. Additionally,
1,200 SpiNNaker boards were used to deploy SpiNNaker1M, located at the Uni-
versity of Manchester. On the way to the commissioning of this machine, five
different SpiNNaker board designs were produced, each with its own objectives
and characteristics.
Table 3.1 summarises the function and main features of each of the boards. The
first two boards were used mainly to verify and evaluate some of the novel aspects
of the SpiNNaker chip, such as the asynchronous NoC interconnect [196]; the
SDRAM interface, which contains an asynchronous, programmable digital Delay-
Locked Loop (DLL) [72]; and the asynchronous, delay-insensitive chip-to-chip
interconnect [224]. These boards were fitted with Zero-Insertion-Force (ZIF) sock-
ets to facilitate the testing of packaged chips. They also had external SDRAM chips
in case the on-package, wire-bonded SDRAM failed. The latter boards do not have
external SDRAM as the internal setup proved reliable.
It is worth noting that, as indicated in Table 3.1, the SpiNN-4 prototype board
had power supply issues. In hindsight, it should have been obvious that 864 ARM
cores waking up concurrently can be extremely taxing on the power supply and this
requires adequate capacitance and remote voltage sensing. The SpiNN-5 produc-
tion board has a completely redesigned power supply and distribution network to
avoid these issues.
The SpiNN-3 development platform and SpiNN-5 production board are exten-
sively used and are described in the following sections.
Table 3.2 summarises the SpiNN-3 board main features. Each SpiNNaker chip
has a red and a green Light-Emitting Diode (LED) for general purpose use. Its
main I/O interface is a 100 Mb/s Ethernet connection, that is usually connected to
a host machine, as described in Chapter 4. Additionally, two inter-chip SpiNNaker
channels have been exported to connectors and can be used to connect to other
SpiNNaker boards or to external neuromorphic devices, such as a Dynamic Vision
Sensor (DVS) [141, 143, 219], also known as a silicon retina or an event camera.
SpiNN-3 boards have been used during the design, verification and testing
of the different software components described in Chapter 4 as well as in the
training of SpiNNaker users. Additionally, the SpiNN-3 platform has been used
to build exemplar neuromorphic systems. Figure 3.4 shows a real-time, event-
driven neuromorphic system for goal-directed attentional selection developed
by Galluppi et al. [67]. The system uses a Field-Programmable Gate Array (FPGA)
interface board to connect an AER [136] DVS to the SpiNNaker board. The inter-
face, built using components from the SpiNNaker I/O library (spI/O) [197], is
described in SpiNNaker Application Note 8 [198].
Feature Value
Feature Value
(Continued)
Putting Chips Together 59
Feature Value
14
16
10
18
20
19
13
23
15
17
11
21
12
22
F2−L00 F2−L01 F2−L02 F2−L03 F2−L04 F2−L05 F2−L06 F2−L07
6
0
8
9
3
7
1
2 1 2 1 2 1 2 1 23 2 23
45 46 47 48
F1−L15 3
4
4,7
5
0 3
4
5,7
5
0 3
4
6,7
5
0 3
4
7,7
5
0 F2−L08
22 B5 B7 22
21
B3 21
F1−L14
F2−L09
B9
44 2 1 25 2 1 26 2 1 27 2 1 28 2 1 20 (4,20) (16,20) 20
F1−L13 3 3,6 0 3 4,6 0 3 5,6 0 3 6,6 0 3 7,6 0 F2−L10
19 19
F1−L12
4 5 4 5 4 5 4 5 4 5
18 B3 18
F2−L11
43 2 1 24 2 1 11 2 1 12 2 1 13 2 1 29 2 1 17 17
B1 B1
B11
F1−L11 3 2,5 0 3 3,5 0 3 4,5 0 3 5,5 0 3 6,5 0 3 7,5 0 F2−L12 16 (8,16) (20,16) 16
4 5 4 5 4 5 4 5 4 5 4 5
F1−L10 15 15
F2−L13
42 2 1 23 2 1 10 2 1 3 2 1 4 2 1 14 2 1 30 2 1
14 14
F1−L09 3 1,4 0 3 2,4 0 3 3, 4 0 3 4,4 0 3 5,4 0 3 6,4 0 3 7,4 0 F2−L14 13 13
B4 B6
12 (0,12) (12,12) 12
4 5 4 5 4 5 4 5 4 5 4 5 4 5
F1−L08
F2−L15
11 11
41 2 1 22 2 1 9 2 1 2 2 1 1 2 1 5 2 1 15 2 1 31 2 1
F1−L07 3 0,3 0 3 1,3 0 3 2,3 0 3 3,3 0 3 4,3 0 3 5,3 0 3 6,3 0 3 7,3 0 F0−L00 10 10
9
B2 9
4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
B8
8 (4,8) (16,8) 8
F1−L06
F0−L01
40 2 1 21 2 1 8 2 1 7 2 1 6 2 1 16 2 1 32 2 1
7 7
B2
F1−L05 3 0,2 0 3 1,2 0 3 2,2 0 3 3,2 0 3 4,2 0 3 5,2 0 3 6,2 0 F0−L02
4 5 4 5 4 5 4 5 4 5 4 5 4 5 6 6
F1−L04
F0−L03 5 5
39 2 1 20 2 1 19 2 1 18 2 1 17 2 1 33 2 1
B0 B1
B10
3 0 3 0 3 0 3 0 3 0 3 0 4 (8,4) (20,4) 4
F1−L03 0,1 1,1 2,1 3,1 4,1 5,1 F0−L04
4 5 4 5 4 5 4 5 4 5 4 5 3 3
F1−L02 2 2
B5 B7
F0−L05
1 1
38 2 1 37 2 1 36 2 1 35 2 1 34 2 1
F1−L01 3 0,0 0 3 1,0 0 3 2,0 0 3 3,0 0 3 4,0 0 F0−L06
4 5 4 5 4 5 4 5 4 5 0 (0,0) (12,0) 0
1
10
11
12
13
14
15
16
17
18
19
7
3
5
2
20
21
22
23
6
0
8
9
4
F1−L00 F0−L15 F0−L14 F0−L13 F0−L12 F0−L11 F0−L10 F0−L09 F0−L08 F0−L07
(a) (b)
(c)
Figure 3.6. SpiNN-5: Board structure and multi-board tiling. Figure (c) reproduced with
permission from Heathcote [94].
fan speed, keeping track of board operating temperatures (using temperature sen-
sors located at the north and south edges of the board) and taking appropriate
action in case of overheating. The BMP has its own Ethernet connection, that can
be used by the host to send commands to, and receive information from, the BMP.
The design of the SpiNNaker chip targeted energy efficiency as a top priority and
the board design also reflects this goal. Table 3.4 shows the power consumption of
the SpiNN-5 board under different loads. Unfortunately, due to area and design
60 Building SpiNNaker Machines
constraints, the board is not adequately instrumented to measure the power con-
sumption of individual components, such as the SpiNNaker chips, the Ethernet and
SATA interfaces or the BMP subsystem, only the power consumption of the FPGAs
can be determined independently. However, board-wide measurements suggest that
each SpiNNaker chip consumes around 1 W when fully loaded.
Test Result
of unreliable devices and channels is kept in the non-volatile memory of the board.
The blacklist is applied during the boot process, guaranteeing a consistent, reliable
system. As explained in Chapter 4, the host reads the SpiNNaker machine infor-
mation to map the application only to correctly operating devices and channels.
Table 3.6 shows the results of the blacklisting process. Entire chips were black-
listed for a number of reasons, usually involving a shared resource, such as the
SDRAM or the SpiNNaker router. The number of blacklisted cores does not
include the cores in the blacklisted chips but includes the cores that were already
identified as not fully functional in accepted 17-core chips. Although the percent-
ages of blacklisted chips and cores are small, they are not negligible.
hexagonal arrangement of the SpiNNaker chips on the SpiNN-5 board, the hexag-
onal torus topology also applies when a board is considered a network node.
North
FPGA 2 North East
Peripheral 2
South West
West FPGA 1
Peripheral 1
East
FPGA 0 South
Peripheral 0
This energy-efficient code is ideal for inter-chip interconnect, given that it works
correctly in the presence of unbounded delays. Unfortunately, the number of wires
required for the 2-of-7 encoding would be extremely expensive for the direct board-
to-board connection. The use of that code would amount to a total of 768 wires on
the board periphery. To reduce the number of wires, the SpiNNaker channels are
multiplexed over SpiNNaker board-to-board links (spiNNlinks), that is, High-Speed
Serial Links (HSSLs) implemented using on-board FPGAs. Each FPGA manages
the 16 SpiNNaker channels on two adjacent sides of the hexagon and has spare
capacity to manage peripheral connections. Additionally, the FPGAs themselves
are connected in a high-speed ring, as shown in Figure 3.7.
spiNNlink incorporates several novel ideas including a bespoke, credit-based,
reliable frame transport protocol that allows the multiplexing of asynchronous
channels over a high-speed serial link and an efficient FPGA to asynchronous
channel interface that provides twice the throughput of traditional synchronisation
schemes.
Figure 3.8 shows two connected SpiNN-5 boards, each with its transmitter (Tx)
and a receiver (Rx). Two independent data + control streams are multiplexed onto
the same HSSL. The figure highlights one of the streams, with left-to-right flow of
64 Building SpiNNaker Machines
key0 (optional)
payload0 (optional)
key7 (optional)
payload7 (optional)
ack
fc sequence flow control CRC
data and the corresponding right-to-left control flow. In the symmetric stream (not
shown in the figure), data and control flow in the opposite directions.
Transmission over HSSLs is structured in frames. The different frame formats
are shown in Figure 3.9. There are five frame types associated with data and control
transmission: data (data), out-of-credit (ooc), acknowledge (ack), reject or negative
acknowledge (nack), and channel flow control (cfc). Each frame is identified by a
different start-of-frame special character, highlighted in red in Figure 3.9, carries
a frame colour (fc) and is protected by a CRC checksum (CRC). Data, out-of-
credit (ooc), ack and nack frames also carry a sequence number (sequence). Two
additional frame types, clock correction (clkc) and idle (idle), are used to keep the
HSSL synchronised.
Frames are a single 32-bit word long except for data frames, which have a vari-
able length. As indicated earlier, eight SpiNNaker channels are multiplexed into a
single HSSL and, as a result, a single data frame can carry up to eight SpiNNaker
packets, one from each channel. A SpiNNaker packet consists of an 8-bit header,
Putting Boards Together 65
a 32-bit routeing key and an optional 32-bit payload. The 8-bit presence field is a
bitmap used to indicate if the frame carries a packet from the respective channel.
Similarly, the 8-bit length field is a bitmap that indicates if the packet is long (con-
tains a payload) or short (no payload). These two bitmaps, part of the first word of
every data frame, establish the actual structure and length of the frame. Depending
on the number of SpiNNaker packets carried, data frames can be 4 to 20 32-bit
words long.
Transmission over the HSSL operates as follows: data from Tx are sent once,
identified by a frame sequence number and protected by a CRC for error detec-
tion. Multiple frames can be sent successively, subject to credit limits. Data frames
need not contain any actual data. If the credit becomes exhausted, Tx simply sends
unsequenced Out-of-Credit (ooc) frames instead.
Received data frames are either correct or not. A correct data frame will pass error
checks and have the expected sequence number. Erroneous frames are rejected and
retransmission is requested using the sequence number. To guarantee frames are
received in order, erroneous frames also change the receiver colour so that sub-
sequent, correct or incorrect, frames can be flushed until the erroneous frame is
retransmitted correctly in the new colour.
Rx provides updates on its status to Tx at expedient intervals. These are not
necessarily triggered by data arrival and continue in the absence of new data. Infor-
mation is conveyed on the credit available, the colour and Rx status. Flow con-
trol information (Xon/Xoff ) for individual SpiNNaker channels is also transmitted.
Error tolerance is provided by the repetition of these ‘frames’.
Tx re-credits its data frame allowance in response to the receiver status. Old
data, retained for possible retransmission, can be discarded up to the acknowledged
(ack) sequence number. When an error indication (nack) is received, the transmitter
changes colour, ignoring further prompts until the data stream is re-established,
resets its inputs to the error point and retransmits frames from the failed frame
sequence point. There is no requirement that the data contained is the same as the
original frames; frames may be reformed with additional data if desired.
The fully-asynchronous, handshake-based SpiNNaker channels described in
Chapter 2 pose a throughput challenge for spiNNlink. In a traditional interface,
the communications throughput is limited by the latency introduced by the syn-
chronisation flip-flops required for the handshake signals.
We investigated a fully asynchronous, that is, clockless, version of spiNNlink
[145] that used an asynchronous First In First Out (FIFO) buffer to avoid the
latency penalty imposed by the synchronisers. The new design increased sig-
nificantly the communication throughput but, as commercial CAD tools target
synchronous design flows, also increased the design, synthesis, placement and ver-
ification effort and was not a good match for the target FPGA devices.
66 Building SpiNNaker Machines
To avoid these issues, a novel strategy was developed for spiNNlink using
well-understood synchronous timing assumptions to predict the arrival of the
next SpiNNaker channel handshake, without actually waiting for it to com-
plete. Additionally, the novel interface is aware of asynchronous back-pressure,
that is, situations in which the asynchronous channel stalls for an unbounded
time due to traffic congestion. In order to operate correctly in these situations,
spiNNlink uses Synchronous Timing Asynchronous Control (STAC). It predicts hand-
shake timing except at the point where the channel may apply back-pressure,
where it completes a fully asynchronous handshake, responding correctly to
back-pressure and providing twice the communications bandwidth of the tradi-
tional implementation.
spiNNlink was implemented on the SpiNN-5 board FPGAs using Xilinx IP and
the components available in the SpiNNaker I/O library (spI/O) [197]. Table 3.7
summarises the high-speed serial interconnect main features.
Feature Value
Figure 3.10. Small-scale SpiNNaker machines. (a) A cased 48-node 864-core board.
(b) A 24-board 20,736-core machine.
68 Building SpiNNaker Machines
Figure 3.11. A card frame holds 24 SpiNN-5 boards, power supplies and a backplane.
6m
Figure 3.12. SpiNNaker1M: 10 cabinets and 3,600 SATA cables interconnecting them.
Figure reproduced with permission from Heathcote [94].
configure the IP and MAC addresses of each SpiNN-5 board. It also contains power
supply data as well as information about fan speed control and temperature limits.
Additionally, the backplane provides access to temperature sensors and to a Liquid-
Crystal Display (LCD) located on the front of the card frame. The display is used to
provide information, such as operating temperature and power supply levels, to the
machine operator. Finally, the backplane carries a Controller Area Network (CAN)
bus, used by the BMP to communicate with each other. The SpiNN-5 boards are
connected to the backplane through an edge connector, located at the bottom right
corner in Figure 3.5(a).
To build larger machines, card frames are assembled together in 1900 cabinets.
Each cabinet holds five card frames, for a total of 120 SpiNN-5 boards, containing
5,760 SpiNNaker chips/103,680 ARM cores. Figure 3.12 shows the 10 cabinets
and 3,600 SATA cables required to build SpiNNaker1M that contains 1,036,800
cores.
Putting Everything Together 69
Performance: Signal quality diminishes as cables get longer, requiring the use of
slower signalling speeds, increased error correction overhead or more complex
hardware.
Energy: Some energy is lost in cables; longer cables lose more signal energy requir-
ing higher drive strengths and/or buffering to maintain signal integrity.
Cost: Shorter cables are cheaper than long ones. Longer cables imply more cabling
in a given space making the task of cable installation and system maintenance
more difficult, increasing labour costs by as much as 5× [40].
6 5 4
1 2 3 4 5 6 1 6 2 5 3 4
1 2 3
(a) (b) (c)
Figure 3.13. Folding and interleaving a ring network to reduce maximum cable length.
Figures reproduced with permission from Heathcote [94].
Figure 3.14. Network folding to shorten interconnect. Figures reproduced with permis-
sion from Heathcote [94].
does not scale up in the general case and requires potentially expensive bespoke
physical infrastructure. Alternatively, the need for long cables is often eliminated
by folding and interleaving units of the network [42]. This process is illustrated
for a 1D torus topology (a ring network) in Figure 3.13. A naïve arrangement of
units in this topology results in a long cable connecting the units at the ends of
the ring (Figure 3.13(a)). To eliminate these long connections, half of the units are
‘folded’ on top of the others (Figure 3.13(b)) and then this arrangement of units
is interleaved (Figure 3.13(c)). This ordering of units requires no long cables while
still observing the physical constraint that units must be laid out in a line.
The folding and interleaving process may be extended to N-dimensional torus
topologies by folding each axis in turn, as illustrated in Figure 3.14. Folding once
along each axis eliminates long connections crossing from left to right, top to bot-
tom and from the bottom-left corner to the top-right corner. Since all axes are
orthogonal in non-hexagonal topologies, the folding process only moves units along
the axis being folded. Unfortunately, this type of folding does not work for hexag-
onal torus topologies due to the non-orthogonality of the three axes. To exploit the
folding technique used by non-hexagonal topologies, the units in a hexagonal torus
topology must be mapped into a space with orthogonal coordinates. The choice of
transformation to an orthogonal coordinate system can have an impact on how
physically far apart logically neighbouring units are in the final arrangement.
Figure 3.15 illustrates the two transformations proposed by Heathcote [94]
to map hexagonal arrangements of units into a 2D orthogonal coordinate space.
Putting Everything Together 71
The first transformation, shearing (Figure 3.15(b)), is general purpose but intro-
duces some distortion. The second transformation, slicing (Figure 3.15(c)), is less
general but can introduce less distortion than shearing and therefore may lead to
shorter cable lengths.
Once a regular 2D grid of units has been formed, this may be folded in the con-
ventional way as illustrated in Figure 3.14. Any shear-transformed network may
be folded this way since its wrap-around connections always follow this pattern.
Slice-transformed networks may only be folded like this when their aspect ratio is
1:2 when the pattern of wrap-around links is the same as a shear-transformed net-
work. When ‘square’ networks, that is, those with a 1:1 aspect ratio, are sliced, the
network must be folded twice along the Y axis to eliminate the criss-crossing wrap-
around links. It is not possible to eliminate wrap-around links from sliced networks
with other aspect ratios by folding. After folding, the units are interleaved, yielding
a 2D arrangement of units in which no connection spans the width or height of
the system. The maximum connection distance is constant for any network thus
allowing the topology to scale up.
As indicated earlier, the hexagonal torus topology also applies to SpiNNaker
when the boards are considered as nodes. The folded and interleaved arrangement
of units produced by these techniques may be translated into physical arrangements
of SpiNNaker boards in a machine room. Figure 3.16 illustrates how the SpiNN-5
boards that make up SpiNNaker1M can be folded and interleaved to keep cable
length short.1
Figure 3.16. SpiNNaker1M: Long interconnect wires are avoided by folding and interleav-
ing the board array in both dimensions.
Figure 3.18. SpiNNer guides cable installation. Figures reproduced with permission
from Heathcote [94].
spiNNlink interconnect are then used to verify the correct installation of each cable
in real time, ensuring that mistakes are highlighted and fixed immediately.
The ‘rule of (three-)thumbs’ proposed by Mazaris [156] was used in
SpiNNaker1M . This rule suggests that a minimum of 5 cm of cable slack should
be provided. As SpiNNaker uses off-the-shelf SATA cables, only standard lengths
were available. For any given span, the shortest length of cable providing at least
5 cm of slack was used. Table 3.8 lists the cable lengths used and the total number
of cables of each length. The table shows a total cable length of over 1.5 km.
networks in biological real time. Most likely, though, not every simulation run
on SpiNNaker1M will consist of a billion spiking neurons. To improve system
throughput and energy efficiency, we developed a centralised software system
which partitions large SpiNNaker machines into smaller ones on demand. This
system is used to run many simulations in parallel on the same machine. The
SpiNNaker machine partitioning and allocation server (Spalloc) [95] enables users
to request virtual SpiNNaker machines of various shapes and sizes. These requests
are queued and allocated in turn, partitioning SpiNNaker1M into the requested
shape. Figure 3.19 shows a SpiNNaker1M diagram with various jobs allocated
through Spalloc. Jobs can be as small as 1 SpiNN-5 board and as large as the whole
machine, that is, 1,200 boards.
When faced with the numerous research problems of optimal packing and
scheduling of allocations, this implementation uses the ‘simplest mechanism that
could possibly work’. This means that a job may end up with a larger machine than
requested, to accommodate a selection of shape and size. Spalloc communicates
with the BMPs of the allocated boards to disable the FPGAs in order to isolate the
virtual machine from neighbouring boards that are not part of the machine. When
a machine is allocated to a job it is powered on but not booted, that is up to the
requester. This allows users complete control of the machine. The requester must
keep the job alive by contacting the Spalloc server periodically and must release the
allocated machine when finished.
Figure 3.20 shows technician Dave Clark checking SpiNNaker1M, the ‘million-core
SpiNNaker machine’. The machine, located at the University of Manchester, was
SpiNNaker1M in Action 75
Chapter 4
Alongside the job of designing and producing the hardware, there is the equally
challenging task of constructing software that allows users to exploit the capabilities
of the machine. Using a large parallel computing system such as SpiNNaker often
requires expert knowledge to be able to create and debug code that is designed to be
executed in a distributed and parallel fashion. More recently, software stacks have
been created which try to abstract this process away from the end user by the use of
explicit interfaces or by defining the problem in a form which is easier to map into a
distributed system. In this chapter, we describe the SpiNNaker software stacks upon
which most of the applications described in subsequent chapters are supported. It
is built by merging slightly modified versions of the work presented by Rowley et al.
[213], covering the software tools that allow the running of generic applications –
the SpiNNaker Tools (SpiNNTools); and Rhodes et al. [207], covering the tools
that specifically support the simulation of Spiking Neural Networks (SNNs) – the
SpiNNaker backend for PyNN (sPyNNaker).
77
78 Stacks of Software Stacks
4.1 Introduction
A growing number of users are now using SpiNNaker for a wide range of tasks,
including Computational Neuroscience [3] and Neurorobotics [1, 48, 209] for
which the platform was originally designed, but also machine learning [240], and
general parallel computation tasks, such as Markov Chain Monte Carlo inference
computations [161]. The provision of a software stack for this platform aims to
provide a base for the various applications, making it easier for them to exploit the
full potential of the platform. Additionally, users will gain the advantage of any
improvement in the underlying tools without requiring changes to their software
(or at most only minor interface changes should they be required). A basic overview
of this approach is seen in Figure 4.1.
The software stack allows the user to describe their computational requirements
in the form of a graph, where the vertices represent the units of computation,
and the edges represent the communication of data between the computational
units. This graph is described in a high-level language and the software then maps
this directly onto an available SpiNNaker machine. The SpiNNaker platform as
a whole is intended to improve the overall execution time of the computational
problems mapped onto it, and so the time taken to execute this mapping is critical;
if it takes too long, it will dwarf the computational execution time of the problem
itself.
The problem of writing code to run on the cores of the SpiNNaker machine
is discussed in more detail by Brown et al. [25], along with the types of applica-
tions which might be suitable to execute on the platform. The software assumes
that the application has already been designed to run in parallel on the platform;
the SpiNNTools software then works to map that parallel application onto the
machine, execute it and extract any results, along with any relevant data about
the machine.
Making Use of the SpiNNaker Architecture 79
The nature of the SpiNNaker chip has important implications for the software
running on the system. This section is a short recap of Chapters 2 and 3. Firstly, it
must be possible to break up the computation of the application into units small
enough that the code for each part fits on a single core. The SDRAM is shared
between the cores on a single chip, and this property can be used by the application
to allow cores to operate on the same data within the same chip. A small amount of
data can be shared with cores operating on other chips as well through communica-
tion via the SpiNNaker router. The SpiNNaker boards can be connected together
to form an even larger grid of chips, so appropriately parallelisable software could
potentially be scaled to run on up to 1 million cores.
The SpiNNaker router is initially set up to handle the routeing of system-level
data. The data to be sent by applications make use of the multicast packet type,
meaning that a packet sent from a single source can be routed to multiple destina-
tions simultaneously. To make multicast routeing work, the routeing tables of the
router must be set up; this process is described in Section 4.7.
Each chip has an Ethernet controller, although in practice only one chip is
connected to the Ethernet connector on each board. The chip with the Ethernet
connected to it is then called the Ethernet chip, and this is used to communicate
with the outside world, allowing, for example, the loading of data and applica-
tions. Communications with other chips on a board from outside of the machine
must therefore go via the Ethernet chip; system-level packets are used to effect
this communication between chips. In practice, the Ethernet connector of every
board in a SpiNNaker machine is connected and configured, although this is not
a requirement.
SpiNNaker machines are designed to be fault tolerant, so it is possible to have a
functional machine with some missing parts. For example, it is normal that some
of the SpiNNaker chips have 17 instead of 18 working cores, and sometimes even
fewer than this as operational cores are tested more thoroughly than the testing
done at manufacture. Additionally, machines can have whole chips that have been
found to have faults, as well as some links broken between the chips and boards.
The machine includes memory onto which faults can be stored statically in a black-
list, so that during the boot process these parts of the machine can be hidden to
avoid using them.
SpiNNaker machines can be connected to external devices through either
a SpiNNaker link connector, of which there is one on every 48-node board, or a
spiNNlink SATA connector, of which there are 9 on each board; of those, 6 are used
80 Stacks of Software Stacks
to connect to other boards. This, along with the low power requirements, makes
the machine particularly useful for robotics applications, since the board can be
connected directly to the robot without any need of other equipment. The only
requirement is that the external devices must be configured to talk to the machine
using SpiNNaker packets. The links can be configured to connect directly to a sub-
set of the SpiNNaker chips on the board, and entries in the routeing tables of those
chips can be used to send packets to any connected device and to route packets
received from the devices across the SpiNNaker network.
The ARM968 cores can execute instructions from the ITCM using the ARM or
Thumb instruction sets; generally, this code is generated from compiled C code
using either the GNU’s Not Unix (GNU) gcc compiler1 or the ARM armcc com-
piler.2 To this end, a library known as the SpiNNaker Application Runtime Kernel
(SARK) has been written which allows access to the features of the SpiNNaker core
and chip [25]. Additionally, software called the SpiNNaker Control And Monitor
Program (SCAMP) has also been written which allows one of the cores to operate
as a monitor processor through which the chip can be controlled [25], allowing,
for example, the loading of compiled applications onto the other cores of the chip,
the reading and writing of the SDRAM, the loading of the SpiNNaker routeing
tables and, of course, controlling the operation of the chip’s blinkenlight. SCAMP
software can also map out parts of the machine known to be faulty when it is first
loaded. Thus, when a description of the machine is obtained via SCAMP, only
working parts should be present. The list of faults is stored on the boards them-
selves and can be updated dynamically if other parts are subsequently found to be
faulty.
The SCAMP code can be loaded onto one core on every chip of the machine,
and these cores then coordinate with each other allowing communication to any
chip via any Ethernet connector on the machine (see below). This communication
makes use of the SpiNNaker Datagram Protocol (SDP) [64], which is encapsulated
into User Datagram Protocol (UDP) packets when going off machine to external
devices. Communication out of the machine from any core is achieved by using
Internet Protocol (IP) Tags. The SCAMP monitor processor on each Ethernet chip
maintains a list of up to 8 IP Tags, which maps between values in the tag field of the
1. https://developer.arm.com/open-source/gnu-toolchain/gnu-rm/downloads
2. https://developer.arm.com/products/sof tware-development-tools/compilers/legacy-compiler-releases
Booting a Million Core Machine 81
SDP packets and an external IP address and port. When a packet is received that is
destined to go out via the Ethernet (identified in the SDP packet header), this table
is consulted and an UDP packet is formed containing the packet and this is sent
to the IP address and port given in the table. The table can also contain Reverse IP
Tags, where an UDP packet received from an external source is mapped from the
UDP port in the packet to a specific chip and core on the machine, where the data
of the packet are extracted and put into an SDP packet before being forwarded to
the given core.
SARK provides a hardware abstraction layer, simplifying interaction with
the DMA, network interface and communications controllers. SpiNNaker1 API
(SpiN1API) provides an event-based operating system, as shown in Figure 4.16,
with three processing threads per core: one for task queuing, one for task dis-
patch and one to service Fast Interrupt Request (FIQ). SpiN1API also pro-
vides the mechanism to link software callbacks to hardware events and enables
triggering of actions such as sending a packet to another core and initiating a
DMA. Callbacks are registered with different priority levels ranging from −1 to
2 depending on their desired function, with lower numbers scheduled prefer-
entially. Callback tasks of priority 1 and 2 can be queued (in queues of maxi-
mum length 15), with new events added to the back of the queue. Callbacks of
priority −1 and 0 are not queued, but instead pre-empt tasks assigned higher
priority level numbers. Operation of this system follows the flow detailed in
Figure 4.16(a).
The scheduler thread places callbacks in queues for priority levels 1 and above,
and the dispatcher picks these callbacks and executes them based on priority. When
the dispatcher is executing a callback of priority 1 or higher, and a callback of pri-
ority 0 is scheduled, this task pre-empts that currently being executed causing it to
be suspended until the higher priority callback has completed. Callbacks of priority
−1 use the FIQ thread to interact with the scheduler and dispatcher, enabling fast
response and pre-emption of priority 0 and above tasks. Pointers are stored allowing
fast access to the callback code, and the processor switches to FIQ banked registers
to avoid the need for stacking [230], optimising the response time of priority −1
callbacks. However, this optimised performance limits the application to registering
only a single −1 priority event and callback.
The process of booting the machine is shown in Figure 4.2. When the machine
is first powered up, the cores on every chip start executing the boot ROM image.
This is stored within the chip and cannot be altered. After testing the ITCM and
82 Stacks of Software Stacks
Figure 4.3. Booting SCAMP on the machine. (a) The SCAMP image is encoded in
SpiNNaker boot messages and sent to the machine, where it is loaded on to the selected
monitor processor of the Ethernet chip. (b) The SCAMP image is sent to neighbouring
chips, which might include chips on adjacent boards, using NN packets.
DTCM of the core, the image then proceeds to determine if the core executing it
is to be the monitor, through reading a mutex in the chip’s System Controller; the
first core to read this locks the mutex and so becomes the monitor. The processor
selected as monitor now performs further tests on the shared parts of the chip.
Once the tests are complete, the Ethernet chips are set up to listen for boot
messages being transmitted using UDP on port 54321. As shown in Figure 4.3(a),
the host now sends the SCAMP image to one of these Ethernet chips; it is not
critical which of these is selected, as the SCAMP software is set up to work out
the dimensions of the machine and the coordinates once it has been loaded. The
boot messages consist of a start command, followed by a series of 256-byte data
blocks (with an appropriate header to indicate the order), followed by a comple-
tion command. If all the blocks are successfully received and assembled, the code
stored in the data blocks is copied to the ITCM of the monitor processor and
executed.
The current version of the SCAMP application starts with an initialisation phase
where various parts of the hardware on the chip are set up for operation. The
Booting a Million Core Machine 83
1. Address Phase. During this phase, each SCAMP computes and sends out its
computed coordinates based on the coordinates it receives from its neigh-
bours; for example, if it receives [0, 0] from the ‘west’ link, it will assume
that its coordinates are [0, 1], and if it receives [0, 0] from the ‘north’ link, it
will assume its coordinates are [−1, 0] (coordinates are allowed to be nega-
tive at this stage). This phase continues until no new coordinates are received
within a given time period.
2. Dimensions Phase. Each SCAMP sends its perceived dimensions of the
machine based on the dimensions received from its neighbours. This again
continues until no change of dimensions has occurred within a given time
period.
3. Blacklisting Phase. The blacklist is sent from the Ethernet chip of each board
to the other chips on the same board. This may result in the current monitor
core discovering it is blacklisted. This is noted and delegation is then set up.
4. Point-to-Point Table Phase. Each SCAMP sends its coordinates once again,
and these are forwarded on along with a hop count, so that every chip receives
them eventually. These are used to update the point-to-point tables based on
the direction in which the coordinates are received, along with the hop count
to allow the use of the shortest route.
5. Monitor Delegation Phase. If the current SCAMP core has been blacklisted,
it now delegates to another core that has not. This is done at this late stage
to avoid interfering with the rest of the setup process.
Note that delegation of a blacklisted monitor core will not happen until after the
‘netinit’ phase has completed. The monitor core tends to be selected from a subset
of the cores on the chip due to manufacturing properties; this means that boards
where a core which is in this subset is so broken that it cannot perform the steps
up to this point will not work with this system. A possible future change would
therefore be to perform the blacklisting phase earlier in the process.
84 Stacks of Software Stacks
Using SpiNNaker machines in the past required end users to load compiled applica-
tions and routeing tables manually onto the SpiNNaker machine through the use of
the low level ybug software included with the aforementioned libraries.3 Other soft-
ware was then designed to ease the development of application code for end users.
These consisted of: the aforementioned low-level libraries SARK and SpiN1API,
and the monitor core software SCAMP, a collection of C code which represented
models known in the neuroscience community and defined by the PyNN 0.6
language [44] and a collection of Python code which translates PyNN models onto
a SpiNNaker machine. These pieces of software were amalgamated into a software
package known as PACMAN 48 [68] and supported the main end-user commu-
nity of computational neuroscientists for a number of years. These tools had the
following limitations:
Figure 4.4. The Python class hierarchy for SpiNNaker Machine representation. The
machine contains a list of chips, and each chip contains a router, an SDRAM and a list
of processor objects, each with their respective properties. A VirtualMachine can also be
made, which contains the same objects but can be identified as being virtual by the rest
of the tools.
86 Stacks of Software Stacks
necessary and also to know where the connected real chip is to make use of that if
needed.
4.6.2 Graphs
A graph in SpiNNTools consists of vertices and directed edges between the vertices.
The vertex is considered to be a place where computation takes place, and as such,
each vertex has a SpiNNaker executable binary associated with it. An edge repre-
sents some communication that will take place from a source, or pre-vertex to a
target, or post-vertex. An additional concept is that of the outgoing edge partition;
this is a group, or partition, of edges that all start at the same pre-vertex, as shown
in Figure 4.5(b). This is useful to represent a multicast communication. Note that
not all edges that have the same pre-vertex have to be in the same outgoing edge
partition; there can be more than one outgoing edge partition for each source vertex.
This represents different message types, which might be multicast to different sets
of target vertices. Thus, each outgoing edge partition has an identifier, which can
be used to identify the type of message to be multicast using that partition.
Figure 4.5. Graphs in SpiNNTools. (a) A Machine Graph made up of two Machine Vertices
connected by a Machine Edge, indicating a flow of data from the first to the second.
(b) A Machine Vertex sends two different types of data to two subsets of destination
vertices using two different Outgoing Edge Partitions, identified by solid and dashed
lines respectively. (c) An Application Graph made up of two Application Vertices, each
of which contain two and four atoms, respectively, connected by an Application Edge,
indicating a flow of data from the first to the second. (d) A Machine Graph created from
the Application Graph in (c) by splitting the first Application Vertex into two Machine
Vertices which contain two atoms each. The second Application Vertex has not been
split. Machine Edges have been added so that the flow of data between the vertices in
still correct.
Data Structures 87
Figure 4.6. The relationship between the graph objects. An ApplicationGraph contains
ApplicationVertex objects and OutgoingEdgePartition objects, which contain Applica-
tionEdge objects in turn. A MachineGraph similarly contains MachineVertex objects and
OutgoingEdgePartition objects, which contain MachineEdge objects in turn. Applica-
tionEdge objects have pre- and post-vertex properties which are ApplicationVertex
objects, and similarly MachineEdge objects and pre- and post-vertex properties which
are MachineVertex objects. An ApplicationVertex can create a number of MachineVertex
objects for a subset of the atoms contained therein and an ApplicationEdge can create
a number of MachineEdge for a subset of atoms in the pre- and post-vertices.
There are two types of graph represented as Python classes in the tools (a dia-
gram can be seen in Figure 4.6). A Machine Graph, an example of which is shown
in Figure 4.5(a), is one in which each vertex (known as a Machine Vertex) is guaran-
teed to be able to execute on a single SpiNNaker processor. A Machine Edge there-
fore represents communication between cores. In contrast, an Application Graph,
an example of which is shown in Figure 4.5(c), is one where each vertex (known
as an Application Vertex) contains atoms, where each atom represents an atomic
unit of computation into which the application can be split; it may be possible
to run multiple atoms of an Application Vertex on each core. Each edge (known
as an Application Edge) represents communication of data between the groups of
computational units; if one or more of the atoms in an Application Vertex com-
municates with one or more atoms in another Application Vertex, there must be
an Application Edge between those Application Vertices. It is not guaranteed that
all the atoms on an Application Vertex fit on a single core, so the instruction code
for Application Vertices should know how to process a subset of the atoms, and
how to handle a received message and direct it to the appropriate atom or atoms.
The graph classes support adding and discovering vertices, edges and outgoing edge
partitions.
88 Stacks of Software Stacks
As the vertices represent the application code that will run on a core, they have
methods to communicate their resource requirements, in terms of the amount of
DTCM and SDRAM required by the application, the number of Central Process-
ing Unit (CPU) cycles used by the instructions of the application code to maintain
any time constraints, and any IP Tags or Reverse IP Tags required by the applica-
tion. The Application Vertex provides a method that returns the resources required
by a continuous range or slice of the atoms in the vertex; this is specific to the exact
range of atoms, allowing different atoms of the vertex to require different resources.
The Application Vertex additionally defines the maximum number of atoms that
the application code can execute at a maximum on each core of the machine (which
might be unlimited) and also the total number of atoms that the vertex represents.
These allow the Application Vertex to be broken down into one or more Machine
Vertices as seen in Figure 4.5(d); to this end, the Application Vertex class has a
method for creating Machine Vertex objects for a continuous range of atoms. A
Machine Vertex can return the resources it requires in their entirety.
The graphs additionally support the concept of a Virtual Vertex. This is a vertex
that represents a device connected to a SpiNNaker machine. The Virtual Vertex
indicates which chip the device is physically connected to, allowing the tool chain to
work with this to include the device in the network. As with the other vertices, there
is a version of the Virtual Vertex for each of the machine and application graphs.
The aim of the SpiNNTools tool chain is to control the execution of a program
described as a graph on the SpiNNaker machine. The software is executed in several
steps as shown in Figure 4.7 and detailed below.
Figure 4.7. The execution work flow of SpiNNTools in use within an application. Once
control has returned to the application, the flow can be resumed at different stages
depending on what has changed since the last execution.
The SpiNNTools Tool Chain 89
4.7.1 Setup
The first step in using SpiNNTools is to initialise them. At this point, the user can
specify appropriate configuration parameters, such as the time step of the simula-
tion, and the location where binary files can be located on the host machine. The
tool chain then sets up the initially empty graphs and reads in configuration files for
further options, such as the SpiNNaker machine to be used. Options are separated
out in this way to allow script-level parameters which might apply no matter where
the script is run (like the timestep of the simulation), from user-level parameters,
which will be different per-user, but likely to be common across multiple scripts
for that user (like the SpiNNaker machine to be used).
Machine Discovery
The first phase of execution is the discovery of the machine to be executed on.
If the user has configured the tool chain to run on a single physical machine, this
machine is contacted, and if necessary booted. Communications with the machine
then take place to discover the chips, cores and links available. This builds up a
Python machine representation to be used in the rest of the tool chain.
If a machine is to be allocated, SpiNNTools must first work out how big a
machine to request, by working out how many chips the user-specified graph
requires. If a machine graph has been provided, this can be used directly, since
the number of cores is exactly the number of vertices in the graph. The resources
must still be queried, as the SDRAM requirements of the vertices might mean that
90 Stacks of Software Stacks
not all of the cores on each chip can be used. For example, a graph consisting of
10 machine vertices, each requiring 20 MByte of SDRAM and thus 200 MByte of
SDRAM overall, will not fit on a single chip in spite of there being enough cores.
If an application graph is provided, this must first be converted into a machine
graph to determine the size of the machine. This is done by executing some of the
algorithms in the mapping phase (see below).
Mapping
The mapping phase takes the graph and maps it onto the discovered machine. This
means that the vertices of the graph are assigned to cores on the machine, and
edges of the graph are converted into communication paths through the machine.
Additionally, other resources required by the vertices are mapped onto machine
resources to be used within the simulation.
If the graph is an application graph, it must first be converted to a machine
graph. This may have been done during the machine discovery phase as described
previously. To allow this, the algorithm(s) used in this ‘graph partitioning’ process
are kept separate from the rest of the mapping algorithms.
Once a machine graph is available, this is mapped to the machine through a
series of phases. This must generate several data structures to be used later in the
process. These include:
Note that once machine has been discovered, mapping can be performed entirely
separately from the machine using the Python machine data structures created.
However, mapping could also make use of the machine itself by executing specially
designed parallel mapping executables on the machine to speed up the execution.
The design of these executables is left as future work.
Mapping information can be stored in a database by the system. This allows for
external applications which interact with the running simulation to decode any live
data received. As shown in Figure 4.7, the applications can register to be notified
when the database is ready for reading and can then notify SpiNNTools when they
have completed any setup and are ready for the simulation to start, and when the
simulation has finished.
The SpiNNTools Tool Chain 91
Data Generation
The data generation phase creates a block of data to be loaded into the SDRAM
for each vertex. This can be used to pass parameters from the Python-described
vertices to the application code to be executed on the machine. This can make use
of the mapping information above as appropriate; for example, the routeing keys and
IP tags allocated to the vertex can be passed to ensure that the correct keys and tags
are used in transmission. The graph itself could also be used to determine which
routeing keys are to be received by the vertex, and so set up appropriate actions to
take upon receipt of these keys.
Some support for data generation and reading is provided by the tool chain both
at the Python level, where data can be generated in ‘regions’, and at the C code level,
where library functions are provided to access these regions. Other more basic data
generation is also supported which simply writes to the SDRAM directly.
Data generation can also create a statistical description of the data to be loaded
and then expand these data through the execution of a binary on the machine.
This allows less data to be created at this point potentially speeding up the data
generation and loading processes, and also allows the expansion itself to occur in
parallel on the machine.
Loading
The loading phase takes all the mapping information and data generated, along
with the application binaries associated with each machine vertex, and prepares the
physical machine for execution. This includes loading the routeing tables generated
on to each chip of the machine, loading the application data into the SDRAM of
the machine, loading the IP tags and reverse IP tags into the Ethernet chips, and
loading the application code to be executed.
Running
The running phase starts off the actual execution of the simulation and, if necessary,
monitors the execution until complete. Before execution, the tool chain wait for the
completion of the setup of any external applications that have registered to read the
mapping database. These tools are then notified that the application is about to
start, and when it is finished.
Once a run is complete, application recorded data and provenance data are
extracted from the machine. The provenance data include:
The log files from each core can also optionally be extracted. During provenance
extraction, each vertex can analyse the data and report any anomalies. If the log files
have been extracted, these can also be analysed and any ‘error’ or ‘warning’ lines can
then be printed.
If a run is detected to have failed in some way, the tool chain will attempt to
extract information about this failure. A failure includes one of the cores going
into an error state, or if the tool chain have been run for a specific duration, if
the cores are not in a completion state after this time has passed. Log files will be
automatically extracted here and analysed as previously discussed. Any cores that
are still alive will also be asked to stop and extract any provenance data so that this
can also be analysed in an attempt to diagnose the cause of the error.
The run may be split into several sub-runs to allow for the limited SDRAM
on the machine, as shown in Figure 4.8. After each run cycle, any recorded data
are extracted from the SDRAM and stored on the host machine, after which the
recording space is flushed, and the run cycle restarted. This requires additional
support within the binary of the vertex, to allow a message to be sent to the core to
increase the run duration, and to reset the recording state. This support is provided
in the form of C code library functions, with callbacks to allow the user to perform
Figure 4.8. Running vertices with recorded data. The SDRAM remaining on each chip
after it has been allocated for other things is divided up between the vertices on that
chip. Each vertex is then checked for the number of time steps it can be run for before
filling up the SDRAM. The minimum number of time steps is taken over all chips and the
total run time is split into smaller chunks, between which the recorded data are extracted
and the buffer is cleared.
The SpiNNTools Tool Chain 93
additional tasks before resuming execution at each phase. Additionally, the tool
chain can be set up to extract and clear the core logs after each run cycle to ensure
that the logs do not overflow.
The length of each run cycle can be determined automatically by SpiNNTools.
This is done by working out the SDRAM available on each chip after data genera-
tion has taken place. This free space is then divided between the vertices on the chip
depending on how much space they require to record per time step of simulation.
To ensure that there is some space for recording, the user can specify the minimum
number of time steps to be recorded and space for this is allocated statically during
the mapping phase (noting that if this space cannot be allocated, this phase will fail
with an error).
At the end of each run phase, external applications are notified that the simu-
lation has been paused and are then notified again when the simulation resumes.
This allows them to keep in synchronisation with the rest of the application.
overwrite the existing data, such as a change in neuron state update parameters in
a neural network. Any increase in the size of the data, such as an increase in the
number of synapses in a neural network, would likely require a remapping of the
graph on to the machine as the SDRAM is likely to be packed in such a way as to
not allow the expansion of the data for a single core; it is left to the vertex to make
this decision however.
Any change to the graph, such as the addition of a vertex or edge, is likely to
require that the mapping phase take place again. This may even result in a new
machine being required should the size of the graph increase to this degree. This
will mean that all the other phases will also have to be executed again.
4.7.6 Closing
Once the user has finished simulating and extracted any data, they can tell the tool
chain that they are finished with the machine by closing it. At this point, the tool
chain resets and releases any machines that have been reserved, and so recorded
data will no longer be available. If the tool chain was told to run the network for
an indeterminate length, this would also result in the extraction and evaluation of
any provenance data at this stage.
Figure 4.9. Algorithms being run by the algorithm execution engine. The executor is
provided with a list of algorithms to run, a set of input items and a set of output items
to produce. It then produces a workflow for the algorithms accounting for their inputs
required and outputs produced.
by Heathcote [94]. The tool chain also includes algorithms for routeing table
compression, which are discussed by Mundy et al. [174]. Many of the other algo-
rithms are currently simplistic in nature; these can be replaced in the future should
other algorithms be found to perform more efficiently and/or effectively.
Figure 4.10. Data buffering and extraction. Top: The buffer manager is used to read back
recorded data during execution; when the buffer contains some data, the buffer manager
is notified and attempts to read the data, notifying the data source once this has been
done to allow the space to be reused. Middle: Data reading done using SCAMP; each
read of up to 256 bytes is further broken down into a number of request and read cycles
on the machine itself, where the packets used contain only 24 bits of data each. Bottom:
Data reading done using multicast messages; the initial request is all that is required,
after which the data are streamed using packets containing 64 bits of data. The machine
is set up so that these packets are guaranteed to arrive, so no confirmation is required.
results in speeds of around 8 Mb/s when reading from the Ethernet chip and around
2 Mb/s when reading from other chips.
To speed up the extraction of data, the tool chain includes the ability to cir-
cumvent this process, an overview of which is shown in Figure 4.10 (Bottom). To
facilitate this, firstly the machine is configured so that packets can be sent with a
guarantee that none of them are ever dropped; this can be done in this scenario
because exactly one path through the machine will be used by each read, so dead-
locks cannot occur. Next, one of the cores on each chip is loaded with an application
that can read from SDRAM and stream multicast messages to another application
loaded onto a core on the Ethernet chip, which then forms these into SDP mes-
sages to be streamed to the host along with a sequence number in each SDP packet.
The host then gathers the SDP packets and notes which sequences are missing.
The missing sequences are then requested again from the machine; this is repeated
until all sequences have been received. This has numerous advantages over the SDP
request-and-response mechanism: the SDP is only formed at the Ethernet chip, and
thus, the headers do not get transmitted across the SpiNNaker fabric; and the host
only sends in a single request for data and then a single request for each group
The SpiNNTools Tool Chain 97
of missing sequences and thus does not have to wait for each chunk of 256 bytes
between sending requests. This results in speeds of up to 40 Mb/s when reading
from any chip on the machine; there is no penalty for reading from a non-Ethernet
chip.
Once this protocol was implemented, we discovered that the Python code had
trouble keeping up with the speed at which the data were received from the
machine. We therefore implemented a version of the data reception in C++ and
Java that could interface with the Python code; the Java version is the version used
in production following comparative testing and assessment of the integration qual-
ity. This then allows the use of the Ethernet connection on multiple boards simul-
taneously, allowing the data extraction speed to scale with the number of boards
required for the simulation, up to the bandwidth of the network connected to the
host machine.
Figure 4.11. Live interaction with vertices. Top: To indicate that live output is required,
an edge is added from the vertex which is the source of the data to the Live Packet
Gatherer vertex in the graph. To indicate that the live input is required, an edge is added
from the Reverse IP Tag Multicast Source vertex to the target of the data in the graph.
Bottom: The effect of adding the edges to the graph is that multicast messages will be
sent from the core (or cores) of the source vertex to the core running the Live Packet
Gatherer, which will then wrap the messages in EIEIO packets and forward them to a
listening external application; and EIEIO packets received from an external application
will be decoded by the Reverse IP Tag Multicast Source core and dispatched as multicast
messages to the target core (or cores).
capture the packets that have been dropped. These are stored until a time at which
the router is no longer blocked and so can safely send the packet onwards. This
helps in those applications where the reliable transmission of packets is critical to
their operation.
There is only one register within the SpiNNaker hardware to hold a dropped
packet. If a second packet is dropped, this packet will be completely unrecoverable;
an additional flag is set in this scenario so the re-injection core can detect this and
count such occurrences. This count is reported to the user at the end of the exe-
cution so that they know that something may not be correct in their simulation
results.
The visualiser shows the system traffic status by gathering and displaying data
from the monitoring and profiling counters on the SpiNNaker chips in the system.
The visualiser can also send commands to the monitor processor via the Ethernet
connection to control and interact with the system.
Conway’s Game of Life [71] consists of a collection of cells which are either alive
or dead based on the state of their neighbouring cells. A diagram of an exam-
ple Machine Graph of this problem is shown in Figure 4.12. The vertices of the
graph of this application are each a cell in the game; given the state of the sur-
rounding cells, this cell can compute whether it is dead or alive in each step and
then send that to its neighbours. It similarly receives the state of the neighbours as
they are transmitted and again uses this to update its own state. The edges of the
graph are thus between adjacent cells in a grid, where each vertex is connected
bidirectionally to its eight surrounding neighbours. The game proceeds in syn-
chronous phases, where the state of cells in a given phase are all considered at the
same time.
Graphs of this form are highly scalable on the SpiNNaker system, since the com-
putation to be performed at each node is fixed, and the communication forms a
regular pattern which does not increase as the size of the board grows. Thus once
100 Stacks of Software Stacks
Figure 4.12. Conway’s Game of Life on a 5 × 5 grid as a Machine Graph. Every Machine
Vertex is connected to each of it’s 8 neighbours bi-directionally; this requires two Machine
Edges for each bi-directional connection. The initial state of each Vertex is either alive
(black) or dead (white).
working, it is likely that any size of game can be built, up to the size of the available
machine. This type of graph would also likely be suited to finite element analysis
[17] problems, provided that the data to be transmitted can be broken down into
SpiNNaker packets. This problem thus works well as an archetype.
It will be assumed that we have built the application code which will update
the cell based on the state of the surrounding cells. This will update the state once
per time step of the simulation based on the received state from the surrounding
cells and then send its own new state out using the given key. It can also record
its state at each time step in the simulation. The set-up of this application is as
follows:
Once the graph is built, the script starts the execution of the graph. During this
execution, the tool chain will obtain a machine description and use this with the
sPyNNaker − Software for Modelling Spiking Neural Networks 101
machine graph to work out a placement of each of the vertices and a routeing of the
edges between these placements, along with an allocated key for each of the vertices.
The software tools will then ask each vertex how many time steps it can record for
based on the available SDRAM after placement is complete, and the resources used
on each chip can therefore be determined. Each vertex will then be asked to generate
its data based on the mapping and timing information. SpiNNTools will then load
the generated data onto the machine along with the routeing tables and application
code and start the execution of the cores. It will wait an appropriate amount of time
for the cores to stop and then check their status. Assuming this is successful, control
will return to the script. This can then request the recorded states from each of the
vertices and display these data in an appropriate way.
A future version could have a Conway vertex that can have multiple cells within
each machine vertex, which would then allow for an application vertex of cells. This
would have a single large Application Vertex which would represent the whole game
board and an Application Edge for each of the 8 directions of connectivity, each
in its own Outgoing Edge Partition to indicate that different keys are required for
each of the directions. This would require that the vertex would have to cope with
the reception of multiple neighbour states, which would make the application code
itself more complex; for example, it would have to cope with multiple incoming
keys from each direction, each of which would target a different cell within the
grid.
Another possible extension to this application is to extract the state during exe-
cution and display this as the application progresses. This would require the addi-
tion of the Live Packet Gatherer vertex (described above) to the graph and an edge
from each of the Conway vertices to this vertex. The script would then indicate,
before executing the graph, that there is an external application that would like to
receive the data. This application will receive a message when the mapping database
has been written, at which point, it can set up a mapping between multicast keys
received and positions in the game board, responding when it has completed its
own setup. The tool chain will then notify this application that the simulation is
starting, and the application will then receive the same state messages as the vertices
receive, which it can use to update the display of the game board.
Figure 4.13. A neural network topology of a 1 mm2 area of cortical microcircuit found
within the mammalian brain. Each population of neurons is shown as a circle containing
a number, where the number indicates the number of neurons in that population.
a 1 mm2 area of the surface of the generic early sensory cortex [201]. Figure 4.13
shows the groups of neurons (Populations) in this network and the connectivity
between them (Projections). In a spiking neural network, the vertices are groups
of point neurons (as a single core can simulate more than one neuron); the com-
putation required is the update of the neuron state in response to spikes received
from connected neurons. The edges are then groups of synapses between the neu-
rons, over which spikes are transmitted. These are potentially unidirectional and
are likely to be more heterogeneous in nature than the regular grid pattern seen in
Conway’s Game of Life.
The problem of SNNs is clearly well suited to the architecture, as this is what it
was designed for, but the heterogeneity of the network, and the fact that multiple
neurons are computed on each core means that some networks will be more suited
to the platform than others; in particular, neural networks often form ‘small world’
networking topologies, where most of the connections are relatively local, but there
are a few long-distance connections. The computation required to simulate each
neuron at each time step in the simulation is generally fixed. The remaining time
is then dedicated to processing the spikes received, the number of which depends
on the how many neurons are sending spikes to the core and the activity of those
connected neurons. This is not known in advance in general, so some flexibility
in the system with respect to the amount of computation available at each node
is necessary to allow the application to work in different circumstances. Once this
sPyNNaker − Software for Modelling Spiking Neural Networks 103
is known for a given network, the system could potentially be reconfigured with
additional cores, allowing that network to be simulated in less time overall.
4.9.1 PyNN
PyNN is a Python interface to define SNN simulations for a range of simulator
back-ends [44]. It allows users to specify an SNN simulation via a Python script
once and have it executed on any or all of the supported back-ends including
NEST [76], NEURON [33] and Brian [82]. This encourages standardisation of
simulators and reproducibility of results, and increases productivity of neural net-
work modellers through code sharing and reuse, by providing a foundation for
simulator-agnostic post-processing, visualisation and data-management tools.
PyNN has continued development as part of the European Flagship Human
Brain Project (HBP) [4], and has hence been adopted as a modelling language
by a number of partners including SpiNNaker. It provides a structured interface
for the definition of neurons, synapses and input sources, giving users the flexi-
bility to build a range of network topologies. Models typically consist of single-
compartment point neurons, grouped together in populations. These populations
are then linked with projections, representing the synaptic connections between the
axons of neurons in a source population, and the dendrites of neurons in a tar-
get population. Once defined, a number of simulation controls are used to exe-
cute the model for a given time period, with the option to update parameters
and initialise state variables between runs. On simulation completion, data can be
extracted for post-processing and future reference. Neuron variables such as spike
trains, total synaptic conductances and neuron membrane potential are accessi-
ble from population objects, while synaptic weights and delays are extracted from
projections. These data can be subsequently saved or visualised using the built-in
plotting functionality.
Example PyNN commands for the generation of populations and projections are
detailed in Listing 4.1. Here the sPyNNaker version of the simulator is imported
as sim and subsequently used to construct and execute a simulation. A population
of 250 Poisson source neurons is created with label ‘poisson_source’ and provides
50 Hz input to the network for 5 s. A second population of 500 integrate and fire
neurons is then created and labelled as ‘excitatory_pop’. Excitatory connections
are made between ‘poisson_source’ and ‘excitatory_pop’ with a 20% probability of
connection, each with a weight of 0.06 nA and delays specified via a probability
distribution. Data recording is then enabled for ‘excitatory_pop’, and the simula-
tion is executed for 5 s. Finally, the ‘excitatory_pop’ spike history data are extracted
from the simulator.
104 Stacks of Software Stacks
1 i m p o r t pyNN . s p i N N a k e r a s s i m
2 # Spike input
3 p o i s s o n _ s p i k e _ s o u r c e = sim . P o p u l a t i o n (250 , sim . S p i k e S o u r c e P o i s s o n (
4 r a t e =50 , d u r a t i o n =5000) , l a b e l = ’ p o i s s o n _ s o u r c e ’ )
5 # Neuronal p o p u l a t i o n s
6 pop_exc = sim . P o p u l a t i o n (500 , sim . I F _ c u r r _ e x p (∗∗ c e l l _ p a r a m s _ e x c ) ,
7 label = ’ excitatory_pop ’ )
8 # Poisson source projections
9 p o i s s o n _ p r o j e c t i o n _ e x c = sim . P r o j e c t i o n ( p o i s s o n _ s p i k e _ s o u r c e , pop_exc ,
10 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 2 ) ,
11 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.06 , d e l a y = d e l a y _ d i s t r i b u t i o n ) ,
12 receptor_type= ’ excitatory ’ )
13 # Specify output recording
14 pop_exc . r e c o r d ( ’ a l l ’ )
15 # Run s i m u l a t i o n
16 sim . run ( simtime =5000)
17 # Extract r e s u l t s data
18 e x c _ d a t a = pop_exc . g e t _ d a t a ( ’ s p i k e s ’ )
Listing 4.1. Example PyNN commands (a complete script is detailed in Listing 4.2).
4.9.3 Preprocessing
At the top of the left-hand side stack in Figure 4.14, users create a PyNN script
defining an SNN. The SpiNNaker back-end is specified, which translates the SNN
into a form suitable for execution on a SpiNNaker machine. This process includes
mapping of the SNN into an application graph, partitioning into a machine graph,
generation of the required routeing information and loading of data and applica-
tions to a SpiNNaker machine. Once loading is complete, all core applications are
instructed to begin execution and run for a predefined period. On simulation com-
pletion, requested output data are extracted from the machine and made accessible
through the PyNN API.
A sample SNN is developed as a vehicle by which to describe the stages of
preprocessing. A random balanced network is defined according to the PyNN
script detailed in Listing 4.2, with the resulting network topology shown in
Figure 4.15(a). The network consists of 500 excitatory and 125 inhibitory neu-
rons, which make excitatory and inhibitory projections to one another, respectively.
Each population additionally makes recurrent connections to itself with the same
effect. Excitatory Poisson-distributed input is included to represent background
sPyNNaker − Software for Modelling Spiking Neural Networks 105
System
Python interface to management SARK
SpiNNaker hardware software
SpiNNaker
Ethernet I/F Ethernet I/F hardware
Figure 4.14. SpiNNaker software stacks. From top left anti-clockwise to top right: users
create SNN models on host via the PyNN interface; the sPyNNaker Python software stack
then translates the SNN model into a form suitable for a SpiNNaker machine and loads
the appropriate data to SpiNNaker memory via Ethernet; sPyNNaker applications, built
on the SARK system management and SpiN1API event-driven processing libraries, use
the loaded data to perform real-time simulation of neurons and synapses.
Figure 4.15. Network partitioning to fit machine resources. (a) Application graph gener-
ated from interpretation of PyNN script: circles represent PyNN populations, and arrows
represent PyNN projections. (b) Machine graph partitioned into vertices and edges to
suit machine resources: squares represent populations (or partitioned sub-populations)
of neurons which fit on a single SpiNNaker core − hence, the model described by the
machine graph in (b) requires 5 SpiNNaker cores for execution.
106 Stacks of Software Stacks
activity, while predefined spike patterns are injected via a spike source array. The
neuronal populations consist of current-based Leaky Integrate and Fire (LIF) neu-
rons, with the membrane potential of each neuron in the excitatory population
initialised via a uniform distribution bounded by the threshold and resting poten-
tials. The sPyNNaker API first interprets the PyNN defined network to construct
an application graph: a vertices and edges view of the neural network, where each
edge corresponds to a projection carrying synapses, and each vertex corresponds to
a population of neurons. This application graph is then partitioned into a machine
graph, by subdividing application vertices and edges based on available hardware
resources and requirement constraints, ultimately ensuring each resulting machine
vertex can be executed on a single SpiNNaker core. From hereon, the term vertex
will refer to a machine vertex and is synonymous with the term sub-population,
representing a group of neurons which can be simulated on a single core. An exam-
ple of this partitioning is shown in Figure 4.15, where due to its size ‘excitatory
population’ is split into two sub-partitions (A and B). Figure 4.15 also shows how
additional machine edges are created to preserve network topology between par-
titions A, B, and the other populations, and how different PyNN connectors are
treated differently during this process. For example, a PyNN OneToOneConnec-
tor connects each neuron in a population to itself. This results in both partitions
A and B having a machine edge representing their own connections, but with no
edge required to map the connector from one sub-population to the other. Con-
versely, the PyNN FixedProbabilityConnector links neurons in the source and target
populations based on connection probability and hence requires machine edges to
carry all possible synaptic connections (e.g. both between vertices A and B, and to
themselves).
Once partitioned, the machine graph is placed onto a virtual representation of
a SpiNNaker machine to facilitate allocation of chip-based resources such as cores
and memory. Known failed cores, chips and board links which compromise the
performance of a SpiNNaker machine are removed from this virtual representation,
and the machine graph is placed accordingly. Chip-specific routeing tables are then
generated facilitating transmission of spikes according to the machine edges repre-
senting the PyNN-defined projections. These tables are subsequently compressed
and loaded into router memory (as described in the previous chapter). The Python
software stack from Figure 4.14 then generates the core-specific neuron and synapse
data structures and loads them onto the SpiNNaker machine using the SpiNNTools
software. Core-specific neuron data are loaded to the appropriate DTCM, while the
associated synapse data are loaded into core-specific regions of SDRAM on the same
chip, ready for use according to Section 4.9.4. Finally, programs for execution on
application cores are loaded to ITCM, with each core executing an initialisation
function to load appropriate data structures (from SDRAM) and prepare the core
sPyNNaker − Software for Modelling Spiking Neural Networks 107
before switching to a ready state. Once all simulation cores are ready, the signal to
begin simulation is given to all cores from host, and the SNN will execute according
to the processes defined in Section 4.9.4.
Figure 4.16. SpiNNaker realtime OS: (a) SpiN1API multi-threaded event-based operating
system: scheduler thread to queue callbacks; dispatcher thread to execute callbacks; and
FIQ thread to service interrupts from high-priority (−1) events. (b) Events and associated
callbacks for updating neuron state variables and processing incoming packets repre-
senting spikes into synaptic input. Figures reproduced with permission from [222, 223].
Table 4.1. Hardware (and single software) events, along with their registered callback and
associated priority level.
facilitate the periodic updating of neuron state and the event-based processing of
synapses when packets representing spikes arrive at a core. These events (squares)
and their callbacks (circles) are shown schematically in Figure 4.16(b). The function
timer_callback evolves the state of neurons in time and is called periodically against
timer events throughout a simulation. A packet received event triggers a _mul-
ticast_packet_received_callback, which reads the packet to extract and trans-
fer the source neuron ID to a spike queue. If no spike processing is currently
being performed, the software-triggered user event is issued and, in turn, executes
a user_callback that reads the next ID from the spike queue, locates the associ-
ated synaptic information stored in SDRAM and initiates a DMA to copy it into
DTCM for subsequent processing. Finally, the _dma_complete_callback is exe-
cuted on a DMA complete event and initiates processing of the synaptic contribu-
tion(s) to the post-synaptic neuron(s). If on completion of this processing there
are items remaining in the input spike queue, this callback initiates processing of
sPyNNaker − Software for Modelling Spiking Neural Networks 109
the next spike: meaning this collection of callbacks can be thought of as a spike
processing pipeline.
is possible due to slight variations in clock speed (from clock crystal manufacturing
variability); however, this effect is small relative to simulation times [235]. Small
variations placing core updates slightly out of phase can also occur due to the way
the ‘start’ signal is communicated, particularly on larger machines; however, again
this effect is negligible. A consequence of this update scheme is that generated spikes
are constrained to the time grid (multiples of the simulation timestep 1t). It also
enforces a finite minimum simulation spike transit time between neurons of 1t, as
input cannot be guaranteed to arrive in the current timestep before a neuron has
been updated. From the hardware perspective, the maximum packet transit time
for the million core machine is ≤25 µs (assuming 200 ns per router [235], and a
maximum path length of 128).
A design goal of the SpiNNaker platform is to achieve real-time simulation of
SNNs, where ‘real time’ is defined as when the time taken to simulate a network
matches the amount of time the network has modelled. Therefore, an SNN with a
simulation timestep of 1t = 1 ms requires the period of timer events to be set at
200,000 clock cycles (where at 200 MHz each clock cycle has a period of 5 ns – see
Section 2.2). This causes 1 ms of simulation to be executed in 1 ms, meaning the
solution will keep up with wall-clock time, enabling advantageous performance,
and interaction with systems operating on the same clock (such as robots, humans
and animals). In practice, real-time execution is not always possible, and therefore,
users are free to reduce the value of 1t in special cases and also adjust the num-
ber of clock cycles between timer events. For example, if a neuron model requires
1t = 0.1 ms for accuracy, it is a common practice to let the period between timer
events remain at 200,000 clock cycles, to ensure there is sufficient processing time
to update the neurons and process incoming spikes [217]. This enforces a slowdown
factor of 10 relative to real time.
From the perspective of an individual core, each neuron is initialised with user-
defined parameters at time t0 (supplied via a PyNN script). All state variables are
then updated one timestep at a time up to the simulation end time tend . The num-
ber of required updates and hence timer events is calculated based on tend and the
user-defined simulation timestep 1t (which is fixed for the duration of simulation).
Each call to timer_callback advances all the neurons on a core by 1t according
to Algorithm S1 in [208], which is shown schematically on the left-hand side of
Figure 4.18. First the synapse state for all neurons on the core is updated accord-
ing to the model shaping rule, and any new input this timestep is added from the
synaptic input buffers (discussed below). Interrupts are disabled during this update
to prevent concurrent access to the buffers from spike processing operations. The
states of all neurons on the core are then updated sequentially. An individual neuron
state at the current time Ni,t is accessed in memory, and if the neuron is not refrac-
tory, its state is updated according to the model characterising its sub-threshold
sPyNNaker − Software for Modelling Spiking Neural Networks 111
Figure 4.18. Left: update flow advancing state of neuron Ni by 1t. Centre: circular synap-
tic input buffers accumulate scaled input at different locations based on synaptic delay
(buffers are rotated one slot at the end of every timestep). Right top: synaptic input
buffer values are converted to fixed-point format and scaled before adding to Ni . Right
bottom: decoding of synaptic word into circular synaptic buffer input.
an integer representation reduces buffer size in DTCM and also the size of synaptic
weights in SDRAM, relative to using standard 32-bit fixed-point accum type. How-
ever, it requires conversion to accum type for use in the neuron model calculations –
as shown in Figure 4.18. This conversion is performed via a union and left-shift, the
size of which represents a trade-off between headroom and precision. An example
shift of 6 is shown, causing the smallest bit of the synaptic input buffer to represent
2−9 = 1.953125 × 10−3 , and the largest 27 = 128, in the accum type of the
synapse state. Under extreme conditions, a buffer slot will saturate from concur-
rent spike activity, meaning the shift size should be increased. However, the shift is
also intrinsic to the weight representation and affects precision, as all weights must
be scaled by 2(15−shi f t) before being written as integers to the synaptic matrices
discussed in Section 4.9.4. For example, in Figure 4.18, a weight of 1.15 nA was
converted to 589 on host during generation of synaptic data, but is returned as
1.150390625 nA when used during simulation (with a shift of 6). The shift value
is currently calculated by the sPyNNaker toolchain to provide a balance between
handling large weights, high fan-in and/or pre-synaptic firing rates, and maintain-
ing precision – see the work by Albada et al. [3] where the theory leading to a usable
closed-form probabilistic headroom mechanism is described in Equation 1.
Receiving a Spike
A _multicast_packet_received_callback is triggered by a packet received event,
raised when a multicast packet arrives at the core. This callback is assigned highest
priority (−1) and hence makes use of the FIQ thread and pre-empts all other core
processing (see Figure 4.16(a)). This callback cannot be queued, and therefore,
to prevent traffic backing up on the network, this callback is designed to execute
quickly, and it simply extracts the source neuron ID (from the 32-bit key) and stores
it in an input spike buffer for subsequent processing. Note that by default this buffer
is 256 entries long, enabling queuing of 256 spikes simultaneously. The callback
then checks for activity in the spike processing pipeline and registers a user event if
inactive. Pseudo code for this callback is made available by Rhodes et al. [208].
Figure 4.19. Data structures for processing incoming spikes: Master population table,
address list, and synaptic matrix, are shown from the perspective of the core simulat-
ing the Excitatory A population in Figure 4.15(b). The path in bold represents that taken
when a packet is received by Excitatory A, originating from itself, and hence two projec-
tions must be processed.
taking a masked source neuron ID as the key by which a source vertex can be
identified. Each row pertains to a single source vertex and consists of: 32-bit key;
32-bit mask; 16-bit start location of the first row in the address list pertaining to this
source vertex; and a 16-bit value defining the number of rows, where each row in the
address list represents a PyNN projection. When searching this table, the key from
the incoming packet is masked using each entry-specific mask before comparing
to the entry key. This masks off the individual neuron ID bits and enables source
vertices to simulate different numbers of neurons. The entry keys are masked on
host before loading for efficiency and are structured to prevent overlap after mask-
ing and facilitate binary searching. The structure of an address list row consists of:
a single header bit detailing whether the synaptic matrix associated with this pro-
jection is located in DTCM or SDRAM; 32-bit memory address indicating the
first row of the synaptic matrix; and an 8-bit value detailing the synaptic matrix
row length (i.e. the maximum number of post-synaptic neurons connected to by
114 Stacks of Software Stacks
Synapse processing
On completion of the DMA in Section 4.9.4, a DMA complete event triggers a
_dma_complete_callback, initiating processing of the synaptic row. As described
previously, each row pertains to synapses made, within a single PyNN projection,
between a single pre-synaptic neuron and multiple post-synaptic neurons. At the
highest level, a synaptic row is an array of synaptic words, where each word is
defined as a 32-bit unsigned integer. The row is split into three designated regions to
enable identification of static and plastic synapses (connections capable of changing
their weight at runtime). The row regions contain dynamic plastic data, constant
fixed plastic data and static data. Three header fields are also included, detailing
the size of each region and enabling easy navigation of the row. A schematic break-
down of the synaptic row structure is detailed in Figure 4.20. Note that because
a PyNN projection cannot be both static and plastic simultaneously, a single row
contains only either static or plastic data. Plastic data are intentionally segregated
into dynamic and fixed regions to facilitate processing. While all plastic data must
be copied locally to evaluate synaptic contributions to a neuron, only the dynamic
region – that is, that changing at runtime – requires updating for use when process-
ing subsequent spikes. Keeping this dynamic data in a separate block facilitates writ-
ing back to the synaptic matrix with a single DMA, and writing back less data helps
compensate for reduced DMA write bandwidth (relative to read – see Section 2.2).
The static region occupies the lower portion of the synaptic row and is itself
an array of synaptic words, where each word corresponds to a synaptic connection
sPyNNaker − Software for Modelling Spiking Neural Networks 115
Static Synapse 1 Static Synapse 2 Static Synapse 3 Static Synapse 4 Padding Delay Type Neuron ID
(32-bit) (32-bit) (32-bit) (32-bit) (3-bit) (4-bit) (1-bit) (8-bit)
Figure 4.20. Synaptic row structure with breakdown of substructures for both static and
plastic synapses.
Neuron Update
Packet Received
Core Activity
timer_callback
_multicast_packet_
received_callback
_dma_complete
_callback Timer Event t + Δt
user_callback
Sleep
DMA Controller
DMA Request
Latency
Spike 4 DMA Complete
Figure 4.21. Interaction of callbacks shown over the time period between two timer
events. Four spike events are processed representing the scenarios: receiving a packet
while processing a timer event; receiving a packet while the core is idling; and receiving
a packet while the spike processing pipeline is active. Note that a lighter colour shade
indicates a suspension of a callback, which is resumed on completion of higher priority
tasks.
116 Stacks of Software Stacks
between the row’s pre-synaptic neuron and a single post-synaptic neuron. As shown
in Figure 4.20, each 32-bit data structure is split such that the top 16 bits repre-
sent the weight, while the lower 16 bits typically split: bottom 8 bits to specify
the post-synaptic neuron ID; 1 bit to specify the synapse type (excitatory 0, or
inhibitory 1); 4 bits to specify synaptic delay; leaving 3 bits for padding (useful
for model customisation, e.g., adding additional receptors types). Data defining
plastic synapses are divided across the dynamic and fixed regions. Fixed plastic
data are defined by a 16-bit unsigned integer and match the structure of the
lower half of a static synapse (see lower half of Figure 4.20). These 16-bit synap-
tic half-words enable double-packing inside the 32-bit array of the synaptic row,
meaning an empty half-slot will be apparent if the row targets an odd number of
synapses. The dynamic plastic region contains a header defining the pre-synaptic
event history, followed by a series of synapse structures capturing the weight of each
synapse. Note that for typical plasticity models, this defaults to the same 16-
bit weight describing static synapses; however, synapse structure can be extended
to include additional parameters (in multiples of 16 bits) if required by a given
plasticity rule.
A task of the _dma_complete_callback is therefore to convert the synaptic
row into individual post-synaptic neuron input. The callback processes the row
headers to ascertain whether it contains static or plastic data, adjusts synapses
according to a given plasticity rule, and then loops over each synaptic word and
extracts its neuronal contribution – pseudo code for this callback is detailed in
Algorithm S4 of [208]. An example of this process for a single static synaptic
word is shown in the lower right of Figure 4.18, where a synaptic word of
[0000001001001101 0001010100001100] leads to a contribution of 589 to slot
10 of the inhibitory synaptic input buffer for neuron N12 .
Callback Interaction
The callbacks described above define how a sPyNNaker application responds to
hardware events and updates an SNN simulation. The interaction of these events is
a complex process, with the potential to impact the ability of a SpiNNaker machine
to perform real-time execution. Figure 4.21 covers the time between two timer
events and shows interaction of spike processing and neuron update callbacks for
four scenarios detailed by the arrival of spikes 1–4. The first timer event initiates
processing of the neuron update; however, after completion of approximately one-
third of the update, the core receives Spike 1, interrupting the timer_callback
and triggering execution of a _multicast_packet_received_callback, which in
turn raises a user event, initiating DMA transfer of the appropriate synaptic infor-
mation. On completion of the callback, the core returns to the timer_callback,
with the DMA transfer occurring in parallel. On completion of the DMA, a
sPyNNaker − Software for Modelling Spiking Neural Networks 117
callbacks. Events are displayed in Figure 4.21 by solid black lines, the width of
which represents the time taken to switch context and begin execution of the call-
back. The timer_callback takes longest to respond due to queuing of events with
priority > 0, while the _multicast_packet_received_callback is quickest due to
its priority of −1 and use of the FIQ thread. Other chip-level factors can also influ-
ence execution, such as SDRAM contention with applications running on adjacent
cores. As DMAs are processed in serial bursts, if multiple simultaneous requests are
received by the SDRAM controller, there may be latency in beginning the DMA
for some cores and a reduced rate of transfer (see Section S1.2 of [208] for further
information).
Software Structure
PyNN defines a number of standard cell models, such as the LIF neuron and
the Izhikevich neuron. Implementations of these standard models are included in
sPyNNaker; however, the API is also designed to support users wishing to extend
this core functionality and implement neuron models of their own. To facilitate this
extension, the model framework is defined in an object-oriented fashion, through
the use of C code on the SpiNNaker machine. This modular approach provides
structure and aids code reuse between different models (e.g. sharing of a synaptic
plasticity rule between different neuron models). A neuron model is built from the
following components:
The individual model components each produces a subset of the neuron and
synapse dynamics and is therefore the entry point for a user looking to deploy
a custom neuron model.4 In keeping with the aforementioned software stacks in
Figure 4.14, interfaces to each component are written in both Python and C. A sin-
gle instance of each component is collected via a C header file and compiled against
the underlying operating system described in Section 4.9.4 to generate a runtime
application. Python classes for each component facilitate user interaction with each
part of the model, enabling setting of parameter values and initial conditions from
a PyNN SNN script.
The runtime execution framework calls each component as part of the
timer_callback, as detailed in Algorithm S1 in [208] and shown schematically in
Figure 4.18. First the synaptic state is advanced forward in time by a single simu-
lation timestep, using the functions defined by the synapse_type component. Core
interrupts are disabled during this process to prevent concurrent access of the synap-
tic input buffers from a _dma_complete_callback. Interrupts are re-enabled when
all the state related to the synapses for all receptor types for all neurons on a core
have been updated. Each neuron then has its state advanced by 1t. The input_type
component is called first, converting the updated synaptic state into neuron input
current. This includes separate excitatory and inhibitory components, with core
implementations capable of handling both current- and conductance-based formu-
lations. The additional_input component is then evaluated to calculate the level of
any intrinsic currents. The synaptic and intrinsic currents, together with any back-
ground current, are then supplied to the neuron_model component which subse-
quently marches forward the neuron state by 1t. The neuron membrane potential
is now passed to the threshold_type component which tests whether the neuron
has fired. If the neuron is above threshold, a number of actions are performed:
a refractory counter begins to instigate any refractory period; the additional_input
is notified of the spike to allow updating of appropriate state variables; and finally,
the core is instructed to send a multicast packet to the router with the neuron ID
as key.
Izhikevich Neuron
The Izhikevich neuron model [116] allows reproduction of biologically observed
neuronal characteristics such as spiking and bursting. Its dynamics follow a type of
‘quadratic integrate and fire’ model, as detailed in Equation 4.5
dv
= 0.04v 2 + 5v + 140 − u + I (t) (4.5)
dt
du
= a(bv − u)
dt
(
v←c
if v ≥ Vθ , then (4.6)
u ←u+d
send between timer events is calculated [130], and the corresponding packets sent
are interspersed with random delays. This random spacing reduces the chance of
synchronised spike arrival at post-synaptic cores, easing pressure on both the source
and target routers. For slow sources, after each spike, an inter-spike interval is evalu-
ated in multiples of 1t, which is then counted down between sending packets. For
fast spike sources, the post-synaptic core is likely to retrieve from SDRAM the same
pieces of synaptic matrix many times during a simulation. Therefore, to remove the
overhead of the DMA, a mechanism is included to store the synaptic matrices from
fast spike sources in DTCM.
two packets are now required to transmit a spike. The post-synaptic core also per-
forms additional processing during look-up of the source vertex in the master popu-
lation table. An additional row must be included to identify spikes travelling direct
from the pre-synaptic core and also those sent from each individual delay stage of
the delay extension. This increased master population table size can be costly to search
and detrimental for real-time performance [207].
1 i m p o r t pyNN . s p i N N a k e r a s s i m
2
3 # I n i t i a l i s e simulator
4 sim . s e t u p ( t i m e s t e p =1)
5
6 # Spike input
7 p o i s s o n _ s p i k e _ s o u r c e = sim . P o p u l a t i o n (250 , sim . S p i k e S o u r c e P o i s s o n (
8 r a t e =50 , d u r a t i o n =5000) , l a b e l = ’ p o i s s o n _ s o u r c e ’ )
9
10 s p i k e _ s o u r c e _ a r r a y = sim . P o p u l a t i o n (250 , sim . S p i k e S o u r c e A r r a y ,
11 { ’ spike_times ’ : [1000]} ,
12 label = ’ spike_source ’ )
13
14
15 # Neuron P a r a m e t e r s
16 cell_params_exc = {
17 ’ tau_m ’ : 2 0 . 0 , ’ cm ’ : 1 . 0 , ’ v _ r e s t ’ : − 65.0 , ’ v _ r e s e t ’ : − 65.0 ,
18 ’ v _ t h r e s h ’ : − 50.0 , ’ t a u _ s y n _ E ’ : 5 . 0 , ’ t a u _ s y n _ I ’ : 1 5 . 0 ,
19 ’ t a u _ r e f r a c ’ : 0.3 , ’ i _ o f f s e t ’ : 0}
20
21 cell_params_inh = {
22 ’ tau_m ’ : 2 0 . 0 , ’ cm ’ : 1 . 0 , ’ v _ r e s t ’ : − 65.0 , ’ v _ r e s e t ’ : − 65.0 ,
23 ’ v _ t h r e s h ’ : − 50.0 , ’ t a u _ s y n _ E ’ : 5 . 0 , ’ t a u _ s y n _ I ’ : 5 . 0 ,
24 ’ t a u _ r e f r a c ’ : 0.3 , ’ i _ o f f s e t ’ : 0}
25
26 # Neuronal p o p u l a t i o n s
27 pop_exc = sim . P o p u l a t i o n (500 , sim . I F _ c u r r _ e x p (∗∗ c e l l _ p a r a m s _ e x c ) ,
28 label = ’ excitatory_pop ’ )
29
30 pop_inh = sim . P o p u l a t i o n (125 , sim . I F _ c u r r _ e x p (∗∗ c e l l _ p a r a m s _ i n h ) ,
31 label = ’ inhibitory_pop ’ )
32
33
34 # G e n e r a t e random d i s t r i b u t i o n s f r o m w h i c h t o i n i t i a l i s e parameters
35 r n g = s i m . NumpyRNG( s e e d = 9 8 7 6 6 9 8 7 , p a r a l l e l _ s a f e = T r u e )
36
37 # I n i t i a l i s e membrane p o t e n t i a l s u n i f o r m l y b e t w e e n t h r e s h o l d and r e s t i n g
38 pop_exc . s e t ( v= sim . RandomDistribution ( ’ uniform ’ ,
39 [ cell_params_exc [ ’ v_reset ’ ] ,
40 cell_params_exc [ ’ v_thresh ’ ]] ,
41 rng = rng ) )
42
43 # D i s t r i b u t i o n from which t o a l l o c a t e d e l a y s
44 d e l a y _ d i s t r i b u t i o n = sim . RandomDistribution ( ’ uniform ’ , [ 1 , 10] , rng = rng )
45
46 # Spike input p r o j e c t i o n s
47 s p i k e _ s o u r c e _ p r o j e c t i o n = sim . P r o j e c t i o n ( s p i k e _ s o u r c e _ a r r a y , pop_exc ,
48 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 0 5 ) ,
49 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t = 0 . 1 , d e l a y = d e l a y _ d i s t r i b u t i o n ) ,
50 receptor_type= ’ excitatory ’ )
51
52 # Poisson source projections
53 p o i s s o n _ p r o j e c t i o n _ e x c = sim . P r o j e c t i o n ( p o i s s o n _ s p i k e _ s o u r c e , pop_exc ,
54 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 2 ) ,
55 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.06 , d e l a y = d e l a y _ d i s t r i b u t i o n ) ,
56 receptor_type= ’ excitatory ’ )
57 p o i s s o n _ p r o j e c t i o n _ i n h = sim . P r o j e c t i o n ( p o i s s o n _ s p i k e _ s o u r c e , pop_inh ,
58 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 2 ) ,
59 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.03 , d e l a y = d e l a y _ d i s t r i b u t i o n ) ,
60 receptor_type= ’ excitatory ’ )
126 Stacks of Software Stacks
61 # Recurrent projections
62 e x c _ e x c _ r e c = sim . P r o j e c t i o n ( pop_exc , pop_exc ,
63 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 1 ) ,
64 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.03 , delay= delay_distribution ) ,
65 receptor_type= ’ excitatory ’ )
66 e x c _ e x c _ o n e _ t o _ o n e _ r e c = sim . P r o j e c t i o n ( pop_exc , pop_exc ,
67 s i m . OneToOneConnector ( ) ,
68 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.03 , delay= delay_distribution ) ,
69 receptor_type= ’ excitatory ’ )
70 i n h _ i n h _ r e c = sim . P r o j e c t i o n ( pop_inh , pop_inh ,
71 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 1 ) ,
72 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.03 , delay= delay_distribution ) ,
73 receptor_type= ’ inhibitory ’ )
74
75 # P r o j e c t i o n s between neuronal p o p u l a t i o n s
76 e x c _ t o _ i n h = sim . P r o j e c t i o n ( pop_exc , pop_inh ,
77 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 2 ) ,
78 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.06 , delay= delay_distribution ) ,
79 receptor_type= ’ excitatory ’ )
80 i n h _ t o _ e x c = sim . P r o j e c t i o n ( pop_inh , pop_exc ,
81 sim . F i x e d P r o b a b i l i t y C o n n e c t o r ( p_connect = 0 . 2 ) ,
82 s y n a p s e _ t y p e = sim . S t a t i c S y n a p s e ( w e i g h t =0.06 , delay= delay_distribution ) ,
83 receptor_type= ’ inhibitory ’ )
84
85
86 # Specify output recording
87 pop_exc . r e c o r d ( ’ a l l ’ )
88 pop_inh . r e c o r d ( ’ s p i k e s ’ )
89
90
91 # Run s i m u l a t i o n
92 sim . run ( simtime =5000)
93
94
95 # Extract r e s u l t s data
96 e x c _ d a t a = pop_exc . g e t _ d a t a ( ’ s p i k e s ’ )
97 i n h _ d a t a = pop_inh . g e t _ d a t a ( ’ s p i k e s ’ )
98
99
100 # Exit simulation
101 s i m . end ( )
Chapter 5
— The Terminator
127
128 Applications − Doing Stuff on the Machine
We are a lab full of engineers. Art was as far away from our collective future pro-
jections for the platform as possible. So, once we were approached by Tove Kjell-
mark, a Swedish artist, with the idea for an exhibit involving humanoid robots and
SpiNNaker, we immediately considered the issues and hurdles of such an attempt,
not the least that of time and expectation management. The exhibition at the
Manchester Art Gallery, named ‘The Imitation Game’ in honour of Alan Turing
and his eponymous test, was to include several robotic pieces with the common
theme of seeming intelligent in particular ways. The robotic entities present in the
gallery would surely not pass Turing’s test in any meaningful way, but that was
not the plan anyway. To school children, laypeople and scientists alike, this was
an artist’s view at imitating life at the behavioural, albeit limited, level. At a basic
level, these pieces would hint at the existence of something more than just Artifi-
cial Intelligence (AI). Tove Kjellmark would call it ‘another nature’, that is to say an
elimination of the artificial boundaries between the technological, the mechanical
and the natural. We would rather call it a conceptual step in a more important area
of research, that of Artificial General Intelligence (AGI), as opposed to the narrow
AI, nowadays present everywhere, the ‘autistic savants’ that tell you what objects
you are looking at, what movies to watch next and what music to listen based on
your listening habits.
Our involvement focused on the piece ‘Talk’ (pictured in Figure 5.1) that fea-
tured two robotic torsos sat cross-legged on comfortable chairs discussing a dream.
They look at each other, gesture while talking, speak fluently and with appropriate
cadence, sighs and pauses. If a human dares approach, they stop their conversation,
turn their head to face the intruder to chastise them and wave them away.1 Thus,
SpiNNaker’s task was to control the arms of the robots to perform realistic-looking
arm movements in three regimes: idling, gesturing and silencing.
The focus of this undergraduate project was successful in revealing that
SpiNNaker is capable of real-life, albeit impractical, applications. The individually
packaged SpiNNaker boards would not be turned off for weeks at a time and would
operate without flaw for over 7 hours a day for approximately 4 months in conjunc-
tion with the physical robots. As expected, maintenance visits to the Gallery would
generally revolve around the robots or indeed the host computers, rather than any
Figure 5.1. Display in ‘The Imitation Game’ exhibition at the Manchester Art Gallery,
2016, celebrating Manchester becoming European City of Science. Artist: Tove Kjellmark;
School of Computer Science, Manchester: Petrut, Bogdan, Prof. Steve Furber, Dr. Dave
Lester, Michael Hopkins; Manchester Art Gallery Exhibitions Intern: Mathew Bancroft;
Mechatronics Division, KTH, Stockholm: Joel Schröder, Jacob Johansson, Daniel Ohlsson,
Elif Toy, Erik Bergdahl, Freddi Haataja, Anders Åström, Victor Karlsson, Sandra Aidanpää;
Furhat Robotics: Gabriel Skantze, Jonas Beskow, Dr Per Johansson.
5.1.1 Building Brains with Nengo and Some Bits and Pieces
Two small PCs were used to control the two robots: the primary PC completely con-
trols one of the robots and the arms of the other, while the secondary PC operates
only the head of the other robot. The two distributed instances of the Furhat con-
troller communicate through the network at key moments advancing the scripted
dialogue. The primary PC is also responsible for communicating with the glo-
rified distance sensor embodied in a Microsoft Kinect sensor, as well as the two
stand-alone SpiNNaker boards. Both PCs control the actuators in the robotic arms
using classical control theory; some translation is required between SpiNNaker’s
130 Applications − Doing Stuff on the Machine
Luke Leia
Depth
informaon
Kinect
communication and these closed-loop control systems. Figure 5.2 reveals the flow
of information involved in this project.
The previous chapter explained how SpiNNaker is usually controlled, using
PyNN as a high-level network description language, viewing individual neurons
as the main units of computation. Instead, here the Neural ENGineering Objects
(Nengo) simulator bunches neurons together in ensembles (populations) and relies
on their concerted activity to perform computation [53].
The way Nengo is built supports the implementation of a proportional-integral-
derivative (PID) controller using a spiking neural substrate. A PID controller is a
control loop feedback mechanism that continuously computes the error between
the desired trajectory and the current position. The controller attempts to min-
imise the error as described by a weighted sum of a proportional, an integral and a
derivative term. The proportional term accounts for moving towards the target at
a rate dictated by the distance from it (cross track error). The derivative term con-
siders the angle of the current trajectory compared to that of the desired trajectory
(also called the cross track error rate), while the integral term is used to correct for
accumulated errors that lead to a steady state error caused by, for example, external
factors.
Consider the example of a driverless car positioned in a controlled environment
with a trajectory precomputed for it to follow down the track in order to avoid some
static obstacles. The goal is to try to follow the trajectory as closely as possible, so
effects such as oscillations are not desired. In addition, the researchers at the facility
have decided to see what would happen if at some point on the path they place
a rock or pothole. They hope that the system would realise that it is drifting off
course and apply a correcting turn. Figure 5.3 shows what this would look like in
Robot Art Project 131
Figure 5.3. An example of trajectory following. In a real example, the trajectory would
potentially not change so abruptly.
Figure 5.4. (a) A 15-second window of the operation of the control system running at the
Manchester Art Gallery. This time period sees the robots going through all of the defined
actions: gesture (the robot is talking), silence (the robot stopped talking to make a silenc-
ing gesture directed at an approaching visitor) and idle (the robot is not talking but lis-
tening to the other robot talk). (b) Robot poses corresponding to the Nengo simulation.
The poses correspond to times 2, 4, 8 and 14.
Nengo. This is very similar to what can be done when controlling robot arm motors
and servos.
Figure 5.4 shows the operation of one of the arms on a robot over a timespan
of 15 seconds. During this time, the robot is issued three different commands in
132 Applications − Doing Stuff on the Machine
Figure 5.5. Gesturing movement of the robots computed as a function of time f (t) =
1
2 ∗ (sin( 1.6∗t
1
) − cos(2 ∗ t)).
succession: gesture, silence and idle. While gesturing, the target position of each
joint is given by a predetermined ‘zero’ or base position (hand-picked values that
look natural in the physical exhibit) subtracted from a sinusoidal signal, namely the
one in Figure 5.5. The incoming signal is transformed using a linear transformation
for each joint individually to create a human like gesturing motion. Since the robots
each has two arms, there is a dot product-based network inhibiting the arm that is
not intended for use. Such arm selection is possible by creating a couple of prede-
fined orthonormal vectors that represent the left and right directions. Based on the
input direction vector for the system, a dot product is computed between it and
the two previously mentioned bases so as to determine which direction is closest
based on the angle. In the particular case where the vector is not significantly closer
to any of the targets, the system accomplishes the desired action using both arms.
The result of adding this level of control and inhibition is that the robot can now
move one arm, or the other, or even both, thus allowing for more human mimetic
behaviour.
When issued the action ‘silence’, the performing robot raises both lower arms
into the air, in a defensive manner, signalled by external feedback from the head
assembly, which turns to face the visitor and asks them to be silent. The action is
achieved by inhibiting the neurons’ spiking activity in the ensemble representing
the ‘zero’ position and the ‘sound’ signal using the inhibiting output from an incor-
porated Basal Ganglia (BG) model. Analogously, idling is achieved by inhibiting the
sound and silencing signals.
Because the exhibition took place in Manchester, no one else was around to
maintain these robots, and we still had to experiment with realistic movement, we
interacted with them for most of their stay at the Manchester Art Gallery. Most of
these interactions took place during typical work hours, meaning that the gallery
was usually populated by school children. It was surreal seeing the children interact
Computer Vision with Spiking Neurons 133
with the robots. They weren’t allowed to touch them of course, although that did
not prevent them from trying. All of this assumes that they managed to enter the
room: the usual first reaction to seeing them was fear. Once I had talked to the chil-
dren’s teachers and assured them that it was safe in the room, they would flock inside
to witness the two humanoids in discussion. There was always someone watching
from the doorway, too apprehensive to approach these mechanical beings, which
were, essentially, only superficially intelligent. Nobody knew what they were talking
about, but they were all fascinated with their ‘silencing’ phases as these provided the
most audience interaction. These groups rarely stopped to read the plaque describ-
ing the exhibit, but surely this was a success in and of itself: SpiNNaker managed
to work flawlessly for the entire duration of the exhibit; the same could not be said
about the actuators and 3D-printed parts which had a much harder time.
Gabor-like Detection
To extract features, we can take inspiration from biological vision; Gabor-like fil-
ters are an example of a common abstraction which have an origin in biology and
have been used in traditional computer vision [97]. These can be implemented
using spiking neurons whose (immediate) receptive field is distance dependent
and synapse weights are proportional to the ones computed by the Gabor func-
tion. Methods for transforming weight values have been proposed in the litera-
ture [185, 190] and in Chapter 7, we discuss a different approach.
134 Applications − Doing Stuff on the Machine
Figure 5.6 (b–g) shows the result of filtering a Modified NIST (MNIST) digit
(Figure 5.6(a)). The Gabor filters were generated using the following equations:
x 02 + γ 2 y 02 x0
O(x, y; λ, θ, ψ, σ, γ ) = exp − cos 2π + ψ ,
2σ 2 λ
(5.1)
x 0 = x cos θ + y sin θ , (5.2)
y 0 = −x sin θ + y cos θ , (5.3)
where λ and ψ are the wavelength and phase of the sinusoidal component, respec-
tively; θ is the orientation of the resulting stripes, σ is the standard deviation of the
Gaussian component; and γ is the spatial aspect ratio. Parameters for the generation
of Gabor kernels are presented in Table 5.1.
Figure 5.6. Results of Gabor-like feature extraction. (a) shows the input image converted
to a spike train and later filtered using six Gabor kernels. (b–g) show the responses of
each filtering population projected to the input space.
Width Sampling σ λ γ ψ θ
Signal flow
ut
Inp
y
tor
ibi
Inh
t
dd
le tpu
Mi Ou
Excitatory
Inhibitory
Blob Detector
Retinal connectivity has also been used as inspiration for key-point extrac-
tion [151]. A retina-inspired network can be used to convert visual input into a
multi-scale representation from which blob-like features can be extracted [103]. In
this three-layered network (Figure 5.7), the middle layer samples the input layer
with receptive fields whose weights are computed using a Gaussian function. Dif-
ferent middle layer ‘classes’ sample the input with different parameters for their
input kernels (i.e. width, σ ). Each neuron in the middle layer drives a neuron in the
output and, additionally, an inhibitory ‘interneuron’. The purpose of the inhibitory
neurons is to induce competition between the output layer neurons, reducing activ-
ity and pushing the output representation towards orthogonality. All neurons in the
output layer compete to represent the input, and the extent to which the inhibitory
neurons influence their neighbours is proportional to the cross-correlation of their
input image kernels. This competition results in centre-surround receptive fields,
as observed in biology.
As an example we took the same input image as in the Gabor filtering (Fig-
ure 5.6(a)), and its spike representation was processed by this blob-detection net-
work using three different Gaussian kernel sizes. Figure 5.8 shows the output of the
network; we can observe that the greatest activity is present in the mid-resolution
class (Figure 5.8(b)) as it is a better fit to the input activity. The high-resolution
class (Figure 5.8(a)) shows a behaviour similar to edge detection, typical of centre-
surround filtering. Finally, as the receptive field for the low-resolution class is not a
good fit for the input, there is little activity observed.
136 Applications − Doing Stuff on the Machine
Figure 5.8. Results of blob-detection network. (a) High-, (b) middle- and (c) low-
resolution neuron classes.
Horizontal
integration
Delayed 1
connections
Activity f low
1
2
2
3
3
Vertical motion v
Time
detection
(a) (b)
Figure 5.9. Motion sensing circuit. (a) Connectivity of the motion detection circuit using
two different neurotransmitters (green-solid and blue-dashed). (b) Delayed lines allow
spikes to reach the neuron body at the same time.
Motion Detection
Objects in the world are often moving, and since time is embedded in SNN simu-
lations, we believe it is important to detect motion. A spiking version of a motion
detector [103] was developed based on the connectivity of Starburst Amacrine Cells
(SAC) [24, 58] and the Reichardt detector [24]. The motion detector network is
illustrated in Figure 5.9(a); the principle of operation is composed of two factors:
(i) delayed connections and (ii) the combination of two neurotransmitters. Delays
are proportional to distance allowing incoming spikes triggered at different times
and distances to arrive at (about) the same time (Figure 5.9(b)).
The two neurotransmitters allow activity from different regions of the input to
be present at the correct time at the detector neuron (Figure 5.10(a) and 5.10(b));
one of the neurotransmitters decays at a slow rate, opening a window for the other
transmitter (whose decay rate is high) to reach the detector.
We tested the circuit using a bouncing ball simulation; the ball moves in a 64×64
pixel window and when it bounces, it does so with a randomly selected speed in
a range of 1 to 2 pixels. Figure 5.11 shows the outputs of easterly and westerly
motion detection as red-dashed and green-solid lines, respectively. Ball motion
is indicated by blue dots in the plot: the ball moved towards the north-east for
SpiNNak-Ear − On-line Sound Processing 137
(a) (b)
Figure 5.10. Interaction of transmitters in the motion sensing circuit. (a) When neuro-
transmitters (blue and green lines) do not reach the neuron within a temporal window,
they will not induce sufficient current for the neuron to spike. (b) In contrast, when they
reach the neuron in the right sequence, they will produce an activation.
about 500 ms, then it bounced off a corner and moved in a south-westerly direc-
tion until ∼1250 ms; finally, it took off to the north-east again. In the first part (0 to
∼1250 ms) of the experiment, detection is near perfect although there are moments
when the detectors fail to sense motion. In the last section (after ∼1250 ms), there
are multiple false-positive detections which can be diminished by lateral competi-
tion of different directions. This circuit can detect apparent motion with an accu-
racy of 70%. A similar detector, though with learned connectivity, is described in
Section 7.5.5.
The SpiNNak-Ear system is a fully scaled biological model of the early mammalian
auditory pathway: converting a sound stimulus into a spiking representation spread
138 Applications − Doing Stuff on the Machine
across a number of parallel auditory nerve fibres [119]. This system takes advantage
of the generic digital processing elements on a SpiNNaker machine, enabling a
Digital Signal Processing (DSP) application to be distributed across its massively
parallel architecture. With the degree of parallel processing available for a SpiNNak-
Ear implementation, one is able to generate a simulation of an ear to a biologically
realistic scale (30,000 human cochlea auditory nerve fibres) in real time.
Figure 5.12. An uncoiled cochlea (right) with parallel auditory nerve fibres innervating
single IHCs along the cochlea. The spiking activity due to two stimulus frequency com-
ponents − High Frequency (HF) and Low Frequency (LF) − can be seen in the corre-
sponding auditory nerve fibres.
SpiNNak-Ear − On-line Sound Processing 139
connects to the cochlea via three ossicle bones to continue (and amplify) this dis-
placement into the inner ear cochlea. The cochlea is a coiled, liquid-filled organ
that converts the TM displacement into a series of travelling waves along its dis-
tance, from base to apex. The frequency components of the sound stimulus dictate
the location along the cochlea that will experience the most displacement along its
Basilar Membrane (BM). High frequencies are absorbed at the basal regions and
progressively lower frequencies reach the apical regions of the cochlea. The cochlea
is lined with many motion sensitive cells, known as Inner Hair Cells (IHCs), that
detect the localised displacements of the BM. The IHCs act as the ‘biological trans-
ducers’ in the ear, converting physical sound-produced displacements into a corre-
sponding spike code signal on the auditory nerve.
The modelling of every section of the cochlea’s BM and the nearby IHCs can be
described as being ‘embarrassingly parallel’, where the processing of each individual
node (a Dual Resonance Non-Linear [DRNL] + IHC models) does not depend on
any other neighbouring nodes. Therefore, we can model the processing of specific
regions of the cochlea in a concurrent fashion.
Figure 5.13. A schematic for the human full-scale early auditory path model distribution
on SpiNNaker. The total number of cores for this simulation is 18,001 spanning across
1,500 SpiNNaker chips.
for real-time performance. The shared memory communication link that triggers a
‘read from shared buffer’ event in a child IHC/AN model is achieved using a mul-
ticast packet transmission from the parent DRNL model once it has processed a
segment. Figure 5.14(a) illustrates these two data communication methods used in
the full model system.
In the full system, the OME model application is triggered by the real-time input
stimulus, after which the subsequent DRNL and IHC/AN models in the software
pipeline are free to run asynchronously (event-driven) until the AN output stage.
In a given simulation, to confirm that all model instances have initialised or have
finished processing, we use ‘core-ready’ or ‘simulation-complete’ acknowledgement
signals fed back through the network of all connected model instances to the parent
OME model instance to ensure all cores are ready to process and data have been
successfully recorded within the given time limits.
5.3.4 Results
The output from SpiNNak-Ear simulation is compared with conventional
computer-based simulation results from the MAP model to ensure no significant
SpiNNak-Ear − On-line Sound Processing 141
(a)
(b)
Figure 5.14. (a) The data passing method from input sound wave to the output of a single
IHC/AN instance using MC and MC with payload message routeing schemes. (b) The
pipeline processing structure used to achieve real-time performance.
numerical errors have occurred from computing the model algorithm on different
simulation hardware. The outputs from both implementations are then compared
with physiological experimental results to confirm the model’s similarities to the
biological processes it emulates.
In experimental neuroscience, the response from a stochastic auditory nerve fibre
to an audio stimulus is measured over many repeated experiments and the subse-
quent recordings are often displayed in a Peri Stimulus Time Histogram (PSTH).
The results, shown in Figure 5.15, show the time varying AN spike rates across 1 ms
windows to a 6.9 kHz sinusoidal 68 dBSPL stimulus, first in Figure 5.15(a) from
physiological data gathered by Westerman and Smith [265] and then from both
model implementations in Figure 5.15(b). These results show both implementa-
tions produce a biologically similar response consisting of pre-stimulus firings of
approximately 50 spikes/s, followed by a peak response at stimulus onset at around
800 spikes/s, decaying to an adapted rate in the region of 170 spikes/s. Finally at
stimulus removal, rates significantly drop during an offset period before returning
to spontaneous firing of approximately 50 spikes/s.
Figure 5.16 illustrates the energy consumed by MAP and SpiNNak-Ear imple-
mentations across the full range of model channels tested. Energy consumption has
been calculated by multiplying the complete processing time by the total power rat-
ing of the hardware used (CPU at 84 W, single SpiNNaker chip at 1 W). Here we
show that both implementations incur an increase in total energy consumed – but
for different reasons. The MAP implementation running on a single, fixed power
CPU uses more energy when the number of channels is increased due to the increase
in serialised processing time. The neuromorphic hardware experiences an increase
in energy consumed due to the increasing size of the machine used (number of
142 Applications − Doing Stuff on the Machine
(a)
(b)
Figure 5.15. PSTH responses to 352 repetitions of a 400 ms 6.9 kHz 68 dBSPL stim-
ulus from experimental data obtained by Westerman and Smith [265] of an HSR
AN fibre in a gerbil (a) and the same experiment repeated for MAP and SpiNNaker
implementations (b).
chips) with an increase in channels. The rate of increase in energy consumed due
to number of channels on neuromorphic hardware is lower than the conventional
serial CPU approach. This effect illustrates the basic philosophy that underlies the
functionality of SpiNNaker (and biological) processing systems: complex compu-
tation on a modest energy budget, performed by dividing overall task workload
across a parallel network of simple and power-efficient processing nodes.
3500
MAP
SpiNNaker
3000
Energy consumption (Ws)
2500
2000
1500
1000
500
0
0 500 1000 1500 2000 2500 3000
Number of channels
Figure 5.16. Average energy consumption from processing a 0.5 s sound sample from
2 to 3,000 channels on both MAP and SpiNNaker implementations. The MAP model is
executed on a desktop computer (Intel Core™ i5-4590 CPU @ 3.3 GHz 22 nm technology)
and SpiNNaker on a range of different sized SpiNNaker machines ranging from 1 to 1,500
chips (130 nm technology) scaled by the number of channels in a simulation.
Here we present a biologically plausible and scalable model of the Basal Ganglia
(BG) circuit, designed to run on the SpiNNaker machine [217]. It is based on
the Gurney–Prescott–Redgrave model of the BG [84, 85]. The BG is a set of
144 Applications − Doing Stuff on the Machine
subcortical nuclei that are evolutionarily very old and appear in all vertebrates,
enabling them to make decisions and take subsequent actions; obviously, there-
fore, computational modelling of the BG has been pursued by researchers with
an interest in robotics [202]. The information on which the decision needs to be
made, that is, the environmental circumstance, constitutes the input to the BG
and is available via the thalamus and cortex. Output from the BG is the specific
action that is decided upon, referred to as ‘action-selection’, and is relayed to the
motor pathway for execution via the thalamus, cortex and other subcortical struc-
tures. The objective of our work on SpiNNaker is to build a ‘basic building block’
towards development of automated decision-making tools in real time.
A single neuro-computational unit in our BG model is simulated with a
conductance-based Izhikevich neuron model. A columnar structure of the BG cir-
cuitry is shown in Figure 5.17; this forms the basic building block for our scalable
framework and is thought to be a single ‘channel’ of action selection. The striatum
forms the main input structure of the BG and receives excitatory glutamatergic
synapses from both the cortex and the thalamus. The substantia nigra pars reticu-
lata (SNr) forms the output structure of the BG and projects inhibitory efferents
to the ventral thalamus and brainstem reticular formation.
The single-channel BG model is first parameterised on SpiNNaker to set the base
firing rates for all model cell populations, informed by prior work by Humphries
et al. [110]. Next, to simulate action selection by competing inputs, the model is
scaled up to three channels and tested with two competing inputs in the presence of
a noisy background stimulus. Results are summarised in Figure 5.18(a). An input
stimulus that is larger than the others is always the ‘winner’, indicated by a relative
drop in the firing rate of the SNr population (representing the BG model output)
in the competing channel. The reduced firing rate of the inhibitory SNr population
implies a reduced inhibition of the thalamic/brainstem cells, which are the recip-
ients of the BG output as mentioned above. This in turn means that the ‘action’
that is solicited by a relatively larger (‘competing’) input is now ‘decided’ by the
BG circuit to be ‘selected and acted upon’, indicated by disinhibition of the target
outputs. The model is tested with a competing input of 15 Hz in the presence of a
noisy background input of 3 Hz. This is further confirmed by ‘selection’ of a larger
146 Applications − Doing Stuff on the Machine
Figure 5.19. (a) The power consumption of the single-channel model using an in-house
Raspberry-Pi-based measurement system connected to the SpiNNaker board [244]. The
duration of recording power can be broken down into four regions: (i) booting the
machine; (ii) pre-processing of data; (iii) model execution; (iv) post-processing (i.e. data
extraction); the delay of around 4 s after booting the machine is inserted for clarity. The
peak-to-peak power in region (iii) is 800 mW. (b) Performance analysis of single-channel
and three-channel models running on both SpiNNaker and SpineML. Execution time on
SpiNNaker, and pre- and post-processing times on SpineML are unaffected by scaling-up
of model.
sense that it does not matter how much we improve the speed, power consump-
tion or size of our computers, there are families of problems which, despite being
solvable in principle with infinite resources, will remain intractable at least until
some exotic machine demonstrates an exponential speedup. Quantum and genetic
computers, at least in theory, promise advances in this direction, but the practi-
calities currently seem to be out of scope. Worse than that are the undecidable or
unsolvable problems. Hence, knowing the performance and complexity of a new
computer architecture in the hierarchy of computable and incomputable problems
will shed light on realistic directions for optimisation and improvement, avoiding
the use of valuable time on aspects that will not add significant scientific or tech-
nological value.
Constraint Satisfaction Problems (CSPs) are a special family of problems that
serve such a purpose. They are beautifully simple to formulate, yet they belong to
the class of intractable problems (the NP-complete family). These are problems
whose solutions are verifiable in Polynomial time (P), yet finding their solution
requires supra-polynomial time as a function of the size of the problem. Actu-
ally, evidence suggests that the time complexity may be exponential, that is, a lin-
ear increase of the problem size results in an exponential increase of the required
resources: time or space, memory or energy.
Formally, a CSP is defined by a set of variables X = {x1 , . . . , x N } that take
values over a set of discrete or continuous domains D = {D1 , . . . , D N }, such
that a set of constraints C = {C1 , . . . , Cm } are satisfied. Each such constraint is
defined as a tuple Ci = hSi , Ri i, where R = {R1 , . . . , Rk } are k relations over m
subsets S = {S1 , . . . , Sm : Si ⊆ X }. In short, C S P = hX, D, Ci. Hence, the
N
problem is defined over a combinatorial space whose size is on the order of D ,
growing exponentially with N . Every solution to a CSP will have zero violations
and include all variables in X . Hence, it will be represented by a global minimum
of the cost hypersurface. If the problem has several solutions, the global minimum
will be degenerate, one minimum existing for each solution. It is easy to see then
that the difficulty of finding a solution for a CSP depends not only on the high
dimensionality of its combinatorial space but also critically on the curvature of that
space. Here, the curvature refers to how folded the space of possible evaluations of
X is when measured against a scalar (energy or cost) function related to the number
of unsatisfied constraints. If the cost function is strictly convex, there will be a single
minimum and methods such as gradient descent will easily find it. Unfortunately,
this is rarely the case.
With a geometrical representation of CSPs, it is easy to imagine solving the prob-
lem by travelling across the cost hypersurface, defined on some high-dimensional
space, looking for a global minimum. Think of it as being like an adventurous
explorer in the middle of the Amazon rain forest, perhaps searching for some
Constraint Satisfaction 149
{t f | u(t f ) = θ and dudt limt=t f > 0}. Immediately after a spike, the potential is
reset to a value u r , such that limt→t f + u(t) = u r . In our network, synapses are
uniquely characterised by ωi j and the inter-neural separation is introduced by
means of a delay 1i j . In biological neurons, each spike event generates an electro-
chemical response on the post-synaptic neurons characterised by Ri, j . We use the
same function for every pair (i, j), and this is defined by the post-synaptic current:
q − t−t0
j (t) = e τ 2(t − t0 ), (5.5)
τ
where q is the total electric charge transferred through the synapse, τ is the char-
acteristic decaying time of the exponential function, t0 = t f + 1i j is the arrival
time of the spike and 2 represents the Heaviside step function. The choice of Ri, j
potentially affects the network dynamics, and although there are more biologically
realistic functions for the post-synaptic response, the use of the exponential func-
tion in Equation 5.5 constitutes one of our improvements over the previous studies
on CSP through SNNs which used a simple square function.
In an SNN, each neuron is part of a large population. Thus, besides the back-
ground current I (t), it receives input from the other neurons, as well as a stochastic
stimulation from noisy neurons implementing a Poisson process. In this case, the
temporal evolution of the membrane potential (Equation 5.4) generalises to:
d X X f
X
τm u = −u(t) + R I (t) + ωj j (t − t j ) + k j (t − Tk )
dt
j f k
(5.6)
where the index f accounts for the spike times of principal neuron j in the
SNN, k is the strength of the kth random spike, occurring at time Tk , and
j (.) is the response function of Equation 5.5. An SNN has the advantage that
its microstate ψt = {n 1 , n 2 , . . . , n N } at any time t can be defined by the binary
firing state n i ∈ {0, 1} of each neuron Ni , instead of the continuous membrane
f
potentials u i ∈ R. Then, the set of firing times {ti } for every neuron Ni , or
equivalently the set of states {ψt }, corresponds to the trajectory (dynamics) of the
network in the state space. The simulations in this work happen in discrete time
(time step = 1 ms) so, in practice, ψt defines a discrete stochastic process (e.g. a
random walk). If the next network state ψti+1 depends on ψti but is conditionally
independent of any ψt j with j < i, the set {ψt } also corresponds to a Markov chain.
Habenschuss et al. [89] demonstrated that this is the case when using rectangular
Post-Synaptic Potentials (PSPs) and a generalised definition of the network state,
the validity of the Markov property for general SNNs could still depend on the
dynamical regime and be affected by the presence of a non-zero probability current
Constraint Satisfaction 151
for the stationary distribution [39]. Each possible configuration of the system, a
microstate ψi , happens with certain probability pi and, in general, it is possible
to characterise the macroscopic state of the network with the Shannon entropy (in
units of bits) [221]:
X
S=− pi log2 pi (5.7)
i
To compute pi and hence Equation 5.7, we binned the spikes from each
simulation with time windows of 200 ms. In this type of high-dimensional
dynamical system, sometimes the particular behaviour of a single unit is not as
relevant as the collective behaviour of the network, described, for example, by
Equations 5.7 and 5.8.
A constraint satisfaction problem hX, D, Ci can now be expressed as an SNN as
shown in the pseudo-code of Listing 5.1. We can do it in three basic steps: (a) create
SNNs for each domain di of each variable, every neuron is then excited by its asso-
ciated noise source, providing the necessary energy to begin exploration of the
states {ψ}; (b) create lateral-inhibition circuits between all domains that belong to
the same variable; (c) create lateral-inhibition circuits between equivalent domains
of all variables appearing in a negative constraint and lateral-excitation circuits for
domains in a positive constraint. With these steps, the resulting network will be
a dynamical system representation of the original CSP. Different strategies can
now be implemented to enforce the random process over states ψt to find the
configuration ψ0 that satisfies all the constraints. The easiest and proposed way of
implementing such strategies is through the functional dependence of the noise
intensity on time. The size of each domain population should be large enough to
average out the stochastic spike activity. Otherwise, the system will not be stable
and will not represent quasi-equilibrium states. As will be shown, it is the size of
the domain populations what allows the system to converge into a stable solution.
The ensemble of populations assigned to every CSP variable xi works as a
Winner-Takes-All (WTA) circuit through inhibitory synapses between domain
populations, which tends to allow a single population to be active. However, the
last restriction should not be over-imposed, because it could generate saturation
and our network will be trapped in a local minimum. Instead, the network should
constantly explore configurations in an unstable fashion, converging to equilib-
rium only when satisfiability is found. The random connections between popula-
tions, together with the noisy excitatory populations and the network topology,
152 Applications − Doing Stuff on the Machine
provide the necessary stochasticity that allows the system to search for satisfiable
states. However, this same behaviour traps some of the energy inside the network.
For some problems, a dissipation population could be created to balance the input
and output of energy or to control the entropy level during the stochastic search.
In general, there may be situations in which the input noise acquired through stim-
ulation can stay permanently in the SNN. Thus, the inclusion of more excitatory
stimuli will saturate the dynamics at very high firing rates, which potentially could
reach the limits of the SpiNNaker communication fabric. In these cases, inhibitory
noise is essential too and allows us to include arbitrarily many stimulation pulses.
We demonstrate in the next section that the simple approach of controlling the
dynamics with the stimulation intensities and times of the Poisson sources provides
an efficient strategy for a stochastic search for solutions to the studied CSPs.
5.5.2 Results
In order to demonstrate the implementation of the SNN solver, we present solu-
tions to some instances of Non-deterministic Polynomial time (NP) problems.
Among the NP-complete problems, we have chosen to showcase instances of graph
colouring, Latin squares and Ising spin glasses. Our aim is to offer a tool for the
development of stochastic search algorithms in large SNNs. We are interested in
Constraint Satisfaction 153
Figure 5.20. (a) Solution to the map colouring problem of the world with 4 colours and
of Australia and Canada with 3 colours (insets). Figure (b) shows the graph of bordering
countries from (a). The plots of the entropy H (top), mean firing spike rate ν (middle)
and states count (bottom) versus simulation time are shown in (c) and (d) for the
world and Australia maps, evidencing the convergence of the network to satisfying sta-
tionary distributions. In the entropy curve, red codes for changes of state between suc-
cessive time bins, green for no change and blue for the network satisfying the CSP. In
the states count line, black dots mean exploration of new states; the dots are yellow
if the network returns to states visited before. In (e), we have plotted the population
activity for four randomly chosen CSP variables from (a), each line represents a colour
domain.
search. Interestingly, although the network has converged to satisfaction during the
last 20 s (blue region in Figure 5.20(c)), the bottom right plot in Figure 5.20(e)
reveals that due to the last stimulation the network has swapped states preserving
satisfaction, evidencing the stability of the convergence. Furthermore, it is notice-
able in Figure 5.20(d) that new states are visited after convergence to satisfiability;
this is due to the fact that, when multiple solutions exist, all satisfying configura-
tions have the same probability of happening. Although we choose planar graphs
here, the SNN can implement any general graph; hence, more complicated P and
NP examples could be explored.
Constraint Satisfaction 155
Figure 5.21. SNN solution to Sudoku puzzles. (a–c) show the temporal dependence of the
network entropy H, firing rate ν and states count for the easy (g), hard (h) and AI escar-
got (i) puzzles. The colour code is the same as that of Figure 5.20. In (g–i), red is used
for clues and blue is used for digits found by the solver. Figures (d) and (f) illustrate the
activity for a random selected cell from (a) and from (c), respectively, evidencing com-
petition between the digits, the lines correspond to a smoothing spline fit. (e) Schematic
representation of the network architecture for the puzzle in (a).
156 Applications − Doing Stuff on the Machine
Escargot puzzle, which has been claimed to be the hardest Sudoku. The temporal
dependence of the network entropy H , firing rate ν and states count is shown
in Figures 5.21(a)–(c), respectively, for the easy (5.21(g)), hard (5.21(h)) and AI
escargot (5.21(i)) puzzles. In Figure 5.21(e), we show a schematic representation of
the dimensionality of the network for the easy puzzle (g); each sphere represents a
single neuron and synaptic connections have been omitted for clarity; the layer for
digit 5 is represented also showing the inhibitory effect of a single cell in position
(1,3) over its row, column, subgrid and other digits in the cell. In this case, the total
number of neurons is ≈37 k and they form ≈86 M synapses.
One major improvement of our implementation with respect to the work of
Habenschuss et al. [89] is the convergence to a stable solution; this is arguably due
to the use of subpopulations instead of single neurons to represent the domains
of the CSP variables as these populations were required to provide stability to the
network. The use of the more realistic exponential post-synaptic potentials instead
of the rectangular ones used by Habenschuss et al. [89] helps deliver a good search
performance as shown in the bottom plots in Figure 5.21(a)–(c), where the solution
is found after visiting only 3, 12 and 26 different states and requiring 0.8 s, 2.8 s
and 6.6 s, respectively, relating well also with the puzzle hardness. It is important
to highlight that the measurement of the difficulty level of a Sudoku puzzle is still
ambiguous and our solver could need more complex strategies for different puzzles,
for example, in the transient chaos-based rating the ‘platinum blonde’ Sudoku is
rated as one of the hardest to solve, and although we have been able to find a solution
for it, it is not stable, which means one should control the noisy network dynamics
in order to survive the long escape rate of the model presented by Ercsey-Ravasz
and Toroczkai [57]. We show in Figure 5.21(d) and (f ) the competing activity of
individual digit populations of a randomly chosen cell in both the easy and the
AI escargot puzzles. The dynamic behaviour resembles that of the dynamic solver
in Figure 2 of the work by Ercsey-Ravasz and Toroczkai [57] for this same easy
puzzle and platinum blonde. Further analysis would bring insights into the chaotic
dynamics of SNNs when facing constraints.
spins { SEi } is considered only between nearest neighbours and represented by a con-
stant Ji, j which determines if the two neighbouring spins will tend to align parallel
Ji, j > 0 or anti-parallel Ji, j < 0 with each other. Given a particular configuration
of spin orientations ω, the energy of the system is then given by the Hamiltonian
operator:
X X
Ĥ = − Ji, j SEi SE j − h Si (5.9)
i, j i
where h is an external magnetic field that tends to align the spins in a preferential
orientation [9]. In this form, each Ji, j defines a constraint Ci, j between the values
D = {+1, −1} taken by the variables SEi and SE j . It is easy to see that the more con-
straints are satisfied, the lower the value of Ĥ becomes in Equation 5.9. This simple
model allows the study of phase transitions between disordered configurations at
high temperature and ordered ones at low temperature. For ferromagnetic Ji, j > 0
and antiferromagnetic Ji, j < 0 interactions the configurations are similar to those
in Figure 5.22(d) and (e) for 3D lattices. These correspond to the stable states of
our SNN solver when the Ising models for Ji, j > 0 and Ji, j < 0 are mapped to an
SNN using Algorithm 5.1 and a 3D grid of 1,000 spins. Figure 5.22(g) shows the
result for a 1D antiferromagnetic spin chain. It is interesting to note that the statis-
tical mechanics of spin systems has been extensively used to understand the firing
dynamics of SNNs, presenting a striking correspondence between their behaviour
even in complex regimes. Our framework allows the inverse problem of mapping
the SNN dynamics to spin interactions. This equivalence between dynamical sys-
tems and algorithms has largely been accepted and we see an advantage in com-
puting directly between equivalent dynamical systems. However, it is clear that the
network parameters should be adequately chosen in order to keep the computation
valid.
If instead of fixing Ji, j to some value U for all spin pairs {(i, j)} one allows
it to take random values from {U, −U } with probabilities pAF and pFM , it will
be found that certain interactions would be frustrated (unsatisfiable constraints).
Figure 5.22(f ) illustrates the frustration with three antiferromagnetic interacting
spins in a way that any choice of orientation for the third spin will conflict with
one or the other. This extension of the Ising model when the grid of interactions
is a random mixture of AF and FM interactions was described by Surungan et al.
[246]. The model is the representation of the spin glass systems found in nature;
these are crystals with low concentrations of magnetic impurities that, due to the
frustrated interactions, are quenched into a frozen random configuration when the
temperature is lowered (at room or high temperature the magnetic moments of a
material are constantly and randomly precessing around their average orientation).
158 Applications − Doing Stuff on the Machine
Figure 5.22. SNN simulation of Ising spin systems. (a) and (b) show two 2-dimensional
spin glass quenched states obtained with interaction probabilities p AF = 0.5 and p AF =
0.1. The results for the three-dimensional lattices for CSPs of 1,000 spins with ferromag-
netic and antiferromagnetic coupling constant are shown in (e) and (d), respectively. In
(c) are plotted the temporal dependence of the network entropy H, firing rate ν and
states count during the stochastic search for the system in (d). (f) illustrates the origin
of frustrated interactions in spin glasses. (g) depicts the result for the one-dimensional
chain.
The statistical analysis of those systems was fundamental for the evolution of artifi-
cial neural networks and machine learning. Furthermore, the optimisation problem
of finding the minimum energy configuration of a spin glass has been shown to be
NP-complete [9]. The quenching of the grid happens when it gets trapped in a
local minimum of the state space of all possible configurations. In Figure 5.22(a)
and (b), we show a quenched state found by our SNN with pAF = 0.5 and
pAF = 0.1, respectively. A spin glass in nature will often be trapped in local min-
ima and will need specific temperature variations to approach a lower energy state;
our SNNs replicate this behaviour and allow for the study of thermal processes,
controlling the time variation and intensity of the excitatory and inhibitory stim-
ulations. If the underlying stochastic process of such stimulations is a good rep-
resentative of heat in solids, they will correspond to an increase and a decrease of
Constraint Satisfaction 159
Chapter 6
Tackling real-world tasks requires being comfortable with chance, trading off time with
accuracy, and using approximations.
1. https://openai.com/f ive/
160
Classical Models 161
We call the well-known and widely used deep learning models ‘classical’ and
give a brief introduction to those models in this section. As mentioned above,
the first break-through in training deep (>2 layer) networks was the greedy
layer-wise strategy [98] proposed to train stacked Restricted Boltzmann Machines
(RBMs). Shortly after, this method was proved also to be efficient for train-
ing other kinds of deep networks including stacked autoencoders (AEs) [13].
RBMs and AEs are suitable for dimensionality reduction and feature extraction
when trained with unsupervised learning on unlabelled data. In 2012, using
such an unsupervised deep learning architecture, the Google Brain team achieved
a milestone in the deep learning era: the neural network learned to recognise
cats by ‘watching’ 10 million images generated from random frames of YouTube
videos [137].
Convolutional Neural Networks (ConvNets) are vaguely inspired from biology
and the significant discovery of Hubel and Wiesel that simple cells have a preferen-
tial response to oriented bars (convolution) and complex cells collate responses from
the simple ones (pooling); it is believed that these represent the basic functions in
the primary visual cortex in cats [109]. These simple cells fire at a high frequency to
their preferred orientation of visual stimuli within their receptive fields, small sub-
regions of the visual field. Meanwhile, a complex cell corresponds to the existence
of a pattern within a larger receptive field but loses the exact position of the pattern.
The NeoCognitron [63] was the first network to mimic the functions of V1 sim-
ple and complex neurons in an ANN, and later, this feature detection of single cells
was improved by sharing weights among receptive fields in LeNet-5 [138]; typically,
ConvNets follow the same principle to this day. The mechanism of shared weights
forms the essence of convolution in a ConvNet, which hugely reduces the number
of trainable parameters in a network. The usual procedure to train ConvNets is a
supervised one and is known as the back-propagation algorithm; it relies on the
calculus chain rule to send error signals through the layers of the network starting
from the output and ending at the input.
162 From Activations to Spikes
The most significant examples of ConvNet have dominated the best perfor-
mances in the annual ImageNet Challenge [215]: AlexNet [132], VGG Net [228],
GoogLeNet [249], ResNet [93] and MobileNet [108].
Despite the powerful capabilities of these feed-forward deep networks, sequence
processing is a challenge for them since the size of the input and output vec-
tors are constrained to the number of neurons. Thus, Recurrent Neural Networks
(RNNs), containing feed-back connections, are ideal solutions for dealing with
sequential information since their current output is always dependent on the pre-
vious ‘memory’. As training mechanisms have become more mature, for example,
using Long Short-Term Memory (LSTM) [99], RNNs have shown great success in
many natural language processing tasks: language modelling [166], machine trans-
lation [247], speech recognition [83] and image caption generation [125].
The current trend in deep learning is to combine Machine Learning (ML) algo-
rithms towards more complex objectives such as sequential decision-making and
data generation.
Reinforcement Learning (RL) is inspired from animal behaviour when agents
learn to make sequential optimised decisions to control an environment [248]. To
address complex decision-making problems in practical life, RL requires a suffi-
ciently abstract representation of the high-dimensional environment. Fortunately,
deep learning nicely complements this requirement and performs effectively at
dimensionality reduction and feature extraction. Advances in RL techniques, such
as asynchronous advantage actor-critic (A3C) [167], are what allowed DeepMind
and OpenAI to perform the feats presented at the beginning of this chapter.
Generative Adversarial Networks (GANs) [80] are proposed for training gener-
ative models of complex data. Instead of training discrimination networks (e.g.
image classification using ConvNets) and generation networks (e.g. data sam-
pling on RBMs) separately with different objectives, GANs train two competing
networks – one the discriminator, the other the generator – simultaneously by mak-
ing them continuously play games with each other. Thus, the generator learns to
produce more realistic data to fool the discriminator, while the discriminator learns
to become better at distinguishing generated from real data. Exciting achievements
have been reported in generating complex data such as realistic image generation
based on descriptions in text [203].
The ConvNet is the most commonly used machine learning architecture for image
recognition. It is a biologically inspired generic architecture for intelligent data
Symbol Card Recognition System with Spiking ConvNets 163
processing [139]. The generic architecture of a ConvNet for visual object recogni-
tion is depicted in Figure 6.1. The visual scene coming out of the retina is fed to a
sequence of layers that emulate the processing layers of the brain visual cortex. Each
layer consists of the parallel application of 2D-filters to extract the main image char-
acteristics. Each image representation obtained is named a feature map. The first
layer extracts oriented edges of the image according to different orientations and
different spatial scales. The subsequent layers combine the feature maps obtained
in the previous layers to detect the presence of combinations of edges, detecting
progressively more complex image characteristics, until achieving the recognition
of complex objects in the higher levels. Along the ConvNet layers, the sizes of the
feature maps are progressively reduced through applying image subsampling. This
subsampling process is intended to introduce invariance to object size and position.
In conventional AI vision systems, the ConvNet architectures are used in a
frame-based manner. A frame representing the particular scene to be analysed is fed
to the architecture. The output of the different convolutional layers is computed
in a sequential way (layer by layer) until a valid output is obtained in the upper
layer indicating the category of the recognised object. However, this is not what
happens in biological brains. In a biological system, the retina ‘pixels’ send, in an
asynchronous way, sequences of spikes representing the visual scene. Those spikes
are sent through the optic nerve to the visual cortex where they are processed as they
arrive by the subsequent neuron layers with just the delay of the spike propagation
and neuron processing.
We have used the SpiNNaker platform to implement a spiking ConvNet. Each
time a spike is generated by a neuron in a layer, the spike is propagated to the
164 From Activations to Spikes
connected neural populations of the next layer and the weights of the correspond-
ing 2D-filter kernels are summed to the neuron states of the subsequent con-
nected layer. That way the convolution is performed on the fly in a spike-based
manner [220].
The input stimulus provided to the system is a flow of spikes representing the
symbols of a poker card deck passing in front of a DVS [141, 143, 219] at high
speed. We used an event-driven clustering algorithm [47] to track the passing sym-
bols and, at the same time, we adjusted the tracking area to a 32 × 32 resolution.
Each symbol passed in 10–20 ms producing 3 k–6 k spikes. The 40 symbols passed
in 0.95 s generating a total of 174,643 spikes.
To achieve real-time recognition with, at the same time, reproducibility of the
recordings, we loaded the spike sequence onto a data player board [218]. The data
player board stores the neuron addresses and timestamps of the recorded spikes in
a local memory and reproduces them as events through a parallel AER link in real
time. The parallel AER events are converted to the 2-of-7 SpiNNaker protocol and
fed in real time to the SpiNNaker machine.
The particular ConvNet architecture used for the card symbol recognition task
is detailed in Figure 6.2. It consists of three convolutional layers (C1, C3 and C5)
interleaved with two subsampling layers (S2 and S4) and a final fully-connected
category layer (C6). Table 6.1 details the numbers and sizes of the feature maps as
well as the sizes of the kernels in each layer.
The kernels in the first layer are a set of six Gabor filters in three different ori-
entations and for two different spatial scales and are fixed, not trained. The rest of
Table 6.1. Number and size of layers in card symbol ConvNet architecture.
ConvNet structure
C1 S2 C3 S4 C5 C6
the network weights are trained using frames and a method to convert the weights
to the spiking domain [190].
neuron state and firing times are individual for each neuron. In the standard
SpiNNaker tool chain, all the neuron parameters are replicated and stored individ-
ually for each neuron in the DTCM. Thus, the DTCM capacity sometimes limits
the number of neurons that can be implemented per core. Here the tool chain was
also modified to distinguish between the parameters that are individual for each
neuron and the parameters shared by all the population. With this approximation,
we are able to implement 2,048 convolution neurons per core, where this number
is determined by the maximum number of addressable neurons supported by the
routeing scheme.
6.2.2 Results
To test the recognition rate, we used a test sequence of 40 32 × 32 tracked symbols
obtained from the events recorded with a DVS [219]. As already explained, the
recording consists of a total of 174,643 spikes encoded as AER events.
We first tested the correct functionality of the ConvNet for card-symbol classi-
fication programmed on the SpiNNaker board at low speed. For this experiment,
we multiplied by a factor 100 the timestamps of all of the events of the sequence
that was reproduced by the data player board. To maintain the same classification
capability as the ConvNet architecture optimised for card symbol recognition, we
had to multiply the time parameters (the refractory and leakage times) of the net-
work by the same factor 100. In Figure 6.3, we reproduce four snapshots of the
composition grabbed with the jAER board of the input stimulus and the output
category obtained with the SpiNNaker ConvNet classifier. As can be seen, correct
classification of the four card symbols is obtained. These snapshots are generated
by collecting and histogramming events with the jAER [178] board over 1.2 ms.
The classification of the test sequence [190] of 40 card symbols slowed down a fac-
tor 100 was repeated 30 times. During the appearance time of each input symbol,
the number of output events generated by the correct output category was counted
as well as the number of output events generated by each of the other three out-
put categories. The classification is considered successful if the number of output
events of the correct category is the maximum. The mean of the success classifi-
cation rate was 97% for the 30 repetitions of the experiment, with a maximum
of 100% and a minimum of 93% in the success classification rate, thus achiev-
ing a recognition success rate slightly higher than the one obtained in the software
real-time experiment [190].
Once we had tested that the SpiNNaker ConvNet classifier functionality was
correct, we tested its maximum operation speed. For that purpose, we repeated the
experiment for different slow-down factors of the event timings of the input stim-
ulus sequence while, at the same time, we applied the same factor to the ConvNet
Symbol Card Recognition System with Spiking ConvNets 167
Figure 6.3. Snapshots of merging the input stimulus with the SpiNNaker classifier output.
The input stimulus was generated with a 100 slow-down factor over real recording time.
Figure 6.4. (a) Recognition rate for the sequence of 40 card symbols versus the slow-
down factor of the input stimulus. (b) Total number of output events generated by the
output recognition layer for the whole sequence of 40 card symbols versus the slow-
down factor of the input stimulus.
real-time operation, that is, classification of the sequence of the high-speed browsed
cards as they each pass in a 400 microsecond interval. We can observe that for slow-
down factors higher than 25, the mean successful classification rate is higher than
90%. However, for slow-down factors lower than 25, the success recognition rate
suffers from a severe degradation. In Figure 6.4(b), we have plotted the total number
of output events generated at the output of the SpiNNaker classifier as a function
of the input stimulus slow-down factor. We can observe that for slow-down fac-
tors below 50, the number of SpiNNaker output events decreases quickly. Another
observation is that there is a local peak in the recognition rate (and the correspond-
ing number of output events) at a slow-down factor of 10. For higher slow-down
factors (slow-down factors of 15 and 20), the recognition rate and the number of
output events in the category layer are lower.
Going into the details of the problem, we observed that the main bottleneck
that limits the operation of the system is the processing time of the events in the
convolution layers. We have also observed that when events are unevenly lost in
subsequent layers, the spatio-temporal congruence of the recognised patterns is lost
and the recognition rate decreases. This phenomenon has already been reported
by Camuñas [31] who observed that queuing events in a highly saturated event
processing system gives a worse performance than simply dropping them, because
queuing introduces time delays, while dropping keeps the temporal coherence of
the processed events. In the present case, when events are lost simultaneously in the
different processing layers, the performance is better than when there is a layer that
has a dominant delay. This explains the lower recognition accuracy for intermediate
slow-down factors.
In Figure 6.2, we show in red numbers the total number of events that enter into
the corresponding layer that have to be processed by each feature map. It can be
observed that each neural population in the second convolutional layer (C3) has to
process 4.7× more events/second than the first convolution layer (C1). As we have
tried to maximise the number of neurons implemented per SpiNNaker core to the
maximum that can be allocated, this has the downside that each core in layer C3
has to process the incoming 816,163 events in 0.95 seconds for real-time operation.
As the SpiNNaker architecture is flexible, it allows us to trade-off the maximum
number of neurons per core against the maximum event processing throughput.
In a first experiment, we noticed that more than one half of the weights in the
first convolution layer (C1) were zero. Zero weights in the kernel add computation
time per event but do not affect the result of the computation. So, we eliminated all
the zero values of the kernels in the first convolution layer. In Figure 6.5, we have
plotted in blue the recognition rate of the original experiment (before eliminating
the zero elements in the C1 kernels) and in the green trace, we plot the recognition
rate after eliminating the zero values in the C1 kernels. It can be observed that
Handwritten Digit Recognition with Spiking DBNs 169
Figure 6.5. Recognition accuracy for the whole sequence of 40 card symbols versus the
slow-down factor of the input stimulus when splitting each C3 neural population among
several cores.
both systems perform similarly for low and high slow-down factors. However, the
‘optimised’ system has worse performance for intermediate slow-down factors. The
reason is that by speeding-up the operation of the first C1 convolution layer, we
obtain more decorrelation between the first and second convolutional layer (C3), as
the second convolutional layer (C3) is the one causing the performance bottleneck
in this particular case.
In a further experiment, to speed-up the operation of the second convolu-
tional layer (C3), we mapped each neural population of layer C3 onto different
SpiNNaker cores. Figure 6.5 plots the recognition rates obtained for different dis-
tributions of the feature map populations of the second convolutional layer (C3).
In these experiments, we kept the elimination of the zero kernel elements in the
C1 layer. In Figure 6.5, the red trace corresponds to splitting each C3 feature map
operation across 2 cores. The cyan, black and magenta traces correspond to split-
ting each C3 feature map across 4, 5 and 6 cores, respectively. As can be observed,
the 4-core division gives the optimum performance as it equalises the delays of the
different layers. For further speeding up the C3 layer, the delay of the third convo-
lutional layer (C5) becomes dominant.
Figure 6.6. An RBM with full connectivity between visible units (bottom) and hidden
units (top), but no connections within the same layer [241].
Pierre et Marie Curie in Paris, France. A more thorough investigation of the work
presented here can be found elsewhere [239–241, 243].
Spiking DBNs on SpiNNaker
The SpiNNaker computing platform was used to investigate the robustness of spik-
ing DBNs to various hardware limitations that are present in digital neuromorphic
architectures such as the limited memory available to store synaptic weights, the bit
precision used to represent weights and neuron states, and the input sensor noise
commonly found in DVS sensors (silicon retinas).
Porting DBNs onto SpiNNaker
For the experiments in this section, the same pre-trained DBN from O’Connor
et al. [183] was used. This DBN consists of an input layer of 784 neurons (the
28×28 MNIST image is flattened to a vector), followed by two hidden layers of
500 neurons each and 10 output neurons, one neuron per digit. This model has in
total 647,000 synapses.
After the training process is over, the DBN is mapped to an SNN by replac-
ing the Siegert activation function with an LIF neuron model using the following
equations:
dV
τm = E L − V + Rm I, (6.1)
dt
where τm is the membrane time constant, E L is the resting potential and Rm is
the membrane resistance. The input current I is computed as a Dirac delta synapse
model,
n
X mi
X
I = wi δ(t − ti j ), (6.2)
i=0 j=0
where wi is the weight of synapse i, δ(t) is the Dirac delta function which is
zero except for the firing times ti j of the i th neuron, n is the number of incom-
ing synapses and m is the number of spikes that the i th neuron receives. The LIF
parameters used in the experiments are summarised in Table 6.2.
O’Connor et al. [183] used MATLAB to train and experiment with spiking
DBNs. To port their trained DBN from MATLAB to SpiNNaker, a Python pack-
age was developed that generates a PyNN [44] description of the SNN ready to
be executed on SpiNNaker and other SNN simulators such as Brian [81]. For the
input population of a spiking DBN, the spike trains generated from each MNIST
digit are described as spike arrays in PyNN using the SpikeSourceArray population.
Additional functionality was developed for SpiNNaker that converts the spikes of
a SpikeSourceArray to a binary file which gets uploaded to a SpiNNaker machine.
172 From Activations to Spikes
τm 5.0 s
Trefract 2.0 ms
Vreset 0.0 mV
Vthresh 1.0 mV
Figure 6.7. Conversion of static images to spike trains and introduction of noise. Each
row represents different input rates ranging from 100 Hz to 1,500 Hz, while the columns
show different percentages of input noise, from 0% up to 100%. Figure taken from [243].
W L = round(2 f · W H ) · 2− f (6.3)
where W H represent the original double floating-point weights of the trained DBN,
and 2− f is the resolution of the lower precision representation.
6.3.1 Results
Robustness of spiking DBNs to reduced bit precision of fixed-point synapses
and input noise: This section summarises the findings of the investigation on the
effect of reduced weight precision of a trained spike-based DBN and its robustness
to input sensory noise.
Figure 6.8 demonstrates the effect of reduced bit precision on the trained weights
of the spiking DBN of O’Connor et al. [183]. More specifically, Figure 6.8(a) shows
the receptive fields of the first 6 neurons of the first hidden layer for different fixed-
point weight resolutions. As can be visually observed, a lot of the structural infor-
mation of the receptive fields is preserved, even for bit a precision of f = 4 bits.
174 From Activations to Spikes
100
80
40
20
0
Q3.8 Q3.4 Q3.3 Q3.2 Q3.1
Precision of weights
(a) (b)
Figure 6.8. Impact of weight bit precision on the representations within a DBN. (a) The
receptive fields of the first 6 neurons (rows) in the first hidden layer of the DBN with the
two hidden layers. (b) Percentage of synapses from all layers that are set to zero due
to the reduction in bit precision for the fractional part. Figure by Stromatias [241] with
minor modifications.
Figure 6.8(b) presents the percentage of weights that were truncated to zero due to
the fixed-point rounding.
Figure 6.9(a) illustrates the classification accuracy (CA) of the spiking DBN on
the MNIST test set as a function of input noise and bit weight resolution for two
different input firing rates (100 and 1,500 Hz), for an input stimulus of 1 second.
Both curves show that the performance drops as the percentage of input noise
increases, but for higher firing rates (1,500 Hz) the performances remains con-
stant until the input noise reaches a 50% level. The peak performance stays at
almost identical levels to the double floating-point precision even for bit precisions
of f = 3. Figure 6.9(b) shows the area under the curve; a larger area translates to a
higher classification performance. As in (a), a similar trend can be observed; higher
input firing rates result in an increase in CA. Figure 6.9(c) demonstrates the CA
for different bit weight precisions as the input firing rates increase, from 100 Hz to
1,500 Hz, for two different input noise levels, 0% and 60%. Finally, the plots in
Figure 6.9(d) show that there is a wide range of input noise levels and bit weight
resolutions in which the performance remains remarkably high for the two input
rates, 100 Hz and 1,500 Hz. For all experiments, the performance dropped signif-
icantly when a bit weight precision of f = 1 was used. For a bit weight precision
of f = 2, the CA remained approximately at 80% for 100 Hz and above 90% for
firing rates higher than 600 Hz.
These findings illustrate that, indeed, the spike-based DBNs exhibit the desired
robustness to input noise and numerical precision. For a weight precision of
Q3.3 (6 bits per weight), the classification performance is on a par with double
floating-point precision (64 bits per weight). For this particular spiking DBN,
Handwritten Digit Recognition with Spiking DBNs 175
(a) (b)
(c) (d)
Figure 6.9. Effect of reduced weight bit precision and input noise on the classification
accuracy (CA) of the spiking DBN with two hidden layers. (a) CA as a function of input
noise and bit precision of synaptic weights for two specific input spike rates of 100 and
1,500 Hz. Results over four trials. (b) Normalised area under curve in (a) for different
percentages of input noise, input firing rates and weight bit precision. Higher values mean
higher accuracy and better robustness to noise. (c) CA as a function of the weight bit
resolution for different input firing rates and for two different noise levels, 0% and 60%.
(d) CA as a 2D function of the bit resolution of the weights and the percentage of input
noise for 100 Hz and 1,500 Hz input rate. The results confirm that spiking DBNs with low
precision weights down to f = 3 bits can still reach high-performance levels and tolerate
high levels of input noise. Figure by Stromatias et al. [243] with minor modifications.
which consists of 642,510 synapses, this means that for a weight precision of Q3.3,
only 0.46 MBytes are required for storing all the weights instead of 4.9 MBytes.
Moreover, one of the effects of the reduced precision is that many of the weights
become zero, as seen in Figure 6.8(b), due to rounding, and thus, they can be
pruned. The benefits of pruning the zeroed weights may include faster execution
times due to avoiding unnecessary memory look-ups, as well as being able to exe-
cute deeper neural networks on the same hardware.
176 From Activations to Spikes
Table 6.3. Classification accuracy (CA) of the same DBN with two hidden
layers running on different platforms [241].
Table 6.3 summarises a comparison between the SpiNNaker platform and var-
ious hardware and software simulators, including the Brian SNN simulator, for
the MNIST classification problem. The SpiNNaker results are very close to the
results of the software simulation with only a 0.06% difference despite the fact that
SpiNNaker uses less precise weights than standard software implementations.2
2. A video of a spiking DBN running on SpiNNaker and recognising a handwritten digit can be seen here:
https://youtu.be/f -Xi2Y4TB58
Handwritten Digit Recognition with Spiking DBNs 177
16.2 ms (Figure 6.10(b)), while the classification accuracy is 95.0%. For firing rates
above 1,500 Hz, there is no effect on the mean classification accuracy; however,
increasing the input firing rate to 2,000 Hz reduces the mean classification latency
to 13.2 ms. What can also be observed from Figure 6.10(a) is that increasing the
total number of input spikes reduces the standard deviation for both the mean
classification latency and the classification accuracy.
(a) (b)
Figure 6.10. (a) Mean classification latency and classification accuracy as a function of
the input spikes per second for the spiking DBN. (b) Histogram of the classification laten-
cies for the MNIST digits of the testing set when the input rates are set to 1,500 Hz. The
mean classification latency of the DBN with two hidden layers is 16 ms [240]. Figure by
Stromatias [241] with minor modifications.
Figure 6.11. Real and estimated power dissipation of the O’Connor et al. [183] spike-based
DBN running on a single SpiNNaker chip as a function of the number of input spikes
generated for the same MNIST digit. The right axis shows the number of output spikes
as a function of the number of input spikes. The left bars (0 input spikes) show power
dissipation when the network is idle. The model used to estimate the power dissipation
of SNN running on a SpiNNaker machine is based on the work of [240, 242]. Figure by
Stromatias [241].
178 From Activations to Spikes
Finally, Figure 6.11 shows the power requirements of the O’Connor et al. [183]
spiking DBN running on a single SpiNNaker chip. Results show that when an input
firing rate of 2,000 Hz is used per digit, a single SpiNNaker chip dissipates 0.39 W.
That accounts for simulating 1,794 LIF neurons with an activity of 1,569,000
synaptic events (SE) per second. For the identical spiking DBN implemented on
Minitaur, an FPGA event-driven SNN simulator clocked at 75 MHz, a power dis-
sipation of 1.5 W was reported for 1,000 spikes per image [176].
An intuitive idea for bringing these deep learning techniques to SNNs is either to
transform well-tuned deep ANN models into SNNs or to translate numerical cal-
culations of weight modulations into biologically plausible synaptic learning rules.
Based on the former approach, this section proposes, based on the work of Liu
[146], a generalised method to train SNNs off-line on equivalent ANNs and trans-
fer the tuned weights back to the SNNs. There are two significant problems to be
solved when training SNNs off-line. First, an accurate activation function is needed
to model the neural dynamics of spiking neurons. In this section, we propose a novel
activation function used in ANNs, Noisy Softplus (NSP), to closely simulate the
firing activity of LIF neurons driven by noisy current influx. The second problem
is to map the abstract numerical values of the ANNs to physical variables, current
(nA) and firing rate (Hz), in the SNNs. Consequently, we introduce the Parametric
Activation Function (PAF) y = p × f (x), which successfully associates physical
units with conventional activation functions and thus unifies the representations
of neurons in ANNs and the ones in SNNs. Therefore, an SNN can be modelled
and trained on an equivalent ANN using conventional training algorithms, such as
backpropagation.
The significance lies in the simplicity and generalisation of the proposed method.
SNN training, now, can be simplified to: firstly, estimate parameters for the PAF
using NSP; secondly, use the PAF version of conventional activation functions to
train an equivalent ANN; and finally, transfer the tuned weights directly into the
SNN without any conversion. Regarding the generalisation, it works exactly the
same as training ANNs: the same feed-forward network architecture, backprop-
agation algorithm and activation functions, and uses the most common spiking
neurons, standard LIF, that run on most neuromorphic hardware platforms.
Therefore, most importantly, this research provides the neuromorphic engi-
neering community with a simple, but effective and generalised off-line SNN
training method which notably simplifies the development of AI applications
on neuromorphic hardware. In turn, it enables ANN users to implement their
models on neuromorphic hardware without the knowledge of spiking neurons or
Spiking Deep Neural Networks 179
programming specific hardware, thereby enabling them to benefit from the advan-
tages of neuromorphic computers: such as real-time processing, low latency, biolog-
ical realism and energy efficiency. Furthermore, the success of the proposed off-line
training method paves the way to energy-efficient AI on neuromorphic machines
scaling from mobile devices to huge computer clusters.
inputs of a spiking neuron (Figure 6.12) are spike trains, which generate current
influx through neural synapses (connections). A single spike creates a current pulse
with an amplitude of w, which is defined as the synaptic efficacy, and the current
then decays exponentially with a decay rate determined by the synaptic time con-
stant, τsyn . The current pulses consequently produce PSPs on the neuron driving its
membrane potential to change over time and trigger spikes as outcomes when the
neuron’s membrane potential reaches some threshold. The dynamics of the current
influxes, PSPs, membrane potentials and spike trains are all time dependent, while
the neurons of ANNs only cope with abstract numerical values representing spik-
ing rate, without timing information. Therefore, these fundamental differences in
input/output representation and neural computation form the main research prob-
lem of how to operate and train biologically plausible SNNs to make them as com-
petent as ANNs in cognitive tasks. In this section, we focus on the solutions of
off-line training where SNNs are trained on equivalent ANNs and then the tuned
weights are transferred to the SNNs.
Jug et al. [122] first proposed the Siegert formula [226] to model the response
function of a spiking neuron, which worked as a Sigmoid unit in training spiking
s1 w1 w
I S (t) Output spike
... Action
S train (Sy)
potential
wi (Spike) Membrane sy
si threshold
t
... wn Current
inf lux (I)
PSPs A spiking neuron
sn t
Figure 6.12. A spiking neuron. Spike trains flow into a spiking neuron as current influx,
trigger linearly summed PSPs through synapses with different synaptic efficacy w, and
the post-synaptic neuron generates output spikes when the membrane potential reaches
some threshold.
180 From Activations to Spikes
deep belief networks. The Siegert formula maps incoming currents driven by
Poisson spike trains to the response firing rate of a LIF neuron, similar to the activa-
tion functions used in ANNs which transform the summed input to corresponding
outcomes. The variables of the response function are in physical units, and thus,
the trained weights can be transferred directly into SNNs.
However, the Siegert formula is inaccurate as it models the current noise as
white [147], τsyn → 0, which is not feasible in practice.
Moreover, the high complexity of the Siegert function and the computation of
its derivative to obtain the error gradient cause much longer training times, thus
consuming more energy, when compared to regular ANN activation functions, for
example, Sigmoid. We will illustrate these problems in detail in Section 6.4.2.
A softened version of the response function of LIF neurons has been pro-
posed [111] and is less computationally expensive than the Siegert function. How-
ever, the model ignores the dynamic noise change introduced by input spikes,
assuming a static noise level of the current influx. Therefore, the training requires
additional noise on the response firing rate and on the training data; however, the
manually added noise is far from the actual activity of the network and includes
hyper-parameters in the model.
Although the trained weights can be directly used in SNNs since both the above
LIF response functions accept and output variables in physical units, they struggle
in terms of poor modelling accuracy and high computational complexity. Moreover,
they lose the numerical abstraction of firing rates in ANNs, thus, being constrained
to SNN training. Meanwhile, other widely used activation functions in ANNs can-
not be transformed to model SNNs. Therefore, the first problem is the accurate
modelling of the neural response activity of LIF neurons using abstract activation
functions, in the hope of (1) increasing the modelling accuracy, (2) reducing the
computation complexity and (3) generalising off-line SNN training to commonly
used ANN activation functions. These activation functions used in ANNs without
physical units are called ‘abstract’ to differ from the response functions of spiking
neurons. We select them for LIF modelling because of the simplicity and generalised
use for training ANNs. Thus, we propose the activation function, NSP [147], in
Section 6.4.2 to address this problem.
Then, the second problem is to map the abstract activation functions to phys-
ical units used in SNNs: current in nA and firing rates in Hz. In doing so, the
neuronal activities of an SNN can be modelled with such scaled activation func-
tions and the trained weights can be transferred into SNNs without conversion.
Instead of directly solving this problem, an alternative way is to train an ANN
with abstract activation functions and then modulate the trained weights to fit in
SNNs. Researchers [32, 51] successfully applied this method on less biologically
realistic and simplified integrate-and-fire (IF) neurons. Nevertheless, these simple
Spiking Deep Neural Networks 181
IF neurons are usually difficult to implement in analogue circuits, and thus they
are feasible only on digital neuromorphic hardware, for example, TrueNorth [163].
Tuning these trained ANN models to adapt to simplified IF neurons is relatively
straightforward, so this method sets the state-of-the-art performance. However, this
section (in Section 6.4.3) aims to address the second problem of mapping abstract
activation functions to the response firing activity of biologically plausible LIF neu-
rons. Thus, not only the training can be simplified by using conventional simple
activation functions, such as Rectified Linear Units (ReLUs), but also the method
can be generalised to target standard LIF neurons which are supported by most
neuromorphic hardware.
Biological Background
A LIF neuron model is as follows:
dV
τm = Vrest − V + Rm I (t). (6.4)
dt
The membrane potential V changes in response to the input current I , starting
at the resting membrane potential Vrest , where the membrane time constant is τm =
Rm Cm , Rm is the membrane resistance and Cm is the membrane capacitance. The
central idea in converting spiking neurons to activation units lies in the response
function of a neuron model. Given a constant current injection I , the response
function, that is, firing rate of the LIF neuron is:
−1
Vth − Vrest
λout = τrefrac − τm log 1 − , when IRm > Vth − Vrest ,
IRm
(6.5)
otherwise, the membrane potential cannot reach the threshold Vth and the output
firing rate is zero. The absolute refractory period τrefrac is included, during which
period synaptic inputs are ignored.
However, in practice, a noisy current generated by the random arrival of spikes,
rather than a constant current, flows into the neurons. The noisy current is typically
182 From Activations to Spikes
treated as a sum of a deterministic constant term, Iconst , and a white noise term,
Inoise . Thus, the value of the current is Gaussian distributed with m I mean and s I 2
variance. The white noise is a stochastic process ξ(t) with mean 0 and variance 1,
which is delta-correlated, that is, the process is uncorrelated in time so that a value
ξ(t) at time t is totally independent on the value at any other time t 0 . Therefore,
the noisy current can be seen as:
Figure 6.13 shows the response curves (Equation 6.10) of a LIF neuron driven
by noisy currents where the Gaussian noise is of mean m I and standard deviation
s I . The parameters of the LIF neuron are all biologically plausible (see the listed
values in Table 6.4), and the same parameters are used throughout this chapter.
Spiking Deep Neural Networks 183
Figure 6.13. Response function of the LIF neuron with noisy input currents with different
standard deviations.
Table 6.4. Parameter setting for the current-based LIF neurons using PyNN.
The bottom (zero noise) line in Figure 6.13 illustrates the response function of
such a LIF neuron injected with constant current, which inspired the proposal of
ReLUs. As noise increases, the level of firing rates also rises. Thus, the Softplus
function approximates the response activity to noisy current, but only represents a
single level of noise; for example, the top line in Figure 6.13 shows the curve when
s I = 1.
variance. The noise was drawn from the Gaussian distribution in a time resolu-
tion of dt. We chose dt = 1 ms and dt = 10 ms for comparison. For a given
pair of m I and s I2 , a noisy current was injected into a current-based LIF neuron
for 10 s, and the output firing rate was the average over 10 trials. There were four
noise levels tested in the experiments: 0, 0.2, 0.5, 1; and the mean current increased
from −0.5 to 0.6 nA.
The dashed curves in Figures 6.14 illustrate the output firing rate of the LIF sim-
ulations, while the bold lines are the analytical reference, the Siegert function (the
same as in Figure 6.13). The differences between the practical simulations and the
Siegert function enlarge when the time resolution, dt, of the NoisyCurrentSource
increases from 1 ms (Figure 6.14(a)) to 10 ms (Figure 6.14(b)). The sampled cur-
rent signals (NoisyCurrentSource) are shown in Figure 6.15(a) and (b). The discrete
sampling of the noisy current introduces time step correlation to the white noise,
shown in Figure 6.15(e) and (f ), where the value remains the same within a time
step dt. Although both current signals follow the same Gaussian distribution, see
Figure 6.15(g) and (h), the current is approximately a white noise when dt = 1 ms,
but a coloured noise, for example, increases Power Spectral Density (PSD) at lower
frequency, when dt = 10 ms, see Figure 6.15(c) and (d). Therefore, the coloured
noise of the current influx drives the LIF neuron to fire observably more intensely.
Hence, the Siegert formula, Equation 6.10, can only approximate the LIF response
of noisy current with white noise, but it is not adapted to coloured noise. In practice,
the current is generated by random arrivals of input spikes with various synaptic
efficiencies, which also brings in coloured noise.
A more realistic simulation of a noisy current can be generated by 100 Poisson
spike trains, where the mean and variance of the current are given by La Camera
et al. [134]:
X 1 X
m I = τsyn wi λi , s I2 = τsyn wi2 λi , (6.11)
2
i i
where τsyn is the synaptic time constant, and each Poisson spike train connects to the
neuron with a strength of wi and a firing rate of λi . Two populations of Poisson
spike sources, for excitatory and inhibitory synapses respectively, were connected
to a single LIF neuron to mimic the noisy currents. The firing rates of the Poisson
spike generators were determined by the given m I and s I . Figure 6.16 illustrates the
recorded firing rates responding to the Poisson spike trains compared to the mean
firing rate driven by NoisyCurrentSource in Figure 6.14. Note that the estimation of
LIF response activity using the Siegert function requires noisy current with white
noise; however, in practice the release of neurotransmitter takes time (τsyn 0) and
t
− τsyn
the synaptic current decays exponentially Isyn = I0 e . Figure 6.17(a) and (b)
Spiking Deep Neural Networks 185
shows two examples of synaptic current of 0 nA mean and 0.2 standard deviation
driven by 100 neurons firing at the same rate and with the same synaptic strength
(half excitatory, half inhibitory), but of different synaptic time constant. There-
fore, the current at any time t during the decay period is dependent on the value
186 From Activations to Spikes
dt =1 ms dt =10 ms
(a) Current sampled at dt=1 ms. (b) Current sampled at dt=10 ms.
at the previous time step, which makes the synaptic current a coloured noise, see
Figure 6.17(c) and (d).
We observe in Figure 6.16(a) that the response firing rate to synaptic current is
higher than the NoisyCurrentSource for most of the current range. This is caused
by the coarse resolution (1 ms) of the spikes, and thus, the standard deviation of
the current is larger than 0.2, shown in Figure 6.17(g); and the τsyn , even when as
short as 1 ms, adds coloured noise instead of white noise to the current. However,
Figure 6.16(b) shows a similar firing rate of both the synaptic driven current and
Spiking Deep Neural Networks 187
Figure 6.16. Recorded response firing rate of a LIF neuron driven by a noisy synaptic
current, which is generated by random arrivals of Poisson spike trains, compared to pre-
vious experiments using NoisyCurrentSource. Averaged firing rates of 10 simulation trails
tested on three noisy levels are shown in different colours of dashed lines, and the grey
colour fills the range between the minimum to maximum of the firing rates. The other
LIF simulation using NoisyCurrentSource is drawn in bold lines (same as the dashed
lines in Figure 6.14) to compare with the noisy synaptic current. The same noise level
is plotted with the same colour for both experiments. Two synaptic time constants are
tested: (a) τsyn = 1 ms, to compare with NoisyCurrentSource sampled at every 1 ms, and
(b) τsyn = 10 ms, to compare with NoisyCurrentSource sampled at every 10 ms.
the NoisyCurrentSource, since both of the current signals have similar distribution
(Figure 6.17(h)) and time correlation (Figure 6.17(f )). Nevertheless, the analytical
response function, the Siegert formula, cannot approximate either of the practical
simulations (see Figure 6.14).
188 From Activations to Spikes
Figure 6.17. Noisy currents generated by 100 Poisson spike trains to a LIF neuron with
synaptic time constant τsyn = 1 ms (left) and τsyn = 10 ms (right). The currents are shown in
the time domain in (a) and (b), and in the spectrum domain in (c) and (d). The autocorre-
lation of both current signals are shown in (e) and (f). The distribution of the generated
samples is plotted in bar chart form to compare to the expected Gaussian distribution,
shown in (g) and (h).
Although the use of the Siegert function opened the door for modelling the LIF
response function to work similarly to the activation functions used in ANNs [122],
there are several drawbacks of this method:
neurons. Thus, the inaccurate model generates errors between the estimation
and the practical response firing rate.
• The high complexity of the Siegert function causes much longer training
times and more energy, let alone the high-cost computation on its derivative.
• The Siegert function is used to replace Sigmoid functions for training spiking
RBMs [122]. Therefore, neurons have to fire at high frequency (higher than
half of the maximum firing rate) to represent the activation of a sigmoid unit;
thus, the network activity results in high power dissipation.
• Better learning performance has been reported using ReLU rather than
Sigmoid units, so modelling spiking neurons with a ReLU-like activation
function is needed.
Therefore, we propose the NSP function which provides solutions to the draw-
backs of the Siegert unit.
Noisy Softplus (NSP)
Due to the limited time resolution of common SNN simulators and the time taken
for neurotransmitter release, τsyn , mismatches exist between the analytical response
function, the Siegert formula and practical neural activities. Consequently to model
the practical LIF response function (see Figure 6.18(a)) whose output firing rates
are determined by both the mean and variance of the noisy input currents, the NSP
is proposed as follows:
h x i
y = f NSP (x, σ ) = kσ log 1 + exp , (6.12)
kσ
where x and σ refer to the mean and standard deviation of the input current, y
indicates the intensity of the output firing rate and k, determined by the biological
configurations on the LIF neurons [147] (listed in Table 6.4), scales the impact of
the noise thereby controlling the shape of the curves. Note that the novel activation
function we proposed contains two parameters, the mean current and its noise,
which takes the values estimated by Equation 6.11: m I and s I2 . Since the NSP
takes two variables as inputs, the activation function can be plotted in 3D, see
Figure 6.19.
Figure 6.18(b) shows the activation function in curve sets corresponding to dif-
ferent discrete noise levels which mimic the responses of practical simulations of
LIF neurons, shown in Figure 6.18(a). It is noteworthy that the non-smooth curve
(blue line in Figure 6.18(a) generated by σ = 0) of the LIF response activities does
not fit the NSP function; this is a limitation of using NSPs to model spike rates
when the noise level approaches 0. However, we ignore this minor mismatch to
unify and simplify the model, since the results show an acceptable performance
drop in Section 6.4.4. In addition, scaling, shifting and parameter calibrations are
190 From Activations to Spikes
(b) NSP
Figure 6.18. NSP models the LIF response function. (a) Firing rates measured by simu-
lations of a LIF neuron driven by different input currents and discrete noise levels. Bold
lines show the average and the grey colour fills the range between the minimum and the
maximum. (b) NSP activates the input x according to different noise levels where the
noise scaling factor k = 0.16.
essential to fit the NSP accurately to LIF responses. We will illustrate the procedure
in Section 6.4.3.
The derivative of the NSP is the logistic function scaled by kσ :
∂ f NSP (x, σ ) 1
∂x
= x , (6.13)
1 + exp − kσ
(a) τsyn =1 ms
λout is measured from SNN simulations where an LIF neuron is driven by synap-
tic input currents of Poisson spike trains, and x and σ take the mean and vari-
ance of the noisy current using Equation 6.11. Figure 6.20 shows two calibration
results in which the parameters were fitted to (k, b, S) = (0.18, 0.07, 201.66)
Spiking Deep Neural Networks 193
x1
...
wij
xi ∑ netj yj
...
Figure 6.21. A general artificial neuron where an activation function transforms the
weighted sum net j to its outcome y j .
when the synaptic constant is set to τsyn = 1 ms and was fitted to (k, b, S) =
(0.35, 0.03, 178.91) when τsyn = 10 ms.
To keep the simple format of traditional activation functions, y = f (x), which
has no constant bias on the input, it is easy to pass the bias b to the LIF param-
eter, the constant current offset, Ioffset = b. Therefore, the specific parameter
Ioffset of the LIF neuron is not chosen arbitrarily, but configured by precise esti-
mation of b. More importantly, setting Ioffset properly on the LIF neuron instead
of having a constant bias on the input of an activation function keeps the hyper-
parameters unchanged in ANN training. For example, the initial weights of a
network have to be set carefully to adapt to a constant bias on the activation
function.
signal then forms the output of an artificial neuron, which can be denoted as y j =
f (net j ), see Figure 6.21.
Equation 6.11 illustrates the physical interpretation of the input of an NSP func-
tion, the noisy current influx, which has the mean of m I , and the variance of S I2 .
To express the physical parameters with the same form of the weighted summation,
net, in a conventional ANN, the mean and variance of the noisy current influx can
be represented with net_x and net_ σ 2 :
X X 1
net_x j = wi j (λi τsyn ), net_ σ j =
2
w 2
(λi τsyn ). (6.15)
2 ij
i i
xi = λi τsyn . (6.16)
194 From Activations to Spikes
x1 wij
Parametric Activation Function
... ∑ net_xj
×S ×τ syn
xi λj yj
...
∑ net_ σj
2
Figure 6.23. An artificial spiking neuron modelled by PAF-NSP, whose input and output
are numerical values, equivalent to those of ANNs. PAF includes the scaling factors S and
the synaptic time constant τsyn in the combined activation function, which links the firing
activity of a spiking neuron to the numerical value of ANNs.
Figure 6.22 illustrates the process that an NSP-modelled artificial spiking neuron
takes the input vector x which is converted from the input firing rate λ, transforms
the weighted sum net_x j and net_ σ j2 to the abstract output y j and scales up yi
with the factor S to the output firing rate λ j .
If instead of multiplying every input firing rate λi by τsyn (left of Figure 6.22),
we do it in every output firing rate after λ j (right of Figure 6.23) and we obtain the
same neuron model and structure as a typical neuron in ANNs, see Figure 6.21,
that neurons take x as input and output abstract value y.
The only difference lies in the activation function where the artificial spiking
neuron takes PAF, which is a simple linearly scaled activation function with a
parameter p. The parameter is determined by the product of the scaling factor
Spiking Deep Neural Networks 195
Training Method
The simple idea of PAF presented in the previous section allows the use of com-
mon ANN training methods to obtain SNN-compatible weights. Consequently,
training SNNs can be done in three simple steps:
1. Calibrate the parameters (k, b, S) for NSP which models the response firing
rates of LIF neurons, thus to estimate the parameter p = S × τsyn for PAFs
and to set the LIF parameter Ioffset = b. Since (k, b, S) are solely dependent
on the biological configurations of a LIF neuron, the same p can be shared
196 From Activations to Spikes
with different activation functions and repeatedly used for various network
architectures and applications.
2. Train any feed-forward ANN with a PAF version of a ReLU-like activation
function. Training compatibility allows us to choose computationally simple
activation functions to increase training speed. The backpropagation algo-
rithm updates weights using the stochastic gradient descent optimisation
method to minimise the error between the labels and the predictions from
the network.
3. Transfer the trained weights directly to the SNN, which should use the same
LIF characteristics as those used in Step 1.
Fine Tuning
As stated above, we can train the network with any PAF version of conventional
ReLU-like activation functions and then fine-tune it with PAF-NSP in the hope of
improving the performance of the equivalent SNN by closely modelling the spiking
neurons with NSP. Additionally, we add a small number, for example 0.01, to all the
binary values of the labels on the training data. Although binary labels enlarge the
disparities between the correct recognition label and the rest for better classification
capability, spiking neurons seldom stay silent even with negative current influx,
and thus, setting labels to 0 is not practical for training SNNs. Therefore, adding
an offset relaxes the strict objective function of predicting exact labels with binary
values.
There are two aspects to the fine tuning which make the ANN closer to SNNs:
firstly, using the NSP activation functions causes every single neuron to run at a
similar noise level as in SNNs, and thus, the weights trained by other activation
functions will be tuned to fit closer to SNNs. Secondly, the output firing rate
of any LIF neuron is greater than zero as long as noise exists in their synaptic
input. Thus, adding a small offset on the labels directs the model to approximate
practical SNNs. The result of fine tuning on a ConvNet will be demonstrated in
Section 6.4.4.
6.4.4 Results
Finally, the proposed generalised SNN training method is put into practice
[52, 147]. We train a 6-layer ConvNet with PAF-NSP and transfer the tuned
weights to an equivalent SNN. The detailed description of the experiment is illus-
trated in this section. We then observe the individual neuronal activities of the
trained SNN, compare the learning and recognition performance between acti-
vation functions, and estimate the power consumption of the SNN running on
neuromorphic hardware.
Spiking Deep Neural Networks 197
Experiment Description
A spiking ConvNet was trained on the MNIST [140] data set, using the gener-
alised SNN training method described above. The architecture (784-6c-6p-12c-
12p-10fc) contains 28×28 input units, followed by two convolution-pooling layers
with 6 and 12 convolutional kernels each, and 10 output neurons fully connected
to the last pooling layer to represent the classified digit.
To train the ConvNet, firstly we estimated parameter p for PAFs given LIF
configurations listed in Table 6.4 and τsyn = 0.005 s, p = S × τsyn = 1.085,
where (k = 0.31, b = 0.1, S = 217) were calibrated using NSP. Sec-
ondly, the training employed PAFs with three core activation functions: ReLU,
Softplus and NSP to compare their learning and recognition performance. The
weights were updated using a decaying learning rate, 50 images per batch and
20 epochs. Finally, the trained weights were then directly transferred to the
corresponding spiking ConvNets for recognition tests on the SNN simulator,
NEST [77]. To validate the effect of fine tuning, we took another training epoch
to train these models with PAF-NSP with data labels shifted by +0.01. Then, the
weights were also tested on SNN simulations to compare with the ones before
fine-tuning.
At the testing stage, the input images were converted to Poisson spike trains [148]
and presented for 1 s each. The output neuron which fired the most indicated the
classification of an input image.
ing rates predicted by NSP, ReLU and Softplus were 180.59, 349.64 and 1293.99,
respectively. We manually selected a static noise level of 0.45 for Softplus, whose
estimated firing rates located roughly on the top slope of the real response activity.
This resulted in a longer Euclidean distance than using ReLU, since most of the
input noisy currents were of relatively low noise level in this experiment. Hence,
the firing rate driven by the lower noise level is closer to the ReLU curve than to
Softplus.
198 From Activations to Spikes
Figure 6.24. Images presented in spike trains convolved with a weight kernel. (a) The
28 × 28 Poisson spike trains as a raster plot, representing the 10 digits in MNIST. (b) The
firing rates of all of the 784 neurons of the fourth image, digit ‘0’, plotted as a 2D image.
(c) One out of six of the trained kernels (5 × 5 size) in the first convolutional layer. (d) The
spike trains plotted as the firing rates of the neurons in the convolved 2D map. (e) Output
firing rates for recognising these digits.
Note that there is a visible mismatch between the actual firing rates and the
model estimation in the lower right region in Figures 6.25(a), (c), where the blue
dots (actual spike counts) fall below the bound of ReLU. This is consistent with
the statement in Section 6.4.2 that the LIF response activities does not fit into the
Spiking Deep Neural Networks 199
(a) Recorded data vs. ReLU. (b) Recorded data vs. Softplus.
NSP function when the noise level is low (approaching 0). However, the minor
mismatch does not result in poor performance on classification accuracy.
Figure 6.24(e) demonstrates the output firing rates of the 10 recognition neu-
rons when tested with the digit sequence. The SNN successfully classified the digits
where the correct label neuron fired the most. We trained the network with binary
labels on the output layer, and thus, the expected firing rate of correct classifica-
tion was 1 × S = 217 Hz according to Equation 6.16. The firing rates of the
recognition test fell into the valid range. This shows another advantage of NSP in
that we can estimate the firing rate of an SNN by S × f NSP (x) from running its
equivalent ANN, instead of simulating the SNN. Moreover, we can constrain the
expected firing rate of the top layer, thus preventing the SNN from exceeding its
maximum firing rate, for example, 1 KHz when the time resolution of the simula-
tion is set to 1 ms.
Learning Performance
Before looking into the recognition results, it is significant to see the learning
capability of the novel activation function, NSP. We compared the training using
200 From Activations to Spikes
Figure 6.26. Comparisons of loss during training using NSP, ReLU and Softplus activa-
tion functions. Bold lines show the average of three training trials, and the grey colour
illustrates the range between the minimum and the maximum values of the trials.
ReLU, Softplus and NSP by their loss during training averaged over three trials, see
Figure 6.26. ReLU learned fastest with the lowest loss, thanks to its steepest deriva-
tive. In comparison, Softplus accumulated spontaneous ‘firing rates’ layer by layer
and its derivative may experience gradual or even vanishing gradients during back
propagation, which results in more difficult training. The recognition performance
of NSP lay between these two. The loss stabilised to the same level as Softplus,
because of the same problem of gradual gradients.
However, the learning stabilised fastest using NSP, which may be a result of the
accurate modelling of the noise. Similar findings have shown that networks with
added noise, for example, dropout [236], also improve training time. The result
suggests that NSP may similarly shorten training time.
Recognition Performance
Classification accuracy: The classification errors for the tests were investigated
by comparing the average classification accuracy among three trials, shown in
Figure 6.27. At first, all trained models were tested on the same artificial neurons
as used for training the ANNs, and these experiments were called the ‘DNN’ test
since the network had a deep structure (6 layers). Subsequently, the trained weights
were directly applied to the SNN without any transformation, and these ‘SNN’
experiments tested their recognition performance on the NEST simulator. From
DNN to SNN, the classification accuracy declines by 0.80%, 0.79% and 3.12%
on average for NSP, ReLU and Softplus.
The accuracy loss is caused by the mismatch between the activations and the
practical response firing rates, see examples in Figure 6.25, and the strict binary
labels for NSP and Softplus activations. Fortunately, the problem is alleviated
Spiking Deep Neural Networks 201
Figure 6.27. Classification accuracy. The trained weights were tested using the same
activation function as training (DNN_Orig), then transferred to an SNN and tested using
NEST simulation (SNN_Orig) and finally fine-tuned to be tested on an SNN (SNN_FT)
again.
Figure 6.28. The classification accuracy of three trials (averaged in bold lines, grey shad-
ing shows the range between minimum to maximum) over short response times, with
trained weights (a) before fine-tuning and (b) after fine-tuning.
Power Consumption
Noisy Softplus can easily be used for energy cost estimation for SNNs. For a single
neuron, the energy consumption of the synaptic events it triggers is:
E j = λ j N j T E syn
y j N j T E syn (6.19)
= ,
τsyn
where λ j is the output firing rate, N j is the number of post-synaptic neurons it con-
nects to, T is the testing time and E syn is the energy cost for a synaptic event of some
specific neuromorphic hardware, for example, about 8 nJ on SpiNNaker [242].
Thus, to estimate the whole network, we can sum up all the synaptic events of
all the neurons:
X T E syn X
Ej = yj Nj. (6.20)
τsyn
j j
Thus, it may cost SpiNNaker 0.064 W, 192 J running for 3,000 s with synap-
tic events of 8 × 106 /s to classify 10,000 images (300 ms each) with an accu-
racy of 98.02%. The best performance reported using the larger network may cost
SpiNNaker 0.43 W operating synaptic event rate at 5.34 × 107 Hz, consuming
4271.6 J to classify all the images for 1 s each.
6.4.5 Summary
We presented a generalised off-line SNN training method to tackle the research
problem of equipping SNNs with equivalent cognitive capability to ANNs. This
training procedure consists of three simple stages: first, estimate parameters for PAF
using NSP; second, use a PAF version of conventional activation functions for ANN
training; third, the trained weights can be directly transferred to the SNN without
any further transformation.
Regarding the generalisation, the training not only uses popular activation func-
tions in ANNs, for example, ReLU, but also targets standard LIF neurons which are
widely used on neuromorphic hardware. Therefore, the proposed method greatly
simplifies the training of AI applications for neuromorphic hardware, thereby
paving the way to energy-efficient AI on brain-like computers: from neuromorphic
robots to clusters. Moreover, it lowers the barrier for AI engineers to access neuro-
morphic hardware without the need to understand SNNs or the hardware. Further-
more, this method incurs the least computational complexity while performing the
most effectively among existing algorithms. In terms of classification/recognition
accuracy, the performance of ANN-trained SNNs is nearly equivalent to ANNs,
204 From Activations to Spikes
and the performance loss can be partially offset by fine-tuning. The best classifi-
cation accuracy of 99.07% using LIF neurons in a PyNN simulation outperforms
state-of-the-art SNN models of LIF neurons and is equivalent to the best result
achieved using IF neurons. Another important feature of accurately modelling LIF
neurons in ANNs is the acquisition of spiking neuron firing rates. These will aid
deployment of SNNs in neuromorphic hardware by providing power and commu-
nication estimates, enabling better use or customisation of the hardware platforms.
DOI: 10.1561/9781680836530.ch7
Chapter 7
If you don’t sleep the very first night after learning, you lose the chance to consolidate those
memories, even if you get lots of ‘catch-up’ sleep thereafter. In terms of memory, then, sleep is not
like the bank. You cannot accumulate a debt and hope to pay it off at a later point in time.
Sleep for memory consolidation is an all-or-nothing event.
— Matthew Walker
A very important set of open questions in Neuroscience are related to learning, from
how addiction rewires our brains to how you can remember where you parked your
car or left your bike this morning, but can’t remember why you entered the kitchen.
Neural memories, whether artificial or biological, seem to operate over multiple
time scales. Very short-term memories are fast, but limited so get overwritten often.
Long-term memories can stick around a lifetime, but they take a good night’s sleep
to consolidate. Whether through sleep spindles or one-shot learning, brains utilise
synaptic plasticity to store these patterns of activity that we call memories, concepts
or motor actions.
This chapter is concerned with the motivation, design and implementation
behind mimicking biological learning rules with a focus on, you guessed it,
SpiNNaker. It starts by presenting Spike-timing-dependent plasticity (STDP)
operating in an unsupervised fashion based on relative spike times of the pre- and
post-synaptic neurons or based on the sub-threshold membrane potential. This is
205
206 Learning in Neural Networks
In this section, we will consider only the changing of the strength of existing con-
nections and, in this context, Hebb’s postulate indicates that connections between
neurons which persistently fire at the same time will be strengthened. Neurons
Spike-Timing-Dependent Plasticity 207
which persistently fire at the same time are likely to do so because they respond to
similar or related stimuli.
Bliss and Lømo [20] provided the first evidence to support this hypothesis by
measuring how – if two connected neurons are stimulated simultaneously – the
synaptic connections between them are strengthened. In networks of rate-based
neurons, this behaviour has been modelled using rules such as the Bienenstock,
Cooper, Munroe (BCM) rule [18] and Oja’s rule [184]. However, the focus of this
section is on SNNs, and in such networks, the timings of individual spikes have
been shown to encode both temporal and spatial information. Therefore, in this
section, we focus on STDP – a form of synaptic plasticity capable of learning such
timings.
In Section 7.2.1, we outline some of the experimental evidence supporting
STDP and discuss how STDP can be modelled in networks of spiking neurons.
Then, in Section 7.2.2, we discuss how STDP has previously been implemented
on SpiNNaker and other distributed systems.
We have developed a new SpiNNaker STDP implementation which has both
lower algorithmic complexity than prior approaches and employs new low-level
optimisations to exploit the ARM instruction set better. Improving the perfor-
mance of previous SpiNNaker STDP implementations is an important aspect
of this work. It is analysed and presented in detail by Knight [129]. Finally, in
Section 7.2.3, we discuss this implementation in depth. This new implementa-
tion is now a key component of the SpiNNaker software developed as part of the
HBP which aims to provide a common platform for running PyNN simulations
on SpiNNaker, BrainScaleS and HPC platforms.
Figure 7.1. Excitatory STDP curve. Each dot represents the relative change in synaptic
efficacy after 60 pairs of spikes. After Bi and Poo [16].
suggest that the magnitude of changes in synaptic efficacy (1wi j ) is related to the
relative spike timings with the following exponential functions (Figure 7.1):
F+ (wi j ) exp − 1t
τ+ if 1t > 0
1wi j = (7.1)
F− (wi j ) exp 1t
if 1t ≤ 0
τ−
Figure 7.2. Absolute change in synaptic efficacy after 60 spike pairs. Potentiation is
induced by spike pairs where the pre-synaptic spike precedes the post-synaptic spike by
2.3 ms to 8.3 ms. Depression is induced by spike pairs in which the post-synaptic spike
precedes the pre-synaptic spike by 3.4 ms to 23.6 ms. The upper blue line is a linear fit to
the potentiation data with slope: 0.4. The lower green line is a linear fit to the depression
data with slope: −1. After Morrison et al. [170].
Figure 7.3. Fitting a pair-based STDP model with τ+ = 16.8 ms and τ− = 33.7 ms to data
from Sjöström et al. [229] by minimising the mean squared error fails to reproduce fre-
quency effects. Blue lines and data points redrawn from Sjöström et al. and the green
lines show the best fit obtained by the pair-based STDP model. After Pfister [192].
Equation 7.1 can alternatively be modelled based on pre- (si ) and post-synaptic (s j )
trace variables:
dsi si X f
=− + δ(t − ti ) (7.2)
dt τ+ f
ti
ds j sj X f
=− + δ(t − t j ) (7.3)
dt τ− f
tj
210 Learning in Neural Networks
Figure 7.4. Calculation of weight updates using pair-based STDP traces. Pre- and post-
synaptic traces reflect the activity of pre- and post-synaptic spike trains. Potentiation is
calculated at each post-synaptic spike time by sampling the pre-synaptic trace (green
circle) to obtain a measure of recent pre-synaptic activity. Depression is calculated at
each pre-synaptic spike time by sampling the post-synaptic trace (blue circle) to obtain
a measure of recent post-synaptic activity. Weight dependence is additive. After Morrison
et al. [171].
f f
Pre- and post-synaptic spikes occurring at ti and t j , respectively, are modelled
using Dirac delta functions (δ) and, as the top 4 panels of Figure 7.4 show, the trace
variables represent a low-pass filtered version of these spikes. These dynamics can
be thought of as representing chemical processes. For example si can be viewed as a
model of glutamate neurotransmitters which, having crossed the synaptic cleft from
the pre-synaptic neuron, bind to receptors on the post-synaptic neuron and are
reabsorbed with a time constant of τ+ . Building on this work, Section 7.4 presents
an implementation of neuromodulated STDP simulated on SpiNNaker.
As the dashed blue lines in Figure 7.4 illustrate, when a pre-synaptic spike occurs
f
at time ti , the s j trace can be sampled to obtain the combined depression caused
by the pairs made between this pre-synaptic spike and all preceding post-synaptic
spikes. Similarly, as the dashed green lines in Figure 7.4 illustrate, when a post-
f
synaptic spike occurs at time t j , the si trace can be sampled, leading to the follow-
ing equations for calculating depression (1wi−j ) and potentiation (1wi+j ):
f f
1wi−j (ti ) = F− (wi j )s j (ti ) (7.4)
f f
1wi+j (t j ) = F+ (wi j )si (t j ) (7.5)
Spike-Timing-Dependent Plasticity 211
Figure 7.5. Inhibitory STDP curve. The relative change in synaptic efficacy after 60 pairs
of spikes. After Vogels et al. [261].
Bi and Poo [16] recorded the data plotted in Figures 7.1 and 7.2 from rat hip-
pocampal neurons, but subsequent studies have revealed similar relationships –
albeit with different time constants and polarities – in other brain areas [210].
Specifically, in the neocortex, excitatory synapses appear to exhibit STDP with simi-
lar asymmetrical kernels to hippocampal neurons, whereas inhibitory synapses have
a symmetrical kernel similar to that shown in Figure 7.5.
While rules that consider pairs of spikes provide a good fit for the data measured
by Bi and Poo, they cannot account for effects seen in more recent experimental
data. Sjöström et al. [229] stimulated cortical neurons with pairs of pre- and post-
synaptic spikes separated by a constant 10 ms but with between 20 ms and 10 s sep-
arating the pairs. When the time between the pairs approaches the time constants
defining the temporal range of the pair-based STDP rule, spikes from neighbour-
ing pairs begin to interact. As shown in Figure 7.3, this interaction then cancels out
the potentiation or depression that the original pair should have elicited.
Several extensions to the STDP rule have been proposed which take into account
the effect of multiple preceding spikes including the ‘triplet rule’ proposed by Pfister
[192]. In this rule, the effect of earlier spikes is modelled using a second set of
traces (si2 and s 2j ) with longer time constants τx and τ y :
dsi2 s2 X f
=− i + δ(t − ti ) (7.6)
dt τx f
ti
ds 2j s 2j X f
=− + δ(t − t j ) (7.7)
dt τy f
tj
212 Learning in Neural Networks
To incorporate the effect of these traces into the weight updates, Pfister also
extended Equations 7.4 and 7.5:
f f f
1wi−j (ti ) = s j (ti )(A−
2 + A3 si (ti − ))
− 2
(7.8)
f f f
1wi+j (t j ) = si (t j )(A+
2 + A3 s j (t j − ))
+ 2
(7.9)
Where is a small positive constant used to ensure that the second set of s 2 traces
f f
is sampled just before the spike occurs at ti or t j . This rule has an explicitly additive
weight dependence with the relative effect of the four traces controlled by the four
free parameters A+ − + −
2 , A2 , A3 and A3 . Pfister fitted these free parameters to the
data obtained by Sjöström et al. [229] and, as shown in Figure 7.6, demonstrated
that the rule can accurately reproduce the frequency effect measured by Sjöström
et al.
The trace-based models we have discussed so far assume that all preceding
spikes can affect the magnitude of STDP weight updates. However experimental
data [229] suggest that this might not be the case and that basing pair-based STDP
weight updates on only the most recent spike can improve the fit of these mod-
els to experimental data. This ‘nearest-neighbour’ spike interaction scheme can be
implemented in a trace-based model by resetting the appropriate trace to 1 when
a spike occurs rather than by incrementing it by 1. Pfister also investigated the
effect of different spike interaction schemes on their triplet rule but found it had
Figure 7.6. Fitting triplet STDP model with τ+ = 16.8 ms, τ− = 33.7 ms, τx = 101 ms and
τ y = 125 ms to the data recorded by Sjöström et al. [229] by minimising mean squared
error effectively reproduces frequency effects. Blue lines and data points (with errors)
redrawn from Sjöström et al. and the green lines show the best fit obtained by the triplet
STDP model. After Pfister [192].
Spike-Timing-Dependent Plasticity 213
no significant effect on its fit to the data recorded by Sjöström et al. This suggests
that alternative spike-pairing schemes may simply be another means of overcoming
some of the limitations of pair-based STDP models.
Presynaptic Postsynaptic
neuron neuron
Figure 7.7. The dendritic and axonal components of synaptic delay. After Morrison et al.
[171].
214 Learning in Neural Networks
incorporate synaptic delays into the STDP processing. Jin et al. [121] were the
first to implement STDP on SpiNNaker. They assumed that the whole synaptic
delay was axonal implying that, as pre-synaptic spikes reach the synapse before this
axonal delay has been applied, they too must be buffered. Jin et al. used a compact
data structure for buffering both pre-synaptic and post-synaptic spikes containing
the time at which the neuron last spiked and a bit field, the bits of which indicate
previous spikes in a fixed window of time. Consequently, only a small amount of
DTCM is required to store the deferred spikes associated with each post-synaptic
neuron. However, because this approach does not use the trace-based STDP model
(Section 7.2.1), the effect of all possible pairs of pre- and post-synaptic spikes must
be calculated separately using Equation 7.1. Additionally, the bit field based record-
ing of history – while compact – represents only a fixed window of time meaning
that only a very small number of spikes from slow firing neurons can ever be
processed.
Diehl and Cook [50] developed the first trace-based STDP implementation for
SpiNNaker. To store the pre- and post-synaptic traces, they extended each synapse
in the synaptic row to contain the values of the traces at the time of the last update.
They allowed synapses to have arbitrary axonal and dendritic delays meaning that,
like Jin et al., they stored a history of both pre- and post-synaptic spikes. However,
rather than using a bit field to store this, they used a fixed-size circular buffer to
store the spike times. This data structure is not only faster to iterate over than a bit
field but also holds a constant number of spikes, regardless of the firing rates of the
pre- and post-synaptic neurons. However, these buffers can still overflow, leading to
spikes not being processed if the pre- and post-synaptic firing rates are too different.
For example, consider a buffer with space for ten entries being used to defer the
spikes from a post-synaptic neuron firing at 10 Hz. If one of the neuron’s input
synapses only receives spikes (and is thus updated) at 0.1 Hz, there is insufficient
10 Hz
buffer space for all 100 = 0.1 Hz of the post-synaptic spikes that occur between
the updates. Using these spike histories, Diehl and Cook developed an algorithm
to perform trace-based STDP updates whenever the synaptic matrix row associated
with an incoming spike packet is retrieved from the SDRAM. The algorithm loops
through these synapses and, for each one, iterates through the buffered pre- and
post-synaptic spikes in the order that they occurred since the last update (taking
into account the dendritic and axonal delays). The effect of each buffered spike
is then applied to the synaptic weight (using Equation 7.4 for pre-synaptic spikes
and Equation 7.5 for post-synaptic spike) and the appropriate trace updated (using
Equation 7.2 for pre-synaptic spikes and Equation 7.3 for post-synaptic spike).
Diehl and Cook measured the performance of their approach using a benchmark
network of 50 LIF neurons stimulated by a large number of 250 Hz Poisson spike
sources connected with 20% sparsity. Using this network, they showed that their
Spike-Timing-Dependent Plasticity 215
approach could process 500 × 103 incoming synaptic events per second compared
to the 50 × 103 achievable using the approach developed by Jin et al.
There are many similarities between simulating large spiking neural networks
on SpiNNaker and on other distributed computer systems – including the two
problems identified at the beginning of this section. In the distributed computing
space, Morrison et al. [170] addressed these in ways highly relevant to a SpiNNaker
implementation. Although the nodes of the distributed systems they targeted do
not have to access synaptic matrix rows using a DMA controller, accessing non-
contiguous memory is also costly on architectures with hardware caches. Therefore,
post-synaptic weight updates still need to be deferred until a pre-synaptic spike.
As each node has significantly more memory, Morrison et al. use a dynamic data
structure to guarantee that all deferred post-synaptic spikes get processed.
Morrison et al. simplify the model of synaptic delay by supporting only config-
urations where the axonal delay is shorter than the dendritic delay. This simplifi-
cation allows pre-synaptic spikes to be processed immediately as it guarantees that
post-synaptic spikes emitted before the axonal delay has elapsed will never ‘over-
take’, and thus need to be processed before, the pre-synaptic spike.
This simplification means that only the time of the last pre-synaptic spike and
the value of the pre-synaptic trace at that time need to be stored with each synap-
tic matrix row. Based on this simplification, the algorithm developed by Morrison
et al. loops through each synapse in the row and, for each one, loops through the
buffered post-synaptic spikes. The effect of each buffered spike is then applied to
the synaptic weight (using Equation 7.5). After all of the post-synaptic spikes have
been processed, the effect of the pre-synaptic spike that instigated the update is
applied to the synaptic weight (using Equation 7.4). Once all of the synapses in the
row have been processed, the pre-synaptic trace is updated (using Equation 7.2).
To assess the relative algorithmic complexity of the approaches presented in this
section, we can consider the situation where an STDP synapse is updated based on
N pr e pre-synaptic and N post post-synaptic spikes. In the approach developed by
Jin et al. [121], each pair of spikes is processed individually and the complexity is
O(N pr e N post ). However, by using a trace-based approach, Diehl and Cook [50]
reduced this complexity to O(N pr e + N post ) and Morrison et al. [170] further
reduced this to O(N post ) by removing the need to buffer pre-synaptic spikes.
7.2.3 Implementation
The best performing SpiNNaker STDP implementation presented in the previ-
ous section was that developed by Diehl and Cook [50]. Their benchmark indi-
cated that, using their implementation, a SpiNNaker core could process up to
500 × 103 incoming synaptic events per second, compared with 5 × 106 events
216 Learning in Neural Networks
Figure 7.8. Ratio distributions of cortical firing rates. Calculated from firing rate distribu-
tions presented by Buzsáki and Mizuseki [30].
Spike-Timing-Dependent Plasticity 217
Figure 7.9. DTCM memory usage of STDP event storage schemes. The memory usage
of other components is based on the current SpiNNaker tools. All trace-based schemes
assume times are stored in a 32-bit format and traces in a 16-bit format, with two look-up
tables with 256 16-bit entries providing exponential decay. The dashed horizontal line
shows the maximum available DTCM.
Figure 7.9 shows the local memory requirements of post-synaptic history struc-
tures with capacity for 10 entries of different sizes. To implement STDP rules such
as the triplet rule discussed in Section 7.2.1, each entry needs to be large enough
to hold not only a spike time but also two trace values. Figure 7.9 suggests that, to
avoid further reductions in the number of neurons that each SpiNNaker core can
simulate, each of these traces should be represented as a 16-bit value. Using 16-bit
trace entries has an additional advantage as the ARM 968 CPU used by SpiNNaker
includes single-cycle instructions for multiply and multiply-accumulate operations
on signed 16-bit integers [62]. These instructions allow additive weight updates
such as wi j ← wi j + s j exp tau to be performed using a single SMLAxy instruc-
−dt
tion and, when implementing rules such as the triplet rule that require two traces,
they provide an efficient means of operating on pairs of 16-bit traces stored within
a 32-bit field.
The range of fixed-point numeric representations is static. Thus, the optimal
representation for storing traces must be chosen ahead of time based on the max-
imum expected value. We can calculate this by considering the value of a trace x
with time constant τ after n spikes emitted at f Hz:
n
− τif
X
x(n) = e (7.10)
i=0
218 Learning in Neural Networks
1
xmax = (7.13)
− τ1f
1−e
The sustained firing rate of most neurons is constrained by the time that ion pumps
take to return the neuron’s membrane potential to its resting potential. This gener-
ally limits a neuron’s maximum firing rate to around 100 Hz but, as Gittis et al. [79]
discuss, there are mechanisms that can overcome this limit. For example, vestibular
nucleus neurons can maintain sustained firing rates of around 300 Hz. Figure 7.10
shows that – based on this worst-case maximum firing rate – 4 integer bits are
required to store traces with time constants in the range fitted to the data recorded
by Bi and Poo [16]. Therefore, a 16-bit fixed-point numeric representation with 4
integer, 11 fractional bits and a sign bit is the optimal choice for representing the
traces required for pair-based STDP.
Figure 7.10. Number of integer bits required to represent traces of a 300 Hz spike train
with different time constants.
Spike-Timing-Dependent Plasticity 219
In the PyNN programming interface, STDP learning rules are defined in terms
of three components:
The timing dependence: Defines how the relative timing of the pre- and post-
synaptic spikes affects the magnitude of the weight update.
The weight dependence: Defines how the current synaptic weight affects the
magnitude of the weight update (the F+ and F− functions discussed in
Section 7.2.1).
The voltage dependence: Defines how the membrane voltage of the post-synaptic
neuron affects the magnitude of the weight update.
Adding a voltage dependence to the type of event-based STDP implementation
discussed here presents several challenges beyond the scope of this section. However,
one such voltage-dependent implementation is described in Section 7.3.
In this section, we implement only the timing and weight dependencies sup-
ported by PyNN. So as to allow users of the HBP software not only to select from
the weight dependencies specified by PyNN but also to implement their own easily,
this implementation defines simple interfaces which timing and weight dependen-
cies must implement. Timing dependencies must define the correct types for the
pre- and post-synaptic states (si and s j , respectively), functions to update pre- and
post-synaptic trace entries based on the time of a new spikes (updatePreTrace and
updatePostTrace, respectively) and functions to apply the effect of deferred pre-
and post-synaptic spikes to a synaptic weight (applyPreSpike and applyPostSpike,
respectively). Algorithm 1 shows an implementation of the functions required to
implement pair-based STDP using this interface. The updatePreTrace adds the
effect of a new pre-synaptic spike at time t to the pre-synaptic trace by decaying the
value of si calculated at the time of the last spike (t lastSpike ) and adding 1 to repre-
sent the effect of the new spike (the closed-form solution to Equation 7.2 between
f
two ti s). Similarly, the applyPreSpike function samples the post-synaptic trace by
decaying the value of s j calculated at the time of the last post-synaptic spike (t j )
f
(the s j (ti ) term of Equation 7.4).
To decouple the timing and weight dependencies, the applyPreSpike and apply-
PostSpike functions in the timing dependence call the applyDepression and apply-
Potentiation functions provided by the weight dependence rather than directly
manipulating wi j themselves. Algorithm 2 shows an implementation of applyDe-
pression which performs an additive weight update.
Algorithm 3 is the result of combining the simplified delay model proposed by
Morrison et al. [170] with the flushing mechanism and the interfaces for timing
and weight dependencies discussed in this section. The algorithm begins by loop-
ing through each post-synaptic neuron ( j) in the row and retrieving a list of the
220 Learning in Neural Networks
function applyPreSpike(wi j , t, t j , s j )
1t ← t − t j
1t
if 1t 6= 0 then return applyDepression wi j , s j · exp − tau
else return wi j
times (t j ) at which that neuron spiked between t lastUpdate and t and its state at
that time (s j ) (taking into account the dendritic (D D ) and axonal (D A ) delays
associated with each synapse). The algorithm continues by looping through each
post-synaptic spike and calling the applyPostSpike function to apply the effect of the
interaction between the post-synaptic spike and the pre-synaptic spike that occurred
at t lastSpike to the synapse. If the update was instigated by a pre-synaptic spike rather
than a flush, the applyPreSpike function is called to apply the effect of the interac-
tion between the pre-synaptic spike and the most recent post-synaptic spike to the
Spike-Timing-Dependent Plasticity 221
Figure 7.11. A random balanced network consisting of recurrently and reciprocally con-
nected populations of excitatory neurons (red filled circles) and inhibitory neurons (blue
filled circles). Excitatory connections are illustrated with red arrows and inhibitory con-
nections with blue arrows.
synapse. Once all events are processed, the fully updated weight is added to the
input ring buffer. If the update was instigated by a pre-synaptic spike rather than a
flush, after all the synapses are processed, the pre-synaptic state stored in the header
of the row (si ) is updated by calling the updatePreTrace function and t lastSpike and
t lastUpdate are set to the current time. If, however, the update was instigated by a flush
event, only t lastUpdate is updated to the current time, meaning that the interactions
between future post-synaptic events and the last pre-synaptic spike will continue to
be calculated correctly.
function applyPostSpike(wi j , t, ti , si )
1t
1t ← t − ti return applyPotentiation wi j , si · exp − tau
established using an STDP rule with the type of symmetrical kernel shown in
Figure 7.5. We implemented this learning rule using the timing dependence func-
tions defined in Algorithm 4 and used it to reproduce the results presented by
Vogels et al. using a network of 2,000 excitatory and 500 inhibitory neurons with
the parameters listed in Table 7.1.
Without inhibitory plasticity, the network remained in the synchronous regime
shown in Figure 7.12(a) in which neurons spiked simultaneously at high rates.
However, with inhibitory plasticity enabled on the connection between the
inhibitory and the excitatory populations, the neural activity quickly stabilised and,
as shown in Figure 7.12(b), the network entered an asynchronous irregular regime
in which neurons spiked at a much lower rate.
Model summary
Populations
Connectivity
τsyn
inh = 10 ms inhibitory synaptic time constant
Plasticity
Figure 7.12. The effect of inhibitory plasticity on a random balanced network with 2,000
excitatory and 500 inhibitory neurons. Without inhibitory plasticity, the network is
in a synchronous state with all neurons firing regularly at high rates. Inhibitory plas-
ticity establishes the asynchronous irregular state with all neurons firing at approxi-
mately 10 Hz.
Although highly successful, the STDP algorithm has some drawbacks. For example,
if the simulator has no memory of pre- and post-synaptic spike times, the algorithm
is difficult to implement; furthermore, if the post-synaptic neuron fails to spike, it
could be that important information is lost. Bengio et al. [14] propose a plasticity
rule which is compatible with STDP dynamics and could, in principle, be a way to
link machine learning and neuroscience.
Voltage-Dependent Weight Update 225
Model summary
Populations
Neurons LIF 1
Stimuli Independent 15 Hz 1,000
Poisson spike trains
Connectivity
Plasticity
Figure 7.13. Histograms showing: (a) initial uniform distribution of synaptic weights and
distribution of synaptic weights following; (b) STDP with additive weight dependence
and (c) STDP with multiplicative weight dependence. Simulation consists of a single
integrate-and-fire neuron with 1,000 independent 15 Hz Poisson spike sources providing
synaptic input.
a b c d
We implemented this rule using both LIF and Izhikevich neuron models [193];
we only show the results with the latter here. The parameters used in our exper-
iments for the Izhikevich model are shown in Table 7.3. The behaviour of the
neuron model when a continuous current is applied is illustrated in Figure 7.14(a).
The blue line depicts the neuron’s membrane voltage (v), and the green line
shows the behaviour of the auxiliary variable u. Since the membrane voltage is
usually noisy, we filter it using the exponential smoothing technique [175]
γ = e−1/τs , (7.14)
s(t) = (1 − γ )v(t) + γ s(t − 1); (7.15)
where τs is the temporal constant for the filtering mechanism. The dashed red line
is a low-passed version of the membrane voltage (s). The change in synaptic efficacy
Voltage-Dependent Weight Update 227
Figure 7.14. Membrane voltage change as a proxy for weight updates. (a) Behaviour of
an Izhikevich neuron to a step input; the blue line illustrates the membrane voltage, while
the red dashed line shows a low-pass-filtered version of it. (b) shows how an average of
weight changes behaves close to STDP when simulated in Python. (c) summarises the
average of weight changes simulated on SpiNNaker.
is given by
1s(t)
1w = α × δ(t − t pr e ) × (7.16)
1t
where δ(·) is the Kronecker delta function and α is the scaling factor and could be
used as the learning rate. To test whether this adjustment to the original learning
rule still produces similar results (i.e. STDP-compatible behaviour), we establish an
experimental set-up similar to the one presented by Bengio et al. [14]. We simulate
5,000 neurons (using a home-brew implementation) which have a noisy current
offset; random input spikes are generated at every time step, with a 20% and 5%
probability, for excitatory and inhibitory types, respectively. All synapses are char-
acterised as a 1 ms pulse response; weights for inhibitory synapses are fixed, while
excitatory are plastic.
We then look for post-synaptic neuron spikes and collect weight change statistics
in a +
−20 ms temporal window in 1 ms time steps. We compute the average change
for each time step; Figure 7.14(b) depicts the resulting averages for voltage changes.
We performed a similar experiment using the SpiNNaker implementation and
computed the average of generated data points which gives rise to an STDP-like
curve (Figure 7.14(c)). A major difference is that the curve gets shifted 1 ms to the
right; this is because weight changes are computed as soon as a spike arrives at the
post-synaptic core but applied a time step later.
7.3.1 Results
To test the viability of using this learning rule to capture the statistics coming from
visual patterns, we set experiments with networks based on a SWTA circuit.
Unsupervised: We first explore an unsupervised learning procedure. The network
for this experiment is presented in Figure 7.15 and is composed of an input layer
228 Learning in Neural Networks
1 Hz. Poisson
50ms
Figure 7.15. SWTA network with visual pattern as input. A 5 × 5 pixel/neuron array is
given an input which corresponds to the two main diagonals alternated with a 50 ms
delay between them; it is also provided with a 1 Hz noise with a Poisson distribution. This
array is connected with plastic connections to 5 target neurons, which in turn are in a
SWTA circuit.
Figure 7.16. Weight changes after alternating pattern simulation. (a) shows the input
weights for each target neuron, and these were set at random with a uniform distribution
[0.05, 0.2). (b) By the end of the simulation, some neurons have specialised for a pattern
as shown by the weights.
and an output layer. The input layer consists of 25 input neurons which serve as a
relay for noise and input patterns. The network is given alternating visual patterns
(diagonals) as an input; a 1 Hz Poisson noise source is added to the input in order
to favour diversity of learning in the output layer. The output layer consists of five
neurons which feed a single inhibitory neuron, the latter will reduce the chances of
spiking for neurons in the output layer.
The synaptic weights from the input to the output layer were plastic and ran-
domly initialised as shown in Figure 7.16(a). After multiple exposures to the input
patterns, the synapses corresponding to the inputs get potentiated (Figure 7.16(b)).
Particularly, neurons 1 and 4 become specialised on one pattern while 2 and 5 to
the other.
where I is the total input current given to the neuron which consists of I+ , the
excitatory inputs, and I− , the inhibitory ones. With this scheme, it is not possible
to condition the acceptance of current provided by NMDA activation influx, Iφ ,
and thus, Equation 7.17 will be modified to
X X X
I = I+ − I− + Iφ (7.18)
We control the activation of NMDA receptor channels in two ways. The first
is through a special φ spike which emulates a gating mechanism of the channel
(i.e. the presence of both glutamate and glycine [66, 179]). We also use this event
to model the current (Iφ ) created by additional positive ions passing through the
opened channel. Secondly, Iφ is allowed to pass into the neuron only when the
membrane voltage is above the threshold Vφ :
(
Iφ if Vm > Vφ or t − tφ < Tφ
Iφ = (7.19)
0 otherwise
Furthermore, to mimic the time it takes to close the channels, we added an inertia-
like mechanism (t − tφ < Tφ in Equation 7.19) which keeps Iφ current flowing
for at least Tφ simulation steps. To achieve this, we subtract the time at which a φ
spike arrived (tφ ) from the current simulation step and, if the temporal difference is
smaller than the inertia window Tφ , the current is allowed to keep flowing. To test
the supervision mechanism we used a similar network setup to that in the unsu-
pervised case though we need two ‘instances’ of the network (Figure 7.17); one will
‘supervise’ the other. Each instance has five output neurons which are each assigned
to an input pattern. During the experiment, neurons 1 and 2 of the bottom output
population (Figure 7.17) were assigned to learn pattern 1 (forward diagonal), and
neurons 3 and 4 were set to learn pattern 2 (back diagonal). The way we induce
neurons to learn a particular pattern is to send a φ spike 5 ms before the pattern is
shown to the corresponding neuron. Neurons in the bottom population connect
to the neurons in the top population in a one-to-one manner through the lateral
φ channel. For this experiment, we used Izhikevich neurons for the output and
inhibitory populations which were configured according to the parameters shown
in Table 7.4.
Figure 7.18 shows synaptic efficacies at the beginning and end of the training.
Each square depicts the values of incoming synapses to a post-synaptic neuron,
the top row of squares corresponds to the student population and the bottom row
230 Learning in Neural Networks
1 Hz. Poisson
A B
50ms
Lateral
interaction
1 Hz. Poisson
B A
50ms
Supervision
a b c d τu τexc τinh τφ θφ
Figure 7.18. Weights at start and end of training using an NMDA-like signal φ. (a) Synaptic
efficacies are set to random values initially. (b) After ∼30 min of simulation, the weights
favour the assigned input patterns.
corresponds to the teacher population. At the start (Figure 7.18(a)), weights are
assigned randomly.
During training, a supervision φ spike is given to neurons h2, 1i, h2, 2i, h2, 4i
and h2, 5i right before one of the patterns reaches them. Neurons h2, 1i and h2, 2i
are assigned the forward diagonal pattern, while neurons h2, 4i and h2, 5i are
assigned the backward diagonal pattern. Neuron (2, 3) is allowed to learn its input
without any supervision. Since patterns for the student population are delayed (and
inverted), the φ spikes coming from the teacher enforce neurons in the student pop-
ulation to learn a particular input (backward diagonal for h1; 1i and h1, 2i, forward
Neuromodulated STDP 231
diagonal for h1, 4i and h1, 5i). Figure 7.18(b) shows synaptic efficacies at the end
of training, note that the largest weights correspond to the assigned patterns. While
the task to learn here is a simple one, the behaviour of the network could be seen
as self-supervision and could be applied to networks whose neurons learn parts of
a larger problem.
Traditionally, simple models of SNNs have used two types of synapse: excitatory
(positive) and inhibitory (negative). These drive changes to the membrane voltage
and, indirectly, can produce weight changes [74].
In biological neural networks, there are, in addition, neurotransmitters and neu-
romodulators that may alter learning processes. Research on dopamine interaction
shows that it could be crucial for reinforcement learning as it has been identified as
a control signal for large regions of the brain.
Furthermore, there are additional cells involved in synapse function – astrocytes –
which are usually characterised as ‘maintainers’ in the central nervous system as
they keep ionic concentrations stable, form scar tissue on damaged regions and
aid energy transfer [231]. Scientists are putting more effort to understand the role
of astrocytes as learning modulators [78]. Research shows that these cells are also
involved in the regulation of current, frequency, short- and long-term plasticity,
and synapse formation and removal [91, 189, 253, 254].
Having a third component modifies Hebbian-based weight updates, and the rule
will now depend on the state of three factors (Figure 7.19):
where s(·) indicates the activity or state. If only the spike time is considered as the
state (as is the case for STDP),
where tx is the time at which a spike from neuron x was perceived by the post-
synaptic neuron.
Third
Pre Post
Ponulak and Kasinski [200] introduced an STDP-like rule with a third factor
(ReSuMe), in which the extra input is used to get the post-synaptic neuron to spike
at a particular time.
When tthir d −t pr e > 0, a weight increment (1+ ) is applied to synapses, whereas
a weight depression (1− ) is applied when t post − tpr e > 0. If the post-synaptic
neuron activates at the desired time tthir d = t post , depression will be equal to
potentiation and the total weight change will be zero.
Gardner and Grüning [70] developed a three-factor learning rule whose purpose
is, also, to learn spike times; to do this, the third input to the synapse carries a ‘tem-
poral target’ signal and will alter the magnitude and direction of the weight change.
The main difference from the ReSuMe rule is that this requires, additionally, a low-
pass filtered version of the error. The temporal error is to modify the effect a single
pre-synaptic spike has on post-synaptic neuron activity. The filtered version of the
error can be seen as an ‘accumulation’ activity for a time window (≈ 10 ms).
While the previously mentioned rules make use of a third factor, they remain
biologically implausible as a synapse is unlikely to be able to keep track of exact
times. In this context, we can see the neurotransmitter dopamine as a global error
signal or a modulator that enables learning after the previous activity in the network
led to a reward-worthy action [248].
Other modulators (e.g. serotonin and noradrenaline) could guide plasticity
through attention-like mechanisms. These are thought to be local signals –
as opposed to dopamine – and may represent feedback and/or lateral interac-
tion [211].
Models of modulated synaptic plasticity have been developed and in general
follow
where f is the regular plasticity function (e.g. STDP) and g is the, usually, decaying
response of the modulatory input.
1 2 3
pre
post pre
post
STDP modulator
trace
trace efficacy
time time
(a) (b)
of the trace [60, 118]. The signal could be dopamine, or another neuromodulator
with slower dynamics, decaying on the scale of hundreds of milliseconds.
In Figure 7.20(a), the trace labelled STDP shows the weight change func-
tion given the inputs shown in rows pre and post. Eligibility traces are formed by
h pr e, posti spike pairs which are illustrated in zones 1 and 2 in Figure 7.20(a);
these cause accumulation of weight changes driven by STDP curves. Since STDP
interactions depend on the time at which the pre- and post-synaptic neurons spiked,
if these times are sufficiently far apart in time, no weight change is added to the eli-
gibility trace (compare zone 3 with zone 2 in Figure 7.20(a)).
Eligibility traces have much slower dynamics than STDP interactions as illus-
trated in Figure 7.20(a); the curve in row trace decays much slower than any of the
curves in row STDP. The low decay rate is useful to keep track of how temporally
distant weight changes contributed to a particular behaviour.
The modulating neurotransmitter (modulator curve in Figure 7.20(b)) also has
slower dynamics than STDP, but not as slow as eligibility traces. Weight changes
are only applied when the third signal is present; this is modelled as a multiplicative
effect
dc(t) c(t)
=− + ST D P(τ−/+ )δ(t − t pr e/ post ); (7.25)
dt τc
where c(t) is the state of the eligibility trace; ST D P(τ−/+ ) is the value from STDP
(Figure 7.20(a)) curves and δ(t − t pr e/ post ) is the Dirac delta function. Similarly,
234 Learning in Neural Networks
where lc, lm and lw subscripts indicate the time (event) at which the last eligibility
trace, modulator and weight updates were performed, respectively. As two different
spike ‘types’ can be received, weight updates will be performed either at tlw = tlc
or tlw = tlm . The evaluation of the squared brackets in Equation 7.27 is done as
in definite integrals.
plastic
static
modulator
1 Exc Inh
excitatory
group
inhibitory
Figure 7.21. Credit assignment experiment network − yet another random balanced
network.
Table 7.5. LIF neuron model parameters for the credit assignment experiment.
A+ A− τ+ τ− τc τd
For this experiment, we use standard LIF neurons whose parameters promote
higher activity from the elements in the inhibitory group. High excitability is
achieved by adding a small base current, reducing the distance between the resting
voltage and the threshold value, and reducing the refractory period (see Table 7.5).
We used exponentially decaying, current-based synapses whose temporal constants
(τsyn_X ) are set to 1 ms to further approximate the original experiment.
The learning algorithm was parametrised so that the area for LTP is 20% greater
than the area for Long-Term Depression (LTD) (A+ , A− , τ+ , τ− in Table 7.6).
Dopamine interaction was characterised in a biologically plausible manner: the
temporal constant for the eligibility trace, τc , is 1,000 ms, which implies the net-
work will learn events that occurred up to a second ago; the temporal constant for
dopamine, τd , is 200 ms according to biological evidence.
We ran the experiment for about 1.5 hours of biological real time. At approx-
imately 70 minutes, the weights from group S1 to other excitatory neurons are
sufficiently large to be visibly noticeable in a raster plot.
The evolution of the average weight in group S1 is shown in Figure 7.22(a) as a
green line, which presents an exponential growth. We can also observe the average
236 Learning in Neural Networks
weight value for every connection in the network as the experiment progresses (blue
line), it grows but more slowly and it is expected to stabilize. Spiking behaviour for
all groups is similar at the start of the experiment; this can be seen as correlated
vertical dots in Figure 7.22(b).
By the end of the experiment, connections which originate from group S1 are
at such a value that most post-synaptic neurons will spike when a neuron in the
group is active. This is shown in the middle of Figure 7.22(c) as a burst of activity.
Although the network still responds to other patterns, it is now tuned to emit a
higher response to the S1 pattern. This has been observed in cortical regions; for
example, a column in V1 will show a response to many oriented bars as inputs, but
it presents the maximum spike rate for a particular orientation.
search for the optimal synaptic configuration that fits within the constraints of the
system.
SpiNNaker’s real-time constraint also means that synaptic rewiring simulations
occurring at a slower time scale compared to other neural and synaptic processes
can be monitored over longer time scales.
The model chosen for translation onto SpiNNaker was developed by Bamford
et al. [8]. Their model had some desirable features and posed some interesting chal-
lenges. Feature wise, the rewiring rules allowed for synaptic formation and removal
driven by Euclidean distance and synaptic weight, respectively.
A certain number of rewiring attempts are performed every simulated second.
Each neuron in the network has a limited number of potential synaptic contact
points. Attempts either follow the formation or removal rules dependent on the
existence of a synapse at a considered contact point.
Formations favour neurons that are relatively close in space, which also represents
the first challenge for SpiNNaker. This was the first time that SpiNNaker neural
models required spatial information for the simulated neurons. A new, full-strength,
connection is formed with a partner neuron that has fired recently if
δ2
−
2σ 2f or m
r < p f or m e (7.28)
[29]) was to replicate and expand on previous results regarding topographic map
formation (Section 7.5.1)
Section 7.5.3 shows that choosing a partner for formation from the set of recently
active cells is a powerful mechanism. In the context of MNIST handwritten digit
classification, this feature allows for decent accuracy and recall scores in the absence
of weight changes as long as the input is encoded using spiking rates. Further,
Section 7.5.4 reveals the importance of visualisation in identifying the cause of
aberrant behaviour using this particular feature as its main example.
A final application of these synaptic rewiring rules in described in Section 7.5.5
where it is shown to perform elementary motion decomposition after the applica-
tion of additional minor enhancements.
An alternative formulation based upon information theory, sparse codes and
their congruence with recent results about the behaviour of clustered synapses in
real dendritic trees is described briefly by Hopkins et al. [103].
Figure 7.23. Topographic maps. Neuron (2) in the target layer has a receptive field formed
by connections from the source layer (feed-forward) as well as connections from within
the target layer (lateral). These connections are centred around the spatially closest neu-
ron, that is neuron (1) in the case of feed-forward connections. Connections from more
distant neurons are likely to be weaker (indicated by a darker colour).
where i loops over synapses, x is a candidate preferred location, | pExi | is the min-
imum distance from the candidate location to the afferent for synapse i and wi
is the weight of the synapse. The candidate preferred location x has been imple-
mented with an iterative search over each whole number location in each direction
followed by a further iteration, this time in increments of 0.1 units. Thus, the pre-
ferred location xE of a receptive field is given by the function: argmin σa f f .
xE
Once the preferred location of each neuron is computed, taking the mean dis-
tance from the ideal location of each preferred location results in a mean AD for the
projection. We report both mean AD and mean σa f f computed with and without
taking into account connections weights. σa f f −weight and ADweight are computed
using synaptic weights gsyn , while σa f f −conn and ADconn are designed to consider
the effect of rewiring on the connectivity, and thus, synaptic weights are considered
unitary.
These metrics are considered in three types of experiments:
Figure 7.24. Stacked bar chart of formations and removals over time within one simu-
lation. Left: evolution of the network starting from no connections; right: evolution of
the network starting from a sensible initial connectivity. The number of formations or
removals is aggregated into 3-second chunks.
loaded onto SpiNNaker – the process of interacting with SpiNNaker for loading or
unloading data is currently the main bottleneck. The latter is meant as a validation
for the model, but also to gain additional insights into its operation, specifically
whether maps formed through a process of simulated development react differently
to maps that are considered in their ‘adult’, fully-formed state in need of refinement.
Results from this simulation are presented in tabular form (Table 7.8), with the
addition of longitudinal snapshots into the behaviour of the network and mean
receptive field spread and drift, as well as a comparison between different initial
connectivity types (topographic, random percentage-based and minimal). In the
initial stages of the simulation, the σa f f and AD are almost zero due to the lack of
connections, but they steadily increase with the massive addition of new synapses.
Figure 7.24 shows a side-by-side comparison of the number of rewires between
early development and adult refinement. The developmental model initially sees a
large number of synapses being formed until an equilibrium is reached at around
10% connectivity. A 10% connectivity is also achieved when starting the network
from an adult configuration. This does not mean that every set of parameters will
yield the same result. In this case, and all the others in this work, we locked the
maximum fan-in for target layer neurons to 32, or 12% connectivity. As a result,
the network is bound to have at most that connectivity and at least half that, or
6%, if formation and removal occur with equal probability.
Table 7.8 shows the final, single-trial, results for a network identical to previ-
ous experiments, but with a run time of 600 seconds. This ensures the networks
have a chance to converge on a value of σa f f and AD. A comparison between
Tables 7.7 and 7.8 shows similar results for case 1, but significantly better results
for Case 3. These differences are summarised in Figure 7.25. σa f f and AD were
computed at the end of three simulations differing only in the initial connectivity:
an initial rough topographic mapping as in the previous experiments, a random
242 Learning in Neural Networks
Table 7.7. Simulation results presented in a similar fashion to Bamford et al. [8]
(Table 2) for three cases, all of which incorporate synaptic plasticity. Case 1 consists of
a network in which both synaptic rewiring and input correlations are present. Case 2
does not integrate synaptic rewiring, but still has input correlations, while case 3 relies
solely on synaptic rewiring to generate sensible topographic maps.
Case 1 2 3
10% initial connectivity balanced between feedforward and lateral, and almost no
initial connectivity (in practice, one-to-one connectivity was used due to software
limitations). We do not simulate the case without synaptic rewiring, as the results
would be severely impacted by the lack of rough initial topographic mapping. The
Structural Plasticity 243
Table 7.8. Results for modelling topographic map formation from development (minimal
initial connectivity).
Case 1 3
Figure 7.25. Comparison of final values for σa f f and AD in the case where input corre-
lations are absent (Case 3). Three types of networks have each been run 10 times (to
generate the standard error of the mean), each starting with a different initial connec-
tivity: an initial rough topographic mapping as in the previous experiments, a random
10% connectivity (5% feedforward, 5% lateral) and almost no connectivity (one-to-one
connectivity used due to software limitations).
Figure 7.26. Evolution of results of interest. The top row shows the evolution of the mean
spread of receptive fields over time, considering both unitary weights (σa f f −conn ) and
actual weights (σa f f −weight ) at that point in time. The bottom row shows the evolution of
the mean absolute deviation of the receptive fields considering connectivity (ADconn ) and
weighted connectivity (ADweight ). Error bars represent the standard error of the mean.
Figure 7.27. Target layer firing rate evolution throughout the simulation. The instanta-
neous firing rate has been computed in 1.2-second chunks for simulations where lateral
connections are excitatory (lat-exh) and for simulations where lateral connections are
inhibitory (lat-inh).
partner selection mechanism to focus its attention mostly on the target layer. We
have achieved a reduction in the target layer firing rate by introducing inhibitory
lateral connections. This is sufficient to generate a stable topographic mapping that
matches quite closely the results of the original network when both input correla-
tions and synaptic rewiring are present: σa f f −conn = 1.74, σa f f −weight = 1.38,
ADconn = 0.85, ADweight = 0.98; all results are significant. The combined
choice of sampling mechanism and lateral inhibition has a homeostatic effect upon
the network.
Conversely, in the cases where input correlations are present, we see stable
topographic mapping, regardless of the presence of synaptic rewiring, as well as
significantly more feedforward synapses. Finally, no applicable network was neg-
atively impacted by initialising the connectivity either randomly or with minimal
connections.
To sum up, the model can generate transiently better topographic maps in
the absence of correlated input when starting with a negligible number of ini-
tial connections or with completely random connectivity. These results can also
be stabilised with the inclusion of lateral inhibitory connections preventing self-
sustained waves of activity within the target layer. Experiments using correlated
inputs do not require inhibitory lateral feedback either for reducing the spread
of the receptive field or for maintaining a stable mapping. Finally, the model has
proven it is sufficiently generic to accommodate changes in initial connectivity, as
well as type of lateral connectivity, that is, the change from excitatory to inhibitory
synapses.
246 Learning in Neural Networks
Figure 7.28. Network architecture used for training. A source layer displays a series of
examples of handwritten digits; each example from a particular class is projected to the
target layer corresponding to that class.
Figure 7.29. Top: Input rate-based MNIST digit representation. Bottom: Reconstruction
of the learned digits when connectivity is adapted using only structural plasticity.
into the connectivity of the network. It is then possible to test the quality of classi-
fication. For this, we make use of a single source layer. The previously learnt con-
nectivity is used to connect all of the target layers to the source layer, and all plas-
ticity is disabled. The source layer now displays class-randomised examples, each
for 200 ms. The classification decision is made off-line, based on which target layer
has the highest average firing rate within the 200 ms period.
This is not a state-of-the-art MNIST classification network (it achieves a modest
accuracy of 78% and a Root Mean Squared Error (RMSE) of 2.01, Figure 7.30
reveals what mistakes the network made) as each input digit class is represented only
as an average for that class, but it serves here to demonstrate that synaptic rewiring
can enable a network to learn, unsupervised, the statistics of its inputs. Moreover,
with the current network and input configuration, the quality of the classification
is critically dependent on the sampling mechanism employed in the formation of
new synapses. Random rewiring, as opposed to preferentially forming connections
to neurons that have spiked recently, could achieve accurate classification only if
operating in conjunction with STDP.
Finally, this approach is also critically dependent on the encoding scheme used
to represent the input. A rate-based encoding as shown here is required if no STDP
is present.
248 Learning in Neural Networks
1. http://dx.doi.org/10.17632/xfp84r5hb7.1#folder-36833daa-91a8-499c-a898-65a96e22958b
Structural Plasticity 249
Figure 7.31. Left: the input spiking activity represented by a Poisson spike generator with
rates described by the Gaussian curve in black; centre: the neuron identifiers considered
for formation throughout an entire simulation if this choice relies on selecting the last
neuron to have spiked; right: the neuron identifiers considered for formation through-
out an entire simulation if the choice relies on selecting an arbitrary partner among the
ones that have spiked since the last time step. In bright yellow, aligned across all plots:
the neuron with the highest firing rate. Additionally, the partitioning of the pre-synaptic
population is highlighted in green in the central and right-most figures.
generated input spikes with bimodal firing rates and showed how the connectivity
had adjusted. We needed only look at which sources the formation attempts were
considering to reveal the issue.
Readout
Input
Excitatory
populaon
Inhibitory
populaon
(a)
(b)
Figure 7.32. (a) Network architecture. (b) Example input 45◦ movement represented as
its constituent frames (before processing to generate spikes). A new frame is presented
every 5 ms and, in total, the presentation of an entire pattern takes 200 ms.
using rank-order encoding (first classification neuron to spike wins), rather than
spike-rate encoding (classification neuron that fires most in a time period wins).
The SNN architecture (pictured in Figure 7.32(a)) is designed to allow unsu-
pervised learning through self-organisation using synaptic and structural plasticity
mechanisms [23]. Neurons in the two target populations are modelled as being
positioned at integer locations on a 32 × 32 grid with periodic boundary con-
ditions. The excitatory population contains neurons that receive sparse excitatory
connections from the input layers and from themselves, while projecting to the
inhibitory layer and to the readout neurons responsible for the final motion clas-
sification decision. The inhibitory population follows a similar structure, but only
projects using inhibitory synapses. Very strong inhibition is also present between
the readout neurons, implementing a WTA circuit. The networks are described
using the PyNN simulator-independent language for building neuronal network
models [44] and the SpiNNaker-specific software package for running PyNN simu-
lations (sPyNNaker [207]).2 The model is simulated in real time on the SpiNNaker
many core neuromorphic platform using previously presented neuron and synapse
dynamics [23].
2. The data and code used to generate the results presented here are available from doi: 10.17632/wpzxh93vhx.1
Structural Plasticity 251
The input stimulus consists of bars encoded using spikes representing ‘ON’ and
‘OFF’ pixels (see Figure 7.32(b) for an example before filtering using a previously
described technique [194]) as well as a background level of Poisson noise (5 Hz).
Each stimulus is presented over a 200 ms time period always moving at a constant
speed (200 frames per second). During training, the target layers are presented with
bars moving in two directions (Eastward or at 0◦ and Northward or at 90◦ ), but
during testing they are presented with moving bars in all directions (randomised
over time, in 5◦ increments) – weights and connectivity are fixed during this lat-
ter phase. The simulations are initialised with no connections and are trained for
around 5 hours, while testing occurs over 20 minutes. As a result of the chosen test-
ing regime, the networks sees over 80 moving bar presentations at each of the 72
angles. This allows us to perform a pair-wise independent t-test between the responses
at each of the angles in the two cases and establish whether their responses are sta-
tistically different. The readout neurons are trained and tested separately from the
rest of the network – this process takes on the order of a minute.
Using the structural plasticity mechanism implemented for SpiNNaker, new
synapses are formed in two regimes: with heterogeneous, random delays ([1, 15]
ms, uniformly drawn) and homogeneous, constant (1 ms) delays; the latter is taken
to be the control experiment. Further, according to the structural plasticity mech-
anism, depressed synapses are more likely to be removed. This optimises the use of
the limited synaptic capacity available for each post-synaptic neuron [73]; neurons
have a fixed maximum fan-in of 128 synapses with fixed delays.
The Direction Selectivity Index (DSI) will be computed for each neuron after
training: DS I = (R pr e f − Rnull )/R pr e f , where R pr e f is the response of a neu-
ron in the preferred direction and Rnull is the response in the opposite direc-
tion [157]. We compute it for each of the possible directions and establish the
preferred direction as that which maximises the DSI. Individual neurons generally
have noisy responses. As such, to avoid the noise skewing the computation, we fil-
ter the response of the neuron by applying a weighted average on individual angle
responses.
More formal analysis, although less specific to the task of motion detection, is
also performed. Entropy is computed using each neuron’s normalised spiking pro-
file. After testing, each neuron’s firing profile R(X ) is computed in relation to each
one of the i = 1 → 72 input movement directions. Normalisation is performed
by dividing every response by the sum of all responses:
X
Pi (X ) = Ri (X ) R j (X ) (7.31)
j
so that i Pi = 1 for each neuron X . The firing profile of the neuron can thus
P
The maximum entropy in the presented system is thus − log2 (1/72) ≈ 6.17
bits, which is equivalent to neurons displaying equal spiking activity in all presented
angles, or no selectivity whatsoever. Neuron X is said to be very selective if simul-
taneously maximises DSI (DS I (X ) → 1) and minimises entropy (H (X ) → 0).
It is sufficient for a neuron to have DS I (X ) ≥ 0.5 to be considered selective.
Both DSI and entropy are used to select and investigate the behaviour of indi-
vidual neurons. In the following section, we will display quadruplets of individual
neuron responses that have maximal grid-aligned responses and minimal orthog-
onal and opposite responses: argmax X = Ri − (Ri+90◦ + Ri−90◦ + Ri+180◦ ).
Here i is in turn: 0◦ , 90◦ , 180◦ and 270◦ , that is, the cardinal directions. The dis-
tributions of entropy and DSI are also included for all experiments. Comparison
of these distributions is performed by applying both Welch’s t-test and Kruskal’s
h-test.
After training the readout units, it is possible to establish class inputs. Based on
the predicted label and the known true labels, we report the accuracy, recall F-score
of the network, as well as the RMSE. We define Tp and Tn to mean the number of
true positive and true negative examples, while Fp and Fn refer to the number of
false positives and negatives, respectively. Recall or sensitivity, intuitively the ability
of the classifier to find positive samples, is reported as T p/(T p + Fn). Precision or
the positive predictive value, intuitively the qualitative ability of the classifier not
to label as positive an example that is negative, is computed as T p/(T p + F p).
Given these metrics, we can now compute the weighted average between precision
and recall to generate the F-score F1 = 2 (precision × recall)/(precision + recall).
Due to the readout architecture and experimental parameters, we also investigate
the number of instances in which no readout neuron produces a spike.
The response of the excitatory population in each regime (incorporating het-
erogeneous delays or not) is plotted for each testing direction (minimum, mean
and maximum responses presented in Figure 7.33(a)). The polar plot reveals the
firing rate (Hz) of neurons during testing when the input is moving in each of the
72 directions from 0◦ to 355◦ in 5◦ increments in a random order. The network
response shows that neurons are responding preferentially to movement, rather than
simply to the shape of the input, because the response is asymmetrical – it can
differentiate between, for example, a vertical bar moving eastward and the same
vertical bar moving westward. The pair-wise independent t-test is performed to com-
pare the network response in the two regimes (Figure 7.33(c), red line signifies that
p ≥ 0.001 for that particular angle); the response is higher in one training direction
Structural Plasticity 253
Figure 7.33. Spiking activity comparison between networks where rewiring assigns ran-
dom and constant delays, respectively, to new synapses. Both networks were trained
using bars moving eastward (0◦ ) and northward (90◦ ). (a) the minimum, mean and max-
imum aggregate excitatory population firing response (Hz); (b) neuron angle preference
based on maximum firing rate encoded by the colour and DSI represented by the arrow
direction (it is only present if DSI ≥0.5); (c) pair-wise independent t-test comparing the
network with heterogeneous delays (on the left in a and b) to the control setup (on the
right in the same subplots) − red lines show the angle at which the comparison yielded
insignificant results; (d) selected individual neuron responses in the 4 grid-aligned direc-
tions (random delays); (e) selected individual neuron responses in the 4 grid-aligned
directions (constant delays); (f) individual overlaid selective neuron responses after fil-
tering (DSI ≥0.5); (g) histogram comparison of all DSI values in the two networks; (h) his-
togram comparison of all entropy values in the two networks.
(90◦ ) and less in the other (0◦ ) for the network with heterogeneous delays com-
pared to the control. As such, we proceed by examining individual neurons rather
than the average network behaviour. The spatial organisation of neurons and their
preferred angle is presented in Figure 7.33(b), showing that local neural neighbour-
hoods become sensitised to the same input statistics. There we also look at neurons’
maximum responses (encoded by the colour of the cell) in conjunction with the
direction that maximises DSI (arrow direction) and DSI ≥ 0.5 (arrow presence).
The DSI histogram presented in Figure 7.33(g) compares the two networks; the
254 Learning in Neural Networks
(a) (b)
(c) (d)
Figure 7.34. Network evolution over a wide range of simulation run times when trained
on two angles. (a) average network firing response during inference when trained for
ever increasing times; (b) average number of afferents (incoming connections) for each
neuron in the excitatory target layer; (c) DSI distribution displayed as a boxplot for each
simulation in (a); (d) entropy distribution displayed as a boxplot for each simulation in
(a). Note: Each data point is a different simulation.
control network has significantly fewer selective neurons (251 compared to 744)
and selectivity is lower on average. Individual responses of our simulated neurons
resemble the direction selectivity found in Superior Colliculus [112].
Further, we examine the network behaviour over a wide range of simulations
times, ranging from 40 minutes up to 20 hours. Figure 7.34(a) shows the evo-
lution of the population-level firing rate and the evolution of the DSI metric
(Figure 7.34(c)). The network is thus shown to be stable over long periods of time,
rather than showing destructive dynamics.
A readout or classification mechanism relying on two mutually inhibitory neu-
rons is sufficient to resolve the two directions presented in the input. Static excita-
tory connectivity originating from the excitatory layer results in a potential 100%
classification accuracy based on rank-order encoding. After 40 seconds, the two
neurons have self-organised to respond to one of two input patterns. Figure 7.35
shows the spiking behaviour of the two neurons in the first 1.8 seconds of training
and testing. STDP reduces the latency in neural response to the stimuli, making
the neurons respond to the stimulus onset, thus making them ideal for classification
using rank-order encoding, rather than a WTA classification based on spike count
across a time period [103].
Neuroevolution 255
(a)
(b)
Figure 7.35. Initial spiking activity of the two readout neurons during training (a) and
testing (b). The full-height vertical bars denote the edges of the pattern presentation time
bins (every tstim = 200 ms). Neuron class is established post hoc as the one maximising
classification accuracy.
7.6 Neuroevolution
Generation
of Population
Evaluation of
Variation
Population
Reproduction
Termination
Selection no criteria met?
yes
Optimised
Population
target task.3 The agents in the population that perform best are selected to repro-
duce, with the offspring having some variation applied. The offspring form the
population for the next generation, see Figure 7.36, often with a small portion of
the best-performing agents automatically passing into the next generation. This is
known as elitism and has been shown to aid convergence to solutions [46]. Like
evolution by natural selection in biology [43], survival of the fittest in EAs is an
unguided force which can lead towards better performing agents.
How the performance of agents is evaluated is task specific; however, it does not
require gradient information, making an EA approach particularly well suited for
tasks in which the error is either difficult or impossible to differentiate. The terms
EA and Genetic Algorithm (GA) are often used interchangeably but technically
GAs are a subclass of EAs in which agents are encoded as discrete values in ‘genes’
and random mutation and crossover adding variation to offspring.
The scale of large SpiNNaker systems, such as the one million core machine at
the University of Manchester, lend themselves to population-based search meth-
ods. The parallelism available with such systems allows model execution to become
invariant with respect to population or network size, the main components leading
3. So far, population has been used in the context of PyNN to describe a group of neurons. In the context of
EAs population is used to refer to a group of individuals (whole networks) to be optimised, unless otherwise
specified.
Neuroevolution 257
7.6.4 Methods
Figure 7.37 shows the structure of the simple convolutional SNN model, the
weights of which were optimised for the MNIST digit recognition task using a GA.
The spikes of the 28 × 28 input layer were rate-coded representations generated
from the MNIST images. The hidden layer was a 24 × 24 layer that is the convo-
lution of input with a 5 × 5 filter. The hidden layer was fully connected to the
10 output neurons. The 25 weights of the filter and the 5,760 weights of the
fully connected layer were encoded in a 5,785 base gene, with the bases taking
integer values in the range −1 to +1. The details of the GA used are detailed in
Table 7.9. Two experiments were carried out to better understand the effect of dif-
ferent initialisation on the evolution of the population over 304 generations: the
Figure 7.37. The structure of the simple SNN model optimised using a GA.
Variable Value
(a) (b)
Figure 7.38. The centre-surround filters used to seed the seeded population.
7.6.5 Results
Figure 7.39 shows the evolution of the training accuracy of the two populations
over 304 generations. The five top performing individuals from the final popula-
tions were evaluated against the MNIST testing set and the best individuals gave
66.7% and 63.9% testing accuracy, unseeded and seeded, respectively. These results
are far from state-of-the-art accuracies but demonstrate that a GA can be used for
optimisation in this way.
It was hypothesised that the same convolutional filter may evolve from an
unseeded population independent of the random initialisation; however, multiple
runs of smaller populations showed that, on the order of hundreds of generations,
this is not the case. During the course of these experiments, the time performance
of the system was evaluated. It was found that the overhead of submitting a Spalloc
job merited redesigning the framework to allow multiple models to be evaluated
in one job. As a reminder, Spalloc is the current SpiNNaker job submission sys-
tem which allocates a subset of the entire machine to individual users. This work
demonstrated that it is possible and feasible to use a GA to tune the parameters of
an SNN model on SpiNNaker.
100%
90%
seeded
80% unseeded
Training Accuracy
70%
60% Maximum
50%
40%
30%
Mean
20%
10%
0%
Minimum
0 50 100 150 200 250 300
Generation
Figure 7.39. A graph comparing number of generations compared with training accuracy
for a GA with two populations of 24,000 individuals, each being a SNN model. The seeded
population was initialised with 12,000 centre-surround filters, 6,000 positive and 6,000
negative and 12,000 individuals with randomly initialised filters. The unseeded population
was initialised with 24,000 random individuals.
ANN models and to uncover novel learning mechanisms in SNNs. Looking more
broadly, the development of robust model optimisation frameworks could well lead
to a change in the way that research is carried out.
Different EAs
The modification of genes in EAs is random and mutations are undirected; an Evo-
lutionary Strategies (ES) algorithm [15] may well be able to achieve similar results
with far fewer evaluations. It does this by mutating the best agent in a population
multiple times to create a new generation. The mutation vector of all individuals is
then scaled by their performance and totalled to give an approximation of the gra-
dient allowing the next-generation mutation to be in the direction in the solution
space which produced the best performance.
One of the key features of SpiNNaker is its asynchronicity, a feature that lends
itself to lesser-studied asynchronous steady-state EAs [2]. In these types of models,
fitness values are not gathered at the end of a generation as they would be in a typical
generational GA, rather models compete between themselves locally. This kind of
optimisation algorithm could be run directly on SpiNNaker and would minimise
the overheads associated with scattering and gathering data to across multiple chips
Neuroevolution 261
and boards. We estimate that running the EA ‘on-machine’ could half the time
taken to evaluate models similar in scale to that of the MNIST test network above.
Machine Learning
A key area of interest is understanding the relationship between SNN and ANN
models and translation between them. Deep ANN models trained by error back-
propagation give state-of-the-art performance in many benchmark machine learn-
ing tasks. An automated optimisation framework could be used to help convert
ANN models to SNNs to be run on SpiNNaker. This would dramatically increase
energy efficiency for practical applications as well as adding to our understanding
of how information is processed in neural networks more generally.
Learning-to-Learn
A current fruitful area of research, dubbed Learning-to-Learn (L2L), applies opti-
misation techniques to fine tune the hyperparameters of another optimisation algo-
rithm [12]. An example of this could be using a genetic algorithm to evolve the
parameters which control how backpropagation performs. This can help find cer-
tain parameters and starting conditions conducive of fast learning on novel tasks.
With both optimisers working at different time scales, it simulates the slow evolu-
tionary process of genetic variation with the fast time scale acting in a similar way
to learned behaviour during a lifetime.
Impact on Computational Neuroscience
By allowing work with biological models that do not have full parameter sets,
the focus of researchers working with experimentally derived models could move
towards higher levels of abstraction and larger-scale models. This step is an impor-
tant one to bridge the gap between understanding information processing in the
brain and the application of such knowledge in future neuromorphic systems.
DOI: 10.1561/9781680836530.ch8
Chapter 8
— Abraham Lincoln
In this chapter, we take a look into the future of this technology. First we sur-
vey interesting developments in hardware accelerators for SNNs and ANNs, but
then we focus primarily on the second-generation SpiNNaker developments. Here
we will refer to the current SpiNNaker machine as SpiNNaker1 and the second-
generation machine as SpiNNaker2.
This is an exciting time as large corporations are exploring the usefulness of neu-
romorphic systems, and it is no longer an effort driven solely by academia. Most
262
Survey of Currently Available Accelerators 263
systems are in a research prototype phase, rather than fully commercially viable
products. Current offerings include:
• In data centres: NVidia’s Tesla Graphics Processing Units (GPUs) (P100 for
training, P4 & P40 for inference)6 ; Intel’s Nervana L-1000 Neural Network
Processor (NNP)7 ; Graphcore’s Colossus Intelligence Processing Unit (IPU)8 ;
Google’s Tensor Processing Unit (TPU).9
• In mobile devices: Huawei’s Kirin970 AI Processor10 ; Qualcomm’s AI
Engine11 ; Imagination’s PowerVR Series2NX and Series3NX12 ; Apple’s A12
Bionic13 ; Cadence’s and Tensilica’s HiFi 5 Digital Signal Processor (DSP)14 ;
ARM’s Trillium Project15 ; LG’s AI Chip.16
1. https://newsroom.intel.com/tag/loihi/#gs.7e79qw
2. http://www.research.ibm.com/articles/brain-chip.shtml
3. https://etacompute.com/
4. https://aictx.ai/
5. https://www.brainchipinc.com/
6. https://www.nvidia.com/en-gb/deep-learning-ai/
7. https://www.intel.ai/nervana-nnp/
8. https://www.graphcore.ai/
9. https://cloud.google.com/tpu/
10. http://www.hisilicon.com/en/Media-Center/News/Key-Inf ormation-About-the-Huawei-Kirin970
11. https://www.qualcomm.com/snapdragon/artificial-intelligence
12. https://www.imgtec.com/vision-ai/powervr-series3nx/
13. https://www.apple.com/uk/iphone-xs/a12-bionic/
14. http://www.cadence.com/go/hif i5
15. https://www.arm.com/products/silicon-ip-cpu/machine-learning/project-trillium
16. http://www.lgnewsroom.com/2019/05/lg-to-accelerate-development-of-artificial-intelligence-with-own-ai-
chip-2/
264 Creating the Future
8.2 SpiNNaker2
Strengths
• Software neuron and synapse modelling. Although the use of software inevitably
compromises energy-efficiency compared with hard-wired analogue or digi-
tal algorithms, for a research platform we believe that the resulting flexibility
more than warrants this sacrifice.
• Multicast packet routeing. The SpiNNaker packet routeing mechanism
has proved its ability to adapt to a wide range of use profiles and
SpiNNaker2 265
Weaknesses
• Host I/O performance. The 100 Mbit Ethernet I/O on each SpiNNaker1
board has proved a major bottleneck in the machine’s use. Although we have
found ways to circumvent this bottleneck in a number of circumstances,
much higher I/O bandwidth would greatly improve the performance and
usability of the machine.
• Memory sharing. In SpiNNaker1, each processor core has its private local
memory and (slower) access to the shared SDRAM. The trend towards
increased communication between the cores on a chip, for example, when
neuron and synapse modelling runs on different cores, has made moving data
between cores increasingly important, but on SpiNNaker1 this can only be
done via SDRAM.
synaptic weights, the communication of spike events and the storage of synaptic
weights.
Figure 8.1 shows the SpiNNaker2 chip top-level architecture. It follows the same
concept as the SpiNNaker1 neuromorphic computation system with:
The processing elements (PEs) are arranged in groups of quads (QPEs) which
form tiles for the homogeneous processor array. SpiNNaker2 employs a mesh-based
NoC where every QPE constitutes one node of the mesh grid. The NoC is respon-
sible for all types of communication between the on-chip components and from/to
the off-chip interfaces. This includes chip boot-up and configuration data transfers,
SerDes
SpiNNaker
Router
QPE
LPDDR4
Figure 8.1. SpiNNaker2 chip architecture. The chip mainly comprises a 7 × 6 array of Quad
Processing Elements (QPEs), with two of these replaced by the SpiNNaker router and two
by the east inter-chip Serialiser/Deserialiser (SerDes) link, leaving 38 QPEs incorporating
in total 152 PEs.
SpiNNaker2 Chip Architecture 267
spike traffic and off-chip memory data traffic. Therefore, all other chip top-level
components are connected to the NoC as well. These include the following:
The SpiNNaker router is the key component in the SpiNNaker machine for SNN
simulations. The SpiNNaker1 router was described in detail in Section 2.2.3.
The router on the second-generation SpiNNaker chip incorporates improvements
required here to support the larger routeing tables and increased communication
throughput; the SpiNNaker2 chip contains more than 100 processing elements.
All the on-chip and inter-chip spikes are routed by the SpiNNaker router. The
whole SpiNNaker2 packet router is designed with fully pipelined packet flows
and provides higher throughput and performance. The top-level structure of the
SpiNNaker2 packet router is shown in Figure 8.2.
The SpiNNaker2 packet router has 6 (parameterised) on-chip and 7 off-chip
communication channels occupying the same area as 2 QPEs. These channels
attach and share the 6 bi-directional NoC ports running at a 400 MHz speed.
Compared with the single input stream in the SpiNNaker1 packet router (running
at around 100 MHz), the 6 parallel ports can absorb 2.4 G input packets per second
(GPKT/s) which is 24 times larger than the SpiNNaker1 packet router. Further-
more, the parallel routeing engines enable the SpiNNaker2 packet router to have
better routeing efficiency than the SpiNNaker1 packet router where only one packet
can be processed every cycle at maximum.
The SpiNNaker2 packet router is currently designed to run at 400 MHz (via
6 NoC ports) which maintains the same maximum theoretical throughput as the
SpiNNaker1 packet router. However, the realistic throughput will be increased
by improving the communication bottleneck. The PE running at the same speed
(200 MHz) now can take packet from the network every 1 or 2 processor cycles.
The bandwidth of the off-chip I/Os is also increased significantly.
The output star network of the SpiNNaker1 packet router is an efficient net-
work for multicasting. However, the centralised arbitration and buffering limit its
scalability. The SpiNNaker2 chip will incorporate 152 PEs. Therefore, a 2D mesh
network is chosen to provide a better scalability. Compared with SpiNNaker1,
the different network topology also brings different design challenges for the
SpiNNaker2 packet router.
The basic function of a router is to route each packet to its destination(s). How-
ever, the SpiNNaker2 packet router is more complicated than that, performing
different routeing algorithms, system monitoring functions and including power
optimisation and high-performance circuits within a limited power and area bud-
get. Below are some new features and differences compared with the SpiNNaker1
packet router.
The 7th SpiNNaker link: This is an additional SpiNNaker link which is func-
tionally similar to the other 6 SpiNNaker links. There are several advantages of
SpiNNaker2 Packet Router 269
MC
General
Registers + C2C NN
OoOBuffer
this additional link. First, the addition of the 7th link provides a dedicated con-
nection for the interaction with other neuromorphic devices without breaking the
torus which is already formed using the other 6 SpiNNaker link. Second, the 7th
link can be used for a hyperconnection to another node in the system which can
significantly reduce the routeing latency in the simulation, because the routeing
through this short-cut path does not need to pass through multiple nodes to reach
the destination. The disadvantage is that it introduces cost where the size of the
multicast look-up table increases by adding one more destination bit per entry.
However, it does not incur extra cost to the SpiNNaker core-to-core (C2C) and
270 Creating the Future
Out-of-order issue buffer: The output star network in the SpiNNaker1 packet
router can issue a single multicast (MC) packet efficiently. The output strategy for
MC packets is All or Nothing (AoN) where the MC packet will only be sent if
all of its destinations are available. Therefore, in the SpiNNaker1 packet router,
if one MC packet stalls due to an unavailable destination, all the subsequent MC
packets will stall. In the SpiNNaker2 packet router, the out-of-order issue buffer is
designed to further improve the output efficiency of MC packets. If the first MC
packet stalls at the output of the multicast routeing engine, it will move to the
out-of-order buffer unit. The out-of-order issue buffer can accommodate several
MC packets. At each router clock cycle, the out-of-order issue buffer can send any
MC packet which does not have a blocked destination. The output efficiency is
increased by issuing the MC packets out of order.
Latch-based TCAM and built-in self-test: The ternary content addressable mem-
ory is the key component in multicast routeing. It is designed without read out
circuitry to save area. In SpiNNaker1, testing the TCAM involved a lot of human
effort to design the test program. Therefore, a built-in self-test unit is devised to
facilitate test automation and improve test efficiency.
The key component of SpiNNaker2 is its processing element PE. The PE architec-
ture is shown in Figure 8.3.
The PE is based on an ARM Cortex-M4 core with FPU. It contains 128 kBytes
of local SRAM which is accessible as data and instruction memory by the processor.
A crossbar handles local SRAM access from the processor core, the communication
controller and from neighbouring PEs inside one QPE. Various components are
The Processing Element (PE) 271
8.5.1 PE Components
Communications Controller
The Communications Controller is the interface between the PE and the NoC
router. It is responsible for transmitting and receiving NoC packets to and from
the communication network. It incorporates a bridge unit, a communication unit,
272 Creating the Future
Rounding Accelerator
To improve the accuracy of a reduced precision neuron model, rounding can
be done at scalar operation level [102]. To support this in the next-generation
SpiNNaker chip, we are including a small hardware accelerator for stochastic round-
ing and round-to-nearest. Stochastic rounding in SpiNNaker is performed on fixed-
point numbers by rounding them to a specified bit position (usually to fit a long
number into 32 bits) probabilistically. The probability of rounding such a number
up is proportional to the round-off residual, and to achieve this, a PRNG is used.
Stochastically rounding fixed-point multiplication results has been shown to reduce
numerical error in the Izhikevich neuron ODE solvers on SpiNNaker [102].
chip is currently planned to have both fixed- and floating-point exponential and
logarithm functions with accuracy control in hardware and here we present some
information about the design of this accelerator.
In the SpiNNaker1 system, there is no hardware support for transcenden-
tal functions, including exponentials, so the models that were developed used
pre-computed Look-up tables (LUTs); this solution was explored in detail by
Vogginger et al. [263]. The SpiNNaker compiler first takes a high-level description
of the network dynamics specified by the user and pre-calculates a range of values
of exponential decay for a specific time constant and a number of time instances
on a fixed simulation time grid (either all possible times in a grid or a subset of
times, depending on memory constraints). Then, the LUTs are copied into each
core’s local memory and used while the application is running.
However, this approach has two limitations:
• a limited number of timing constants and a limited input range can be used
due to the constraints of the on-chip memory, and
• in the case where a model requires timing constants that depend on some
dynamic quantity, such as the voltage-dependent timing constants in the
intrinsic currents of the well-known Hodgkin-Huxley neuron model and its
variants, the number of required look-up tables for each possible value that
the time constant can take would be too large to store in the local SpiNNaker
memory.
The memory requirements are further increased if the simulation time step
is 0.1 ms, which is rarely used on SpiNNaker, but will be used on SpiNNaker2
as it will give more accuracy in all the parts of the simulation. In this scenario,
the size of the LUTs for the same amount of time decay look-up will grow 10
1t
times. For example, modelling a 16-bit exponential decay e− τx for 1 second and
all the values that 1t can take at 0.1 ms simulation time step will require 20 kB
of memory space. A software exponential function is also available in the SpiN-
Naker software library, but with the latency of approximately 95 clock cycles it is
a major limitation to real-time synaptic plasticity processing, where a single pair of
spikes takes approximately 30 cycles (using LUTs for the exponential) as reported
by Knight and Furber [127]. With most learning rules we usually require more
than one exponential per spike pair processed. Learning rules requiring three or
more decay time constants have already started appearing in the computational
neuroscience literature and some have already been tried on SpiNNaker: see, for
example, voltage-dependent STDP [37] implemented on SpiNNaker [69], the
BCPNN learning rule [128, 263] and the neuromodulated STDP [165] learn-
ing rule.
The Processing Element (PE) 275
Most of the algorithms for performing elementary functions are categorised into
two types: polynomial approximations or convergence algorithms [56, 173]. For this
accelerator, a well-known convergence algorithm [173] was chosen, which provides
exponential and natural logarithm functions with overlapping hardware compo-
nents. (Note that having both of these functions also allows us to derive a general
power function for a limited range of arguments). The implementation is based on
the iterative shift-add algorithms that are usually considered to be slower than poly-
nomial approximation due to the serial dependencies in the algorithm, but they do
not require multiplication, which reduces the area of the circuit. A further useful
property of these iterative algorithms is that after just a few iterations they already
contain an approximate result. This property is used to provide programmable
accuracy control, following the principles of approximate computing [92] (in this
case approximation comes not from the errors in the circuit as is most common,
but by running fewer iterations than are required for a precise result) in order to add
options for modellers to trade-off accuracy against speed and energy. This property
will provide a platform for experimenting with concepts arising from the ongoing
discussion about the maximum precision of arithmetic required in neuromorphic
systems, for example, for representing weights in STDP [191] – the smaller the
weight, the less precise the calculation of weight changes that is required.
Local SRAM
128bit
ARM Ctrl
A (8bit) B (8bit)
16x4 MAc array
Data Fetch
&
32bit ...
Execution
Ctrl
ACC (29bit)
128bit
128bit
NoC Ctrl NoC Data
Figure 8.4. SpiNNaker2 machine learning accelerator.
operations for one output value are performed in one sequence and the output is
written back afterwards. Due to the 16×4 arrangement of the MAc array, 128-bit
data for the first operand and 32-bit data for the second operand are required in
each clock cycle. Thus, the first operand fully occupies the bandwidth of either
SRAM or NoC interface. If both operands are located in local SRAM or both are
accessed via the NoC, MAc operations are interrupted periodically to fetch data for
the second operand.
For a convolution operation, the dimensions of the convolution have to be con-
figured besides the memory locations of input feature maps, kernels and output
feature maps. Each row of the MAc array is used for a different output channel,
that is, four output channels are calculated in parallel. Data flow is again output-
stationary.
The ML accelerator supports both signed and unsigned inputs/output, which
can be configured independently. Furthermore, input data width can be changed
to 16-bit for either operand. This is realised with low overhead by adding together
several MAc register results upon write-back. Also, configurable truncation and
ReLU calculation are included in the write-back path. Output bit width can be
chosen to be 8-, 16- or 32-bit, so that output data can be either directly used for
the next DNN layer or further processed by the ARM core.
As a result, the wide-spread convolutional and fully-connected DNN layers with
ReLU activation function can be processed completely by the machine learning
accelerator, with the ARM core only configuring and starting it. Full flexibility for
The Processing Element (PE) 277
other layer types or activation functions (e.g. pooling layers or sigmoid activation
functions) is still provided by the ARM core.
higher supply voltages, the dynamic power consumption dominates the total energy
consumed. There exists a minimum energy point (MEP) around 0.50 V, where the
PE implementation is capable of operating at 150 MHz. Note that at 0.50 V all
standard cells are operating in a super-threshold regime for all PVT conditions,
Summary 279
since the ABB approach adaptively compensates the device threshold voltages for
PVT variations.
Although it is desired to operate the PE at the MEP for maximum efficiency, this
obviously does not result in significant processor performance scaling compared to
SpiNNaker1. Performance enhancement is achieved by applying Dynamic Volt-
age and Frequency Scaling (DVFS) [106, 107] to the PE. As shown in Figure 8.3
the PE core logic can be connected to one of two supply voltage rails. This allows
for energy-efficient operation at a low-performance level at 0.50 V and peak per-
formance operation at a higher-performance level at 0.60 V. It has been shown
[106, 107] that under the dynamics of spiking neuromorphic applications, where
peak processing power is only required in few simulation cycles, this technique
significantly reduces the PE power consumption while still maintaining the tem-
poral peak performance of the PE. The performance level transition is scheduled
from a local power management controller at QPE level, based on the concept from
[105]. Clocks are generated by PLLs [216]. Using this approach, each PE is capable
of managing its own DVFS level just by knowing its local workload in the current
simulation cycle (e.g. the number of synaptic events to be processed) independently
of the other PEs. Performance level switching is realised in less than 100 ns, which
is a negligible timing overhead compared to the neuromorphic real-time simulation
with 0.1 ms or 1 ms timing resolution.
8.6 Summary
SpiNNaker2 represents the next step in the SpiNNaker story and brings us up to
date. The SpiNNaker2 chip has yet (at the time of writing) to be fabricated, but
will appear in 2020 and will form a new basis for the project for the future.
We have learnt a great deal in the 20 years that have gone into the project
so far, about neuromorphic computing of course, but also about building large
machines and making them reliable, about building large and complex software
stacks, and about the areas of research of our users and collaborators in neuro-
science and robotics. We have tried to capture those lessons, warts and all, in the
accounts given in this book by the many contributing writers.
The book ends here, but the story goes on – there is still a great deal more to be
learnt!
References
280
References 281
[29] Butz, M. and A. van Ooyen. 2013. “A simple rule for dendritic spine
and axonal bouton formation can account for cortical reorganization after
focal retinal lesions”. PLoS Computational Biology. 9(10): 39–43. ISSN:
1553734X. DOI: 10.1371/journal.pcbi.1003259.
[30] Buzsáki, G. and K. Mizuseki. 2014. “The log-dynamic brain: How
skewed distributions affect network operations”. Nature Reviews Neuro-
science. 15(4): 264–278. ISSN: 14710048. DOI: 10.1038/nrn3687. arXiv:
NIHMS150003.
[31] Camuñas-Mesa, L. A., Y. L. Domıénguez-Cordero, A. Linares-Barranco,
T. Serrano-Gotarredona, and B. Linares-Barranco. 2018. “A configurable
event-driven convolutional node with rate saturation mechanism for mod-
ular ConvNet systems implementation”. Frontiers in Neuroscience. 12: 63.
ISSN: 1662-453X. DOI: 10.3389/fnins.2018.00063.
[32] Cao, Y., Y. Chen, and D. Khosla. 2015. “Spiking deep convolutional neu-
ral networks for energy-efficient object recognition”. International Journal of
Computer Vision. 113(1): 54–66.
[33] Carnevale, N. T. and M. L. Hines. 2006. The NEURON Book. Cambridge
University Press, New York, NY, USA. 1–457. ISBN: 9780511541612.
DOI: 10.1017/CBO9780511541612. arXiv:1011.1669v3.
[34] Carter, R., J. Mazurier, L. Pirro, J. Sachse, P. Baars, J. Faul, C. Grass,
G. Grasshoff, P. Javorka, T. Kammler, A. Preusse, S. Nielsen, T. Heller,
J. Schmidt, H. Niebojewski, P. Chou, E. Smith, E. Erben, C. Metze, C. Bao,
Y. Andee, I. Aydin, S. Morvan, J. Bernard, E. Bourjot, T. Feudel, D. Harame,
R. Nelluri, H. -. Thees, L. M-Meskamp, J. Kluth, R. Mulfinger, M. Rashed,
R. Taylor, C. Weintraub, J. Hoentschel, M. Vinet, J. Schaeffer, and B. Rice.
2016. “22nm FDSOI technology for emerging mobile, Internet-of-Things,
and RF applications”. In: 2016 IEEE International Electron Devices Meeting
(IEDM). 2.2.1–2.2.4. DOI: 10.1109/IEDM.2016.7838029.
[35] Chaitin, G. J. 1982. “Register allocation & spilling via graph coloring”.
SIGPLAN Not. 17(6): 98–101. ISSN: 0362-1340. DOI: 10.1145/872726.
806984.
[36] Christian, B. and T. Griffiths. 2016. Algorithms to Live By: The Computer
Science of Human Decisions. New York, NY, USA: Henry Holt and Co., Inc.
ISBN: 9781627790369.
[37] Clopath, C., L. Büsing, E. Vasilaki, and W. Gerstner. 2010. “Connectivity
reflects coding: a model of voltage-based STDP with homeostasis.” Nature
Neuroscience. 13(3): 344–52. ISSN: 1546-1726. DOI: 10.1038/nn.2479.
[38] Colbourn, C. 1984. “The complexity of completing partial Latin squares”.
English (US). Discrete Applied Mathematics. 8(1): 25–30. ISSN: 0166-218X.
DOI: 10.1016/0166-218X(84)90075-1.
284 References
[49] Destexhe, A., Z. F. Mainen, and T. J. Sejnowski. 2002. “Kinetic models for
synaptic interactions”. The Handbook of Brain Theory and Neural Networks
(2nd ed): 1126–1130.
[50] Diehl, P. U. and M. Cook. 2014. “Efficient implementation of STDP rules
on SpiNNaker neuromorphic hardware”. In: Proceedings of the International
Joint Conference on Neural Networks. Beijing, China. 4288–4295. ISBN:
9781479914845. DOI: 10.1109/IJCNN. 2014.6889876.
[51] Diehl, P. U., D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer. 2015.
“Fast-classifying, high-accuracy spiking deep networks through weight and
threshold balancing”. In: Neural Networks (IJCNN), 2015 International Joint
Conference on. IEEE. 1–8.
[52] Dominguez-Morales, J. P., Q. Liu, R. James, D. Gutierrez-Galan, A.
Jimenez-Fernandez, S. Davidson, and S. Furber. 2018. “Deep Spiking Neu-
ral Network model for time-variant signals classification: a real-time speech
recognition approach”. In: 2018 International Joint Conference on Neural
Networks (IJCNN). IEEE. 1–8.
[53] Eliasmith, C. 2013. How to Build a Brain: A Neural Architecture for Biological
Cognition. Oxford University Press.
[54] Eliasmith, C., T. C. Stewart, X. Choo, T. Bekolay, T. DeWolf, Y. Tang, and
D. Rasmussen. 2012. “A large-scale model of the functioning brain”. Science.
338(6111): 1202–1205. DOI: 10.1126/science.1225266.
[55] Elliott, T. and N. R. Shadbolt. 1999. “A neurotrophic model of the devel-
opment of the retinogeniculocortical pathway induced by spontaneous reti-
nal waves.” Journal of Neuroscience. 19(18): 7951–7970. ISSN: 1529-2401.
URL: https://www.jneurosci.org/content/19/18/7951.
[56] Ercegovac, M. and T. Lang. 2004. Digital Arithmetic. Morgan Kaufmann
Series in Comp. Morgan Kaufmann. ISBN: 9781558607989. URL: https:
//books.google.co.uk/books?id=uUk%5C_AQAAIAAJ.
[57] Ercsey-Ravasz, M. and Z. Toroczkai. 2012. “The chaos within Sudoku”.
Scientific Reports. 2: 725.
[58] Euler, T., S. Haverkamp, T. Schubert, and T. Baden. 2014. “Retinal bipo-
lar cells: elementary building blocks of vision”. Nature Reviews Neuroscience.
15(8): 507–519.
[59] Fisher, S. D., P. B. Robertson, M. J. Black, P. Redgrave, M. A. Sagar, W. C.
Abraham, and J. N. Reynolds. 2017. “Reinforcement determines the timing
dependence of corticostriatal synaptic plasticity in vivo”. Nature Communi-
cations. 8(1). ISSN: 20411723. DOI: 10.1038/s41467-017-00394-x.
286 References
[146] Liu, Q. 2018. “Deep spiking neural networks”. PhD Thesis. University of
Manchester. 212.
[147] Liu, Q. and S. Furber. 2016. “Noisy Softplus: a biology inspired activa-
tion function”. In: International Conference on Neural Information Processing.
Springer. 405–412.
[148] Liu, Q., G. Pineda-Garcıéa, E. Stromatias, T. Serrano-Gotarredona, and
S. B. Furber. 2016. “Benchmarking spike-based visual recognition: a dataset
and evaluation”. Frontiers in Neuroscience. 10.
[149] Liu, Y. H. and X. J. Wang. 2001. “Spike-frequency adaptation of a general-
ized leaky integrate-and-fire model neuron”. Journal of Computational Neuro-
science. 10(1): 25–45. ISSN: 09295313. DOI: 10.1023/A: 1008916026143.
[150] Lopez-Poveda, E. A. and R. Meddis. 2001. “A human nonlinear cochlear
filterbank”. The Journal of the Acoustical Society of America. 110(6):
3107–3118.
[151] Lowe, D. G. 1999. “Object recognition from local scale-invariant features”.
In: The Proceedings of the Seventh IEEE International Conference on Computer
Vision, 1999. Vol. 2. IEEE. 1150–1157.
[152] Mahowald, M. 1992. “VLSI analogs of neuronal visual processing: A syn-
thesis of form and function”. Technology. 1992(5): 236. URL: http://caltec
hcstr.library.caltech.edu/591/.
[153] Malmierca, M. S., L. A. Anderson, and F. M. Antunes. 2015. “The corti-
cal modulation of stimulus-specific adaptation in the auditory midbrain and
thalamus: a potential neuronal correlate for predictive coding”. Frontiers in
Systems Neuroscience. 9.
[154] Markram, H., J. Lubke, M. Frotscher, and B. Sakmann. 1997. “Regula-
tion of synaptic efficacy by coincidence of postsynaptic APs and EPSPs”.
Science. 275(5297): 213–215. ISSN: 00368075. DOI: 10.1126/science.
275.5297.213.
[155] Marsaglia, G. and A. Zaman. 1993. “The KISS generator”. Tech. rep.,
Department of Statistics, University of Florida.
[156] Mazaris, D. 1997. “The reality of patch-cord management”. Cabling Instal-
lation & Maintenance. Feb.
[157] Mazurek, M., M. Kager, and S. D. V. Hooser. 2014. “Robust quantifica-
tion of orientation selectivity and direction selectivity”. Frontiers in Neural
Circuits. DOI: 10.3389/fncir.2014.00092.
[158] Mead, C. 1989. Analog VLSI and Neural Systems. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc. ISBN: 0-201-05992-4.
[159] Meddis, R., W. Lecluyse, N. R. Clark, T. Jürgens, C. M. Tan, M. R. Panda,
and G. J. Brown. 2013. “A computer model of the auditory periphery and
its application to the study of hearing”. In: Basic Aspects of Hearing. Springer.
11–20.
References 295
[169] Moore, S., P. Fox, S. Marsh, A. Markettos, and A. Mujumdar. 2012. “Blue-
hive – A field-programmable custom computing machine for extreme-scale
real-time neural network simulation”. In: 2012 IEEE 20th Annual Inter-
national Symposium on Field-Programmable Custom Computing Machines
(FCCM), 133–140. DOI: 10.1109/FCCM.2012.32.
[170] Morrison, A., A. D. Aertsen, M. Diesmann, A. Morrison, and M. Diesmann.
2007. “Spike-timing-dependent plasticity in balanced random networks”.
Neural Computation Massachusetts Institute of Technology. 19: 1437–1467.
ISSN: 0899-7667. DOI: 10.1162/neco.2007.19.6.1437.
[171] Morrison, A., M. Diesmann, and W. Gerstner. 2008. “Phenomenological
models of synaptic plasticity based on spike timing”. Biological Cybernetics.
98(6): 459–478. ISSN: 03401200. DOI: 10.1007/s00422-008-0233-1.
[172] Morrison, A., C. Mehring, T. Geisel, A. D. Aertsen, and M. Diesmann.
2005. “Advancing the boundaries of high-connectivity network simula-
tion with distributed computing.” Neural Computation. 17(8): 1776–1801.
ISSN: 0899-7667. DOI: 10.1162/0899766054026648.
[173] Muller, J.-M. 2016. Elementary Functions – Algorithms and Implementation.
3rd ed. Birkhäuser Basel.
[174] Mundy, A., J. Heathcote, and J. D. Garside. 2016. “On-chip order-
exploiting routing table minimization for a multicast supercomputer
network”. IEEE International Conference on High Performance Switch-
ing and Routing, HPSR. 2016 July: 148–154. ISSN: 23255609. DOI:
10.1109/HPSR.2016.7525659.
[175] Natrella, M. 2010. “NIST/SEMATECH e-handbook of Statistical Meth-
ods”. Ed. by C. Croarkin and P. Tobias. https://www.itl.nist.gov/div898/ha
ndbook/pmc/section4/pmc431.htm. (Accessed on 2018).
[176] Neil, D. and S.-C. Liu. 2014. “Minitaur, an event-driven FPGA-based
spiking network accelerator”. IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems. 22(12): 2621–2628. ISSN: 1063-8210. DOI:
10.1109/TVLSI.2013.2294916.
[177] Neumarker, F., S. Höppner, A. Dixius, and C. Mayr. 2016. “True
random number generation from bang-bang ADPLL jitter”. In: 2016
IEEE Nordic Circuits and Systems Conference (NORCAS). 1–5. DOI:
10.1109/NORCHIP.2016.7792875.
[178] Neuroinformatics of the University of Zürich, I. of. 2007. “jAER: Java tools
for Address-Event Representation (AER) neuromorphic vision and audio
sensor processing”. URL: https://github.com/SensorsINI/jaer (accessed on
2018).
References 297
[189] Perea, G., M. Navarrete, and A. Araque. 2009. “Tripartite synapses: Astro-
cytes process and control synaptic information”. Trends in Neurosciences.
32(8): 421–431. ISSN: 0166-2236. DOI: 10.1016/j.tins. 2009.05.001.
[190] Pérez-Carrasco, J. A., B. Zhao, C. Serrano, B. Acha, T. Serrano-Gotarredona,
S. Chen, and B. Linares-Barranco. 2013. “Mapping from frame-driven to
frame-free event-driven vision systems by low-rate rate coding and coin-
cidence processing–application to feedforward ConvNets”. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence. 35(11): 2706–2719. ISSN:
0162-8828. DOI: 10.1109/TPAMI.2013.71.
[191] Pfeil, T., T. Potjans, S. Schrader, W. Potjans, J. Schemmel, M. Diesmann,
and K. Meier. 2012. “Is a 4-Bit synaptic weight resolution enough? –
constraints on enabling spike-timing dependent plasticity in neuromor-
phic hardware”. Frontiers in Neuroscience. 6: 90. ISSN: 1662-453X. DOI:
10.3389/fnins.2012.00090.
[192] Pfister, J.-P. 2006. “Triplets of spikes in a model of spike timing-dependent
plasticity”. Journal of Neuroscience. 26(38): 9673–9682. ISSN: 0270-6474.
DOI: 10.1523/JNEUROSCI.1425-06.2006.
[193] Pineda-Garcıéa, G. 2019. “A Visual Pipeline Using Networks of Spiking
Neurons”. PhD Thesis. The University of Manchester. 166.
[194] Pineda-Garcıéa, G., P. Camilleri, Q. Liu, and S. Furber. 2016. “pyDVS: An
extensible, real-time Dynamic Vision Sensor emulator using off-the-shelf
hardware”. In: IEEE Symposium Series on Computational Intelligence, SSCI.
ISBN: 9781509042401. DOI: 10.1109/SSCI.2016.7850249.
[195] Plana, L. A., D. Clark, S. Davidson, S. Furber, J. Garside, E. Painkras,
J. Pepper, S. Temple, and J. Bainbridge. 2011. “SpiNNaker: Design and
Implementation of a GALS Multicore System-on-Chip”. ACM J. Emerg.
Tech. Comput. 7(4): 17:1–17:18.
[196] Plana, L. A., S. B. Furber, S. Temple, M. Khan, Y. Shi, J. Wu, and S. Yang.
2007. “A GALS infrastructure for a massively parallel multiprocessor”.
IEEE Design Test of Computers. 24(5): 454–463. ISSN: 0740-7475. DOI:
10.1109/MDT.2007.149.
[197] Plana, L. A., J. Heathcote, J. S. Pepper, S. Davidson, J. Garside, S. Temple,
and S. B. Furber. 2014. “spI/O: A library of FPGA designs and reusable
modules for I/O in SpiNNaker systems”. DOI: 10.5281/zenodo.51476.
[198] Plana, L. A. 2017. “Interfacing AER devices to SpiNNaker using an FPGA”.
Tech. Rep. http://spinnakermanchester.github.io/docs/spinn-app-8.pdf ,
SpiNNaker Application Note 8. University of Manchester.
References 299
306
Index 307
DTCM, xvi, 27, 29, 82, 88, 106–108, event-driven, 42, 55, 56, 105, 107, 140,
112–114, 122, 123, 165, 166, 214, 165, 178, 213, 234, 249
217 event-driven clustering algorithm, 164
DVFS, xvi, 279 event-driven computation, 234
DVS, xvi, 55, 57, 164, 166, 170, 171 event-driven library, 105, 107
Dynamic Vision Sensor, see DVS excitatory projection, 104
EA, xvi, 255–257, 260, 261 executable binary, 86
edge detection, 135 exhibition, 132
EIEIO packet, 98 exponential, 150, 179, 208, 226, 274
EIEIO protocol, 97 exponential decay, 274
elementary motion decomposition, 238, exponential function, 208
249 F-score, 252
eligibility trace, 206, 232, 233, 235 fabrication cost, 50
emergency routeing, 34, 35, 44, 49 Fast Interrupt Requests, see FIQ
energy consumption, 141 fault, 37
energy consumption auditory model, 141 fault tolerance, 79
energy efficiency, 74, 261, 277 FDSOI, 265, 277
Engineering, 147 feature map, 163, 169
entropy, 155, 156, 158, 251–254, 273 Feature maps, 165
entropy extraction, 273 feedback, 132, 143
Entschedungsproblem, 3 ferromagnetic, 158
EoP, xvi, 35–38 ferromagnetic lattice, 159
EPSRC, xii, xvi, 5, 16 field-effect transistors, xv
error backpropagation, 261 Field-Programmable Gate Array, see FPGA
error correction, 155 FIFO, xvi, 65, 267
error detection, 155 filter bank, 139
ES, xvi, 260 FIQ, xvi, 30, 81, 108, 112, 118
escargot, 155 FIQ thread, 108, 112
Eta Compute, 263 firing rate, 180, 191, 194, 198, 199
Ethernet, 29, 30, 42, 43, 55–57, 59, 60, fixed simulation timestep, 109
67, 79, 80, 82, 83, 95–97, 99, 105, fixed-point datatypes, 121
122, 257, 265 Flash, xvi
EU, 76 flash, 55
Event Camera, 55 Flash memory, 55
event-based, 219 flit, xvi, 35–38
event-based computation, 133 FPGA, xiv, xvi, xxiii, 55, 57–60, 62, 63,
event-based operating system, see 65, 66, 74, 178
SpiNNaker, SpiN1API, 108, 118, FPU, xvi, 172, 270
122 frequency, 138, 139
event-based processing, 133 fully connected layer, 164
310 Index
318
Contributing Authors
319
320 Contributing Authors