0% found this document useful (0 votes)
22 views14 pages

FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization

Uploaded by

ankithlal41322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views14 pages

FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization

Uploaded by

ankithlal41322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1178 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO.

8, AUGUST 2023

FireFly: A High-Throughput Hardware Accelerator


for Spiking Neural Networks With Efficient DSP
and Memory Optimization
Jindong Li , Guobin Shen , Dongcheng Zhao , Qian Zhang , and Yi Zeng

Abstract— Spiking neural networks (SNNs) have been widely at 300 MHz. As a lightweight accelerator, FireFly achieves the
used due to their strong biological interpretability and high- highest computational density efficiency compared with existing
energy efficiency. With the introduction of the backpropagation research using large FPGA devices.
algorithm and surrogate gradient, the structure of SNNs has
become more complex, and the performance gap with artificial Index Terms— Field-programmable gate array (FPGA), hard-
neural networks (ANNs) has gradually decreased. However, most ware accelerator, spiking neural networks (SNNs).
SNN hardware implementations for field-programmable gate
arrays (FPGAs) cannot meet arithmetic or memory efficiency I. I NTRODUCTION
requirements, which significantly restricts the development of
SNNs. They do not delve into the arithmetic operations between
the binary spikes and synaptic weights or assume unlimited
on-chip RAM resources using overly expensive devices on small
S PIKING neural networks (SNNs) are considered as the
third generation of artificial neural networks (ANNs) [1].
They were developed to mimic the operational mechanism
tasks. To improve arithmetic efficiency, we analyze the neural in the human brain, where information is communicated via
dynamics of spiking neurons, generalize the SNN arithmetic
operation to the multiplex-accumulate operation, and propose a spikes among neurons. Recent advances in SNNs have demon-
high-performance implementation of such operation by utilizing strated comparable performance to nonspiking ANNs [2], [3].
the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve However, compared to the extensive work on ANN accelera-
memory efficiency, we design a memory system to enable efficient tors [4], [5], [6], the existing SNN hardware accelerator still
synaptic weights and membrane voltage memory access with lags, limiting the practical applications of SNNs.
reasonable on-chip RAM consumption. Combining the above
two improvements, we propose an FPGA accelerator that can Most research ignores the importance of efficiently imple-
process spikes generated by the firing neurons on-the-fly (Fire- menting arithmetic operations in SNN accelerators. In field-
Fly). FireFly is the first SNN accelerator that incorporates DSP programmable gate array (FPGA) design, using the built-in
optimization techniques into SNN synaptic operations. FireFly dedicated hard block to implement arithmetic operations can
is implemented on several FPGA edge devices with limited achieve considerably higher performance than its general
resources but still guarantees a peak performance of 5.53 TOP/s
logic fabric counterparts. Fabric-only implementations in an
arithmetic-extensive application can lead to a compromised
Manuscript received 8 January 2023; revised 16 April 2023; clock frequency and even routing failures when the fab-
accepted 15 May 2023. Date of publication 5 June 2023; date of current
version 26 July 2023. This work was supported in part by the Strategic ric consumption is high. However, in the SNN accelerator
Priority Research Program of the Chinese Academy of Sciences under Grant design, the register transfer-level (RTL) description of the SNN
XDB32070100 and in part by the Chinese Academy of Sciences Foundation arithmetic operation cannot be automatically synthesized into
Frontier Scientific Research Program under Grant ZDBS-LY-JSC013.
(Corresponding authors: Qian Zhang; Yi Zeng.) the dedicated arithmetic hard block. Therefore, most SNN
Jindong Li and Qian Zhang are with the School of Artificial Intelligence, accelerators adopt the fabric-only implementation without
University of Chinese Academy of Sciences, Beijing 100049, China, and further optimizations. Although a single arithmetic operation
also with the Brain-Inspired Cognitive Intelligence Laboratory, Institute of
Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: unit in an SNN accelerator consumes considerably fewer
[email protected]; [email protected]). resources than a multiply-accumulate (MAC) unit in an ANN
Guobin Shen is with the School of Future Technology, University of Chinese accelerator design, hardware optimization of such operation
Academy of Sciences, Beijing 100049, China, and also with the Brain-Inspired
Cognitive Intelligence Laboratory, Institute of Automation, Chinese Academy can still significantly impact the system’s performance when
of Sciences, Beijing 100190, China (e-mail: [email protected]). the unit is instantiated hundreds or even thousands of times.
Dongcheng Zhao is with the Brain-Inspired Cognitive Intelligence Labora- In the Xilinx Ultrascale FPGA, the dedicated arithmetic hard
tory, Institute of Automation, Chinese Academy of Sciences, Beijing 100190,
China (e-mail: [email protected]). block, or the DSP48E2, enhances the speed and efficiency of
Yi Zeng is with the Brain-Inspired Cognitive Intelligence Laboratory, Insti- many operations, including multiplication, addition, wide bus
tute of Automation, Chinese Academy of Sciences, Beijing 100190, China, multiplexing, pattern detection, and single instruction multiple
with the School of Artificial Intelligence, University of Chinese Academy of
Sciences, Beijing 100049, China, and also with the Center for Excellence in data (SIMD) operations. It is possible to generalize the SNN
Brain Science and Intelligence Technology, Chinese Academy of Sciences, computation to the arithmetic operations that the DSP48E2
Shanghai 200031, China (e-mail: [email protected]). can provide.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2023.3279349. Another important aspect of the SNN accelerator design is
Digital Object Identifier 10.1109/TVLSI.2023.3279349 the memory system. When scaling the parallelism, the memory
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1179

bandwidth imbalance between the binary input–output spikes, transferred, and processed only if the neuron fires. However,
the multibit synaptic weights and the multibit membrane volt- these neuromorphic hardware designs place rigid restrictions
age becomes problematic. While the computational complexity on the network. The SNN networks are distributed among the
and the memory footprint of the binary spikes decrease, neurocores, and the total number of neurons in the model
the memory access requirements of synaptic weights and cannot exceed the maximum capacity of the hardware, not
membrane voltage do not. The off-chip memory access to mention the harsh fan-in and fan-out hardware limitations
bandwidth needed by the weights and membrane voltage of the network.
cannot fully support the increased parallelism brought by The second type of neuromorphic hardware explores emerg-
the hardware-friendly synaptic operations and storage-friendly ing devices. The BrainScale [10] developed by Heidelberg
binary spikes without further exploration of the reuse mecha- University-emulated SNNs on analog neuromorphic hardware
nism. Most hardware accelerators assume large on-chip mem- and achieved several advantages over conventional computers.
ory, store all the synaptic weights, and accumulate membrane Some research explores new materials like memresistors and
voltage on-chip to ease the harsh bandwidth requirement. This optics [11], [12]. However, the low precision and uncertain
method is not scalable, especially when the model gets larger nature of the hardware prevent them from being used in
and targets edge FPGA devices. A scalable memory system practice.
for synaptic weights and membrane voltage balancing, as well The third type of neuromorphic hardware follows the
as off-chip data access and on-chip data buffering, should be scheme of the ANN accelerator design except for construct-
developed. ing dedicated hardware for synaptic operations and explores
At present, most existing neuromorphic hardware or accel- optimal dataflow for SNNs specifically [13], [14], [15], [16].
erators are inefficient in terms of resource utilization, computa- These types of work require less area cost and achieve higher
tional density, and scalability. In real-world SNN applications, computing resource utilization. Fine-grained parallelism of the
it is not feasible to use overly expensive and large FPGA accelerator can enable high-performance computing of the
devices. A lightweight and high-performance SNN acceler- SNN compared with the sequential spike processing mech-
ator targeting resource-constrained edge scenarios should be anism of the NoC counterparts. This type of hardware has
developed. Focusing on these aspects, we propose FireFly, the fewest restrictions on the network models and can quickly
a high throughput and reconfigurable FPGA accelerator that adapt to emerging neuromorphic research. FPGA platforms
can process spikes generated by the firing neurons on-the- are the ideal choice for this type of hardware due to their
fly, achieving both arithmetic and memory efficiency. Our flexibility and reconfigurability.
contributions can be summarized as follows. The three types of neuromorphic hardware designs listed
1) We generalize the SNN arithmetic operation to above have a general hardware architecture that can adapt
the multiplex-accumulate operation and propose a to different types of networks, whereas the fourth type of
high-performance implementation of such an operation neuromorphic hardware is tailored to particular networks [17],
by utilizing the DSP48E2 hard block in Xilinx Ultrascale [18], [19], [20]. Park et al. [19] build an on-chip learning
FPGAs. system tailored for a two-layer SNN using direct spike-only
2) We design a synaptic weight delivery hierarchy and feedback. Chuang et al. [20] introduce a low-power 90-nm
a partial sum and membrane voltage (Psum-Vmem) CMOS binary weight SNN ASIC for real-time image classi-
unified buffer to balance the off-chip memory access fication. Panchapakesan et al. [17] and Aung et al. [18] target
bandwidth and on-chip RAM consumption. FPGA devices and design inference engines for specific neural
3) We evaluate multiple deep SNN models on vari- networks (NNs). Although these hardware designs can achieve
ous datasets and achieve faster inference speed and high-energy efficiency and inference speed, they are limited in
higher classification accuracy than the existing research. their practicality for deep and large SNNs due to their linear
We implement FireFly on several commercial off-the- expansion in power and area as network size increases. More-
shelf FPGA edge devices with limited resources, bring- over, because ASIC designs lack reconfigurability, hardware
ing hope for real-world SNN applications in edge sce- specifically designed for one network may not be adaptable to
narios. other network configurations.
While FireFly belongs to the third category, FireFly’s con-
tributions are largely complementary to the existing work.
II. R ELATED W ORK SyncNN [17] proposed a novel synchronous event-driven SNN
The existing dedicated neuromorphic hardware designed for reconfigurable inference engine and evaluated multiple SNN
SNN can be categorized into four types. models on multiple FPGA devices. Fang et al. [21] proposed
The majority of neuromorphic hardware constructs its a holistic optimization framework for the encoder, model, and
hardware substrates in a Network on Chip (NoC) fashion. architecture design of FPGA-based neuromorphic hardware.
Spinnaker [7], Loihi [8], and TrueNorth [9] fall into this However, these designs are based on high-level synthesis,
category. In these hardware designs, neurons are grouped into thus inducing large resource redundancy. Lee et al. [22]
multiple neurocores, which communicate via spikes through and Chen et al. [23] explored spatial–temporal parallelism by
the NoC, and spike messages are scheduled by dedicated unrolling the computations in both the spatial and time dimen-
routers. These hardware architectures are compatible with the sions and achieved significant acceleration. However, paral-
event-driven nature of SNNs, as spike events are generated, lelization across multiple time points violates the time-related
1180 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 8, AUGUST 2023

sequential nature of the membrane voltage update behavior. where [1 − (1/τm )] < 1 denotes the leaky term, which is
SpinalFlow [15] achieved significant sparsity acceleration by ignored when using the IF model.
adopting a different input–output spike representation to skip 3) Output Spike Generation Phase: Whenever the mem-
the nonspike computations. SATO [16] achieved high-speed brane potential reaches the firing threshold, the neuron gener-
inference by incorporating a temporal-oriented dataflow and ates an output spike and resets its membrane potential
a bucket-sort-based dispatcher to balance the workload. How- (
ever, these techniques only work for temporal coding SNNs, (vi [t], 0), vi [t] < Vth
(u i [t + 1], si [t + 1]) = (4)
limiting the accuracy of the SNN models. DeepFire [18] was (0, 1), vi [t] ≥ Vth .
the first research migrating DSP48E2s into neuron core design.
In these three phases, we have two key observations. The
However, they did not delve into the function of DSP48E2 and
input current integration phase completely dominates the total
still induce large fabric overhead.
computational cost due to the high degree of synaptic connec-
We argue that with careful RTL design, focusing on optimiz-
tivity and a large number of neurons. The membrane potential
ing spatial parallelism on FPGA, adopting regular and simple
update phase has the harshest storage requirement because the
time-step convolutional NN (CNN)-like processing, and fully
membrane potential is read and written back and forth in every
utilizing the multifunction DSP48E2, we can still achieve
timestep. We will focus on these two aspects in Section IV.
impressive inference throughput on small FPGA edge devices.
FireFly is more applicable in real-world applications where
design space exploration is constrained by limited resources. B. Dataflow and Parallelism Scheme for SCNN
Similar to CNNs, convolutional layers dominate the total
III. SNN BASICS computational cost in spiking CNNs (SCNNs). We mainly
A. Spiking Neuron Model focus on the dataflow optimizations of the convolutional layers
Spiking neurons are the basic units of SNNs, which are and show that the dataflow can be migrated to fully connected
connected through weighted synapses and transmit infor- layers.
mation through binary spikes. Although more complex and Input–output spike representation varies in different neu-
detailed neuron models such as Izhikevich [24] and Hodgkin romorphic hardware. Most SNN hardware implementations
and Huxley [25] can accurately model a biological neuron’s adopt the address-event-representation (AER) data format to
behavior, simpler models such as integrate and fire (IF) [26] transmit spikes between neurons. The standard AER package
for one spike includes the spiking neuron’s input location
and leaky integrate and fire (LIF) [27] are used more often in
and the spike’s timestamp. Although the AER data format is
current SNN applications.
compatible with the event-driven nature of SNNs, multiple bits
An IF neuron integrates its inputs over multiple timesteps
are needed to express the original single-bit spike event. The
and generates a spike whenever the integrated membrane
logic and storage overhead may not be worth it.
voltage surpasses a firing threshold. An LIF neuron acts the
This article adopts the original single-bit format to represent
same except for the leaky behavior of the membrane voltage.
the binary spikes. At any discrete timestep t in the digitalized
The neural dynamics of an LIF neuron membrane potential u
SCNN, the output spikes of all the neurons in one channel of
can be described as
the convolutional layer can be considered a timestep snapshot
du
τm = −u + R · I (t), u < Vth (1) in the form of a binary map [28]. In this case, the input-current
dt integration phase computation process of the SNNs is almost
where Vth denotes the threshold, I denotes the input current, R the same as that of the traditional ANNs except for the
denotes the resistance, and τm is the membrane time constant. additional time dimension and the changed operation. The set
A spike is generated when u reaches Vth and u is reset to of computations for the complete SNN convolutional layer
resting potential u rest , which is set to 0 in this work. The that receives a single batch of input can be formulated as a
membrane potential’s neural dynamics can be divided into loop nest over these seven variables. All permutations of these
three phases, and each phase can be described in a discrete six loop variables, except for the timestep variable, are legal.
computational form. Permutations of the loop variables open up the possibility of
1) Input Current Integration Phase: All the presynaptic different dataflow choices. The tiling of the loop variables
currents generated by the presynaptic spikes are integrated at opens up the possibility of different parallelism schemes.
each discrete timestep Different permutations of the loop variables adopt different
kinds of dataflow. Different dataflow schemes for convolution
X
I [t] = wi j s j [t] + bi (2)
j
have been extensively studied by Eyeriss [4]. The key con-
sideration is how to minimize data movement and maximize
where the subscript i represents the ith neuron, wi j is the data reuse. In SCNN, synaptic connection weights need to
synaptic weight from neuron j to neuron i, and bi is a bias. be fetched and membrane voltage needs to be updated at
2) Membrane Potential Update Phase: The membrane every time timestep, due to the unique time dimension in SNN
potential of each neuron is updated by the integrated presy- computation. Therefore, output and weight stationary (OS and
naptic currents at each timestep WS) dataflow can minimize the data movement of the multibit
 
1 membrane voltage and synaptic weight data between on-chip
vi [t] = 1 − u i [t] + I [t] (3)
τm logic and off-chip memory.
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1181

Algorithm 1 Pseudo Code of Scheduling a Single Convolu- tensor in shape Cout × Cin × K h × K w , and an output spike
tional Layer in FireFly Architecture tensor in shape T × Cout × H × W assuming same padding
and stride of one, where T denotes the timestep, (Cout , Cin )
denotes the output–input channels, and (H, W ) denotes the
size of spike maps. We adopt channel tiling in output and
input channels with the parallelism factor P and flatten the
spike map to 1-D data stream of length L = H × W , yielding
T × ci fragments of input spike stream with P spike channels
and T × co fragments of output spike stream with P spike
channels, where ci = ⌈(Cin /P)⌉ and co = ⌈(Cout /P)⌉. FireFly
receives P channels of input spike stream, performs a P × P
spike map convolution and generates P channels of output
partial sum. The P × P convolutions are unrolled spatially,
while T , co , and ci are folded in time, reusing the same
hardware substrates. Any permutation of the loop variables T ,
co and ci is legal. We iterate co over T over ci , adopting a WS
dataflow as discussed above. In this way, ci fragments of input
spike stream pass the hardware for co times at each timestep.
The calculation of IF/LIF neural dynamics is performed when
the last fragment of the input spike stream flows through. The
membrane voltage is cleared when the integration and spike
generation process is done for all timesteps.

IV. H ARDWARE A RCHITECTURE


A. Architecture Overview
In this section, the digital design of SNNs is discussed
in detail. Fig. 1 shows the overall system design of FireFly.
FireFly targets heterogeneous Zynq Ultrascale devices. The
central processing unit (CPU) of the processing system (PS)
acts as the controller for system state control and external
memory access. The programmable logic (PL) accelerates the
SNN inference. AXI DataMover IP, instead of AXI DMA
IP, enables high-throughput and low-latency data transactions
between the off-chip DRAM and on-chip memory storage.
The unique store and forward feature of AXI DataMover is
enabled to allow multiple outstanding requests.
The WS systolic array is responsible for the acceleration
of SNN arithmetic operations. The systolic array consists of
several DSP48E2 chains and multiple adder trees. A weight
matrix delivery hierarchy is proposed to enable efficient weight
loading to the systolic array. Two separate datapaths for con-
volutional and fully connected layers are designed to generate
binary spike vectors for the systolic array. A Psum-Vmem
unified buffer and update engine is constructed to support
back-and-forth membrane potential update and IF/LIF neuron
Different tiling strategies for the loop variables enable
dynamics. An optional MaxPooling unit is placed on the
different parallelism schemes. The tiling of the loop variables
output spike datapath to support on-the-fly pooling.
can induce data reordering or data segmentation. We argue that
The designs of the systolic array, the spike vector gen-
it is important to keep the input and output spike arrangements
eration unit, the synaptic weight delivery hierarchy, and the
the same to enable spikes to be processed in an on-the-fly
Psum-Vmem unified buffer are elaborated in detail below.
fashion without complicated data rearrangement. We chose the
spatial tiling of the input and output channel dimensions rather
than tiling within the same spike feature map to avoid data B. Synaptic Crossbar Computation Featured by DSP48E2s
rearranging or irregular off-chip data access. Fig. 2(a) shows an all-to-all, fully connected connec-
Adopting the dataflow and parallelism scheme above, the tion topology between eight presynaptic neurons and eight
pseudo-code of scheduling a single convolutional layer in postsynaptic neurons. The axons of presynaptic neurons and
FireFly architecture is described in Algorithm 1. Given an the dendrites of the postsynaptic neurons are crossed, forming
input spike tensor in shape T × Cin × H × W , a weight an 8 × 8 synapse matrix. The input spikes from the axon of
1182 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 8, AUGUST 2023

Fig. 1. Architecture of FireFly.

Fig. 2. Synaptic crossbar computation. (a) 8 × 8 synaptic crossbar. (b) Equivalent digital circuit of the 8 × 8 synaptic crossbar. (c) Implementation of a
2 × 4 synaptic crossbar by a single DSP48E2 slice. (d) Systolic array with 4 × 4 PEs. A PE is a DSP48E2 cascaded chain of length 8.

a presynaptic neuron are broadcasted through a row of the to the dedicated DSP hard block in Xilinx FPGA, leading to
crossbar. The dendrite integrates the input spikes along the significant improvements in resource efficiency and clock rate.
column of the crossbar. DSP48E2 is the dedicated digital signal processing logic
The synaptic operation happens at every crossing point block in the Xilinx Ultrascale series FPGA. Most FPGA
of the synaptic crossbar, represented by a black dot. Math- neuromorphic hardware simply treats them as multipli-
ematically, the synaptic operation consists of a dot product ers and leaves them underutilized. However, they enhance
between the binary spike and the synaptic weight and an the speed and efficiency of many applications far beyond
addition accumulating the synaptic current propagated along multiplication-based digital signal processing [29]. In this
the dendrites. Such an operation can be implemented by chapter, we show that a single DSP48E2 slice can support
a multiplexer and an adder. The spike acts as the control a 2 × 4 synaptic crossbar computation, and up to eight
signal of the multiplexer, switching the synaptic weight ON DSP48E2 slices can be cascaded in a chain to support
or OFF when the neuron is firing or resting. The adder sums a 16 × 4 crossbar computation without numeric overflow.
up the result from the multiplexer and the result coming By instantiating multiple DSP48E2 cascaded chains and
from the cross point above. Fig. 2(b) shows the equivalent arranging them in a 2-D matrix, we can construct a systolic
digital circuit to the connection topology shown in Fig. 2(a). array supporting larger synaptic crossbar computation. The
In ASIC design, the RTL description of the synaptic operation detailed implementation of this method is demonstrated below.
is synthesized into standard cells of a certain technology The DSP48E2 slice consists of four pipeline stages for input
library. In FPGA design, the RTL description of the synaptic ports A, B, C, and D, a 27-bit preadder, a 27 × 18 multiplier,
operation is automatically synthesized into LUTs and FFs. four 48-bit wide-bus multiplexers named W, X, Y, and Z,
However, we show that the multiplex-accumulating operation and a flexible 48-bit ALU. A 5-bit INMODE port sets the
existing in the crossbar computation can be manually mapped configuration of the input pipeline stages and the 27-bit
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1183

preadder. A 9-bit OPMODE port controls the select signal TABLE I


of the W, Y, X, and Z multiplexer. A 4-bit ALUMODE R ESOURCE U TILIZATION C OMPARISON B ETWEEN DSP AND FABRIC
I MPLEMENTATION
port controls the functionality of the 48-bit ALU. In FireFly,
we fully utilize the four 48-bit wide-bus multiplexers, dynamic
control of the OPMODE, and the SIMD mode of the 48-bit
ALU to implement the crossbar computation.
The static configuration of the DSP48E2 is configured as
below: The 27-bit preadder and the 27 × 18 multiplier are
disabled. The 4-bit ALUMODE port is set to 4’b0000 so that
the ALU unit will perform add operation. The 5-bit INMODE
port is set to 5’b10001 so that data ports A and B are registered upper DSP slice through the PCOUT data port. Therefore, the
once. Data port C is registered once. Data port D is left arithmetic function of a single DSP48E2 slice is equivalent to
unused. All the carry inputs are ignored. The 48-bit ALU unit a 2 × 4 synaptic crossbar computation without general fabric
is configured into SIMD mode, supporting four independent logic overhead.
12-bit additions. Direct access to these specific configurations 2) Synaptic Crossbar Computation by a DSP48E2 Chain:
in DSP48 is achieved by directly instantiating the DSP48E2 There are dedicated internal paths between adjacent DSP48E2s
primitive. In this way, the outputs of the four 48-bit multiplex- for local cascading, which will not occupy global routing
ers W, X, Y, and Z are split into four 12-bit fields, respectively. resources. Since the synaptic weights are quantized to 8 bits
The 48-bit ALU unit acts as four independent 12-bit adders and the bit width of the SIMD adder is 12 bits, up to eight
summing up each field of the four multiplexers. DSP48E2 can be cascaded in a chain without numeric over-
The dynamic configuration of the DSP48E2 involves chang- flow. While the multiplexer and the ALU unit in the DSP48
ing the OPMODE at runtime to switch the multiplexers to are used for the synaptic crossbar computation, the dedicated
different inputs. There are dozens of combinations of inputs cascaded path of the DSP48E2 acts as the dendrite, collecting
to these multiplexers, one of them can be: either C or all 0 s on and accumulating the computation results along the DSP48E2
the W multiplexer; either A:B or all 0 s on the X multiplexer; chain. In this way, the arithmetic function of a DSP48 cascaded
all 0 s on the Y multiplexer; PCIN on the Z multiplexer, where chain of length 8 is equivalent to a 16 × 4 synaptic crossbar
PCIN is the output of a lower DSP slice, cascaded into the computation. A DSP48 cascaded chain of length 8 is a single
current DSP slice. processing element (PE) in FireFly, which is the basic element
Adopting the static and dynamic configuration of the of the systolic array introduced in Section IV-B3.
DSP48E2 described above, the synaptic crossbar computation The straightforward implementation of a 2 × 4 synaptic
can be efficiently implemented by three levels of DSP48E2 crossbar computation described above using general fabric
instantiation: a single DSP48E2 slice, a DSP48E2 chain, and will consume 86 look-up tables, 114 flipflops, and eight carry
a DSP48E2 systolic array. chains, while a 16 × 4 crossbar will consume 688 look-up
1) Synaptic Crossbar Computation by a Single DSP48E2: tables, 912 flip-flops and 64 carry chains, shown in Table I.
The main idea of our approach is to bundle sets of synaptic Note that general fabric implementation will also consume
weights and feed them to the DSP48E2 multiplexers and global routing resources. It is considerably less efficient than
switch the multiplexer with spikes, as illustrated in Fig. 2(c). the proposed approach and will lead to a compromised clock
In this work, the synaptic connection weights are quantized frequency when the parallelism scales up.
into INT8 by the well-established posttraining quantization or 3) Synaptic Crossbar Computation by a Systolic Array: By
quantization-aware training methods developed in traditional instantiating multiple DSP48E2 chains, or PEs, in a systolic
NNs. Four sets of INT8 weights are signed extended to INT12 array fashion [shown in Fig. 2(d)], we can support a larger
and concatenated into 48-bit. The upper 30 bits are assigned synaptic crossbar. The systolic array is a specialized mesh of
to the input port A while the lower 18 bits are assigned to homogeneous PEs designed to process massive parallel com-
input port B. As shown in Fig. 2(c), w20 , w21 , w22 , w23 are putations. It has the potential to run at a high frequency due
bundled and assigned to port A and B. A and B are then to its regular and adjacent interconnections. Previous FPGA
concatenated and multiplexed by the X multiplexer. In SNNs, neuromorphic hardware adopting a systolic array architecture
the input spikes are shared by different sets of weights through failed to achieve satisfactory performance, either in resource
the axons, as shown in Fig. 2(a). In this case, spike s2 is efficiency or clock frequency, since they are implemented
fetched to dynamically switch the X multiplexer between the in low-speed general fabrics. FireFly makes full use of the
four sets of weights (A:B) and all 0 s. Similarly, another DSP48E2 feature and greatly improves the systolic array’s
four sets of INT8 weights, w30 , w31 , w32 , w33 , are signed performance.
extended, concatenated, and directly assigned to the C data In this work, our definition of the systolic array size is
input. Another spike, s3 , is fetched to dynamically switch the same as that of the synaptic crossbar. An M × N sys-
the W multiplexer between C and all 0 s. The OPMODE tolic array support an M × N synaptic crossbar computation,
is dynamically controlled by s2 and s3 . The Y multiplexer consisting of (M/16) × (N /4) PEs, or (M/16) × (N /4) ×
outputs are set to all 0 s. The Z multiplexer selects the PCIN 8 DSP48E2s. Note that the DSP48E2 chain acting as the
inputs and the partial sum from the lower DSP slice. The dendrite in each PE cannot be cascaded across PEs without
results are staged into the P register and propagated to the numeric overflow, therefore four additional adder-trees are
1184 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 8, AUGUST 2023

instantiated to sum the SIMD accumulating results from a streaming pipeline. However, the weight matrix needs to
(M/16) PEs at each column up. be changed after the current subset of feature maps finishes
Each PE in the systolic array contains different sets of processing. Replacing the current set of weights with the
synaptic weights. Adopting a WS scheme, the synaptic weight next set of weights can be problematic. On the one hand,
matrix remains cached in a PE until they are no longer needed. the instantaneous data reloading bandwidth is extremely high
The same 1 × M binary spike vector is shared across columns when the current set of weights expires. On the other hand, the
horizontally behaving just like the axons. The 1 × N partial expired set of weights at the current timestep will be reloaded
sums, or the synaptic currents, flow out of the systolic array again at the next step. It is inefficient if synaptic weights need
vertically behaving just like the dendrites. to be fetched from off-chip memory over and over again at
every timestep.
C. Spike Vector Generation for Convolution by Line Buffer We propose a multilevel weight delivery hierarchy to tackle
the aforementioned problems. The instantaneous bandwidth
Similar to ANN, 2-D convolution is the basic operation
needed when reloading the next set of weights is amortized
in a digitalized SCNN. We incorporate the traditional line
over an idle period when the weights are kept stationary by
buffer design to generate the spike window needed for the
the multilevel weight delivery hierarchy. Synaptic weights are
spike-map convolution. The line buffer is commonly seen
cached on-chip and reused over all timesteps using a novel
in CNN accelerator design because it can efficiently achieve
memory structure we proposed to avoid repetitive off-chip
kernel-level parallelism and ensure good reuse of image data.
data access. As we iterate the tiled output channel variable
When FireFly is configured to SCNN mode, Cin channels
over timesteps, shown in Algorithm 1, only a small portion
of binary spike map are bundled together and stream into
of weights contributing to the current subset of output feature
the line buffer. The K h × K w spikes-bundle window is then
maps need to be cached on-chip. Data width upsizing tech-
flattened to a K h × K w × Cin vector and sent to the sys-
niques are used to boost the on-chip data bandwidth to enable
tolic array. In most of the established CNN architectures,
faster weight delivery. There are three basic components in the
3 × 3 convolution with stride 1 and the same padding is
proposed multilevel weight delivery hierarchy: The proposed
the most common configuration. The SCNN architecture fol-
partial reuse FIFO, the stream width upsizer and the skid
lows this scheme. Ideally, general neuromorphic hardware for
buffer.
SNN should support all types of convolutional layers with
1) partial reuse FIFO: We propose partial reuse FIFO,
different configurations. But the hardware would not work
a new memory structure for streaming data buffering, sup-
efficiently for all types of convolution configuration and such
porting data reuse like the ping-pong buffer and having
design would cause hardware overhead, thus might not be
an FIFO-like feature. We first review two classic memory
feasible. Therefore, we design specialized line buffer logic for
structures for streaming data buffering and latency hiding
3 × 3 convolution. Nevertheless, the methods discussed here
before we introduce the partial reuse FIFO. Fig. 3(a) shows
are compatible with other kernel sizes. Using the dynamic
a classic ping-pong buffer. The buffer size is doubled for
function exchange features in FPGA, hardware supporting
independent read and write processes. The input stream flows
different types of convolutional layers can be dynamically
into one bank of the buffer and the output stream flows out
deployed in FPGA during runtime.
from the other. Read and write conflicts are eliminated but
When FireFly is configured for multilayer perception (MLP)
memory resource consumption is relatively high for double
topology mode, the line buffer datapath for SCNN is left idle
buffering. Data cached in the ping-pong buffer can be reused
and the shift register datapath for MLP is switched on. The
but manual controlling and bank switching are needed, which
shift register forms a serial-to-parallel stream width adapter by
may complicate the controller design. Fig. 3(b) shows a classic
combining the Cin input spikes of K h × K w input transactions
synchronous FIFO, which is represented using a ring. A push
into one. The length of the binary spike vector in SCNN and
pointer is used to mark the write address of the incoming data
MLP datapaths is the same and compatible with the height of
stream. A pop pointer is used to mark the read address of the
the systolic array.
output data stream. When the push pointer and the pop pointer
meet each other, the FIFO is either full or empty, depending on
D. Synaptic Weight Delivery in a Multilevel Hierarchy whether the occupancy of the FIFO is rising or falling. FIFO
Although a DSP48E2-featured systolic array can already provides a certain capability of buffering the input data stream
support a large synaptic crossbar computation, it is not feasible when the downstream module is not ready. The control logic
to build a static synaptic crossbar circuit large enough for of the FIFO is self-contained. Using a valid-ready handshaking
SNNs that have millions of neurons and synaptic connections. protocol, FIFO can be inserted directly between modules
Instead, the presynaptic neurons and postsynaptic neurons without complicating the whole design. However, FIFO does
should share the same synaptic crossbar computation circuit not support data reusing.
in a time-multiplexed manner. The partial reuse FIFO we proposed is shown in Fig. 3(c).
Considering the inference process of a single convolution The mechanism of the partial reuse FIFO is the same as the
layer in SCNN, all presynaptic neurons within the same traditional synchronous FIFO, except that a partial region in
channel of the same feature map share the same weight the FIFO ring cannot be flushed by incoming data until it is
kernel, so the weight matrix can remain static while the input reused T times, where T is a control register of the partial
spikes of feature maps flow through the systolic array in reuse FIFO. The reuse region of the FIFO is labeled by Star t
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1185

Fig. 3. Different approaches for hiding data transfer latency to improve throughput. (a) Ping-pong buffer. (b) Synchronous FIFO. (c) Proposed partial reuse
FIFO. (d) ×4 stream width upsizer. (e) Four-level synaptic weights delivery hierarchy to enable synaptic weights reusing, reduce off-chip memory bandwidth
and hide the weight loading latency to the systolic array.

and End. The pop pointer jumps back to the Star t position delivering one element per clock cycle. The ×4 stream width
whenever it reaches the End. The reuse counter increases upsizer performs a serial-to-parallel conversion, delivering four
whenever the pop pointer jumps back to Star t. The Star t elements every four clock cycles. The average data throughput
label stays the same when the region is still being reused. of the upstream and downstream measured over time is the
When the counter reaches T , the counter is reset, label End same, but the instantaneous throughput is increased ×4 times.
becomes the next label Star t, and the next label End is set by A partial reuse FIFO module can be directly placed after the
Star t + L −1, where L is another control register of the partial stream width upsizer to boost the data throughput once the
reuse FIFO. When the push pointer meets the label Star t, the reuse sector region of the partial reuse FIFO is filled.
partial reuse FIFO is considered full and the ready signal to the 3) Skid Buffer: A skid buffer is a two-entry Pipeline FIFO
inputs stream is cleared. When label End is ahead of the push Buffer. It decouples two sides of a ready/valid handshake to
pointer, the partial reuse FIFO is considered empty until the allow back-to-back transfers. The skid buffer is placed at the
reuse sector of the FIFO is filled by the input stream. Using the last stage of the multilevel weight delivery hierarchy. At the
valid-ready handshaking protocol, the function of the partial downstream side of the skid buffer, the systolic array holds the
reuse FIFO is self-contained, with only two control registers, current set of WS by applying back pressure to the skid buffer
the reusing times T and the reusing length L, exposed. The and releasing the pressure when the current set of weights is
partial reuse FIFO contains only a monolithic RAM and does no longer needed. At the upstream side of the skid buffer,
not need to be double-buffered. The push-pop pointer in the the ready signal is always held high until new data shifts in,
FIFO control logic ensures no read-write collision. The reuse blocking the back pressure of the systolic array to enable faster
sector protected by the Star t and End labels enables data data delivery.
reuse. New data from multiple batches can be pushed to the Fig. 3(e) shows a simple example illustrating the mechanism
partial reuse FIFO sequentially as long as the FIFO is not full. of the multilevel weight delivery hierarchy. Arrows indicate the
The partial reuse FIFO is the key component in this multilevel direction of data transfer. We assume the weight data stream
synaptic weight delivery hierarchy. coming from off-chip memory delivers one element per clock
2) Stream Width Upsizer: A ×N stream width upsizer cycle (for the simplicity of drawing). The multilevel weight
converts the one-element input stream to an N -element output delivery hierarchy consists of four levels as listed below.
stream by allocating N elements of the input stream and firing
them all at once. As shown in Fig. 3(d), the data stream from 1) Level 1: The data stream is upsized by
the upstream master delivers −1, 2, 5, 8, 7, . . . , 3 serially, the ×8 stream width upsizer, delivering eight
1186 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 8, AUGUST 2023

elements every eight clock cycles. As shown Since tiles of input spike map channels in a single timestep
in Fig. 3(e), −1, 2, 5, . . . , 8, 7, 5, −8, . . . , 3, are sent to the computing array one by one and the temporal
0, 3, 7, . . . , 4 flow into the upsizer one by one dimension of SNN is kept in its natural way of executing
serially. (−1, 2, 5, . . . , 8), (7, −5, −8, . . . , 3), . . ., sequentially, the partial sum accumulating process and the
(0, 3, 7, . . . , 4) flow out of the upsizer eight elements membrane voltage update process can be scheduled using a
in a group. Elements with slash symbols are invalid at finite state machine (FSM). There are three states specified in
the current clock cycle. the FSM: accumulating phase, thresholding phase, and clear-
2) Level 2: The partial reuse FIFO is placed right after the ing phase. During the accumulating phase, Psum extracted
upsizer. Once the data are cached in the reuse region, the from the Psum-Vmem unified buffer is accumulated by the
output stream of the partial reuse FIFO can deliver eight computing results from the systolic array. When the last tile
elements per cycle, thus boosting the data throughput. of the input spike map channel in the current timestep arrives
Weight data cached in the reuse region is reused T and the current timestep is not the last, the FSM switches to the
times before being flushed with new data. Note that thresholding phase. The extracted Psum is first accumulated,
invalid elements no longer occupy clock cycles in the then processed by the optional leaky unit and the thresholding
output data stream of the partial reuse FIFO as shown unit, and eventually written back to the unified buffer. The
in Fig. 3(e). accumulated Vmem will be subtracted from a fixed portion of
3) Level 3: Another ×8 stream width upsizer is placed its value by the optional leaky unit to support the LIF neuron
following the partial reuse FIFO to further expand the dynamics. The thresholding unit will compare the Vmem with
instantaneous bandwidth. In this case, 8 × 8 elements the threshold, generate a spike, and reset the Vmem if it
are collected and delivered all at once. exceeds the threshold. All of the computations are pipelined to
4) Level 4: Finally, the skid buffer is instantiated to bridge improve timing. The FSM switches back to the accumulating
the weight delivery logic and the systolic array. phase when this phase finishes. When the last tile of the
The multilevel weight delivery hierarchy enables instant input spike map channel in the last timestep arrives, the FSM
weight data supply to the systolic when the current set of switches to the clearing phase. The computation process is the
weights expires, minimizing the idle state of the systolic array, same as the thresholding phase, except that the Vmem value
thereby greatly raising the ratio between the actual throughput will be cleared to reset the unified buffer for the next SNN
and the theoretical throughput. layer.

E. Psum-Vmem Unified Buffer and Spike Generation Logic V. I MPLEMENTATION AND E XPERIMENTS
A classic systolic array consumes data from the input data
A. Experiments Setup
domain and the weight data domain and generates data for the
output data domain. If one data domain stays stationary, the FireFly is mapped onto several off-the-shelf commer-
other two must flow through the computing logic. This metric cially available Xilinx Zynq Ultrascale FPGAs, including the
holds for the three classic input, weight, and output stationary Ultra96v2, KV260 and ZCU104 evaluation boards. The FPGA
dataflows. Our architecture adopts the WS dataflow. In this chips of the three evaluation boards are xczu3eg, xczu5ev,
case, synaptic weights remain stationary in the systolic array, and xczu7ev, respectively. Most neuromorphic hardware uses
and the input binary spikes and the output flow in and out of expensive large FPGA devices, ignoring the feasibility of
the systolic array. The flowing spike vector is generated by deploying such hardware in the real world. FireFly brings hope
the line buffer mechanism, and the outputs are stored in the to SNN real-world applications in an edge scenario.
proposed Psum-Vmem Unified Buffer, shown in Fig. 4. Our proposed FireFly is designed using SpinalHDL. The
In our architecture, the synaptic operations in SNN are Verilog codes generated by the SpinalHDL compiler are
spatially parallelized. However, it is unlikely to flatten a synthesized and implemented in the Xilinx Vivado 2021.1 with
whole layer spatially onto the area-power-restricted hardware ML-Based design optimization to achieve a higher clock rate
substrates. Therefore, certain tiling strategies need to be imple- and faster timing closure. Power consumption estimates and
mented. We adopt the channel tiling strategy to accommodate timing results are obtained after place-and-route using the
layers with a large number of channels to the same systolic power analysis and timing summary tools in the Vivado Design
array. Input spike map channels are split into multiple tiles Suite which provides detailed analysis and accurate estima-
to fit into the height of the systolic array. Output spike map tion. We choose the Zynq devices as the system platforms.
channels are calculated N at a time according to the width of The built-in host CPU controller enables fast deployment of
the systolic array. In every single timestep, the partial sums different SNN networks without the need to change the PL
of the N output spike map channels are stored on-chip and logic. The host program generates a command sequence in
are not fully accumulated until all tiles of the input spike advance and sends the commands to PL through a high-
map channels are calculated. In each layer, the membrane performance AXI-Stream to the internal command queue of
voltage of the N output spike map channels are also needed the AXI DataMover. FireFly is based on the Brain-Inspired
to be stored on-chip until all timesteps are iterated. Instead of Cognitive Engine (BrainCog) and is a first step toward
instantiating a separate buffer for Psum-Vmem, we propose the the software–hardware codesign for the BrainCog project
Psum-Vmem Unified Buffer to reduce RAM consumption. (http://www.brain-cog.network/) [30].
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1187

can achieve a peak performance of 5529.6 GOP/s, as presented


in Table II.
We compare with four representative systolic-array-based
hardware accelerators implemented on FPGA platforms in
Table II. We focus on comparing the hardware specifications
and theoretical computing capabilities of these accelerators,
which can be easily quantified. As these accelerators are
based on a systolic array, the regular 2-D arrangement of
PEs in these works makes it simple to measure the maximum
computing power that these accelerators can deliver. Despite
variations in their PE configurations, all designs employ a
basic multiplex-accumulate unit to implement the synaptic
crossbar computation. The peak throughput is determined by
the number of multiplex-accumulate units and the clock rate
and can be estimated using (5).
We first present a comprehensive perspective by outlining
several essential observations.
1) Despite the abundance of DSP resources in their devices,
none of these works effectively utilize them, resulting in
high LUT consumption and low clock frequency.
2) It is worth noting that the FPGA devices used in
these works, such as xc7z100, xcvu440, xc7k325t, and
Fig. 4. Psum-Vmem update mechanism. (a) Finite-state machine perform- xc7vx690t, are considerably larger than the edge device
ing the Psum-Vmem update. (b) Proposed Psum-Vmem unified buffer and
Psum-Vmem update engine. (c) Hardware implementation details of the we employed (xczu3eg), yet we were able to build a
Psum-Vmem update engine. larger computing array and achieve comparable, or even
better, peak performance.
3) Although these works employ expensive, large FPGA
B. Comparisons in Hardware Specifications devices, they do not effectively harness the full potential
of these resources and fail to consider the practical
The theoretical peak GOP/s of an SNN accelerator is given feasibility of real-world deployment.
as
We then make case-by-case comparisons with these four
Peak GOP/s = 2 × f × S (5) representative works, since finding a normalized metric that
considers all aspects, such as precision, neuron types, and
where f is the system clock frequency, and S = M × N resource consumption, can be extremely challenging.
denotes the size of the systolic array. The peak GOP/s cal- 1) Cerebron [13] leverages weight sparsity acceleration
culation is the same as [13] and [31]. In FireFly, M denotes and supports pointwise and depthwise convolutions,
the number of rows in the systolic array, while N denotes the resulting in a more complex PE design than that of
columns. The peak performance should be proportional to the FireFly. However, FireFly achieves a higher peak per-
systolic array size. The size of the systolic can be statically formance (1382.4 versus 650 GOP/s) using a smaller
reconfigured in FireFly according to the on-chip resources on device (xczu3eg versus xc7z100), surpassing Cerebron
different evaluation boards. An M × N systolic array in Fire- in terms of computational density efficiency. A drawback
Fly receives M presynaptic inputs and produces partial sum of FireFly is its inability to benefit from weight sparsity
for N neurons, where N = P and M = K h × K w × P. The acceleration or support pointwise and depthwise convo-
resource consumption, memory bandwidth and acceleration lutions.
performance are linearly proportional to the parallelism factor 2) SIES [31] utilizes a 64 × 64 systolic array, which
P. P can be any value as long as the systolic array can fit in provides a peak performance of 1638.4 GOP/s on the
the target device. As P is also the tiling factor of the input xcvu440 platform. Guo et al. [32] implement a 32 ×
and output channels in a convolutional layer, it is preferable 32 systolic array, delivering a peak performance of
to set P to a power of 2 because the number of channels 204.8 GOP/s. It is worth noting that both SIES and Guo
in most convolutional layers is a power of two. Therefore, use FIX32 precision, while FireFly uses INT8. To ensure
we evaluate two representative configurations, 144 × 16 and a fair comparison, we use Tb/s (Tera bit-operations per
288 × 32 to demonstrate the reconfigurability of FireFly. second) to account for data precision. We compare Fire-
Implementing synaptic operations using DSP48 significantly Fly, mapped on xczu3eg, to Guo’s implementation and
reduces fabric overhead and leads to substantial improvements the results show that FireFly surpasses Guo’s implemen-
in GOP/s compared to most existing hardware. FireFly, with tation (11.1 versus 6.5). We compare FireFly, mapped
a 144 × 16 systolic array, can achieve a peak performance of on xczu7ev, to SIES and the results show that FireFly
1382.4 GOP/s, while FireFly with a 288 × 32 systolic array slightly trails behind SIES (44.24 versus 52.4). However,
1188 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 8, AUGUST 2023

TABLE II
C OMPARISON W ITH OTHER W ORKS IN H ARDWARE S PECIFICATIONS

FireFly is more efficient in terms of resource utilization, In this work, we deploy several state-of-the-art SNN net-
with a balanced LUTs and DSPs consumption compared works trained by backpropagation algorithms [2] on FireFly to
with both implementations. test the inference performance. We evaluate not only the static
3) Ye et al. [14] implement a systolic array that supports datasets such as MNIST, CIFAR10, and CIFAR100 but also
accurate LIF dynamics using an extended prediction cor- the neuromorphic datasets such as DVS-CIFAR10 and DVS-
rection technique, while FireFly only approximates LIF Gesture. The models are trained using the Pytorch framework
behavior using a simple shift operation. Ye et al. support with NVIDIA A100 graphic processing unit (GPU). AdamW
MLP and CNN topologies using separate computing algorithm [38] is used as the optimizer. The learning rate is set
units (PE arrays + FC cores), while FireFly manages to to 1 × 10−3 . The membrane potential threshold is set to 0.5.
reuse the same systolic array. Although Ye et al. cannot The membrane time constant τm is set to 2.0 for LIF neuron
achieve comparable computing throughput with FireFly, models. The batch size is set to 128. The training epochs are
they can deliver a more precise LIF behavior. set to 600. The models are trained using surrogate functions
like quadratic gate and arctangent gradient to overcome the
problem of the nondifferentiability of binary spikes. Direct
C. Comparisons in Benchmark Evaluations coding is used to reduce the total timesteps of the SNNs.
Existing SNN training methods can be categorized into three In our experiment, the timesteps are scaled down to four
types: Biologically plausible methods are mainly inspired by without a significant accuracy drop. These training algorithms
the synaptic learning rules in the human brain. Spike-timing- are provided in BrainCog’s infrastructures [30], [39].
dependent plasticity and Hebbian learning rules are extensively After the training process, several steps are needed to deploy
used in these methods. Although these methods are energy the Pytorch-trained SNN model to FireFly. We first apply batch
efficient and biologically plausible, they only work well in norm fusion to merge the batch normalization layer with the
shallow networks and toy datasets like MNIST. Conversion preceding convolutional layer to reduce computation complex-
methods convert the analog values of ANNs into the firing ity. Then we observe the distribution of the synaptic weights
rates of SNNs. Although higher accuracy can be achieved by of each layer, calculate the scaling factor, and convert the
this method, the timestep is too long thus leading to high- synaptic weights represented in 32-bit floating point numbers
energy consumption. Backpropagation algorithms are also to 8-bit signed integers. The threshold is also quantized using
introduced into the SNN domain. Surrogate gradient helps the scaling factor derived from the weight observations. Note
SNNs perform backpropagation through time (BPTT) so that that the performance drop of posttraining quantization without
SNNs can be adopted to larger scale network structures on further retraining or fine-tuning is negligible in SNN because
more complex datasets. no scaling errors of multiplications are introduced.
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1189

TABLE III
C OMPARISON W ITH R ELATED W ORK FOR M ULTIPLE I MAGE C LASSIFICATION TASKS U SING SNN S FOR M ULTIPLE DATASETS

FireFly shows reconfigurability on different SNN models for To evaluate the inference performance of these accelerators,
different image classification tasks. We evaluate four different we use the metric kFPS·MFLOPS, where kilo frames per
SNN model structures with 5, 7, 9, and 11 convolutional layers second (kFPS) is reported in each experiment and MFLOPS
on five different datasets, shown in Table III. As different is calculated. To evaluate the efficiency of these accelerators,
research studies use varying benchmarks that differ in scale we divide the kFPS·MFLOPS by power, where power is also
and complexity, deriving meaningful comparison results can be reported in each experiment. These two metrics effectively
challenging. To ensure a fair comparison among experiments, reflect the accelerators’ efficiency while taking the benchmark
we have calculated the FLOPS number of the equivalent ANN size and power consumption into account.
model of each SNN model to quantify the benchmark size. After analyzing Table III, we have made several key obser-
Please note that we have ignored the timestep of the SNN vations.
in our FLOPS calculation. This is because these accelerators
1) FireFly can adapt to various datasets and models and
may employ a spike aggregation techniques [17] to reduce the
achieve comparable accuracy with other works across
complexity of their SNN models, or only process timesteps
all five datasets.
with spikes [33], or adopt temporal coding to ensure that
2) FireFly can support deep and large SNN networks,
only one spike occurs in all timesteps [15], [16], or simply not
as indicated by the larger MFLOPS values shown in
report the total timesteps used. Therefore, it is not possible to
Table III. In terms of inference latency without consid-
have a fair comparison considering all these aspects. FLOPS of
ering the benchmark size, FireFly achieves a moderate
equivalent ANN models can at least provide a rough estimate
level of performance. In terms of power consumption
of the benchmark size. Specifically, we calculate the FLOPS
alone, FireFly is not particularly outstanding.
of a single convolutional or fully connected layer as follows:
3) FireFly can achieve high kFPS·MFLOPS when taking
the benchmark size into account. Only Cerebron’s exper-
FLOPSConv = 2 × K h × K w × H × W × Cout × Cin (6)
iment on MNIST using a small ConvNet can surpass
FLOPSMLP = 2 × Cout × Cin . (7) FireFly in this regard.
1190 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 8, AUGUST 2023

4) FireFly can achieve high-computational efficiency com- In fact, the potential of the DSP48E2 is still far from
pared to most research, especially when the benchmark being fully realized. Wu et al. [6] proposed a high-throughput
size is increased. While Cerebron and SyncNN can processing array for matrix multiplication based on DSP super-
achieve high efficiency when the benchmark size is tile and achieved peak DSP clock rates on Xilinx UltraScale
small, their performance degrades rapidly when switch- (741 MHz) and UltraScale+ (891 MHz) devices. SNN accel-
ing to large-size networks. erators can incorporate the DSP supertile design and achieve
5) FireFly exhibits a stable kFPS·MFLOPS across various even higher performance. The potential of other dedicated
benchmark sizes and datasets, demonstrating its scala- hard blocks on FPGA is also yet to be exploited. Scaling the
bility and reconfigurability. Cascades [5] fully utilized the dedicated cascade interconnect
Note that our chosen device, xczu3eg, is an edge device of the DSP48E2, BRAM36K, and URAM288K and achieved
having the fewest resources among all the listed hardware, nearly 100% usage of these hard blocks, delivering incredible
but still, FireFly shows significant improvement in all these inference speed on MLPerf benchmarks. It is necessary to
benchmarks. When using a larger xczu7ev device, all the infer- migrate the existing hardware optimization techniques of ANN
ence performances listed above are improved by ×4 because accelerator design to SNN neuromorphic hardware research.
xczu7ev supports higher parallelism and has a peak perfor- Nevertheless, we agree that ideally, the main advantage of
mance of 5.523 TOP/s. Our system also supports multiple het- new SNN accelerators compared to ANNs on digital hard-
erogeneous cores running different SNN models concurrently. ware comes primarily from exploiting the sparsity of spikes
When targeting xczu5ev, two FireFly cores can be deployed and not from the replacement of MAC operations with ac
independently to support multiple real-world tasks. operations [40]. Future neuromorphic hardware design should
exploit spike sparsity and migrate existing FPGA optimization
techniques simultaneously.
D. Discussion Also, we admit that since we come up with a DSP opti-
We argue that for FPGA-based SNN accelerator design, mization technique that is tightly coupled to the FPGA device
the benefits of designing complicated hardware supporting family we are using, it will limit the portability of our Verilog
spike sparsity may not make up for the losses of irregular codes and make it difficult for us to convert our design into
interconnect and underutilization of the dedicated hard block. an ASIC. However, we argue that FPGA-specific optimiza-
The system clock frequency can have a significant impact tions are still necessary for SNN accelerator design. As an
on inference performance. Compared with ASICs, routing in emerging research field, SNNs’ variants are ever-changing.
FPGAs contributes more delay time since logic elements are FPGA implementations of fast-evolving SNN algorithms are
connected through a series of switching matrices instead of preferred over ASIC implementations because of their recon-
direct physical wires. A complex digital design with irregular figurability and flexibility. High-quality FPGA accelerators
interconnect can easily violate the timing requirements even with FPGA-specific optimizations can offer feasible solutions
in the most state-of-the-art FPGA devices. Most existing to SNN real-world applications.
FPGA-based SNN accelerators can only satisfy the timing
requirement of at most 200 MHz even on the expensive VI. C ONCLUSION
Virtex Ultrascale+ device. Another important aspect of FPGA In this work, we introduced a high-throughput and
low-power system design is to utilize the existing dedicated reconfigurable hardware accelerator for SNNs. To achieve
hard block rather than build one from scratch. Implementing high-performance inference of SNN, we fully exploited the
the same function using the dedicated hard block in FPGAs features of the dedicated DSP48E2 embedded in the FPGA and
usually consumes less energy than using the general fabric achieved the highest GOP/s compared with the existing accel-
counterparts. However, most existing FPGA-based SNN accel- erator designs. To improve memory efficiency, we designed
erators fail to delve into the features provided by the existing a synaptic weight delivery hierarchy and a Psum-Vmem
dedicated hard block and adopt a no-brainer implementation unified buffer to support the high parallelism. To demon-
of spike computation using low-speed fabric. strate FireFly’s reconfigurability, we evaluated multiple deep
In this article, FireFly provides a different perspective on SNN models on various datasets. To make SNN applications
designing dedicated neuromorphic hardware for SNNs target- more convenient, we used off-the-shelf commercially available
ing FPGA devices. We are well aware that it is important to FPGA edge devices, offering a more feasible solution than
design hardware that supports sparsity acceleration. However, any other existing hardware. In the future, we will try to
to our best knowledge, few studies [15], [16] targeting ASICs migrate more optimization techniques targeting FPGAs while
can show significant speed-ups considering this inherent nature exploring sparsity acceleration to enable more energy-efficient
of SNNs, not to mention the large majority of FPGA-based SNN software and hardware codesign.
designs. Instead of designing complicated circuits to support
the sparsity acceleration, FireFly consists of a monolithic sys- R EFERENCES
tolic array. The acceleration comes from the clock frequency [1] W. Maass, “Networks of spiking neurons: The third generation of
improvement brought by the regular and simple interconnect neural network models,” Neural Netw., vol. 10, no. 9, pp. 1659–1671,
of the systolic array, the pipelined arithmetic computations, Dec. 1997.
[2] G. Shen, D. Zhao, and Y. Zeng, “Backpropagation with biologically
and, most importantly, the flexible use of the multifunction plausible spatiotemporal adjustment for training deep spiking neural
DSP48E2s. networks,” Patterns, vol. 3, no. 6, Jun. 2022, Art. no. 100522.
LI et al.: FireFly: A HIGH-THROUGHPUT HARDWARE ACCELERATOR FOR SNNs 1191

[3] H. Zheng, Y. Wu, L. Deng, Y. Hu, and G. Li, “Going deeper with [21] H. Fang, Z. Mei, A. Shrestha, Z. Zhao, Y. Li, and Q. Qiu, “Encoding,
directly-trained larger spiking neural networks,” in Proc. AAAI Conf. model, and architecture: Systematic optimization for spiking neural
Artif. Intell., May 2021, vol. 35, no. 12, pp. 11062–11070. network in FPGAs,” in Proc. IEEE/ACM Int. Conf. Comput. Aided
[4] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient Design (ICCAD), Nov. 2020, pp. 1–9.
reconfigurable accelerator for deep convolutional neural networks,” [22] J. Lee, W. Zhang, and P. Li, “Parallel time batching: Systolic-array
IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017. acceleration of sparse spiking neural computation,” in Proc. IEEE Int.
[5] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, “Scaling the cas- Symp. High-Perform. Comput. Archit. (HPCA), Apr. 2022, pp. 317–330.
cades: Interconnect-aware FPGA implementation of machine learning [23] Q. Chen, C. Gao, X. Fang, and H. Luan, “Skydiver: A spiking neural
problems,” in Proc. 29th Int. Conf. Field Program. Log. Appl. (FPL), network accelerator exploiting spatio-temporal workload balance,” IEEE
Sep. 2019, pp. 342–349. Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41, no. 12,
[6] E. Wu, X. Zhang, D. Berman, and I. Cho, “A high-throughput recon- pp. 5732–5736, Dec. 2022.
figurable processing array for neural networks,” in Proc. 27th Int. Conf. [24] E. M. Izhikevich, “Which model to use for cortical spiking neurons?”
Field Program. Log. Appl. (FPL), Sep. 2017, pp. 1–4. IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1063–1070, Sep. 2004.
[7] E. Painkras et al., “SpiNNaker: A 1-W 18-core system-on-chip for [25] A. L. Hodgkin and A. F. Huxley, “A quantitative description of mem-
massively-parallel neural network simulation,” IEEE J. Solid-State Cir- brane current and its application to conduction and excitation in nerve,”
cuits, vol. 48, no. 8, pp. 1943–1953, Aug. 2013. J. Physiol., vol. 117, no. 4, pp. 500–544, Aug. 1952.
[8] M. Davies et al., “Loihi: A neuromorphic manycore processor with on- [26] L. F. Abbott, “Lapicque’s introduction of the integrate-and-fire model
chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, Jan. 2018. neuron (1907),” Brain Res. Bull., vol. 50, nos. 5–6, pp. 303–304,
[9] F. Akopyan et al., “TrueNorth: Design and tool flow of a 65 mW 1 mil- Nov. 1999.
lion neuron programmable neurosynaptic chip,” IEEE Trans. Comput.- [27] P. Dayan and L. Abbott, “Theoretical neuroscience: Computational and
Aided Design Integr. Circuits Syst., vol. 34, no. 10, pp. 1537–1557, mathematical modeling of neural systems,” J. Cognit. Neurosci., vol. 15,
Oct. 2015. no. 1, pp. 154–155, 2003.
[10] J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner, [28] L. Zhang et al., “A cost-efficient high-speed VLSI architecture for
“A wafer-scale neuromorphic hardware system for large-scale neural spiking convolutional neural network inference using time-step binary
modeling,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Jun. 2010, spike maps,” Sensors, vol. 21, no. 18, p. 6006, Sep. 2021.
pp. 1947–1950. [29] Ultrascale Architecture DSP Slice User Guide, Xilinx Inc., San Jose,
[11] J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and CA, USA, 2021.
W. H. P. Pernice, “All-optical spiking neurosynaptic networks with [30] Y. Zeng et al., “BrainCog: A spiking neural network based Brain-
self-learning capabilities,” Nature, vol. 569, no. 7755, pp. 208–214, Inspired cognitive intelligence engine for Brain-Inspired AI and brain
May 2019. simulation,” 2022, arXiv:2207.08533.
[12] J.-Q. Yang et al., “Leaky integrate-and-fire neurons based on per- [31] S.-Q. Wang et al., “SIES: A novel implementation of spiking convo-
ovskite memristor for spiking neural networks,” Nano Energy, vol. 74, lutional neural network inference engine on field-programmable gate
Aug. 2020, Art. no. 104828. array,” J. Comput. Sci. Technol., vol. 35, no. 2, pp. 475–489, Mar. 2020.
[13] Q. Chen, C. Gao, and Y. Fu, “Cerebron: A reconfigurable architecture for [32] S. Guo et al., “A systolic SNN inference accelerator and its co-optimized
spatiotemporal sparse spiking neural networks,” IEEE Trans. Very Large software framework,” in Proc. Great Lakes Symp. VLSI, May 2019,
Scale Integr. (VLSI) Syst., vol. 30, no. 10, pp. 1425–1437, Oct. 2022. pp. 63–68.
[14] W. Ye, Y. Chen, and Y. Liu, “The implementation and optimization [33] D. Neil and S.-C. Liu, “Minitaur, an event-driven FPGA-based spiking
of neuromorphic hardware for supporting spiking neural networks with network accelerator,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
MLP and CNN topologies,” IEEE Trans. Comput.-Aided Design Integr. vol. 22, no. 12, pp. 2621–2628, Dec. 2014.
Circuits Syst., vol. 42, no. 2, pp. 448–461, Feb. 2023. [34] J. Han, Z. Li, W. Zheng, and Y. Zhang, “Hardware implementation of
[15] S. Narayanan, K. Taht, R. Balasubramonian, E. Giacomin, and spiking neural networks on FPGA,” Tsinghua Sci. Technol., vol. 25,
P. Gaillardon, “SpinalFlow: An architecture and dataflow tailored for no. 4, pp. 479–486, Aug. 2020.
spiking neural networks,” in Proc. ACM/IEEE 47th Annu. Int. Symp. [35] J. Zhang, H. Wu, J. Wei, S. Wei, and H. Chen, “An asynchronous
Comput. Archit. (ISCA), May 2020, pp. 349–362. reconfigurable SNN accelerator with event-driven time step update,”
[16] F. Liu et al., “SATO: Spiking neural network acceleration via temporal- in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2019,
oriented dataflow and architecture,” in Proc. 59th ACM/IEEE Design pp. 213–216.
Autom. Conf., Jul. 2022, pp. 1105–1110. [36] X. Ju, B. Fang, R. Yan, X. Xu, and H. Tang, “An FPGA imple-
[17] S. Panchapakesan, Z. Fang, and J. Li, “SyncNN: Evaluating and accel- mentation of deep spiking neural networks for low-power and
erating spiking neural networks on FPGAs,” in Proc. 31st Int. Conf. fast classification,” Neural Comput., vol. 32, no. 1, pp. 182–204,
Field-Program. Log. Appl. (FPL), Aug. 2021, pp. 286–293. Jan. 2020.
[18] M. T. L. Aung, C. Qu, L. Yang, T. Luo, R. S. M. Goh, and W. Wong, [37] D. Gerlinghoff, Z. Wang, X. Gu, R. S. M. Goh, and T. Luo, “E3NE:
“DeepFire: Acceleration of convolutional spiking neural network on An end-to-end framework for accelerating spiking neural networks with
modern field programmable gate arrays,” in Proc. 31st Int. Conf. Field- emerging neural encoding on FPGAs,” IEEE Trans. Parallel Distrib.
Program. Log. Appl. (FPL), Aug. 2021, pp. 28–32. Syst., vol. 33, no. 11, pp. 3207–3219, Nov. 2022.
[19] J. Park, J. Lee, and D. Jeon, “A 65-nm neuromorphic image classifi- [38] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
cation processor with energy-efficient training through direct spike-only 2017, arXiv:1711.05101.
feedback,” IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 108–119, [39] BrainCog: Brain-Inspired Cognitive Intelligence Engine. Accessed:
Jan. 2020. Jan. 8, 2013[Online]. Available: http://www.brain-cog.network
[20] P. Chuang, P. Tan, C. Wu, and J. Lu, “A 90 nm 103.14 TOPS/W [40] M. Dampfhoffer, T. Mesquida, A. Valentian, and L. Anghel, “Are
binary-weight spiking neural network CMOS ASIC for real-time object SNNs really more energy-efficient than ANNs? An in-depth hardware-
classification,” in Proc. 57th ACM/IEEE Design Autom. Conf. (DAC), aware study,” IEEE Trans. Emerg. Topics Comput. Intell., vol. 7, no. 3,
Jul. 2020, pp. 1–6. pp. 1–11, Jun. 2023.

You might also like