0% found this document useful (0 votes)
22 views8 pages

FPGA Processor in Memory Architectures PIMs Overla

FPGA_Processor_In_Memory_Architectures_PIMs_Overla

Uploaded by

Youssef Ebrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views8 pages

FPGA Processor in Memory Architectures PIMs Overla

FPGA_Processor_In_Memory_Architectures_PIMs_Overla

Uploaded by

Youssef Ebrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

arXiv:2308.03914v1 [cs.

AR] 7 Aug 2023

© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.

This work has been accepted at the 2023 33rd International Conference on Field-Programmable Logic
and Applications (FPL) and will appear in the proceedings and on the IEEE website soon.
FPGA Processor In Memory Architectures (PIMs):
Overlay or Overhaul ?
MD Arafat Kabir∗ , Ehsan Kabir∗ , Joshua Hollis∗ , Eli Levy-Mackay∗ , Atiyehsadat Panahi† ,
Jason Bakos‡ , Miaoqing Huang∗ and David Andrews∗
Department of Computer Science and Computer Engineering
∗ University
of Arkansas, ‡ University of South Carolina, † Cadence Design Systems
{makabir, ekabir, jrhollis, elevymac, apanahi, mqhuang, dandrews}@uark.edu, [email protected]

Abstract—The dominance of machine learning and the ending provide better energy efficiency compared to full precision
of Moore’s law have renewed interests in Processor in Mem- PEs. PIM systems offer a theoretical peak performance limited
ory (PIM) architectures. This interest has produced several only by the memory bandwidth.
recent proposals to modify an FPGA’s BRAM architecture
to form a next-generation PIM reconfigurable fabric [1], [2]. The trend towards PIM architectures is inspiring new re-
PIM architectures can also be realized within today’s FPGAs configurable fabrics that integrate bit-serial arithmetic units
as overlays without the need to modify the underlying FPGA into BRAM IP to form PIM tiles [1], [2], [8]–[15]. These
architecture. To date, there has been no study to understand the
comparative advantages of the two approaches. In this paper, architectures may represent the future but are not currently
we present a study that explores the comparative advantages available. To fill the void PIM architectures can be created
between two proposed custom architectures and a PIM overlay as overlays in existing FPGAs. The fundamental question we
running on a commodity FPGA. We created PiCaSO, a Processor explore in our work is, how close in performance can an
in/near Memory Scalable and Fast Overlay architecture as a overlay come to the performance being reported for next-
representative PIM overlay. The results of this study show that
the PiCaSO overlay achieves up to 80% of the peak throughput generation PIM reconfigurable compute fabrics?
of the custom designs with 2.56× shorter latency and 25% – 43% To explore this question we created PiCaSO, a Processor
better BRAM memory utilization efficiency. We then show how
in/near Memory Scalable and Fast Overlay as an open-
several key features of the PiCaSO overlay can be integrated into
the custom PIM designs to further improve their throughput by source PIM overlay architecture [16]. We present perfor-
18%, latency by 19.5%, and memory efficiency by 6.2%. mance comparisons that show PiCaSO achieves 80% of the
Index Terms—Processing-in-Memory, Bit-serial, Overlay, peak throughput of these emerging proposed custom designs
FPGA, Machine Learning, SIMD while delivering 2.56× shorter latency and 25% – 43% better
BRAM memory utilization efficiency. This validates PiCaSO’s
I. I NTRODUCTION ability to bring enhanced designer productivity to the design
Convolutional Neural Networks (CNNs), Multilayer Per- of FPGAs without the traditional performance sacrifices of an
ceptrons (MLPs), and Recurrent Neural Networks (RNNs) overlay.
have emerged as the dominant machine learning approaches
Finally, we apply several PiCaSO design optimizations to
for today’s application domains. Each of the three networks
the custom PIM designs to further improve their throughput
have different computation to communication requirements,
by 18%, latency by 19.5%, and memory efficiency by 6.2%.
or operational intensities, that necessitate different types of
The specific contributions of this work are:
architectural support [3].
CNNs exhibit high operational intensities where end-to-
• A PIM overlay architecture that scales linearly with the
end inference latencies are dominated by arithmetic compute
BRAM capacity of a device, without sacrificing the clock
times. Conversely, MLPs and RNNs exhibit much lower
frequency.
operational intensities where the end-to-end inference latencies
• A comparative study with a state-of-the-art PIM overlay
are dominated by bus bandwidth and memory swapping times.
showing improvements of clock speed by 2×, resource
Processor in/near memory (PIM) architectures [4]–[7] are
utilization by 2×, and accumulation latency by 17×.
making a resurgence to address these types of network require-
• An improved version of an existing custom PIM design
ments. PIM architectures break the sequential von Neumann
incorporating the novel features of the proposed overlay
bottleneck by integrating bit-serial processors within memory.
architecture.
PIM architectures can leverage the continued trend in ma-
• A comparative study between the proposed overlay and
chine learning arithmetic towards lower precision. Less than
custom PIM designs analyzing the trade-offs and use
full precision operands can result in better utilization of limited
cases of the overlay and custom designs.
memory, and the bit-serial processing elements (PEs) can
This work is partially supported by the National Science Foundation PiCaSO is open-source and freely available at [16] for use,
under Grant No. 1955820. modification, and distribution without restriction.
NEWS
Network NEWS TABLE I
Node F ULL A DDER /S UBTRACTOR (FA/S) O P -C ODES
Op-Code Output (SUM) Description
A ALU
X ADD X+Y Acts as a Full-Adder (FA)

OpMux
B
SUB X-Y Acts as an FA with borrow logic
BRAM R CPX X Copies operand X unmodified
CPY Y Copies operand Y unmodified

...
R Y
NET

(a) Processing-in-Memory architecture TABLE II


O P -E NCODER C ONFIGURATIONS FOR B OOTH ’ S R ADIX -2 M ULTIPLIER
CB-1
CB Conf YX ALU Op-Code Description
Conf Op-Encoder
FA/S 000 xx ADD Request ADD
X 001 xx CPX Select X operand
SUM 010 xx CPY Select Y operand
Y
(b) ALU architecture 011 xx SUB Request SUB
1xx 00 CPX NOP
Fig. 1. Proposed overlay architecture for processing in/near memory, PiCaSO 1xx 01 ADD +Y
1xx 10 SUB -Y
II. R ELATED W ORK 1xx 11 CPX NOP

PIM architectures are a growing area of research [8]– reported in [1], [2], [26] stream operands between memory and
[15]. Building on earlier work such as Logic-In-Memory [6], ALUs across dedicated bitlines. Such an organization does not
Terasys [5], Shamrock [4], and Computational RAM [7], provide support for fast reduction operations (summation of
PIM architectures seek to break the classic von Neumann product terms) between the PEs and instead requires explicit
bottleneck by moving the processing closer to the data residing buffered transfer or copying of the product terms (for multiply-
in memory in a Single Instruction Multiple Data (SIMD) accumulate) between BRAM columns. PiCaSO enables zero-
architectural organization [17]–[24]. copy reduction operations with the operand-multiplexer (Op-
Recently, the reconfigurable computing community has Mux) shown in Fig. 1. The operand-multiplexer allows pass-
been exploring modifying the internal circuity of the on-chip through of bitlines from BRAMs to ALUs for multiplication
BRAMs within a modern FPGA with bit-serial arithmetic and but then supports zero-copy reduction summation of the prod-
logic operations to form a PIM tile [1], [2]. Examples include uct terms. The Network Node in Fig. 1 provides a streaming
RIMA (Reconfigurable In-Memory Accelerator) [2], which is interface between PIM blocks enabling the streaming of partial
built upon Neural Cache [11]. Their compute capable BRAMs products into the ALU of the destination PE for summation,
(CCB) enhanced the peak MAC throughput by factors of 1.6× without intermediate copying. Section III-C presents how the
and 2.3× for 8-bit integer and block floating point precision reduction operation can be optimized by inserting pipeline
at a cost of a 7.4% increase in BRAM tile area [2]. Reported stages that overlap data transfers with ALU operations, hiding
clock frequencies range from 250 MHz to 455 MHz on a Straix the transfer latency.
10 device. CCB requires simultaneous activation of multiple
wordlines on a port and modifications to the voltage source A. Parallel to Serial Corner Turning
for robustness. PiCaSO is a bit-serial array processor designed to work with
CoMeFa [1] builds upon CCB taking advantage of the dual- standard processors. Parallel data read/written from DRAM
port nature of BRAMs. Two versions of PIM blocks were and external I/O devices is corner-turned into bit-serial data
proposed. Optimized for the delay, CoMeFa-D showed 25.4% and stored as a striped column within the BRAMs. This is
tile area increase due to the inclusion of 160 PEs, 120 sense a standard storage scheme for bit serial ALUs similar to [1],
amplifiers (SA), and write drivers. Optimized for the area, [2], [26]. PiCaSO configures a BRAM to be 16-bits wide to
CoMeFa-A showed an 8.1% increase in the BRAM tile area concurrently feed bit-serial data to 16 ALUs [26]. In SPAR-
mainly attributed to the addition of 40 PEs. The maximum 2 [26], the benchmark overlay, the 16 PEs form a logical
clock frequency (735 MHz) dropped 1.25× to 588 MHz for 4×4 PE Block. PiCaSO structurally organizes the PE-Block
CoMeFa-D. The clock frequency for CoMeFa-A dropped 2.5× as a 1×16 linear array to optimize layout in the columnar
to 294 MHz, to perform 4 reads and 2 writes in a single cycle. architecture of Virtex FPGAs. This reduces routing complexity
PiCaSO is very synergistic with these efforts. We show how and wire delay, allowing a greater number of PEs to be
design optimizations developed for PiCaSO can be applied to synthesized into the FPGA and improving system clock speed.
these BRAM tile designs and potentially reclaim the clock
frequency difference with the BRAM’s supported maximum. B. Bit-Serial ALUs
Fig. 1(b) shows the architecture of the bit-serial ALU
III. P I C A SO A RCHITECTURE consisting of a Full-ADD/SUB module (FA/S) and an op-
Fig. 1 shows the processor in-memory architecture of Pi- code encoder. The FA/S implements the four operations in
CaSO. PiCaSO builds on the SPAR-2 PIM processor array Table I. CPX and CPY support min/max pooling and other
reported in [25]–[27] but with the key modifications dis- filter operations that require the selection of one of the two
cussed below. Custom bit-serial PIM designs including those input operands. Op-Encoder provides an abstract interface for
N N N
0 1 2 3 4 5 6 7 Fold-1 0 1 2 3 4 5 6 7 L=0 R T R T R T R T
PB
PB PB
PB PB
Col-ID - 0 1 2 3 4 5 6 7
N N N
0 1 2 3 4 5 6 7 Fold-2 0 1 2 3 4 5 6 7 L=1 R P T P R P T P
PB
PB PB
PB PB
Col-ID - 0 1 2 3 4 5 6 7
N N N
0 1 2 3 4 5 6 7 Fold-3 0 1 2 3 4 5 6 7
(a) (b) PB
PB PB
PB PB L=2 R P P P T P P P

Fig. 2. Folding patterns in Operand Multiplexer (OpMux) (a) Network Architecture (b) Jump over PE-Blocks

Shi�-In

Shi�-Out
TABLE III
C ONFIGURATIONS OF O PERAND M ULTIPLEXER Conf
... Decoder
Config Code X Y Description
A-OP-B A B Used in standard operations N N
A-FOLD-1 A {0, A[H2]} A[H2]: second half of A E E
TX RX W
A-FOLD-2 A {0, A[Q2]} A[Q2]: second quarter of A W
S Capture S
A-FOLD-3 A {0, A[HQ2]} A[HQ2]: second half-quarter of A

NET
A-FOLD-4 A {0, A[HHQ2]} A[HHQ2]: second half of A[HQ1]1
A-OP-NET A NET Operates on network stream From Regfile
0-OP-B 0 B Used in the first iteration of MULT (c) Network node (N) architecture for hopping
1 A[HQ1] : first half-quarter of A
Fig. 3. Data network for fast accumulation and reduction operations

the FA/S module. Table II shows the encoding for Booth’s between columns. For level 1, the middle node of every 3 con-
Radix-2 multiplication algorithm. secutive nodes acts as a pass-through, effectively connecting
C. Supporting Reduction Operations its neighbors. Similarly, level 2 connects node-4 to node-0.
During accumulation, bits of the operand in the transmitter
The operand-multiplexer (OpMux) provides a data path for hop through P-nodes to reach the receiver ALU where they
reduction operations between the PEs in a PE-Block without are added (serially) to the operand in the receiver. After levels
having to copy the operands between bitlines. This is achieved 0, 1, and 2, PE 0 contains the accumulation result of an entire
using a folding technique. Fig. 2 shows two types of folding row in the array.
patterns for a PE row with 8 columns enabled by the OpMux
module. In pattern (a), after adding an operand with its fold-1 E. Pipelining Options for PIM-Blocks
pattern, PE 0, 1, 2, and 3 contain the summation of 0 & 4, 1 The dashed registers in Fig. 1(a) show three potential points
& 5, 2 & 6, and 3 & 7 respectively. In pattern (b), after adding for pipelining the PIM Block: register file output, OpMux
an operand with its fold-1 pattern, PE 0, 2, 4, and 6 contain output, and ALU output. The Single-Cycle configuration has
the summation of 0 & 1, 2 & 3, 4 & 5, and 6 & 7. In both no pipeline stages and is equivalent to the custom BRAM
cases, after applying fold-1, fold-2, and fold-3 in that order, designs [1], [2] and the benchmark overlay [26]. PiCaSO can
the accumulation result will be stored in PE-0. Fold-1 of the be configured in different pipeline configurations based on
pattern (b) can be especially useful in CNN models, where network requirements and choice of FPGA. RF-Pipe inserts
each PE needs access to its adjacent PEs. A similar type of a pipeline stage at the register file output to hide the read
folding scheme can be realized using multiplexers at the output latencies of the BRAM. Op-Pipe inserts a pipeline stage at the
of SAs in custom PIM blocks. Results presented in Section V OpMux output to hide long wire delays through the network.
show the potential reduction in accumulation latency for the Full-Pipe, referred to as PiCaSO-F, enables all three pipeline
custom designs provided with this optimization. stages as shown in Fig. 1 (a).
Table III shows configurations currently supported by the
OpMux module. Configuration A-OP-B connects ports A to X IV. A NALYSIS
and B to Y and is used in element-wise operations. A-FOLD-x
A. Performance and Utilization
implements folding patterns similar to Fig. 2(a). A-OP-NET
directly feeds the network stream into the ALU. 0-OP-B is Table IV compares the pipeline configurations outlined
used as the initialization step in Booth’s multiplication. in subsection III-E against SPAR-2, the benchmark overlay
from [26]. All designs were implemented and run on Virtex-
D. Network Architecture 7 (xc7vx485) and Alveo U55 FPGAs. Utilization numbers
Fig. 3(c) expands the Network Node shown in Fig. 1. follow the tile definition in SPAR-2 consisting of 256 PEs
Fig. 3(a) shows the PE-Blocks (PB) connected to the data organized in a 4×4 array of PE blocks, with 16 PEs in
network through the network module (N). Fig. 3(b) illustrates each block. The total utilization per tile and the average
data reduction patterns between 8 nodes. Each node can be utilization per block are shown. The Full-Pipe configuration
configured as a transmitter (T), receiver (R), or pass-through achieved a 2.25× and a 1.67× increase in clock frequency
(P) based on a level (L) parameter and its position in the compared to the benchmark design on Virtex-7 and U55
array. Fig. 3(b), shows that level 0 logically connects even devices, respectively. In both devices, Full-Pipe provided a
nodes as receivers with their right neighbor as transmitters 2× improvement in resource utilization over SPAR-2.
TABLE IV
C OMPARISON BETWEEN TILES OF 4 × 4 PE-B LOCKS OF DIFFERENT OVERLAY CONFIGURATIONS
Benchmark [26] Full-Pipe Single-Cycle RF-Pipe Op-Pipe
Virtex-7 U55 Virtex-7 U55 Virtex-7 U55 Virtex-7 U55 Virtex-7 U55
Tile Block Tile Block Tile Block Tile Block Tile Block Tile Block Tile Block Tile Block Tile Block Tile Block
LUT 3023 189 2449 153 835 52 774 48 895 56 1068 67 1017 64 1064 67 836 52 774 48
FF 1024 64 768 48 1799 112 1799 112 1031 64 1031 64 1543 96 1527 95 1543 96 1543 96
Slice 1056 66 556 35 522 33 243 15 395 25 223 14 451 28 243 15 472 30 295 18
Max-Freq 240 MHz 445 MHz 540 MHz 737 MHz 245 MHz 487 MHz 360 MHz 600 MHz 370 MHz 620 MHz
TABLE V TABLE VI
C YCLE LATENCY OF DIFFERENT OPERATIONS C OMPARISON OF LARGEST OVERLAY ARRAYS IN V IRTEX DEVICES

Operation Benchmark [26] PiCaSO-F Virtex-7 Alveo U55


ADD/SUB 2N 2N Benchmark [26] PiCaSO-F Benchmark [26] PiCaSO-F
MULT1 2N 2 + 2N 2N 2 + 2N Max-Size 24K 33K 63K 64K
Accumulation2 (q − 1 + 2 log2 q)N 15 + 16 q
+ 4N + (N + 4)J LUT 74.6% 32.5% 41.6% 14.8%
q = 128, N = 32 4512 259 FF 16.0% 38.0% 9.7% 17.3%
1 Booth’s Radix-2 multiplication
BRAM 73.8% 99.9% 98.4% 100.0%
Uniq. Ctrl. Set 32.1% 2.1% 19.5% 0.8%
2 q : Number of columns to be accumulated
Slice 86.0% 76.4% 63.4% 32.0%
N : Operand width
J : Number of network jumps needed = log2 (q/16) the careful design of the binary-hopping network discussed in
The Single-Cycle configuration achieved similar perfor- Section III-D, which overlaps data transfer with computation
mance on the Virtex-7 and better performance on the U55 during accumulation.
compared to the benchmark system, with 2.6× and 2.5× C. Scalability
utilization improvements, respectively. It had a smaller flip-
flop count and slice utilization compared to the Full-Pipe due A primary design goal for PiCaSO was to make it scale
to the absence of the pipeline registers. Both RF-Pipe and Op- linearly with the BRAM capacity of any FPGA. To evaluate
Pipe achieved better clock speeds but with an increase in slice scalability, the largest-sized array of PIM blocks that could
utilization compared to Single-Cycle, due to the addition of the fit into the target devices was constructed. The results of this
pipeline stages. As argued in Subsection III-E, Op-Pipe had study are shown in Table VI.
better performance compared with RF-Pipe by minimizing the In the Virtex-7 FPGA, the largest array of SPAR-2 [26]
clock latency contributed by the network. All configurations PIM blocks contained 24K PEs. This did not achieve the full
offered at least 2× better utilization and up to 2× better capacity of the Slices or BRAM resources available in that
performance compared to the benchmark design. device. The implementation tool failed at the placement step
Table IV shows Full-Pipe achieved clock frequencies of for larger arrays due to a high utilization (32.1%) of Unique
540 MHz on the Virtex-7 (xc7vx485-2), and 737 MHz on Control Sets. Control sets are the collection of control signals
the Alveo U55 (xcu55c, -2 speed grade). The data sheets for for slice flip-flops. Flip-flops must belong to the same control
these devices list 543.77 MHz and 737 MHz, respectively set to be packed into the same slice. A large number of
as the maximum BRAM clock frequencies. Surprisingly, this unique control sets makes it difficult to find a valid placement,
is an improvement over the custom designs reported in [1], even with enough available slices. In contrast, PiCaSO-F
[2]. The technology node of U55 (16 nm) is comparable to fully utilized the BRAM resources to fit 33K PEs, a 37.5%
that of the designs proposed in CCB (Stratix 10, 14 nm) improvement over SPAR-2 in the same device. PiCaSO does
and CoMeFa (Arria 10, 20 nm). Yet, PiCaSO runs 1.62× not suffer from the placement issues observed in SPAR-2 due
and 1.25× faster than the fastest configurations of CCB and to a very low (2.1%) utilization of the unique control sets.
CoMeFa, respectively. This is due to the pipelined architecture In the U55 FPGA, SPAR-2 almost achieved the full BRAM
of PiCaSO, where the slowest stage is the BRAM. Thus, it can capacity for an array size of 63K PEs. This is due to the U55
run as fast as the maximum frequency of the BRAM. FPGA offering significantly more slices and routing resources
compared to the Virtex-7 FPGA. PiCaSO achieved 100%
B. Reduction Network utilization of BRAM with 2× better slice utilization over
Both PiCaSO and SPAR-2 [26] use Booth’s Radix-2 al- SPAR-2.
gorithm for multiplication. Thus, the cycle latencies for the Our results showed that the scalability of the benchmark
ADD/SUB and MULT operations in Table V are identical. design, SPAR-2, is dependent on the Slice-to-BRAM ratio
SPAR-2 uses a standard NEWS network to copy operands and cannot guarantee the creation of a PIM array that scales
between PEs when summing the partial products during with the BRAM capacity. Conversely, our results showed
multiply-accumulate (MAC) operations. The Accumulation PiCaSO scaling with the BRAM capacity independent of
row compares the number of clock cycles for SPAR-2’s NEWS the Slice-to-BRAM ratio across multiple devices of Virtex-7
network and PiCaSO’s reduction network. The last row in and Ultrascale+ FPGA families. Table VII lists representative
Table V shows the PiCaSO-F reduction network provides devices we evaluated based on the following two criteria:
a 17× improvement in accumulation latency for the test BRAM capacity and LUT-to-BRAM ratio. Each device is
configuration reported in [25]. This improvement is due to assigned an ID as a short name to be used in this paper.
TABLE VII 4-bit 8-bit 16-bit
2.50
R EPRESENTATIVE OF V IRTEX -7 AND U LTRASCALE + DEVICES 2.00
1.50
Device Tech BRAM# Ratio1 Max PE#2 ID 1.00
xc7vx330tffg-2 V7 750 272 24K V7-a 0.50
xc7vx485tffg-2 V7 1030 295 32K V7-b 0.00
CCB CoMeFa-D CoMeFa-A PiCaSO-F A-Mod D-Mod
xc7v2000tfhg-2 V7 1292 946 41K V7-c
xc7vx1140tflg-2 V7 1880 379 60K V7-d Fig. 5. Relative MAC latency of custom designs w.r.t PiCaSO
xcvu3p-ffvc-3 US+ 720 547 23K US-a
xcvu23p-vsva-3 US+ 2112 488 67K US-b 4-bit 8-bit 16-bit
xcvu19p-fsvb-2 US+ 2160 1892 69K US-c 3.00
xcvu29p-figd-3 US+ 2688 643 86K US-d 2.50 BRAM

TMAC/sec
1 LUT-to-BRAM ratio 2.00 DSP
2 Maximum number of PEs if all BRAMs are utilized 1.50 LUT
1.00
100
0.50
U�liza�on %

80
0.00
60 CCB CoMeFa-D CoMeFa-A PiCaSO-F A-Mod D-Mod
40
20 Fig. 6. Peak MAC throughput of PiCaSO and custom designs on Alveo U55
0
V7-a V7-b V7-c V7-d US-a US-b US-c US-d 100

Mem. Efficiency %
BRAM LUT FF Ctrl. Set Slice 80
60
Fig. 4. Scalability study on Virtex-7 and Ultrascale+ FPGA families 40
20
Fig. 4 shows that PiCaSO utilized the full BRAM capacity 0
2-bit 4-bit 6-bit 8-bit 10-bit 12-bit 14-bit 16-bit
in all devices, and achieved the maximum number of PEs the
CCB CoMeFa PiCaSO CoMeFa-Mod
device can fit based on BRAM density. Results showed for
Fig. 7. BRAM memory utilization efficiency on Virtex devices
the smallest device (V7-a) and lowest LUT-to-BRAM ratio,
the LUT and flip-flop utilization is around 40%. For one of architectures. Memory utilization efficiency can be defined as
the largest devices with a high LUT-to-BRAM ratio (US-c), the fraction of BRAM memory that can be used to store model
these utilization numbers are negligible, around 5%. These weights. Both CCB and CoMeFa follow the computation
results strongly support that PiCaSO scales linearly with the techniques used in [11] which requires scratchpad memory.
BRAM capacity of the device. For N-bit operands, CCB requires 8N reserved wordlines.
CoMeFa only needs 5N wordlines using the “One Operand
V. C OMPARISON WITH C USTOM D ESIGNS
Outside RAM (OOOR)” technique. PiCaSO requires only 4N
Fig. 5 shows the relative MAC latency of the custom designs wordlines, as it does not require copying operands to the
w.r.t PiCaSO. The latency is computed for 16 parallel MULTs same bitline as in CoMeFa. In the widest mode of a Virtex
followed by the accumulation of the products. The clock 36Kb BRAM, each PE of CCB and CoMeFa would have
speeds of custom designs are adjusted based on the perfor- 256 bits of storage in its register file (bitline). For PiCaSO,
mance degradations reported in [1], [2]. With the exception each register file has 1024 bits. Fig. 7 shows the memory
of CoMeFa-D at 16-bit precision, PiCaSO has the shortest utilization efficiency of these architectures. As observed, at
latency due to faster clock speed and accumulation. CCB and higher precisions the memory efficiency drops significantly for
CoMeFa extend the clock period to allow a complete read- CCB and CoMeFa. For 16-bit operands, CCB and CoMeFa
modify-write per clock cycle. This allows a complete MULT have only 50% and 68.8% efficiencies, respectively, while
to finish in half the number of cycles compared to PiCaSO PiCaSO has 93.8% efficiency.
and can reduce latencies at higher precisions. Still, PiCaSO Table VIII summarizes comparisons between PiCaSO and
runs 1.72× – 2.56× faster than CoMeFa-A, which is reported the custom designs. The custom designs significantly degrade
as the most practical design in [1]. the BRAMs maximum clock frequency, whereas PiCaSO runs
Peak TeraMAC/sec throughputs on the U55 FPGA are at the maximum clock speed of the BRAM. However, PiCaSO
shown in Fig. 6. CCB and CoMeFa design the BRAM IP has 1/4th the number of parallel MACs, as it cannot access
to support one PE per bitline. With a column muxing factor all the bitlines. Multiplication in PiCaSO is 2× slower, as it
of 4 [1], a Virtex 36Kb BRAM would be redesigned as a requires 2 cycles to process a single bit. However, accumula-
256×144 array with 144 PEs per BRAM. The use of standard tion is 2× faster in PiCaSO. PiCaSO supports Booth’s radix-2
BRAM IP prevents PiCaSO (and all overlays) from making multiplication. In Booth’s algorithm, half of the intermediate
this modification. Yet PiCaSO still achieves 75% – 80% of steps are NOPs on average. Thus, PiCaSO can potentially
CoMeFa-A’s peak throughput, the most practical of the two further reduce the multiplication latency by 50% on average.
CoMeFa designs. This results from PiCaSO not sacrificing the CoMeFa can use Booth’s algorithm only in OOOR mode and
same degradation of clock speed seen in all custom designs. CCB does not support it at all.
The memory utilization efficiency of BRAMs is not dis- As discussed earlier, the memory utilization efficiency of
cussed in [1], [2] but we feel is an important metric for PIM CCB is significantly low, PiCaSO is high, and CoMeFa lies
Port#2 Port#1 Port#2 Port#1
TABLE VIII
C OMPARISON WITH C USTOMIZED BRAM PIM ARCHITECTURES SA SA
WD WD

FF FF FF FF
CCB CoMeFa-D CoMeFa-A PiCaSO-F A-Mod Network NEWS
OpMux
Architecture Custom Custom Custom Overlay Custom wr_sel1 Module
Clock Overhead 60% 25% 150% 0% 150% TR0 (Embed/LB) NEWS
S TR1
Parallel MACs 144 144 144 36 144

W1
X TR2
TR3 From
Mult Latency1 (a) (a) (a) (b) (a) Cin Neighboring PE

W2
N=8 86 86 86 144 86
Pred.
Accum. Latency2 (c) (c) (c) (d) (e) Op-Mux
wr_sel2 C
q = 16, N = 8 80 80 80 48 40 P
From M
Support Booth’s No Partial Partial Yes Yes Le� PE d_in2 d_in1 d_out2 d_out1
Mem. Efficiency Low Medium Medium High Medium
Complexity High Medium Medium No Medium Fig. 8. Modified CoMeFa-A [1] with PiCaSO adoption (A-Mod)
Practicality Low Medium High Very High High
1 (a) N 2 + 3N − 2 ; (b) 2N 2 + 2N OpMux and network modules can improve their MAC la-
2 (c) (2N + log2 q) log2 q ; (d) (N + 4) log2 q ; (e) (N + 2) log2 q tency by 13.4% – 19.5% due to faster accumulation. This
consequently improves their throughput by 5% - 18% over
in between. CCB has the highest design complexity mainly different precisions. In Fig. 7, CoMeFa-Mod represents both
due to its need for a modified voltage supply. CoMeFa has A-Mod and D-Mod implementations. Due to OpMux, A-
medium complexity since it requires modifications to the Mod and D-Mod no longer requires scratchpad storage to
SAs, additional flip-flops, and SA cycling. Being an overlay, copy operands for accumulation. This improves their memory
PiCaSO does not have such design complexities. As reported utilization efficiency by 6.2%. This means at 4-bit precision,
in [1], the practicality of CCB is low, CoMeFa-D is medium, 1.6 million more weights can be stored in a device with
and CoMeFa-A is high. In that reference, the practicality of 100 Mb of BRAM. This would significantly reduce weight
PiCaSO is very high. It offers 80% of CoMeFa-A’s peak stall cycles [3] and allow bigger models to be stored on chip.
throughput with 2.56× shorter latency, 25% better memory In Table VIII, the A-Mod column shows the architectural
efficiency, can be implemented using off-the-shelf FPGAs, and enhancements due to these modifications. A-Mod retains the
is tested on real devices, while CCB and CoMeFa numbers are high parallelism and fast Mult latency of the original CoMeFa
mainly based on simulations. design and offers 2× faster accumulation and full support for
Booth’s algorithm.
A. Fusing PiCaSO Optimizations into Custom Designs
Fig. 8 shows how modifications highlighted in red can VI. C ONCLUSIONS
accelerate CoMeFa-A [1]. We refer to this implementation as This paper presented PiCaSO, an open-source scalable and
A-Mod. PiCaSO’s OpMux module per bitline consists of a 2- portable Processor in Memory (PIM) overlay architecture. As
to-1 mux and a 4-to-1 mux. This can be implemented using a an overlay, PiCaSO brings software levels of productivity
few CMOS pass transistors. OpMux then saves both the cycles to the design of FPGA machine-learning accelerators across
and memory needed to copy operands during accumulation [2], AMD devices. The PIM architecture addresses the needs of
[11]. PiCaSO’s network module can overlap data movement machine learning and big data analytic applications that are
with computation between different PIM blocks. The network memory intensive.
module can be embedded within the PIM block or can be A scalability study was presented that established PiCaSO
implemented using logic slices from the FPGA. A single-bit scaled linearly with the BRAM capacity across a range of de-
port connection to the network module is enough to support vices with varying LUT-to-BRAM ratios. Analysis on Virtex-7
row-wise accumulation. and Ultrascale+ devices showed PiCaSO runs as fast as the
Although [1] mentions that the PE does not add additional BRAM maximum frequency. Comparisons against SPAR-2,
delay to the extended clock, in a practical circuit, there will a state-of-the-art SIMD array processor overlay, showed im-
always be an additional delay. This delay can be hidden using provements in slice utilization and achievable clock frequency
one of the pipelining schemes of PiCaSO. A single stage of by 2× and accumulation latency reduction by 17×.
registers could be enough to hide the PE delay. As BRAM Comparative analysis against custom designs showed Pi-
blocks already contain output registers, this should not add CaSO achieves up to 80% of the peak throughput and up
any additional area overhead on top of what is reported in [1]. to 2.56× shorter latency and 25% – 43% better memory
The PE circuit can be placed between two stages of registers utilization.
if the delay is too long. This is illustrated using the dashed We showed that the proposed architecture can be adopted
flip-flops in Fig. 8. Similar modifications can be performed on into custom PIM designs, and can improve the throughput by
CoMeFa-D referred to as implementation D-Mod. 18%, latency by 19.5%, and memory utilization by 6.2%.
These modifications can significantly improve the perfor- Our future efforts are focused on automating and applying
mance of the custom designs. The extrapolated performance application-specific and logic-family customizations to the
numbers for A-Mod and D-Mod are presented in Fig. 5 and generation of both PiCaSO-based accelerator and compiler-
Fig. 6. As observed in Fig. 5, the adoption of PiCaSO’s generated executables.
R EFERENCES [14] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM: In-Memory
Acceleration of Deep Neural Network Training with High Precision,”
[1] A. Arora, T. Anand, A. Borda, R. Sehgal, B. Hanindhito, J. Kulkarni, and in 2019 ACM/IEEE 46th Annual International Symposium on Computer
L. K. John, “CoMeFa: Compute-in-Memory Blocks for FPGAs,” in 2022 Architecture (ISCA), 2019, pp. 802–815.
IEEE 30th Annual International Symposium on Field-Programmable [15] F. Gao, G. Tziantzioulis, and D. Wentzlaff, “ComputeDRAM: In-
Custom Computing Machines (FCCM), May 2022, pp. 1–9. Memory Compute Using Off-the-Shelf DRAMs ,” in Proceedings of the
[2] X. Wang, V. Goyal, J. Yu, V. Bertacco, A. Boutros, E. Nurvitadhi, 52nd annual IEEE/ACM international symposium on microarchitecture,
C. Augustine, R. Iyer, and R. Das, “Compute-Capable Block RAMs 2019, pp. 100–113.
for Efficient Deep Learning Acceleration on FPGAs,” in 2021 IEEE [16] M. A. Kabir, E. Kabir, J. Hollis, E. Levy-Mackay, A. Panahi, J. Bakos,
29th Annual International Symposium on Field-Programmable Custom M. Huang, and D. Andrews, “PiCaSO: A Scalable and Fast PIM
Computing Machines (FCCM), May 2021, pp. 88–96. Overlay.” [Online]. Available: https://github.com/Arafat-Kabir/PiCaSO
[3] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, [17] A. Landy and G. Stitt, “Serial Arithmetic Strategies for Improving FPGA
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter Throughput,” ACM Transactions on Embedded Computing Systems
performance analysis of a tensor processing unit,” in Proceedings of the (TECS), vol. 16, no. 3, pp. 1–25, jul 2017.
44th annual international symposium on computer architecture, 2017, [18] D. J. M. Moss, D. Boland, and P. H. W. Leong, “A Two-Speed, Radix-4,
pp. 1–12. Serial-Parallel Multiplier,” IEEE Trans. Very Large Scale Integr. Syst.,
[4] P. M. Kogge, J. B. Brockman, T. L. Sterling, and G. R. Gao, “Pro- vol. 27, no. 4, pp. 769–777, 2019.
cessing In Memory: Chips to Petaflops,” in International Symposium on [19] G. Csordás, B. Fehér, and T. Kovácsházy, “Application of bit-serial
Computer Architecture, vol. 97. Citeseer, 1997. arithmetic units for FPGA implementation of convolutional neural
[5] M. Gokhale, B. Holmes, and K. Iobst, “Processing in memory: the networks,” in 2018 19th International Carpathian Control Conference
Terasys massively parallel PIM array,” Computer, vol. 28, no. 4, pp. (ICCC), 2018, pp. 322–327.
23–31, 1995. [20] A. Landy and G. Stitt, “Revisiting Serial Arithmetic: A Performance and
[6] H. S. Stone, “A Logic-in-Memory Computer,” IEEE Transactions on Tradeoff Analysis for Parallel Applications on Modern FPGAs,” in 2015
Computers, vol. C-19, no. 1, pp. 73–78, 1970. IEEE 23rd Annual International Symposium on Field-Programmable
[7] D. G. Elliott, W. M. Snelgrove, and M. Stumm, “Computational RAM: A Custom Computing Machines, 2015, pp. 9–16.
memory-SIMD hybrid and its application to DSP,” in 1992 Proceedings [21] D. Walsh and P. Dudek, “A compact FPGA implementation of a
of the IEEE Custom Integrated Circuits Conference. IEEE, 1992, pp. bit-serial SIMD cellular processor array,” in 2012 13th International
30–6. Workshop on Cellular Nanoscale Networks and their Applications, 2012,
[8] T. Finkbeiner, G. Hush, T. Larsen, P. Lea, J. Leidel, and T. Manning, pp. 1–6.
“In-Memory Intelligence,” IEEE Micro, vol. 37, no. 4, pp. 30–38, 2017. [22] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
[9] S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, and K. Vissers, “FINN: A Framework for Fast, Scalable Binarized
K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, Neural Network Inference,” in Proceedings of the 2017 ACM/SIGDA
and N. S. Kim, “Hardware Architecture and Software Stack for PIM International Symposium on Field-Programmable Gate Arrays, 2017, p.
Based on Commercial DRAM Technology : Industrial Product,” in 65–74.
2021 ACM/IEEE 48th Annual International Symposium on Computer [23] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos,
Architecture (ISCA), 2021, pp. 43–56. “Stripes: Bit-serial deep neural network computing,” in 2016 49th
[10] Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, Annual IEEE/ACM International Symposium on Microarchitecture (MI-
O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi, H.- CRO), 2016, pp. 1–12.
S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, [24] P. Colangelo, N. Nasiri, E. Nurvitadhi, A. Mishra, M. Margala, and
E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song, J. Youn, K. Sohn, K. Nealis, “Exploration of low numeric precision deep learning inference
and N. S. Kim, “25.4 A 20nm 6GB Function-In-Memory DRAM, Based using intel® fpgas,” in 2018 IEEE 26th Annual International Symposium
on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using on Field-Programmable Custom Computing Machines (FCCM), 2018,
Bank-Level Parallelism, for Machine Learning Applications,” in 2021 pp. 73–80.
IEEE International Solid- State Circuits Conference (ISSCC), vol. 64, [25] S. Basalama, A. Panahi, A.-T. Ishimwe, and D. Andrews, “SPAR-2: A
2021, pp. 350–352. SIMD Processor Array for Machine Learning in IoT Devices,” in 2020
[11] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, 3rd International Conference on Data Intelligence and Security (ICDIS).
D. Blaauw, and R. Das, “Neural Cache: Bit-Serial in-Cache Acceleration IEEE, 2020, pp. 141–147.
of Deep Neural Networks,” in 2018 ACM/IEEE 45Th annual interna- [26] A. Panahi, S. Balsalama, A.-T. Ishimwe, J. M. Mbongue, and D. An-
tional symposium on computer architecture (ISCA), 2018, pp. 383–396. drews, “A Customizable Domain-Specific Memory-Centric FPGA Over-
[12] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A lay for Machine Learning Applications,” in 2021 31st International
low-overhead, locality-aware processing-in-memory architecture,” ACM Conference on Field-Programmable Logic and Applications (FPL), Aug.
SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 336–348, 2021, pp. 24–27.
2015. [27] A. Panahi, E. Kabir, A. Downey, D. Andrews, M. Huang, and J. D.
[13] N. S. Kim, D. Chen, J. Xiong, and W.-m. W. Hwu, “Heterogeneous Bakos, “High-rate machine learning for forecasting time-series sig-
computing meets near-memory acceleration and high-level synthesis in nals,” in 2022 IEEE 30th Annual International Symposium on Field-
the post-moore era,” IEEE Micro, vol. 37, no. 4, pp. 10–18, 2017. Programmable Custom Computing Machines (FCCM), 2022, pp. 1–9.

You might also like