0% found this document useful (0 votes)
35 views3 pages

14.2 A Compute SRAM With Bit Serial Integer - Floating Point Operations For Programmable in Memory Vector Acceleration

Uploaded by

Akash Mukherjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views3 pages

14.2 A Compute SRAM With Bit Serial Integer - Floating Point Operations For Programmable in Memory Vector Acceleration

Uploaded by

Akash Mukherjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

ISSCC 2019 / SESSION 14 / MACHINE LEARNING & DIGITAL LDO CIRCUITS / 14.

14.2 A Compute SRAM with Bit-Serial Integer/Floating-Point simultaneously, resulting in massive parallelism (2048 CBLs in our design).
Operations for Programmable In-Memory Vector Subtraction is performed by first inverting B and then adding to A with Cin pre-set
to 1. As shown in Fig. 14.2.3, multiplication is more complicated as it requires
Acceleration predication. For this, the tag latch (Fig. 14.2.2) is used to enable the write-back
Jingcheng Wang, Xiaowei Wang, Charles Eckert, Arun Subramaniyan, driver, resulting in a conditional copy/addition. First, 4 empty columns in the array
Reetuparna Das, David Blaauw, Dennis Sylvester are reserved for the product and initialized to zero. In the first cycle, the LSB of the
multiplier is loaded to the tag latch. In cycles 2 and 3, the multiplicands are copied
University of Michigan, Ann Arbor, MI to product columns only if their tag is 1. In cycle 4, the second bit of the multiplier
is loaded to the tag latch. In the next 2 cycles, for rows with tag = 1, the
Data movement and memory bandwidth are dominant factors in the energy and
multiplicands are added to the second and third bits of the product, shifting the
performance of both general purpose CPUs and GPUs. This has led to extensive
multiplicands by 1 to account for the multiplier bit position. Finally, we store Cout in
research focused on in-memory computing, which moves computation to where
the most significant bit (MSB) of the product to complete the multiplication. Note
the data is located. With this approach, computation is often performed on the
that partial products are implicitly shifted as they are added using appropriate bit
memory bit-lines in the analog domain using current summing [1-3], which requires
addressing in the bit-serial operation and no explicit shift is performed. Division is
expensive analog-to-digital and digital-to-analog conversions at the array boundary.
conducted similarly by implicit shifting and subtraction from a partial result. Floating
In addition, such analog computation is very sensitive to PVT variations, limiting
point arithmetic is implemented using repeated integer add/sub/mult/div with
precision. More recently, full-rail (digital) binary in-memory computing was
predication. Fig. 14.2.3 provides a list of supported computations and their
proposed to avoid this conversion overhead and improve robustness [4, 5].
performance, demonstrating both the versatility of CRAM and its high performance
However, both prior in-memory approaches suffer from the same major limitations:
due to bit-line parallelism.
they accelerate only one type of algorithm and are inherently restricted to a very
specific application domain due to their limited and fixed bit-width precision and Figure 14.2.4 shows measurement results from the prototype chip fabricated in
non-programmable architecture. Software algorithms, on the other hand, continue 28nm CMOS that contains 8 CRAM banks (128KB memory with 2048 computing
to evolve rapidly, especially in novel application domains, such as neural networks, rows) and a Cortex-M0 processor. The figure shows measured frequency and
vision and graph processing, making rigid accelerators of limited use. Furthermore, energy efficiency of 8b addition and multiplication across supply voltage. At 1.1V
most available SRAM in today’s chips is located in the caches of CPUs or GPUs. the maximum frequency of 475MHz results in 122GOPS for 8b addition and
These large CPU and GPU SRAM stores present an opportunity for extensive in- 9.4GOPS for 8b multiplication. The best energy efficiency is achieved at 0.6V and
memory computing and have, to date, remained largely untapped. 114MHz, resulting in 0.56TOPS/W for 8b multiplication and 5.27TOPS/W for 8b
addition. Fig. 14.2.4 shows measured frequency and leakage power distributions
In this paper, we present a general purpose hybrid in-/near-memory compute SRAM
for 21 measured dies.
(CRAM) that combines the efficiency of in-memory computation with the flexibility
and programmability necessary for evolving software algorithms. CRAM augments Figure 14.2.5 shows the performance of the test chip for diverse computationally
conventional SRAM in a CPU with vector-based, bit-serial [6, 7] in-memory intensive tasks ranging from neural networks to graph and signal processing. The
arithmetic. It can accommodate a wide range of bit-widths, from single to 32b or total latency in cycles is compared with a baseline operation, where CRAMs are only
64b, and operation types, including integer and floating point addition, multiplication used as data memories and the computation is entirely performed on the ARM CPU.
and division. To maintain compatibility with CPU/GPU operation, CRAM writes/reads The first benchmark is the 1st convolutional layer from Cuda-convnet and the
operands conventionally with horizontal word-lines and vertical bit-lines. Then, using second is the last fully connected layer from AlexNet. Due to their size, these layers
a transposable bitcell [8], CRAM operates directly on the stored operands in memory must be executed in multiple smaller sub-sections. The third application consists
with additional horizontal compute bit-lines. This enables the same bit position from of 512 simultaneous 32-tap FIR filters and the fourth application performs traversal
two vectors elements to be simultaneously accessed on a single bit-line. Logic of a directed graph represented by a 192×192 adjacency matrix. The workload
operations are performed on the bit-line (in-memory), while small additional in- breakdown shows the percentage of time spent on input loading and output loading
column logic (near-memory with 4.5% SRAM bank area overhead) enables vs. in-memory computation. Speedup, compared to executing the same workload
carry-propagation between successive bit-serial calculations, enabling multi-bit with the ARM Cortex-M0, varies from 7.2-to-114×, with the greatest gains obtained
arithmetic operations in SIMD fashion across all vectors of elements. To maintain when the operation is compute-heavy and low on input/output movement.
versatility, the memories can function either as traditional or compute memories.
The approach was implemented in a small IoT processor in 28nm CMOS, consisting Figure 14.2.6 compares the proposed approach with other state-of-the-art in-
of a Cortex-M0 CPU and 8 CRAM banks of 16KB each (128KB total). The system memory accelerators. The proposed work is the only solution to provide a wide
achieves 475MHz operation and, with all CRAMs active, produces 30GOPS or range of instructions and flexible bitwidth. It repurposes the memory storage already
1.4GFLOPS on 32b operands for graph, neural, and DSP applications. available in processors, thereby accelerating computation while maintaining
programmability.
Figure 14.2.1 shows the overall organization of the IoT processor. The ARM core
can access all 8 memory banks and load/store data using the horizontal word-lines Acknowledgements:
and vertical bit-lines. Then, in-memory instructions can be streamed from one bank We gratefully acknowledge TSMC University Shuttle Program for chip fabrication.
to one or more compute-configured banks, while the M0 simultaneously performs This work was supported in part by ADA, one of six centers in JUMP, a
other processing with the remaining memory banks. Banks performing in-memory Semiconductor Research Corporation (SRC) program sponsored by DARPA.
computing use the horizontal compute bit-lines (CBLs) and vertical compute word- References:
lines (CWLs). [1] J. Zhang, et al., "In-Memory Computation of a Machine Learning Classifier in a
Figure 14.2.2 shows the architecture of the 128×256 CRAM sub-array, which is one Standard 6T SRAM Array," IEEE JSSC, vol. 52, no. 4, pp. 915-924, 2017.
quarter of a 16KB CRAM macro. An 8T transposable bitcell is used to provide [2] A. Biswas, et al., “Conv-RAM: An Energy-Efficient SRAM with Embedded
bidirectional access. Fig. 14.2.2 shows an example operation of the data flow for a Convolution Computation for Low-Power CNN-based Machine Learning
1b addition performed in 1 cycle of the bit-serial computation. Here, we add the Applications,” ISSCC, pp. 488-489, 2018.
second bit positions of vector A (A1=0) and vector B (B1=1) with carry-in C (=1) [3] S. Gonugondla, et al., “A 42pJ/Decision 3.12TOPS/W Robust In-Memory
from the previous cycle, and store the result back to vector D. First, the CRAM Machine Learning Classifier with On-Chip Training,” ISSCC, pp. 490-491, 2018
instruction decoder receives the ADD instruction and the 3 column addresses of [4] W. Khwa, et al., “A 65nm 4Kb Algorithm-Dependent Computing-in-Memory
bits A1, B1 and D1. It activates the CWLs of A1 and B1 simultaneously to compute ‘A SRAM Unit-Macro with 2.3ns and 55.8TOPS/W Fully Parallel Product-Sum
AND B’ on CBL and ‘A AND B’ on CBLB. Since A=0 and B=1, both CBL and CBLB Operation for Binary DNN Edge Processors,” ISSCC, pp 496-497, 2018.
discharge. Then, after the dual sense amps, the results propagate to the near- [5] Y. Zhang, et al., “Recryptor: A Reconfigurable In-Memory Cryptographic Cortex-
memory logic located at the end of each CBL. The NOR gate generates ‘A XOR B’, M0 Processor for IoT,” IEEE Symp. VLSI Circuits, 2017.
which combined with Cin from the carry latch produces Sum=0 and Cout =1. Sum is [6] K. Batcher, “Bit-Serial Parallel Processing Systems,” IEEE Trans. on Computers,
then written back to D, and Cout is stored in the carry latch, which provides Cin for vol. 31, no. 5, pp. 377-384, 1982.
the next cycle, thus completing one full bit-serial addition in one clock cycle. [7] C. Eckert, et al., “Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural
Networks,” ACM/IEEE ISCA, pp. 383-396, 2018.
Figure 14.2.3, left, shows how two vectors of 2b numbers (A and B) are added bit- [8] J. Seo, et al., “A 45nm CMOS Neuromorphic Chip with a Scalable Architecture
by-bit starting from the least significant bit (LSB). Note that while only one bit of a for Learning in Networks of Spiking Neurons,” IEEE CICC, 2011.
multi-bit operand is processed in each cycle, all compute bit-lines operate

224 • 2019 IEEE International Solid-State Circuits Conference 978-1-5386-8531-0/19/$31.00 ©2019 IEEE
ISSCC 2019 / February 19, 2019 / 2:00 PM

Figure 14.2.2: CRAM array architecture (top-left), 8T transposable bitcell (top-


right), In-memory computing part (bottom-left) and near-memory computing
Figure 14.2.1: Chip architecture and storage and computation of data in part (bottom-right) of 1-bit addition. Addition of near-memory logic increased
transposable memory array. array size by 4.5%.

14

Figure 14.2.3: 2-bit addition cycle-by-cycle demonstration (top-left), 2-bit Figure 14.2.4: Frequency and energy efficiency of 8-bit multiplication and
multiplication cycle-by-cycle demonstration (top-mid & right), and list of CRAM addition at different VDD (top), maximum frequency and leakage power
instructions and its performance (bottom). distribution of 21 dies at 1.1V (bottom).

Figure 14.2.5: Performance comparison between CRAM and baseline scenario


(top), workload breakdown (bottom). Figure 14.2.6: Comparison table.

DIGEST OF TECHNICAL PAPERS • 225


ISSCC 2019 PAPER CONTINUATIONS

Figure 14.2.7: Die photo.

• 2019 IEEE International Solid-State Circuits Conference 978-1-5386-8531-0/19/$31.00 ©2019 IEEE

You might also like