0% found this document useful (0 votes)
19 views14 pages

Tvlsi 18 Computing-In-Memory With STT-MRAM

Uploaded by

陈德爱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Tvlsi 18 Computing-In-Memory With STT-MRAM

Uploaded by

陈德爱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

470 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO.

3, MARCH 2018

Computing in Memory With Spin-Transfer


Torque Magnetic RAM
Shubham Jain , Ashish Ranjan, Kaushik Roy, Fellow, IEEE, and Anand Raghunathan, Fellow, IEEE

Abstract— In-memory computing is a promising approach to Spintronic memories have emerged as a promising candidate
addressing the processor-memory data transfer bottleneck in for future memories due to several desirable attributes, such as
computing systems. We propose spin-transfer torque compute- nonvolatility, high density, and near-zero leakage. In particular,
in-memory (STT-CiM), a design for in-memory computing with
spin-transfer torque magnetic RAM (STT-MRAM). The unique spin-transfer torque magnetic RAM (STT-MRAM) has gar-
properties of spintronic memory allow multiple wordlines within nered a significant interest with various prototype demonstra-
an array to be simultaneously enabled, opening up the possibility tions and early commercial offerings [1]–[3]. There have been
of directly sensing functions of the values stored in multiple rows several research efforts to boost the efficiency of STT-MRAM
using a single access. We propose modifications to STT-MRAM at the device, circuit, and architectural levels [4]–[30].
peripheral circuits that leverage this principle to perform logic,
arithmetic, and complex vector operations. We address the chal- In this paper, we explore, viz., in-memory computing with
lenge of reliable in-memory computing under process variations STT-MRAM. By exploiting the ability to simultaneously
by extending error-correction code schemes to detect and correct enable multiple wordlines (WLs) within a memory array,
errors that occur during CiM operations. We also address the we enhance STT-MRAM arrays to perform a range of arith-
question of how STT-CiM should be integrated within a general- metic, logic, and vector operations. We propose circuit and
purpose computing system. To this end, we propose architectural
enhancements to processor instruction sets and on-chip buses that architectural techniques for reliable computation under process
enable STT-CiM to be utilized as a scratchpad memory. Finally, variations and to enable the proposed design to be used in a
we present data mapping techniques to increase the effectiveness programmable processor-based system.
of STT-CiM. We evaluate STT-CiM using a device-to-architecture In-memory computing is motivated by the observation that
modeling framework, and integrate cycle-accurate models of the movement of data from bit-cells in the memory to the
STT-CiM with a commercial processor and on-chip bus (Nios II
and Avalon from Intel). Our system-level evaluation shows that processor and back (across the bitlines, memory interface, and
STT-CiM provides the system-level performance improvements system interconnect) is a major performance and energy bot-
of 3.93 times on average (up to 10.4 times), and concurrently tleneck in computing systems. Efforts that have explored the
reduces memory system energy by 3.83 times on average (up to closer integration of logic and memory are variedly referred
12.4 times). to in the literature as logic-in-memory, computing-in-memory,
Index Terms— In-memory computing, processing-in-memory, and processing-in-memory. These efforts may be classified
spin-transfer torque magnetic RAM (STT-MRAM), spintronic into two categories—moving logic closer to memory or near-
memories. memory computing [31]–[44] and performing computations
I. I NTRODUCTION within memory structures or in-memory computing [45]–[57],
which is the focus of this paper. In-memory computing reduces
T HE growth in data processed and increase in the number
of cores place high demands on the memory systems
of modern computing platforms. Consequently, a growing
the number of memory accesses and the amount of data
transferred between processor and memory, and exploits the
fraction of transistors, area, and power are utilized toward wider internal bandwidth available within memory systems.
memories. CMOS memories (SRAM and embedded DRAM) Our proposal is based on the observation that by enabling
have been the mainstays of memory design for the past several multiple WLs simultaneously1 and sensing the effective resis-
decades. However, recent technology scaling challenges in tance of each bitline (BL), it is possible to directly compute
CMOS memories, along with an increased demand for mem- logic functions of the values stored in the bit-cells. Based
ory capacity and performance, have fueled an active interest on this insight, we propose spin-transfer torque compute-in-
in alternative memory technologies. memory (STT-CiM), a design for in-memory computing with
STT-MRAM that can perform a range of arithmetic, logic,
Manuscript received April 5, 2017; revised July 13, 2017; accepted and vector operations. In STT-CiM, the core data array is the
August 15, 2017. Date of publication December 28, 2017; date of current
version February 22, 2018. This work was supported in part by STARnet, same as standard STT-MRAM; hence, memory density and
a Semiconductor Research Corporation program sponsored by MARCO and the efficiency of read and write operations are maintained.
DARPA, and in part by the National Science Foundation under grant 1320808. Reliable sensing under the limited tunneling magnetoresis-
(Corresponding author: Shubham Jain.)
The authors are with the School of Electrical and Computer Engineering, tance (TMR) of STT-MRAM bit-cells is known to be a chall-
Purdue University, West Lafayette, IN 47906 USA (e-mail: jain130@purdue. enge [12]–[16], [29], and we show that challenge this is
edu; [email protected]; [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 1 Note that this is much easier in STT-MRAM than in CMOS memories,
Digital Object Identifier 10.1109/TVLSI.2017.2776954 due to the resistive nature of the bit-cells.
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 471

further aggravated for in-memory computations. In order to


enhance the robustness of STT-CiM under process variations,
we extend error-correction codes (ECCs) to errors that occur
during in-memory computations. To evaluate the benefits
of STT-CiM, we utilize it as a scratchpad in the memory
hierarchy of the Intel Nios II [58] processor. We propose
enhancements to the on-chip bus and extend the instruction set
of the processor to support CiM operations and expose them
to software. We also present suitable data mapping techniques Fig. 1. Related work: overview.
to maximize the benefits of STT-CiM.
We note that earlier efforts (see [48]) have proposed We limit the scope of our discussion to approaches that
enabling multiple WLs to perform computations within non- improve the efficiency of active computation. For example,
volatile memories (NVMs). Although this paper shares this we do not discuss the embedding of NVM elements into
principle, we differ from the previous work in several key a logic circuit [59]–[62] in order to enable the system to
aspects: 1) we address reliable in-memory computing under shut down and wake up efficiently for improved power
process variations; 2) we go beyond bitwise logic opera- management.
tions to also perform arithmetic and vector operations, which Near-memory computing refers to bringing logic or process-
are commonly present in modern computing workloads; and ing units closer to memory. Notwithstanding the closer inte-
3) we propose architectural enhancements [bus and instruction gration, processing units still remain distinct from memory
set architecture (ISA) extensions], and data mapping tech- arrays. Near-memory computing has been explored at var-
niques to enable in-memory computation in the context of ious levels of the memory hierarchy [31]–[35], [37]–[44].
on-chip scratchpad memories. Intelligent RAM [31] is an early example, which integrated
In summary, the key contributions of this paper are as a processor and DRAM in the same chip to improve the
follows. bandwidth between them. Embedding simple processing units
1) We explore CiM with spintronic memories as an within each page of main memory [32] and within secondary
approach to improving system performance and energy. storage [33] enables computations to be performed near
2) We propose STT-CiM, an enhanced STT-MRAM array memory. An application-specific example of near-memory
that can perform a range of arithmetic, logic, and computation is memory that can generate interpolated val-
vector CiM operations without modifying either the ues, enabling the evaluation of complex mathematical func-
bit-cells or the core data array. tions [42]. Near-memory computing has gained significant
3) We address a key challenge in STT-CiM, i.e., reli- interest in recent years, with industry efforts such as hybrid
ably performing in-memory operations under process memory cube [43] and high bandwidth memory [44].
variation, by demonstrating suitable error correction In-memory computing [45]–[48], [51]–[54] integrates logic
mechanisms. operations into memory arrays, fundamentally blurring the dis-
4) We propose extensions to the instruction set and on-chip tinction between processing and memory. The key challenge
bus to integrate STT-CiM into a programmable processor of in-memory computing is to realize it without impacting the
system and demonstrate the viability of these extensions desirability of the resulting design as a standard memory (i.e.,
using Intel’s Nios II processor and Avalon on-chip bus. density or efficiency of standard read and write operations).
5) We evaluate the performance and energy benefits Due to these constraints, in-memory computing is typically
of STT-CiM, achieving average improvements limited to performing a small number of simple operations.
of 3.83 times (up to 12.4 times) and 3.93 times We can classify previous proposals for in-memory com-
(up to 10.4 times) in the total memory energy and puting based on whether they target application-specific or
system performance, respectively. general-purpose computations, and based on the underlying
The rest of this paper is organized as follows. Section II memory technology that they consider. Application-specific
presents an overview of prior research efforts related to examples of in-memory computing include vector-matrix
in-memory computation. Section III provides the neces- multiplication [54]–[57] and sum-of-absolute difference [46]
sary background on STT-MRAM. Section IV describes the computation. Ternary content-addressable memory [45],
STT-CiM design and how it supports in-memory computation. ROM-embedded RAM [63], AC-DIMM [52], and Micron’s
Section V outlines architectural enhancements for STT-CiM. automata processor [64] can also be viewed as examples of
Section VI describes the experimental methodology, and in-memory computing that target specific operations, such as
experimental results are presented in Section VII. Section VIII pattern matching or evaluation of transcendental functions.
concludes this paper. Unlike these application-specific designs, we focus on embed-
ding a broader set of operations (arithmetic, logic, and vector
II. R ELATED W ORK operations) within memory.
The closer integration of logic and memory is variedly In-memory evaluation of bitwise logic operations has been
referred to in the literature as logic-in-memory, computing- explored for memristive memories [48]–[50] and DRAM [51].
in-memory, and processing-in-memory. These efforts can be This paper differs from these efforts in several impor-
broadly classified into two categories, as shown in Fig. 1. tant aspects. First, we focus on in-memory computing for

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

A write is performed by passing a current greater than


the critical switching current of the MTJ through the bit-
cell. The logic value written is dependent on the direction
of the write current, as shown in Fig. 2. The write operation
in STT-MRAM is stochastic in nature, and the duration and
magnitude of the write current determine the write failure rate.
Apart from write failures, STT-MRAMs are also subject to
Fig. 2. STT-MRAM bit-cell. read decision failures, where the value stored in a bit-cell
is incorrectly sensed due to process variations, and read
disturb failures, where a read operation inadvertently ends up
spintronic memory, which involves fundamentally different
writing into the bit-cell. These failures are addressed through a
prospects and design challenges. For example, the proposed
range of techniques, including device and circuit optimization,
operations are not destructive to the contents stored in the
manufacturing test and self-repair, and error correcting
accessed bit-cells (unlike [51]). On the other hand, the much
codes [12], [13], [15], [16]. Apart from write/read failures,
lower ratio of ON-to-OFF resistance in spintronic memory leads
memories may also have failures due to thermal noise, which
to lower sensing margins. Second, we use a different sensing
causes stochastic flipping in the bit-cells. However, such
and reference generation circuitry (RGC), which enables us
failures are negligible in STT-MRAM due to the high energy
to natively realize a wider variety of operations. For example,
barrier between the two resistance states.
the proposed design requires only one array access (unlike
two in the case of [48]) to perform bitwise XOR operations.
Second, our design goes beyond bitwise logic operations and IV. STT-MRAM-BASED C OMPUTE - IN -M EMORY
realizes arithmetic as well as complex vector operations. Third,
In this section, we describe STT-MRAM-based compute-in-
we propose architectural extensions (bus and ISA extensions)
memory (STT-CiM), a design for in-memory computing using
and data mapping techniques to enable in-memory computing
standard STT-MRAM arrays.
within a general-purpose processor system. Finally, we address
a key challenge associated with in-memory computing, viz.,
reliable operation under process variations. A. STT-CiM Overview
A different approach to in-memory computing with
The key idea behind STT-CiM is to enable multiple WLs
spintronic memories [47] uses an extra transistor in each
simultaneously in an STT-MRAM array, leading to multiple
bit-cell (2T-1R cells), which sacrifices the density benefits
bit-cells being connected to each BL. With enhancements
of standard (1T-1R) STT-MRAM while potentially enabling
that we propose to the sensing and RGC, we can directly
more complex functions to be evaluated within the array.
compute logic functions of the enabled words. Note that such
In contrast, our proposal enables in-memory computation
an operation is feasible in STT-MRAMs, since the bit-cells
within a standard STT-MRAM array with no changes to the
are resistive, and since the write currents are typically much
bit-cells. We note that a concurrent effort [53] has explored
higher than read currents. In contrast, enabling multiple WLs
bitwise AND/ OR operations in STT-MRAM. The bitwise XOR
in SRAM can lead to short-circuit paths through the memory
operation cannot be realized atomically using the design
array, leading to loss of data stored in the bit-cells.
proposed in [53]. Furthermore, these efforts restrict themselves
Fig. 3 explains the principle of operation of STT-CiM.
to device and circuit-level considerations, and do not address
First, consider the resistive equivalent circuit of a single
the architectural challenges of in-memory computing.
STT-MRAM bit-cell shown in Fig. 3(a). Rt represents the
ON -resistance of the access transistor and Ri represents the
III. BACKGROUND resistance of the MTJ. When a voltage Vread is applied between
An STT-MRAM bit-cell consists of an access transistor the BL and the SL, the net current Ii flowing through the
and a magnetic tunnel junction (MTJ), as shown in Fig. 2. bit-cell can take two possible values depending on the MTJ
An MTJ, in turn, consists of a pinned layer that has a configuration, as shown in Fig. 3(b). A read operation involves
fixed magnetic orientation and a free layer whose magnetic using a sensing mechanism to distinguish between these two
orientation can be switched. The magnetic layers are separated current values.
by a tunneling oxide. The relative magnetic orientation of the Fig. 3(c) demonstrates a CiM operation, where two
free and pinned layers determines the resistance offered by the WLs (WLi and WL j ) are enabled, and a voltage bias (Vread )
MTJ (the resistance for the parallel configuration R P is lower is applied to the BL. The resultant current flowing through
than the antiparallel resistance RAP ). The two resistance states the SL (denoted ISL ) is a summation of the currents flowing
encode a bit (we assume that parallel represents logic “1,” through each of the bit-cells (Ii and I j ), which in turn depends
and antiparallel represents logic “0”). A read operation is on the logic states stored in these bit-cells. The possible values
performed by applying a bias (Vread ) between the BL and the of ISL are shown in Fig. 3(d). We propose enhanced sensing
source line (SL), and enabling the WL. The resultant current mechanisms to distinguish between these values and thereby
flowing through the bit-cell (I P or IAP ) is compared against compute logic functions of the values stored in the enabled
a global reference to determine the logic state stored in the bit-cells. We discuss the details of these operations in turn as
bit-cell. follows.

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 473

TABLE I
P OSSIBLE O UTPUTS OF VARIOUS S ENSING S CHEMES

Fig. 5. In-memory ADD operation.

Fig. 3. STT-CiM: principle of operation. (a) Resistive equivalent. it is not necessary to distinguish between the cases where the
(b) Bit-cell currents for read operation. (c) CiM operation. (d) SL currents for two bit-cells connected to a BL store “10” and “01.”
CiM operation. 4) ADD Operation: An ADD operation is realized by
leveraging the ability to concurrently perform multiple bitwise
logical operations, as illustrated in Fig. 5. Suppose An and Bn
(the nth bits of two words, A and B) are stored in two
different bit-cells of the same column within an STT-CiM
array. Suppose that we wish to compute the full-adder logic
function (the nth stage of an adder that adds words A and B).
As shown in Fig. 5, Sn (the sum) and Cn (the carry out) can
be computed using An XOR Bn and An AND Bn , in addition
Fig. 4. STT-CiM sensing schemes. (a) Bitwise OR sensing scheme. to Cn−1 (carry input from the previous stage). Fig. 5 also
(b) Bitwise AND sensing scheme. (c) Reference currents. expresses the ADD operation in terms of the outputs of bitwise
operations, OAND and OXOR . Three additional logic gates are
1) Bitwise OR (NOR): In order to realize logic OR and NOR required to enable this computation. Note that the sensing
operations, we use the sensing scheme shown in Fig. 4(a), schemes discussed enable us to perform the bitwise XOR and
where ISL is connected to the positive input of the sense AND operations simultaneously, thereby performing an ADD

amplifier and a reference current Iref-OR is fed to its negative operation with a single array access.
input. We choose Iref−or to be between IAP-AP and IAP-P ,
as shown in Fig. 4(c). As a result, among the possible values B. STT-CiM Array
of ISL [see Fig. 3(d)], only ISL = IAP-AP is less than Iref- OR . In this section, we present the array-level design of
Consequently, only the case where both bit-cells are in the STT-CiM using the above-described circuit-level techniques.
AP configuration, i.e., both store “0,” leads to an output As shown in Fig. 6, the proposed STT-CiM memory array
of logic “0” (“1”) at the positive (negative) output of the takes an additional input CiMType that indicates the type
sense amplifier, while all other cases lead to logic “1” (“0”). of CiM operation that needs to be performed for every
Thus, the positive and negative outputs of the sense amplifier memory access. The CiM decoder interprets this input and
evaluate the logic OR and NOR of the values stored in the generates appropriate control signals to perform the desired
enabled bit-cells. logic operation. In order to enable CiM operations, the read
2) Bitwise AND (NAND): A bitwise AND (NAND) operation peripheral circuits present in each column (sensing circuit and
is realized at the positive (negative) terminal of the sense global reference generation circuit in Fig. 6) are enhanced,
amplifier by using the sensing scheme shown in Fig. 4(b). Note while the core data array remains the same as in the standard
that in this scheme, a different reference current (Iref-AND ) is STT-MRAM. The address (row) decoder needs to enable
fed to the sense amplifier. multiple WLs for CiM operations. Specifically, we utilize two
3) Bitwise XOR: A bitwise XOR operation is realized when address decoders, with each decoding the corresponding input
the two sensing schemes shown in Fig. 4 are used in tandem, address. The corresponding outputs of the decoders are ORed
and OAND and ONOR are fed to a CMOS NOR gate. In other and connected to each WL. This configuration allows any of
words, OXOR = OAND NOR ONOR . the two decoders to activate random WL locations. While the
Table I summarizes the logic operations achieved using the row decoder overhead is roughly doubled, it represents a small
two sensing schemes discussed earlier. Note that, all the above- fraction of total area and power for configurations involving
described logic operations are symmetric in nature, and hence, large arrays (1.8% in our evaluation). The write peripheral

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
474 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

Fig. 6. STT-CiM array structure.

circuits are unchanged, as write operations are identical to TABLE II


the standard STT-MRAM. We next describe enhancements STT-C I M O PERATIONS C ONTROL S IGNALS
to sensing and reference generation circuits to enable CiM
operations.
1) Sensing Circuitry: Fig. 6 shows the sensing circuit
enhanced to support all the logic operations discussed in
Section IV-A. It consists of two sense amplifiers, a CMOS
NOR gate, three multiplexers, and three additional logic gates
for the ADD operation. We note that the area and power
overheads associated with these enhancements are minimal,
since the sensing circuit constitutes a small fraction of the
total memory area/power. As shown in Fig. 6, the reference
currents (Irefl and Irefr ) produced by the global reference
generation circuit are fed to the two sense amplifiers in order sel0 , sel1 , and rwl0 to logic “1.” On the other hand, a CiM
to realize the sensing schemes discussed in Section IV-A. The operation is performed by enabling two WLs and setting
three MUX control signals (sel0 , sel1 , and sel2 ) are generated CiMType to the appropriate value, which results in computing
by the CiM decoder to select the desired CiM operation. the desired function of the enabled words. The control signal
2) Reference Generation: Fig. 6 illustrates the modified values for a read operation as well as CiM operations are
reference generation circuit used to produce the additional shown in Table II.
reference currents necessary for the proposed sensing schemes.
It includes two reference stacks, one for each of the two
C. CiM Operation Under Process Variations
sense amplifiers in the sensing circuit. Each stack consists
of three bit-cells programmed to offer resistances R P , RAP , The STT-CiM array suffers from the same failure mecha-
and RREF , respectively. RREF 2 represents the fixed resistance nisms (read disturb failures, read decision failures, and write
reference MTJ used in a standard STT-MRAM to perform failures) that are observed in the standard STT-MRAM. In this
read operations. The CiM decoder generates control sig- section, we compare the failure rates in the STT-CiM and stan-
nals (rwl0 , rwl1 , . . . , rwr1 , rwr2 ) that enable a subset of these dard STT-MRAM. Normal read/write operations in STT-CiM
bit-cells in the reference stacks, which in turn produces the have the same failure rate as in a standard STT-MRAM,
desired reference currents. Table II presents the values of these since the read/write mechanisms are identical. However, CiM
control signals so as to achieve the required reference currents. operations differ in their failure rates, since the currents that
The STT-CiM array can perform both regular memory flow through each bit-cell differ when enabling two WLs
operations and a range of CiM operations. The normal read simultaneously. In order to analyze the read disturb and read
operation is performed by enabling a single WL and setting decision failures under process variations for CiM opera-
tions, we performed a Monte Carlo circuit-level simulation
2R
AP > RREF > R P . on 1 million samples considering variations in MTJ oxide

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 475

Fig. 8. Codeword retention property of CIM XOR.

Fig. 9. EDC for CiM operations.

Fig. 7. Probability density distribution of ISL under process variations


during (a) read and (b) CiM operations. simple Hamming code. As shown in Fig. 8, word1 and word2
are augmented with ECC bits ( p1, p2 , and p3 ) and stored in
memory as InMemW1 and InMemW2 , respectively. A CiM
thickness (σ/μ = 2%), transistor VT (σ/μ = 5%), and
XOR operation performed on these stored words (InMemW1
MTJ cross-sectional area (σ/μ = 5%) [12]. Fig. 7 shows
and InMemW2 ) results in the ECC codeword for word1 XOR
the probability density distribution of the possible currents
word2 ; therefore, the codewords are preserved for CiM XORs.
obtained during the read and CiM operations on these 1 million
We leverage this codeword retention property of CiM XORs
samples.
to detect and correct errors in all CiM operations. This is
1) CiM Disturb Failures: As shown in Fig. 7, the overall
enabled by the fact that STT-CiM always computes bitwise
current flowing through the SL is slightly higher in case of a
XOR (CiM XOR) irrespective of the CiM operation that is being
CiM operation compared with a normal read. However, this
performed.
increased current is divided between the two parallel paths,
We demonstrate the proposed error detection and correc-
and consequently, the net read current flowing through each
tion (EDC) mechanism for CiM operations in Fig. 9. Let us
bit-cell (MTJ) is reduced. Hence, the read disturb failure rate
assume that data bit d1 suffers from a decision failure during
is lower for CiM operations than for normal read operations.
CiM operations, as shown in Fig. 9. As a result, the combi-
2) CiM Decision Failures: The net current flowing through
nation of logic 1 and 1 in the two bit-cells (I P−P ) is inferred
the SL (ISL ) in case of a CiM operation can have three
as logic 1 and 0 (IAP-P ), leading to erroneous CiM outputs.
possible values, i.e., I P−P , IAP-P (I P-AP ), and IAP-AP . A read
An error detection logic operating on the CiM XOR output
decision failure occurs during a CiM operation when the
(see Fig. 9) detects an error in the d1 data bit. This error can be
current I P−P is interpreted as IAP- P (or vice versa), or when
corrected directly for a CiM XOR operation by simply flipping
IAP-AP is inferred as IAP-P (or vice versa). In contrast to
the erroneous bit. For other CiM operations, we perform two
normal reads, CiM operations have two read margins—one
conventional reads on words InMemW1 and InMemW2, and
between I P−P and IAP-P and another between IAP-P and
correct the erroneous bits by recomputing them using an
IAP-AP [see Fig. 7(b)]. Our simulation results show that the
EDC unit (discussed in Section V). Note that such corrections
read margins for CiM operations are lower compared with
lead to overheads, as we need to access memory array three
normal reads; therefore, they are more prone to decision
times (compared with two times in STT-MRAM). However,
failures. Moreover, the read margins in CiM operations are
our variation analysis shows that error corrections on CiM
unequal.3 Thus, we have more failures arising due to the read
operations are infrequent, leading to overall improvements.
margin between I P−P and IAP-P .
4) ECC Design Methodology: We use the methodology
3) ECC for STT-CiM: In order to mitigate these failures
employed in [12] to determine ECC requirements for both
in STT-MRAM, various ECC schemes have been previously
the baseline STT-MRAM and the proposed STT-CiM design.
explored [12]–[14]. We show that ECC techniques that provide
The approach uses circuit-level simulations to determine the
single error correction and double error detection (SECDED)
bit-level error probability, which is then used to estimate the
and double error correction and triple error detection can be
array level yield. Moreover, the ECC scheme is selected based
used to address the decision failures in CiM operations as well.
on the target yield requirement. Our simulation shows that
This is feasible because the codeword properties for most ECC
1-bit failure probability of normal reads and CiM opera-
codes are retained for a CiM XOR operation. Fig. 8 shows the
tions are 4.2 × 10−8 and 6 × 10−5 , respectively. With these
codeword retention property of a CiM XOR operation using a
obtained bit-level failure rates and assuming a target yield
3 Although resistances may be equally separated, the currents are not, since of 99%, the ECC requirement for 1-Mb STT-MRAM is
they depend inversely on resistance. SECDED, whereas the ECC requirement for 1-Mb STT-CiM is

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
476 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

TABLE III
E XAMPLES OF R EDUCTION O PERATIONS

we observe that vector operations are often followed by reduc-


tion operations. For example, a vector dot-product involves
elementwise multiplication of two vectors followed by a
summation (reduction) of the resulting vector of products to
Fig. 10. STT-CiM supporting in-memory vector operation. produce a scalar value. Based on this observation, we introduce
a reduce unit (RU) before the column multiplexer, as shown
in Fig. 10. The RU takes an array of data elements as inputs
three-error correction and four-error detection (3EC4ED). and reduces it to a single data element. The RU can support
Note that the overheads of the ECC schemes [12], [65] are various reduction operations, such as summation, Euclidean
fully considered and reflected in our experimental results. distance, L1 and L2 norm, and zero-comparison of which
Moreover, our simulation shows that the probability of CiM twoare described in Table III. Consider the computation
N
operations having errors is at most 0.1, i.e., no more than of i=1 A[i ] + B[i ], where arrays A and B are stored in
1 in 10 CiM operations will have an error. Errors on all CiM rows i and j , respectively (shown in Fig. 10). To compute
operations are detected by using the 3EC4ED code on the XOR the desired function using a VCiM operation, we activate
output. Detected errors are directly corrected for CiM XORs rows i and j simultaneously, and configure the sensing cir-
using the 3EC4ED code, and by reverting to near-memory cuitry to perform an ADD operation and the RU to perform
computation for other CiM operations. accumulation of the resulting output. Note that the summation
Apart from ECC schemes, STT-CiM can also lever- would require 2N memory accesses in a conventional memory.
age various reliability improvement techniques proposed for With scalar CiM operations, it would require N memory
STT-MRAMS [27]–[30]. Furthermore, recent efforts [27], [28] accesses. With the proposed VCiM operations, only a single
that increase the TMR of the MTJ and improve sensing memory access is required.
margins will reduce read failure in CiM operations as well. The overheads of the RU depend on two factors: 1) the
These techniques can be used along with ECC to cost-effective number of different reduction operations supported and 2)
mitigate failures in STT-CiM operations. the maximum vector length allowed (can be between 2 to N
words). To limit the overheads, we restrict our design to the
V. STT-C I M A RCHITECTURE vector lengths of 4 and 8.
2) Error Detection and Correction: To enable correction
In order to evaluate the application-level benefits of of erroneous bits for CiM operations, we introduce an EDC
STT-CiM, we integrate it as a scratchpad memory within the unit that implements the 3EC4ED ECC scheme. The EDC
memory hierarchy of a programmable processor [58]. This unit checks for errors using the CiM XOR output (recall that
section describes architectural enhancements for STT-CiM and the XOR is evaluated along with all CiM operations) and
hardware/software optimizations to increase its efficiency. signals the controller (shown in Fig. 10) upon the detection of
erroneous computations. Upon receiving this error detection
A. Optimizations for STT-CiM signal, the controller performs the required corrective actions.
In order to further the efficiency improvements obtained by
STT-CiM, we propose additional optimizations.
B. Architectural Extensions for STT-CiM
1) Vector CiM Operations: Modern computing workloads
exhibit significant data parallelism. To further enhance the effi- To integrate STT-CiM in a programmable processor-based
ciency of STT-CiM for data-parallel computations, we intro- system, we propose the following architectural enhancements.
duce vector CiM (VCiM) operations. The key idea behind 1) ISA Extension: We extend the ISA of a programmable
VCiM operations is to perform CiM operations on all the processor to support CiM operations. To this end, we introduce
elements of a vector concurrently. Fig. 10 shows how the a set of new instructions in the ISA (CiMXOR, CiMNOT,
internal memory bandwidth (32 × N bits) can be significantly CiMAND, CiMADD . . .) that are used to invoke the different
larger than the limited I/O bandwidth (32 bits) visible to the types of operations that can be performed in the STT-CiM
processor. We exploit the memory’s internal bandwidth to array. In a load instruction, the requested address is sent to
perform vector operations (N words wide) within STT-CiM. the memory, and the memory returns the data stored at the
Note that the data resulting from a vector operation may also addressed location. However, in the case of a CiM instruction,
be a vector, and hence, transferring it back to the processor the processor is required to provide addresses of two memory
is subject to the limited I/O bandwidth. To address this issue, locations instead of a single one, and the memory operates on

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 477

Fig. 11. Program transformation for CiMXOR.

the two data values to return the final output


Format: Opcode Reg1 Reg2 Reg3
Example: CiMXOR RADDR1 RADDR2 RDEST . (1)
Equation (1) shows the format of a CiM instruction with an
example. As shown, both the addresses required to perform
CiMXOR operations are provided through registers. The for-
mat is similar to a regular arithmetic instruction that accesses
two register values, performs the computation, and stores the
result back in a register. Fig. 12. Data mapping for various computation patterns
2) Program Transformation: To exploit the proposed CiM
instructions at the application level, an assembly-level program
transformation is performed, wherein specific sequences of We observe that the target applications for STT-CiM have
instructions in the compiled program are mapped to suitable well-defined computation patterns, facilitating such a data
CiM instructions in the ISA. Fig. 11 shows an example placement. Fig. 12 shows three general computation patterns.
transformation, where two load instructions followed by an We next discuss these compute patterns and the corresponding
XOR instruction are mapped to a single CiMXOR instruction. data placement techniques.
3) Bus and Interface Support: In a programmable Type I: This pattern, shown in the top row of Fig. 12,
processor-based system, the processor and the memory com- involves element-to-element operations (OPs) between two
municate via a system bus or on-chip network.4 This makes arrays, e.g., A and B. In order to effectively utilize STT-CiM
it essential to analyze the impact of CiM operations on the for this compute pattern, we utilize the array alignment
bus and the corresponding bus interface. As discussed earlier, technique [shown in Fig. 12(a)] that ensures alignment of
a CiM operation is similar to a load instruction with the elements A[i ] and B[i ] of arrays A and B for any value
key difference that it sends two addresses to the memory. of i . This enables the conversion of operation A[i ] OP B[i ]
Conventional system buses only allow sending a single address into a CiM operation. An extension to this technique is the
onto the bus via the address channel. In order to send the row-interleaved placement shown in Fig. 12(b). This technique
second address for CiM operations, we utilize the unused write is applicable to larger data structures that do not fully reside
data channel of the system bus, which is unutilized during in the same memory bank. It ensures that the corresponding
a CiM operation. Besides the two addresses, the processor elements, i.e., A[i ] and B[i ], are mapped to the same bank for
also sends the type of CiM operation (CIMType) that needs any value of i , and satisfy the alignment criteria for a CiM
to be performed. Note that it may be possible to overlay operation.
the CIMType signal onto the existing bus control signals; Type II: This pattern, shown in the middle row of Fig. 12,
however, such optimizations strongly depend on the specifics involves a nested loop in which the inner loop iteration consists
of the bus protocol being used. In our design, we assume that of a single element of array A being operated with several
three control bits are added to the bus to carry CIMType, and elements of array B. For this one-to-many compute pattern,
account for the resulting overheads in our experiments. we introduce a spare row technique for data alignment. In this
technique, a spare row is reserved in each memory bank to
C. Data Mapping store copies of an element of A. As shown in Fig. 12(c), in the
kth iteration of the outer loop, a special write operation is used
In order to perform a CiM instruction, the locations of its
to fill the spare rows in all banks with A[k]. This results in each
operands in memory must satisfy certain constraints. Let us
element of array B becoming aligned with a copy of A[k],
consider a memory organization consisting of several banks
thereby allowing CiM operations to be performed on them.
where each bank is an array that contains rows and columns.
Note that the special write operation introduces energy and
In this case, a CiM operation can be performed on two data
performance overheads, but this overhead is amortized over all
elements only if they satisfy three key criteria: 1) they are
inner loop iterations, and is observed to be quite insignificant
stored in the same bank; 2) they are mapped to different rows;
in our evaluations.
and 3) they are stored in the same set of columns.
Type III: In this pattern, shown in the bottom row of Fig. 12,
Consequently, a suitable data placement technique is
operations are performed on an element drawn from a small
required that maximizes the use of CiM operations.
array A and an element from a much larger array B. The
4 While we consider the case of a shared bus for illustration, the same elements are selected arbitrarily, i.e., without any predictable
enhancements can be applied to more complex interconnect networks. pattern. For example, consider when a small sequence of

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
478 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

Fig. 13. STT-CiM device-to-architecture evaluation framework.


Fig. 14. System-level integration of STT-CiM.
TABLE IV
D EVICE PARAMETERS TABLE V
B ENCHMARK A PPLICATIONS

characters needs to be searched within a much larger input


string. For this pattern, we propose a column replication a spin-based memory. The variation analysis to compute
technique to enable CiM operations, as shown in Fig. 12(d). failure rates was performed considering variations in MTJ
In this technique, a single element of the small array A is oxide thickness (σ/μ = 2%), transistor VT (σ/μ = 5%), and
replicated across columns to fill an entire row. This ensures MTJ cross-sectional area (σ/μ = 5%).
that each element of A is aligned with every element of B, 2) System-Level Simulation: We evaluated STT-CiM as a
enabling a CiM operation to be utilized. Note that the initial 1-MB scratchpad for an Intel Nios II processor [58]. Fig. 14
overhead due to data replication is very small, as it pales in shows the integration of STT-CiM in the memory hierarchy
comparison to the number of memory accesses to the larger of the programmable processor. In order to expose the STT-
array. CiM operations to software, we extended the Nios II proces-
sor’s instruction set with custom instructions. The Avalon
VI. E XPERIMENTAL M ETHODOLOGY on-chip bus was also extended to support CiM operations.
Cycle-accurate RTL simulation was used to obtain the exe-
In this section, we discuss the device-to-architecture simula-
cution time and the memory access traces for various bench-
tion framework (see Fig. 13) and application benchmarks used
marks. These traces along with the energy results obtained
to evaluate the performance and energy benefits of STT-CiM
through the modified CACTI tool were used to estimate the
at the array level and system level.
total memory energy.
1) Device/Circuit Modeling: We first characterize the bit-
3) Benchmark Applications: We evaluate the STT-CiM on
cells using SPICE-compatible MTJ models that are based
a suite of 12 algorithms drawn from various applications
on the self-consistent solution of Landau–Lifshitz–Gilbert
(see Table V).
magnetization dynamics and nonequilibrium Green’s function
electron transport [66]. Table IV shows the MTJ device VII. R ESULTS
parameters [67] used in our experiments. Using the 45-nm
bulk CMOS technology and the MTJ models, the memory In this section, we first present an array-level analysis of
array along with the associated peripherals and extracted STT-CiM and then quantify its benefits through system-level
parasitics was simulated in SPICE for read, write, and CiM energy and performance evaluation.
operations to obtain array-level timing and energy character-
istics. The obtained characteristics were then used as tech- A. Array-Level Analysis
nology parameters in a modified version of CACTI [68] 1) Energy: The second and third bars in Fig. 15 show
that is capable of estimating system-level properties for the energy consumed by a standard read operation and

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 479

Finally, the access time overhead for STT-CiM was found to


be only ∼0.8%, because the WL and BL delays dominate the
total memory access latency.

B. Application-Level Memory Energy


We next present the system-level memory energy benefits of
using STT-CiM in the programmable processor-based system
described in Fig. 14. We evaluated the total memory energy
consumed by STT-CiM across the application benchmarks,
Fig. 15. Array-level energy evaluation of STT-CiM.
and compared it with a baseline design that uses the standard
STT-MRAM. Fig. 17 shows the breakdown of different energy
components, viz., read, write, and CiM, that contribute to
the overall memory energy in both the STT-MRAM and
proposed STT-CiM designs. In addition, it also shows the
energy overheads due to near memory corrections on failing
CiM operations. The total memory energy for an application
is normalized to the memory energy consumed by the baseline
design. For the proposed STT-CiM design, we evaluated a ver-
sion without vector operations (STT-CiM), and two versions
with the vector lengths of 4 and 8 (STT-CiM+VEC4 and
Fig. 16. Array-level area evaluation of STT-CiM. STT-CiM+VEC8, respectively). Across all benchmarks, we
observe 1.26 times, 2.77 times, and 3.83 times average
a representative CiM operation (CiMXOR) in a 1-Mb improvement in energy for STT-CiM, STT-CiM+VEC4, and
STT-CiM array. Each bar shows the energy breakdown into STT-CiM+V EC8, respectively.
the major components, i.e., peripheral circuitry (PeriphCkt), To provide further insights into the energy benefits, Fig. 19
WL (WordL), BL (BitL), REF, sense amplifier, and ECC. For presents a breakdown for memory accesses made by each
comparison, we provide the read energy for an STT-MRAM application into three categories—writes, reads that can-
array of the same capacity (first bar in Fig. 15, and all energy not be converted into CiM operations [CiM nonconvert-
numbers are normalized to this value. A normal read operation ible reads (CNC-Reads)], and CiM convertible reads (CC-
in STT-CiM incurs an energy overhead of about 4.4%, which Reads). We see that applications where CC-Reads dominate
arises primarily due to the extra PeriphCkt and stronger ECC. the total memory accesses [Knuth–Morris–Pratt (KMP), bit-
STT-CiM uses a 3EC4ED ECC scheme (compared with blit (BLIT), generalized learning vector quantization (GLVQ),
SECDED in the baseline STT-MRAM), which accounts for K-means clustering (KMEANS), optical character recogni-
about 3% of the 4.4% energy overhead. The CiMXOR oper- tion (OCR), image segmentation (IMGSEG), multilayer per-
ation consumes higher energy than a standard read oper- ceptron (MLP), and SVM in Fig. 19] experience higher energy
ation mainly due to the charging of multiple WLs and a benefits from STT-CiM (see Fig. 17). Among these applica-
slightly higher SL current. However, since a CiM operation tions, those that benefit from vectorization achieve the highest
replaces two normal read operations, we also present the savings [GLVQ, KMEANS, OCR, MLP, and support vector
energy required for two reads in a standard STT-MRAM (last machines (SVMs)]. Applications with relatively fewer CC-
bar in Fig. 15). Note that an array-level comparison greatly Reads or more frequent writes (AHC, LCS, RC4, and EDIST)
understates the benefits of STT-CiM, since it does not consider exhibit relatively lower energy savings. CNC-Reads and writes
the system-level impact of reduced data transfers between the are not benefited by STT-CiM, and writes, in particular,
processor and memory (system-level evaluation is presented consume significantly (∼3 times) higher energy than reads.
in Sections VII-B and VII-C). Nevertheless, it is worth noting The energy overheads due to additional writes incurred for data
that even at the array level, STT-CiM consumes 34.2% less alignment in Types II and III compute patterns were observed
energy than STT-MRAM. The benefits mainly arise from a to be 0.8% and 0.3%, respectively.
lower BL dynamic energy (BitL), since only a single access
to the memory array is required for STT-CiM. C. System-Level Performance
2) Area and Access Time: Fig. 16 shows the area break- Fig. 18 shows the speedup for the Nios II processor system
down for two STT-CiM designs that support vector opera- integrated with the STT-CiM across various applications. The
tions of length 4 (VEC4) and 8 (VEC8). Compared with speedup shown in Fig. 18 is with respect to the baseline
the STT-MRAM baseline, the area overheads for VEC4 and design, i.e., the processor system integrated with a standard
VEC8 are 14.2% and 16.6%, respectively. As shown in Fig. 16, STT-MRAM-based memory. As discussed in Section V-B,
peripheral circuits, ECC storage, and ECC logic are the causes CiM lowers the total number of memory accesses as well
of area overheads (5%, 3.6%, and 3.2%, respectively). Periph- as the number of instructions executed, which leads to per-
eral circuits include the enhanced address decoder (1.8%), formance benefits at the system level. Overall, for STT-CiM
sense amplifier (0.9%), and RU (2.3%). Note that the total area without vector operations, we observe performance benefits
is still dominated by the core array, which remains unchanged. ranging from 1.07 times to 1.36 times. With vector operations,

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
480 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

Fig. 17. Application-level memory energy.

Fig. 18. Application-level system performance.

Fig. 19. Memory access breakdown.

Fig. 20. Performance sensitivity to memory latency.


the average speedup increased to 3.25 times and 3.93 times
for the vector lengths of 4 and 8, respectively. Comparing
Figs. 18 and 19, we see that the factors that indicate higher of 1 cycle, and 1.26 times speedup for a memory latency
energy savings for an application (a large fraction of memory of 16 cycles, thereby illustrating the effectiveness of the
accesses are CC-Reads, and opportunities for vectorization proposed approach.
exist) are also predictive of higher performance improvements.
In order to demonstrate the performance sensitivity to
VIII. C ONCLUSION
memory latency, we vary the memory latency and evaluate the
execution time for each application. Fig. 20 shows the results STT-MRAM is a promising candidate for future on-chip
of this sensitivity analysis. On the Y -axis, we have the speedup memories. In this paper, we proposed STT-CiM, an enhanced
of STT-CiM over STT-MRAM, and on the X-axis the memory STT-MRAM that can perform a range of arithmetic, logic,
latency. We observe that STT-CiM yields higher performance and VCiM operations. We addressed a key challenge asso-
benefits at higher memory latency. This is attributed to ciated with these in-memory operations, i.e., reliable com-
the fact that the reduced number of memory accesses for putation under process variations. We utilized the proposed
STT-CiM has a larger impact on system performance. On an design (STT-CiM) as a scratchpad in the memory hierarchy
average, we achieve 1.13 times speedup for a memory latency of a programmable processor, and introduced ISA extensions

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 481

and on-chip bus enhancements to support in-memory com- [19] A. K. Mishra, X. Dong, G. Sun, Y. Xie, N. Vijaykrishnan, and C. R. Das,
putations. We proposed architectural optimizations and data “Architecting on-chip interconnects for stacked 3D STT-RAM caches
in CMPs,” in Proc. 38th Annu. Int. Symp. Comput. Archit. (ISCA),
mapping techniques to enhance the efficiency of STT-CiM. Jun. 2011, pp. 69–80.
A device-to-architecture simulation framework was used to [20] K. Lee and S. H. Kang, “Development of embedded STT-MRAM
evaluate the benefits of STT-CiM. Our experiments indicate for mobile system-on-chips,” IEEE Trans. Magn., vol. 47, no. 1,
pp. 131–136, Jan. 2011.
that STT-CiM achieves substantial improvements in energy [21] A. Nigam, C. W. Smullen, IV, V. Mohan, E. Chen, S. Gurumurthi, and
and performance, and shows considerable promise in alleviat- M. R. Stan, “Delivering on the promise of universal memory for spin-
ing the processor-memory gap. transfer torque RAM (STT-RAM),” in Proc. IEEE/ACM Int. Symp. Low
Power Electron. Design, Aug. 2011, pp. 121–126.
[22] A. Jadidi, M. Arjomand, and H. Sarbazi-Azad, “High-endurance and
R EFERENCES performance-efficient design of hybrid cache architectures through
adaptive line replacement,” in Proc. IEEE/ACM Int. Symp. Low Power
[1] Everspin | The MRAM Company. Accessed: 2015. [Online]. Available: Electron. Design, Aug. 2011, pp. 79–84.
http://www.everspin.com/ [23] Y. Zhang et al., “Multi-level cell spin transfer torque MRAM
[2] D. Apalkov et al., “Spin-transfer torque magnetic random access mem- based on stochastic switching,” in Proc. 13th IEEE Int. Conf.
ory (STT-MRAM),” J. Emerg. Technol. Comput. Syst., vol. 9, no. 2, Nanotechnol. (IEEE-NANO), Aug. 2013, pp. 233–236.
pp. 13:1–13:35, May 2013. [24] J. Zhao and Y. Xie, “Optimizing bandwidth and power of graphics mem-
[3] Avalanche Technology—Enterprise Solid State Storage Arrays. ory with hybrid memory technologies and adaptive data migration,” in
Accessed: 2017. [Online]. Available: http://www.avalanche- Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2012,
technology.com/ pp. 81–87.
[4] A. Jog et al., “Cache revive: Architecting volatile STT-RAM caches [25] W. Xu, Y. Chen, X. Wang, and T. Zhang, “Improving STT MRAM
for enhanced performance in CMPs,” in Proc. 49th ACM/EDAC/IEEE storage density through smaller-than-worst-case transistor sizing,” in
Design Autom. Conf., Jun. 2012, pp. 243–252. Proc. 46th ACM/IEEE Annu. Design Autom. Conf. (DAC), New York,
[5] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “Energy reduction for NY, USA, Jul. 2009, pp. 87–90.
STT-RAM using early write termination,” in Proc. Int. Conf. Comput.- [26] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopadhyay, and
Aided Design (ICCAD), Nov. 2009, pp. 264–268. S. Yalamanchili, “An energy efficient cache design using spin torque
[6] S. Chatterjee, M. Rasquinha, S. Yalamanchili, and S. Mukhopadhyay, transfer (STT) RAM,” in Proc. ACM/IEEE Int. Symp. Low-Power
“A scalable design methodology for energy minimization of STTRAM: Electron. Design (ISLPED), Aug. 2010, pp. 389–394.
A circuit and architecture perspective,” IEEE Trans. Very Large Scale [27] A. Aziz, N. Shukla, S. Datta, and S. K. Gupta, “COAST: Correlated
Integr. (VLSI) Syst., vol. 19, no. 5, pp. 809–817, May 2011. material assisted STT MRAMs for optimized read operation,” in Proc.
[7] Y. Kim, S. K. Gupta, S. P. Park, G. Panagopoulos, and K. Roy, “Write- IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED), Jul. 2015,
optimized reliable design of STT MRAM,” in Proc. ACM/IEEE Int. pp. 1–6.
Symp. Low Power Electron. Design (ISLPED), New York, NY, USA, [28] S. Ikeda et al., “Tunnel magnetoresistance of 604% at 300 K by suppres-
Jul. 2012, pp. 3–8. sion of Ta diffusion in CoFeB/MgO/CoFeB pseudo-spin-valves annealed
[8] H. Noguchi et al., “A 3.3 ns-access-time 71.2 μW/MHz 1 Mb embed- at high temperature,” Appl. Phys. Lett., vol. 93, no. 8, p. 082508,
ded STT-MRAM using physically eliminated read-disturb scheme and 2008.
normally-off memory architecture,” in IEEE Int. Solid-State Circuits [29] W. Kang, L. Zhang, J.-O. Klein, Y. Zhang, D. Ravelosona, and W. Zhao,
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2015, pp. 1–3. “Reconfigurable codesign of STT-MRAM under process variations in
[9] S. P. Park, S. Gupta, N. Mojumder, A. Raghunathan, and K. Roy, deeply scaled technology,” IEEE Trans. Electron Devices, vol. 62, no. 6,
“Future cache design using STT MRAMs for improved energy effi- pp. 1769–1777, Jun. 2015.
ciency: Devices, circuits and architecture,” in Proc. Design Autom. Conf., [30] N. N. Mojumder, X. Fong, C. Augustine, S. K. Gupta, S. H. Cho-
Jun. 2012, pp. 492–497. day, and K. Roy, “Dual pillar spin-transfer torque MRAMs for low
[10] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan, power applications,” J. Emerg. Technol. Comput. Syst., vol. 9, no. 2,
“Relaxing non-volatility for fast and energy-efficient STT-RAM caches,” pp. 14:1–14:17, May 2013.
in Proc. Int. Symp. High Perform. Comput. Archit., Feb. 2011, pp. 50–61. [31] D. Patterson et al., “Intelligent RAM (IRAM): Chips that remember and
[11] W. Xu, H. Sun, X. Wang, Y. Chen, and T. Zhang, “Design of last- compute,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
level on-chip cache using spin-torque transfer RAM (STT RAM),” IEEE Papers, Feb. 1997, pp. 224–225.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 3, pp. 483–493, [32] M. Oskin, F. T. Chong, and T. Sherwood, “Active pages: A computation
Mar. 2011. model for intelligent memory,” in Proc. 25th Annu. Int. Symp. Comput.
[12] K.-W. Kwon, X. Fong, P. Wijesinghe, P. Panda, and K. Roy, “High- Archit., Jun. 1998, pp. 192–203.
density and robust STT-MRAM array through device/circuit/architecture [33] E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle, “Active disks
interactions,” IEEE Trans. Nanotechnol., vol. 14, no. 6, pp. 1024–1034, for large-scale data processing,” Computer, vol. 34, no. 6, pp. 68–74,
Nov. 2015. Jun. 2001.
[13] B. D. Bel, J. Kim, C. H. Kim, and S. S. Sapatnekar, “Improving STT- [34] J. Draper et al., “The architecture of the DIVA processing-in-memory
MRAM density through multibit error correction,” in Proc. Design, chip,” in Proc. ACM ICS, 2002, pp. 14–25.
Autom. Test Europe Conf. Exhib. (DATE), Mar. 2014, pp. 1–6. [35] R. Nair et al., “Active memory cube: A processing-in-memory archi-
[14] W. Kang et al., “A low-cost built-in error correction circuit design tecture for exascale systems,” IBM J. Res. Develop., vol. 59, nos. 2–3,
for STT-MRAM reliability improvement,” Microelectron. Rel., vol. 53, pp. 17:1–17:14, Mar./May 2015.
nos. 9–11, pp. 1224–1229, 2013. [36] B. Falsafi et al., “Near-memory data services,” IEEE Micro, vol. 36,
[15] W. Kang et al., “Yield and reliability improvement techniques for no. 1, pp. 6–13, Jan. 2016.
emerging nonvolatile STT-MRAM,” IEEE Trans. Emerg. Sel. Topics [37] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,
Circuits Syst., vol. 5, no. 1, pp. 28–39, Mar. 2015. “Neurocube: A programmable digital neuromorphic architecture with
[16] X. Fong, Y. Kim, S. H. Choday, and K. Roy, “Failure mitigation high-density 3D memory,” in Proc. ACM/IEEE 43rd Annu. Int. Symp.
techniques for 1T-1MTJ spin-transfer torque MRAM bit-cells,” IEEE Comput. Archit. (ISCA), Jun. 2016, pp. 380–392.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 2, pp. 384–395, [38] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim,
Feb. 2014. “NDA: Near-DRAM acceleration architecture leveraging commod-
[17] G. S. Kar et al., “Co/Ni based p-MTJ stack for sub-20 nm high density ity DRAM devices and standard memory modules,” in Proc. IEEE
stand alone and high performance embedded memory application,” in 21st Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2015,
IEDM Tech. Dig., Dec. 2014, pp. 19.1.1–19.1.4. pp. 283–295.
[18] A. Ranjan, S. Venkataramani, X. Fong, K. Roy, and A. Raghunathan, [39] S. H. Pugsley et al., “NDC: Analyzing the impact of 3D-stacked
“Approximate storage for energy efficient spintronic memories,” in memory+logic devices on MapReduce workloads,” in Proc. IEEE
Proc. 52nd Annu. Design Autom. Conf. (DAC), New York, NY, USA, Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Mar. 2014,
Jun. 2015, pp. 195:1–195:6. pp. 190–200.

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
482 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

[40] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, [62] M.-F. Chang et al., “Read circuits for resistive memory (ReRAM) and
and M. Ignatowski, “TOP-PIM: Throughput-oriented programmable memristor-based nonvolatile logics,” in Proc. 20th Asia South Pacific
processing in memory,” in Proc. ACM 23rd Int. Symp. High-Perform. Design Autom. Conf., Jan. 2015, pp. 569–574.
Parallel Distrib. Comput. (HPDC), New York, NY, USA, Jun. 2014, [63] D. Lee, X. Fong, and K. Roy, “R-MRAM: A ROM-embedded
pp. 85–98. STT MRAM cache,” IEEE Electron Device Lett., vol. 34, no. 10,
[41] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: pp. 1256–1258, Oct. 2013.
A low-overhead, locality-aware processing-in-memory architecture,” in [64] P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes,
Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput. Archit. (ISCA), “An efficient and scalable semiconductor architecture for parallel
Jun. 2015, pp. 336–348. automata processing,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 12,
[42] Q. Zhu, K. Vaidyanathan, O. Shacham, M. Horowitz, L. Pileggi, and pp. 3088–3098, Dec. 2014.
F. Franchetti, “Design automation framework for application-specific [65] D. Strukov, “The area and latency tradeoffs of binary bit-parallel
logic-in-memory blocks,” in Proc. IEEE 23rd Int. Conf. Appl.-Specific BCH decoders for prospective nanoelectronic memories,” in Proc. 14th
Syst., Archit. Process., Jul. 2012, pp. 125–132. Asilomar Conf. Signals, Syst. Comput., Oct./Nov. 2006, pp. 1183–1187.
[43] J. T. Pawlowski, “Hybrid memory cube (HMC),” in Proc. IEEE Hot [66] X. Fong, S. H. Choday, P. Georgios, C. Augustine, and K. Roy,
Chips Symp. (HCS), vol. 23. Aug. 2011, pp. 1–24. “Spice models for magnetic tunnel junctions based on mon-
[44] D. U. Lee et al., “A 1.2 V 8 Gb 8-channel 128 GB/s high-bandwidth odomain approximation,” Tech. Rep., Aug. 2016. [Online]. Available:
memory (HBM) stacked DRAM with effective microbump I/O test https://nanohub.org/resources/19048
methods using 29 nm process and TSV,” in IEEE Int. Solid-State Circuits [67] S. Ikeda et al., “A perpendicular-anisotropy CoFeB–MgO magnetic
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 432–433. tunnel junction,” Nature Mater., vol. 9, pp. 721–724, Jul. 2010.
[45] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable mem- [68] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
ory (CAM) circuits and architectures: A tutorial and survey,” NUCA organizations and wiring alternatives for large caches with
IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712–727, CACTI 6.0,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchi-
Mar. 2006. tecture (MICRO), Washington, DC, USA, Dec. 2007, pp. 3–14.
[46] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
“An energy-efficient VLSI architecture for pattern recognition via deep
embedding of computation in SRAM,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), May 2014, pp. 8326–8330.
[47] J.-P. Wang and J. D. Harms, “General structure for computa-
tional random access memory (CRAM),” U.S. Patent 9 224 447 B2,
Dec. 29, 2015
[48] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo:
A processing-in-memory architecture for bulk bitwise operations in
emerging non-volatile memories,” in Proc. ACM 53rd Annu. Design
Autom. Conf. (DAC), New York, NY, USA, Jun. 2016, pp. 173:1–173:6.
[49] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design
within memristive memories using memristor-aided loGIC Shubham Jain received the B.Tech. degree
(MAGIC),” IEEE Trans. Nanotechnol., vol. 15, no. 4, pp. 635–650, (honors) in electronics and electrical communication
Jul. 2016. engineering from IIT Kharagpur, Kharagpur, India,
in 2012. He is currently working toward the Ph.D.
[50] J. Reuben et al., “Memristive logic: A framework for evaluation and
degree at the School of Electrical and Computer
comparison,” in Proc. IEEE Int. Symp. Power Timing Modeling, Optim.
Engineering, Purdue University, West Lafayette, IN,
Simulation, Sep. 2017, pp. 1–8.
USA.
[51] V. Seshadri et al., “Fast bulk bitwise AND and OR in DRAM,” IEEE He was with Qualcomm, Bengaluru, India, for two
Comput. Archit. Lett., vol. 14, no. 2, pp. 127–131, Jul./Dec. 2015. years. His current research interests include explor-
[52] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, ing circuit and architectural techniques for emerging
“AC-DIMM: Associative Computing with STT-MRAM,” in Proc. ACM post-CMOS devices and computing paradigms, such
40th Annu. Int. Symp. Comput. Archit. (ISCA), New York, NY, USA, as spintronics, approximate computing, and neuromorphic computing.
2013, pp. 189–200. Mr. Jain was a recipient of the Andrews Fellowship from Purdue University
[53] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-memory in 2014.
processing paradigm for bitwise logic operations in STT–MRAM,” IEEE
Trans. Magn., vol. 53, no. 11, Nov. 2017, Art. no. 6202404.
[54] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI
Circuits (VLSI-Circuits), Jun. 2016, pp. 1–2.
[55] X. Liu et al., “RENO: A high-efficient reconfigurable neuromorphic
computing accelerator design,” in Proc. 52nd ACM/EDAC/IEEE Design
Autom. Conf. (DAC), Jun. 2015, pp. 1–6.
[56] S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and
A. Raghunathan, “SPINDLE: SPINtronic deep learning engine for large-
scale neuromorphic computing,” in Proc. IEEE/ACM Int. Symp. Low
Power Electron. Design (ISLPED), Aug. 2014, pp. 15–20.
Ashish Ranjan received the B.Tech. degree in
[57] P. Chi et al., “PRIME: A novel processing-in-memory architecture electronics engineering from IIT (BHU) Varanasi,
for neural network computation in ReRAM-based main memory,” in Varanasi, India, in 2009. He is currently working
Proc. 43rd Int. Symp. Comput. Archit. (ISCA), Piscataway, NJ, USA, toward the Ph.D. degree at the School of Electrical
Jun. 2016, pp. 27–39. and Computer Engineering, Purdue University, West
[58] Nios II Processor, Intel Corp., Mountain View, CA, USA, 2017. Lafayette, IN, USA.
[59] T. Hanyu, “Challenge of MTJ/MOS-hybrid logic-in-memory architecture His industry experience includes three years as a
for nonvolatile VLSI processor,” in Proc. IEEE Int. Symp. Circuits Senior Member Technical Staff with the Design Cre-
Syst. (ISCAS), May 2013, pp. 117–120. ation Division, Mentor Graphics Corporation, Noida,
[60] M. Natsui et al., “Nonvolatile logic-in-memory LSI using cycle-based India. His current research interests include circuit-
power gating and its application to motion-vector prediction,” IEEE architecture codesign for emerging technologies and
J. Solid-State Circuits, vol. 50, no. 2, pp. 476–489, Feb. 2015. approximate computing.
[61] S. Matsunaga et al., “MTJ-based nonvolatile logic-in-memory circuit, Mr. Ranjan received the University Gold Medal for his academic perfor-
future prospects and issues,” in Proc. Conf. Design, Autom. Test Europe, mance by IIT (BHU) Varanasi in 2009 and the Andrews Fellowship from
2009, pp. 433–435. Purdue University in 2012.

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 483

Kaushik Roy (F’01) received the B.Tech. degree Anand Raghunathan (F’12) received the B.Tech.
in electronics and electrical communications engi- degree from IIT Madras, Chennai, India and the
neering from IIT Kharagpur, Kharagpur, India and M.A. and Ph.D. degrees from Princeton University,
the Ph.D. degree from the Department of Electri- Princeton, NJ, USA.
cal and Computer Engineering, University of Illi- He was a Senior Research Staff Member with
nois at Urbana–Champaign, Champaign, IL, USA, NEC Laboratories America, Princeton, NJ, USA,
in 1990. where he led projects on system-on-chip architec-
He was with the Semiconductor Process and ture and design methodology. He has also held
Design Center, Texas Instruments, Dallas, TX, USA, the Gopalakrishnan Visiting Chair with the Depart-
where he was involved in field-programmable gate ment of Computer Science and Engineering, IIT
array architecture development and low-power cir- Madras, Chennai, India. He is currently a Professor
cuit design. He was a Faculty Scholar at Purdue University, West Lafayette, of Electrical and Computer Engineering and the Chair of the VLSI area at
IN, USA, from 1998 to 2003. He was a Research Visionary Board Member Purdue University, West Lafayette, IN, USA, where he directs the research
of Motorola Labs in 2002. He held the M. K. Gandhi Distinguished Visiting with the Integrated Systems Laboratory. He has coauthored a book, eight
Faculty, IIT Bombay, Mumbai, India. He joined the Electrical and Computer book chapters, and over 200 refereed journal and conference papers, and
Engineering Faculty, Purdue University, West Lafayette, IN, USA, in 1993, holds 21 U.S. patents. His current research interests include domain-specific
where he is currently an Edward G. Tiedemann Jr. Distinguished Professor. architecture, system-on-chip design, computing with post-CMOS devices, and
He has authored over 600 papers in refereed journals and conferences, heterogeneous parallel computing.
holds 15 patents, graduated 60 Ph.D. students, and is a coauthor of two Dr. Ragunathan is a Golden Core Member of the IEEE Computer Society.
books: Low-Power CMOS VLSI Circuit Design (New York, NY, USA: Wiley, He has been a member of the technical program and organizing committees
2009) and Low Voltage, Low Power VLSI Subsystems (New York, NY, of several leading conferences and workshops. His publications received
USA: McGraw-Hill, 2005). His current research interests include spintronics, eight best paper awards and five best paper nominations. He received a
devicecircuit codesign for nanoscale silicon and nonsilicon technologies, low- Patent of the Year Award and two technology commercialization awards from
power electronics for portable computing and wireless communications, and NEC. He received the IEEE Meritorious Service Award and the Outstanding
new computing models enabled by emerging technologies. Service Award. He was chosen among the Massachusetts Institute of Technol-
Dr. Roy received the U.S. National Science Foundation Career Development ogy (MIT) TR35 (top 35 innovators under 35 years across various disciplines
Award in 1995, the IBM Faculty Partnership Award, the ATT/Lucent Foun- of science and technology) in 2006. He chaired premier IEEE/ACM confer-
dation Award, the 2005 SRC Technical Excellence Award, the SRC Inventors ences [International Conference on Compilers, Architecture, and Synthesis
Award, the Purdue College of Engineering Research Excellence Award, for Embedded Systems (CASES), International Symposium on Low Power
the Humboldt Research Award in 2010, the 2010 IEEE Circuits and Systems Electronics and Design (ISLPED), VLSI Test Symposium (VTS), and VLSI
Society Technical Achievement Award, the Distinguished Alumnus Award Design]. He served on the editorial boards of various IEEE and ACM journals
from IIT Kharagpur, the Fulbright-Nehru Distinguished Chair, and best paper in his areas of interest.
awards at the 1997 International Test Conference, the IEEE 2000 International
Symposium on Quality of IC Design, the 2003 IEEE Latin American Test
Workshop, the 2003 IEEE Nano, the 2004 IEEE International Conference
on Computer Design, the 2006 IEEE/ACM International Symposium on Low
Power Electronics & Design, and the 2005 IEEE Circuits and System Society
Outstanding Young Author Award (Chris Kim), the 2006 IEEE Transactions
on VLSI Systems Best Paper Award, the 2012 ACM/IEEE International
Symposium on Low Power Electronics and Design Best Paper Award,
the 2013 IEEE Transactions on VLSI Best Paper Award. He has been on
the Editorial Board of IEEE D ESIGN AND T EST, the IEEE T RANSACTIONS
ON C IRCUITS AND S YSTEMS , the IEEE T RANSACTIONS ON V ERY L ARGE
S CALE I NTEGRATION (VLSI) S YSTEMS , and the IEEE T RANSACTIONS
ON E LECTRON D EVICES . He was the Guest Editor for the Special Issue
on Low-Power VLSI in IEEE D ESIGN AND T EST in 1994 and the IEEE
T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS
in 2000, the IEE Proceedings-Computers and Digital Techniques in 2002, and
the IEEE J OURNAL ON E MERGING AND S ELECTED T OPICS IN C IRCUITS
AND S YSTEMS in 2011.

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.

You might also like