Tvlsi 18 Computing-In-Memory With STT-MRAM
Tvlsi 18 Computing-In-Memory With STT-MRAM
3, MARCH 2018
Abstract— In-memory computing is a promising approach to Spintronic memories have emerged as a promising candidate
addressing the processor-memory data transfer bottleneck in for future memories due to several desirable attributes, such as
computing systems. We propose spin-transfer torque compute- nonvolatility, high density, and near-zero leakage. In particular,
in-memory (STT-CiM), a design for in-memory computing with
spin-transfer torque magnetic RAM (STT-MRAM). The unique spin-transfer torque magnetic RAM (STT-MRAM) has gar-
properties of spintronic memory allow multiple wordlines within nered a significant interest with various prototype demonstra-
an array to be simultaneously enabled, opening up the possibility tions and early commercial offerings [1]–[3]. There have been
of directly sensing functions of the values stored in multiple rows several research efforts to boost the efficiency of STT-MRAM
using a single access. We propose modifications to STT-MRAM at the device, circuit, and architectural levels [4]–[30].
peripheral circuits that leverage this principle to perform logic,
arithmetic, and complex vector operations. We address the chal- In this paper, we explore, viz., in-memory computing with
lenge of reliable in-memory computing under process variations STT-MRAM. By exploiting the ability to simultaneously
by extending error-correction code schemes to detect and correct enable multiple wordlines (WLs) within a memory array,
errors that occur during CiM operations. We also address the we enhance STT-MRAM arrays to perform a range of arith-
question of how STT-CiM should be integrated within a general- metic, logic, and vector operations. We propose circuit and
purpose computing system. To this end, we propose architectural
enhancements to processor instruction sets and on-chip buses that architectural techniques for reliable computation under process
enable STT-CiM to be utilized as a scratchpad memory. Finally, variations and to enable the proposed design to be used in a
we present data mapping techniques to increase the effectiveness programmable processor-based system.
of STT-CiM. We evaluate STT-CiM using a device-to-architecture In-memory computing is motivated by the observation that
modeling framework, and integrate cycle-accurate models of the movement of data from bit-cells in the memory to the
STT-CiM with a commercial processor and on-chip bus (Nios II
and Avalon from Intel). Our system-level evaluation shows that processor and back (across the bitlines, memory interface, and
STT-CiM provides the system-level performance improvements system interconnect) is a major performance and energy bot-
of 3.93 times on average (up to 10.4 times), and concurrently tleneck in computing systems. Efforts that have explored the
reduces memory system energy by 3.83 times on average (up to closer integration of logic and memory are variedly referred
12.4 times). to in the literature as logic-in-memory, computing-in-memory,
Index Terms— In-memory computing, processing-in-memory, and processing-in-memory. These efforts may be classified
spin-transfer torque magnetic RAM (STT-MRAM), spintronic into two categories—moving logic closer to memory or near-
memories. memory computing [31]–[44] and performing computations
I. I NTRODUCTION within memory structures or in-memory computing [45]–[57],
which is the focus of this paper. In-memory computing reduces
T HE growth in data processed and increase in the number
of cores place high demands on the memory systems
of modern computing platforms. Consequently, a growing
the number of memory accesses and the amount of data
transferred between processor and memory, and exploits the
fraction of transistors, area, and power are utilized toward wider internal bandwidth available within memory systems.
memories. CMOS memories (SRAM and embedded DRAM) Our proposal is based on the observation that by enabling
have been the mainstays of memory design for the past several multiple WLs simultaneously1 and sensing the effective resis-
decades. However, recent technology scaling challenges in tance of each bitline (BL), it is possible to directly compute
CMOS memories, along with an increased demand for mem- logic functions of the values stored in the bit-cells. Based
ory capacity and performance, have fueled an active interest on this insight, we propose spin-transfer torque compute-in-
in alternative memory technologies. memory (STT-CiM), a design for in-memory computing with
STT-MRAM that can perform a range of arithmetic, logic,
Manuscript received April 5, 2017; revised July 13, 2017; accepted and vector operations. In STT-CiM, the core data array is the
August 15, 2017. Date of publication December 28, 2017; date of current
version February 22, 2018. This work was supported in part by STARnet, same as standard STT-MRAM; hence, memory density and
a Semiconductor Research Corporation program sponsored by MARCO and the efficiency of read and write operations are maintained.
DARPA, and in part by the National Science Foundation under grant 1320808. Reliable sensing under the limited tunneling magnetoresis-
(Corresponding author: Shubham Jain.)
The authors are with the School of Electrical and Computer Engineering, tance (TMR) of STT-MRAM bit-cells is known to be a chall-
Purdue University, West Lafayette, IN 47906 USA (e-mail: jain130@purdue. enge [12]–[16], [29], and we show that challenge this is
edu; [email protected]; [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 1 Note that this is much easier in STT-MRAM than in CMOS memories,
Digital Object Identifier 10.1109/TVLSI.2017.2776954 due to the resistive nature of the bit-cells.
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 471
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 473
TABLE I
P OSSIBLE O UTPUTS OF VARIOUS S ENSING S CHEMES
Fig. 3. STT-CiM: principle of operation. (a) Resistive equivalent. it is not necessary to distinguish between the cases where the
(b) Bit-cell currents for read operation. (c) CiM operation. (d) SL currents for two bit-cells connected to a BL store “10” and “01.”
CiM operation. 4) ADD Operation: An ADD operation is realized by
leveraging the ability to concurrently perform multiple bitwise
logical operations, as illustrated in Fig. 5. Suppose An and Bn
(the nth bits of two words, A and B) are stored in two
different bit-cells of the same column within an STT-CiM
array. Suppose that we wish to compute the full-adder logic
function (the nth stage of an adder that adds words A and B).
As shown in Fig. 5, Sn (the sum) and Cn (the carry out) can
be computed using An XOR Bn and An AND Bn , in addition
Fig. 4. STT-CiM sensing schemes. (a) Bitwise OR sensing scheme. to Cn−1 (carry input from the previous stage). Fig. 5 also
(b) Bitwise AND sensing scheme. (c) Reference currents. expresses the ADD operation in terms of the outputs of bitwise
operations, OAND and OXOR . Three additional logic gates are
1) Bitwise OR (NOR): In order to realize logic OR and NOR required to enable this computation. Note that the sensing
operations, we use the sensing scheme shown in Fig. 4(a), schemes discussed enable us to perform the bitwise XOR and
where ISL is connected to the positive input of the sense AND operations simultaneously, thereby performing an ADD
amplifier and a reference current Iref-OR is fed to its negative operation with a single array access.
input. We choose Iref−or to be between IAP-AP and IAP-P ,
as shown in Fig. 4(c). As a result, among the possible values B. STT-CiM Array
of ISL [see Fig. 3(d)], only ISL = IAP-AP is less than Iref- OR . In this section, we present the array-level design of
Consequently, only the case where both bit-cells are in the STT-CiM using the above-described circuit-level techniques.
AP configuration, i.e., both store “0,” leads to an output As shown in Fig. 6, the proposed STT-CiM memory array
of logic “0” (“1”) at the positive (negative) output of the takes an additional input CiMType that indicates the type
sense amplifier, while all other cases lead to logic “1” (“0”). of CiM operation that needs to be performed for every
Thus, the positive and negative outputs of the sense amplifier memory access. The CiM decoder interprets this input and
evaluate the logic OR and NOR of the values stored in the generates appropriate control signals to perform the desired
enabled bit-cells. logic operation. In order to enable CiM operations, the read
2) Bitwise AND (NAND): A bitwise AND (NAND) operation peripheral circuits present in each column (sensing circuit and
is realized at the positive (negative) terminal of the sense global reference generation circuit in Fig. 6) are enhanced,
amplifier by using the sensing scheme shown in Fig. 4(b). Note while the core data array remains the same as in the standard
that in this scheme, a different reference current (Iref-AND ) is STT-MRAM. The address (row) decoder needs to enable
fed to the sense amplifier. multiple WLs for CiM operations. Specifically, we utilize two
3) Bitwise XOR: A bitwise XOR operation is realized when address decoders, with each decoding the corresponding input
the two sensing schemes shown in Fig. 4 are used in tandem, address. The corresponding outputs of the decoders are ORed
and OAND and ONOR are fed to a CMOS NOR gate. In other and connected to each WL. This configuration allows any of
words, OXOR = OAND NOR ONOR . the two decoders to activate random WL locations. While the
Table I summarizes the logic operations achieved using the row decoder overhead is roughly doubled, it represents a small
two sensing schemes discussed earlier. Note that, all the above- fraction of total area and power for configurations involving
described logic operations are symmetric in nature, and hence, large arrays (1.8% in our evaluation). The write peripheral
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
474 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 475
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
476 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
TABLE III
E XAMPLES OF R EDUCTION O PERATIONS
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 477
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
478 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 479
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
480 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 481
and on-chip bus enhancements to support in-memory com- [19] A. K. Mishra, X. Dong, G. Sun, Y. Xie, N. Vijaykrishnan, and C. R. Das,
putations. We proposed architectural optimizations and data “Architecting on-chip interconnects for stacked 3D STT-RAM caches
in CMPs,” in Proc. 38th Annu. Int. Symp. Comput. Archit. (ISCA),
mapping techniques to enhance the efficiency of STT-CiM. Jun. 2011, pp. 69–80.
A device-to-architecture simulation framework was used to [20] K. Lee and S. H. Kang, “Development of embedded STT-MRAM
evaluate the benefits of STT-CiM. Our experiments indicate for mobile system-on-chips,” IEEE Trans. Magn., vol. 47, no. 1,
pp. 131–136, Jan. 2011.
that STT-CiM achieves substantial improvements in energy [21] A. Nigam, C. W. Smullen, IV, V. Mohan, E. Chen, S. Gurumurthi, and
and performance, and shows considerable promise in alleviat- M. R. Stan, “Delivering on the promise of universal memory for spin-
ing the processor-memory gap. transfer torque RAM (STT-RAM),” in Proc. IEEE/ACM Int. Symp. Low
Power Electron. Design, Aug. 2011, pp. 121–126.
[22] A. Jadidi, M. Arjomand, and H. Sarbazi-Azad, “High-endurance and
R EFERENCES performance-efficient design of hybrid cache architectures through
adaptive line replacement,” in Proc. IEEE/ACM Int. Symp. Low Power
[1] Everspin | The MRAM Company. Accessed: 2015. [Online]. Available: Electron. Design, Aug. 2011, pp. 79–84.
http://www.everspin.com/ [23] Y. Zhang et al., “Multi-level cell spin transfer torque MRAM
[2] D. Apalkov et al., “Spin-transfer torque magnetic random access mem- based on stochastic switching,” in Proc. 13th IEEE Int. Conf.
ory (STT-MRAM),” J. Emerg. Technol. Comput. Syst., vol. 9, no. 2, Nanotechnol. (IEEE-NANO), Aug. 2013, pp. 233–236.
pp. 13:1–13:35, May 2013. [24] J. Zhao and Y. Xie, “Optimizing bandwidth and power of graphics mem-
[3] Avalanche Technology—Enterprise Solid State Storage Arrays. ory with hybrid memory technologies and adaptive data migration,” in
Accessed: 2017. [Online]. Available: http://www.avalanche- Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2012,
technology.com/ pp. 81–87.
[4] A. Jog et al., “Cache revive: Architecting volatile STT-RAM caches [25] W. Xu, Y. Chen, X. Wang, and T. Zhang, “Improving STT MRAM
for enhanced performance in CMPs,” in Proc. 49th ACM/EDAC/IEEE storage density through smaller-than-worst-case transistor sizing,” in
Design Autom. Conf., Jun. 2012, pp. 243–252. Proc. 46th ACM/IEEE Annu. Design Autom. Conf. (DAC), New York,
[5] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “Energy reduction for NY, USA, Jul. 2009, pp. 87–90.
STT-RAM using early write termination,” in Proc. Int. Conf. Comput.- [26] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopadhyay, and
Aided Design (ICCAD), Nov. 2009, pp. 264–268. S. Yalamanchili, “An energy efficient cache design using spin torque
[6] S. Chatterjee, M. Rasquinha, S. Yalamanchili, and S. Mukhopadhyay, transfer (STT) RAM,” in Proc. ACM/IEEE Int. Symp. Low-Power
“A scalable design methodology for energy minimization of STTRAM: Electron. Design (ISLPED), Aug. 2010, pp. 389–394.
A circuit and architecture perspective,” IEEE Trans. Very Large Scale [27] A. Aziz, N. Shukla, S. Datta, and S. K. Gupta, “COAST: Correlated
Integr. (VLSI) Syst., vol. 19, no. 5, pp. 809–817, May 2011. material assisted STT MRAMs for optimized read operation,” in Proc.
[7] Y. Kim, S. K. Gupta, S. P. Park, G. Panagopoulos, and K. Roy, “Write- IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED), Jul. 2015,
optimized reliable design of STT MRAM,” in Proc. ACM/IEEE Int. pp. 1–6.
Symp. Low Power Electron. Design (ISLPED), New York, NY, USA, [28] S. Ikeda et al., “Tunnel magnetoresistance of 604% at 300 K by suppres-
Jul. 2012, pp. 3–8. sion of Ta diffusion in CoFeB/MgO/CoFeB pseudo-spin-valves annealed
[8] H. Noguchi et al., “A 3.3 ns-access-time 71.2 μW/MHz 1 Mb embed- at high temperature,” Appl. Phys. Lett., vol. 93, no. 8, p. 082508,
ded STT-MRAM using physically eliminated read-disturb scheme and 2008.
normally-off memory architecture,” in IEEE Int. Solid-State Circuits [29] W. Kang, L. Zhang, J.-O. Klein, Y. Zhang, D. Ravelosona, and W. Zhao,
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2015, pp. 1–3. “Reconfigurable codesign of STT-MRAM under process variations in
[9] S. P. Park, S. Gupta, N. Mojumder, A. Raghunathan, and K. Roy, deeply scaled technology,” IEEE Trans. Electron Devices, vol. 62, no. 6,
“Future cache design using STT MRAMs for improved energy effi- pp. 1769–1777, Jun. 2015.
ciency: Devices, circuits and architecture,” in Proc. Design Autom. Conf., [30] N. N. Mojumder, X. Fong, C. Augustine, S. K. Gupta, S. H. Cho-
Jun. 2012, pp. 492–497. day, and K. Roy, “Dual pillar spin-transfer torque MRAMs for low
[10] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan, power applications,” J. Emerg. Technol. Comput. Syst., vol. 9, no. 2,
“Relaxing non-volatility for fast and energy-efficient STT-RAM caches,” pp. 14:1–14:17, May 2013.
in Proc. Int. Symp. High Perform. Comput. Archit., Feb. 2011, pp. 50–61. [31] D. Patterson et al., “Intelligent RAM (IRAM): Chips that remember and
[11] W. Xu, H. Sun, X. Wang, Y. Chen, and T. Zhang, “Design of last- compute,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
level on-chip cache using spin-torque transfer RAM (STT RAM),” IEEE Papers, Feb. 1997, pp. 224–225.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 3, pp. 483–493, [32] M. Oskin, F. T. Chong, and T. Sherwood, “Active pages: A computation
Mar. 2011. model for intelligent memory,” in Proc. 25th Annu. Int. Symp. Comput.
[12] K.-W. Kwon, X. Fong, P. Wijesinghe, P. Panda, and K. Roy, “High- Archit., Jun. 1998, pp. 192–203.
density and robust STT-MRAM array through device/circuit/architecture [33] E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle, “Active disks
interactions,” IEEE Trans. Nanotechnol., vol. 14, no. 6, pp. 1024–1034, for large-scale data processing,” Computer, vol. 34, no. 6, pp. 68–74,
Nov. 2015. Jun. 2001.
[13] B. D. Bel, J. Kim, C. H. Kim, and S. S. Sapatnekar, “Improving STT- [34] J. Draper et al., “The architecture of the DIVA processing-in-memory
MRAM density through multibit error correction,” in Proc. Design, chip,” in Proc. ACM ICS, 2002, pp. 14–25.
Autom. Test Europe Conf. Exhib. (DATE), Mar. 2014, pp. 1–6. [35] R. Nair et al., “Active memory cube: A processing-in-memory archi-
[14] W. Kang et al., “A low-cost built-in error correction circuit design tecture for exascale systems,” IBM J. Res. Develop., vol. 59, nos. 2–3,
for STT-MRAM reliability improvement,” Microelectron. Rel., vol. 53, pp. 17:1–17:14, Mar./May 2015.
nos. 9–11, pp. 1224–1229, 2013. [36] B. Falsafi et al., “Near-memory data services,” IEEE Micro, vol. 36,
[15] W. Kang et al., “Yield and reliability improvement techniques for no. 1, pp. 6–13, Jan. 2016.
emerging nonvolatile STT-MRAM,” IEEE Trans. Emerg. Sel. Topics [37] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,
Circuits Syst., vol. 5, no. 1, pp. 28–39, Mar. 2015. “Neurocube: A programmable digital neuromorphic architecture with
[16] X. Fong, Y. Kim, S. H. Choday, and K. Roy, “Failure mitigation high-density 3D memory,” in Proc. ACM/IEEE 43rd Annu. Int. Symp.
techniques for 1T-1MTJ spin-transfer torque MRAM bit-cells,” IEEE Comput. Archit. (ISCA), Jun. 2016, pp. 380–392.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 2, pp. 384–395, [38] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim,
Feb. 2014. “NDA: Near-DRAM acceleration architecture leveraging commod-
[17] G. S. Kar et al., “Co/Ni based p-MTJ stack for sub-20 nm high density ity DRAM devices and standard memory modules,” in Proc. IEEE
stand alone and high performance embedded memory application,” in 21st Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2015,
IEDM Tech. Dig., Dec. 2014, pp. 19.1.1–19.1.4. pp. 283–295.
[18] A. Ranjan, S. Venkataramani, X. Fong, K. Roy, and A. Raghunathan, [39] S. H. Pugsley et al., “NDC: Analyzing the impact of 3D-stacked
“Approximate storage for energy efficient spintronic memories,” in memory+logic devices on MapReduce workloads,” in Proc. IEEE
Proc. 52nd Annu. Design Autom. Conf. (DAC), New York, NY, USA, Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Mar. 2014,
Jun. 2015, pp. 195:1–195:6. pp. 190–200.
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
482 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
[40] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, [62] M.-F. Chang et al., “Read circuits for resistive memory (ReRAM) and
and M. Ignatowski, “TOP-PIM: Throughput-oriented programmable memristor-based nonvolatile logics,” in Proc. 20th Asia South Pacific
processing in memory,” in Proc. ACM 23rd Int. Symp. High-Perform. Design Autom. Conf., Jan. 2015, pp. 569–574.
Parallel Distrib. Comput. (HPDC), New York, NY, USA, Jun. 2014, [63] D. Lee, X. Fong, and K. Roy, “R-MRAM: A ROM-embedded
pp. 85–98. STT MRAM cache,” IEEE Electron Device Lett., vol. 34, no. 10,
[41] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: pp. 1256–1258, Oct. 2013.
A low-overhead, locality-aware processing-in-memory architecture,” in [64] P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes,
Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput. Archit. (ISCA), “An efficient and scalable semiconductor architecture for parallel
Jun. 2015, pp. 336–348. automata processing,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 12,
[42] Q. Zhu, K. Vaidyanathan, O. Shacham, M. Horowitz, L. Pileggi, and pp. 3088–3098, Dec. 2014.
F. Franchetti, “Design automation framework for application-specific [65] D. Strukov, “The area and latency tradeoffs of binary bit-parallel
logic-in-memory blocks,” in Proc. IEEE 23rd Int. Conf. Appl.-Specific BCH decoders for prospective nanoelectronic memories,” in Proc. 14th
Syst., Archit. Process., Jul. 2012, pp. 125–132. Asilomar Conf. Signals, Syst. Comput., Oct./Nov. 2006, pp. 1183–1187.
[43] J. T. Pawlowski, “Hybrid memory cube (HMC),” in Proc. IEEE Hot [66] X. Fong, S. H. Choday, P. Georgios, C. Augustine, and K. Roy,
Chips Symp. (HCS), vol. 23. Aug. 2011, pp. 1–24. “Spice models for magnetic tunnel junctions based on mon-
[44] D. U. Lee et al., “A 1.2 V 8 Gb 8-channel 128 GB/s high-bandwidth odomain approximation,” Tech. Rep., Aug. 2016. [Online]. Available:
memory (HBM) stacked DRAM with effective microbump I/O test https://nanohub.org/resources/19048
methods using 29 nm process and TSV,” in IEEE Int. Solid-State Circuits [67] S. Ikeda et al., “A perpendicular-anisotropy CoFeB–MgO magnetic
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 432–433. tunnel junction,” Nature Mater., vol. 9, pp. 721–724, Jul. 2010.
[45] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable mem- [68] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
ory (CAM) circuits and architectures: A tutorial and survey,” NUCA organizations and wiring alternatives for large caches with
IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712–727, CACTI 6.0,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchi-
Mar. 2006. tecture (MICRO), Washington, DC, USA, Dec. 2007, pp. 3–14.
[46] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
“An energy-efficient VLSI architecture for pattern recognition via deep
embedding of computation in SRAM,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), May 2014, pp. 8326–8330.
[47] J.-P. Wang and J. D. Harms, “General structure for computa-
tional random access memory (CRAM),” U.S. Patent 9 224 447 B2,
Dec. 29, 2015
[48] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo:
A processing-in-memory architecture for bulk bitwise operations in
emerging non-volatile memories,” in Proc. ACM 53rd Annu. Design
Autom. Conf. (DAC), New York, NY, USA, Jun. 2016, pp. 173:1–173:6.
[49] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design
within memristive memories using memristor-aided loGIC Shubham Jain received the B.Tech. degree
(MAGIC),” IEEE Trans. Nanotechnol., vol. 15, no. 4, pp. 635–650, (honors) in electronics and electrical communication
Jul. 2016. engineering from IIT Kharagpur, Kharagpur, India,
in 2012. He is currently working toward the Ph.D.
[50] J. Reuben et al., “Memristive logic: A framework for evaluation and
degree at the School of Electrical and Computer
comparison,” in Proc. IEEE Int. Symp. Power Timing Modeling, Optim.
Engineering, Purdue University, West Lafayette, IN,
Simulation, Sep. 2017, pp. 1–8.
USA.
[51] V. Seshadri et al., “Fast bulk bitwise AND and OR in DRAM,” IEEE He was with Qualcomm, Bengaluru, India, for two
Comput. Archit. Lett., vol. 14, no. 2, pp. 127–131, Jul./Dec. 2015. years. His current research interests include explor-
[52] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, ing circuit and architectural techniques for emerging
“AC-DIMM: Associative Computing with STT-MRAM,” in Proc. ACM post-CMOS devices and computing paradigms, such
40th Annu. Int. Symp. Comput. Archit. (ISCA), New York, NY, USA, as spintronics, approximate computing, and neuromorphic computing.
2013, pp. 189–200. Mr. Jain was a recipient of the Andrews Fellowship from Purdue University
[53] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-memory in 2014.
processing paradigm for bitwise logic operations in STT–MRAM,” IEEE
Trans. Magn., vol. 53, no. 11, Nov. 2017, Art. no. 6202404.
[54] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI
Circuits (VLSI-Circuits), Jun. 2016, pp. 1–2.
[55] X. Liu et al., “RENO: A high-efficient reconfigurable neuromorphic
computing accelerator design,” in Proc. 52nd ACM/EDAC/IEEE Design
Autom. Conf. (DAC), Jun. 2015, pp. 1–6.
[56] S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and
A. Raghunathan, “SPINDLE: SPINtronic deep learning engine for large-
scale neuromorphic computing,” in Proc. IEEE/ACM Int. Symp. Low
Power Electron. Design (ISLPED), Aug. 2014, pp. 15–20.
Ashish Ranjan received the B.Tech. degree in
[57] P. Chi et al., “PRIME: A novel processing-in-memory architecture electronics engineering from IIT (BHU) Varanasi,
for neural network computation in ReRAM-based main memory,” in Varanasi, India, in 2009. He is currently working
Proc. 43rd Int. Symp. Comput. Archit. (ISCA), Piscataway, NJ, USA, toward the Ph.D. degree at the School of Electrical
Jun. 2016, pp. 27–39. and Computer Engineering, Purdue University, West
[58] Nios II Processor, Intel Corp., Mountain View, CA, USA, 2017. Lafayette, IN, USA.
[59] T. Hanyu, “Challenge of MTJ/MOS-hybrid logic-in-memory architecture His industry experience includes three years as a
for nonvolatile VLSI processor,” in Proc. IEEE Int. Symp. Circuits Senior Member Technical Staff with the Design Cre-
Syst. (ISCAS), May 2013, pp. 117–120. ation Division, Mentor Graphics Corporation, Noida,
[60] M. Natsui et al., “Nonvolatile logic-in-memory LSI using cycle-based India. His current research interests include circuit-
power gating and its application to motion-vector prediction,” IEEE architecture codesign for emerging technologies and
J. Solid-State Circuits, vol. 50, no. 2, pp. 476–489, Feb. 2015. approximate computing.
[61] S. Matsunaga et al., “MTJ-based nonvolatile logic-in-memory circuit, Mr. Ranjan received the University Gold Medal for his academic perfor-
future prospects and issues,” in Proc. Conf. Design, Autom. Test Europe, mance by IIT (BHU) Varanasi in 2009 and the Andrews Fellowship from
2009, pp. 433–435. Purdue University in 2012.
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
JAIN et al.: COMPUTING IN MEMORY WITH STT-MRAM 483
Kaushik Roy (F’01) received the B.Tech. degree Anand Raghunathan (F’12) received the B.Tech.
in electronics and electrical communications engi- degree from IIT Madras, Chennai, India and the
neering from IIT Kharagpur, Kharagpur, India and M.A. and Ph.D. degrees from Princeton University,
the Ph.D. degree from the Department of Electri- Princeton, NJ, USA.
cal and Computer Engineering, University of Illi- He was a Senior Research Staff Member with
nois at Urbana–Champaign, Champaign, IL, USA, NEC Laboratories America, Princeton, NJ, USA,
in 1990. where he led projects on system-on-chip architec-
He was with the Semiconductor Process and ture and design methodology. He has also held
Design Center, Texas Instruments, Dallas, TX, USA, the Gopalakrishnan Visiting Chair with the Depart-
where he was involved in field-programmable gate ment of Computer Science and Engineering, IIT
array architecture development and low-power cir- Madras, Chennai, India. He is currently a Professor
cuit design. He was a Faculty Scholar at Purdue University, West Lafayette, of Electrical and Computer Engineering and the Chair of the VLSI area at
IN, USA, from 1998 to 2003. He was a Research Visionary Board Member Purdue University, West Lafayette, IN, USA, where he directs the research
of Motorola Labs in 2002. He held the M. K. Gandhi Distinguished Visiting with the Integrated Systems Laboratory. He has coauthored a book, eight
Faculty, IIT Bombay, Mumbai, India. He joined the Electrical and Computer book chapters, and over 200 refereed journal and conference papers, and
Engineering Faculty, Purdue University, West Lafayette, IN, USA, in 1993, holds 21 U.S. patents. His current research interests include domain-specific
where he is currently an Edward G. Tiedemann Jr. Distinguished Professor. architecture, system-on-chip design, computing with post-CMOS devices, and
He has authored over 600 papers in refereed journals and conferences, heterogeneous parallel computing.
holds 15 patents, graduated 60 Ph.D. students, and is a coauthor of two Dr. Ragunathan is a Golden Core Member of the IEEE Computer Society.
books: Low-Power CMOS VLSI Circuit Design (New York, NY, USA: Wiley, He has been a member of the technical program and organizing committees
2009) and Low Voltage, Low Power VLSI Subsystems (New York, NY, of several leading conferences and workshops. His publications received
USA: McGraw-Hill, 2005). His current research interests include spintronics, eight best paper awards and five best paper nominations. He received a
devicecircuit codesign for nanoscale silicon and nonsilicon technologies, low- Patent of the Year Award and two technology commercialization awards from
power electronics for portable computing and wireless communications, and NEC. He received the IEEE Meritorious Service Award and the Outstanding
new computing models enabled by emerging technologies. Service Award. He was chosen among the Massachusetts Institute of Technol-
Dr. Roy received the U.S. National Science Foundation Career Development ogy (MIT) TR35 (top 35 innovators under 35 years across various disciplines
Award in 1995, the IBM Faculty Partnership Award, the ATT/Lucent Foun- of science and technology) in 2006. He chaired premier IEEE/ACM confer-
dation Award, the 2005 SRC Technical Excellence Award, the SRC Inventors ences [International Conference on Compilers, Architecture, and Synthesis
Award, the Purdue College of Engineering Research Excellence Award, for Embedded Systems (CASES), International Symposium on Low Power
the Humboldt Research Award in 2010, the 2010 IEEE Circuits and Systems Electronics and Design (ISLPED), VLSI Test Symposium (VTS), and VLSI
Society Technical Achievement Award, the Distinguished Alumnus Award Design]. He served on the editorial boards of various IEEE and ACM journals
from IIT Kharagpur, the Fulbright-Nehru Distinguished Chair, and best paper in his areas of interest.
awards at the 1997 International Test Conference, the IEEE 2000 International
Symposium on Quality of IC Design, the 2003 IEEE Latin American Test
Workshop, the 2003 IEEE Nano, the 2004 IEEE International Conference
on Computer Design, the 2006 IEEE/ACM International Symposium on Low
Power Electronics & Design, and the 2005 IEEE Circuits and System Society
Outstanding Young Author Award (Chris Kim), the 2006 IEEE Transactions
on VLSI Systems Best Paper Award, the 2012 ACM/IEEE International
Symposium on Low Power Electronics and Design Best Paper Award,
the 2013 IEEE Transactions on VLSI Best Paper Award. He has been on
the Editorial Board of IEEE D ESIGN AND T EST, the IEEE T RANSACTIONS
ON C IRCUITS AND S YSTEMS , the IEEE T RANSACTIONS ON V ERY L ARGE
S CALE I NTEGRATION (VLSI) S YSTEMS , and the IEEE T RANSACTIONS
ON E LECTRON D EVICES . He was the Guest Editor for the Special Issue
on Low-Power VLSI in IEEE D ESIGN AND T EST in 1994 and the IEEE
T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS
in 2000, the IEE Proceedings-Computers and Digital Techniques in 2002, and
the IEEE J OURNAL ON E MERGING AND S ELECTED T OPICS IN C IRCUITS
AND S YSTEMS in 2011.
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on February 27,2021 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.