0% found this document useful (0 votes)
62 views13 pages

A Survey of MRAM-Centric Computing From Near Memory To in Memory

In Memory computing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views13 pages

A Survey of MRAM-Centric Computing From Near Memory To in Memory

In Memory computing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 5 April 2022; revised 20 August 2022; accepted 25 September 2022.

Date of publication 21 October 2022; date of current version 7 June 2023.


Digital Object Identifier 10.1109/TETC.2022.3214833

A Survey of MRAM-Centric Computing:


From Near Memory to In Memory
YUETING LI , TIANSHUO BAI, XINYI XU, YUNDONG ZHANG, BI WU , (Member, IEEE),
HAO CAI , (Senior Member, IEEE), BIAO PAN , (Member, IEEE), AND WEISHENG ZHAO , (Fellow, IEEE)
Yueting Li, Tianshuo Bai, Xinyi Xu, Biao Pan, and Weisheng Zhao are with the School of Integrated Circuit Science and Engineering, Beihang University, Beijing
100191, China, and also with Fert Beijing Institute, Beihang University, Beijing 100191, China, and also with the MIIT Key Laboratory for Spintronics,
Beihang University, Beijing 100191, China
Yundong Zhang is with the National Key Laboratory on Digital Multimedia Chip Technology, Vimicro Corporation, Beijing 100191, China
Bi Wu is with the College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210095, China
Hao Cai is with the National ASIC System Engineering Center, Southeast University, Nanjing 211189, China
CORRESPONDING AUTHOR: BIAO PAN ([email protected]), WEISHENG ZHAO ([email protected]).
This work was supported in part by the National Key Research and Development Program of China under Grants 2021YFB3601304 and 2021YFB3601300, in part by
the National Natural Science Foundation of China under Grant 92164206, in part by International Collaboration Project under Grant B16001, in part by the National
Natural Science Foundation of China under Grant 62001019, and in part by the Laboratory Open Fund of Beijing Smart-chip Microelectronics Technology Co., Ltd.

ABSTRACT Conventional von Neumann architecture suffers from bottlenecks in computing performance
and power consumption due to frequent data exchange between memory and processing units. To overcome
this issue, research on novel computing architectures including near-memory computing (NMC) and in-mem-
ory computing (IMC) has been accelerated based on emerging nonvolatile memory devices. Among various
potential candidates, spintronic-based magnetic random-access memory (MRAM) has come into a research
and development hotspot by its ultralow switching energy, nonvolatility, and superior endurance. This paper
outlines the background, trends, and challenges involved in the development of MRAM-centric computing,
and highlights the recent prototypes and advances in applications based on MRAM-NMC and MRAM-IMC.
INDEX TERMS Near-memory computing, in-memory computing, MRAM, neural network, hybrid memory
architecture

I. INTRODUCTION logical operations close to/inside where the data resides to


For more than half a century, computers have been designed improve energy efficiency and minimize the expensive data
based on the von Neumann architecture in which the process- movements. NMC could move the logic operation from proc-
ing units and the memory units are physically separated [1]. essing units to memory units at different levels, which also
However, this separation makes von Neumann architecture benefits from advances of the 3D integration technology [4].
gradually fail to meet the requirements of burgeoning big- The novel IMC architecture has attracted broad attention,
data-driven applications, such as artificial intelligence (AI), which could eliminate the boundary between memory and
Internet of Things (IoT) and autonomous driving [3]. The processing elements by implementing computing operations
bottleneck of the well-known power wall and memory wall within the memory macro [5]. By exploiting the physical
arises from leakage currents and quantum effects of comple- attributes of memory devices, plus computations could be
mentary metal–oxide–semiconductor (CMOS) circuit sys- performed directly in memory to tackle the energy efficiency
tems [2]. To solve these problems, modern technology is challenge of von Neumann architecture.
shifting its emphasis to memory-centric computing, aiming Despite NMC and IMC presenting enormous potential,
to shorten the distance between computation and storage, their applications are demanding on memory devices [6].
given that most of the energy and time are consumed in data Stimulated by a wide range of implementation prospects,
movement. multiple explorations have emerged based on the utilization
Near-memory computing (NMC) and in-memory comput- of CMOS memory devices, such as static random access
ing (IMC) are two promising candidates for breaking the memory (SRAM) and dynamic random access memory
power wall and memory wall, with the aim of performing (DRAM). However, due to the insurmountable leakage

2168-6750 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE
318 permission.
Authorized licensed use limited to: Indian Institute See ht_tps://www.ieee.org/publications/rights/index.html
of Technology Indore. Downloaded on March 21,2024 for more information.UTC from IEEE
at 06:50:19 VOLUME 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

FIGURE 1. The roadmap for MRAM-centric computing from the device level to the architecture level.

current with the scaling down of CMOS technologies, the II. BACKGROUND OF MRAM
bottleneck of power wall and memory wall may not be Compared to the resistive RAM (RRAM), phase change
completely solved due to limitations from device to architec- memory (PCM), and ferroelectric field effect transistor
ture [7]. Different from SRAM and DRAM, spintronic-based (FeFET), MRAM offers an advantageous performance
magnetic random-access memory (MRAM) is one of the most among all the emerging nonvolatile memory technologies,
promising technologies for the memory-centric computing especially, in terms of energy and time of the read operations.
system with its superiorities in nonvolatility, data retention, Since the late 1980s, a series of scientific discoveries and
reprogramming durability, etc. At the device level, as shown technological innovations have contributed to the rapid prog-
in Figure 1, MRAM has made rapid progress based on new ress in MRAM technology and its commercialization. The
mechanisms constantly proposed. At the architecture level, discovery of the tunneling magnetoresistive (TMR) effect for
the distance between the processing unit and memory unit has the CoFe/MgO-based magnetic tunnel junction (MTJ) device
decreased from von Neumann architecture to MRAM-NMC serves as one of the milestones during this progress [8].
and MRAM-IMC. Additionally, MRAM could drive the revo- MTJ, which mainly comprises two ferromagnetic layers with
lution in computing architecture at different technology nodes a tunneling barrier layer, is the key building block of
to produce highly efficient computations with favorable MRAM. The magnetization direction of the fixed layer is
CMOS compatibility. generally permanent, while that of the free layer can be
This review provides readers comprehensive understand- manipulated between two different directions for denoting
ing of MRAM-NMC and IMC with a collection of state-of- one-bit datum of ’0’ or ’1’. Until now, several manipulation
the-art work mainly from 2018 to 2022. Simultaneously, methods have been proposed, such as magnetic field, spin
these latest research work and existing challenges from transfer torque (STT), spin-orbit torque (SOT) and voltage-
device, circuit and architecture are also summarized. With controlled magnetic anisotropy (VCMA) [9]. The evolution
the focus on the current status and future prospects of of manipulation methods allows for more energy efficiency
MRAM-NMC and MRAM-IMC, this review is aimed at and higher write speed of the MRAM bit-cell, which could
encouraging the whole community to pay close attention to be attributed to the significant decrease in the joule heating
this area and ultimately translate basic research into industrial and switching delay. Of all the methods above, the magnetic
prototypes. The remaining parts of this paper are arranged as field and STT methods have been successfully commercial-
follows. The background on MRAM is briely given in ized, namely Toggle-MRAM (first in 2006) and STT-
Section II Section III presents MRAM-NMC from three MRAM (first in 2012) [10].
dimensions, namely hardware implementation, software opti- The STT-MRAM can offer much higher density and lower
mization for data flow, and promising applications. Section energy and commercial ones are now used as a replacement
IV reviews MRAM-IMC from three perspectives, hardware for embedded flash (eFlash) memory or SRAM in embedded
implementation, computing paradigm, and promising appli- applications. In addition, as the size of MTJ shrinks, issues
cations. Section V concludes this paper with discussions such as read disturbance and tunnel barrier breakdown have
about future challenges of MRAM-centric computing. become more serious. To mitigate the STT limitations, SOT

VOLUME 11, NO.


Authorized 2, APRIL-JUNE
licensed 319
2023to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE Xplore. Restrictions apply.
use limited
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

FIGURE 2. Applications of MRAM based IMC and NMC. MRAM-NMC forms (a) LLC/main memory or (b) hybrid memory architecture to
implement promising applications such as (c) neural network acceleration and (d) approximate computing. MRAM-IMC employs two
main memory arrays (e) 1T-1MTJ and (f) 2T-1MTJ to realize hardware implementation of (g) neural network acceleration and (h) graph
computing.

and VCMA effects were proposed as alternatives. SOT- Memory Controller. Zhang et al. [21] proposed a memory
MRAM can partially overcome these problems since it uses controller with 38-bit lines to integrate STT-MRAM and
separated current paths for read and write at the cost of an SRAM as L1 cache. In their scheme, the incoming data were
extra transistor. On the other hand, the VCMA, which uti- placed into SRAM at first and STT-MRAM worked as the
lizes a voltage (or electric field) for MTJ writing by increas- buffer of SRAM. When the SRAM capacity was not suffi-
ing the thickness of the tunneling barrier, can greatly save cient, the data could be placed in STT-MRAM to reduce off-
the write energy [11]. chip memory access. This design employs the associativity
Besides, various novel manipulation methods are also approximation logic to eliminate cache movements, which
under research for the next-generation MRAM, such as sky- accommodates more write-once-read-more data stored in
rmion or domain-wall based racetrack memory [12], SOT STT-MRAM. Thus, the arbitration module judges the tag
+STT [13], current-induced exchange bias switching via the search from a set of status registers of STT-MRAM and
SOT [14] and magnetoelectric spin-orbit (MESO) devices SRAM to serve the request from the hit cache. If the memory
[15], as shown in Figure 1. controller meets the case of write-multiple, the arbitrator will
move the data from STT-MRAM to SRAM. Based on this
III. MRAM-NMC workflow, it can reduce the overhead from off-chip memory
The MRAM-NMC connects the computing circuit block to access by 32% and energy costs by 53%.
the MRAM macro with the primary working forms as shown MRAM Benefit. The increasing chip density has promoted
in Figures 2(a), 2(b), 2(c), and 2(d), which can contribute to dramatic growth in on-chip temperature. However, this could
reducing the memory wall effect based on MRAM solving provide an opportunity to optimize the performance of the
the power wall. The MRAM-NMC architecture stores data in STT-MRAM-based system. Wu et al. [18] reported the STT-
the same manner as the typical MRAM array and optimizes MRAM-based last level cache (LLC) with the novel thermally
data movement by tightly intercoupling the multiple chips/ aware non-uniform cache access (NUCA) design, on account
nodes to maximize its physical advantages. This section of the write latency and energy consumption from STT-
reviews the MRAM-NMC from the following three aspects: MRAM to produce a downward trend at rising temperatures.
memory-to-system implementation, software optimization To reduce leakage energy consumption in the system, STT-
for dataflow, and promising applications. MRAM was divided into 64 banks connected with the SRAM
L1 cache and each core. NUCA, depending on thermal aware-
A. MEMORY-TO-SYSTEM IMPLEMENTATION ness, adopted multiple migration strategies in different ther-
Various designs of MRAM-NMC in the multi-core architec- mal regions of STT-MRAM. The simulation results show that
ture have been proposed, demonstrating their potential to the NUCA design with negligible hardware overhead can
enhance system performance, such as the hybrid memory save 41.2% write energy at most and 13.01% on average.
controller [21], the special advantage of applying MRAM Approximate Computing. To utilize the specific error-
[18], and approximate computing [28], [29], [30]. energy tradeoff of STT-MRAM, Ranjan et al. [28] proposed

320 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

FIGURE 3. The MRAM-based Approximate Computing. CAST


Table: it stores approximation commands received from the API
calls. WB Controller: it performs WB requests and victim blocks
in the controller circuit. CAST controller: it sits next to the cache
controller, receives the QL set by the CAST table, and selects
the corresponding minimum settings.

the STAxCache, an approximate STT-MRAM-based L2


cache that improves the energy efficiency from the read/write
operation. Specifically, the quality configurable array sup- FIGURE 4. The MRAM-NMC for hybrid memory architecture.
ports the read/write approximation, which can be partially MRAM has undesired effects on writing: write failure, longer
completed from STT-MRAM, and lowers the read/write cur- time, and high power.
rent by appropriately modulating the duration. This step can
lead to a higher read/write failure in the least significant bit relevant QL information transferred from the CPU in the tag
group. Therefore, the quality-aware cache controller, cache array, the NOSTalgy controller selects the appropriate volt-
insertion, and replacement policy are properly enhanced for age for the write operation when the write signal is generated.
reading/write failure. The STAxCache utilizes a device-to- The quality energy knob also fluctuates with environmental
architecture simulation that achieves 1.44x improvement in conditions through the feedback mechanism, to adjust the
the STT-MRAM-based L2 cache. quality of the buffer memory of STT-MRAM, saving 52% of
Based on the approximate computing combined with STT- the energy consumption.
MRAM [28], Monazzah et al. [30] proposed CAST, as Some MRAM-NMC challenges for the memory-to-system
shown in Figure 3, which is a hardware-software co-optimi- implementation remain unaddressed in CMOS-based tech-
zation approach to balancing the energy and quality of write nologies: 1) imbalance between the MRAM capacity and the
operations in the STT-MRAM cache of multicore systems. communication bandwidth in a tight power budget; 2) the
The CAST table is closely connected with the translation error elasticity rather than fully exact results used for the
look-aside buffer (TLB) to transfer the physical address (PA) energy efficiency improvement within an acceptable range;
and the Write-Back (WB) requests. This method is also con- 3) insufficient use of the MRAM features in specific
ducive to energy saving through further manipulating scenarios.
energy-quality knobs and controlling the amounts of applied
write currents based on the write transition directions. The B. SOFTWARE OPTIMIZATION FOR DATAFLOW
cache controller next to the CAST controller works in paral- As shown in Figure 4, the adoption of MRAM devices as on-
lel to control the STT-MRAM quality-energy tradeoff knob. chip memory can offer ultra-low leakage power consumption
The architecture saves 57% of energy consumption with an and high memory density in the system. Nevertheless,
acceptable quality of the generated outputs compared to the MRAM with long write latency may lead to write failure and
benchmark STT-MRAM cache. higher power. To solve the above problems, some typical
Salahvarzi et al. [29] proposed NOSTalgy architecture that solutions are proposed based on the hybrid memory architec-
considers the external environmental influence input condi- ture from algorithms [19], [22], [20], and [17] to avoid the
tions. NOSTalgy is a modificatory approach to adjusting the undesired effect of MRAM.
STT-MRAM energy-quality knob by providing information Data Allocation. Qiu et al. [22] proposed the genetic algo-
on the test program and operating system. According to the rithm in the hybrid scratchpad memory architecture, which is

VOLUME 11, NO.


Authorized 2, APRIL-JUNE
licensed 321
2023to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE Xplore. Restrictions apply.
use limited
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

TABLE 1. The MRAM-NMC accelerators for neural networks.

tightly coupled with each core. The search process of the mechanism dynamically increases the updated probability of
genetic algorithm can find the optimal solution for the data heavy write-energy data; 3) the energy-aware write-back
allocation when the number of iterations or searches exceeds strategy evicts data from the LLC depending on the cache hit
the predefined threshold. Then the genetic algorithm can rate and the write-back energy consumption. To be specific,
minimize memory access costs and write operations to the WB energy calculator first compares two data sources
MRAM. Similar to [19], Salehi et al. proposed the energy- (modified data and original data) to evaluate the energy con-
aware cache block insertion/migration policy in the STT- sumption of the WB operation from STT-MRAM. Then, the
MRAM-based hybrid memory architecture. The scheme of candidate promotion mechanism dynamically adjusts more
the hybrid 8-way set associative SRAM (way-0 and way-1) positions for STT-MRAM due to its high write cost. Finally,
and STT-MRAM (way-2 through way-7) is adopted as the the energy-aware WB strategy activates and moves the data
LLC design. The insertion/migration policy maintains the to the promotion position when the WB energy consumption
read-dominant cache blocks in STT-MRAM banks while of STT-MRAM is higher than the threshold. Compared with
write-intensive blocks are transferred to SRAM banks. the conventional algorithm, E-cache can reduce 36% of
Besides, the insertion policy also maximizes self-organized power consumption.
sub-bank throughput to improve the access bandwidth of the MRAM-NMC inspires many corresponding requirements
STT-MRAM. for computing architecture to offer excellent potential for
Write Failure. The [22] and [19] play an important role in near-zero standby power and high design integration. There
the implementation and reduction of write operations for are two factors among the mentioned changes in software
MRAM. To handle the write error rate (WER) of STT- optimization for dataflow: 1) MRAM-NMC is designed to
MRAM in the hybrid memory architecture, Talebi et al. [20] apply the physical advantages of MRAM meanwhile
proposed ROCKY, a robust architecture based on the cache addressing its unfavorable effects, such as write-on MRAM
controller to redirect the traffics. The design of ROCKY with longer time and higher latency; 2) the rationalization of
mainly considers two constraints from STT-MRAM, namely data allocation for MRAM is applied at different levels with
write operation and incoming block. The replacement policy other memory units, considering virtual memory support,
finds the target block to write in the STT-MRAM when the cache coherence, and data mapping.
updated data area is less than the threshold. On the WB hit,
the updated block is written in the STT-MRAM to check the C. PROMISING APPLICATIONS
hardware boundary meanwhile determining whether the Conventional processing units such as CPU and GPU fail to
write operation is performed. On the contrary, STT-MRAM match essentials of machine learning (ML) algorithms and
needs to free the new area for the hit block. The simulation the speed of neural network accelerators. To dramatically
results show that ROCKY can reduce the dynamic power enhance the data efficiency between the memory unit and
consumption and write failure rate of STT-MRAM by 3.55% compute unit for emerging modern workloads, as shown in
and 28.7% , respectively. Table 1, recent research on MRAM-NMC in terms of secu-
When optimizing WB for STT-MRAM, Liang et al. [17] rity [27], [16], Internet of Things [26], embedded machine
put forward E-cache, an energy-aware cache replacement learning [23], specialized accelerator [24], and intermittent
policy, which needs to take into account WB energy, infre- processing [25].
quently-used data and energy consumption. E-cache boasts Security. Data encryption/decryption needs to be consid-
three strengths: 1) the minimized WB energy of STT- ered in MRAM-NMC for security-related applications. Chiu
MRAM by evaluation and calculation; 2) the promotion et al. [27] proposed a 4Mb STT-MRAM for data-encrypted,

322 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

D-flip-flop and NMC addr for shifting/rotating the number


of MRAM bits, respectively. The proposed MB-CSA can
reduce the area overhead of the security-aware application
by 33.3%, and increase latency for the high-bandwidth by
only 170ps.
Internet of Things. Rossi et al. [26] proposed the Vega, a
significant end-node with high performance and always-on
capability for battery-powered applications. More specifi-
cally, STT-MRAM is located in the independent switchable
power domain, linked with the cache through the DMA
channel and managed as peripherals. The frequency of STT-
MRAM can reach up to 40MHz with the 78-bit interface,
taking advantage of the read performance to store read-only
weights and programming codes. The STT-MRAM is awak-
ened after the CPU switches from sleep mode to active
mode. Then the weights from STT-MRAM are computed in
the eight-core cluster supporting multi-precision single
instruction multiple data integer and floating-point computa-
tion. STT-MRAM as an on-chip integration memory can
FIGURE 5. The MRAM-NMC for cache/main memory. Each mem- achieve over 40x more energy efficiency than SRAM with
ory bank includes 256 NSRFs and 32 MB-CSAs. A total of eight similar bandwidth, and simultaneously balance output
NSRFs are connected to each MB-CSA. This NSRF includes four threshold precision and energy consumption.
control signals other than memory-mode: NMC En, NMC Sel, Embedded Machine Learning. The always-on wearable sys-
NMC clk and NMC addr. Thus, each NSRF comprises only one
slave latch (LS) and two switches (SW1 and SW2).
tem that adopts the deep neural network (DNN) is full of hur-
dles, which incur external memory access due to the limited
on-chip memory size. For this reason, Lee et al. [23] introduced
achieving a 192GB/s read-and-decryption bandwidth, 25.1- the on-chip 1MB STT-MRAM to store 8-bit fixed-point preci-
55.1 TOPS/W and 8-bit multiply-and-accumulate (MAC) for sion of entire weights and input features to minimize external
AI operations. This program involves three parties: 1) a bit- memory access. If the abnormal electrocardiogram (ECG) sig-
wise vertical weight mapping burst access scheme that favors nal is detected, the wake-up algorithm will turn on the leakage-
MRAM-NMC and ReLU-prediction to reduce the frequency based delay multiply-and-accumulation (LDMAC) array
of memory accesses and full channel operations. 2) a bidirec- before waking up the STT-MRAM. Then system extracts input
tional-bit line-access readout scheme to reduce macro-level data and weights of the neural network from STT-MRAM to
read latency. 3) a charge-cycling voltage-type small-offset the buffer of the LDMAC, and performs the ECG arrhythmia
sense amplifier (SA) to lessen read energy consumption. detection algorithm on it. The 1MB STT-MRAM integrates
Additionally, the STT-MRAM with multiple cycles for full- 8528 specimens for SoC monitoring the ECG arrhythmia. The
channel MAC operation achieves 8-bit high-precision inputs, NMC system based on the STT-MRAM achieves a power con-
8-bit weights, 26-bit outputs and 576 accumulations. The sumption is only 1.02mW during DNN inference for ECG
performance of the proposed scheme shows that the macro- arrhythmia detection.
level read latency is 1.5ns and 24.9fJ/bit read energy during Intermittent Processing. However, in comparison with bat-
reading the encrypted data. tery-operated applications [26], [23], some devices don’t
It’s worth noting that the security-aware application place equip with batteries and can solely rely on energy harvesting
more performance demands on memory devices, including techniques. This puts higher demands on the hardware,
fast read-access, high bandwidth, and shift/rotate functional- which needs to be energy efficient and capable of tolerating
ity. Chiu et al. [16] explored a 1Mb STT-MRAM with near interruptions in the event of power outages. Resch et al. [25]
memory shift/rotate functionality and achieved 42.6 GB/s presented a MRAM-based ML accelerator for energy har-
read bandwidth on security-aware mobile devices. As shown vesting applications. MRAM can perform the read/write
in Figure 5, the eight near-memory shift-and-rotation func- operation via three types of instructions, consisting of the
tionalities (NSRFs) are connected to the multi-bit current- logic operation, the memory operation, and column activa-
mode sense amplifier (MB-CSA), which transfers 8-bit tion. The MRAM controller acquires and issues all instruc-
output within a single memory access cycle. Furthermore, tions by the 128-byte memory buffer and immediately stores
each NSRF contains two switches (SW1 and SW2) to move data in the tiles. Meanwhile, MRAM serving as the register
data between output latches and slave latches. Four control is used for programming counter and buffering individual
signals can be found in every NSRF: NMC En for activat- instructions, respectively. The logic operations combined
ing NMC control, NMC Sel for NMC controller to shift with MRAM in energy harvesting applications can restart
and rotation functions, NMC clk for the timing of the procedures during intermittent processing. The simulation

VOLUME 11, NO.


Authorized 2, APRIL-JUNE
licensed 323
2023to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE Xplore. Restrictions apply.
use limited
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

results show that MRAM-NMC with ML accelerator can


provide remarkable latency reduction and energy efficiency
advantages over other advanced approaches.
Specialized Accelerator. The above MRAM-NMC based
on the neural network adopts various schemes to address
unnecessary memory access and additional power consump-
tion. Through this method, the weights from the convolu-
tional neural networks (CNN) are normalized between ’-1’
and ’1’ after each convolution, yet can generate 1-bit unused
floating-point data. Jasemi et al. [24] combined the unused
bit and MLC STT-MRAM to improve the read and write
energy for CNN Accelerators. Two steps are taken to pro-
gram the 2-bit MLC STT-MRAM to reach ’01’ or ’10’: 1)
the soft bit (the smaller MTJ) is programmed and the hard bit
(the larger MTJ) is realized through the former step; 2)
another pulse is applied into the latter one. To overcome
metadata management issues, the grouping mechanism, a
pure baseline, rotation and rounding solution are employed
to further reduce the energy consumption of memory and
achieve high accuracy and reliability.
Most existing MRAM-NMC architectures are focusing on
ML techniques. However, more challenges in this area
remain to be addressed: 1) given that the prototype of
MRAM may not be available in large capacity or at the mass
scale, the way to maximize limited resources and avoid vul- FIGURE 6. Overall illustration of MRAM-IMC architecture from
nerabilities of MRAM is required; 2) system architects need (a) hardware implementation of computational memory, (b) com-
to extend the benefits of MRAM to the entire spectrum of puting paradigm including analog and digital approach, and
ML applications; 3) little progress has been made towards (c) promising application directions.
training acceleration of MRAM-NMC, although the training
phase desires more computation than others. array is generally organized with multiple bit-lines (BLs),
source-lines (SLs), and word-lines (WLs).
IV MRAM-IMC STT-CiM [31] is a typical example of IMC-P that employs
Compared with MRAM-NMC, MRAM-IMC could funda- modifications to the sensing circuitry and reference genera-
mentally blur the distinction between processing and mem- tion circuitry while keeping the memory array of STT-
ory units, further reducing memory access. This section MRAM unchanged to perform logic operations. As shown in
reviews MRAM-IMC from the following three aspects: hard- Table 2(a), by applying a bias voltage to the BL, the resis-
ware implementation, computing paradigm and promising tance states of each pair of bit-cells codetermine the summa-
applications, as shown in Figure 6. tion of the current ISL flowing through the SL (e.g.ðRi =0,
Rj =1Þ ) I01 ). To conduct logic operations, the sensing
A. HARDWARE IMPLEMENTATION schemes are designed to generate different reference currents
According to the location of computing operation, the hard- Iref and results of comparison. For bitwise OR (NOR) opera-
ware implementation of MRAM-IMC could be divided into tions, Iref or and ISL are respectively corresponding to the
two basic categories: peripheral circuits (IMC-P) or memory negative and positive input of SA while the value of Iref or is
array (IMC-A). set between I10 and I00 . Consequently, only when the bit-
IMC-P. In this approach, the basic idea is to exploit the cells both store ’0’, the value at the positive (negative) output
peripheral circuitry to perform a range of Boolean logic oper- of the SA is set to logic ’0’ (’1’). In other cases, it leads to a
ations while the structure of the MRAM array remains logic ’1’ (’0’). Thus, the positive (negative) output of the SA
unchanged. realizes the bitwise OR(NOR) logic of the values stored in
As shown in Figure 6(a), the schematic of a typical the enabled two bit-cells. Based on the above logic opera-
MRAM bank consists of xT-1MTJ (x=1,2) bit-cell and tions, bitwise AND (NAND) and some complex combinato-
peripheral circuits, e.g., pre-charge SA, write/read drivers, rial logic modules such as full-adder could also be realized.
row/column decoders, and input/output (I/O) interfaces. MRIMA [39] is another example of IMC-P that exploits
Here, T refers to CMOS transistor connected with MTJ hardware-friendly BL computing methods to implement
device and the bit-cell of STT-MRAM is typical 1T-1MTJ complete Boolean logic within STT-MRAM in a single clock
while the SOT-MRAM’s is 2T-1MTJ, as shown in Figure 2 cycle. The key idea behind MRIMA is also to choose differ-
(e) and 2(f). To link different bit-cells together, the MRAM ent references when sensing the selected memory cell(s) but

324 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

TABLE 2. Example diagrams of MRAM-IMC in (a)(b) hardware implementation,(c)(d) computing paradigm and (e) promising application.

* Each of the 16 Local MRAM bit-cell arrays in [33] maximally supports 1024 8-bit weight data.

with more elaborate SA design which is divided into two the bit select lines (BSLs). If three MTJs are connected to the
sub-SAs and six reconfigurable branches. LL as shown in Table 2 (b), the resistance states of MTJs are
The primary advantage of IMC-P is that the computing R1 , R2 and Ro, respectively. The sum of resistances corre-
memory core is not different from standard MRAM, and thus sponds to ðR1 jjR2 Þ þ Ro . With a voltage VBSL applied to the
the storage density and the regular read/write operations can BSL, the current I of LL is calculated by:
be maintained. However, some challenges still exist: 1) the   
pre-processing data should be grouped and transported into R1 R2
I ¼ VBSL = þ Ro (1)
the same bank, which may cause extra power consumption; R1 þ R2
2) the post-processing data should be cached before being
used in the next procedure; 3) poor scalability to realize com- where Ic is the critical threshold current of MTJ. Depending
plex logic functions; 4) the reference current/voltage must be on the relationship of I and Ic , the state output will vary (if
very precise to ensure the correctness of result, which is not I>Ic ) or remain the same (if I<Ic ), i.e., a logic function with
an easy task. multiple inputs and one output was completed.
IMC-A. Contrary to IMC-P, the key idea behind IMC-A is SPU [42] is another example of IMC-A that exploits STT-
to exploit the MRAM bit-cells for logic operation by dynami- MRAM to implement different reconfigurable logic func-
cally configuring them with regular write/read operations. tions within one or two read/write cycles. The key idea
CRAM [32] is a typical example of IMC-A that employs behind SPU is to realize multi-bit logic operations in a highly
one additional transistor to provide a platform where logic parallel structure with only a transmission gate added on each
operations are performed within the MRAM array. The gen- WL to control the access signal. Besides, by adding a control
eral structure of CRAM array is shown in Table 2 (b) in which unit in the main MRAM array to translate the computing
the MTJ of each bit-cell is addressed through the first transis- commands into read/write operations, a feasible MRAM-
tor (T1 ) and logic operations could be completed by selecting IMC platform is established.
the second transistor (T2 ). The CRAM thus can work in two
modes: 1) Memory Mode: When T1 is turned on by pulling up B. COMPUTING PARADIGM
the WL and T2 is turned off by holding down the logic bit line According to the signal type and computing paradigm for
(LBL), data could be read from or written into the MTJ. Dur- implementation, MRAM-IMC could be divided into two
ing this mode, the configuration is effectively identical to a basic categories: analog and digital.
standard STT-MRAM bit-cell; 2) Logic Mode: When T1 is Analog MRAM-IMC. As shown in Figure 6(b), typical ana-
turned off by holding down the WL and T2 is turned on by log MRAM-IMC retains the structure of the MRAM array
pulling up LBL, the MTJ is connected to a logic line (LL) in and read/write function. It completes logic operations by
each row. During this mode, several MTJs in a row could applying voltage or current signals directly to the bit-cells.
form a logic gate, such as AND, and NAND. The operand of a External signals are entered on each WL through a digital-to-
logic gate could be expressed by the resistance states of the analog converter (DAC) or pulse width modulation, and mul-
input/output MTJs, while an appropriate voltage is applied to tiple rows are accessed by decoders and drivers at once. After

VOLUME 11, NO.


Authorized 2, APRIL-JUNE
licensed 325
2023to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE Xplore. Restrictions apply.
use limited
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

the data stored inside the bit-cell has completed logic opera- Zhang et al. [35] presented a reconfigurable MRAM-IMC
tions with external data, the intermediate results of each architecture employing single voltage-gated SHE-driven
column are accumulated and moved into the bottom analog- MTJ, which was also designed in digital computing paradigm.
to-digital converter (ADC) to be converted into the digital In this approach, two inputs are needed for the state shift. One
output. Depending on the trade-off between precision and input is represented by the VCMA bias voltage across the
energy consumption, 4-8 bit ADCs are usually used in ana- MTJ and the other is represented by the initial data stored in
log MRAM-IMC. MTJ. The logic output result is calculated and recorded as the
Cai et al. [33] proposed a novel TMR ratio magnifying state of MTJ in memory cell. By measuring its resistance state
method based on a universal 1T-1MTJ STT-MRAM bit-cell in read operation cycle, the logic output can then be read. The
to realize analog MRAM-IMC. The general structure of this result demonstrated the feasibility of achieving stateful recon-
design is shown in Table 2(c), in which the MTJ is connected figurable Boolean logic functions by a single VG-SHE driven
to a latch structure while the peripheral circuits are minimally MTJ device.
modified to enable in-memory matrix-vector multiplication. The primary advantage of digital MRAM-IMC is its ability
A virtual TMR magnified by 7500 is achieved, leading to a to realize high accuracy on high-precision computing(>16bit)
57.6% reduced integral nonlinearity and a 9.47-25.4 TOPS/ and flexibility for various bit widths. However, some prob-
W energy efficiency for CNN with 2-bit input, 1-bit weight lems remain unresolved: 1) a combination of logical units
and 4-bit output. with bit-cells occupies extra area; 2) copy parameters gener-
The primary merits of analog MRAM-IMC are high on- ally requires large extra memory size in digital IMC architec-
chip bandwidth and computation-area efficiency. Currently, ture; 3) complex operation such as matrix multiplication
with the data explosion, analog MRAM-IMC has found its needs to be decomposed into the collection of basic opera-
way to conduct MAC operations with high parallelism tions, which will take more execution cycles and latency.
degree and enhanced throughput. However, some challenges
persist: 1) the data precision determined by the partial sum is C. PROMISING APPLICATIONS
confined by the ADC solution and there is a stringent MRAM-IMC has shown its potential for reducing most of
requirement for the area of high-resolution ADC (>8 bit); 2) the data transmission energy and latency while performing
non-ideal device characteristics including the cell-to-cell var- computing within memory. Previous MRAM-IMC proposals
iation and the intrinsic ADC offset will degrade the comput- were classified according to specific applications when carry-
ing accuracy; 3) lack computing robustness due to a low ing out data-centric tasks [5], including scientific computing,
signal noise ratio (SNR) during the analog signal processing. signal optimization, machine learning, etc. This paper takes
Digital MRAM-IMC. As shown in Figure 6(b), the digital Neural Network and Graph Computing as examples to ana-
MRAM-IMC paradigm consists of multi processing element lyze MRAM-IMC architecture in terms of its application
(PE) units which are constituted by MRAM array and Bool- progress and prospect.
ean logic blocks. Compared to the analog counterpart, digital Neural Network. With the increasing data set scale and
MRAM-IMC implementations are typically less energy/area computing complexity, the efficiency of neural network algo-
efficient, but are more scalable and tolerant to noise and rithms is limited owing to the von Neumann bottleneck. To
variations. address this issue, MRAM-IMC architecture has been intro-
CRISP [44] architecture is a representative instance favor- duced as a possible solution, displaying superior performance
ing digital logic operations inside memory. Its spintronic- in terms of energy efficiency and latency [36], [43], [45].
assisted logic-in-memory (SLIM) cells can execute a series Zhang et al. [36] presented a time-domain computing in
of partial product generations and additions to perform MAC memory (TD-CIM) scheme based on SOT-MRAM to opti-
operations as shown in Table 2(d). In the initial stage, the mize the performance of energy efficiency and delay for
weight W½m of input cell IN1 and output cells (OUT1 and CNN applications. It achieves Boolean logic operations by
OUT2 ) are set to ’1’. Two input currents I½n and I½nþ1 are recording the BL output at different moments. Compared
applied as input voltages for VCMA effect. Only when input with CRAM [32] in identifying the MNIST dataset, the delay
and weight are both set to ’1’, the weight equivalent current of the TD-CIM architecture is reduced by 1.2-2.7 times, and
generated from IN1 can surpass the magnitude of the switch- the energy is decreased by 2.4103 -1.1104 times.
ing current of OUT1 and OUT2 . Otherwise, the two output In 2021, Peter et al. [40] first proposed a MRAM-IMC
cells will remain original state, and achieve NAND opera- macro designed with 128-Kb array in 22nm FD-SOI technol-
tion. The subsequent full addition is fulfilled with successive ogy in Table 3 (a), achieving area-normalized throughput of
majority-voting-3 (MV3) and majority-voting-5 (MV5) oper- 758 GOPS=mm2 and energy efficiency of 5.1 TOPS/W.
ations. When the majority of input cells are in the low resis- While the architecture is incorporated in CIFAR-10 classifi-
tance (data ’1’), the accumulated current gets larger than the cation task, the inference accuracy reaches 90.1%, matching
switching threshold and flips the output cells to data ’0’. ideal software-based computation.
Through these logic functions, SLIM cells can be efficiently Very recently, Seungchul et al. [34] reported a crossbar
used in the memory configuration of CRISP architecture for array based on MRAM cells with a resistance-sum method
multiple matrix multiplications as PE units. for analogue MAC operation. This approach replaces

326 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

TABLE 3. MRAM-IMC prototype in the application of neural the number of triangles in a given graph is the key to
network. extracting relation net model. Traditional graph algorithms
have problems with complex control capabilities when
applied to MRAM-IMC. To cope with this issue, Wang
[37] innovatively reformulated the TC problem into basic
Boolean logic functions and designed a triangle counting
in-memory (TCIM) accelerator using simple AND and Bit-
Count operations for computing. By slicing and compress-
ing the input graph, valid data will be loaded into the bitcell
of STT-MRAM array and implemented with efficient in-
memory bitwise operation as shown in Table 2 (e). In the
SNAP dataset from real-life graphs, the execution results
outperform the energy-efficient FPGA by 31.8 and
achieve a 34 energy efficiency improvement. MRAM-
IMC excels in terms of precision, energy usage, speed, sta-
bility and endurance [34]. Based on the switching proper-
ties of STT-MRAM, physical unclonable function is also a
candidate for embedded secure devices [38]. Despite the
ability to perform computationally expensive and memory-
intensive tasks, MRAM-IMC still faces challenges in large-
scale integration on chip to fulfill industrial demands of cur-
Kirchhoff’s law and consumes less power than the previous rent big-data-driven applications.
standard crossbar array with the current-sum method. A
6464 array is integrated with the readout electronics based
on Time-to-Digital Converters (TDC) in the 28 nm CMOS V. CHALLENGES AND PROSPECT
process, reaching 405 TOPS/W power efficiency while proc- In this review, we briefly introduce the MRAM-centric
essing dot products with a 0.8 V supply for the TDCs computing solutions, which can be categorized as MRAM-
(Table 3 (b)). Using a two-layer binary neural network per- NMC and MRAM-IMC to address the bottleneck of von
ceptron, the accuracy of applying the crossbar array in Neumann architecture. MRAM-NMC places computational
10,000-image MINST classification tasks is up to units at the periphery of memory array for fast data access,
93.230.05%. From the perspective of MRAM industrializa- while MRAM-IMC uses the memory array to perform logic
tion, these two prototypes have brought volume production operations directly through simple configuration. We col-
of MRAM-IMC chip into routine, which may help to push lected some representative works of MRAM-NMC and
IMC technology to the forefront. MRAM-IMC published in recent years. Figure 7 plots
Graph Computing. Due to the growing need to dissect energy efficiency versus processing node of these studies,
relationships from massive data, graph computing has showing that the technology node scaling from 6x nm to 2x
received extensive attention. Triangle counting (TC) is a nm and the normalized energy efficiency downscaling from
fundamental issue in graph computing in which obtaining pJ/bit to fJ/bit. Despite these progress, MRAM-NMC

FIGURE 7. Energy consumption of (a) MRAM-NMC and (b) MRAM-IMC in range of 6x-2x nm technology nodes. Note: In order to make the
comparison more intuitive, the energy consumption data from the references has been transformed into a uniform unit (pJ/bit).

VOLUME 11, NO.


Authorized 2, APRIL-JUNE
licensed 327
2023to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE Xplore. Restrictions apply.
use limited
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

and MRAM-IMC still face challenges from device, circuit REFERENCES


and architecture. [1] M. F. Gonzalez-Zalba, S. de Franceschi, E. Charbon, T. Meunier, M.
For MRAM-NMC, by leveraging the mature commercial Vinet, and A.S. Dzurak, “Scaling silicon-based quantum computing using
CMOS technology,” Nature Electron., vol. 4, no. 12, pp. 872–884, 2021.
technology of MTJ, several prototypes and products have [2] C. Xue et al., “A CMOS-integrated compute-in-memory macro based on
shown the potential for cost-effective and energy-efficient resistive random-access memory for AI edge devices,” Nature Electron.,
implementation of neural network acceleration, approximate vol. 4, no. 1, pp. 81–90, 2021.
[3] S. Dutta, H. Jeong, Y. Yang, V. Cadambe, T. M. Low, and P. Grover,
computing, and security-aware applications. Nevertheless, “Addressing unreliability in emerging devices and non-von neumann
before these new computational paradigms are ready for vol- architectures using coded computing,” Proc. IEEE, vol. 108, no. 8,
ume manufacturing, more efforts from three aspects remain pp. 1219–1234, 2020.
[4] Y. Zhang, L. Xu, Q. Dong, J. Wang, D. Blaauw, and D. Sylvester,
to be accomplished. First of all, MRAM-NMC is rapidly “Recryptor: A Reconfigurable Cryptographic Cortex-M0 Processor With
reshaping computing systems by applying the well-known In-Memory and Near-Memory Computing for IoT Security,” IEEE J.
benefits of MRAM. However, the memory required for non- Solid-State Circuits, vol. 53, no. 4, pp. 995–1005, Apr. 2018.
[5] A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou,
volatile field-programmable gate array and nonvolatile CPU “Memory devices and applications for in-memory computing,” Nature
needs to meet sufficient frequency, dependent switchable Nanotechnol., vol. 15, no. 7, pp. 529–544, 2020.
power domain and lower process node, which all drive itera- [6] S. Khoram, Y. Zha, J. Zhang, and J. Li, “Challenges and opportunities:
From near-memory computing to in-memory computing,” in Proc. ACM
tions and improvements of MRAM. Additionally, MRAM- Int. Symp. Phys. Des., 2017, pp. 43–46.
NMC reduces the computing energy by adding parasitic ele- [7] Z. T. Sandhie, J. A. Patel, F. U. Ahmed, and M. H. Chowdhury, “Investi-
ments in different circuit design scenarios. But the memory gation of multiple-valued logic technologies for beyond-binary era,” ACM
Comput. Surveys, vol. 54, no. 1, pp. 1–30, 2021.
array and peripheral circuits of MRAM should be carefully [8] Z. Wang et al., “Resistive switching materials for information processing,”
implemented to meet the special requirement. Finally, the Nature Rev. Mater., vol. 5, no. 3, pp. 173–195, 2020.
energy-efficient MRAM-NMC architecture poses unpredict- [9] S. Fukami, C. Zhang, S. DuttaGupta, A. Kurenkov, and H. Ohno, “Magne-
tization switching by spin–orbit torque in an antiferromagnet–ferromagnet
able work intensities, such as software stack overhead, file sys- bilayer system,” Nature Mater., vol. 15, no. 5, pp. 535–541, 2016.
tem security and memory fusion techniques at different levels. [10] Z. Guo et al., “Spintronics for energy-efficient computing: An overview
For MRAM-IMC, most of the work is still at the device and outlook,” Proc. IEEE, vol. 109, no. 8, pp. 1398–1417, 2021.
[11] D. Ielmini and H. S. P. Wong, “In-memory computing with resistive
level, lacking prototypes for stochastic computing, neural switching devices,” Nature Electron., vol. 1, no. 6, pp. 333–343, 2018.
network acceleration and high-precision scientific comput- [12] A. Fert, V. Cros, and J. Sampaio, “Skyrmions on the track,” Nature Nano-
ing. Fundamentally, the path to advance the MRAM-IMC technol., vol. 8, no. 3, pp. 152–156, 2013.
[13] T. Endoh, H. Honjo, K. Nishioka, and S. Ikeda, “Recent progresses in
technology for mass manufacturing is hampered by numer- STT-MRAM and SOT-MRAM for next generation MRAM,” in Proc.
ous barriers. First, to enhance the TMR and read/write effi- IEEE Symp. VLSI Technol., 2020, pp. 1–2.
ciency, further device engineering is necessary to [14] J. Kang et al., “Current-induced manipulation of exchange bias in IrMn/
NiFe bilayer structures,” Nature Commun., vol. 12, no. 1, pp. 1–7, 2021.
manipulate the magnetic state of MTJ with high spin polar- [15] S. Manipatruni et al., “Scalable energy-efficient magnetoelectric spin–orbit
ization, spin filtering, large spin–orbit coupling, etc. In logic,” Nature, vol. 565, no. 7737, pp. 35–42, 2019.
addition, considering that there is currently no complete [16] Y. Chiu et al., “A 22-nm 1-Mb 1024-b read data-protected STT-MRAM
macro with near-memory shift-and-rotate functionality and 42.6-GB/s read
spintronic computer concept, integrating MTJ devices with bandwidth for security-aware mobile device,” IEEE J. Solid-State Circuits,
CMOS components is still the most realistic solution for vol. 57, no. 6, pp. 1936–1949, Jun. 2022.
MRAM-IMC. Therefore, more attention should be paid to [17] Y. Liang, T. Chen, Y. Chang, S. Chen, P. Chen, and W. Shih, “Rethinking
last-level-cache write-back strategy for MLC STT-RAM main memory
the CMOS back end of line fabrication process with MTJ, with asymmetric write energy,” in Proc. IEEE/ACM Int. Symp. Low Power
including the optimization of scalability and integration Electron. Des., 2019, pp. 1–6.
density. Third, since not all computing tasks are tailed for [18] B. Wu et al., “A novel high performance and energy efficient NUCA archi-
tecture for STT-MRAM LLCs with thermal consideration,” IEEE Trans.
MRAM-IMC, it is worthwhile to reconsider the application Comput.-Aided Des. Integr. Circuits Syst., vol. 39, no. 4, pp. 803–815,
scenarios and the software/hardware cooperation with other Apr. 2020.
related approaches. [19] S. Salehi, N. Khoshavi, and R. F. DeMara, “Mitigating process variability
for non-volatile cache resilience and yield,” IEEE Trans. Emerg. Topics
MRAM has proven its potential to replace existing Comput., vol. 8, no. 3, pp. 724–737, Jul.-Sep. 2018.
memory technologies for computations where a combina- [20] M. Talebi, A. Salahvarzi, A. M. H. Monazzah, K. Skadron, and M. Fazeli,
tion of nonvolatility, energy efficiency, speed and endur- “ROCKY: A robust hybrid on-chip memory kit for the processors with
STT-MRAM cache technology,” IEEE Trans. Comput., vol. 70, no. 12,
ance is vital. While challenges remain, MRAM-centric pp. 2198–2210, Dec. 2021.
computing is now at a unique position to define new [21] J. Zhang, M. Jung, and M. T. Kandemir, “FUSE: Fusing STT-MRAM into
applications and enable architecture innovations over the GPUs to alleviate off-chip memory access overheads,” in Proc. IEEE Int.
Symp. High Perform. Comput. Archit., 2019, pp. 426–439.
next decade and beyond. Along with the maturity and [22] M. Qiu et al., “Data allocation for hybrid memory with genetic algo-
wide availability of advanced CMOS manufacturing and rithm,” IEEE Trans. Emerg. Topics Comput., vol. 3, no. 4, pp. 544–555,
MRAM fabrication processes, the MRAM-centric com- Dec. 2015.
[23] K. Lee et al., “A 1.02-mW STT-MRAM-based DNN ECG arrhythmia
puting paradigms discussed in this paper will offer a firm monitoring SoC with leakage-based delay MAC unit,” IEEE Solid-State
step towards the future era of intelligence and we thus Circuits Lett., vol. 3, no. 9, pp. 390–393, Sep. 2020.
hope our perspective will motivate more research in this [24] M. Jasemi, S. Hessabi, and N. Bagherzadeh, “Reliable and energy efficient
MLC STT-RAM buffer for CNN accelerators,” Comput. Elect. Eng.,
fascinating area. vol. 86, no. 9, 2020, Art. no. 106698.

328 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

[25] S. Resch et al., “MOUSE: Inference in non-volatile memory for energy [45] A. D. Patil, H. Hua, S. Gonugondla, M. Kang, and N. R. Shanbhag, “An
harvesting applications,” in Proc. IEEE/ACM 53rd Annu. Int. Symp. MRAM-based deep in-memory architecture for deep neural networks,” in
Microarchit., 2020, pp. 400–414. Proc. IEEE Int. Symp. Circuits Syst., 2019, pp. 1–5.
[26] D. Rossi et al., “Vega: A ten-core SoC for IoT endnodes with DNN
acceleration and cognitive wake-up from MRAM-based state-retentive
sleep mode,” IEEE J. Solid-State Circuits, vol. 64, no. 1, pp. 60–62,
YUETING LI is currently working toward the PhD
Jan. 2022.
degree in Prof. W.S. Zhao’s Group with the School
[27] Y. Chiu et al., “A 22nm 4Mb STT-MRAM data-encrypted near-memory
of Integrated Circuit Science and Engineering, Bei-
computation macro with a 192GB/s read-and-decryption bandwidth and
hang University. Her research interests mainly
25.1–55.1TOPS/W 8b MAC for AI operations,” in Proc. IEEE Int. Solid-
include system integration, the application of
State Circuits Conf., 2022, pp. 178–180.
MRAM, near-memory computing, and neural net-
[28] A. Ranjan, S. Venkataramani, Z. Pajouhi, R. Venkatesan, K. Roy, and A.
work accelerator design. She won the University
Raghunathan, “STAxCache: An approximate, energy efficient STT-
Demo Best Demonstration in ACM/SIGDAUD’21,
MRAM cache,” in Proc. IEEE Des. Automat Test Eur. Conf. Exhib., 2017,
Best Presentation Award in ICCC’21, and the
pp. 356–361.
Finalist in ISLPED’21 Design Contest.
[29] A. Salahvarzi, A. M. H. Monazzah, M. Fazeli, and K. Skadron, “NOS-
Talgy: Near-optimum run-time STT-MRAM quality-energy knob manage-
ment for approximate computing applications,” IEEE Trans. Comput.,
vol. 70, no. 3, pp. 414–427, Mar. 2021.
[30] A. M. H. Monazzah, A. M. Rahmani, A. Miele, and N. Dutt, “CAST: Con- TIANSHUO BAI received the BS degree in the
tent-aware STT-MRAM cache write management for different levels of Beijing University of Technology, Beijing, China,
approximation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., in 2017. He is currently working toward the MS
vol. 39, no. 12, pp. 4385–4398, Dec. 2020. degree in the School of Integrated Circuit Science
[31] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory and Engineering, Beihang University. His research
with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale interests include compiler toolchain and digital
Integr. Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018. computing-in-memory.
[32] M. Zabihi, Z. I. Chowdhury, Z. Zhao, U. R. Karpuzcu, J. Wang, and S. S.
Sapatnekar, “In-memory processing on the spintronic CRAM: From hard-
ware design to application mapping,” IEEE Trans. Comput., vol. 68, no. 8,
pp. 1159–1173, Aug. 2019.
[33] H. Cai et al., “Proposal of analog in-memory computing with magnified
tunnel magnetoresistance ratio and universal STT-MRAM cell,” IEEE
Trans. Circuits Syst. I: Regular Papers, vol. 69, no. 4, pp. 1519–1531, XINYI XU received the BS degree in computer sci-
Apr. 2022. ence and technology from the China University of
[34] S. Jung et al., “A crossbar array of magnetoresistive memory devices Geosciences, Beijing, in 2017. She is currently
for in-memory computing,” Nature, vol. 601, no. 7892, pp. 211–216, working toward the master’s degree in electronic
2022. information with Beihang University. Her research
[35] H. Zhang, W. Kang, L. Wang, and W. Zhao, “Stateful reconfigurable logic interests include near-memory computing and neu-
via a single-voltage-gated spin hall-effect driven magnetic tunnel junction ral network accelerator design.
in a spintronic memory,” IEEE Trans. Electron Devices, vol. 64, no. 10,
pp. 4295–4301, Oct. 2017.
[36] Y. Zhang et al., “Time-domain computing in memory using spintronics for
energy-efficient convolutional neural network,” IEEE Trans. Circuits Syst.
I: Regular Papers, vol. 68, no. 3, pp. 1193–1205, Mar. 2021.
[37] X. Wang et al., “Triangle counting accelerations: From algorithm to in-
memory computing architecture,” IEEE Trans. Comput., no. 11, pp. 1–11, YUNDONG ZHANG established chip start-up T-
Nov. 2021. Square Inc. in Silicon Valley, USA, which was
[38] S. B. Dodo, R. Bishnoi, S. M. Nair, and M. B. Tahoori, “A spintronics merged by Ali Lab. Currently he is Co-Founder,
memory PUF for resilience against cloning counterfeit,” IEEE Trans. Very Executive Director of Vimicro Corporation. He also
Large Scale Integr. Syst., vol. 27, no. 11, pp. 2511–2522, Nov. 2019. serves as Executive Director of National Key Labora-
[39] S. Angizi, Z. He, A. Awad, and D. Fan, “MRIMA: An MRAM-based in- tory on Digital Multimedia Chip Technology in Bei-
memory accelerator,” IEEE Trans. Comput.-Aided Des. Integr. Circuits jing, China. He is known as a specialist in digital
Syst., vol. 39, no. 5, pp. 1123–1136, May 2020. multimedia chip design and artificial intelligence
[40] P. Deaville, B. Zhang, L. Chen, and N. Verma, “A maximally row-parallel chip design. He was awarded as First-class Prize of
MRAM in-memory-computing macro addressing readout circuit sensitiv- National Science and Technology Advancement.
ity and area,” in Proc. IEEE 47th Eur. Solid State Circuits Conf., 2021,
pp. 75–78.
[41] A. Agrawal, A. Ankit, and K. Roy, “SPARE: Spiking neural network
acceleration using ROM-embedded RAMs as in-memory-computation BI WU (Member, IEEE) received the PhD
primitives,” IEEE Trans. Comput., vol. 68, no. 8, pp. 1190–1200, degree from the School of Electronic Informa-
Aug. 2019. tion Engineering, Beihang University, Beijing,
[42] H. Zhang, W. Kang, K. Cao, B. Wu, Y. Zhang, and W. Zhao, “Spintronic China, in 2019, with the financial support of the
processing unit in spin transfer torque magnetic random access memory,” China Scholarship Council, he spent one year as
IEEE Trans. Electron Devices, vol. 66, no. 4, pp. 2017–2022, Apr. 2019. a visiting graduate student at the University of
[43] H. Wang, Y. Zhao, C. Li, Y. Wang, and Y. Lin, “A new MRAM-based Notre Dame, USA, under the supervision of
process in-memory accelerator for efficient neural network training with Professor Xiaobo Sharon Hu. After the PhD, he
floating point precision,” in Proc. IEEE Int. Symp. Circuits Syst., 2020, joined College of Electronic and Information
pp. 1–5. Engineering, Nanjing University of Aeronautics
[44] T. Kim, Y. Jang, M. G. Kang, B. G. Park, K. J. Lee, and J. Park, “SOT- and Astronautics (NUAA), Nanjing, China, as
MRAM digital PIM architecture with extended parallelism in matrix multi- an Assistant Professor. His research interests include magnetic memory
plication,” IEEE Trans. Comput., vol. 71, no. 11, pp. 2816–2828, architecture, spintronic devices based in-memory computing architec-
Nov. 2022. ture, neural network accelerator design, etc.

VOLUME 11, NO.


Authorized 2, APRIL-JUNE
licensed 329
2023to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE Xplore. Restrictions apply.
use limited
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory

HAO CAI (Senior Member, IEEE) received the WEISHENG ZHAO (Fellow, IEEE) received the
master’s degree and the PhD degree in electrical PhD degree in physics from the University of Paris
engineering from Lund University, Sweden, and Sud, in 2007. He worked as a research associate
TELECOM ParisTech, France, in 2009 and 2013, with the CEA’s embedded computing laboratory
respectively. From 2013 to 2017, he was with Uni- from 2007 to 2009, and with the French national
versite Paris-Saclay, France, in 2018, he joined research center (CNRS), as a tenured scientist from
National ASIC System Engineering Center, South- 2009 to 2014, where he led the spintronics integra-
east University, Nanjing, China, where he is cur- tion group. Now he is a professor and director of
rently an Associate Professor. He is currently Fert Beijing Institute, MIIT Key Laboratory of
working on low-power MRAM design and device- Spintronics, School of Integrated Circuit Science
circuit design interaction. He has authored or and Engineering, in Beihang University. His
co-authored 2 book chapters and more than 120 scientific papers, including research focused on spintronic memories and logics from devices, circuits to
IEEE Journal of Solid-State Circuits, IEEE Trans. Circ. Syst. I: Reg. Papers, systems. He has authored or coauthored more than 200 scientific papers,
etc. He has been severing on the technical committee of IEEE-CAS society, such as Nature Electronics, Nature Communications, Advanced Materials,
severing as the conference TPC member in DAC, GLSVLSI, Nanoarch, and Proceedings of the IEEE. He is the editor-in-chief of IEEE Transactions
ESREF, NEWCAS. on Circuits and Systems I: Regular Papers.

BIAO PAN (Member, IEEE) He received the PhD


degree in optical engineering from the Huazhong
University of Science and Technology, Wuhan, in
2015. He is currently an Assistant Professor with the
School of Integrated Circuit Science and Engineer-
ing, Beihang University, Beijing, China. His
research interests include MRAM based processing-
in-memory circuit and neuromorphic computing
with the emerging non-volatile memory devices.

330 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.

You might also like