A Survey of MRAM-Centric Computing From Near Memory To in Memory
A Survey of MRAM-Centric Computing From Near Memory To in Memory
ABSTRACT Conventional von Neumann architecture suffers from bottlenecks in computing performance
and power consumption due to frequent data exchange between memory and processing units. To overcome
this issue, research on novel computing architectures including near-memory computing (NMC) and in-mem-
ory computing (IMC) has been accelerated based on emerging nonvolatile memory devices. Among various
potential candidates, spintronic-based magnetic random-access memory (MRAM) has come into a research
and development hotspot by its ultralow switching energy, nonvolatility, and superior endurance. This paper
outlines the background, trends, and challenges involved in the development of MRAM-centric computing,
and highlights the recent prototypes and advances in applications based on MRAM-NMC and MRAM-IMC.
INDEX TERMS Near-memory computing, in-memory computing, MRAM, neural network, hybrid memory
architecture
2168-6750 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE
318 permission.
Authorized licensed use limited to: Indian Institute See ht_tps://www.ieee.org/publications/rights/index.html
of Technology Indore. Downloaded on March 21,2024 for more information.UTC from IEEE
at 06:50:19 VOLUME 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory
FIGURE 1. The roadmap for MRAM-centric computing from the device level to the architecture level.
current with the scaling down of CMOS technologies, the II. BACKGROUND OF MRAM
bottleneck of power wall and memory wall may not be Compared to the resistive RAM (RRAM), phase change
completely solved due to limitations from device to architec- memory (PCM), and ferroelectric field effect transistor
ture [7]. Different from SRAM and DRAM, spintronic-based (FeFET), MRAM offers an advantageous performance
magnetic random-access memory (MRAM) is one of the most among all the emerging nonvolatile memory technologies,
promising technologies for the memory-centric computing especially, in terms of energy and time of the read operations.
system with its superiorities in nonvolatility, data retention, Since the late 1980s, a series of scientific discoveries and
reprogramming durability, etc. At the device level, as shown technological innovations have contributed to the rapid prog-
in Figure 1, MRAM has made rapid progress based on new ress in MRAM technology and its commercialization. The
mechanisms constantly proposed. At the architecture level, discovery of the tunneling magnetoresistive (TMR) effect for
the distance between the processing unit and memory unit has the CoFe/MgO-based magnetic tunnel junction (MTJ) device
decreased from von Neumann architecture to MRAM-NMC serves as one of the milestones during this progress [8].
and MRAM-IMC. Additionally, MRAM could drive the revo- MTJ, which mainly comprises two ferromagnetic layers with
lution in computing architecture at different technology nodes a tunneling barrier layer, is the key building block of
to produce highly efficient computations with favorable MRAM. The magnetization direction of the fixed layer is
CMOS compatibility. generally permanent, while that of the free layer can be
This review provides readers comprehensive understand- manipulated between two different directions for denoting
ing of MRAM-NMC and IMC with a collection of state-of- one-bit datum of ’0’ or ’1’. Until now, several manipulation
the-art work mainly from 2018 to 2022. Simultaneously, methods have been proposed, such as magnetic field, spin
these latest research work and existing challenges from transfer torque (STT), spin-orbit torque (SOT) and voltage-
device, circuit and architecture are also summarized. With controlled magnetic anisotropy (VCMA) [9]. The evolution
the focus on the current status and future prospects of of manipulation methods allows for more energy efficiency
MRAM-NMC and MRAM-IMC, this review is aimed at and higher write speed of the MRAM bit-cell, which could
encouraging the whole community to pay close attention to be attributed to the significant decrease in the joule heating
this area and ultimately translate basic research into industrial and switching delay. Of all the methods above, the magnetic
prototypes. The remaining parts of this paper are arranged as field and STT methods have been successfully commercial-
follows. The background on MRAM is briely given in ized, namely Toggle-MRAM (first in 2006) and STT-
Section II Section III presents MRAM-NMC from three MRAM (first in 2012) [10].
dimensions, namely hardware implementation, software opti- The STT-MRAM can offer much higher density and lower
mization for data flow, and promising applications. Section energy and commercial ones are now used as a replacement
IV reviews MRAM-IMC from three perspectives, hardware for embedded flash (eFlash) memory or SRAM in embedded
implementation, computing paradigm, and promising appli- applications. In addition, as the size of MTJ shrinks, issues
cations. Section V concludes this paper with discussions such as read disturbance and tunnel barrier breakdown have
about future challenges of MRAM-centric computing. become more serious. To mitigate the STT limitations, SOT
FIGURE 2. Applications of MRAM based IMC and NMC. MRAM-NMC forms (a) LLC/main memory or (b) hybrid memory architecture to
implement promising applications such as (c) neural network acceleration and (d) approximate computing. MRAM-IMC employs two
main memory arrays (e) 1T-1MTJ and (f) 2T-1MTJ to realize hardware implementation of (g) neural network acceleration and (h) graph
computing.
and VCMA effects were proposed as alternatives. SOT- Memory Controller. Zhang et al. [21] proposed a memory
MRAM can partially overcome these problems since it uses controller with 38-bit lines to integrate STT-MRAM and
separated current paths for read and write at the cost of an SRAM as L1 cache. In their scheme, the incoming data were
extra transistor. On the other hand, the VCMA, which uti- placed into SRAM at first and STT-MRAM worked as the
lizes a voltage (or electric field) for MTJ writing by increas- buffer of SRAM. When the SRAM capacity was not suffi-
ing the thickness of the tunneling barrier, can greatly save cient, the data could be placed in STT-MRAM to reduce off-
the write energy [11]. chip memory access. This design employs the associativity
Besides, various novel manipulation methods are also approximation logic to eliminate cache movements, which
under research for the next-generation MRAM, such as sky- accommodates more write-once-read-more data stored in
rmion or domain-wall based racetrack memory [12], SOT STT-MRAM. Thus, the arbitration module judges the tag
+STT [13], current-induced exchange bias switching via the search from a set of status registers of STT-MRAM and
SOT [14] and magnetoelectric spin-orbit (MESO) devices SRAM to serve the request from the hit cache. If the memory
[15], as shown in Figure 1. controller meets the case of write-multiple, the arbitrator will
move the data from STT-MRAM to SRAM. Based on this
III. MRAM-NMC workflow, it can reduce the overhead from off-chip memory
The MRAM-NMC connects the computing circuit block to access by 32% and energy costs by 53%.
the MRAM macro with the primary working forms as shown MRAM Benefit. The increasing chip density has promoted
in Figures 2(a), 2(b), 2(c), and 2(d), which can contribute to dramatic growth in on-chip temperature. However, this could
reducing the memory wall effect based on MRAM solving provide an opportunity to optimize the performance of the
the power wall. The MRAM-NMC architecture stores data in STT-MRAM-based system. Wu et al. [18] reported the STT-
the same manner as the typical MRAM array and optimizes MRAM-based last level cache (LLC) with the novel thermally
data movement by tightly intercoupling the multiple chips/ aware non-uniform cache access (NUCA) design, on account
nodes to maximize its physical advantages. This section of the write latency and energy consumption from STT-
reviews the MRAM-NMC from the following three aspects: MRAM to produce a downward trend at rising temperatures.
memory-to-system implementation, software optimization To reduce leakage energy consumption in the system, STT-
for dataflow, and promising applications. MRAM was divided into 64 banks connected with the SRAM
L1 cache and each core. NUCA, depending on thermal aware-
A. MEMORY-TO-SYSTEM IMPLEMENTATION ness, adopted multiple migration strategies in different ther-
Various designs of MRAM-NMC in the multi-core architec- mal regions of STT-MRAM. The simulation results show that
ture have been proposed, demonstrating their potential to the NUCA design with negligible hardware overhead can
enhance system performance, such as the hybrid memory save 41.2% write energy at most and 13.01% on average.
controller [21], the special advantage of applying MRAM Approximate Computing. To utilize the specific error-
[18], and approximate computing [28], [29], [30]. energy tradeoff of STT-MRAM, Ranjan et al. [28] proposed
320 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory
tightly coupled with each core. The search process of the mechanism dynamically increases the updated probability of
genetic algorithm can find the optimal solution for the data heavy write-energy data; 3) the energy-aware write-back
allocation when the number of iterations or searches exceeds strategy evicts data from the LLC depending on the cache hit
the predefined threshold. Then the genetic algorithm can rate and the write-back energy consumption. To be specific,
minimize memory access costs and write operations to the WB energy calculator first compares two data sources
MRAM. Similar to [19], Salehi et al. proposed the energy- (modified data and original data) to evaluate the energy con-
aware cache block insertion/migration policy in the STT- sumption of the WB operation from STT-MRAM. Then, the
MRAM-based hybrid memory architecture. The scheme of candidate promotion mechanism dynamically adjusts more
the hybrid 8-way set associative SRAM (way-0 and way-1) positions for STT-MRAM due to its high write cost. Finally,
and STT-MRAM (way-2 through way-7) is adopted as the the energy-aware WB strategy activates and moves the data
LLC design. The insertion/migration policy maintains the to the promotion position when the WB energy consumption
read-dominant cache blocks in STT-MRAM banks while of STT-MRAM is higher than the threshold. Compared with
write-intensive blocks are transferred to SRAM banks. the conventional algorithm, E-cache can reduce 36% of
Besides, the insertion policy also maximizes self-organized power consumption.
sub-bank throughput to improve the access bandwidth of the MRAM-NMC inspires many corresponding requirements
STT-MRAM. for computing architecture to offer excellent potential for
Write Failure. The [22] and [19] play an important role in near-zero standby power and high design integration. There
the implementation and reduction of write operations for are two factors among the mentioned changes in software
MRAM. To handle the write error rate (WER) of STT- optimization for dataflow: 1) MRAM-NMC is designed to
MRAM in the hybrid memory architecture, Talebi et al. [20] apply the physical advantages of MRAM meanwhile
proposed ROCKY, a robust architecture based on the cache addressing its unfavorable effects, such as write-on MRAM
controller to redirect the traffics. The design of ROCKY with longer time and higher latency; 2) the rationalization of
mainly considers two constraints from STT-MRAM, namely data allocation for MRAM is applied at different levels with
write operation and incoming block. The replacement policy other memory units, considering virtual memory support,
finds the target block to write in the STT-MRAM when the cache coherence, and data mapping.
updated data area is less than the threshold. On the WB hit,
the updated block is written in the STT-MRAM to check the C. PROMISING APPLICATIONS
hardware boundary meanwhile determining whether the Conventional processing units such as CPU and GPU fail to
write operation is performed. On the contrary, STT-MRAM match essentials of machine learning (ML) algorithms and
needs to free the new area for the hit block. The simulation the speed of neural network accelerators. To dramatically
results show that ROCKY can reduce the dynamic power enhance the data efficiency between the memory unit and
consumption and write failure rate of STT-MRAM by 3.55% compute unit for emerging modern workloads, as shown in
and 28.7% , respectively. Table 1, recent research on MRAM-NMC in terms of secu-
When optimizing WB for STT-MRAM, Liang et al. [17] rity [27], [16], Internet of Things [26], embedded machine
put forward E-cache, an energy-aware cache replacement learning [23], specialized accelerator [24], and intermittent
policy, which needs to take into account WB energy, infre- processing [25].
quently-used data and energy consumption. E-cache boasts Security. Data encryption/decryption needs to be consid-
three strengths: 1) the minimized WB energy of STT- ered in MRAM-NMC for security-related applications. Chiu
MRAM by evaluation and calculation; 2) the promotion et al. [27] proposed a 4Mb STT-MRAM for data-encrypted,
322 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory
324 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory
TABLE 2. Example diagrams of MRAM-IMC in (a)(b) hardware implementation,(c)(d) computing paradigm and (e) promising application.
* Each of the 16 Local MRAM bit-cell arrays in [33] maximally supports 1024 8-bit weight data.
with more elaborate SA design which is divided into two the bit select lines (BSLs). If three MTJs are connected to the
sub-SAs and six reconfigurable branches. LL as shown in Table 2 (b), the resistance states of MTJs are
The primary advantage of IMC-P is that the computing R1 , R2 and Ro, respectively. The sum of resistances corre-
memory core is not different from standard MRAM, and thus sponds to ðR1 jjR2 Þ þ Ro . With a voltage VBSL applied to the
the storage density and the regular read/write operations can BSL, the current I of LL is calculated by:
be maintained. However, some challenges still exist: 1) the
pre-processing data should be grouped and transported into R1 R2
I ¼ VBSL = þ Ro (1)
the same bank, which may cause extra power consumption; R1 þ R2
2) the post-processing data should be cached before being
used in the next procedure; 3) poor scalability to realize com- where Ic is the critical threshold current of MTJ. Depending
plex logic functions; 4) the reference current/voltage must be on the relationship of I and Ic , the state output will vary (if
very precise to ensure the correctness of result, which is not I>Ic ) or remain the same (if I<Ic ), i.e., a logic function with
an easy task. multiple inputs and one output was completed.
IMC-A. Contrary to IMC-P, the key idea behind IMC-A is SPU [42] is another example of IMC-A that exploits STT-
to exploit the MRAM bit-cells for logic operation by dynami- MRAM to implement different reconfigurable logic func-
cally configuring them with regular write/read operations. tions within one or two read/write cycles. The key idea
CRAM [32] is a typical example of IMC-A that employs behind SPU is to realize multi-bit logic operations in a highly
one additional transistor to provide a platform where logic parallel structure with only a transmission gate added on each
operations are performed within the MRAM array. The gen- WL to control the access signal. Besides, by adding a control
eral structure of CRAM array is shown in Table 2 (b) in which unit in the main MRAM array to translate the computing
the MTJ of each bit-cell is addressed through the first transis- commands into read/write operations, a feasible MRAM-
tor (T1 ) and logic operations could be completed by selecting IMC platform is established.
the second transistor (T2 ). The CRAM thus can work in two
modes: 1) Memory Mode: When T1 is turned on by pulling up B. COMPUTING PARADIGM
the WL and T2 is turned off by holding down the logic bit line According to the signal type and computing paradigm for
(LBL), data could be read from or written into the MTJ. Dur- implementation, MRAM-IMC could be divided into two
ing this mode, the configuration is effectively identical to a basic categories: analog and digital.
standard STT-MRAM bit-cell; 2) Logic Mode: When T1 is Analog MRAM-IMC. As shown in Figure 6(b), typical ana-
turned off by holding down the WL and T2 is turned on by log MRAM-IMC retains the structure of the MRAM array
pulling up LBL, the MTJ is connected to a logic line (LL) in and read/write function. It completes logic operations by
each row. During this mode, several MTJs in a row could applying voltage or current signals directly to the bit-cells.
form a logic gate, such as AND, and NAND. The operand of a External signals are entered on each WL through a digital-to-
logic gate could be expressed by the resistance states of the analog converter (DAC) or pulse width modulation, and mul-
input/output MTJs, while an appropriate voltage is applied to tiple rows are accessed by decoders and drivers at once. After
the data stored inside the bit-cell has completed logic opera- Zhang et al. [35] presented a reconfigurable MRAM-IMC
tions with external data, the intermediate results of each architecture employing single voltage-gated SHE-driven
column are accumulated and moved into the bottom analog- MTJ, which was also designed in digital computing paradigm.
to-digital converter (ADC) to be converted into the digital In this approach, two inputs are needed for the state shift. One
output. Depending on the trade-off between precision and input is represented by the VCMA bias voltage across the
energy consumption, 4-8 bit ADCs are usually used in ana- MTJ and the other is represented by the initial data stored in
log MRAM-IMC. MTJ. The logic output result is calculated and recorded as the
Cai et al. [33] proposed a novel TMR ratio magnifying state of MTJ in memory cell. By measuring its resistance state
method based on a universal 1T-1MTJ STT-MRAM bit-cell in read operation cycle, the logic output can then be read. The
to realize analog MRAM-IMC. The general structure of this result demonstrated the feasibility of achieving stateful recon-
design is shown in Table 2(c), in which the MTJ is connected figurable Boolean logic functions by a single VG-SHE driven
to a latch structure while the peripheral circuits are minimally MTJ device.
modified to enable in-memory matrix-vector multiplication. The primary advantage of digital MRAM-IMC is its ability
A virtual TMR magnified by 7500 is achieved, leading to a to realize high accuracy on high-precision computing(>16bit)
57.6% reduced integral nonlinearity and a 9.47-25.4 TOPS/ and flexibility for various bit widths. However, some prob-
W energy efficiency for CNN with 2-bit input, 1-bit weight lems remain unresolved: 1) a combination of logical units
and 4-bit output. with bit-cells occupies extra area; 2) copy parameters gener-
The primary merits of analog MRAM-IMC are high on- ally requires large extra memory size in digital IMC architec-
chip bandwidth and computation-area efficiency. Currently, ture; 3) complex operation such as matrix multiplication
with the data explosion, analog MRAM-IMC has found its needs to be decomposed into the collection of basic opera-
way to conduct MAC operations with high parallelism tions, which will take more execution cycles and latency.
degree and enhanced throughput. However, some challenges
persist: 1) the data precision determined by the partial sum is C. PROMISING APPLICATIONS
confined by the ADC solution and there is a stringent MRAM-IMC has shown its potential for reducing most of
requirement for the area of high-resolution ADC (>8 bit); 2) the data transmission energy and latency while performing
non-ideal device characteristics including the cell-to-cell var- computing within memory. Previous MRAM-IMC proposals
iation and the intrinsic ADC offset will degrade the comput- were classified according to specific applications when carry-
ing accuracy; 3) lack computing robustness due to a low ing out data-centric tasks [5], including scientific computing,
signal noise ratio (SNR) during the analog signal processing. signal optimization, machine learning, etc. This paper takes
Digital MRAM-IMC. As shown in Figure 6(b), the digital Neural Network and Graph Computing as examples to ana-
MRAM-IMC paradigm consists of multi processing element lyze MRAM-IMC architecture in terms of its application
(PE) units which are constituted by MRAM array and Bool- progress and prospect.
ean logic blocks. Compared to the analog counterpart, digital Neural Network. With the increasing data set scale and
MRAM-IMC implementations are typically less energy/area computing complexity, the efficiency of neural network algo-
efficient, but are more scalable and tolerant to noise and rithms is limited owing to the von Neumann bottleneck. To
variations. address this issue, MRAM-IMC architecture has been intro-
CRISP [44] architecture is a representative instance favor- duced as a possible solution, displaying superior performance
ing digital logic operations inside memory. Its spintronic- in terms of energy efficiency and latency [36], [43], [45].
assisted logic-in-memory (SLIM) cells can execute a series Zhang et al. [36] presented a time-domain computing in
of partial product generations and additions to perform MAC memory (TD-CIM) scheme based on SOT-MRAM to opti-
operations as shown in Table 2(d). In the initial stage, the mize the performance of energy efficiency and delay for
weight W½m of input cell IN1 and output cells (OUT1 and CNN applications. It achieves Boolean logic operations by
OUT2 ) are set to ’1’. Two input currents I½n and I½nþ1 are recording the BL output at different moments. Compared
applied as input voltages for VCMA effect. Only when input with CRAM [32] in identifying the MNIST dataset, the delay
and weight are both set to ’1’, the weight equivalent current of the TD-CIM architecture is reduced by 1.2-2.7 times, and
generated from IN1 can surpass the magnitude of the switch- the energy is decreased by 2.4103 -1.1104 times.
ing current of OUT1 and OUT2 . Otherwise, the two output In 2021, Peter et al. [40] first proposed a MRAM-IMC
cells will remain original state, and achieve NAND opera- macro designed with 128-Kb array in 22nm FD-SOI technol-
tion. The subsequent full addition is fulfilled with successive ogy in Table 3 (a), achieving area-normalized throughput of
majority-voting-3 (MV3) and majority-voting-5 (MV5) oper- 758 GOPS=mm2 and energy efficiency of 5.1 TOPS/W.
ations. When the majority of input cells are in the low resis- While the architecture is incorporated in CIFAR-10 classifi-
tance (data ’1’), the accumulated current gets larger than the cation task, the inference accuracy reaches 90.1%, matching
switching threshold and flips the output cells to data ’0’. ideal software-based computation.
Through these logic functions, SLIM cells can be efficiently Very recently, Seungchul et al. [34] reported a crossbar
used in the memory configuration of CRISP architecture for array based on MRAM cells with a resistance-sum method
multiple matrix multiplications as PE units. for analogue MAC operation. This approach replaces
326 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory
TABLE 3. MRAM-IMC prototype in the application of neural the number of triangles in a given graph is the key to
network. extracting relation net model. Traditional graph algorithms
have problems with complex control capabilities when
applied to MRAM-IMC. To cope with this issue, Wang
[37] innovatively reformulated the TC problem into basic
Boolean logic functions and designed a triangle counting
in-memory (TCIM) accelerator using simple AND and Bit-
Count operations for computing. By slicing and compress-
ing the input graph, valid data will be loaded into the bitcell
of STT-MRAM array and implemented with efficient in-
memory bitwise operation as shown in Table 2 (e). In the
SNAP dataset from real-life graphs, the execution results
outperform the energy-efficient FPGA by 31.8 and
achieve a 34 energy efficiency improvement. MRAM-
IMC excels in terms of precision, energy usage, speed, sta-
bility and endurance [34]. Based on the switching proper-
ties of STT-MRAM, physical unclonable function is also a
candidate for embedded secure devices [38]. Despite the
ability to perform computationally expensive and memory-
intensive tasks, MRAM-IMC still faces challenges in large-
scale integration on chip to fulfill industrial demands of cur-
Kirchhoff’s law and consumes less power than the previous rent big-data-driven applications.
standard crossbar array with the current-sum method. A
6464 array is integrated with the readout electronics based
on Time-to-Digital Converters (TDC) in the 28 nm CMOS V. CHALLENGES AND PROSPECT
process, reaching 405 TOPS/W power efficiency while proc- In this review, we briefly introduce the MRAM-centric
essing dot products with a 0.8 V supply for the TDCs computing solutions, which can be categorized as MRAM-
(Table 3 (b)). Using a two-layer binary neural network per- NMC and MRAM-IMC to address the bottleneck of von
ceptron, the accuracy of applying the crossbar array in Neumann architecture. MRAM-NMC places computational
10,000-image MINST classification tasks is up to units at the periphery of memory array for fast data access,
93.230.05%. From the perspective of MRAM industrializa- while MRAM-IMC uses the memory array to perform logic
tion, these two prototypes have brought volume production operations directly through simple configuration. We col-
of MRAM-IMC chip into routine, which may help to push lected some representative works of MRAM-NMC and
IMC technology to the forefront. MRAM-IMC published in recent years. Figure 7 plots
Graph Computing. Due to the growing need to dissect energy efficiency versus processing node of these studies,
relationships from massive data, graph computing has showing that the technology node scaling from 6x nm to 2x
received extensive attention. Triangle counting (TC) is a nm and the normalized energy efficiency downscaling from
fundamental issue in graph computing in which obtaining pJ/bit to fJ/bit. Despite these progress, MRAM-NMC
FIGURE 7. Energy consumption of (a) MRAM-NMC and (b) MRAM-IMC in range of 6x-2x nm technology nodes. Note: In order to make the
comparison more intuitive, the energy consumption data from the references has been transformed into a uniform unit (pJ/bit).
328 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.
LI et al.: Survey of MRAM-Centric Computing: From Near Memory to In Memory
[25] S. Resch et al., “MOUSE: Inference in non-volatile memory for energy [45] A. D. Patil, H. Hua, S. Gonugondla, M. Kang, and N. R. Shanbhag, “An
harvesting applications,” in Proc. IEEE/ACM 53rd Annu. Int. Symp. MRAM-based deep in-memory architecture for deep neural networks,” in
Microarchit., 2020, pp. 400–414. Proc. IEEE Int. Symp. Circuits Syst., 2019, pp. 1–5.
[26] D. Rossi et al., “Vega: A ten-core SoC for IoT endnodes with DNN
acceleration and cognitive wake-up from MRAM-based state-retentive
sleep mode,” IEEE J. Solid-State Circuits, vol. 64, no. 1, pp. 60–62,
YUETING LI is currently working toward the PhD
Jan. 2022.
degree in Prof. W.S. Zhao’s Group with the School
[27] Y. Chiu et al., “A 22nm 4Mb STT-MRAM data-encrypted near-memory
of Integrated Circuit Science and Engineering, Bei-
computation macro with a 192GB/s read-and-decryption bandwidth and
hang University. Her research interests mainly
25.1–55.1TOPS/W 8b MAC for AI operations,” in Proc. IEEE Int. Solid-
include system integration, the application of
State Circuits Conf., 2022, pp. 178–180.
MRAM, near-memory computing, and neural net-
[28] A. Ranjan, S. Venkataramani, Z. Pajouhi, R. Venkatesan, K. Roy, and A.
work accelerator design. She won the University
Raghunathan, “STAxCache: An approximate, energy efficient STT-
Demo Best Demonstration in ACM/SIGDAUD’21,
MRAM cache,” in Proc. IEEE Des. Automat Test Eur. Conf. Exhib., 2017,
Best Presentation Award in ICCC’21, and the
pp. 356–361.
Finalist in ISLPED’21 Design Contest.
[29] A. Salahvarzi, A. M. H. Monazzah, M. Fazeli, and K. Skadron, “NOS-
Talgy: Near-optimum run-time STT-MRAM quality-energy knob manage-
ment for approximate computing applications,” IEEE Trans. Comput.,
vol. 70, no. 3, pp. 414–427, Mar. 2021.
[30] A. M. H. Monazzah, A. M. Rahmani, A. Miele, and N. Dutt, “CAST: Con- TIANSHUO BAI received the BS degree in the
tent-aware STT-MRAM cache write management for different levels of Beijing University of Technology, Beijing, China,
approximation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., in 2017. He is currently working toward the MS
vol. 39, no. 12, pp. 4385–4398, Dec. 2020. degree in the School of Integrated Circuit Science
[31] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory and Engineering, Beihang University. His research
with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale interests include compiler toolchain and digital
Integr. Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018. computing-in-memory.
[32] M. Zabihi, Z. I. Chowdhury, Z. Zhao, U. R. Karpuzcu, J. Wang, and S. S.
Sapatnekar, “In-memory processing on the spintronic CRAM: From hard-
ware design to application mapping,” IEEE Trans. Comput., vol. 68, no. 8,
pp. 1159–1173, Aug. 2019.
[33] H. Cai et al., “Proposal of analog in-memory computing with magnified
tunnel magnetoresistance ratio and universal STT-MRAM cell,” IEEE
Trans. Circuits Syst. I: Regular Papers, vol. 69, no. 4, pp. 1519–1531, XINYI XU received the BS degree in computer sci-
Apr. 2022. ence and technology from the China University of
[34] S. Jung et al., “A crossbar array of magnetoresistive memory devices Geosciences, Beijing, in 2017. She is currently
for in-memory computing,” Nature, vol. 601, no. 7892, pp. 211–216, working toward the master’s degree in electronic
2022. information with Beihang University. Her research
[35] H. Zhang, W. Kang, L. Wang, and W. Zhao, “Stateful reconfigurable logic interests include near-memory computing and neu-
via a single-voltage-gated spin hall-effect driven magnetic tunnel junction ral network accelerator design.
in a spintronic memory,” IEEE Trans. Electron Devices, vol. 64, no. 10,
pp. 4295–4301, Oct. 2017.
[36] Y. Zhang et al., “Time-domain computing in memory using spintronics for
energy-efficient convolutional neural network,” IEEE Trans. Circuits Syst.
I: Regular Papers, vol. 68, no. 3, pp. 1193–1205, Mar. 2021.
[37] X. Wang et al., “Triangle counting accelerations: From algorithm to in-
memory computing architecture,” IEEE Trans. Comput., no. 11, pp. 1–11, YUNDONG ZHANG established chip start-up T-
Nov. 2021. Square Inc. in Silicon Valley, USA, which was
[38] S. B. Dodo, R. Bishnoi, S. M. Nair, and M. B. Tahoori, “A spintronics merged by Ali Lab. Currently he is Co-Founder,
memory PUF for resilience against cloning counterfeit,” IEEE Trans. Very Executive Director of Vimicro Corporation. He also
Large Scale Integr. Syst., vol. 27, no. 11, pp. 2511–2522, Nov. 2019. serves as Executive Director of National Key Labora-
[39] S. Angizi, Z. He, A. Awad, and D. Fan, “MRIMA: An MRAM-based in- tory on Digital Multimedia Chip Technology in Bei-
memory accelerator,” IEEE Trans. Comput.-Aided Des. Integr. Circuits jing, China. He is known as a specialist in digital
Syst., vol. 39, no. 5, pp. 1123–1136, May 2020. multimedia chip design and artificial intelligence
[40] P. Deaville, B. Zhang, L. Chen, and N. Verma, “A maximally row-parallel chip design. He was awarded as First-class Prize of
MRAM in-memory-computing macro addressing readout circuit sensitiv- National Science and Technology Advancement.
ity and area,” in Proc. IEEE 47th Eur. Solid State Circuits Conf., 2021,
pp. 75–78.
[41] A. Agrawal, A. Ankit, and K. Roy, “SPARE: Spiking neural network
acceleration using ROM-embedded RAMs as in-memory-computation BI WU (Member, IEEE) received the PhD
primitives,” IEEE Trans. Comput., vol. 68, no. 8, pp. 1190–1200, degree from the School of Electronic Informa-
Aug. 2019. tion Engineering, Beihang University, Beijing,
[42] H. Zhang, W. Kang, K. Cao, B. Wu, Y. Zhang, and W. Zhao, “Spintronic China, in 2019, with the financial support of the
processing unit in spin transfer torque magnetic random access memory,” China Scholarship Council, he spent one year as
IEEE Trans. Electron Devices, vol. 66, no. 4, pp. 2017–2022, Apr. 2019. a visiting graduate student at the University of
[43] H. Wang, Y. Zhao, C. Li, Y. Wang, and Y. Lin, “A new MRAM-based Notre Dame, USA, under the supervision of
process in-memory accelerator for efficient neural network training with Professor Xiaobo Sharon Hu. After the PhD, he
floating point precision,” in Proc. IEEE Int. Symp. Circuits Syst., 2020, joined College of Electronic and Information
pp. 1–5. Engineering, Nanjing University of Aeronautics
[44] T. Kim, Y. Jang, M. G. Kang, B. G. Park, K. J. Lee, and J. Park, “SOT- and Astronautics (NUAA), Nanjing, China, as
MRAM digital PIM architecture with extended parallelism in matrix multi- an Assistant Professor. His research interests include magnetic memory
plication,” IEEE Trans. Comput., vol. 71, no. 11, pp. 2816–2828, architecture, spintronic devices based in-memory computing architec-
Nov. 2022. ture, neural network accelerator design, etc.
HAO CAI (Senior Member, IEEE) received the WEISHENG ZHAO (Fellow, IEEE) received the
master’s degree and the PhD degree in electrical PhD degree in physics from the University of Paris
engineering from Lund University, Sweden, and Sud, in 2007. He worked as a research associate
TELECOM ParisTech, France, in 2009 and 2013, with the CEA’s embedded computing laboratory
respectively. From 2013 to 2017, he was with Uni- from 2007 to 2009, and with the French national
versite Paris-Saclay, France, in 2018, he joined research center (CNRS), as a tenured scientist from
National ASIC System Engineering Center, South- 2009 to 2014, where he led the spintronics integra-
east University, Nanjing, China, where he is cur- tion group. Now he is a professor and director of
rently an Associate Professor. He is currently Fert Beijing Institute, MIIT Key Laboratory of
working on low-power MRAM design and device- Spintronics, School of Integrated Circuit Science
circuit design interaction. He has authored or and Engineering, in Beihang University. His
co-authored 2 book chapters and more than 120 scientific papers, including research focused on spintronic memories and logics from devices, circuits to
IEEE Journal of Solid-State Circuits, IEEE Trans. Circ. Syst. I: Reg. Papers, systems. He has authored or coauthored more than 200 scientific papers,
etc. He has been severing on the technical committee of IEEE-CAS society, such as Nature Electronics, Nature Communications, Advanced Materials,
severing as the conference TPC member in DAC, GLSVLSI, Nanoarch, and Proceedings of the IEEE. He is the editor-in-chief of IEEE Transactions
ESREF, NEWCAS. on Circuits and Systems I: Regular Papers.
330 VOLUME
Authorized licensed use limited to: Indian Institute of Technology Indore. Downloaded on March 21,2024 at 06:50:19 UTC from IEEE 11, NO.
Xplore. 2, APRIL-JUNE
Restrictions 2023
apply.