0% found this document useful (0 votes)
201 views11 pages

On-Chip MRAM As A High-Bandwidth, Low-Latency Replacement For DRAM Physical Memories

This document discusses using magneto-resistive random access memory (MRAM) to replace DRAM in the memory hierarchy. It proposes an MRAM-based hierarchy where a single MRAM layer is divided into banks, each with an L2 cache. This MRAM hierarchy could offer higher bandwidth than DRAM-based hierarchies and enable a 15% improvement in program performance. The document compares the proposed MRAM hierarchy to alternatives like chipstacked DRAM and finds MRAM performs best due to its very high bandwidth enabled by vertical interconnects between MRAM layers.

Uploaded by

api-20008301
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views11 pages

On-Chip MRAM As A High-Bandwidth, Low-Latency Replacement For DRAM Physical Memories

This document discusses using magneto-resistive random access memory (MRAM) to replace DRAM in the memory hierarchy. It proposes an MRAM-based hierarchy where a single MRAM layer is divided into banks, each with an L2 cache. This MRAM hierarchy could offer higher bandwidth than DRAM-based hierarchies and enable a 15% improvement in program performance. The document compares the proposed MRAM hierarchy to alternatives like chipstacked DRAM and finds MRAM performs best due to its very high bandwidth enabled by vertical interconnects between MRAM layers.

Uploaded by

api-20008301
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 11

On-chip MRAM as a High-Bandwidth,

Low-Latency Replacement for DRAM Physical Memories

Abstract: proliferation of chip multiprocessors due


Impediments to main memory to increased memory bandwidth
performance have traditionally been due demands.
to the divergence in processor versus
memory speed and the pin bandwidth Introduction:
limitations of modern packaging Main memory latencies are already
technologies. In this paper we evaluate a hundreds of cycles; often processors
magneto-resistive memory (MRAM)- spend more than half of their time
based hierarchy to address these future stalling for L2 misses. Memory latencies
constraints. MRAM devices are will continue to grow, but more slowly
nonvolatile, and have the potential to be over the next decade than in the last,
faster than DRAM, denser than since processor pipelines are nearing
embedded DRAM, and can be integrated their optimal depths. However, off-chip
into the processor die in layers above bandwidth will continue to grow as a
those of conventional wiring. We performance-limiting factor, since the
describe basic MRAM device operation, number of transistors on chip is
develop detailed models for MRAM increasing at a faster rate than chip
banks and layers, and evaluate an signaling pins. Left unaddressed, this
MRAM-based memory hierarchy in disparity will limit the scalability of
which all off-chip physical DRAM is future chip multiprocessors. Larger
replaced by on-chip MRAM. We show caches can reduce off-chip bandwidth
that this hierarchy offers extremely high constraints, but consume area that could
bandwidth, resulting in a 15% instead be used for processing, limiting
improvement in end-program the number of useful processors that can
performance over conventional DRAM- be implemented on a single die. In this
based main memory systems. Finally, paper, we evaluate the potential of on-
we compare the MRAM hierarchy to one chip magneto-resistive random access
using a chipstacked DRAM technology memory (MRAM)to solve this set of
and show that the extra bandwidth of problems. MRAM is an emerging
MRAM enables it to outperform this memory technology that stores
nearer-term technology. We expect that information using the magnetic polarity
the advantage of MRAM-like of a thin ferromagnetic layer. This
technologies will increase with the information is read by measuring the
current across an MRAMcell, hierarchy that replaces off-chip DRAM
determined by the rate of electron physical memories with an onchip
quantum tunneling, which is in turn MRAM memory hierarchy. Our MRAM
affected by the magnetic polarity of the hierarchy breaks a single MRAM layer
cell. MRAM cells have many potential into a collection of banks, in which the
advantages. They are non-volatile, and MRAM devices sit between two metal
they can be both faster, and potentially wiring layers, but in which the decoders,
as dense, as DRAM cells. They can be word line drivers, and sense amplifiers
implemented in wiring layers above an reside on the transistor layer, thus
active silicon substrate as part of a single consuming chip area. The MRAM banks
chip. Multiple MRAMlayers can thus be hold physical pages, and under each
placed on top of a single die, permitting MRAM bank resides a small level-2
highly integrated capacities. Most (L2) cache which caches lines mapped to
important, the enormous interconnection that MRAM bank. The mapping of
density of 100,000 vertical wires per physical pages thus determines to which
square millimeter, assuming vertical L2 cache bank a line will be mapped.
wires have pitch similar to global vias Since some MRAM banks are more
(currently24 thicknessand10 width), expensive to access than others, due to
will enable as many as 10,000 wires per the physical distances across the chip,
addressable bank within the MRAM page placement into MRAM banks can
layer. In this technology, the number of affect performance. An ideal placement
interconnects and total bandwidth are policy would:
limited by the pitch of the vertical vias (1)minimize routing latency by placing
rather than that of the pads required by frequently accessed pages into
conventional packaging technologies. MRAMbanks close to the processor,
Unsurprisingly, MRAMdevices have (2)minimize network congestion by
several potential drawbacks. They placing pages into banks that have the
require high power to write, and layers fewest highly accessed pages,
of MRAMdevices may interfere with (3)minimize L2 bank miss rates by
heat dissipation. Furthermore, while distributing hot pages evenly across the
MRAMdevices have been prototyped, MRAM banks. According to our results,
the latency and density of production the best page placement policy with
MRAM cells in contemporary MRAM outperforms a conventional
conventional technologies remains DRAM-based hierarchy by 15% across
unknown. To justify the investment 16 memory-intensive benchmarks. We
needed to make MRAMs commercially evaluate several page placement
competitive will require evidence of policies, and find that in near-term
significant advantages over conventional technology, minimizing L2 miss rates
technologies. One goal of our work is to with uniform page distribution
determine whether MRAM hierarchies outweighs minimization of bank
show enough potential performance contention or routing delay. That balance
advantages to be worth further will shift as cross-chip routing delays
exploration. In this paper, we develop grow in future technologies, and as both
and describe access latency and area wider-issue and CMP processors place a
models for MRAM banks and layers. heavier load on the memory subsystem.
Using these models, we simulate a
Finally, we compare our MRAM information, and a bit can be detected by
hierarchy against another emerging sampling the difference in electrical
memory technology called chipstacked resistance between the two polarized
SDRAM. With this technology, a states of the MTJ. Current MRAM
conventional DRAM chip is die-attached designs using MTJ material to store data,
to the surface of a logic chip. This are non-volatile and have unlimited read
procedure enables the two chips to and write endurance. Along with its
communicate through many more wires advantages of small dimensions and
than can be found on conventional chip non-volatility, MRAM has the potential
packages or even multi-chip modules to be fabricated on top of a conventional
(MCMs). Although the higher microprocessor, thus providing very
bandwidth is exploited in a manner high bandwidth. The access time and
similar to that of MRAM, the total I/O cell size of MRAM memory has been
counts is still substantially lower. Our shown to be comparable to DRAM
preliminary evaluation of these two memory . Thus, MRAM memory has
hierarchies shows that the MRAM attributes which make it competitive
hierarchy performs best, with the caveat with semiconductor memory.
that the parameters used for both are
somewhat speculative. Section 2 MRAM Cell:
describes MRAMdevice operation and Figure 1 shows the different components
presents a model for access delay of of an MRAM cell. The cell is composed
MRAMbanks and layers. Section 3 of a diode and an MTJ stack, which
proposes an MRAM-based memory actually stores the data. The diode acts
hierarchy, in which a single MRAM as a current rectifier and is required for
layer is broken into banks, cached by reliable readout operation. The MTJ
per-bank L2 banks, and connected via a stack material consists of two
2-D switched network. Section 4 ferromagnetic layers separated by a thin
compares a traditional memory dielectric barrier. The polarization of one
hierarchy, with an off-chip DRAM of the magnetic layers is pinned in a
physical memory, to the MRAM fixed direction, while the direction of the
memory hierarchy. Section 5 presents a other layer can be changed using the
performance comparison of the MRAM direction of current in the bitline. The
hierarchy to a chipstack SDRAM resistance of the MTJ depends on the
memory hierarchy. Section 6 describes relative direction of polarization of the
related work, and Section 7 summarizes fixed and the free layer, and is minimum
our conclusions and describes issues for or maximum depending on whether the
future study. MRAM Memory Cells and direction is parallel or anti-parallel to
Banks Magnetoresitive random access each other. When the polarization is
memory (MRAM) is a memory anti-parallel, the electrons experience an
technology that uses the magnetic tunnel increased resistance to tunneling through
junction (MTJ) to store information. The the MTJ stack. Thus, the information
potential for MRAM has improved stored in a selected memory cell can be
steadily due to advances in magnetic read by comparing its resistance with the
materials. MRAM uses the resistance of a reference memory cell
magnetization orientation of a thin located along the same wordline. The
ferromagnetic material to store resistance of the reference memory cell
always remains at the minimum level. techniques have been proposed for
As the data stored in an MRAM cell are improving cell stability down to 100 nm
non-volatile, MRAMs do not consume feature size. Also, IBM and Motorola are
any static power. Also, MRAM cells do already exploring 0.18 um MRAM
not have to be refreshed periodically like designs, and researchers at MIT have
DRAM cells. However, the read and demonstrated 100 nm x 150 nm
write power for MRAM cells are prototypes. While there will be
considerably different as the current challenges for design and manufacture,
required for changing the polarity of the existing projections indicate that MRAM
cell is almost 8 times than that required technology can be scaled, and with
for reading. A more complete enough investment and research, will be
comparison of the physical competitive with other conventional and
characteristics of competing memory emerging memory technologies.
technologies can be found in.
The MRAM cell consists of a diode, MRAM Bank Design:
which currently can be fabricated using Figure 2 shows an MRAM bank
excimer laser processing on a metal composed of a number of MRAM cells
underlayer, and an MTJ stack which can located at the intersection of every bit
be fabricated using more conventional and word line. During a read operation,
lithographic processes. The diode in this current sources are connected to the bit
architecture must have a large on-to-off lines and the selected wordline is pulled
conductance ratio to provide isolation of low by the wordline driver. Current
the sense path from the sneak paths. This flows through the cells in the selected
isolation is achievable using thin film wordline and the magnitude of the
diodes. Schottky barrier diodes have also current through each cell depends on its
been shown to be promising candidates relative magnetic polarity. If the
for current rectification in MRAM cells. ferromagnetic layers have the same
Thus, MRAM memory has the potential polarity, the cell will have lower
to be fabricated on top of a conventional resistance and hence more current will
microprocessor in wiring layers. flow through the cell thus reducing the
However, the devices required to operate current flowing through the sense
the MRAM cells namely, the decoders, amplifiers. The current through the sense
the drivers, and the sense amplifiers, amplifiers is shown graphically in Figure
cannot be fabricated in this layer and 2, when the middle wordline is selected
hence must reside in the active transistor for reading. The bitline associated with
layers below. Thus, a chip with MRAM the topmost cell experiences a smaller
memory will have area overhead drop in current as the cell has higher
associated with these devices. The data resistance compared to the other two
cells and the diode themselves do not cells connected to the selected wordline.
result in any silicon overhead since they This change in current is detected using
are fabricated in metal layers. One of the sense amplifiers,
main challenges for MRAM scalability
is the cell stability at small feature sizes,
as thermal agitation can cause a cell to
lose data. However, researchers are
already addressing this issue and
independently accessible. Some of the
important features we added to model
MRAMs include:
1. The area consumed in the transistor
layer by the devices required to operate
the bank including decoders, wordline
drivers, bitline drivers, and sense
amplifiers.
2. The delay due to vertical wires
carrying signals and data between the
transistor layer and the MRAM
layer .
3. MRAM capacity for a given die size
and MRAM cell size.
4. Multiple layers of MRAM with
independent and shared wordlines and
bitlines. We used the 2001 SIA roadmap
for the technology parameters at 90 nm
technology . Given an MRAM bank size
and the number of sub-banks in each
bank, our tool computes the time to
access the MRAM bank by computing
the access time of the sub-bank and
accounting for the wire delay to reach
and the stored data is read out. As the the farthest sub-bank. To compute the
wordline is responsible for sinking the optimal sub-bank size, we looked at
current through a number of cells, the designs of modern DRAM chips and
wordline driver should be strong to made MRAM sub-bank sizes similar to
ensure reliable sensing. Alternative current DRAM sub-bank sizes. We
sensing schemes have been proposed for computed the access latency for various
MRAM which have increased sensing sub-bank configurations using our area
reliability but also increase the cell area. and timing model. This latency is shown
in Table 1. From this table it is clear that
MRAM Bank Modeling: the latency increases substantially once
To estimate the access time of an we increase the sub-bank size beyond 8
MRAM bank and the area overhead in Mb. We fixed 4 Mb as the size for the
the transistor layer due to the MRAM sub-banks in our system. Our area and
banks, we developed an area and timing timing tool was then used to compute the
tool by extending CACTI-3.0 and adding delay for banks composed of different
MRAM specific features to the model. number of sub-banks. We added a fixed
In our model, MRAM memory is 5ns latency to the bank latency to
divided into a number of banks which account for the MRAM cell access
are independently accessible. The banks latency which is half the access time
in turn are broken up into sub-banks to demonstrated in current prototypes . The
reduce access time. The subbanks 4Mb bank latency was then used in our
comprising a bank, however, are not architectural model which is described in
the next section. We used a single
vertical MRAM layer in our evaluation.
Future implementations with multiple
MRAM layers will result in a much
larger memory capacity, but will also
increase the number of vertical buses
and the active area overhead, if the
layers have to be independently
accessed. It might be possible to reduce
the number of vertical buses and the
active area overhead by either sharing
the wordlines or the bitlines among the
different layers. Sharing bitline among
layers might interfere with reliable
sensing of the MRAM cells. Evaluation
of multiple layer MRAM architectures is
a topic for future research.

A Chip-Level MRAM Memory


Hierarchy:
MRAM memory technology promises
large memory capacity close to the memory architecture for managing
processor. With global wire delays MRAM memory, and use dynamic page
becoming significant, we need a allocation policies to distribute the data
different approach to managing this efficiently across the chip
memory efficiently, to ensure low
latency access in the common case . In
this section, we develop a distributed
below each MRAM bank. We assume
MRAM banks occupy 75% of the chip
area in the metal layer, and the SRAM
caches and associated MRAM circuitry
occupy 60% of the chip area in the
active layer. Each node in the network
has a MRAM bank controller that
receives requests from the L1 cache and
Basic MRAM System Architecture: checks its local L2 cache first to see if
Our basic MRAM architecture consists the data are present in it. On a cache hit,
of a number of independently accessed the data are retrieved from the cache and
MRAM banks distributed across the returned via the network. On a cache
chip. As described in Section 2, the data miss, the request is sent to the MRAM
stored in the MRAM banks are present bank which returns the data to the
in a separate vertical layer above the controller and also fills its associated L2
processor substrate, while the bank cache. We model channel and buffer
controller and other associated logic contention in the network, and also
required to operate the MRAM bank model contention for the ports associated
reside on the transistor layer. The banks with each SRAM cache and MRAM
are connected through a network that bank.
carries request and response packets
between the level-1 (L1) cache and each Factors influencing MRAM design
bank controller. Figure 3 shows our space:
proposed architecture, with the processor The cost to access data in an MRAM
assumed to be in the center of the system depends on a number of factors.
network. To cache the data stored in Since the MRAM banks are distributed
each MRAM bank, we break the SRAM and connected by a network, the latency
L2 cache into a number of smaller to access a bank depends on the bank’s
caches, and associate each one of these physical location in the network. The
smaller caches with an MRAM bank. access cost also depends on the
The SRAM cache associated with each congestion in the network to reach the
MRAM bank has low latency due to its bank, and the contention in the L2 cache
small size, and has a high bandwidth associated with the bank. Understanding
vertical channel to its MRAM bank. the trade-offs between these factors is
Thus, even for large cache lines, the important to achieve high performance
cache can be filled with a single access in an MRAM system.
to the MRAM bank on a miss. Each
SRAM cache is fabricated in the active Number of banks: Having a large
layer below the MRAM bank with which number of banks in the system increases
it is associated. The SRAM cache is the concurrency in the system, and
smaller than the MRAM bank and can ensures fast hits to the closest banks.
thus easily fit under the bank. The However, the network traversal cost to
decoders, sense amplifiers, and other reach the farthest bank also increases
active devices required for operating an due to the increased number of hops.
MRAM bank are also present The amount of L2 cache associated with
each bank depends on the number of
banks in the system. For a fixed total L2 examine the performance of our MRAM
size, having more number of banks system with increasing bank latencies.
results in a smaller size for the L2 cache The mean performance of the system
associated with each bank. However, the across our set of benchmarks for
latency of each L2 cache is lower now different bank access latencies is shown
because of its smaller size. Thus in Figure 8. The horizontal line
increasing the number of banks in the represents the mean IPC for the
system results in reduced cache and conventional SDRAM system. As can be
MRAM bank latency (because of seen from this graph, the performance of
smaller bank size for a fixed total our architecture is relatively insensitive
MRAM capacity), while increasing the to MRAM latency and breaks even with
potential miss rates in each individual L2 the SDRAM system only at MRAM
cache and increased latency to traverse latencies larger than the off-chip
the network due to greater number of SDRAM latency. This phenomenon
hops. occurs because the higher

Cache Line Size: Because of the


potential for MRAM to provide a high-
bandwidth interface to its associated L2
cache, we can have large line sizes in the
L2 cache which can potentially be filled
with a single access to MRAM on an L2
miss. Large line sizes can result in
increased spatial locality but they also
result in an increase in the access time of
the cache. Thus, there is a trade-off
between increased locality and increased
hit latency which determines the optimal
line size when bandwidth is not a
constraining factor. In addition, the line
size has an effect on the number of bytes
written back into an MRAM array,
which is important due to the substantial
amount of power required to perform an
MRAM write compared to a read.

Page Placement Policy: The MRAM


banks are accessed using physical
addresses, and the memory in the banks
is allocated on a page granularity. Thus,
when a page is loaded, the operating
system can assign a

MRAM Latency Sensitivity:


To study the sensitivity of our MRAM
architecture to MRAM bank latency, we
architecture which minimizes the
amount of data written back into the
MRAM banks from the L2 cache. In
Table 4 we show the total number of
bytes written back into MRAM memory
for a 100 bank configuration with
different line sizes. We show only a
subset of the benchmarks as all the
benchmarks show the same trend. From
Table 4 we can see that the total volume
of data written back increases with
increasing line size. We found that even
though the number of writebacks
decreases with larger line size, the
amount of data written back increases.
This is because the decrease in the
number of writebacks is offset by the
increased line size. Thus, there is a
power performance trade-off in an
MRAM system as larger line sizes
consume more power but yield better
performance. We are currently exploring
other mechanisms such as sub-blocking
to reduce the volume of data written
back when long cache lines are
employed.

Conclusions:
In this paper, we have introduced and
examined an emerging memory
technology, MRAM, which promises to
enable large, high bandwidth memories.
MRAM can be integrated into the
microprocessor die and avoid the
conventional pin bandwidth limitations
found in off-chip memory systems. We
have developed a model for simulating
MRAM banks and use it to examine the
trade-offs between line size and bank
number to derive the MRAM
organization with the best performance.
Cost of Writes: Writes to MRAM
We break down the components of
memory consume more power than
latency in the memory system, and
reads because of the larger current
examine the potential of page placement
needed to change the polarity of the
to improve performance. Finally, we
MRAM cell. Hence, for a low power
have compared MRAM with
design, it might be better to consider an
conventional SDRAM memory systems volatile, its impact on system reliability
and another emerging technology, over conventional memory should be
chipstacked SDRAM, to evaluate its measured. Finally, our uniprocessor
potential as a replacement for main simulation does not take full advantage
memory. Our results show that MRAM of the large bandwidth inherent in the
systems perform 15% better than partitioned MRAM. We expect that chip
conventional SDRAM systems and 30% multiprocessors will have additional
better than stacked SDRAM systems. An performance gains beyond the
important feature of our memory uniprocessor model studied in this paper.
architecture is that the L2 cache and
MRAM banks are partitioned. This
architecture reduces miss conflicts in the
L2 cache and provides high bandwidth References:
when multiple L2s are accessed
simultaneously. We studied MRAM [1] V. Agarwal, M. S. Hrishikesh, S. W.
systems with perfect L2 caches and Keckler, and D. Burger. Clock rate
perfect networks to understand where versus IPC: The end of the road for
performance was being lost. We found conventional microarchitectures. In
that the penalty of cache conflicts in the Proceedings of the 27th Annual
L2 cache and the network latency had International Symposium on Computer
widely varying effects among the Architecture, pages 248–259, June 2000.
benchmarks. However, these results did
show that page allocation policies in the [2] V. Agarwal, S. W. Keckler, and D.
operating system have a great potential Burger. The effect of technology scaling
to improve MRAM performance. Our on microarchitectural structures.
work suggests several opportunities for Technical Report TR2000-02,
future MRAM research. First, our Department of Computer Sciences,
partitioned MRAM memory system University of Texas at Austin, Austin,
allows page placement policies for a TX, Aug. 2000.
uniprocessor to consider a new variable
– proximity to the processor. Allowing [3] D. Bailey, J. Barton, T. Lasinski, and
pages to dynamically migrate between H. Simon. The NAS parallel
MRAM partitions may provide benchmarks. Technical Report RNR-91-
additional performance benefit. Second, 002 Revision 2, NASA Ames Research
the energy use of MRAM must be Laboratory, Ames, CA, Aug. 1991.
characterized and compared to
alternative memory technologies. [4] H. Boeve, C. Bruynseraede, J. Das,
Applications may have quite different K. Dessein, G. Borghs, and J. D. Boeck.
energy use given that the energy Technology assessment for the
required to write the MRAM cell is implementation of magnetoresistive
greater than that to read it. In addition, elements with semiconductor
the L2 cache line size has a strong effect components in magnetic random access
on the amount of data written to the memory MRAM architectures. IEEE
MRAM and may be an important factor Transactions on Magnetics
in tuning systems to use less energy. 35:2820–2825, Sep 1999.
Third, since MRAM memory is non-
[5] P. N. Brown, R. D. Falgout, and J. E.
Jones. Semicoarsening multigrid on
distributed memory machines. Technical
Report UCRL-JC-130720, Lawrence
Livermore National Laboratory, 2000.

[6] R. Chandra, S. Devine, B. Verghese,


A. Gupta, and M. Rosenblum.
Scheduling and page migration for
multiprocessor compute servers. In
Proceedings of the Sixth International
Conference on Architectural Support for
Programming Languages and Operating
Systems, pages 12–24, San Jose,
California, 1994.

[7] V. Cuppu, B. Jacob, B. Davis, and T.


Mudge. A performance comparison of
contemporary DRAM architectures. In
Proceedings of the 26th Annual
International Symposium on Computer
Architecture, pages 222–233, May 1999.

[8] R. Desikan, D. Burger, S. W.


Keckler, and T. M. Austin. Sim-alpha: a
validated execution driven alpha 21264
simulator. Technical Report TR-01-23,
Department of Computer Sciences,
University of Texas at Austin, 2001.

You might also like