0% found this document useful (0 votes)
32 views70 pages

Embedded Memories Designs and Applications

Uploaded by

m.kessad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views70 pages

Embedded Memories Designs and Applications

Uploaded by

m.kessad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CHAPTER 6

EMBEDDED MEMORIES DESIGNS


AND APPLICATIONS

6.1. EMBEDDED MEMORY DEVELOPMENTS

The memory technology for embedded memories has a wide variation, ranging
from small blocks of ROMs, hundred of kilobytes of cache RAMs, high density
(several megabits) of DRAMs, and small to medium density nonvolatile
memory blocks of EEPROMs and flash memories. For memories embedded in
the logic process, the most important figure of merit is the compatibility to
logic process. In general, embedded ROM used for microcode storage has the
highest compatibility to the logic process; however, its application is rather
limited. Programmable logic array (PLA) or ROM-based logic is well used, but
it is considered a special case of embedded ROM [1].
Embedded SRAMs is one of the most frequently used memory embedded in
logic chips, and typical applications include on-chip buffers, caches, register
files, and so on. The standard six-transistor (6T) SRAM cell is also fairly
compatible to a logic process, unless there are special structures involved. The
bit density is not very high. Polysilicon resistor load (4T) cells provide higher
bit density, but at the cost of process complexity associated with additional
polysilicon-Iayer resistors. Embedded DRAM (eDRAM) that provides high
density features is also becoming quite popular in combination with RISC
processor and other peripheral circuitry in system-on-chip (SOC) types of
applications for graphic accelerators and multimedia chips. Embedded EP-
ROM, EEPROM, and flash memory technologies require two to three or more
additional masking steps to the standard logic process. These are finding
applications in microcontrollers, field programmable gate arrays (FPGAs), and
complex programmable logic devices (CPLDs).

479
480 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

In a typical logic VLSI design, the on-chip memory integrated with other
circuitry may include anything from simple registers to caches of several
megabit sizes. In special applications such as the microprocessors, memory can
occupy more than 50% of the chip area. A proce ssor in search of data or
instructions looks first in the first-level (Ll) cache memory, which is closest to
the proce ssor, and if the inform ation is not found there, the request is passed
on to the second-level cache (L2). The integration of on-chip Ll improves both
the processor performance and bandwidth. The SRAM cells used in the Ll
cache are usually larger than the ones for commodity SRAM s and for highest
performance, fabricated with the same process as for the processor logic. Thus,
an Ll cache tends to be faster than the L2 and L3 (the off-chip caches) and
often utilizes special SRAM processes that minimize cell area.
The major goal of cache memory design is to minimize the miss rate - that
is, the possibility that immediately needed bits of information are not available
in the nearest level of cache memory and have to be fetched from the higher
levels of caches or even the main memory. As the processor performance has
increased, the wait time in idle memory cycles has also increased, which has
led to the so-called memory-processor performance gap, as illustrated in Figure
6.1. Therefore, the ability to integrate large memory close to the processor
(possibly, on-chip) helps remove some of the constraints of a slow memory
access, allowing increase of more conventional pin-limited bus widths of
16-256 bits, or even as high as 1024 bits [2].
Another trend that drives the integration of DRAMs into logic chips is the
memory granularity, which refers to the smallest increment by which a memory
size may be increased. For applications that require less than 2 Mb of memory,
an embedded SRAM would probably be more cost effective and should be
considered first. In the O.l8-Jlm process generation, it is expected that the
embedded DRAM solution would become cost effective at above approximate-

~ 1,000
(J)
~

.Q
"C
~
ell
~ 100
o
u
Q)
o
c:
ell
E
C5 10
1:
Q)

l~r;;;§~~~~~~~C:=:=rJ=====J:[~~
Q)
.z
iii
£
0.
1
1980 1985 1990 1995 2000

Figure 6.1 Illustration of performance levels of processor versus DRAM , over several
generations [2].
EMBEDDED MEMORY DEVELOPMENTS 481

ly 2-Mb density. The embedding of DRAM not only reduces power by


eliminating the need for off-chip drivers, but also allows for more active power
management systems.
In a DRAM manufacturing process, tight lithography and structural inno-
vations are combined to fabricate very dense cell arrays, and focus is on the
minimization of cell area. Cell structure innovations include buried trench
capacitor or stacked capacitor structure. Furthermore, in DRAM technologies,
low leakage current levels are crucial, because the goal is to minimize the loss
of stored charge (increase its retention time), including the transistor off-
current as well as the junction leakage. Typically, the low off-current levels are
attained by using relatively high threshold voltages and longer channel lengths.
In fact, unlike in logic process, channel lengths are significantly longer than the
minimum dimension (e.g., in a 0.25-,um process, the designed gate lengths may
be as long as 0.35-0.40/lm). Therefore, when the longer channels are combined
with the rather high threshold voltages, the device on-currents are low as
compared to those for the devices fabricated on a same generation standard
logic process.
In comparison to DRAM process, logic technologies use gate-level mini-
mum dimensions to drive short channel lengths to optimize the circuit timing
and minimize the timing skew from channel length variations. The DRAM
processes prefer a single work-function gate material (usually n-type) for
submicron technologies, so that the p-channel FET operates as a buried-
channel device. The gate of a buried pFET device is more loosely coupled to
the channel it controls, and the performance is poorer as compared to the
surface-channel device. The logic technology process runs dual-function gates:
n-poly gates for the nFETs and p-poly gates for the pFETs to get highest
performance at the cost of complexity.
In a given DRAM technology, the devices have thicker gate dielectric than
would be consistent with a constant field scaling. For example, a 0.25-,um
DRAM process would typically have about 8-nm gate oxide, whereas 0.25-,um
logic process typically uses a 5-nm gate oxide to ensure maximum performance
at 2.5 V. In a DRAM process, to minimize cost, all devices (and not just the
transfer gates that are stressed with the boosted gate voltage) are fabricated
with the thicker gate oxide, even though it provides suboptimal performance.
The logic circuits often have self-aligned silicides (called salicides, in contrast
to the polycide process) in which the gate and diffusion area are silicided
simultaneously. However, in DRAM technology, the silicided diffusions are
avoided because they can increase the junction leakage, even though the gate
conductor usually has a silicide that is put on before the gate is defined.
The density of a DRAM chip is mostly dependent on the cell size or array
density, and it supports circuits that rarely push density limits and are less
strictly controlled. Thus, the performance compromises made by commodity
DRAM processes to reduce cost and increase yield result in a significant gap
in performance between the logic and DRAM technologies. A simple logic
circuit fabricated in a logic technology can be twice as fast as a comparable
482 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

circuit fabricated in a high-yield, commodity DRAM technology. As a result


of the tradeoffs, while the DRAM technologies are capable of dense features,
the devices have lower drive capability. Thus, to drive the next stages, the
devices have to be made larger (i.e., wider), and the indirect conseq uence is that
despite the high circuit density of a DRAM process, it lags about one-
to one-half generation behind the logic technology process. The logic technol-
ogy may have up to seven or eight levels of high-performance wiring (intercon-
nect) as compared with two or three for the DRAMs, and it requires an
extremely planar surface at every level.
These relative tradeoffs between the two technologies have spawned argu-
ments regarding which technology should be preferred for the embedded
DRAMs-fabricating memory in a logic-based technology versus fabricating
logic circuits in a DRAM-based technology.
An ASIC with embedded DRAM costs about 40% more than the same size
die with pure logic, and combining the two produces logic that is 10-300/0
slower than the pure logic in the same process geometry. Also, embedded
DRAM requires internal built-in-self test (BIST) structures and additional
testing, which can increase the time to market. Due to these disadvantages, the
embedded DRAM approach may not be appropriate for every design. The first
to use embedded DRAM technology were the graphics, networking, and data
storage applications. Graphic controllers use embedded DRAM to boost the
throughput, network controllers are replacing embedded SRAM with large
DRAM buffers, and disk drive controllers are embedding DRAM because
discrete devices take up too much board area. A number of traditional memory
manufacturers such as Mitsubishi, Toshiba, and Samsung, and so on, offer
embedded DRAM technology for the ASICs. Also, many logic process vendors
supply ASICs with embedded memory macros and intellectual property (IP)
cores.
A major advantage of embedding DRAM is the much lower power
consumption, because it eliminates the need for power-consuming driver
circuits and allows the circuits to operate at chip's internal voltage, which is
typically below 2 V. Embedding DRAM improves system performance, be-
cause the signals between CPU and memory no longer suffer the delays from
the drivers and additional peripheral chips (or PCB) routing. However, the
major performance boost comes when the system is adapted to take full
advantage of the embedded DRAM capabilities. For example, by widening the
interface between the CPU and the DRAM array from the typical 64 bits to
256 bits, a 100-MHz DRAM array can provide a memory bandwidth of 3
Gbytesjs. Such wide interfaces are not practical with discrete designs because
the package size on the ASIC would be prohibitively expensive. The embedded
array has no such limitation.
Although DRAM is more complicated to work with compared to any other
available memory, it is popular as discrete implementation because it is the
most compact memory technology available, and therefore the cheapest. The
size difference between a DRAM and a SRAM can be nearly tenfold. In a
EMBEDDED MEMORY DEVELOPMENTS 483

0.25-{lm process, a DRAM memory cell occupies 0.6-1.0 {1m 2 , whereas the
SRAM cell occupies 5-9 {1m 2 • Therefore, an embedded DRAM can allow an
ASIC to contain as much as 128 Mb in a 0.25-{lm process.
Another advantage is noise reduction. The interconnect between the proces-
sor and the memory carries some of the highest-frequency signals that generate
electrical noise, which is increasingly difficult to control as the system clock
speeds increase. The fact that a memory bus contains many high-speed signals
routed together makes the problem worse . Therefore, bringing the signals
inside the ASIC by using embedded DRAM approach reduces the difficulty of
controlling electrical noise on the board. Figure 6.2 shows (a) an on-chip
DRAM that eliminates both the PCB traces and two sets of I/O drivers
required and (b) an ASIC plus a discrete DRAM design approach [3]. The
embedding of DRAM can also improve the granularity of the ASIC itself,
because it allow s the system designers to select an optimum size for their
system memory without any waste .
The embedded DRAM can reduce the system engineering effort and
therefore, possibly, the time to market. While the discrete DRAM has many
access modes, embedded DRAM has only one basic mode, which eliminates
the need for extensive architectural analysis to determine which type of DRAM
provides the best system performance. Despite these savings, there are some
drawbacks also , such as the process cost s, technical risks, and additional test
requirements for the embedded memory approach. Although embedded
DRAM lowers some of the system costs , the cost of the ASIC is much greater

ASIC

.A ~
Memory
Logic ~ ;~ ,j;
array
~ On-chIp
' I"
bus

(a)

ASIC Discrete DRAM

Memory
LogiC array
PCB
connection

(b)

Figure 6.2 A comparison of (a) ASIC with embedded DRAM and (b) ASIC with PCB
connection to a discrete DRAM [3].
484 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

and may override the system savings. An embedded DRAM process costs
nearly 40% more to run than a logic process, because it requires extra
production steps (i.e., more masks). As a result, an ASIC with on-chip DRAM
is almost always more expensive than a pure logic ASIC combined with a
discrete DRAM approach.
Memory testing has several components that are different from the logic
testing, and need specialized tests such as the functional test patterns to detect
for the memory array pattern sensitivity faults and data retention measure-
ments, under worst-case refresh timings and operating temperature extremes.
While these tests are routinely performed by the commodity DRAM manufac-
turers, the ASIC manufacturers may not be equipped to perform this complex
testing on the embedded DRAMs. Therefore, the embedded DRAMs in a logic
process may require a different approach such as the design for testability
(DFT) and built-in self-test (BIST) techniques. These DFT and BIST tech-
niques for embedded memories were discussed in Semiconductor Memories,
Chapter 5. The addition of testability requirements to embedded memories
may increase the time to market, as well as the system cost.
A key advantage of the embedded memory approach is the higher packag-
ing density and board space saving, which is a very desirable feature for the
notebook computers, mobile computing, and portable communication devices.
In the conventional multichip memory approach, interconnections require
large I/O buffers to overcome package and board level trace impedances. The
resultant increased power means limited battery life and often reduced reliabil-
ity. It is estimated that a graphic controller with embedded DRAM consumes
roughly 500-750 mW, which is roughly 250/0 the power of its multichip
alternative, consuming approximately 2.5 W.
The transistor gate oxide thicknesses differ for the logic and memory cell
processes. The standard logic transistors have thin oxides with low turn-on
threshold voltages to minimize switching time and maximize performance. In
contrast, many memory processes require thicker gate oxide transistors that
have high turn-on threshold voltages to minimize off-transistor leakage cur-
rent, which is a prime determinant of the required DRAM refresh frequency.
Also, thicker oxides improve data retention characteristics for the EPROM,
EEPROM, and flash memory, because they can help memory cells withstand
the effects of high voltages and corresponding electrical field stresses during
repeated programming and erase cycles.
Mitsubishi for its larger than 0.25-,um design processes, uses normal thin
gate oxide logic transistors for fabricating the DRAM memory cells also,
depending on the lower operating voltages to reduce leakage current. However,
at 0.25-llm and below processes, Mitsubishi's HyperDRAM process uses
dual-oxide thicknesses and adds three processing steps. Standard logic and
HyperDRAM share the same metal pitch and therefore the same logic-layout
libraries but have different logic timing. Mitsubishi's triple-well approach
isolates the DRAM substrate from the bias and injected noise originating in
the logic and standard SRAM circuits and also contains any required DRAM
substrate bias (see Figure 6.3) [4]. Embedded memory density depends on the
EMBEDDED MEMORY DEVELOPMENTS 485

LOGIC DRAM CORE

MEMORY-CELL
VCP 1" CAPACITOR

P-WELL
N-WELL
P-SUBSTRATE

Figure 6.3 Mitsubishi embedded DRAM triple-well process cross section [4].

internal bus-width, including optional parity support, array aspect ratio, the
number of memory macros, and logic gate count.
In addition to providing embedded DRAM for ASIC applications, Mit-
subishi leverages its eDRAM processes to build the 3-D RAM graphic chip
that combines 1.25 Mbytes of DRAM with a high-performance ALU and also
combines the M32RjD 32-bit RISC CPU with 2 Mbytes of DRAM. Mitsubishi
has ported its 32-bit M32R CPU not only to the embedded DRAM with
M32R jD, but also to the flash memory with the M32R jE family.
Another example is Samsung Semiconductor, which supplies merged
DRAM and logic on MDL90, a 3.3-Y, O.35-Jlm process, with three-layer or
four-layer metal process . This process is the same that the company uses for
its SOD-MHz Alpha 21164 CPU. It supports as much as 24 Mbits of extended
dataout (EDO) DRAM or SDRAM. A four-well approach ensures isolation
between the logic and DRAM subsections. There are other vendors offering
embedded flash memory fabricated on an EEPROM process, such as Atmel,
Hyundai, Lucent Technologies, Motorola, and Texas Instruments. EEPROM
variants, although they have more complex cell structures than the NOR
alternatives, offer efficient programming and erasing that minimizes the re-
quired size of on-chip charge pumps for scaling down to low-voltage operation,
and, unlike the NAND flash memory, are appropriate for both code and data
storage. Some companies are also planning to introduce FRAM as part of their
embedded memory portfolio. However, FRAM has a more complex manufac-
turing process because of its specialized capacitor-dielectric material structure.
The selection of embedded memory approach as compared to the discrete
DRAM-based memory systems has advantages as well as disadvantages in
three areas of consideration, as follows:

1. On-Chip Memory Interface


• The replacement of the off-chip drivers with smaller on-chip drivers can
reduce power consumption significantly, because large board wire capac-
itive loads are avoided. For example, a system that needs a 4-Gbyte js
bandwidth and a bus width of 256 bits, built with discrete SDRAMs
486 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

(16-bit interface at 100 MHz), would require about 10 times the power of
an eDRAM with an internal 256-bit interface. However, even though the
use of eDRAM may reduce overall power consumption of the system, the
power consumption per chip may increase, and therefore the junction
temperature may increase, which can affect the DRAM retention time.
• Embedded DRAMs can achieve much higher clock frequencies than the
discrete SDRAMs, because the chip interface can be 512 bits wide (or even
higher) as compared to the discrete SDRAMs that are limited to 16-64
bits. It is possible to make a 4-Mb eDRAM with a 256-bit interface,
whereas it would require 16 discrete 4-Mb chips (organized as 256K x 16)
to achieve the same width, and the granularity of such a discrete system
is 64 Mb (overcapacity for an application that needs, say, 4 Mb of
memory).
• In eDRAMs, interconnect wire lengths can be optimized for a given
application, which can result in lower propagation delays and higher
speeds. In addition, noise immunity is enhanced.

2. System Integration
DRAM transistors are optimized for low leakage currents, yielding lower
transistor performance, whereas the logic transistors are optimized for
high saturation current, yielding high leakage currents. If this compromise
is not acceptable, then extra tnanufacturing steps must be added, at extra
cost.
Higher system integration saves board level space, reduces pin count, and
yields a better form factor. The pad-limited designs may be transferred
into non-pad-limited ones with an eDRAM approach. However, more
expensive eDRAM packages may be required. Some sort of external
memory interface is still needed in order to test the embedded memory.
The eDRANI process adds another technology for which libraries must be
developed and characterized, macros must be ported, design debugged,
and then optimized.

3. Me1110ry Size
• In the eDRAM approach, the memory sizes can be customized. However,
the down side to this is that the memory system designer must know the
exact system memory requirements at the time of design. Later extensions
are not possible because there is no external memory interface.
In general, the selection of eDRAM approach should be considered if the
product volume and projected lifetime are fairly high, or if the eDRAM is
required for higher performance and higher bandwidth. From a system
designer's point of view, eDRAM provides capabilities of (a) customizing the
memory size to precise system requirements, (b) adapting the memory bus
interface to the system requirements, and (c) optimizing the memory structure
(page length, number of banks, word width) to the system requirements [5].
CACHE MEMORY DESIGNS 487

Embedded nonvolatile memory increases the cost of ASIC fabrication


process. Most CMOS logic processes require 16 to 18 mask layers, and adding
a nonvolatile memory such as flash increases the number of required mask
steps. There are other penalties of embedded nonvolatile memory, depending
upon the type of memory usage. For example, an ASIC design with embedded
ROM used for code storage may make it more difficult to make any
subsequent design changes, if needed [6].
The use of embedded flash memory to hold program information eliminates
the possibility that a program change will affect the ASIC design. However, it
may complicate the design's scaling to smaller process geometries of the next
generation's processes. Embedding flash memory in ASIC design has another
hidden penalty. To use it effectively, one may need to acquire two pieces of
intellectual property (IP): the flash memory design and the programming/erase
algorithms.
Section 6.2 will discuss cache memory designs, and Section 6.3 will cover the
embedded SRAM/DRAM designs and architectures.

6.2. CACHE MEMORY DESIGNS

In a PC system, the major goal of using a buffer or a cache memory is to


increase the DRAM subsystem performance by reducing the latency and
increasing bandwidth. A typical cached system consists of a standard memory
hierarchy comprising on-chip cache (Ll), off-chip cache (L2), and fast page
mode or EDO DRAMs. The cacheless system uses either EDO DRAMs, fast
page mode DRAMs, burst mode EDO DRAMs, or synchronous DRAMs.
In general, the memory hierarchy consists of five levels (L,): registers (L o),
cache (L I ) , main memory (Lj), a disk storage device (L 3 ) , and backup units
such as magnetic tapes or optical drives (L 4 ) . There are five parameters that
help characterize these levels [7]:

• Access time (t[i]). It is the total time that it takes the CPU to access the
level L, of the memory hierarchy.
• Memory size (sri]). The memory size refers to the number of bytes in level
i of the memory hierarchy.
• Cost per byte (c[i]). The cost of level L, is usually estimated as the cost
per byte, or the product of c[i] and sri].
• Transfer bandwidth (b[i]). The bandwidth is the rate at which the data are
\ transferred over time between various levels.
• Unit of transfer (x[i]). The unit of transfer refers to the grain size for data
transfer.

As a general rule, the memory devices at a higher level have a faster access
times, are smaller in size, have higher cost per byte and bandwidth, and use a
smaller unit of transfer compared to those at a lower level.
488 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

There are three basic concepts associated with the memory hierarchy:
inclusion, coherence, and locality. The inclusion property implies that all
information items are originally stored in the outermost level. During the
execution of the processes, subsets of the total stored data and instructions
move up the hierarchy as they are used (e.g., data in L 1 moves to L o for
execution). As the data ages, it moves down the hierarchy. A word miss occurs
when a word is searched for in level L i , but is not found. If a word miss occurs
in level L i , it also means that the word miss occurred in all levels above Lio
Another concept associated with inclusion is the method of information
transfer between two levels of the hierarchy. The CPU and cache communicate
through words (typically, 8 bytes each). Due to the inclusion principle, the
cache block must be typically bigger than the size of the memory word and is
typically 32 bytes. The cache and the main memory communicate through
blocks. The main memory is divided into pages (typically 4096 bytes) that are
the units of information transfer between the disk and the main memory. At
the final level of memory hierarchy, the pages of the main memory are stored
as segments, sectors, and tracks in the backup storage device.
The coherence property requires that the copies of the same information
item at lower levels of the memory hierarchy be consistent. This implies that if
a word is modified in the processor, copies of that word must be updated either
immediately (write-through method) or eventually (write-back method) at all
lower levels of the memory hierarchy.
Most computer programs are both highly sequential and highly loop-
oriented, which requires the cache to operate on the principle of spatial and
temporal locality of reference [8]. A spatial locality means that information the
CPU will reference in the near future is likely to be logically close in main
memory to information being referenced currently. Temporal locality means
that the information CPU is referencing currently is likely to be referenced
again in the near future. By using these concepts, it is possible to design a cache
that makes it highly probable that the CPU references are located in the cache.
The principle of locality allows a small high-speed memory to be effective by
storing only a subset (recently used instructions and data) of the main memory.
The spatial locality of reference is addressed in the following manner. If a
cache miss occurs (i.e., if the cache does not contain the information requested
by the CPU), then the cache accesses main memory and retrieves the
information being requested along with information in several additional
locations that logically follow the current reference. This set of information is
called a block, or cache line. The next CPU reference now has a statistically
higher probability of being serviced by the cache, thus avoiding main memory's
relative long access time.
The temporal locality of reference is addressed by allowing information to
remain in the cache for an extended period of time, so that the stored line is
replaced only to make room for a new one. By allowing information to remain
in the cache and using a cache of sufficient size, it is possible to fit an entire
loop of code into the cache, thereby enabling very high speed execution of
CACHE MEMORY DESIGNS 489

instructions in the loop. Therefore, the goal of a cache design is to reduce the
effective memory access time, from a CPU point of reference. When the CPU
finds the item that is addressed in the cache, it is called a "hit," whereas if the
item is not found in the cache, it is called a "miss." The probability that the
required data are in the cache is called the "hit rate," and the probability that
the required data are not in the cache is called the "miss rate" (1 - hit rate).
In a "miss" the processor must go to the main memory or the next level of
cache or the data.
In a cached system, the effective main memory access time (t ef f ) is given by

t eff = t cache + m(t main)


where
tcache = Effective hit time of the cache (the cache access time)

t ma in = Main-memory access time


In = Miss rate of the cache (available from standard tables in
cache design references)

The highest level of hierarchy is the first-level cache, which supplies data to
the processor at a rate that requires no delays in the processor operation. To
accomplish this, the first-level cache must be very close to the processor so that
there are no transmission line delays, and the bus between the cache and the
processor must be wide enough to supply the data at the rate required, and
there must be no delay in the interfaces between the two chips. In addition, it
must also contain the data that the processor wants. For example, if the bus
width of memory is 64 bits, to supply a data rate of 1.6 GB/s, it needs to run
at 200 MHz, whereas if it is 128 bits wide, it needs to run at only 100 MHz.
However, an SRAM with 128 bits running at 100 MHz will have significant
delays due to transmission effects crossing the interfaces, unless careful
measures are taken to have proper termination. Also, a memory with 128 II0s
will be a large chip in a large package.
These are some of the basic cache organizations and operating modes:

• Unified Cache. Both cache and data instructions are cached in the same
SRAM memory.
• Burst Cache. A type of synchronous cache, which is about 30-50% faster
than the asynchronous cache and about 50% more expensive. In burst
mode, several bits of data are selected using a single address, which is
incremented using an on-chip counter. Both flow-through and pipelined
SRAMs may have the burst feature.
• Synchronous Burst Cache. SyncBurst SRAMs use an input clock to
synchronize the device to the system clock. This allows for short setup and
490 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

hold times for the data, address, and control signals. Both the pipelined
and flow-through versions have input registers, which allow this mode.

• Pipelined Burst Cache. This is same as the synchronous burst cache,


except the pipelined device has output registers for data, which are not
incorporated into the flow-through device. This allows the pipelined
device to operate with a much faster cycle time than the flow-through
version.

In general, the four different aspects of cache organization are the cache size,
mapping method, line size, and whether the cache is combined or split (i.e.,
whether it integrates caches for data and instructions or separates them). The
information that the processor wants normally falls into instructions and data.
The two types of information can be stored in one cache or two. A system can
have a split cache with one for instructions and one for data. The two caches
can have different structures to optimize their function. A cache that contains
both instructions and data is called a "unified cache." The data and instruction
cache for a first-level cache (Ll) on the processor chip are usually designed to
optimize the performance of each. They may have different levels of associativ-
ity and different widths. Similarly, separate data and instruction caches can be
used for external L2 cache. These can be on two different chips or the same chip.
The type of mapping used affects the cache's hit time and miss rate, and
usually an increase in the miss rate exacts a penalty on the cache hit time.
Different cache memory architectures have different levels of property called
associativity. The hit rate of the cache can be improved by increasing its
associativity. The most widely used mapping schemes are based on the
principle of associativity. A direct mapped cache allows any specific location
in main memory to be mapped to only one location in the cache. This has the
lowest level of associativity and is the least complex caching scheme that
implements a one way, set-associative design. A fully associative cache is called
a content addressable memory (CAM), discussed in Chapter 2, Section 2.8.3.
In a direct mapped cache, every location in the main memory maps to a
unique location in the cache. For example, location 1 in the main memory
maps to location 1 in the cache; location 2 in the main memory maps to
location 2 in the cache; location 111 in the main memory maps to location In in
the cache: and location m + 1 in the main memory again maps to location 1
in the cache; and so on. Figure 6.4a shows an example of such a cache that
consists of a data memory, a tag memory, and a comparator. The data memory
contains the cached data and instructions, and its size defines the cache size.
The cache tag memory is the directory of the cache, containing information
about where in the main memory the data stored in the cache originated, and
it uses the comparator to determine whether the cache contains the line being
addressed by the CPU.
A direct-mapped cache has two critical timing paths: (1) the read data path
through the data memory to Data Out and (2) the tag memory path to Match
CACHE MEMORY DESIGNS 491

Address Latch
I
I. . I.
!t Ii
i
·· 1· I l-
I
I i
!
I

I
I
c/ I
I
l
I

)
I
t ;

i "
I .
"' /,
i
Address
Comparator
i+w
• It
.
;

Match Out Data Out

(a)

(b)

Figure 6.4 (a) Direct mapped cache organization, and (b) A two-way set-assoc iative
cache [8].
492 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

Out. The slower of these two paths determines the number of bus clock cycles
for the first access; the read data path determines number of bus clock cycles
for the rest of the burst. In a direct mapped cache, the address is split into three
fields called tag (the higher order address bits of the address), index, and word
offset. The bits of the index field address the tag memory to see if the line being
accessed is the line wanted by the CPU. The tag in the location addressed by
the index field is compared against the upper-order bits (the tag field) of the
address. In parallel with the tag access and comparison, the index and
word-offset bits address the data memory, and the accessed word is placed in
the Data Output Buffer. If the tags match and the status bits are correct, the
cache asserts the Match Out signal, indicating that the information retrieved
from the data memory is correct-a cache hit. If the tags do not match or the
status bits are not correct, Match Out is deasserted, which shows the data
retrieved to be invalid (i.e., a cache miss), and correct data must be retrieved
from the main memory.
The most complex cache is a fully associative cache, which allows any
location in the main memory to be mapped to any location in the cache, that
is, n-way associative cache (typically n = 1, 2, 4, 8 ... ) allows any specific
location to be mapped to n locations in the cache. Figure 6.4b shows a
two-way, set associative cache that has a performance equivalent to a direct
mapped cache twice its size.
Cache size has the single largest influence on the miss ratio. In general, a
larger cache size has a lower miss ratio, but then increasing the cache size
beyond a certain optimum limit can actually result in a performance decrease.
A cache size of 256 to 512 kilobytes is considered large enough to allow a cache
to reach a 98 % hit rate. The cache line size, typically an even binary amount
such as 16,32, 64, or 128 bytes, is the basic unit of information transfer between
the cache and main memory, It ranks second behind cache size as the
parameter that most affects cache performance.
In some cases, the small L1 cache may not have a high enough probability
of ha ving the needed data to keep the processor fed at the required rate. In this
case, another level of cache (L2) is added as off-the-processor chip that is larger
and runs at a speed between that of the first level cache and the main memory.
Because the L2 cache is now external, it can be larger than an on-chip cache,
but it still needs to have measures taken to increase the probability that the
data the processor wants next, which is not found in the Ll, is available in the
L2. This implies that some amount of logic must be added to the L2 SRAM
to make it an effective cache memory.
The PC designs widely use L2 cache in a direct mapped configuration
consisting of 256 KB (32 Kb x 64) of data RAM and 8 KB (8 Kb x 8) of tag
RAM. The cache data RAM may consist of asynchronous RAMs (lower cost)
or synchronous burst RAMs, for higher performance. Some manufacturers
offer PCs in which the L2 cache has been eliminated, and the systems rely on
EDO DRAM to recover some of the lost performance. The chipsets for
Pentium systems support either a write-through or write-back mode of
CACHE MEMORY DESIGNS 493

operation. In a write-through cache, all writes to the cached locations are


immediately written by the CPU through the cache directly to the DRAM
main memory. The write-through is simpler to implement (than write-back)
and uses a less complex cache controller. However, the write-through approach
can reduce performance, because the CPU usually must be held, pending
completion of the write operation. The write-through can also cause problems
because of the increased CPU bus traffic.
A variation on the write-through is buffered write-through, where the data
are first written to a "write buffer," which frees the cache to service a read
request before finishing the write. This is useful for very slow DRAM. In posted
write-through, the write-back to DRAM is delayed until the bus is free. The
write-back scheme updates the cache only on CPU store cycles and updates
the main memory only when it becomes necessary to replace a modified
("dirty") line in the cache. This greatly reduces the CPU bus traffic and makes
the write-back scheme widely used in server applications-especially in
multiprocessor based systems, in which the performance is limited by available
memory bus bandwidth. However, the write-back scheme does increase the
complexity of the cache controller.
In a copy-back operation, the data written into the cache by the processor
are not written into the main memory until the time for that line of data to be
replaced in the cache. This technique entails the use of a flag bit, which can be
set to indicate if that location in the cache has been written into. When the
location has been written into and the flag bit is set, then the bit is called
"dirty." The word at that location must be written into main memory before
it is destroyed. If the flag bit is not set, then the data at that location of the
cache are the same as the corresponding address in main memory and the
cache data can be destroyed.
In a multilevel cache system, it is important to reduce the overhead required
to maintain consistency among the L1 cache, L2 cache, and main memory. The
tradeoff is one of the cache controller complexity and the amount of bus
bandwidth, versus cost; the Ll- L2 cache coherency strategy can result in a
+ 150/0 cache system performance differential. The performance improvements
brought by an L2 cache depends on three main factors: type of DRAM (Fast
Page Mode or Extended Data Out); the processor-clock-to bus-clock ratio;
and the type of application.
A cache's hit rate can be increased by tailoring it specifically to the system.
For example, for an Ll cache, the data and instruction caches can be separate.
If they each maintain the same bus width, the bandwidth of the cache is
doubled. In addition, the two can have different configurations and bus widths
to provide exactly the speed and bus width required by that part of the
processing system. Anyone of these can be direct-mapped or partially or fully
associative. They can be single- or multiple-ported, and they can be write-
through or write-back.
In expressing the access time of a memory, the notation (X-Y-Y-Y) is used,
which denotes the number of clocks required to access the first word from
494 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

memory (X), followed by three subsequent (Y) accesses to complete the


four-word on-chip (Ll) cache line fill. For example, an ideal access for a 486
or Pentium processor chip would be (2-1-1-1), a typical asynchronous (L2)
cache access is (3-2-2-2), and a fast page mode DRAM access is (5-3-3-3) for a
page hit and a (9-3-3-3) for a page miss. Because EDO DRAMs overlap the
CPU's data hold time with the CAS precharge time, they can return. burst
accesses one clock cycle faster than the page mode DRAMs. A typical EDO
DRAM access is (5-2-2-2) for a page hit and (9-2-2-2) for a page miss [9].
The time required to retrieve that first word from memory - the latency-
is usually longer than subsequent access times. The concept of latency is very
critical in noncache system designs. Thus, in characterizing memory subsystem
performance, key attributes are (a) the number of total clocks (X + Y + Y + Y)
used to retrieve the data from the memory subsystem and (b) the latency in
accessing the first word (X).
Another important factor in evaluating the memory performance involves
hit or miss. For caches, a hit means that the information requested by the
processor is in the cache and is subject to retrieval in SRAM access time. For
DRAMs, a hit means that the information is in the current page and can be
retrieved in the CAS access time. A miss means that the information is not
within the current page, forcing a RAS CAS cycle to load the page, and latency
is significantly increased due to the length of this cycle.
Second-level (L2) caches enhance system performance by reducing the miss
penalty of L1 cache in the processors, so that the CPU can operate with
minimum delay. The access time equates to the Ll miss penalty. Therefore, Ll
miss penalty (or access time) can be expressed by

L1 miss penalty = (L2 hit rate x L2 hit timing)


+ (L2 Iniss rate x L2 miss penalty)
The maximum clock speed of a CPU is set by two simple factors: (1) gate
propagation delay and (2) number of gates in the critical path. The data or
instruction set must travel around a circular path in time to be clocked into a
register during the next clock cycle. Pipelined systems are designed to acceler-
ate this process (except in certain cases) by adding registers to intermediate
points within the circle and clocking at a faster rate. The minimum clock cycle
time of any CPU is set by the maximum delay length path between register
clocks. In complex instruction set computers (CISC) processors, this additional
complexity of a larger instruction set adds more gates (one at a time) to this
critical path and, therefore, more delay [10].
In the reduced instruction-set computer (RISC) processor, the approach is
to find the optimally small instruction set, based upon concurrently reducing
both the instruction set's size and cycle time of the CPU. An objective of a
RISe Cl'U design is that the most frequently needed instructions are opti-
mized to run as fast as possible, at the loss of complexity in the instruction set.
CACHE MEMORY DESIGNS 495

The major goal of the RISC architecture is the ability to execute one
instruction every cycle. RISC architectures focus more attention on registers
and have register-rich CPUs. This means that the memory compilers used to
support these CPU designs will focus a lot more attention on turning frequent
memory accesses into register accesses instead, so that the external data
transfers are significantly reduced in comparison with the number of instruc-
tion calls.
The processor architects have to trade off the speed of their designs versus
the feature-set offered. If one-cycle-per-instruction goal is to be approached, the
instruction, the operands, and destination must be all accessible to the CPU at
the same time, rather than being accessed sequentially. The site at which some
large delays can arise is within the memory management unit (MMU). The two
major architectures for processors are referred to as the Von Neumann and
Harvard. The Von Neumann machine has a single address space, any portion
of which can be accessed as either instructions or as data. A Von Neumann
machine must fetch an instruction, then load or store an operand in two
separate cycles on a single data bus within a single memory space.
In comparison, a Harvard architecture uses two separate fixed-size spaces:
one for instructions and one for data. This implies that a Harvard machine can
load or operate upon data in a single cycle, with the instruction coming from
the instruction space via the instruction bus, and the data either being loaded
or stored via a data bus into the data memory.
Some examples of these architectures are MIPS R3000 and SUNSPARC
processor sets. The discussion of their architectures is beyond the scope of this
book. Section 6.4 on Merged Processor DRAM Architectures does provide
examples of (a) Mitsubishi's 32-bit RIse processor (M32RjD) and (b) a 2000-
MOPS embedded RISC processor with a Rambus DRAM controller. The
following section reviews cache architecture implementation for a TI DSP
TMS320C6211.

6.2.1. Cache Architecture Implementation for a DSP (Example)


An example of cache architecture implementation is the TI DSP
TMS320C6211 that utilizes a two-level real-time cache for internal program
and data storage. This DSP executes over 99.50/0 of all CPU cycles without
going off-chip, and it employs a two-level memory architecture for on-chip
program and data accesses. Figure 6.5a shows the block diagram [11]. The first
level has a dedicated 4-Kbyte program and data caches, LIP and LID,
respectively. The second-level memory (shown as L2) is a 64-Kbyte memory
block that is shared by both the program and the data. The dedicated LI latch
eliminates conflicts for the memory resources between the program and data
busses. A unified L2 memory provides flexible memory allocation between the
program and data accesses that do not reside in Ll.
Each of the two caches consists of cache memory, a smaller block of
memory to save the state of the cache known as the tag RAM, and a cache
496 EMBEDDEDMEMORIES DESIGNS AND APPLICATIONS

ij!
External C1 P Ca(;he~;,
.- f-- Memory
~
1+-+ Direct Mapped
Interface
(EMIF) .
4 Kbyles

+C6200B CPU
Multi-channel Control
.-- Buffered
1++
Instruction Fetch
Registers
- r+ Serial Port 1
(McBSP 1)
Enhanced
DMA
Controller
... 1'-2 Memor~
4 Banks
('j4 !<byles
Instruction Dispatch
In-Circu it
Instruction Decode Emul ation
Data Path 1 Data Path 2
3:
~
2
"2-
0
- ... Multi-channel
Buff ered
k-+
A Register File B Register File 0
:J

.- - Serial Port 0 III1+- ~ I I I ~


(McBSP 0) IIL1IS1!M1ID11 D21M2S21L21

Host Port ... ~


.- ... Interf ace 1++ L.1 D Cache
D

(HP I) 2·Way Set ' ~


AssoCiative
1
Power Down Logic I ITimer 11 ITimer 01 4 Kbytes
C6211 Digital Signal Processor

(a)

CPU requests data

Is the data Is the data Re quest data from


No No
in L17 in L27 external memory

Yes
Yes
~

Send data to CPU Put data in L1 Put data in L2

(b)

Figure 6.5 TI TM S320C6211 DSP (a) Block diagra m, (b) Illust ration of two-level
cache fetch flow. (c) L2 memory configura tions. (From reference 11.)
CACHE MEMORY DESIGNS 497

16 Kbytes
Mapped
32 Kbytes
Mapped
RAM
48 Kbytes
Mapped
RA M
64 Kbytes 4 Way
RAM
Mapped Cache
3 Way
RAM Cache
64 Kbytes
2 Way
48 Kbytes
Cache
1 Way
32 Kbytes
Cache
16 Kbvtes
(c)

Figure 6.5 Continued (c) L2 memory configurations.

controller. When an access is initiated by the CPU, the cache controller checks
its tag RAM to determine if these data reside in the cache. If these data do reside
in the cache, a cache hit occurs and those data are sent to the cpu. If the CPU
data do not reside in the cache, then a cache miss occurs . On a cache miss, the
controller requests the data from the next level memory, which, in the case of an
LIP or LID miss, is L2. In the case of an L2 miss, the next-level memory is the
external memory. The amount of data that a cache requests on a miss is referred
to as the cache's line size. Figure 6.5b illustrates the decision process used by the
DSP memory system to fetch the correct data on a CPU request.
The DSP performance is significantly improved by using a cache, which
dynamically allocates memory to reduce the latency to a slower memory. A
cache's performance can be affected by a situation known as "thrashing." For
thrashing to occur, the data must be read into the cache. Subsequently, another
location is cached, whose data overwrite the first data. When the first data are
requested again , the cache must again fetch it from the slow memory.
The major elements of DSP cache memory architecture are briefly described
below.

Level-One Program Cache (L 1P) The LI P is organized as a direct-mapped


cache with a 64-byte line size. In a direct mapped cache, every cacheable
memory location maps to only one location in the cache . Thus, the cache
controller needs to check only one location in the tag RAM to determine if the
requested data are available in the cache. DSP algorithms primarily consist of
loops that execute the same program kernel many times on multiple data
locations. Such algorithms remain in the loop for a long time before proceeding
to the next kernel. The LIP is large enough to hold several typical DSP kernels
simultaneously. Becau se these kernels execute sequentially, they will not thrash
in the LIP.
498 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

In the case where cache miss occurs, the LlP requests an entire line of data
from the L2. In other words, both the requested fetch packet and the next fetch
packet in memory are loaded into the cache. Because most applications execute
sequential instructions, the likelihood is higher that the next fetch packet will
be immediately available when it is requested by the CPU. Thus, the startup
latency to fetch the next packet is eliminated by bursting an entire cache line.
Fetching ahead also reduces the number of cache misses.

Level-One Data Cache (L1D) The LtD is organized as a two-way set


associative cache with a 32-byte line size. A set associative provides additional
flexibility to a direct-mapped cache. A two-way set associative cache consists
of two direct-mapped caches, each of which is referred to as a cache way. Each
cache way in the L1D caches 2 Kbytes of data. In a two-way set associative
cache, every cacheable memory location can reside in one location in each
cache way. A two-way set associative cache reduces the chance of a cache
thrash, because two thrashing addresses can be stored in the cache simulta-
neously. This is beneficial for DSP architecture for DSP data, which often
accesses multiple arrays simultaneously, such as the arrays of coefficients and
samples. Because the CPU data have two paths, which could simultaneously
access two different data arrays, the LtD architecture minimizes the chance of
these two datapaths thrashing.
The LtD replaces data with a least recently used (LRU) replacement
strategy that selects which set to update with new data by determining which
of the two cache ways was accessed least recently. The new data are then placed
in the appropriate set of that least recently used way. LR U is considered the
best replacement strategy for set associative caches because of the temporal
locality of data: Once the data have been used, they will probably be needed
again within a short time. When a cache miss occurs, the LlD requests an
entire line of data from the L2. When the CPU is fetching a data array from
a contiguous noncache memory, this greatly reduces the latency for subsequent
data fetches. The LtD memories are dual-ported, which allows the LtD to
support two simultaneous CPU data accesses without stalling.

Level- Two Cache/Unified Memory (L2) The L2 is a unified memory used


for both the program and data, consisting of a 64-Kbyte SRAM divided into
four 16-Kbyte blocks. The amount of program or data in the L2 is configur-
able. For example, if the application requires 8 Kbytes of program space and
56 Kbytes of data space, then both could be linked into the L2 at the same
time. Each of the four blocks can be independently configured as either the
cache- or memory-mapped RAM, which allows allocation of the amount of L2
that is used as cache and amount to be used as RAM. If the application uses
some data, which must be accessed quickly, those data can be linked into an
L2 block, which is configured as RAM. The rest of L2 can be configured as
cache to provide high-performance operation of the remaining program and
data.
EMBEDDED SRAM/DRAM DESIGNS 499

When an L2 block is configured as RAM, external data are not cached in


that block; instead, that memory is accessed by direct addressing. Each block
that is configured as a cache adds a cache way to the L2. For example, when
only one block is configured as cache, the L2 operates as a one-way associative
(direct-mapped) cache and 48 Kbytes of RAM, as shown in Figure 6.5c. When
all four blocks are configured as cache, the L2 operates as a four-way set
associative cache.

6.3. EMBEDDED SRAM/DRAM DESIGNS

The demand for embedded memory is on the rise in the current generation of
ULSI and system-on-chip level designs that require large amounts of SRAM,
multiport RAM, DRAM, ROM, and EEPROM/flash memories. For example,
in the case of high-performance microprocessors, 30-500/0 of the premium
space and 800/0 of the transistors are allocated to memory alone. These
controllers include several levels of cache for data and instructions, multiport
SRAMs for TAGs, translation look-aside buffers (TLBs), CAMs, register files,
and general-purpose SRAMs [12]. As the need for embedded memories
continues to increase, so does the complexity, density, and speed of these
memories. Many companies outsource the design of their embedded memories.
An alternate method of obtaining embedded memory design is to use a
memory compiler, which can provide a physical block in a relatively quick and
inexpensive manner. However, there are some drawbacks also to this approach.
Generally, compiled memory designs result in a larger memory block and less
efficient overall system performance. In addition, the memory design may be
inflexible when the system design requires additional features.
In comparison, the customized memories can accommodate emerging
system requirements such as the need to pitch-match the logic with memory
core. Instead of placing a standard memory block on the chip and then
synthesizing the logic around it to create a desired function, the designers can
move the logic into the memory block, allowing the physical layout to fit
tightly with the memory pitch dimensions. This approach can reduce the
overall chip size, while allowing for higher density and improved performance
of the chip.
The embedded systems that basically consist of logic and embedded mem-
ory can be developed either by using logic process technology and embedding
memory or by implementing logic in a DRAM process. By using logic process
as the base technology, the chip benefits from the fast transistors, but because
of high saturation current it is difficult to implement memory with a 1T cell.
This means relatively high area penalty. In comparison, a DRAM-based
technology allows the creation of embedded DRAMs with very low leakage
currents, but the speed of logic transistors lags. However, the DRAM technol-
ogy allows the design of embedded systems with up to 128 Mbits, and even
higher-capacity DRAMs, which is not possible with logic-based technology
500 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

because of unacceptable large chip areas. Therefore, the most serious problem
is how to improve the gate density and transistor performance while achieving
the same memory density as that of a standard DRAM [13].
A potential approach for increasing the performance of a memory macro is
to use some of the techniques used to make stand-alone memories faster , such
as synchronous design. Additional speed can be obtained by pipe lining the
address and datapath and/or using a wide bus from the array to prefetch
multiple words on one clock cycle into a fast register with individual words
sent out at the same multiple of speed. In the current generation of memory
macros, both the techniques of dividing the array and using a prefetch scheme
are used. In embedded memory applications, special memory configurations
are also possible that can configure memory to the requirements of the
processor and thereby enhance the performance. Another performance issue is
power consumption, which can often be traded against the bandwidth require-
ments.
The system chips with optimized memory macros tend to be more expensive
than a chip in pure CMOS logic because of the additional processing steps. A
major issue to consider is whether to use a predefined memory macro, a block
compiler constructing an array out of smaller blocks, or a cell compiler.
Optimized DRAMs tend to use block compilers because even the small blocks
tend to maintain some of the high-density optimization, which is the only
justification for using DRAMs. Block compilers can either be very simple or
can attempt to construct space saving features such as the shared sense
amplifiers. Cell compilers are commonly used for the SRAMs, and various cell
SRAM cell compilers are commercially available.
Embedded DRAM is playing a significant role in the growth of application-
specific processors such as the graphic accelerators, multimedia chips, and
system-on-chip (SOC) designs. The future projections as shown in Figure 6.6
indicate that embedded DRAM will dominate the embedded memory market

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0% -¥=====:::;::= ====:::;::=====":: (?v
1998 1999 2000 2001

Figure 6.6 Future projections for growth of embedded DRAM market compared to
that for SRAM and nonvolatile memory [14].
EMBEDDED SRAM/DRAM DESIGNS 501

for SOC designs [14]. These are some of the factors that are key drivers for
embedding memory in current generation of high-performance application-
specific IC (ASICs) and SOC designs:

• Higher Bandwidth Embedded DRAMs can scale bus widths up to


x 1024, allowing extremely fast transfer rates between the memory and
logic due to reduction of the off-chip interconnect delays.
• Lower Power Consumption Embedded DRAM reduces the number of
I/Os required, by eliminating the need for off-chip drivers and operating
at internal chip voltage.
• Board Space Savings The use of embedded DRAM reduces parts count,
resulting in savings of PCB area. Also, DRAM cell size is less than
one-tenth that of a 6T SRAM cell.
• Memory Granularity Control The use of embedded DRAM allows cus-
tom optimization of the memory size according to the design requirements
from 1 Mb to 128 Mb of DRAM.
• Higher Reliability The use of embedded DRAM can provide higher
reliability due to reduced parts count.

When merging logic and DRAM on one die, an ASIC vendor typically
needs to provide polysilicon layers to construct the DRAM capacitors and
other structures, and several metal layers for fast and robust logic interconnect.
A compromise has to be made between thin-gate-oxide, low-threshold logic
transistors and thick-oxide, higher-threshold memory cell array transistors.
Also, the noise-sensitive DRAM array has to be isolated from transients
generated by the fast switching logic. In the DRAM manufacturing world,
stacked capacitor cell approach is used by more than 60% of the suppliers and
has proven to be a high-volume and cost-effective solution. The trench
capacitor cells, offered by a small percentage of DRAM suppliers, have scaling
limitations below approximately O.18-,um process designs.
A successful DRAM-based design, whether embedded or discrete, requires
accurate dynamic circuit optimization based on the layout-related parasitic
capacitances to meet the noise, performance, and power consumption require-
ments. As the design cycles continue to shrink and ASIC vendors migrate
their process lithographies to ensure cost-effectiveness and higher integra-
tion capability, the challenges become even greater. The re-layout of the
embedded DRAM array changes both the parasitics and signal loading
characteristics.
An example of merged DRAM logic (MDL) design is Samsung's latest
MDLII0 in O.2S-,um process technology, offering several fabrication options
that trade off high performance (2.5 V) and low power (1.8 V) and enable
varying levels of memory, analog, and logic integration. This modular ASIC
process provides up to 8.2 million usable gates and allows core-based SOC
designers to start with the logic process and then add DRAM, SRAM, and
502 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

mixed signal layers, or they can replace the DRAM with flash memory.
Samsung uses this modular ASIC process to fabricate the 750-MHz Alpha
21164 CPU, and the process is also capable of integrating up to 128 Mb of
embedded DRAM, 64 Mb of SuperkAlvl" an innovative DRAM macro that
embeds a multiport cache DRAM), 32 Mb of NOR flash memory, 4 Mb of
SRAM, and 4 Mb of compiled ROM. Available embedded DRAM operating
DRAM modes include the EDO and SDRAM. Memory configurations offer
bus-width options of x 4, x 8, x 16, x 32, x 64, x 128, x 256, x 512, and
x 1024 and granularity steps of 1 Mb or 4 Mb. The embedded DRAM density
requirements can vary, depending upon the application.
In addition, the designer can also add customized DRAM functions
and a number of logic and analog cores including ARM RISe processor,
Alpha Microprocessor, Z80 and 80C51 CPUs, and Oak and Teak DSPs.
Merged processor DRAM architectures will be discussed in more detail in
Section 6.4.
The embedded SRAM has been used for many years to accelerate the
performance of high-end network routers and switches. It is popular because
it is based upon the standard logic process and does not require additional
masking steps. While the embedded SRAM employs a larger cell than DRAM,
new technologies are emerging to help boost SRAM embedded density. The
key to embedded SRAM performance is memory compiler design. A memory
compiler works on a basic principle that memory has a regular structure
consisting of four basic building blocks: the memory array, precoder, decoder,
and the column select and I/O option. The memory array is constructed by
using the same memory core cell. The other three building blocks are also
constructed from a basic leaf cell [15].
A compiler creates a memory design by using instances of the different leaf
cell types to make up the desired memory width and depth. A maximum
performance is achieved when the leaf cell and memory core cell are optimized
for both process technology and memory-size range. Currently compiler
designs are optimized to meet the demands for a wide range of applications.
Segmented or block architectures are used to improve performance and power
consumption. SOC cores are designed with tightly coupled memories to
overcome the processor-to-memory interface bottleneck. The memory designs
often feature multiport, synchronous or asynchronous operation, and stringent
power control.
Nowadays, the biggest challenge facing embedded SRAM developers is to
satisfy growing demands to embed ever-larger memories on chip. The amount
of embedded memory available to ASIC designers has rapidly grown from 1
Mbit in 0.35-flm process technology to 2.5 Mbit in 0.25-flm process, and more
recently to 6-8 Mbit in O.18-flm technology. The latest generation of embedded
memories typically feature built-in scan latches and a scan path, as well as
built-in self-test (BIST) logic. Some vendors employ built-in self-repair (BISR)
schemes in which the device identifies a bad row in a self-diagnostic routine
and uses address mapping logic to automatically map into a good address
EMBEDDED SRAM/DRAM DESIGNS 503

space. However, a BISR scheme can increase the address setup time and can
be a liability for high-performance designs.
A more radical extension of the embedded memory concept is to integrate
the processor functions into memory. The researchers are laying the ground-
work for devices that will essentially eliminate the processor-to-memory
bottleneck in current memory systems by merging both functions on chip to
create very high performance designs. Some of these architectures will be
discussed in more detail in Section 6.5.
Sections 6.3.1 and 6.3.2 discuss examples of some advanced SRAM and
DRAM macros developments.

6.3.1. Embedded SRAM Macros

In general, an ASIC with embedded memory will provide better system


performance and a smaller parts count as compared to a design that uses
external memory. However, embedding memory requires additional test struc-
tures, which, in turn, increase test time, drive up the manufacturing costs, and
may necessitate additional design revisions. SRAM is the most common type
of embedded memory that does not require additional processing steps. The
main tradeoff for SRAM is size and cost against scalability and power
consumption.
Also, SRAM is easy to migrate to a new production process, because of the
availability of memory compilers. A memory compiler creates a memory design
by stacking together enough leaf cells of each type to provide the desired width
and depth. Although this process automatically produces a memory of any size
and shape, it doesn't produce an optimized design. A designer still needs to
customize the array to both the ASIC manufacturing process and memory
array size to get the highest performance. The ASIC vendors offer customers a
choice of preconfigured memory array sizes, which are optimized for both the
process and array sizes [6].
Thus, it is relatively easy to implement six-transistor cells on a standard
logic process, as is commonly done for integrating an on-chip, Level 1 cache
for a microprocessor. The alternative implementations include the compiled
and customized cells. The compiled cells offer faster turnaround times, whereas
the customized cells deliver the most efficient array architectures. The use of
four-transistor and two-resistor SRAM cell design (also referred to as the 4T
as compared to 6T cell) or multiport SRAM configuration limits ASIC vendor
and foundry options. The 4T SRAM cell designs require additional polysilicon
plug layers to implement the resistors and are more difficult to scale down. The
primary limitation of SRAMs having two or more ports is not process
compatibility, but library model availability. The limitation occurs to an even
greater extent with content addressable memory cells, which are popular in
network designs and are more logic intensive than the standard SRAMs.
The following sections discuss examples of two advanced SRAM macros: (1)
1T SRAM and (2) 4T SRAM.
504 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

6.3.1.1. A 1T SRAM Macro The specialized SRAM processes are opti-


mized for making the SRAM cell small and include device structures such as
the poly load resistors, local interconnects, and buried contacts, which contrib-
ute little to the performance of logic circuits. For a given feature size, embedded
SRAM processes (four-transistor, two-resistor cells or pseudo-six-transistor
cells using multi-poly layers) typically lag behind the logic processes by about
a half-process generation. The use of a 6T SRAM cell is compatible with the
standard logic process at the cost of low bit density. A 1T SRAM™ has been
developed by Mosys Inc., for embedded memory applications. Figure 6.7a
shows the schematic of a 1T- SRAM cell, which consists of a capacitor and an
access transistor [16]. Basically, it is a planar DRAM cell with the capacitor
substituted by a MOS structure. Figure 6.7b shows the cross section of the
planar cell implemented using p-channel devices. The cell contains one bit line,
one word line, and one contact. Therefore, it does not have a complex
interconnect structure of a SRAM cell and is easily scalable. The use of MOS
structure for the storage capacitor allows the cell to be fabricated in a standard
logic process. The biggest advantage is its smaller size, which is about one-third
to one-fourth of the SRAM cell.
However, the simple 1T cell structure has some shortcomings compared to
the SRAM or even standard DRAM cell. The MOS structure stores significant-
ly less charge than the capacitor in a DRAM cell and is also highly nonlinear.
Therefore, a special linearization-biasing scheme is required for compensation.
Another area of concern is the worsening of soft error rate (SER) because of
its single cell and lower charge storage in the MOS structure. These concerns
have been addressed by using special circuit techniques.
In a standard DRAM, data are stored in the DRAM cell in the form of a
charge in the capacitor that is converted to voltage on the bit line and then
amplified by the sense amplifier. The bit-line voltage is proportional to the
ratio of the cell capacitance to the bit-line capacitance. Therefore, to keep the
cell size small, the bit lines must be kept short, minimizing the bit-line
capacitance. In DRAMs, the bit lines are mostly constructed from polysilicon,
and they account for large portion of the sensing delay. In 1T SRAM, metal
bit lines are used to improve the speed; and use of short word lines further
reduces the memory cycle time, which is comparable to that for many
high-speed SRAMs. For example, in a standard O.25-,um logic process using a
single layer of poly and four layers of metal, a memory cycle of greater than
220 MHz has been demonstrated.
The use of short memory word lines also reduces the number of active
memory bit cells, which improves the soft error rate. The 1T SRAM uses
multibank architecture in which the short bit-line and word-line array struc..
ture can be organized to form multiple small independent banks, so that
parallel operations can be performed. Each bank can perform a read, write, or
refresh operation independent of the other banks. The availability of many
layers of metal in the standard logic process allows the easy connections of
large number of memory banks to the interface circuitry, unlike the specialized
EMBEDDED SRAM/DRAM DESIGNS 505

Word l ine

I~ T
(a)

Word Line Vbbl


Bit Line Vppl

p-

\ Vss

DB[255:0J
DRAM
BANKO
DRAM
BANK 1
DRAM
BAN K 2
... DRAM
BAN K 63

ACCESS
CONTROL
ACCESS
CONTROL
ACCESS
CON TROL
... ACCESS
CONTROL
DA[255:0J

t tt .t t
t 1
RFO #
I
IREFRESH TIMER I

."
m
~~
~o
OJ :D
.9 qQ EA[ lO :O]
0

.1-""'
..

Ft:J ~
,...-'-- ~ ~ IBAN KADORf ~AR
~

Fr ""
1...-- EA DECODER
ClK
PHASE BUFFER ADOR ;:: TAG
GEN DECODER'" ,.-

;: . k
,"0'

t ""1,
"---

o
m
~
---r 6

~
;::
=E
....
'" '"
.9
.
EA[2:0J

t
»
0
en , HIT

~CONTROlSEO
WR H
CACHE [;
.-
EWR
I
MUX

ClK
I 1/0 DRIVER l
t DO[31:0J

(c)

Figure 6.7 IT SRAM. (a) Bit-cell schematic. (b) Bit-cell cross section. (c) Block
diagram (From reference 16, with permission of IEEE.)
506 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

DRAM or SRAM processes, where the number of metal layers is often limited
to three or less. In the embedded environment, a large number of input and
output connections can be easily supported. For each access, up to several
thousand memory cells can be accessed simultaneously. This inherent large
word width makes the memory access more efficient.
Figure 6.7c shows the 1T SRAM block diagram. The multibank organi-
zation allows simultaneous operations in each bank, thereby facilitating the
hiding of refresh operations. A refresh timer is incorporated to generate
periodic refresh requests to the memory bank. During any clock cycle, a read
or a write access can be generated by an accessing agent outside the memory
block by the activation of the ADS signal and read-write indicator WR. The
access address EA is divided into row and column addresses, which are
broadcast to each memory bank, and the bank address, which is decoded by
the bank address decoder to generate external access request ERQ. The bank
accessed will have its ERQ driven low. When a bank is accessed continuously,
longer than the refresh period, the accessed bank looses refresh cycles, and data
can be lost. To avoid this problem, a shadow cache can be implemented using
an additional single-transistor memory bank.

6.3.1.2. A 4T SRAM Macro A conventional SRAM macro based on a six-


transistor SRAM cell is easily embedded in the LSIs because of its compatibil-
ity with the logic CMOS process. However, the large cell size makes it difficult
to achieve a large memory capacity. A loadless CMOS transistor (4T) SRAM
cell has been developed, whose size (1.93 Jlm 2 ) is only 56% the size of the
conventional 6T SRAM cell. The implementation of such a 4T SRAM cell in
a large-capacity, high-speed memory intended for embedded (or off-chip) cache
requires capability to generate accurate timing signals, ensure static data
retention at all temperatures, and overcome the effects of low cell current and
increased bit-line coupling capacitance. The 4T SRAM macro does this by
means of (a) an end-point dual-pulse driver (EDD) for accurate timing control,
(b) a word-line voltage-level compensation (WLL) circuit for temperature
insensitive data retention, and (c) an all adjoining twisted bit line (ATBL) to
reduce bit line coupling capacitance.
Figure 6.8a shows the loadless CMOS 4T SRAM c.ell consisting of two access
pMOS FETs and two drive nMOS FETs. It does not require load elements and
is fabricated by using the shallow trench isolation CMOS logic process [17J.
Figure 6.8b illustrates the operation of a 4T SRAM cell. Two major differences
between the proposed 4T cell and conventional 6T cell are as follows: (1) Two
access transistors are pMOS FETs, which means when the word-line voltage is
switched from 1.8 V to 0 V, the cell will be activated, and (2) the precharged
bit-line voltage must be maintained at the power supply voltage level of 1.8 V for
stable data retention. During a READ operation, the cell discharges one of the
bit-line pairs (BL). A sense amplifier then amplifies the bit-line differential
voltage. During a WRITE operation, the write driver discharges one of the bit
line pairs to 0 V (BL), causing the cell data to be flipped.
EMBEDDED SRAM/DRAM DESIGNS 507

Access PMOS FET

Drive NMOS FET


(a)

YSB,SAE
1.6V
WLB

Pre-charge 1.6V
BL,
,..-----,_r--Lr, / transistors . BL ~
PBL ----t--l~--<l--~
.: SAO ~
.6V
YSB -........,f------f-- : SAO'
: Read

1.6V
WLB \ \ - _
_ ,,;.;1..:.6V~ _
BL, ~
BL ~ ............... _

:: Storage ~.6V
; node . _
SAE ~t-----ll

Write

(b)

Figure 6.8 A loadless CMOS 4T SRAM. (a) Cell. (b) Read and Write operations.
(From reference 17, with permission of IEEE.)

To maintain one of two storage nodes at nearly 1.8 V, this SRAM cell
depends on the off-state current to this mode from one of the precharged bit
lines through the access pMOS FET. For stable data retention, the off-state
current of the access pMOS FET (IoffP) must be higher than that of the drive
nMOS FET (IoffN)' To achieve stable data retention at low temperatures, a
word-line-voltage-level-compensation (WLC) circuit is used, which controls
standby word-line voltage. When the standby word-line voltage is lowered by
0.1 V, the off-state current of the pMOS FET increases 10 times. The WLC
circuit consists of a word-line voltage level determination (WLD) circuit
508 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

containing a dummy cell. The WLD circuit monitors the internal node voltage
(VI) in the dummy cell and determines the word line voltage level (V2 ) .
This 4T SRAM macro cell uses an all-adjoining twisted bit-line (ATBL)
scheme. In a conventional twisted bit-line (TBL) scheme, two bit lines are
twisted within the same bit-line pair to utilize the effect of coupling capacitance
being canceled out when the voltages of adjoining bit lines swing in the same
direction. This results in 250/0 reduction of total coupling capacitance. In the
proposed ATBL scheme, all bit lines are twisted among each other and adjoin
equally. The macro uses eight pairs of bit lines, which means that the bit-line
coupling capacitance can be reduced by 7/16. The ATBL scheme does not
result in an additional area overhead.
A 16-Mb 4T SRAM prototype macro was fabricated using a five-metal,
0.18-{tm CMOS logic process, and its size was 10.4 x 5.5 mm". The macro had
5I2K x 32-bit organization and consisted of X-decoders, V-decoders, sixteen
1-Mb blocks, and two WLC circuits. Each 1-Mb block consists of 16 x 64k
subarrays. A size comparison was made between a 6T SRAM macro, a 4T
SRAM macro using a conventional nontwisted bit-line scheme (64 cells per bit
line), and the proposed 4T SRAM macro using the ATBL scheme (128 cells
per BL). The 4T SRAM macro using a conventional nontwisted bit-line scheme
was 730/0 of the size of the 6T SRAM macro. The 4T SRAM macro using the
ATBL scheme that reduces the number of sense amplifiers was 66% of the size
of the 6T SRAM macro. The test results showed this macro as capable of
400-MHz speed access (2.5 ns) at 1.8-V supply voltage. An advanced version
of this macro in O.13-J-lm process demonstrated 500-MHz high-speed access
(2.0 ns) at 1.5-V supply voltage.

6.3.2 Embedded DRAM Macros


6.3.2.1. dRAMASICs An example of early embedded DRAM development
is Toshiba's dRAMASIC process. In 1988, the company designed and imple-
mented the first dRAMASIC with 72K raw gates in a channelless architecture,
along with 1 Mbit of DRAM. This device was based on Toshiba's 1.0-J-lm
HC 2MOS twin-well technology. The chip size was 14.95 mrrr' and the basic
one transistor memory cell size was 3.6 ILm x 8.2 j1m, utilizing a planar
capacitor structure. For dRAMASIC, three additional masking steps were
added to the standard HC 2MOS process. A critical design consideration was
providing noise immunity to the embedded DRAM from switching that
occurred in the logic portion of the chip, and it was achieved through
embedding the memory cell array in a separate p-well to minimize the effects
of substrate potential variations [18].
In addition, separate power and ground pads were assigned for the DRAM
macro and logic portions, respectively, to eliminate any interaction between the
power supply lines. Becausee the interface between the memory and the logic
was now through on-chip connections, row and column addresses were no
longer multiplexed, which resulted in an easier interface as well as higher
EMBEDDED SRAM/DRAM DESIGNS 509

memory performance. The refresh operations for the DRAM were handled
automatically through implementation of the refresh circuit in the logic section,
thus eliminating the need for user control.
The migration of dRAMASIC technology to Toshiba's 0.5-Itm process
required three layers of polysilicon and two layers of aluminum interconnect.
Additionally, planar capacitor technology was replaced by the trench capacitor
technology that was being used for the 16-Mb standard DRAM product. An
innovative process feature was the introduction of a triple-well structure,
separating the logic portion from the DRAM cell array, which allows an
independent bias voltage to be supplied for each well and the substrate. In the
first implementation of dRAMASIC on a 0.5-,um process, although the die size
was only 200/0 larger than for the 1.0-,um process, it included 2.5 times more
raw gates, as well as 8 times the amount of embedded memory. This 8 Mb of
embedded memory configured in a 128-bit-wide bus resulted in a bandwidth
of 1.6 Gbytes/s for a memory operating at 100 MHz.
Toshiba has currently two approaches for the implementation of
dRAMASIC designs. The first is based on the DRAM process and provides the
capability for implementing high-density embedded DRAMs. This technology
(1T dRAMASIC) utilizes a one-transistor memory cell with trench capacitor
architecture. The second approach is based on the standard CMOS logic
process and provides the capability for implementation of low-density embed-
ded DRAMs. This technology (3T dRAMASIC) utilizes the three-transistor
memory cell approach.

6.3.2.2. A Compiled 100-MHz DRAM Macro Another example of an


embedded macro is a compiled 100-MHzDRAM macro with a novel data bus
architecture by Mosaid Technologies, Inc., which provides tradeoffs between
power consumption/data transfer rate and flexibility in terms of multi-banking,
column decoding, multiple registering, and porting. This macro achieves
densities up to 16 Mb and databus widths up to 256 bits, while occupying a
space of 3.4 mm 2/Mb in a O.35-Itm merged DRAM/logic process.
In a conventional DRAM architecture with a single-metal process, the word
lines run orthogonally to folded bit lines. During the read and write operations,
data buses, running parallel to the word lines in the sense amplifier area, are
connected to the selected columns with Y-access devices controlled by the
active column decoder output. Therefore, increasing the number of databuses
requires a corresponding increase in the area. A variation that appeared in a
double-metal 16-Mb DRAM part with multiple cell arrays employed a
single-column decoder serving all arrays, with its output in metal 2 parallel to
the bit lines. Nowadays, the use of polycide bit lines and double-metal layers
has provided an option of running the databuses parallel to the bit lines. A
practical approach is to run a databus pair for each group of 4-bit line pairs
or columns, which offers many new possibilities. These are some of the general
advantages of this approach [19]:
510 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

• It allows implementation of a wide, high-bandwidth datapath, accessing


up to one-fourth of the page at a time in a less-than-10-ns column cycle,
with no databus area penalty.
• The wide datapath and associated datapath circuitry can be shared by
many independent banks, effectively hiding row activation and precharge
time.
• The memory can be dual- or multiported at the databus interface.
Additional registers can be added as a cache or serialized, as needed. The
wide datapath may be further decoded without difficulty to any desired
word width.
• Low operating power can be achieved despite the wide, highly capacitive
bus crossing multiple arrays. A use of restricted voltage swing signals on
the databuses can achieve an order of magnitude reduction in power.

The DRAM macro uses a wide databus circuit configuration, in which the data
are stored in a stacked capacitor cell occupying 2 ,um 2 . The cell plate is
conventionally biased to Vdd/2. A CMOS double cross-coupled bit sense
amplifier is simultaneously clocked at p-channel and n-channel sources. With
the bit lines and databuses precharged to Vdd/2 as well, the operation is fully
balanced and symmetrical with respect to ~d/2 as the reference level.
During the read operations, the databus signal is sensed in another
conventional CMOS latch, similar to the bit line sense amplifier except that it
has isolation devices turning off, as the amplifier senses, so as not to fully
restore the databus levels. The access for a write operation follows similar
approach, in which the write data are pulsed on to the databuses and a safe
differential level is coupled from them into precharged bit sense amplifiers. The
bit sense amplifier is then clocked to write full 1 or 0 levels into the cell. Thus,
the bit sense amplifiers are used for writing rather than the conventional read-
modify-write approach, which is often unnecessary in an embedded memory.
A representative macrocell configuration produced by the compiler is shown
in Figure 6.9. It is organized as 4 banks x 336 rows x 16 columns x 132 bits. The
dual read/write dataports employing separate wide databuses enable indepen-
dent access to open pages. Arbitration logic incorporated within the read/write
control circuitry allows simultaneous read and write to the same location. The
macrocell uses a simple clocked interface in which all inputs are sampled on
the rising edge of the master clock. There are separate row enable signals (RE)
that control each of the four banks. Each DRAM array is organized as 168
word lines by 4224 bit lines (or 2112 columns), excluding redundancy. When
a bank is activated, 2 word lines are brought to the Vp p level, and 4 of the 5
bit-line sense amplifier arrays are enabled. A 6-bit Y-address (including 2 bits
for the bank address) and a column enable input CE are provided for each data
port.
This macro can transfer data at a continuous rate of 100 Mb/s/pin on both
ports by utilizing its four banks and using column and row cycle times of 10
EMBEDDED SRAM/DRAM DESIGNS 511

.....

~
4 Ban ks Bank 0
x 336 Rows /'
x 16 Column s Word line Drivers
~
x 132 bits Addr ess/Contro l
»> ~
Logic
Clk • ~
~ Bank 1
V /'
Reset • -::::::= DRAM Array

RE .i.r-- Sense Ampl ifiers

~
XA ~ Bank 2
7------
CE ~ V bb, V blp/V cp,
Vcbp Supp lies
YAO 4-

~
6. Bank 3
YA1 I • /'
4 •
R/W - I • Vpp Supply
4 »->
OE- I • Read/Write
--::::::=
--t Amp lifiers

t 132 t 132
DQO DQ1

Figure 6.9 A lOO-MHz DRAM macro representative macrocell with dual-port, 4-


bank, 2.8-Mb configuration. (From reference 19, with permission of IEEE).

ns and 60 ns, respectively. This is an effective rate of 3.3 GB/s , with the databus
CV power dissipation in the core of roughly 26 mW (with limited voltage swing
on the databus levels). The compiler generates library of leaf cells for 0.35-flm
DRAM process. A tiler is used to automatically generate a wide range of
configurations. The tiler can be adapted to other processes with different leaf
cell dimensions. A set of netli sting and simulation scripts can automatically
characterize the critical paths in the resulting macrocell.
The representative 4-bank, 2.8-Mb macrocell shown in Figure 6.9 measures
3.052 mm x 4.164 mm, for a cell efficiency of 45%. The cell efficiency improves
as the size of the macrocell is increased. The maximum 16-Mb configuration
with only a single bank measures 10.522 mm x 5.148 mm , for a cell efficiency
of 62%, and provides dual 256-bit wide databuses.

6.3.2.3. A Dual-Port Interleaved DRAM Architecture Macro A fast ran-


dom cycle time and low-latency embedded DRAM macro that uses dual-port
interleaved DRAM architecture (D 2RAM) has been developed. This D 2RAM
fabricated in a test chip using 0.25-ttm embedded DRAM process demon-
strated reduced random cycle time from 50 ns (20 MHz) typical for a
512 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

WL-rx-

~I ~ (a)

WLa
-t-----..----+--
WLb

m
...J
rn T
(b)

Storage capacitor

.c
...J
m
...J
~ ~
(c)

Figure 6.10 Memory cell and bit-line operation. (a) Conventional DRAM. (b) Dual-
gate interleaved D 2RAM . (c) Memory cell layout for D 2RAM. (From reference 20, with
permission of IEEE.)

conventional DRAM, to 8 ns (125 MHz) . The key circuit techniques used to


achieve the desired performance included dual-port interleaved DRAM archi-
tecture, two-stage pipelined operation, and a write-before-sensing (WBS)
scheme. Figure 6.10 shows the memory cell and bit-line cell operation of both
(a) conventional DRAM and (b) D 2RAM [20]. In the conventional DRAM,
one memory cell consists of one transistor and one capacitor (1TIC), and
EMBEDDED SRAM/DRAM DESIGNS 513

active and precharge operations are executed in turns on a bit line. In the case
of D 2RAM, one memory cell consists of one capacitor and two transistors
connected to dual bit lines (BLa/BLb) and dual word lines (WLa/WLb).
In the case of D 2RAM, both dual-port bit lines (alb) of the memory cell can
be accessed alternately, according to optional commands. When the a-port bit
line is activated, the b-port bit line is precharged; conversely, the a-port bit line
is precharged when the b-port line is activated. This interleaved bit-line
operation has half the random cycle time of a DRAM. In a conventional
DRAM, a folded bit-line sense amplifier architecture is generally used. How-
ever, the D 2RAM· adopts an open bit-line sense amplifier architecture [21],
which halves the cell size compared to the folded bit-line sense amplifier
architecture. The cell size can be reduced even further to about 1.8 times that
of conventional DRAM, by reshaping the storage capacitor. Figure 6.10c
shows the dual-port memory cell layout, which is identical to that of a
conventional DRAM. The word lines run vertically and bit lines run horizon-
tally.
The D 2RAM uses a two-stage pipelined architecture for random accessing
at maximum frequency. This works with a clock synchronous interface. The
read operation has a two-cycle latency from the address input to data output.
For a write operation, the write command and data are presented simulta-
neously, while the word-line, bit-line, and internal databus are activated similar
to the read operation.
In a conventional DRAM, the data are rewritten to the bit line after sensing.
Adjacent unselected bit-line data must be latched to avoid the data destruction
due to the coupling noise from the bit line being rewritten. Therefore, more
time is required for a bit-line sensing, rewriting, and restoring than for a read
operation. The D 2RAM uses write-before-sensing (WBS) scheme, in which the
adjacent port bit line shields the unselected bit line from noise caused by the
written bit line. Thus, in WBS the data of the unselected bit line are
maintained, and write time is shorter than the read cycle time. The random
cycle time is below 8 ns under worst-case conditions and below 6.5 ns under
typical conditions, using a two-stage pipelined circuit operation and WBS. This
random cycle time is about six times faster than that of a conventional DRAM.
The D 2 RAM prototype macro chip was fabricated in O.25-,um triple-well,
3-poly and 4-layer Al metallization embedded DRAM process. The test chip
contained three 2-Mb D 2RAM macros, along with logic and I/O cells. The
2-Mb macro size was 9.41 rnm", and memory cell efficiency (die efficiency) of
this test chip was about 44%.

6.3.2.4. A 1-GHz Synchronous DRAM Macro An embedded DRAM


macro has been developed to serve as a large-capacity, on-chip L2 cache
memory for a future gigahertz system-on-chip (SOC) design [22]. The goal was
to make this DRAM macro fully pipelined and synchronous with a 1-GHz
MPU clock [23] and to supply the data for an entire 1-kb cache line in one
clock and every clock cycle. This was done using following circuit design
514 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

techniques: a two-stage sensing scheme with a synchronized second stage, a


two-port read/write bit switch, a half- VDD bit-line precharge scheme, and a
negative word-line bias scheme. Each subarray is independently operated with
12-cycle pipeline stages.
The macro was fabricated in IBM's CMOS8S embedded DRAM technology
[24]. It offers features such as a trench DRAM capacitor module, a triple-well
process, two different oxide thickness devices, and copper interconnect. The
main difference between this synchronous embedded macro and I-GHz
DRAM technology are cell area and device speed. The eDRAM cell size of
0.617 JLm 2 is 2.5 times larger than that of the latest 1-Gb DRAM cells, but is
roughly seven times smaller than the SRAM cells in equivalent technology.
One inverter delay (with nominal condition at 55°C) for a fan-out of one is 17
ps for eDRAM versus 70 ps for I-Gbit DRAM technology.
The 1-Gb DRAM has a memory cell that uses borderless contact with a
funnel-like shape. This borderless contact reduces the overall area of each
memory cell, and it requires a tall gate stack with thick nitride protection over
the polysilicon layer. However, this tall structure makes it very difficult to etch
with good control over the effective channel length (L eff ) .
Figure 6.11a shows the cross section of an eDRAM cell that does not use
borderless contacts or stacked gates [22]. As a result, the area of each cell
becomes larger than the 1-Gbit DRAM cell. The passing gate located on the
top of the deep trench becomes an active gate in the next cell. Polysilicon word
lines are stitched by the second-level metal lines at every 20th cell. The first
metal level is used for the bit line and is connected to the drain region of the
device through the tungsten stud.
This eDRAM technology uses both thin-gate-oxide and thick-gate-oxide
devices. For the thin-oxide devices (1;,x = 2.8 nm), L eff is 0.09 J1m for NFETs
and 0.12 J1m for PFETs. The devices operate at a scaled-down supply voltage
of 1.5 V, and there are no comparable devices in the 1-Gb DRAM technology.
The thick-oxide devices used are comparable to the devices in 1-Gb DRAM
technology and are mainly used for the array and array peripherals.
Figure 6.11b shows the block diagram of 8k x 1024 eDRAM macro
organization. The macro receives following external signals: Clock - NCLK;
three commands-READ, WRITE, REFRESH; and a 13-bit address signal. It
produces 1280 bits of data including 256 bidirectional error correction codes
(ECC) bits. The macro can deliver a 1280-bit cache line for every I-ns clock
cycle.
The macro consists of sixteen 512-kb independent subarrays and a central
block for address/command decoding, second-stage sense amplifiers, input
pipeline latches, and the output latches/buffers. Within each subarray, there are
two 256-kb arrays and a control block. There are eight 32-kb arrays in a
256-kb array. In each 256-kb array, there are 264 word-line and 1296 bit-line
(first-stage) pairs. Eight of the 264 word lines are for redundancy purposes.
Two sets of eight bit-line pairs are for redundancy, and two sets of 128 bit-line
pairs are for the ECC. Figure 6.11c shows the expanded view of the subarray.
EMBEDDED SRAM/DRAM DESIGNS 515

passing
active wordline
tungsten wordline

+ t

M1 active contact active deep


bitline area wordline trench

(a)

NCLK
NCLK LK
Ni
SIAWRT BUF
2/
ADRS O.. ~ ---1 S/A SELECTOR o
6 8 64 / 0
~ Tl ROW DEC z
--<
;0
SUBARRA Y x8
en 4 16 4
+l
I f/l'
IX

ADRS 12 .. ~
.0
HD TOP DEC

~DDRESS STATIC
13/
0
r-
S/A WRT BUF
~
NCLK NCLK H l2BO
~nd S/A DATA_LATC P

WRITE....
u,
::l
en ... z
LU
o if,
1280
JJ280
NCLK

READ....
0
z
-c
::;
... 0
z
«
::;
--+-l DIN 8UFIDOUT 8UFP
1280 DOUT
1024 DIN
::;
~ E FRESH "" 0 ... ~ {J: 280iJ.
1280
~ ~
-+-fnd SIA DATA_LATCHI
H l2BO
S/A WRT BUF
o
0
z
--<
;0
SUBARRAY
0 x8
r-

SIA WRT BUF


~

(b)

Figure 6.11 A I-GHz embedded DRAM macro. (a) Cross section of eDRAM cell. (b)
Block diagram of macro organization. (c) Expanded view of the subarray. (From
reference 22, with permission of IEEE.)

The even-numbered bit -line pairs are sensed by the sense amplifiers at the
top, and odd numbered pairs are sensed at the bottom. The bit lines are
distributed vertically with first-level metal. Each bit-line pair is twisted three
times within the 256-kb array at every 64th cell to balance out the coupling
noise due to the neighboring lines. The word lines are distributed horizontally
with both polysilicon and second-level metal.
There are 1296 read data-line (second-stage bit-line) pairs and 1296 write
data-line pairs running vertically in third-level metal. The read data lines
516 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

,--
64b data
----... ..-------.....
16b ECC "'

@@--------~IU +---
)~ f
N ...
Ja
....
:.c
\... redundancy 64b data 16b Eee ~ 00

t
write buffer
senseamp
1 Array control write buffer
sense amp

~ ~ ~ :2 ~ == == ~
== ~
== ~
== ;fa == :=
~ ~ ~ ~ ~ ~ ~ ~ ~ ~

senseamp '-'" Array control sense amp


write buffer write buffer
(c)
Figure 6.11 Continued (c) Expanded view of the subarray.

connect between the first-stage sense amplifiers in subarrays and the second-
stage sense amplifiers in the central unit. The write data lines connect between
the input buffer and the first-stage sense amplifier. This results in matching two
read data lines and a write data line (total of three) to every bit-line pair. The
macro requires four metal layers. The Iast metal layer is mainly used for clock
and power distribution.
A 4-bit state machine implemented in each subarray block controls the
subarray. The state machine is synchronized with the global clock of the
macro. It generates the necessary timing signals for the activation of the word
lines, setting of the bit-line sense amplifiers, resetting the word lines, and
precharging the bit lines to half- Voo .
In the worst-case data pattern simulation, the macro including the clock
drivers consumes approximately 10 W. The macro is designed to operate with
a I-GHz clock, at 85°C, nominal process parameters, and a 10% degraded Voo.
The design is fully pipelined and synchronous with 16 independent subarrays.
With l-kb-wide I/O and I-GHz clock, the maximum data rate becomes 1 Tb/s,
The address access time is 3.7 ns, four cycles with a I-GHz clock.

6.4. MERGED PROCESSOR DRAM ARCHITECTURES

The first generation of logic-based eDRAMs have suffered from severe per-
formance constraints. A major issue in scaling to the next generation is the
migration ease of libraries and cores. DRAMs do not follow the migratable
design rules, unlike the logic technologies, for which a vast majority of libraries
and cores are adaptable to the new design rules by simple automated migration
software and retiming. This is a major advantage of synthesized logic. In
MERGED PROCESSOR DRAMARCHITECTURES 517

comparison, the design rules in DRAM technology are based on the shrink
path of the cell and may not correspond to simple scaling in the logic
technology. For these reasons, the DRAM-based technology is not compatible
with the library and core methodology used by the designers of synthesized
logic.
The embedding of DRAMs in a logic technology typically requires some
additions to the process flow. Figure 6.12 shows the additional steps in dark
boxes [2] and are as follows: (1) a deep trench for the storage node , (2) an
optimized shallow trench isolation, used in place of the logic shallow trench,
(3) predoping of the polysilicon word lines, and (4) predoping of the block . The
boxes in gray are the steps required for pass transistor and word-line drivers
of the embedded DRAM, but also used in the format ion of 2.5-/3.3- V I/Os. The
additional steps are expected to increase the process complexity by about 25%.
An example of merged processor memory architecture is Mitsubishi's
embedded DRAM technology (eRAM™) that uses a three-layer metal, Hyper-
DRAM process . The result is a standard eRAM product called M32R /D,
which integrates a 32-bit RISC processor, DSP functions, 2 MB of DRAM, and
4 K B of cache SRAM - all on the same die. Figure 6.13a shows a block
diagram of the typical MCU chip for consumer applications, which has a
configuration that integrates a CPU and certain peripheral circuits on the same
chip , while memory and additional application specific peripheral circuits

V, = Threshold adjustment implant

Figure 6.12 Additional steps required for embedding DRAMs in a logic technology
process. (From reference 2, with permission of IEEE.)
518 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

16/32 bits 128 bits DRAM ,


CPU DRA M CPU ~ so ftware
library

8/16 bits t 6 bits


Peripheral L--. Peripheral

I Peripheral I
(a) (b)

128 bits, 66.6 MHz


Cache DRAM
4 Kbytes

11l
Q)
'C
110
t3
1 Mbyt e/2 Mbytes

E Q) ~

Q)
Q) Ql 2
11l
:::J
Q) :0 '5
.0
:::J cO :0
a N N ::>
(0
~
~

0
~ N
CP U I- ~

Instruc tion
Data
32 bits , 66.6 MHz
TT External bus 16 bits , 16.67 MHz

(c)

Figure 6.13 Block diagrams. (a) Conventional microcontrolier configuration. (b)


eRAM-based M32RjD. (c) M32RjD chip configuration, including 128-bit, 66.6-MHz
internal bus. (From reference 25, with permis sion of IEEE.)

located off-chip are connected by an interchip bus that is typically 16 or 32 bits


wide. Figure 6.l3b shows the block diagram of M32R /D configuration that
integrates a CPU and a DRAM on one chip, while peripheral circuits are
located off the chip and connected by a 16-bit-wide databus. The bus between
the CPU and the DRAM is 128 bits wide.
Because most peripherals do not require high-speed operation, the design
uses a 16.67-MHz external clock for the bus connection to the peripheral I/O
chip . The same external clock feeds into the M32R jD, where a phase-locked-
loop (PLL) circuit generates a 66.6-MHz internal clock. Thus, the CPU and
memory are operated by a 66.6-MHz clock on a 128-bit-wide bus, while the
connection between the peripherals and the CPU/memory chip is a 16-bit
wide bus running at 16.67 MHz. To realize embedded D RAM in a CPU
process, it is necessary to keep the switching speed of logic components fast
MERGED PROCESSOR DRAM ARCHITECTURES 519

while implementing higher-density memory with small cell size and good
charge retention characteristics. To satisfy these requirements, M32R/D uses a
merged logic and DRAM technology called hyperDRAM process. Two key
techniques used in this process are (1) a suppression of interference between
the logic and DRAM sections by using a triple-well structure for electrical
isolation and (2) reduction of wiring pitch by planarizing the dielectric layers
between the interconnect layers with CMP.
The M32R/D consists of a small RISC CPU connected to the DRAM via
a 128-bit-wide bus, as shown in Figure 6.13c. The instruction queue, 128-/32-
bit selector, and the bus interface unit (BIU) buffer are connected with the
128-bit bus between the DRAM array and the 4-Kbyte cache. The instruction
queue consists of two 128-bit entries and holds instructions for the CPU. When
the CPU reads and writes operand data, the 128-/32-bit selector adjusts the
data transfer between the 128-bit bus and the CPU. The BIU is a 128-bit data
buffer that supports burst transfers on the 128-bit bounded data.
The cache is direct mapped with 128-bit block size and follows a write-back
policy. It has two caching modes: (1) on-chip DRAM and (2) instruction-
caching. In the on-chip DRAM caching mode, the cache functions as a unified
cache for caching instructions and data from the on-chip DRAM. This mode
is effective when most or all of an entire program and data fit in the internal
DRAM, and the data are frequently read and written by the CPU. In the
instruction-caching mode, the entire cache works as an instruction cache that
caches instructions both from the on-chip DRAM and from external memory
chips. This mode is for application that uses M32R/D and external ROM chips
as a system's data processing components.
The 128-bit internal bus operates at 66.6 MHz and transfers 128 bits of data
between the CPU and memory in one cycle. A cache hit requires one access
cycle (i.e., there is no wait cycle). An operand access transfers 32 bits of data
between the CPU and the cache in one cycle through the 128-/32-bit selector.
Instruction fetch operations fetch 128 bits of instruction code into the instruc-
tion queue in one cycle, so that the CPU can get four to eight instructions in
one cycle. On a cache read miss, the data transfer from the DRAM to the CPU
and the update of the cache data take four cycles on a DRAM page hit, or
eight cycles on a DRAM page miss. On a cache write miss, the data transfer
between the cache and the CPU takes only two cycles (i.e., one wait cycle).
The major advantage of integrating the CPU and DRAM is the feasibility
of using a wide and fast internal bus for higher performance. If on-chip DRAM
is sufficiently large for an application, it doesn't need additional memory
outside the chip. The integration of the CPU and memory eliminates power
consumption of the I/O buffers driven during a DRAM access. The wide
internal bus also reduces the number of DRAM reads and writes.
A multimedia system has to deal with an enormous quantity of data, so it
needs a large bandwidth memory bus and a large bandwidth system bus.
Increasing the bit width of the memory bus is one way to increase the memory
bandwidth, but this increases the pin count. An alternative is to increase the
520 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

data frequency by using the Rambus DRAM (RDRAM) that has only 11 signal
pins for the Rambus channel and boosts the signal rate to 600 Mbytes/s (see
Chapter 4, Section 4.3). The RDRAM approach makes it possible to get a large
bandwidth and small memory granularity, and a multimedia system can be
made with just one RDRAM chip. In comparison, to obtain a memory
bandwidth greater than 600 Mbyte/s by using 16-bit, 100-MHz bus, 64-Mb
SDRAM devices, the bus width must be 64 bits and the minimum memory size
must be 256 Mb (or 32 Mbytes) [26]. A memory of this size is too large to be
used in embedded systems. Therefore, the RDRAM is an especially suitable
device for embedded application, and integrating the system bus on the chip
can provide higher data rates.
A multmedia-oriented embedded RISC processor has been developed that
has a high data rate system and uses a concurrent RDRAM (C-RDRAM)
controller in a superscalar architecture [27]. Figure 6.14a shows the block
diagram of the processor. A CPU core and several peripheral circuits are
integrated on one chip to handle the multimedia data efficiently. The core and
the peripherals are connected with one of the two buses incorporated into the
chip. One bus is a 200-MHz, 64-bit system bus, which connects the peripherals
requiring a high data rate. The other is a 50-MHz, 16-bit bus for serial interface
connection. The core and the system bus operates using a 200-MHz clock
generated by an internal phase-locked loop circuit from an external 50-MHz
clock signal. The processor is fabricated using a 0.25-,um, four-layer metal
CMOS process technology and has 3.9 million transistors mounted on a
10.5 x lO.5-mln 2 die size, and the power consumption of chip is less than 2 W.
The CPU core has a six-stage pipeline structure for high-clock-frequency
operation. The whole pipeline is divided into three parts: an instruction
dispatch pipeline (I-pipe), an integer execution pipeline (V -pipe), and a multi-
media single-instruction multiple-data (SIMD) coprocessor pipeline (M-pipe).
These pipelines work in parallel to provide execution speeds up to 2000
MOPS, a level of performance sufficient for MPEG-2 decoding, and MPEG-l
encoding.
The V-pipe includes a 32-bit multiply-adder and a 64-bit shifter. The
multiply-adder executes simple 32-bit multiplication or 32-bitaddition follow-
ing the multiplication. The 64-bit shifter performs 32-bit logical and arithmetic
left/right operations as well as 64-bit shift operations. The M-pipe performs
SIMD parallel operations on eight packed bytes, four packed half-words, or
two packed words. The unit has a 32-word, 64-bit register file with four read
ports and two write ports and also has six function units: a multiply unit, an
add unit, a shift unit, a logic function unit, and two data-type converter units.
The execution units are fully pipelined and have one-clock throughput and
fixed four-clock latency. Also, there are four 16-bit multiply-adders in the
M-pipe.
The CPU core contains two 16-Kbit internal caches used for the instruc-
tions and data. The peripherals include an RDRAM control unit (RCU), a
video control unit (VCU), an audio control unit (ACU), and a bus control unit
MERGED PROCESSOR DRAM ARCHITECTURES 521

~Ra mbu s
~channel
Instruction unit ~ S erial
,.,~~ audio out

E ~16-bit
~ video out

*
>.
l/l
"iii
c
ffi
~-+-- Externa l interrupt
~ request
~-+-- Externa l DMA
C ~ request

- -r- - - - -
16b @ 50 MHz
r -- - ' - - -- - - ,
Seriall/Os
g eu ~ bus
3Z-b addre ss/data
@ 50 MHz

Clock in

Internal s stem bus (ZOO-MHz, 64-bit)

Izoo MHzl

Reset Refresh
controller controller

I 75 MHz I

RAe
u _ ~ 300 MHzl-uu uu u u
Rambu s channel (up to 600 MB/s)

RDRAMs (max. 8 devices)

(b)

Figure 6.14 Block diagrams. (a) Embedded RISe processor with a Rambus DRAM
controller. (b) Rambus DRAM control unit. (From reference 26, with permission of
IEEE.)
522 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

(BCU) that are connected to the 200-MHz, 64-bit internal system bus. Also,
large data buffers with sizes ranging from 384 to 512 bytes are included in each
peripheral. The RCU controls a Rambus channel, which handles 600-Mbyte/s
data transmission. Eight C-RDRAM devices can be connected directly to the
RCU. The RCU handles memory accesses from the CPU core and various
interfaces. The VCU and ACU have double 256-byte buffers. Each of them
fetches 256-bytes data from the C-RDRAM, in which a frame buffer and an
audio data buffer are assigned, and outputs the data synchronously with the
interface's individual clock. The VCU has 16-bit parallel data ports for the
video data. The ACU has a serial port for 'audio data. The BCU manages a
32-bit parallel address/data multiplexed external bus and also acts as a bridge
circuit to a 16-bit, 50-MHz bus, which is connected to various serial-port
controllers. The 200-MHz internal system bus also connects to the interrupt
control unit and the direct memory access control unit. Both these units direct
the CPU core and the BCU, according to external signals.
The RCU provides a bridge between two high-speed buses: the 200-MHz,
64-bit internal system bus and the 600-Mbytejs Rambus channel. Figure 6.14b
shows the block diagram of the RCV. The Rambus channel exchanges data
using signals synchronized with each edge of the 300-MHz clock signal. The
Rambus ASIC cell (RAC) manages the physical layer of the Rambus channel
protocol, multiplexes write data by 8:1, and demultiplexes read data by 1:8. The
RAC also generates a 75-MHz clock from the 300-MHz clock and communi-
cates with circuits using this 75-MHz clock. The internal system bus side of the
RCU operates using a 200-MHz clock. The Rambus memory controller
(RMC) follows the RDRAM commands in the queue to compose and
decompose packets for the Rambus channel and controls the timing with which
the packets are sent to the Rambus channel. The RCU issues refresh com-
mands and current-level-control commands periodically.
To reduce memory access latency, the RCV interleaves transactions and
prefetches instructions. The performance estimate of this multimedia indicates
that the interleaving results in data rates as high as 533 Mbyte/s and reduces
the 256-byte read latency by 9%. The prefetching reduces the instruction cache
refill latency by 70%.

6.5. DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES

The logic processes and commodity DRAM fabrication processes have con-
flicting needs; therefore, embedding DRAM using either of those two ap-
proaches involves a compromise between the logic performance and DRAM
cell size. The logic technology requires transistors with high threshold voltages
that can provide fast switching times. But high-threshold voltages produce high
leakage currents. DRAM technology requires transistors with low leakage
currents to minimize the size of capacitor in the memory cell; however, low
leakage means slower transistors. Taking into consideration the various
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 523

tradeoffs involved, the embedded DRAM. suppliers have taken one of the two
approaches. The companies with established logic processes and ASIC heritage
start with their logic process and add masking steps to produce DRAM
capacitors, using merged processor DRAM logic architectures that were
discussed in Section 6.4.
In comparison, the companies with a DRAM manufacturing heritage such
as Mitsubishi, Samsung, Toshiba, and Infineon Technologies, start with a
DRAM process and add masking steps to include logic. Some examples of this
approach are modular embedded DRAM core developed by Infineon Tech-
nologies and a multimedia accelerator (MSM7680) by Oki Semiconductor.
There are some experimental DRAM processes with embedded logic architec-
ture being developed, such as intelligent RAM (IRAM, by the University of
California at Berkeley) and computational RAM (CRAM). These are briefly
described in the following sections.

6.5.1. A Modular Embedded DRAM Core


Infineon Technologies has developed a modular embedded DRAM core
concept using its O.24-pm DRAM trench technology process and based on the
idea of a toolbox. In this toolbox, all parts are available that are needed to
form a customer-specific embedded DRAM core. The major part is a I-Mbit
memory block containing all cells, word lines, and bit lines that are needed for
a memory array, including the whole row path with row decoder, word-line
drivers, and word redundancy steering circuits for this memory block. These
building blocks can allow creation of memories with up to 128 Mbit and with
a memory granularity of 1 Mbit [28].
The other parts in the toolbox are all building blocks that are needed in the
interface to perform DRAM functionality: column decoder, secondary sense
amplifier, data multiplexer, column redundancy fuses, data I/O, address input,
test interface, and power supply module. The construction of a memory core
depends on the application and the customer's requirements such as the
general operating conditions of the embedded DRAM: memory size, page
length, data word width, latency mode, clock frequency, and memory organiz-
ation.
The core concept supports a multibank operation mode with up to four
independent banks in one memory module, where a memory module is a
concatenated stack of memory blocks with a shared interface. In this memory
module, up to sixteen 1-Mbit building blocks can be stacked above each other.
The maximum number of memory blocks that can be activated in parallel is 8,
taking into consideration the power consumption aspect. As a 1-Mbit memory
block has a page length of 2 kbits, the maximum page length that can be
achieved by parallel activation is 16 kbits.
Figure 6.15 shows three different memory configurations, each of them with
a memory size of 4 Mbits [28]. Configuration #A consists of one bank, where
only one memory block can be activated at a time (shown as grayed-out
524 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

Configura tion #A Configuration #8 Configuration #C

1 MbltDRAM 1 MbltDRAM 1 Mbit DRAM


.
....c
1 MbilDRAM 1 MbitDRAM 1 MbitDRAM '"
al

~
....c ....~c I
1 MbitDRAM '"
al
1 Mbi tDRAM '"
al
1 MbitDRAM I
!E
....c
1 MhitDRAM 1 MhitDRAM 1 MhitDRAM
'"
al

I
Interface Interface Interface

Figure 6.15 A modular embedded DRAM core with examples of three different
memory configurations, each of them with a memory size of 4 Mb. (From reference 13,
with permission of IEEE.)

block). This activation results in a page length of 2 kbits. Configuration #B


consists of one bank, but in this one bank, two memory blocks are activated
in parallel, so that a page length of 4 kbits is available at the expense of higher
activation current. Configuration #C consists of two banks, but in each of
these memory banks, only one memory block can be activated at a time. This
results in a page length of 2 kbits . However, by activating bank #0 and then
bank # 1, a page length of 4 kbits is possible.
The customer has an option of operating the DRAM core in a high-power
mode or in a low-power mode . The word width delivered by the interface is 64
bits. By activating up to 8 memory modules in parallel, a maximum data word
width of 512 bits is possible . If a customer needs a small memory with a high
data word width, the option exists for expanding the word width of one
memory module to 128 bits by using two 64-bit interfaces in parallel, in only
one memory module.
The core concept offers programmable latency stages of 1, 2, and 3. In
latency mode 3, the embedded DRAM can be clocked with up to 166 MHz,
depending on the memory size. A memory module with an interface of 64-bit
word width achieves data rates of up to 10.6 Gbitsjs in page mode. For
example, by combining four of the memory modules, with a total data word
width of 512 bits, a maximum data rate of 84.8 Gbitsjs in a page mode can be
achieved .

6.5.2. Multimedia Accelerator with Embedded DRAM


Many features essential for graphics display require a high data transfer rate
between the frame buffer and the data processing units. For example, to display
a high-quality picture image and have a high-refresh rate to suppress flicker,
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 525

the resolution enhancement of color and space requires a data transfer rate
directly proportional to its resolution and refresh rate. The peak transfer rate
of a 1280 x 1,024-pixel, true color (a 24-bit color depth) image at a 75-Hz
refresh rate is around 400 MB/s, which is an order of magnitude increase
compared to the conventional data transfer rate of 640 x 480-pixel, 8-bit color
with 60-Hz refresh rate display. The image overlay (or image scaling) doubles
the data transfer rate. Image blending overlays two images by mixing them
together using certain weighting coefficients. Color keying is another overlay
operation in which one of the two images is selected based on color. The data
from two images is read concurrently in both of these operations. In addition
to the read operation of such data, the write operation to update the image in
the frame buffer requires data transfer rate at tens of Mbytes/s, The graphic
controllers use 64-bit and higher databus widths to support external DRAM
for the frame buffer.
An example of the embedded DRAM implementation in graphic display
system is Oki Electric Industry Company's multimedia accelerator chip
(MSM7680) that integrates the frame buffer with graphic controller functions
such as a 2-D drawing engine, MPEGI decoder, digital/analog converter for
RGB analog output, and a clock generator phase-locked loop. The MSM7680
uses a four-layer polysilicon, three-metal layer process to embed DRAM in a
complex logic circuit, and has a wide bus with high peak bandwidth, which
eliminates the data transfer bottleneck between the frame buffer and data
processing units.
The integration of frame buffer on a chip using wide databus between the
frame buffer and the FIFO buffers provides a high data transfer rate. The width
between these buffers is restricted by the I/O pin count when the frame buffer
is implemented with separate memory. The internal databus of MSM7680 is
256 bits wide and has a peak data transfer rate of over 2 Gbytes/s when the
column access cycle time is less than 16 ns. A lower power dissipation is
another advantage of DRAM integration, because embedded DRAMs have a
smaller internal databus capacitance than does a bus connected to an external
DRAM.
Figure 6.16 shows a top level view of the MSM7680 LSI multimedia
accelerator architecture that integrates the MPEG-l video/audio decoder, 2-D
GUI (graphics user interface) engine, and RAMDAC (135-MHz, true color
digital/analog converter) [29]. The host bus interface supports the PCI
protocol at 33-MHz default clock frequency. This interface can support at up
to 50 or 66 MHz with the insertion of wait states and includes the standard
PCI configuration register access. The bus interface performs as either a slave
or a bus master with a multichannel DMA, which individually manages 2-D-
rendering command and data streams, and MPEG video and audio data
streams. It also supports abort, parity generation, and checking functions.
Because all drawing coordinates are rectangular, the hardware must trans-
late rectangular coordinates to linear addresses, as well as perform the clipping
and ternary (256) raster operations (ROPs). The on-chip graphic engine
526 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

Bus interface Rendering Memory Display/output


I
20
graph ics Embedded
Video
engine DRAM Graphics in
(1.25 video
Mbytes) mixer
Host Audio
Hos t l - - bus Digital
CPU interface
unit
I RGB
(HBIU) Display
I- Analog
Memory processor Digital- RGB
interface to-
MPEG 1 unit analog
AN (MIU/FIFO) converter
Periphera I U decoder (DAC)
I/O

External
DRA M

Figure 6.16 Block diagram of top-level architecture for Oki MSM7680 multimedia
accelerator with embedded DRAM. (From reference 29, with permission of IEEE.)

supports transparent/opaque text and bit-block transfers for color and mon -
chrome sources and patterns. The bitmaps can be defined with 4-, 8-, 16-, and
24-pixel formats. The 2-D acceleration is performed in 8-, 16-, and 24-bpp
formats. The MSM7680 includes an MPEGI video and audio decoder. The
host software manages the overall displ ay buffers and files, splits data streams,
and recovers errors. The host manages synchronization between the video and
audio by using special timing and synchronization hardware facilities. The
MSM7680 also includes display processor and graphics video mixer, and it is
compatible with standa rd VGA format and supports the VGA and SVGA
(VESA-enhanced VGA) modes .
The MSM7680 has 1.25-MB embedded DRAM configured with syn-
chronous DRAM, a 256-bit internal databus width, and high-speed access. The
building of DRAM into the controller serves to optimize the composition of
row and column bit length according to the required functions. The embedded
DRAM consists of 21 small block s, each consisting of 2,048 cells (column) x
256 cells (row) . To reduce the number of transistors that are charged or
disch arged in one read or write operation, the memory block is divided into
21 small blocks for every 512 Kbits. This makes it possible to control the access
speed and reduce the power consumption during a read or a write.
The read and write control logic uses RAS and CAS signals just like a
standard DRAM. This control logic supports RAS access mode (read and write
operations of one word) and CAS access mode (read and write operations of
the consecutive data). In the CAS access mode, up to eight words (256
bits/word) can be accessed consecutively. A 256-bit -wide databus width is
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 527

selected between the internal DRAM and the memory interface (MIU/FIFO)
instead of the standard 128-bit DRAM width. This MMS7680 memory
configuration allows general data transfer rates of 800 MB/s in RAS access
mode. When reading and/or writing eight words consecutively at 80 MHz, the
MSM7680 transfers data at 2 Gbytes/s in CAS access mode.
In one section of the embedded DRAM, large buffers are implemented for
command and image data to speed up 2-D drawing. The host CPU communi-
cates all commands, parameters, and operand data for the rendering engine
through these two buffers (FIFOs). The MIU/FIFO supports both internal
and external memory operations. Internal memory is accessible with an
extremely high bandwidth due to a wide, low-latency internal interface. The
external memory can be accessed, along with an optional 1-Mbyte attachment
of memory, with high bandwidth via a burst EDO DRAM interface. A set of
special registers is used that connects the memory core to the peripheral logic
circuits in the interface circuitry to maximize use of the high bandwidth gained
by using the embedded DRAM. In the MSM7680, each function model has a
256-bit register that takes the 256-bit-wide data read from the embedded
DRAM. These registers then forward the data gathered to each function
module through a different-width bus.
A benchmark evaluation (Winbench97, 233- MHz Pentium II, SVGA
display, 16-bpp color) of this multimedia accelerator performance in 2-D
drawings recorded 104 Mpixel/s.

6.5.3. Intelligent RAM (IRAM)


Figure 6.1 showed that while microprocessor performance has been improving
at the rate of 60% per year, the access time for DRAM has been improving at
a rate of less than 100~ per year, leading to a processor-memory performance
gap. Processor system architects have attempted to bridge the processor-
memory performance gap by introducing deeper and deeper cache memory
hierarchies, which has made memory latency even worse. While the processor-
memory gap has widened to the point where it dominates performance for
many applications, the cumulative effect of two decades of 600/0 per year
improvement in DRAM capacity has resulted in huge individual DRAM chips.
However, at a minimum, a system must have enough DRAMs so that their
collective width matches the width of the DRAM bus of the microprocessor,
which is 64 bits in the Pentium and 256 bits in several RISe machines.
The goal of intelligent RAM is to design a cost-effective computer by
designing a processor in a memory logic process that includes memory on-chip,
instead of embedding memory in a conventional logic fabrication process. The
reason for putting the processor in a DRAM rather than increasing the
on-processor SRAM is that DRAM is roughly 20 times more dense than the
SRAM [30]; thus, the IRAM approach can enable a larger amount of on-chip
memory than is possible in a conventional architecture. The second reason is
that, since the actual processor occupies only about one-third of the die, the
528 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

new generation of gigabit DRAM has enough capacity to accommodate whole


programs and data sets on a single chip. The third reason is that DRAM dies
have grown in size about 50% each generation, use multilevel metallizations,
and use fast transistors in SDRAMs to meet the high-speed interface require-
ments. This should facilitate easier integration of on-chip processor. These are
some of the potential advantages of IRAM approach [31]:

• Higher Bandwidth. A DRAM has a very high internal bandwidth, basi-


cally corresponding to the square root of its capacity each DRAM clock
cycle. To keep the cell storage area small, the length of the bit line is
limited, typically by using 256 to 512 bits per sense amplifier. Also, to save
die area, each block has a small number of I/O lines, which reduces the
internal bandwidth by a factor of 5 to 10. A goal of IRAM is to capture
a large fraction of this potential on-chip processor bandwidth.
• Lower Latency. To reduce latency, the wire lengths should be kept as
short as possible, which means fewer bits per block. The access latency of
an IRAM processor does not need to be limited by the same constraints
as a standard DRAM part. A goal of IRAM approach is to improve
memory latency by factors of 5 to 10 and improve memory bandwidth by
factors of 50 to 100 by intelligent floorplanning, utilizing faster circuit
topologies, and redesigning the memory interface.
• Energy Efficiency. Integrating the microprocessor and DRAM memory
on the same die offers the potential for improving energy consumption of
the memory system. There will be fewer external memory accesses, which
consume energy to drive the high-capacitance, off-chip buses. Even
on-chip accesses should be more efficient, because a DRAM consumes less
energy than an SRAM. A goal of IRAM design is to improve energy
efficiency by a factor of 2 to 4.
Men-lory Size and Width. An advantage of IRAM over conventional
designs is the ability to adjust both the size and width of the on-chip
DRAM, instead of being limited by powers of 2 (in length or width), as is
the case with the conventional DRAMs. The IRAM design can specify
exactly the number of words and their width.
• Board Space. The goal of IRAM is to reduce the board area by factors of
4 or higher by integrating several chips into a single chip solution.

These are some of the potential disadvantages and challenges of the IRAM
approach:

• IRAM will be fabricated in a memory process that has been optimized for
smaller cell size and low charge leakage current rather than the fast
transistor speed. This same DRAM fabrication process offers fewer metal
layers than a logic process to lower costs, because routing speed is less of
an issue in a memory.
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 529

• DRAMs are designed to work in plastic packages and dissipate less than
2 W, while desktop microprocessors dissipate 20 to 50 W using ceramic
packages.
• Some applications may not fit within the on-chip memory of an IRAM,
and hence IRAMs must access either conventional DRAMs or other
IRAMs over a much slower path than the on-chip accesses.
• A biggest challenge to the IRAM approach is matching the cost of a
DRAM memory. The DRAMs include redundant memory to improve
yield and therefore lower the cost. Traditionally, microprocessors have no
redundant logic to improve the yield, and therefore the on-chip logic may
effectively determine the yield of the IRAM. In addition, testing time
affects the chip costs. Given both the logic and DRAM die on the same
chip, an IRAM die may need to be tested on both the logic and memory
testers.

High-speed microprocessors utilize instruction level parallelism (ILP) in


programs, which means the hardware has the potential short instruction
sequences to execute in parallel. These microprocessors rely on getting a high
percentage of hits in the cache memory to supply instructions and operands at
a sufficient rate to keep the processors busy. An alternative model to ILP
approach that does not rely on the use of cache memories is the vector
processing. It is a well-established architecture and compiler model that is
considerably older than the superscalar architecture. The vector processors
have high-level operations that work on linear arrays of numbers and do not
rely on data caches to have high performance. Instead, they rely on interleaved,
low-latency main memory, often made from SRAM, using up to 1024 memory
banks to get high memory bandwidth. In addition, they have large register sets,
typically 8 to 16 "vector registers," each with about 64 to 128 of 64-bit
elements. Thus, they have 32-128 Kbits of multiported, high-speed registers.
The vector processors also depend upon multiple, pipelined functional units
and, for matching them to high-bandwidth memory, utilize multiple ports
between the processor and memory. Because IRAM has low latency and is
highly interleaved, it matches the needs of a vector processor. The vector
processing appears to be a promising approach to exploit IRAM architecture,
because vectors easily allow a tradeoff of more hardware and slower clock rate
without sacrificing peak performance.
Figure 6.17 shows the block diagram of a vector IRAM in a 0.I8-llm
DRAM process implementation that includes following major elements [32]:

• Sixteen 1024-bit-wide memory ports on the IRAM, offering a collective


100 GB/s of memory bandwidth
• Sixteen 128-element vector registers
• Pipelined vector units for floating point add, multiply, divide, integer
operations, load store, and multimedia operations
530 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

+
x
2-way Superscalar
Processor

Load/S tore
1024
Vector Reg isters

Memory Cross Bar

~~ ~~ ~ ~
~G GG G G
... 1R?4 ... ...
1R?4 19. 24

~G G~ ~ G
Figure 6.17 Block diagram of a vector IRAM in an O.I8-Jlm DRAM proce ss imple-
mentation [32].

Inst ead of operating one element a t a time, an eight-pipe vector processor


can operate on eight elements in a clock cycle at the cost of multiple
vector units.

For example, in a 0.I8-/lm DRAM process with a 600-mm 2 chip, a high -


performance vector accelerator might have eight Add -Multiply units running
at 500 MHz and sixteen l-Kbit buses running at 50 MHz. This combination
can offer up to 8 GFLOPS and 100-Gbytes/s performance.

6.5.4. Computational RAM


A major goal behind emerging developments in memory-based logic architec-
ture is to exploit the memory chip's wide internal datapaths and benefit from
the energy efficiencies that can result in embedding mu ltifunctions on the chip.
But to effectively utilize this internal memory bandwidth, logic and memory
must be more tightly integrated than merel y being located on the same chip.
A key goal is to remain compatible with commodity DRAM in cost , process-
ing, silicon area, speed , and packaging, while utilizing a significant fraction of
the internal memory bandwidth.
Another approach to implement processor-in-memory architecture is com-
putational RAM (also referred to as C' RAM) that can make effective use of
internal memory bandwidth by pitch-matching simp le processing elements to
memory columns. Computational RAM can function either as a conventional
memory chip or as a single-instruction stream, multiple-data stream (SIMD)
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 531

computer. When used as a memory, computational RAM is competitive with


conventional DRAM in terms of access time, packaging, and cost. As a SIMD
computer, computational RAM has a potential to run certain parallel applica-
tions thousands of times faster than a CPU.
To extract as much memory bandwidth as possible, the computational
RAM processing elements are pitch-matched to a small number (e.g., 1, 2, 4,
or 8) of memory columns. The use of a common row address shared by a row
of processing elements requires that the processing elements have a SIMD
architecture. With this design approach, the area overhead can range from 30/0
to 20%, while power overhead can be 10% to 25%, compared to the
memory-alone approach. Such chips can potentially add a massively parallel
processing capability to machines and systems that currently use DRAMs.
Figure 6.18a shows the conceptual approach in which the computational RAM
chips could serve both as the computer main memory and as the graphic
memory [33].
In the computational RAM architecture's programming model, a host CPU
can read and write to any memory location during an external memory cycle.
During an operate cycle, all processing elements execute the same common
instruction and optionally access the same memory offset within their private
memory partitions. In other words, computational RAM is a SIMD processor
with distributed, nonshared, uniformly addressed memory. A bus facilitates
interprocessor communication and is useful for combinational operations.
A more common alternative to pitch-match processing elements to groups
of sense amplifiers is to put a single RISe or vector processor in a DRAM chip
[34]. This approach allows a wide variety of conventional programs to be
compiled and run without modification or attention to the data placement and
communication patterns. Such a processor has access to a wider bus (128 to
256 bits) for cache or register fills than it would have if implemented on a
separate chip.
Figure 6.18b shows an example of a candidate computational RAM process-
ing element that was implemented in silicon using SRAMs. Also, physical
designs were created in 4-Mb and 16-Mb DRAM processes to demonstrate the
processing elements compatibility to the DRAM processes. The processing
element shown supports bit-serial computation and has shift-left, shift-right,
and wired-AND bussed communication. The ALU consisting of an 8-to-1
multiplexer has a compact VLSI implementation, Thus, the entire processing
element (including the off-chip read/write path) can be implemented with as few
as 88 transistors using dynamic logic. The control signals (derived from a 13-bit
SIMD instruction) are routed straight through a row of processing elements.
In this architecture, the ALU can perform an arbitrary Boolean function of
three inputs: X registers, Y registers, and memory. The ALU opcode, connected
to the data inputs of the ALU multiplexer, provides the ALU's truth table. The
results can be written back to either the memory or the X, Y, or write-enable
register. The write-enable register is useful for implementing conditional
operations.
532 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

I/O C-RAMs

CPU Processing
elements
Cache

C-RAM
display I----l~ Video display
controller

om ~~~~
0000
C-RAM

c=J
(a)

Write
enable
register
Shift left

Shift right

ALU (multiplexer)

Global instruction

______t__
Broadcast bus

(b)

Figure 6.18 (a) Illustration showing DRAM replacement with computational RAM
and support logic. (b) CRAM processing element. (From reference 33, with permission
of IEEE.)
EMBEDDED EEPROM AND FLASH MEMORIES 533

This processing element can be implemented with fewer than 100 transistors,
using a dynamic logic multiplexer. The processing element design fits in the
pitch of eight bit lines (four folded bit-line pairs or columns) across several
generations of DRAMs. To make the effective use of silicon area, structures in
this processing element often serve multiple purposes. For example, X and Y
registers are used to store the results of local computations (such as sum and
carry) as well as to act as the destination for left and right shift operations
between the adjacent processing elements. During communication operations,
the ALU is used for routing signals. The processing elements and support
circuitry adds 18% to the area of existing DRAM design. A single processing
element occupies an area of approximately 360 bits of memory (including sense
amplifier and decoder overhead).
Because DRAM technology is different from the logic-based processor
technology, the implementation of computational RAM does present some
problems, which are generic to the implementation of a merged memory-logic
process and were discussed earlier. The Ie manufacturers offering merged
logic-DRAM processes address these problems through the use of (1) a
separate implant mask for the memory cell array, (2) a separately biased well
for the memory cell array, and (3) two thicknesses of gate oxides. Faster logic
in DRAM is available at the expense of these extra process steps. However,
because it is the DRAM's cycle time that largely determines computational
RAM's performance, the computational RAM would see only a small benefit
from a merged logic-DRAM process. The computational RAMs, designed as
it is on a commodity DRAM process, can be manufactured at a lower cost than
the merged logic-DRAM process.
The computational RAM processing elements share a common instruc-
tion bus and thus operate in a SIMD mode. Since the mid-1990s, essentially
all new massively parallel SIMD designs have used the embedded memory
approach.

6.6. EMBEDDED EEPROM AND FLASH MEMORIES

The most popular examples of embedded flash memories devices are pro-
grammable logic devices (PLDs), field programmable gate arrays (FPGAs),
DSPs, and microcontrollers. The embedded system designers prefer to use
flash-based processors, which can be quickly programmed, before transferring
their code to a more cost-effective ROM-based chip for high-volume produc-
tion. During the development and early production, the flexibility of on-chip
flash memory speeds software development and allows for changes up to the
last minute-even if the software will be eventually programmed into an
on-chip ROM. Using embedded flash throughout the life of the product offers
some additional advantages. For example, the designers of embedded systems
do not have to physically remove devices to provide software updates, and
embedded flash reduces parts inventory. The embedded flash can also reduce
534 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

the inventory of finished products, because the same system can be pro-
grammed with different software, allowing a single inventory of different
models of a product with different capabilities [35]. This section will briefly
discuss the use of embedded flash and EEPROM technologies in microcontrol-
lers (MCUs). Section 6.7 will review memory card designs in smartmedia and
multimedia applications that combine flash chips and/or microcontroller in
wide variety of consumer products such as the cellular phones, mobile laptop
and palmtop computers, digital cameras, and so on.
The need for use embedded flash and EEPROMs on microcontrollers comes
from several directions. The software code updates are getting more frequent
because of the shorter design cycles and competitive pressure in time to market.
Also, the electrical reprogrammability can be used as a means to cost-effectively
extend the life of a product. Microcontroller designers are taking advantage of
both the flash and EEPROM architectures and incorporating both memory
types simultaneously on to the same microcontroller chip. Typically, the flash
memory is used for the program storage area because the cells offer the best
packing density and the program code has to be usually written in large blocks.
In comparison, the EEPROM arrays are often used in fairly small blocks, from
less than 100 bytes to about 1 kbyte-just enough to hold the desired
parameters.
The range of microcontrollers that incorporate either EEPROM or flash
memories run from the low-cost 4-bit devices that pack a few bytes worth of
electrically erasable storage to 8-, 16-, and 32-bit CISC and RISe processors
that can pack 128 kbytes or more of flash memory. The addition of a flash
memory to a processor is limited not only to MCUs but also to DSPs. If the
MCV selected does not meet the desired memory configuration or feature-set
combination, the designers have the option of creating their own custom
microcontroller. Many of the MCU suppliers and ASIC vendors such as
Hitachi, Motorola, Philips Semiconductors, SGS-Thomson Microelectronics,
Texas Instruments, Toshiba, and others have both microcontroller cores and
blocks of flash or EEPROM memory available in their cell libraries from
which a custom controller can be assembled [36].
The well-established Intel 8051 (and all the alternate sources) and the
Motorola 68HC05 (including 68HC08 and 68HC11) are two of the most
popular 8-bit microcontroller families that include on-chip flash or EEPROM,
or both. Both Atmel and Philips have developed flash-based versions of the
MCV. Additional 8-bit processor families such as Hitachi's H8 series, SGS-
Thomson's ST9 (8/16 bit) series, and TI's TMS370 families offer devices with
on-chip EEPROM and/or flash memory, as well. In addition to EEPROM
storage on board, some of the devices also have on-chip ROM, RAM, and a
mix of features ranging from asynchronous or synchronous serial ports to
timers, analog-to-digital converters (ADCs), and general-purpose I/O lines.
Many of these device families are targeted at smart card applications.
The higher-performance microcontrollers with 16- and 32-bit wide datapath
controllers are also available with EEPROM and flash memory blocks from
EMBEDDED EEPROM AND FLASH MEMORIES 535

suppliers including Hitachi, Motorola, NEC, Siemens (now Infineon Technolo-


gies), SGS-Thomson, and Sharp Corporation. For example, Motorola offers
16-bit MCV such as 68HC12 that combines on-chip both byte-erasable
EEPROM and a large block of flash memory. The MC68HC16 family of
microcontrollers from Motorola are high-end, full 16-bit architecture MCUs
that pack large blocks of embedded flash EEPROM, up to several kilobytes of
standby RAM, and ROM. Hitachi offers 32-bit, 33-MIPS single-chip, 5-V
power supply operation RISe microcontrollers that include 256 Kbytes of
on-chip embedded flash memory.
A nonvolatile embedded flash memory architecture called Micro-flash,
compatible with the standard CMOS process, has been developed for SOC
applications [37]. The Micro-flash process uses nonvolatile read-only
memory (NROM) technology, and the cell is an n-channel MOSFET in
which the gate dielectric is replaced with a trapping material (nitride) sand-
wiched between two silicon dioxide layers to form an oxide-nitride-oxide
(ONO) structure [38] (also see Semiconductor Memories, Chapter 3, Section
3.5). The top and bottom oxides are thicker than 50 A to avoid any direct
tunneling. The charge is stored in the nitride next to N + junctions. The cell
stores two physically separated bits with a unique method to sense the trapped
charge.
The Micro-flash cell can be from four to six times smaller than the
equivalent flash technology cell. The key features of this Micro-flash cell
operation are the localized trapping concept and a unique reading scheme. The
trapping mechanism builds up sufficient potential in the cell to enable it to be
read externally. Also, each edge of the ONO layer can store charge indepen-
dently, which enables the writing of two bits in each cell.
In a Micro-flash device, a relatively small charge at the source prevents
current flow, while a charge of the same value at the drain allows current flow.
Programming is performed by channel hot electron (CHE) injection, while
erasing is accomplished using tunnel-enhanced hot-electron injection.
A Micro-flash memory array is formed by placing the cells in a virtual
ground architecture that ensures symmetry between the source and the drain.
The symmetry provides a mechanism to address each of the two bits per cell.
The array consists of bit lines and word lines that are orthogonal to each other.
The bit lines are formed with buried N+ implants, while the word lines are a
composite layer of polycide on polysilicon. The ONO layer covers the space
between the N + bit lines. Thin oxidation of the bit lines reduces capacitance
between the bit lines and the word lines.
Figure 6.19 shows the SEM cross section of a Micro-flash cell [37]. The
Micro-flash process requires only five additional masking steps over the
standard CMOS process. Three masks are used to generate the array, and the
other two form the high-voltage transistors. Micro-flash architecture was
implemented in a family of stand-alone and embedded devices in 0.5-t im
technology that used dual voltages, and for applications that needed only a
limited number of programming cycles. Micro-flash devices subjected to
536 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

Figure 6.19 SEM cross section of the Micro-flash cell [37].

various baking and high-temperature storage life (HTSL) tests showed their
data retention capabilities equivalent to the floating -gate device performance.
Cycling tests demonstrated the technology capability to support 100,000 cycles
and beyond.

6.7. MEMORY CARDS AND MULTIMEDIA APPLICATIONS

6.7.1. Memory Cards


There is a growing need worldwide for small, convenient, inexpensive, rugged ,
and easily transportable forms of nonvolatile data storage for use in computer
systems, consumer products, and industrial/commercial equipment. Flash card
technology meets these requirements, and the applications for use of flash cards
are expected to grow exponentially over the next decade.
In general, the flash cards connect to a microprocessor-based system to
provide a source for data storage for the host CPU. A card loses power when
removed from the system, yet the information stored in its nonvolatile flash
memory chips is retained for indefinitely long periods of time, until those data
are rewritten. An early introduction has been the Personal Computer Memory
Card International Association (PCMCIA) or PC flash card, also referred to
as a "Linear" Card. These PC cards are primarily used in situations where they
connect directly to the system bus and have a direct access to the system CPU.
In the "execute-in-place" (XIP) mode , the linear card must be capable of
high-speed random access read operations. XIP-type applications include those
that provide internal storage for the boot code, BIOS, fonts, operating systems,
application programs, and da ta [39].
MEMORY CARDS AND MULTIMEDIA APPLICATIONS 537

Linear cards can also be used to download data to main memory (DRAM),
the transfer being controlled by the CPU in the host system. The PCMCIA
cards available as Type I (3.3 rnm thick), Type II (5 mm thick), and Type III
(10 mm thick) formats have densities greater than 1 Gbyte. Both flash memory
and .magnetic media variants exists for the PC cards.
Advanced Technology Attachment (ATA) flash cards are I/O mapped
devices for file storage applications. To achieve plug-and-play interoperability,
they implement the PC standard I/O interface to the host computer. The cards
emulate ATA disk drive operation and do not allow the CPU in the host
system to directly access the flash memory. ATA flash cards can be used
interchangeably in a variety of systems and products that have slots for PC
cards. The two types of ATA flash cards - PC cards and CompactFlash™
(CF™) cards-are similar in concept. In both, an on-card controller imple-
ments a standardized interface that allows the ATA flash PC cards and CF
cards to operate as an external memory storage device without imposing
overhead on the system CPU or requiring special software. However, there are
differences in form factor (CF cards are smaller than ATA PC cards), density
(CF cards have a lower maximum density due to their smaller size), and
protocol specification [40].
The ATA flash cards are 53.9 mm x 85.5 mm (the same width and length
as modem cards and linear PC cards), use a 68-pin connector, and are 5 mm
thick (Type II PC card height limit). In comparison, the CF cards are 36.4
mm x 42.8 mm (62% smaller than the PC cards) and 3.3 mm thick. The file
structure in ATA flash cards is predefined. Data are accessed in sectors of 512
bytes or more, even when only one byte is needed. The applications for ATA
flash cards include mobile computing systems such as handheld pes, PDAs,
digital still cameras, smart digital phones, GPS systems, and communication
systems such as the cellular base stations, PBX equipment, and digital routing
switches.
The architecture of ATA PC cards and CF cards is almost identical. In
addition to AND, DINOR, NAND, or Flash chips, both card types contain
an intelligent controller-a processor with typically 256 Kbytes of RAM (for
temporary data storage and address translation) and some additional logic, or
a microprocessor-core-based ASIC that combines these circuits. The controller
performs two functions. It controls the reading and writing of data from and
to the flash devices, and it implements the protocols of the ATA interface, thus
offioading the CPU in the host system.
The ATA cards are optimal for many markets because they offer high-
density and inter-operability. By far the largest applications for ATA flash
cards are consumer products, particularly digital cameras and palmtop com-
puting devices. However, these PC cards are too big and thick for many
portable systems. A development in this area has been the Miniature Cards,
first championed by Intel and Sharp and later by AMD and Fujitsu, which
reduced the card's footprint while retaining the PCMCIA's parallel interface
for high-bandwidth transfers.
538 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

To keep the costs down, Miniature Cards contained no on-board memory


controller, although they did contain some basic logic circuitry for the system
to flash memory bus translation. The lack of memory controller forced the
system designers to rely on file management software running on the host
CPU. These Miniature Cards had some handling and reliability issues and
never really gained popularity and market acceptance. A few suppliers still offer
linear flash cards in PCMCIA and Miniature Card formats.
A major new development has been the introduction of SmartMedia Cards,
formerly known as the solid-state floppy disk cards (SSFDCs). In fact,
SmartMedia is just another packaging option for the NAND flash memory die
inside. Samsung and Toshiba are the two major suppliers for these Smart-
Media modules that contain no memory controller. However, unlike the
Miniature Card, the system interface has a low pin count, which indicates that
unless the application requires direct code execution out of the card, wide
parallel address and databuses are unnecessary, especially when a series of card
reads and writes address consecutive flash memory array locations.
Initially, SmartMedia cards contained only a single flash memory die, which
limited their density. The latest generation of devices contain two die, and some
suppliers are introducing paper-thin packages that may conceptually support
even more dies. As the densities increase, problems have been reported with
incompatibilities caused by system-design limitations. The SmartMedia spec-
ification standardizes only the electrical and mechanical interfaces and the
low-level flash memory command set, not the higher-level system file format.
SmartMedia has other shortcomings also.
The CompactFlash has achieved a widespread popularity amidst small form
factor card formats. The card and interface specifications are openly licensed,
and the standards organization enjoys broad industry participation. Compac-
tFlash memory cards employ a standard AT A/IDE command protocol that
simplifies software system development. The ATE compatibility feature (re-
gardless of the storage media technology used inside the card), along with the
file allocation table (FAT) file-format standardization, increase card supplier's
source flexibility and enable card portability among systems.
A CompactFlash card is still too large for some systems, such as cel-
lular phones and compact digital audio players. Sandisk Corporation has
partnered with Infineon Technologies to develop a small ROM-based Mul-
tmedia Card (MMC), which has become quite popular industry-wide. Unlike
CompactFlash, the MMC card currently supports only data and file storage,
due to its small size and seven-pin serial interface, although the MMC
Association is considering specifications for I/O card options. However, neither
the CompactFlash nor the MMC in their current versions incorporates robust
encryption/decryption logic for security purposes, although both include an
optional identifier unique to each card. Recently, Matsushita/Panasonic and
Toshiba along with Sandisk Corporation have developed a secure digital (SD)
card that is slightly thicker than the l\1MC (an MMC-compatible, narrower
version is also being developed) to provide increased security logic and to
MEMORY CARDS AND MULTIMEDIA APPLICATIONS 539

allow for greater storage capacity. The SD cards have increased the interface
count to nine contacts, but are otherwise identical to their MMC predecessors.
The MMCA is considering a variety of secure MMC proposals including a
SD-like expansion of the interface pin count to 9 or 13 contacts and a thicker
card assembly.
The storage media inside the cards is equally important from the cost,
performance, and reliability point of view. NOR-based flash technology has
received widespread adoption in EPROM replacement direct-code-execution
applications such the PC BIOS and cellular phones. However, its applications
in the more write-intensive data- and file-storage applications, especially in
designs that do not mix code and data within a single chip, have not gained
much popularity. Multilevel-cell (MLC) storage technologies such as Intel's
StrataFlash have lower cost per bit, but have the drawbacks in areas of
program performance and increased erase-block sizes. Some NOR supporters
continue to advocate the technology in write-mostly applications.
Sandisk is the memory-card market leader, and its products have been
NOR-based ever since the company's inception. Recently, Sandisk has an-
nounced products based on its 256-Mbit double-density ~1LC chip. Micron is
supplying CompactFlash cards based on its MediaFlash component and
matched memory controller. STMicroelectronics is offering a 64-Mbit MLC
chip and controller. In adapting NOR-based technology to mass storage
applications, a memory vendor is prone to accepting larger die with more array
decoding periphery logic to reduce erase block size, making the media appear
more like a hard-disk drive sector-based approach.
The flash memory based on EEPROM technology has a larger cell size than
the NOR flash, but has a better erase and write performance. Atmel's
DataFlash and Nexcom Technology's Serial Flash both use EEPROM-based
flash arrays. Silicon Storage Technology (SST), which uses an EEPROM-
derived flash cell as the basis for its code storage and execution chips, is
supplying other NAND flash memory vendor's chips in its CompactFlash
cards. Currently, NAND-based flash technology is the preferred storage media
in many CompactFlash cards and Memory Stick modules, recently introduced
by Sony Corporation. Samsung and Toshiba have NAND-based products that
are supplied with some nonfunctional blocks, which are mapped-around by the
memory controller while they are on-card or in-system. AMD offers 64-Mbit
UltraNAND devices in limited production. Sandisk and Toshiba have jointly
announced a O.16-,um technology-based, 5I2-Mbit NAND flash device, which,
by using optional MLC technology, can store 1 Gbit of information. In about
a year, the companies plan to double the size of their largest device to a 2-Gbit
MLC version based on a.13-,um technology.
Hitachi and Mitsubishi's AND-based flash is the dominant technology in
data and file storage. These companies have introduced their 256-Mbit MLC
(128 million cells, with 2 bits per cell) AND chip. Hitachi and Mitsubishi have
also announced NAND-based product offerings. According to the data sheet
specifications, AND does not suffer from the same degree of write performance
540 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

degradation as the other flash technologies that incorporate MLC storage.


However, AND chips arc currently not manufactured using the same advanced
processes as their NAND counterparts, which can negate some cost-per-bit
advantage of the MLC-based AND devices.
CompactFlash (CF), SmartMedia, and MultiMedia cards, along with their
relative tradeoffs, are briefly discussed below.

CompactFlash™ (CF) Cards CF cards conform to the same open industry


standard interface that the AT A flash PC cards use, an interface that eases
system design. The only real difference between the CF cards and ATA PC
cards are form factor and density, with the CF card having a lower maximum
density due to its smaller size. The widely supported CF standard ensures
compatibility across vertical and horizontal markets for current as well as
future developments, and in both 5-V and 3.3-V platforms, independent of the
type of flash chips used. The comprehensive control circuitry inside a CF card
adds some extra cost, but is much less a problem at higher densities when the
flash chips become the predominant cost. Moreover, the functionality provided
by the intelligent controller built into every CF card provides significant cost
advantages such as simpler system level design.
CompactFlash™ was introduced by Sandisk Corporation in 1994, as a
removeable mass storage device to capture, retain, and transport data such as
video, audio, and image files [41]. It is compatible with industry standard
functional and electrical specifications established by the PCMCIA. The data
and audio images on CF memory cards can be transported to the PCMCIA-ATA
(AT Bus Attachment) products via a standard PCMCIA Type II adapter card.
CF is manufactured and marketed by Sandisk, as well as by several
companies licensed by Sandisk that support CF specifications. CF is based on
flash memory technology, so that the data, audio, video images are stored on
flash chips rather than the conventional, rotating, mechanical hard disk drives.
Sandisk's first CF cards were based on 32-Mb flash chips built with the
company's O.5-Jlm technology and single chip integrated ATA controller, which
stores all intelligent drive electronics (IDE) and ATA commands.
SanDisk expanded CF technology with 64-Mb flash chips in 1996, 128-Mb
flash in 1998, and 256-Mb flash in 1999. As flash memory density increases,
considerably higher capacities can be achieved in the same size form factor. In
November 1998, Sandisk introduced the first CF Type II solid-state flash
memory card in a form factor approved by the CompactFlash File Association
(CFA). This new CF Type II card (5 mm thick) offer higher flash memory
capacity for mobile devices and can store up to 300 MB of data, audio, and
images.

SmartMedia Cards A SmartMedia card (previously called flash solid-state


disk, FSSD card) contains one or two NAND flash chips and nothing else, so
it is essentially a special type of flash device package. These cards perform best
MEMORY CARDS AND MULTIMEDIA APPLICATIONS 541

in file storage applications that can tolerate large granularity differences and
do not require high-density or multidensity options. They use a proprietary
NAND interface and cannot be used in ATA card applications unless they are
inserted into an active adapter card that implements the ATA interface.
SmartMedia cards conform to a proprietary specification originated by
Toshiba and currently supported by Toshiba and Samsung. Because the
SmartMedia cards contain just one or two flash chips, they are clearly the
simplest possible flash-based nonvolatile storage implementation. However,
this simple configuration imposes a hardware and software burden on the host
system, which must have a controller very similar to the complexity of an ATA
controller.
SmartMedia flash cards are very small and thin (45.0 mm x 37.0 mm x 0.76
mm) and are targeted at serial access, file storage applications. The connections
to the host system are via a 22-pad elastomeric connector or a probe-type
mechanical connector. The power supply voltage is either 5 V or 3.3 V, but not
both. The software that runs on the SmartMedia card controller in the host
system must vary according to the card type used and manufacturer because
the operating specifications for flash chips, such as those for reading and
writing, vary between device types. The software must be written to handle the
full range of flash chips that might possibly be used in the system's SmartMedia
card slot for current and the future configurations. The SmartMedia card also
mandates software licensing fees for the host sytem and requires extra ROM
and SRAM in the host system. These system software- and hardware-related
issues can complicate the system design and increase time to market. The
SmartMedia cards held under 13% of the market in 1997.
The basic design issues for SmartMedia cards are system development cost
and the requirement for a hardware interface and a software driver in every
host system. Some other areas of concern are reliability due to the use of
unproven packaging technology, compatibility issues because of the propri-
etary interface, capacity growth and density granularity, and interconnection-
related issues. The SmartMedia card as compared to a CF card has per-card
cost advantage at its density points, but that quickly disappears considering
the greater development sources that are required to design a product to accept
a SmartMedia card than are the ones needed for a CF card.
Another factor that must be considered when evaluating SmartMedia cards
versus CF cards is the compatibility issue for varying supply voltages and
software requirements. To be used in a system that has an ATA PC card slot,
the SmartMedia must be inserted into a very costly interface card containing
control functions that implement the ATA interface. By comparison, the CF
card needs only an inexpensive "pass-through" connector. Additional disad-
vantage of the SmartMedia compared to the CF card is the packaging
approach used that has mechanical strength deficiencies. Premature Smar-
tMedia card failure is likely to occur if subjected to forces that cause twisting
or bending.
542 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

MultiMedia Cards The MultiMedia Card is a universal, low-cost data


storage and communication media that is designed to cover a wide area of
applications such as electronic toys, PDAs, cameras, smart phones, digital
recorders, pagers, and so on. The targeted features for MultiMedia card
development are high mobility, low cost with high performance, low power
consumption, and high data throughput. The card interface is based on an
advanced 7-pin serial bus designed to operate in a low voltage range. The
communication protocol is defined as a part of this standard and referred to
as the MultiMedia Card mode. In addition, for compatibility to the existing
controllers, the MultiMedia Card may offer an alternate communication
protocol, which is based on the SPI standard.
The main design goal of the MultiMedia Card system is to provide a
low-cost, mass memory storage card with a simple controlling unit and a
compact, easy to implement interface. However, the complete MultiMedia
Card system must have the functionality to work with the low-cost card stack
and execute tasks (at least for the high-end applications) such as error
correction and standard bus connectivity. The system level concept is based on
modularity and the capability of reusing the hardware over a large variety of
cards.
The basic MultiMedia Card concept is based on transferring data via a
minimal number of signals. The communication signals are as follows [42]:

• CLK: One-bit data transfer on the command and data lines occur on
each cycle of this signal. The frequency may vary between zero and the
maximum clock frequency.
• CMD: It is a bidirectional command channel used for card initialization
and data transfer commands. The CMD signal has two operation modes:
open drain for initialization mode and push-pull for fast command
transfer. The commands are sent from the MultiMedia Card bus master
to the card, and the responses are sent from the card to the host.
• DAT: It is bidirectional data channel. The DAT signal operates in the
push-pull mode. Only one card or the host is driving this signal at a time.

Multimedia cards can be grouped in several categories, which differ in the


functions they provide, such as:

• Read Only Memory (ROM) Cards. These cards are manufactured with a
fixed data content, and they are typically used as a distribution media for
software, audio, and so on.
• Read/Write (R/W) Cards. These are available in various versions such as
flash, one-time programmable (OTP), and multiple-time programmable
(MTP). These cards are typically sold as blank media and are used for
mass data storage, with the end-user recording of audio, video, or digital
images.
MEMORY CARDS AND MULTIMEDIA APPLICATIONS 543

• I/O Cards. These cards are intended for communication (e.g., modems)
and typically have an additional interface link.

Figure 6.20 shows the MultiMedia Card (a) architecture and (b) the
bus system [42]. The MultiMedia Card bus is designed to connect either
solid-state mass storage memor y or I/O devices in a card format to multimedia

CMO
11 1 Voo
CI_K OAT
Interface driver

I OCRI31:0] ~
;
• ;

r CID1127:01 ~ Card -
interface c
.2
I RCA11 5:01 ~ controller
re set 4- ~
I DSRII 5:0 ~ III
"0
c
I CSDl 127:0 ~ 0

; lii
~
;1 Memory core interface reset J..-- -~

Me 'no yc ore

(a)

Powe r MultiMed iaCard


sup ply bus master

1 1 Mult iMediaCard bu 5

! 1 ! 1 !
Card Card Card Ca rd Ca rd
(I/O ) (RO M) (OTP) (MT P) (Flash)

(b)

Figure 6.20 MultiMedia card . (a) Architecture. (b) Bus System. (From reference 42.)
544 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

applications. It is a single master bus with a variable number of slaves. The


Multimedia Card bus master is the bus controller, and each slave is either a
single mass storage card (with possibly different technologies such as ROM,
OTP, flash, etc.) or an I/O card with its own controlling unit (on card) to
perform the data transfer.
The MultiMedia Card bus also includes power connections to supply the
cards. The bus communication uses a special protocol (MultiMedia Card bus
protocol), which is applicable to all devices. Therefore, the payload data
transfer between the host and the cards can be bidirectional. The MultiMedia
Card bus architecture requires all cards to be connected to the same set of lines.
No card has an individual connection to the host or other devices, which
reduces cost of the MultiMedia Card system.
In a MultiMedia Card system, the shared functions are implemented by a
MultiMedia controller. These are the basic requirements for the controller:

• Protocol translation from the standard MultiMedia Card bus to the


application bus
• Data buffering to enable minimal data access latency
• MultiMedia Card stack management to relieve the application processor
• Macros for common complex command sequences

The MultiMedia Card controller is the link between the application and the
MultiCard bus with its cards. It translates the protocol of the standard
MultiMedia Card bus to the application bus and is divided into two major
parts: (1) the application adapter, which is the applications oriented part, and
(2) the MultiMedia Card adapter, which is the MultiMedia-Card-oriented part.
The application adapter consists at least of a bus slave and a bridge into the
MultiMedia Card system. It can be extended to become a master on the
application bus and supports functions like the DMA or server application
specific needs.

6.7.2. Single-Chip Flash Disk


There are two standards for flash memory interface: AT Attachment (ATA)
and the flash translation layer (FTL). Both of these allow flash memory to
emulate a disk so transparently that the user and system cannot functionally
distinguish it from a mechanical disk drive. The ATA approach requires a
microprocessor, RAM, and an ASIC to manage the flash media. This complex
controller allows the flash memory to fully emulate a hard disk. However, the
cost of ATA controller overhead including additional components associated
with the ATA interface can be high. Although flash is nonvolatile, the ATA
approach still requires a constant power supply to manage the flash. With
ATA, the amount of power a particular flash device will consume during a read
MEMORY CARDS AND MULTIMEDIA APPLICATIONS 545

and write operation can be directly attributed to the use of the internal clock
of the controller. For an ATA flash disk to have a higher performance, the
clock rate needs' to be increased, which places a greater power drain on the
battery in mobile applications [43].
The FTL solution can achieve the same functionality of the flash media
through software. The FTL is a simple software driver that acts as a translator
between the flash media and the BIOS parameter block/file allocation table
(BPB/FAT). Because FTL is a software-based solution, it can bring the cost
down as compared to the ATA approach, roughly by an order of magnitude.
However, the FTL approach has a strong reliance on the speed of the host
processor.
The DiskOnChip (DOC) is M-System's first monolithic solid-state flash
disk, which combines a disk controller with flash memory on a single die and
is available in a standard 32-pin DIP or TSOP package. The DiskOnChip
products are optimized for use in information appliances such as the set-top
boxes and portable PC-compatible systems that require minimal weight, space,
and power consumption. In order to emulate a hard disk, a flash disk requires
a software management layer. The M-system has patented a flash file system
management technology called TrueFFS(R\ that allows flash components to
fully emulate hard disk. The TrueFFS has the following features:

• The use of wear leveling algorithms to ensure that all blocks are erased
an equal number of times, which can potentially increase the life of the
product by several orders of magnitude.
• Using virtual blocking of the flash device to make the large erase blocks
transparent to the operator.
• Automatic mapping of bad blocks.
• Implementation of a power loss recovery mechanism to guarantee abso-
lute protection of data.

TrueFFS drivers also support 16-bit and 32-bit bus architecture, which is
commonly used in the RISC processors. The DiskOnChip is compatible with
the standard EEPROM pinout, and it supports local bus and ISA bus interface
options. It utilizes the Reed-Solomon error detection code (EDC) and error
correction code (ECC), which provide the following error immunity for each
512-byte block of data: (1) correction of up to two lO-bit symbols including
two random bit errors, as well as correction of single bursts up to 11 bits, and
(2) detection of single bursts up to 31 bits and double bursts up to 11 bits, as
well as detection of up to 4 random bit errors.
Figure 6.21a illustrates a typical interface of the DiskOnChip-Millennium to
a system [44]. It is connected as a standard memory device using standard
memory interface signals. Typically, the DOC can be mapped to any free 8-KB
546 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

A[12:0]

D[7 :0] ~

OE
• DiskOnChip-
Millennium

WE BMBytes

CE

RSTIN
(TSOP ON LV)
- -. (a)

Extended
Memory
OFFFFFH 1M
BIOS
OFOOOOH
1
DiskOnChip J Bk
OC8000H
Display
OBOOOOH 640k

RAM

o
(b)

Figure 6.21 DiskOnChip-Millennium. (a) Simplified I/O diagram. (b) PC memory


map. (From reference 44, with permission of M-Systems Inc.)

memory space. Up to four DOC devices can be connected in parallel to the


host bus with no external decoding circuitry. The DOC is compatible with the
industry standard TrueFFS flash management technology. Under the control
of TrueFFS, the DOC behaves like a standard storage device, fully emulating
a hard disk in the system. The DOC is accessed like any other block device,
using standard file system calls. Applications can write to and read from any
sector on the DOC, which is compatible with all diagnostic utilities, applica-
tions, and file systems. The flash memory within the DOC is accessed by the
TrueFFS through an 8-KB window in the CPU's memory space.
Figure 6.21b shows the D O C 'memory location in relation to the PC
memory map. The DOC can be used as the only disk in the system, allowing
the system to boot from it. In addition, DOC can also work with other hard
disk or floppies as the boot device or as a secondary disk.
REFERENCES 547

REFERENCES
1. Chung-Yu Wu, Embedded Memory, in The VLSI Handbook, Wai-Kai Chun (Ed.),
CRC Press, Ne\v York, Chapter 50.
2. S. Iyer and Howard L. Kalter, Embedded DRAM technology: Opportunities and
challenges, IEEE Spectrum, 1999, pp. 56-64.
3. Richard A. Quinnnell, Embedding DRAM boosts performance at a cost, Silicon
Strategies, February 1998, pp. 19-22.
4. Brian Dipert, Embedded memory: the all-purpose core, EDN, March 13, 1998, pp.
34-50.
5. Norbert Wehn and Spren Hein, Embedded DRAM architectural tradeoffs, Techni-
cal Paper Siemens web page.
6. Richard A. Quinnell et aI., Focus Report: ASICs with embedded memory, Silicon
Strategies, February 1998, pp. 53-61.
7. Richard Stacpoole et aI., Cache memories, IEEE Potentials, 2000, pp. 24-29.
8. Gary Green, Extended cacheability, EDN Products Design, Aug. 8, 1997, pp. 29-32.
9. David Barringer et aI., Modernize your memory subsystem design, Electron. Des.,
February 5, 1996, pp. 83-92.
10. Jim Handy, Fast Cache Memory, in The Cache Men-lory Book, Academic Press, San
Diego, CA, 1998, Chapter 4.
11. TI TMS 320C6211 Cache Analysis, Application report SPRA427, September 1998,
pp. 1-11, TI web page.
12. Eric Hall and George Costakis, Developing a design methodology for embedded
memories, lSD, January 2000, pp. 13-16.
13. Konarad Schonemann, A modular embedded DRAM core concept in 0.24 tIm
technology, IEEE Proc. of MTDT' 98 Conference on 24-25 August 1998 at San
Jose, CA.
14. Embedded DRAM, Samsung Home Page.
15. Danny D. Yeung, Embedded memories are the key to unleashing the power of SOC
designs, Electron. Des., December 4,2000, pp. 123-129.
16. W. Leung et aI., The ideal SoC memory: 1T-SRAM, Proc. IEEE ASIC/SOC
Conference 2000, pp. 32- 35.
17. K. Takeda et al., A 16-Mb 400 MHz loadless CMOS four-transistor SRAM macro,
IEEE JSSC, Vol. 35, no. 11, November 2000, pp. 1631-1639.
18. A. Lalchandani and F. Krupecki, dRAMASIC™: The marriage of memory and
logic, EDN Products Edition, May 14, 1997, pp. 23-25.
19. R. C. Foss et aI., Re-inventing the DRAM for embedded use: A compiled, wide-
databus DRAM macrocell with high bandwidth and low power, Proc. IEEE
Custom I C Conference, May 13, 1998, pp. 1- 5.
20. Y. Agata et al., An 8 ns random cycle embedded RAM macro with dual-port
interleaved DRAM architecture (D 2RAM), IEEE JSSC, Vol. 35, no. 11, November
2000, pp. 1668-1671.
21. M. Inoue et aI., A 16-Mb DRAM with a relaxed sense-amplifier-pitch open-bit-line
architecture, IEEE JSSC, Vol. 23, pp. 1104-1112, October 1988.
548 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS

22. O. Takahashi et aI., I-GHz fully pipelined 3.7 ns address access time 8k x 1024
embedded synchronous DRAM macro, IEEE JSSC, Vol. 15, no. 11, November
2000, pp. 1673-1688.
23. P. Hofstee et aI., A 1-GHz single issue 64-b Power PC processor, 2000 IEEE ISSCC
Dig. Tech. Papers, pp. 92-93.
24. S. Crowder et aI., Integration of trench DRAM into a single performance 0.18 lim
logic technology with copper BEOL, in 1998 IEEE IEDM Dig. Tech. Papers, pp.
1017-1020.
25. Y. Nunomura et al., M32R/D-Integrating DRAM and microprocessor, IEEE
Micro, November/December 1997, pp. 40-47.
26. K. Suzuki et aI., A 2000-MOPS embedded RISC processor with a Rambus DRAM
controller, IEEE ,JSSC, Vol. 34, no. 7, July 1999, pp. 1010-1020.
27. K. Suzuki et aI., V830R/AV: An embedded multimedia superscalar RISC processor,
IEEE Micro Magazine, Vol. 18, pp. 36-47, March 1998.
28. N. Wehn et aI., Embedded DRAM architectural trade-off's, Design, Automation and
Test Conference, Europe, 1998.
29. Ichiro Sase et al., Multimedia LSI accelerator with embedded DRAM, IEEE Micro,
November/December 1997, pp. 49-54.
30. Steven Przybylski, New DRAM technologies: A comprehensive analysis of the new
architectures, .tv[icroliesiqn Resources, Sebastopol, California, 1994.
31. David Patterson et al., Intelligent RAM (IRAM): The industrial setting, applica-
tions, and architectures, University of California at Berkeley web page.
32. David Patterson et aI., A case for intelligent RAM: IRAM, from University of
California at Berkeley web page and IEEE Micro, April 1997.
33. D. G. Elliott et aI., Computational RAM: Implementing processors in memory,
IEEE Des. Test of C011lpUt., January-March 1999, pp. 32-41.
34. T. Shimizu et aI., A multimedia 32b RISe microprocessor with 16 Mb DRAM,
IEEE ISSCC Proc. 1996, pp. 216-217.
35. John Bond, Embedded flash speeds time-to-market, Comput. Des., December 1998,
pp.28-30.
36. Dave Bursky, Flash and EEPROM technologies combine on feature-rich MCUs,
Electron. Des., May 27, 1997, pp. 81-93.
37. Boaz Eitan et aI., Embedding flash memory in SOC applications, ISD Magazine,
December 2000, pp. 46-50.
38. K-T. Chang et aI., A new SONGS memory using source-side injection for program-
ming, IEEE Electron Device Lett., EDL-19(7), pp. 253-255, 1998.
39. Flash Card White Paper: Technology and Market Backgrounder, April 1998,
Hitachi web site.
40. CF and CompactFlash Paper: CompactFlash Association web page.
41. CompactFlash ™ Sandisk web page.
42. The MultiMedia Card System Summary Based on System Specification Version 2.2,
MMCA Technical Committee.
43. Stefanie Helm, The changing face of solid state memory, EDN Products, May 14,
1997.
44. DiskOnf.hip" Millennium Single Chip Flash Disk Data Sheets.

You might also like