Embedded Memories Designs and Applications
Embedded Memories Designs and Applications
The memory technology for embedded memories has a wide variation, ranging
from small blocks of ROMs, hundred of kilobytes of cache RAMs, high density
(several megabits) of DRAMs, and small to medium density nonvolatile
memory blocks of EEPROMs and flash memories. For memories embedded in
the logic process, the most important figure of merit is the compatibility to
logic process. In general, embedded ROM used for microcode storage has the
highest compatibility to the logic process; however, its application is rather
limited. Programmable logic array (PLA) or ROM-based logic is well used, but
it is considered a special case of embedded ROM [1].
Embedded SRAMs is one of the most frequently used memory embedded in
logic chips, and typical applications include on-chip buffers, caches, register
files, and so on. The standard six-transistor (6T) SRAM cell is also fairly
compatible to a logic process, unless there are special structures involved. The
bit density is not very high. Polysilicon resistor load (4T) cells provide higher
bit density, but at the cost of process complexity associated with additional
polysilicon-Iayer resistors. Embedded DRAM (eDRAM) that provides high
density features is also becoming quite popular in combination with RISC
processor and other peripheral circuitry in system-on-chip (SOC) types of
applications for graphic accelerators and multimedia chips. Embedded EP-
ROM, EEPROM, and flash memory technologies require two to three or more
additional masking steps to the standard logic process. These are finding
applications in microcontrollers, field programmable gate arrays (FPGAs), and
complex programmable logic devices (CPLDs).
479
480 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
In a typical logic VLSI design, the on-chip memory integrated with other
circuitry may include anything from simple registers to caches of several
megabit sizes. In special applications such as the microprocessors, memory can
occupy more than 50% of the chip area. A proce ssor in search of data or
instructions looks first in the first-level (Ll) cache memory, which is closest to
the proce ssor, and if the inform ation is not found there, the request is passed
on to the second-level cache (L2). The integration of on-chip Ll improves both
the processor performance and bandwidth. The SRAM cells used in the Ll
cache are usually larger than the ones for commodity SRAM s and for highest
performance, fabricated with the same process as for the processor logic. Thus,
an Ll cache tends to be faster than the L2 and L3 (the off-chip caches) and
often utilizes special SRAM processes that minimize cell area.
The major goal of cache memory design is to minimize the miss rate - that
is, the possibility that immediately needed bits of information are not available
in the nearest level of cache memory and have to be fetched from the higher
levels of caches or even the main memory. As the processor performance has
increased, the wait time in idle memory cycles has also increased, which has
led to the so-called memory-processor performance gap, as illustrated in Figure
6.1. Therefore, the ability to integrate large memory close to the processor
(possibly, on-chip) helps remove some of the constraints of a slow memory
access, allowing increase of more conventional pin-limited bus widths of
16-256 bits, or even as high as 1024 bits [2].
Another trend that drives the integration of DRAMs into logic chips is the
memory granularity, which refers to the smallest increment by which a memory
size may be increased. For applications that require less than 2 Mb of memory,
an embedded SRAM would probably be more cost effective and should be
considered first. In the O.l8-Jlm process generation, it is expected that the
embedded DRAM solution would become cost effective at above approximate-
~ 1,000
(J)
~
.Q
"C
~
ell
~ 100
o
u
Q)
o
c:
ell
E
C5 10
1:
Q)
l~r;;;§~~~~~~~C:=:=rJ=====J:[~~
Q)
.z
iii
£
0.
1
1980 1985 1990 1995 2000
Figure 6.1 Illustration of performance levels of processor versus DRAM , over several
generations [2].
EMBEDDED MEMORY DEVELOPMENTS 481
0.25-{lm process, a DRAM memory cell occupies 0.6-1.0 {1m 2 , whereas the
SRAM cell occupies 5-9 {1m 2 • Therefore, an embedded DRAM can allow an
ASIC to contain as much as 128 Mb in a 0.25-{lm process.
Another advantage is noise reduction. The interconnect between the proces-
sor and the memory carries some of the highest-frequency signals that generate
electrical noise, which is increasingly difficult to control as the system clock
speeds increase. The fact that a memory bus contains many high-speed signals
routed together makes the problem worse . Therefore, bringing the signals
inside the ASIC by using embedded DRAM approach reduces the difficulty of
controlling electrical noise on the board. Figure 6.2 shows (a) an on-chip
DRAM that eliminates both the PCB traces and two sets of I/O drivers
required and (b) an ASIC plus a discrete DRAM design approach [3]. The
embedding of DRAM can also improve the granularity of the ASIC itself,
because it allow s the system designers to select an optimum size for their
system memory without any waste .
The embedded DRAM can reduce the system engineering effort and
therefore, possibly, the time to market. While the discrete DRAM has many
access modes, embedded DRAM has only one basic mode, which eliminates
the need for extensive architectural analysis to determine which type of DRAM
provides the best system performance. Despite these savings, there are some
drawbacks also , such as the process cost s, technical risks, and additional test
requirements for the embedded memory approach. Although embedded
DRAM lowers some of the system costs , the cost of the ASIC is much greater
ASIC
.A ~
Memory
Logic ~ ;~ ,j;
array
~ On-chIp
' I"
bus
(a)
Memory
LogiC array
PCB
connection
(b)
Figure 6.2 A comparison of (a) ASIC with embedded DRAM and (b) ASIC with PCB
connection to a discrete DRAM [3].
484 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
and may override the system savings. An embedded DRAM process costs
nearly 40% more to run than a logic process, because it requires extra
production steps (i.e., more masks). As a result, an ASIC with on-chip DRAM
is almost always more expensive than a pure logic ASIC combined with a
discrete DRAM approach.
Memory testing has several components that are different from the logic
testing, and need specialized tests such as the functional test patterns to detect
for the memory array pattern sensitivity faults and data retention measure-
ments, under worst-case refresh timings and operating temperature extremes.
While these tests are routinely performed by the commodity DRAM manufac-
turers, the ASIC manufacturers may not be equipped to perform this complex
testing on the embedded DRAMs. Therefore, the embedded DRAMs in a logic
process may require a different approach such as the design for testability
(DFT) and built-in self-test (BIST) techniques. These DFT and BIST tech-
niques for embedded memories were discussed in Semiconductor Memories,
Chapter 5. The addition of testability requirements to embedded memories
may increase the time to market, as well as the system cost.
A key advantage of the embedded memory approach is the higher packag-
ing density and board space saving, which is a very desirable feature for the
notebook computers, mobile computing, and portable communication devices.
In the conventional multichip memory approach, interconnections require
large I/O buffers to overcome package and board level trace impedances. The
resultant increased power means limited battery life and often reduced reliabil-
ity. It is estimated that a graphic controller with embedded DRAM consumes
roughly 500-750 mW, which is roughly 250/0 the power of its multichip
alternative, consuming approximately 2.5 W.
The transistor gate oxide thicknesses differ for the logic and memory cell
processes. The standard logic transistors have thin oxides with low turn-on
threshold voltages to minimize switching time and maximize performance. In
contrast, many memory processes require thicker gate oxide transistors that
have high turn-on threshold voltages to minimize off-transistor leakage cur-
rent, which is a prime determinant of the required DRAM refresh frequency.
Also, thicker oxides improve data retention characteristics for the EPROM,
EEPROM, and flash memory, because they can help memory cells withstand
the effects of high voltages and corresponding electrical field stresses during
repeated programming and erase cycles.
Mitsubishi for its larger than 0.25-,um design processes, uses normal thin
gate oxide logic transistors for fabricating the DRAM memory cells also,
depending on the lower operating voltages to reduce leakage current. However,
at 0.25-llm and below processes, Mitsubishi's HyperDRAM process uses
dual-oxide thicknesses and adds three processing steps. Standard logic and
HyperDRAM share the same metal pitch and therefore the same logic-layout
libraries but have different logic timing. Mitsubishi's triple-well approach
isolates the DRAM substrate from the bias and injected noise originating in
the logic and standard SRAM circuits and also contains any required DRAM
substrate bias (see Figure 6.3) [4]. Embedded memory density depends on the
EMBEDDED MEMORY DEVELOPMENTS 485
MEMORY-CELL
VCP 1" CAPACITOR
P-WELL
N-WELL
P-SUBSTRATE
Figure 6.3 Mitsubishi embedded DRAM triple-well process cross section [4].
internal bus-width, including optional parity support, array aspect ratio, the
number of memory macros, and logic gate count.
In addition to providing embedded DRAM for ASIC applications, Mit-
subishi leverages its eDRAM processes to build the 3-D RAM graphic chip
that combines 1.25 Mbytes of DRAM with a high-performance ALU and also
combines the M32RjD 32-bit RISC CPU with 2 Mbytes of DRAM. Mitsubishi
has ported its 32-bit M32R CPU not only to the embedded DRAM with
M32R jD, but also to the flash memory with the M32R jE family.
Another example is Samsung Semiconductor, which supplies merged
DRAM and logic on MDL90, a 3.3-Y, O.35-Jlm process, with three-layer or
four-layer metal process . This process is the same that the company uses for
its SOD-MHz Alpha 21164 CPU. It supports as much as 24 Mbits of extended
dataout (EDO) DRAM or SDRAM. A four-well approach ensures isolation
between the logic and DRAM subsections. There are other vendors offering
embedded flash memory fabricated on an EEPROM process, such as Atmel,
Hyundai, Lucent Technologies, Motorola, and Texas Instruments. EEPROM
variants, although they have more complex cell structures than the NOR
alternatives, offer efficient programming and erasing that minimizes the re-
quired size of on-chip charge pumps for scaling down to low-voltage operation,
and, unlike the NAND flash memory, are appropriate for both code and data
storage. Some companies are also planning to introduce FRAM as part of their
embedded memory portfolio. However, FRAM has a more complex manufac-
turing process because of its specialized capacitor-dielectric material structure.
The selection of embedded memory approach as compared to the discrete
DRAM-based memory systems has advantages as well as disadvantages in
three areas of consideration, as follows:
(16-bit interface at 100 MHz), would require about 10 times the power of
an eDRAM with an internal 256-bit interface. However, even though the
use of eDRAM may reduce overall power consumption of the system, the
power consumption per chip may increase, and therefore the junction
temperature may increase, which can affect the DRAM retention time.
• Embedded DRAMs can achieve much higher clock frequencies than the
discrete SDRAMs, because the chip interface can be 512 bits wide (or even
higher) as compared to the discrete SDRAMs that are limited to 16-64
bits. It is possible to make a 4-Mb eDRAM with a 256-bit interface,
whereas it would require 16 discrete 4-Mb chips (organized as 256K x 16)
to achieve the same width, and the granularity of such a discrete system
is 64 Mb (overcapacity for an application that needs, say, 4 Mb of
memory).
• In eDRAMs, interconnect wire lengths can be optimized for a given
application, which can result in lower propagation delays and higher
speeds. In addition, noise immunity is enhanced.
2. System Integration
DRAM transistors are optimized for low leakage currents, yielding lower
transistor performance, whereas the logic transistors are optimized for
high saturation current, yielding high leakage currents. If this compromise
is not acceptable, then extra tnanufacturing steps must be added, at extra
cost.
Higher system integration saves board level space, reduces pin count, and
yields a better form factor. The pad-limited designs may be transferred
into non-pad-limited ones with an eDRAM approach. However, more
expensive eDRAM packages may be required. Some sort of external
memory interface is still needed in order to test the embedded memory.
The eDRANI process adds another technology for which libraries must be
developed and characterized, macros must be ported, design debugged,
and then optimized.
3. Me1110ry Size
• In the eDRAM approach, the memory sizes can be customized. However,
the down side to this is that the memory system designer must know the
exact system memory requirements at the time of design. Later extensions
are not possible because there is no external memory interface.
In general, the selection of eDRAM approach should be considered if the
product volume and projected lifetime are fairly high, or if the eDRAM is
required for higher performance and higher bandwidth. From a system
designer's point of view, eDRAM provides capabilities of (a) customizing the
memory size to precise system requirements, (b) adapting the memory bus
interface to the system requirements, and (c) optimizing the memory structure
(page length, number of banks, word width) to the system requirements [5].
CACHE MEMORY DESIGNS 487
• Access time (t[i]). It is the total time that it takes the CPU to access the
level L, of the memory hierarchy.
• Memory size (sri]). The memory size refers to the number of bytes in level
i of the memory hierarchy.
• Cost per byte (c[i]). The cost of level L, is usually estimated as the cost
per byte, or the product of c[i] and sri].
• Transfer bandwidth (b[i]). The bandwidth is the rate at which the data are
\ transferred over time between various levels.
• Unit of transfer (x[i]). The unit of transfer refers to the grain size for data
transfer.
As a general rule, the memory devices at a higher level have a faster access
times, are smaller in size, have higher cost per byte and bandwidth, and use a
smaller unit of transfer compared to those at a lower level.
488 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
There are three basic concepts associated with the memory hierarchy:
inclusion, coherence, and locality. The inclusion property implies that all
information items are originally stored in the outermost level. During the
execution of the processes, subsets of the total stored data and instructions
move up the hierarchy as they are used (e.g., data in L 1 moves to L o for
execution). As the data ages, it moves down the hierarchy. A word miss occurs
when a word is searched for in level L i , but is not found. If a word miss occurs
in level L i , it also means that the word miss occurred in all levels above Lio
Another concept associated with inclusion is the method of information
transfer between two levels of the hierarchy. The CPU and cache communicate
through words (typically, 8 bytes each). Due to the inclusion principle, the
cache block must be typically bigger than the size of the memory word and is
typically 32 bytes. The cache and the main memory communicate through
blocks. The main memory is divided into pages (typically 4096 bytes) that are
the units of information transfer between the disk and the main memory. At
the final level of memory hierarchy, the pages of the main memory are stored
as segments, sectors, and tracks in the backup storage device.
The coherence property requires that the copies of the same information
item at lower levels of the memory hierarchy be consistent. This implies that if
a word is modified in the processor, copies of that word must be updated either
immediately (write-through method) or eventually (write-back method) at all
lower levels of the memory hierarchy.
Most computer programs are both highly sequential and highly loop-
oriented, which requires the cache to operate on the principle of spatial and
temporal locality of reference [8]. A spatial locality means that information the
CPU will reference in the near future is likely to be logically close in main
memory to information being referenced currently. Temporal locality means
that the information CPU is referencing currently is likely to be referenced
again in the near future. By using these concepts, it is possible to design a cache
that makes it highly probable that the CPU references are located in the cache.
The principle of locality allows a small high-speed memory to be effective by
storing only a subset (recently used instructions and data) of the main memory.
The spatial locality of reference is addressed in the following manner. If a
cache miss occurs (i.e., if the cache does not contain the information requested
by the CPU), then the cache accesses main memory and retrieves the
information being requested along with information in several additional
locations that logically follow the current reference. This set of information is
called a block, or cache line. The next CPU reference now has a statistically
higher probability of being serviced by the cache, thus avoiding main memory's
relative long access time.
The temporal locality of reference is addressed by allowing information to
remain in the cache for an extended period of time, so that the stored line is
replaced only to make room for a new one. By allowing information to remain
in the cache and using a cache of sufficient size, it is possible to fit an entire
loop of code into the cache, thereby enabling very high speed execution of
CACHE MEMORY DESIGNS 489
instructions in the loop. Therefore, the goal of a cache design is to reduce the
effective memory access time, from a CPU point of reference. When the CPU
finds the item that is addressed in the cache, it is called a "hit," whereas if the
item is not found in the cache, it is called a "miss." The probability that the
required data are in the cache is called the "hit rate," and the probability that
the required data are not in the cache is called the "miss rate" (1 - hit rate).
In a "miss" the processor must go to the main memory or the next level of
cache or the data.
In a cached system, the effective main memory access time (t ef f ) is given by
The highest level of hierarchy is the first-level cache, which supplies data to
the processor at a rate that requires no delays in the processor operation. To
accomplish this, the first-level cache must be very close to the processor so that
there are no transmission line delays, and the bus between the cache and the
processor must be wide enough to supply the data at the rate required, and
there must be no delay in the interfaces between the two chips. In addition, it
must also contain the data that the processor wants. For example, if the bus
width of memory is 64 bits, to supply a data rate of 1.6 GB/s, it needs to run
at 200 MHz, whereas if it is 128 bits wide, it needs to run at only 100 MHz.
However, an SRAM with 128 bits running at 100 MHz will have significant
delays due to transmission effects crossing the interfaces, unless careful
measures are taken to have proper termination. Also, a memory with 128 II0s
will be a large chip in a large package.
These are some of the basic cache organizations and operating modes:
• Unified Cache. Both cache and data instructions are cached in the same
SRAM memory.
• Burst Cache. A type of synchronous cache, which is about 30-50% faster
than the asynchronous cache and about 50% more expensive. In burst
mode, several bits of data are selected using a single address, which is
incremented using an on-chip counter. Both flow-through and pipelined
SRAMs may have the burst feature.
• Synchronous Burst Cache. SyncBurst SRAMs use an input clock to
synchronize the device to the system clock. This allows for short setup and
490 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
hold times for the data, address, and control signals. Both the pipelined
and flow-through versions have input registers, which allow this mode.
In general, the four different aspects of cache organization are the cache size,
mapping method, line size, and whether the cache is combined or split (i.e.,
whether it integrates caches for data and instructions or separates them). The
information that the processor wants normally falls into instructions and data.
The two types of information can be stored in one cache or two. A system can
have a split cache with one for instructions and one for data. The two caches
can have different structures to optimize their function. A cache that contains
both instructions and data is called a "unified cache." The data and instruction
cache for a first-level cache (Ll) on the processor chip are usually designed to
optimize the performance of each. They may have different levels of associativ-
ity and different widths. Similarly, separate data and instruction caches can be
used for external L2 cache. These can be on two different chips or the same chip.
The type of mapping used affects the cache's hit time and miss rate, and
usually an increase in the miss rate exacts a penalty on the cache hit time.
Different cache memory architectures have different levels of property called
associativity. The hit rate of the cache can be improved by increasing its
associativity. The most widely used mapping schemes are based on the
principle of associativity. A direct mapped cache allows any specific location
in main memory to be mapped to only one location in the cache. This has the
lowest level of associativity and is the least complex caching scheme that
implements a one way, set-associative design. A fully associative cache is called
a content addressable memory (CAM), discussed in Chapter 2, Section 2.8.3.
In a direct mapped cache, every location in the main memory maps to a
unique location in the cache. For example, location 1 in the main memory
maps to location 1 in the cache; location 2 in the main memory maps to
location 2 in the cache; location 111 in the main memory maps to location In in
the cache: and location m + 1 in the main memory again maps to location 1
in the cache; and so on. Figure 6.4a shows an example of such a cache that
consists of a data memory, a tag memory, and a comparator. The data memory
contains the cached data and instructions, and its size defines the cache size.
The cache tag memory is the directory of the cache, containing information
about where in the main memory the data stored in the cache originated, and
it uses the comparator to determine whether the cache contains the line being
addressed by the CPU.
A direct-mapped cache has two critical timing paths: (1) the read data path
through the data memory to Data Out and (2) the tag memory path to Match
CACHE MEMORY DESIGNS 491
Address Latch
I
I. . I.
!t Ii
i
·· 1· I l-
I
I i
!
I
I
I
c/ I
I
l
I
)
I
t ;
i "
I .
"' /,
i
Address
Comparator
i+w
• It
.
;
(a)
(b)
Figure 6.4 (a) Direct mapped cache organization, and (b) A two-way set-assoc iative
cache [8].
492 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
Out. The slower of these two paths determines the number of bus clock cycles
for the first access; the read data path determines number of bus clock cycles
for the rest of the burst. In a direct mapped cache, the address is split into three
fields called tag (the higher order address bits of the address), index, and word
offset. The bits of the index field address the tag memory to see if the line being
accessed is the line wanted by the CPU. The tag in the location addressed by
the index field is compared against the upper-order bits (the tag field) of the
address. In parallel with the tag access and comparison, the index and
word-offset bits address the data memory, and the accessed word is placed in
the Data Output Buffer. If the tags match and the status bits are correct, the
cache asserts the Match Out signal, indicating that the information retrieved
from the data memory is correct-a cache hit. If the tags do not match or the
status bits are not correct, Match Out is deasserted, which shows the data
retrieved to be invalid (i.e., a cache miss), and correct data must be retrieved
from the main memory.
The most complex cache is a fully associative cache, which allows any
location in the main memory to be mapped to any location in the cache, that
is, n-way associative cache (typically n = 1, 2, 4, 8 ... ) allows any specific
location to be mapped to n locations in the cache. Figure 6.4b shows a
two-way, set associative cache that has a performance equivalent to a direct
mapped cache twice its size.
Cache size has the single largest influence on the miss ratio. In general, a
larger cache size has a lower miss ratio, but then increasing the cache size
beyond a certain optimum limit can actually result in a performance decrease.
A cache size of 256 to 512 kilobytes is considered large enough to allow a cache
to reach a 98 % hit rate. The cache line size, typically an even binary amount
such as 16,32, 64, or 128 bytes, is the basic unit of information transfer between
the cache and main memory, It ranks second behind cache size as the
parameter that most affects cache performance.
In some cases, the small L1 cache may not have a high enough probability
of ha ving the needed data to keep the processor fed at the required rate. In this
case, another level of cache (L2) is added as off-the-processor chip that is larger
and runs at a speed between that of the first level cache and the main memory.
Because the L2 cache is now external, it can be larger than an on-chip cache,
but it still needs to have measures taken to increase the probability that the
data the processor wants next, which is not found in the Ll, is available in the
L2. This implies that some amount of logic must be added to the L2 SRAM
to make it an effective cache memory.
The PC designs widely use L2 cache in a direct mapped configuration
consisting of 256 KB (32 Kb x 64) of data RAM and 8 KB (8 Kb x 8) of tag
RAM. The cache data RAM may consist of asynchronous RAMs (lower cost)
or synchronous burst RAMs, for higher performance. Some manufacturers
offer PCs in which the L2 cache has been eliminated, and the systems rely on
EDO DRAM to recover some of the lost performance. The chipsets for
Pentium systems support either a write-through or write-back mode of
CACHE MEMORY DESIGNS 493
The major goal of the RISC architecture is the ability to execute one
instruction every cycle. RISC architectures focus more attention on registers
and have register-rich CPUs. This means that the memory compilers used to
support these CPU designs will focus a lot more attention on turning frequent
memory accesses into register accesses instead, so that the external data
transfers are significantly reduced in comparison with the number of instruc-
tion calls.
The processor architects have to trade off the speed of their designs versus
the feature-set offered. If one-cycle-per-instruction goal is to be approached, the
instruction, the operands, and destination must be all accessible to the CPU at
the same time, rather than being accessed sequentially. The site at which some
large delays can arise is within the memory management unit (MMU). The two
major architectures for processors are referred to as the Von Neumann and
Harvard. The Von Neumann machine has a single address space, any portion
of which can be accessed as either instructions or as data. A Von Neumann
machine must fetch an instruction, then load or store an operand in two
separate cycles on a single data bus within a single memory space.
In comparison, a Harvard architecture uses two separate fixed-size spaces:
one for instructions and one for data. This implies that a Harvard machine can
load or operate upon data in a single cycle, with the instruction coming from
the instruction space via the instruction bus, and the data either being loaded
or stored via a data bus into the data memory.
Some examples of these architectures are MIPS R3000 and SUNSPARC
processor sets. The discussion of their architectures is beyond the scope of this
book. Section 6.4 on Merged Processor DRAM Architectures does provide
examples of (a) Mitsubishi's 32-bit RIse processor (M32RjD) and (b) a 2000-
MOPS embedded RISC processor with a Rambus DRAM controller. The
following section reviews cache architecture implementation for a TI DSP
TMS320C6211.
ij!
External C1 P Ca(;he~;,
.- f-- Memory
~
1+-+ Direct Mapped
Interface
(EMIF) .
4 Kbyles
+C6200B CPU
Multi-channel Control
.-- Buffered
1++
Instruction Fetch
Registers
- r+ Serial Port 1
(McBSP 1)
Enhanced
DMA
Controller
... 1'-2 Memor~
4 Banks
('j4 !<byles
Instruction Dispatch
In-Circu it
Instruction Decode Emul ation
Data Path 1 Data Path 2
3:
~
2
"2-
0
- ... Multi-channel
Buff ered
k-+
A Register File B Register File 0
:J
(a)
Yes
Yes
~
(b)
Figure 6.5 TI TM S320C6211 DSP (a) Block diagra m, (b) Illust ration of two-level
cache fetch flow. (c) L2 memory configura tions. (From reference 11.)
CACHE MEMORY DESIGNS 497
16 Kbytes
Mapped
32 Kbytes
Mapped
RAM
48 Kbytes
Mapped
RA M
64 Kbytes 4 Way
RAM
Mapped Cache
3 Way
RAM Cache
64 Kbytes
2 Way
48 Kbytes
Cache
1 Way
32 Kbytes
Cache
16 Kbvtes
(c)
controller. When an access is initiated by the CPU, the cache controller checks
its tag RAM to determine if these data reside in the cache. If these data do reside
in the cache, a cache hit occurs and those data are sent to the cpu. If the CPU
data do not reside in the cache, then a cache miss occurs . On a cache miss, the
controller requests the data from the next level memory, which, in the case of an
LIP or LID miss, is L2. In the case of an L2 miss, the next-level memory is the
external memory. The amount of data that a cache requests on a miss is referred
to as the cache's line size. Figure 6.5b illustrates the decision process used by the
DSP memory system to fetch the correct data on a CPU request.
The DSP performance is significantly improved by using a cache, which
dynamically allocates memory to reduce the latency to a slower memory. A
cache's performance can be affected by a situation known as "thrashing." For
thrashing to occur, the data must be read into the cache. Subsequently, another
location is cached, whose data overwrite the first data. When the first data are
requested again , the cache must again fetch it from the slow memory.
The major elements of DSP cache memory architecture are briefly described
below.
In the case where cache miss occurs, the LlP requests an entire line of data
from the L2. In other words, both the requested fetch packet and the next fetch
packet in memory are loaded into the cache. Because most applications execute
sequential instructions, the likelihood is higher that the next fetch packet will
be immediately available when it is requested by the CPU. Thus, the startup
latency to fetch the next packet is eliminated by bursting an entire cache line.
Fetching ahead also reduces the number of cache misses.
The demand for embedded memory is on the rise in the current generation of
ULSI and system-on-chip level designs that require large amounts of SRAM,
multiport RAM, DRAM, ROM, and EEPROM/flash memories. For example,
in the case of high-performance microprocessors, 30-500/0 of the premium
space and 800/0 of the transistors are allocated to memory alone. These
controllers include several levels of cache for data and instructions, multiport
SRAMs for TAGs, translation look-aside buffers (TLBs), CAMs, register files,
and general-purpose SRAMs [12]. As the need for embedded memories
continues to increase, so does the complexity, density, and speed of these
memories. Many companies outsource the design of their embedded memories.
An alternate method of obtaining embedded memory design is to use a
memory compiler, which can provide a physical block in a relatively quick and
inexpensive manner. However, there are some drawbacks also to this approach.
Generally, compiled memory designs result in a larger memory block and less
efficient overall system performance. In addition, the memory design may be
inflexible when the system design requires additional features.
In comparison, the customized memories can accommodate emerging
system requirements such as the need to pitch-match the logic with memory
core. Instead of placing a standard memory block on the chip and then
synthesizing the logic around it to create a desired function, the designers can
move the logic into the memory block, allowing the physical layout to fit
tightly with the memory pitch dimensions. This approach can reduce the
overall chip size, while allowing for higher density and improved performance
of the chip.
The embedded systems that basically consist of logic and embedded mem-
ory can be developed either by using logic process technology and embedding
memory or by implementing logic in a DRAM process. By using logic process
as the base technology, the chip benefits from the fast transistors, but because
of high saturation current it is difficult to implement memory with a 1T cell.
This means relatively high area penalty. In comparison, a DRAM-based
technology allows the creation of embedded DRAMs with very low leakage
currents, but the speed of logic transistors lags. However, the DRAM technol-
ogy allows the design of embedded systems with up to 128 Mbits, and even
higher-capacity DRAMs, which is not possible with logic-based technology
500 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
because of unacceptable large chip areas. Therefore, the most serious problem
is how to improve the gate density and transistor performance while achieving
the same memory density as that of a standard DRAM [13].
A potential approach for increasing the performance of a memory macro is
to use some of the techniques used to make stand-alone memories faster , such
as synchronous design. Additional speed can be obtained by pipe lining the
address and datapath and/or using a wide bus from the array to prefetch
multiple words on one clock cycle into a fast register with individual words
sent out at the same multiple of speed. In the current generation of memory
macros, both the techniques of dividing the array and using a prefetch scheme
are used. In embedded memory applications, special memory configurations
are also possible that can configure memory to the requirements of the
processor and thereby enhance the performance. Another performance issue is
power consumption, which can often be traded against the bandwidth require-
ments.
The system chips with optimized memory macros tend to be more expensive
than a chip in pure CMOS logic because of the additional processing steps. A
major issue to consider is whether to use a predefined memory macro, a block
compiler constructing an array out of smaller blocks, or a cell compiler.
Optimized DRAMs tend to use block compilers because even the small blocks
tend to maintain some of the high-density optimization, which is the only
justification for using DRAMs. Block compilers can either be very simple or
can attempt to construct space saving features such as the shared sense
amplifiers. Cell compilers are commonly used for the SRAMs, and various cell
SRAM cell compilers are commercially available.
Embedded DRAM is playing a significant role in the growth of application-
specific processors such as the graphic accelerators, multimedia chips, and
system-on-chip (SOC) designs. The future projections as shown in Figure 6.6
indicate that embedded DRAM will dominate the embedded memory market
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0% -¥=====:::;::= ====:::;::=====":: (?v
1998 1999 2000 2001
Figure 6.6 Future projections for growth of embedded DRAM market compared to
that for SRAM and nonvolatile memory [14].
EMBEDDED SRAM/DRAM DESIGNS 501
for SOC designs [14]. These are some of the factors that are key drivers for
embedding memory in current generation of high-performance application-
specific IC (ASICs) and SOC designs:
When merging logic and DRAM on one die, an ASIC vendor typically
needs to provide polysilicon layers to construct the DRAM capacitors and
other structures, and several metal layers for fast and robust logic interconnect.
A compromise has to be made between thin-gate-oxide, low-threshold logic
transistors and thick-oxide, higher-threshold memory cell array transistors.
Also, the noise-sensitive DRAM array has to be isolated from transients
generated by the fast switching logic. In the DRAM manufacturing world,
stacked capacitor cell approach is used by more than 60% of the suppliers and
has proven to be a high-volume and cost-effective solution. The trench
capacitor cells, offered by a small percentage of DRAM suppliers, have scaling
limitations below approximately O.18-,um process designs.
A successful DRAM-based design, whether embedded or discrete, requires
accurate dynamic circuit optimization based on the layout-related parasitic
capacitances to meet the noise, performance, and power consumption require-
ments. As the design cycles continue to shrink and ASIC vendors migrate
their process lithographies to ensure cost-effectiveness and higher integra-
tion capability, the challenges become even greater. The re-layout of the
embedded DRAM array changes both the parasitics and signal loading
characteristics.
An example of merged DRAM logic (MDL) design is Samsung's latest
MDLII0 in O.2S-,um process technology, offering several fabrication options
that trade off high performance (2.5 V) and low power (1.8 V) and enable
varying levels of memory, analog, and logic integration. This modular ASIC
process provides up to 8.2 million usable gates and allows core-based SOC
designers to start with the logic process and then add DRAM, SRAM, and
502 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
mixed signal layers, or they can replace the DRAM with flash memory.
Samsung uses this modular ASIC process to fabricate the 750-MHz Alpha
21164 CPU, and the process is also capable of integrating up to 128 Mb of
embedded DRAM, 64 Mb of SuperkAlvl" an innovative DRAM macro that
embeds a multiport cache DRAM), 32 Mb of NOR flash memory, 4 Mb of
SRAM, and 4 Mb of compiled ROM. Available embedded DRAM operating
DRAM modes include the EDO and SDRAM. Memory configurations offer
bus-width options of x 4, x 8, x 16, x 32, x 64, x 128, x 256, x 512, and
x 1024 and granularity steps of 1 Mb or 4 Mb. The embedded DRAM density
requirements can vary, depending upon the application.
In addition, the designer can also add customized DRAM functions
and a number of logic and analog cores including ARM RISe processor,
Alpha Microprocessor, Z80 and 80C51 CPUs, and Oak and Teak DSPs.
Merged processor DRAM architectures will be discussed in more detail in
Section 6.4.
The embedded SRAM has been used for many years to accelerate the
performance of high-end network routers and switches. It is popular because
it is based upon the standard logic process and does not require additional
masking steps. While the embedded SRAM employs a larger cell than DRAM,
new technologies are emerging to help boost SRAM embedded density. The
key to embedded SRAM performance is memory compiler design. A memory
compiler works on a basic principle that memory has a regular structure
consisting of four basic building blocks: the memory array, precoder, decoder,
and the column select and I/O option. The memory array is constructed by
using the same memory core cell. The other three building blocks are also
constructed from a basic leaf cell [15].
A compiler creates a memory design by using instances of the different leaf
cell types to make up the desired memory width and depth. A maximum
performance is achieved when the leaf cell and memory core cell are optimized
for both process technology and memory-size range. Currently compiler
designs are optimized to meet the demands for a wide range of applications.
Segmented or block architectures are used to improve performance and power
consumption. SOC cores are designed with tightly coupled memories to
overcome the processor-to-memory interface bottleneck. The memory designs
often feature multiport, synchronous or asynchronous operation, and stringent
power control.
Nowadays, the biggest challenge facing embedded SRAM developers is to
satisfy growing demands to embed ever-larger memories on chip. The amount
of embedded memory available to ASIC designers has rapidly grown from 1
Mbit in 0.35-flm process technology to 2.5 Mbit in 0.25-flm process, and more
recently to 6-8 Mbit in O.18-flm technology. The latest generation of embedded
memories typically feature built-in scan latches and a scan path, as well as
built-in self-test (BIST) logic. Some vendors employ built-in self-repair (BISR)
schemes in which the device identifies a bad row in a self-diagnostic routine
and uses address mapping logic to automatically map into a good address
EMBEDDED SRAM/DRAM DESIGNS 503
space. However, a BISR scheme can increase the address setup time and can
be a liability for high-performance designs.
A more radical extension of the embedded memory concept is to integrate
the processor functions into memory. The researchers are laying the ground-
work for devices that will essentially eliminate the processor-to-memory
bottleneck in current memory systems by merging both functions on chip to
create very high performance designs. Some of these architectures will be
discussed in more detail in Section 6.5.
Sections 6.3.1 and 6.3.2 discuss examples of some advanced SRAM and
DRAM macros developments.
Word l ine
I~ T
(a)
p-
\ Vss
DB[255:0J
DRAM
BANKO
DRAM
BANK 1
DRAM
BAN K 2
... DRAM
BAN K 63
ACCESS
CONTROL
ACCESS
CONTROL
ACCESS
CON TROL
... ACCESS
CONTROL
DA[255:0J
t tt .t t
t 1
RFO #
I
IREFRESH TIMER I
."
m
~~
~o
OJ :D
.9 qQ EA[ lO :O]
0
.1-""'
..
Ft:J ~
,...-'-- ~ ~ IBAN KADORf ~AR
~
Fr ""
1...-- EA DECODER
ClK
PHASE BUFFER ADOR ;:: TAG
GEN DECODER'" ,.-
;: . k
,"0'
t ""1,
"---
o
m
~
---r 6
~
;::
=E
....
'" '"
.9
.
EA[2:0J
t
»
0
en , HIT
~CONTROlSEO
WR H
CACHE [;
.-
EWR
I
MUX
ClK
I 1/0 DRIVER l
t DO[31:0J
(c)
Figure 6.7 IT SRAM. (a) Bit-cell schematic. (b) Bit-cell cross section. (c) Block
diagram (From reference 16, with permission of IEEE.)
506 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
DRAM or SRAM processes, where the number of metal layers is often limited
to three or less. In the embedded environment, a large number of input and
output connections can be easily supported. For each access, up to several
thousand memory cells can be accessed simultaneously. This inherent large
word width makes the memory access more efficient.
Figure 6.7c shows the 1T SRAM block diagram. The multibank organi-
zation allows simultaneous operations in each bank, thereby facilitating the
hiding of refresh operations. A refresh timer is incorporated to generate
periodic refresh requests to the memory bank. During any clock cycle, a read
or a write access can be generated by an accessing agent outside the memory
block by the activation of the ADS signal and read-write indicator WR. The
access address EA is divided into row and column addresses, which are
broadcast to each memory bank, and the bank address, which is decoded by
the bank address decoder to generate external access request ERQ. The bank
accessed will have its ERQ driven low. When a bank is accessed continuously,
longer than the refresh period, the accessed bank looses refresh cycles, and data
can be lost. To avoid this problem, a shadow cache can be implemented using
an additional single-transistor memory bank.
YSB,SAE
1.6V
WLB
Pre-charge 1.6V
BL,
,..-----,_r--Lr, / transistors . BL ~
PBL ----t--l~--<l--~
.: SAO ~
.6V
YSB -........,f------f-- : SAO'
: Read
1.6V
WLB \ \ - _
_ ,,;.;1..:.6V~ _
BL, ~
BL ~ ............... _
:: Storage ~.6V
; node . _
SAE ~t-----ll
Write
(b)
Figure 6.8 A loadless CMOS 4T SRAM. (a) Cell. (b) Read and Write operations.
(From reference 17, with permission of IEEE.)
To maintain one of two storage nodes at nearly 1.8 V, this SRAM cell
depends on the off-state current to this mode from one of the precharged bit
lines through the access pMOS FET. For stable data retention, the off-state
current of the access pMOS FET (IoffP) must be higher than that of the drive
nMOS FET (IoffN)' To achieve stable data retention at low temperatures, a
word-line-voltage-level-compensation (WLC) circuit is used, which controls
standby word-line voltage. When the standby word-line voltage is lowered by
0.1 V, the off-state current of the pMOS FET increases 10 times. The WLC
circuit consists of a word-line voltage level determination (WLD) circuit
508 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
containing a dummy cell. The WLD circuit monitors the internal node voltage
(VI) in the dummy cell and determines the word line voltage level (V2 ) .
This 4T SRAM macro cell uses an all-adjoining twisted bit-line (ATBL)
scheme. In a conventional twisted bit-line (TBL) scheme, two bit lines are
twisted within the same bit-line pair to utilize the effect of coupling capacitance
being canceled out when the voltages of adjoining bit lines swing in the same
direction. This results in 250/0 reduction of total coupling capacitance. In the
proposed ATBL scheme, all bit lines are twisted among each other and adjoin
equally. The macro uses eight pairs of bit lines, which means that the bit-line
coupling capacitance can be reduced by 7/16. The ATBL scheme does not
result in an additional area overhead.
A 16-Mb 4T SRAM prototype macro was fabricated using a five-metal,
0.18-{tm CMOS logic process, and its size was 10.4 x 5.5 mm". The macro had
5I2K x 32-bit organization and consisted of X-decoders, V-decoders, sixteen
1-Mb blocks, and two WLC circuits. Each 1-Mb block consists of 16 x 64k
subarrays. A size comparison was made between a 6T SRAM macro, a 4T
SRAM macro using a conventional nontwisted bit-line scheme (64 cells per bit
line), and the proposed 4T SRAM macro using the ATBL scheme (128 cells
per BL). The 4T SRAM macro using a conventional nontwisted bit-line scheme
was 730/0 of the size of the 6T SRAM macro. The 4T SRAM macro using the
ATBL scheme that reduces the number of sense amplifiers was 66% of the size
of the 6T SRAM macro. The test results showed this macro as capable of
400-MHz speed access (2.5 ns) at 1.8-V supply voltage. An advanced version
of this macro in O.13-J-lm process demonstrated 500-MHz high-speed access
(2.0 ns) at 1.5-V supply voltage.
memory performance. The refresh operations for the DRAM were handled
automatically through implementation of the refresh circuit in the logic section,
thus eliminating the need for user control.
The migration of dRAMASIC technology to Toshiba's 0.5-Itm process
required three layers of polysilicon and two layers of aluminum interconnect.
Additionally, planar capacitor technology was replaced by the trench capacitor
technology that was being used for the 16-Mb standard DRAM product. An
innovative process feature was the introduction of a triple-well structure,
separating the logic portion from the DRAM cell array, which allows an
independent bias voltage to be supplied for each well and the substrate. In the
first implementation of dRAMASIC on a 0.5-,um process, although the die size
was only 200/0 larger than for the 1.0-,um process, it included 2.5 times more
raw gates, as well as 8 times the amount of embedded memory. This 8 Mb of
embedded memory configured in a 128-bit-wide bus resulted in a bandwidth
of 1.6 Gbytes/s for a memory operating at 100 MHz.
Toshiba has currently two approaches for the implementation of
dRAMASIC designs. The first is based on the DRAM process and provides the
capability for implementing high-density embedded DRAMs. This technology
(1T dRAMASIC) utilizes a one-transistor memory cell with trench capacitor
architecture. The second approach is based on the standard CMOS logic
process and provides the capability for implementation of low-density embed-
ded DRAMs. This technology (3T dRAMASIC) utilizes the three-transistor
memory cell approach.
The DRAM macro uses a wide databus circuit configuration, in which the data
are stored in a stacked capacitor cell occupying 2 ,um 2 . The cell plate is
conventionally biased to Vdd/2. A CMOS double cross-coupled bit sense
amplifier is simultaneously clocked at p-channel and n-channel sources. With
the bit lines and databuses precharged to Vdd/2 as well, the operation is fully
balanced and symmetrical with respect to ~d/2 as the reference level.
During the read operations, the databus signal is sensed in another
conventional CMOS latch, similar to the bit line sense amplifier except that it
has isolation devices turning off, as the amplifier senses, so as not to fully
restore the databus levels. The access for a write operation follows similar
approach, in which the write data are pulsed on to the databuses and a safe
differential level is coupled from them into precharged bit sense amplifiers. The
bit sense amplifier is then clocked to write full 1 or 0 levels into the cell. Thus,
the bit sense amplifiers are used for writing rather than the conventional read-
modify-write approach, which is often unnecessary in an embedded memory.
A representative macrocell configuration produced by the compiler is shown
in Figure 6.9. It is organized as 4 banks x 336 rows x 16 columns x 132 bits. The
dual read/write dataports employing separate wide databuses enable indepen-
dent access to open pages. Arbitration logic incorporated within the read/write
control circuitry allows simultaneous read and write to the same location. The
macrocell uses a simple clocked interface in which all inputs are sampled on
the rising edge of the master clock. There are separate row enable signals (RE)
that control each of the four banks. Each DRAM array is organized as 168
word lines by 4224 bit lines (or 2112 columns), excluding redundancy. When
a bank is activated, 2 word lines are brought to the Vp p level, and 4 of the 5
bit-line sense amplifier arrays are enabled. A 6-bit Y-address (including 2 bits
for the bank address) and a column enable input CE are provided for each data
port.
This macro can transfer data at a continuous rate of 100 Mb/s/pin on both
ports by utilizing its four banks and using column and row cycle times of 10
EMBEDDED SRAM/DRAM DESIGNS 511
.....
~
4 Ban ks Bank 0
x 336 Rows /'
x 16 Column s Word line Drivers
~
x 132 bits Addr ess/Contro l
»> ~
Logic
Clk • ~
~ Bank 1
V /'
Reset • -::::::= DRAM Array
~
XA ~ Bank 2
7------
CE ~ V bb, V blp/V cp,
Vcbp Supp lies
YAO 4-
~
6. Bank 3
YA1 I • /'
4 •
R/W - I • Vpp Supply
4 »->
OE- I • Read/Write
--::::::=
--t Amp lifiers
t 132 t 132
DQO DQ1
ns and 60 ns, respectively. This is an effective rate of 3.3 GB/s , with the databus
CV power dissipation in the core of roughly 26 mW (with limited voltage swing
on the databus levels). The compiler generates library of leaf cells for 0.35-flm
DRAM process. A tiler is used to automatically generate a wide range of
configurations. The tiler can be adapted to other processes with different leaf
cell dimensions. A set of netli sting and simulation scripts can automatically
characterize the critical paths in the resulting macrocell.
The representative 4-bank, 2.8-Mb macrocell shown in Figure 6.9 measures
3.052 mm x 4.164 mm, for a cell efficiency of 45%. The cell efficiency improves
as the size of the macrocell is increased. The maximum 16-Mb configuration
with only a single bank measures 10.522 mm x 5.148 mm , for a cell efficiency
of 62%, and provides dual 256-bit wide databuses.
WL-rx-
~I ~ (a)
WLa
-t-----..----+--
WLb
m
...J
rn T
(b)
Storage capacitor
.c
...J
m
...J
~ ~
(c)
Figure 6.10 Memory cell and bit-line operation. (a) Conventional DRAM. (b) Dual-
gate interleaved D 2RAM . (c) Memory cell layout for D 2RAM. (From reference 20, with
permission of IEEE.)
active and precharge operations are executed in turns on a bit line. In the case
of D 2RAM, one memory cell consists of one capacitor and two transistors
connected to dual bit lines (BLa/BLb) and dual word lines (WLa/WLb).
In the case of D 2RAM, both dual-port bit lines (alb) of the memory cell can
be accessed alternately, according to optional commands. When the a-port bit
line is activated, the b-port bit line is precharged; conversely, the a-port bit line
is precharged when the b-port line is activated. This interleaved bit-line
operation has half the random cycle time of a DRAM. In a conventional
DRAM, a folded bit-line sense amplifier architecture is generally used. How-
ever, the D 2RAM· adopts an open bit-line sense amplifier architecture [21],
which halves the cell size compared to the folded bit-line sense amplifier
architecture. The cell size can be reduced even further to about 1.8 times that
of conventional DRAM, by reshaping the storage capacitor. Figure 6.10c
shows the dual-port memory cell layout, which is identical to that of a
conventional DRAM. The word lines run vertically and bit lines run horizon-
tally.
The D 2RAM uses a two-stage pipelined architecture for random accessing
at maximum frequency. This works with a clock synchronous interface. The
read operation has a two-cycle latency from the address input to data output.
For a write operation, the write command and data are presented simulta-
neously, while the word-line, bit-line, and internal databus are activated similar
to the read operation.
In a conventional DRAM, the data are rewritten to the bit line after sensing.
Adjacent unselected bit-line data must be latched to avoid the data destruction
due to the coupling noise from the bit line being rewritten. Therefore, more
time is required for a bit-line sensing, rewriting, and restoring than for a read
operation. The D 2RAM uses write-before-sensing (WBS) scheme, in which the
adjacent port bit line shields the unselected bit line from noise caused by the
written bit line. Thus, in WBS the data of the unselected bit line are
maintained, and write time is shorter than the read cycle time. The random
cycle time is below 8 ns under worst-case conditions and below 6.5 ns under
typical conditions, using a two-stage pipelined circuit operation and WBS. This
random cycle time is about six times faster than that of a conventional DRAM.
The D 2 RAM prototype macro chip was fabricated in O.25-,um triple-well,
3-poly and 4-layer Al metallization embedded DRAM process. The test chip
contained three 2-Mb D 2RAM macros, along with logic and I/O cells. The
2-Mb macro size was 9.41 rnm", and memory cell efficiency (die efficiency) of
this test chip was about 44%.
passing
active wordline
tungsten wordline
+ t
(a)
NCLK
NCLK LK
Ni
SIAWRT BUF
2/
ADRS O.. ~ ---1 S/A SELECTOR o
6 8 64 / 0
~ Tl ROW DEC z
--<
;0
SUBARRA Y x8
en 4 16 4
+l
I f/l'
IX
ADRS 12 .. ~
.0
HD TOP DEC
~DDRESS STATIC
13/
0
r-
S/A WRT BUF
~
NCLK NCLK H l2BO
~nd S/A DATA_LATC P
WRITE....
u,
::l
en ... z
LU
o if,
1280
JJ280
NCLK
READ....
0
z
-c
::;
... 0
z
«
::;
--+-l DIN 8UFIDOUT 8UFP
1280 DOUT
1024 DIN
::;
~ E FRESH "" 0 ... ~ {J: 280iJ.
1280
~ ~
-+-fnd SIA DATA_LATCHI
H l2BO
S/A WRT BUF
o
0
z
--<
;0
SUBARRAY
0 x8
r-
(b)
Figure 6.11 A I-GHz embedded DRAM macro. (a) Cross section of eDRAM cell. (b)
Block diagram of macro organization. (c) Expanded view of the subarray. (From
reference 22, with permission of IEEE.)
The even-numbered bit -line pairs are sensed by the sense amplifiers at the
top, and odd numbered pairs are sensed at the bottom. The bit lines are
distributed vertically with first-level metal. Each bit-line pair is twisted three
times within the 256-kb array at every 64th cell to balance out the coupling
noise due to the neighboring lines. The word lines are distributed horizontally
with both polysilicon and second-level metal.
There are 1296 read data-line (second-stage bit-line) pairs and 1296 write
data-line pairs running vertically in third-level metal. The read data lines
516 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
,--
64b data
----... ..-------.....
16b ECC "'
@@--------~IU +---
)~ f
N ...
Ja
....
:.c
\... redundancy 64b data 16b Eee ~ 00
t
write buffer
senseamp
1 Array control write buffer
sense amp
~ ~ ~ :2 ~ == == ~
== ~
== ~
== ;fa == :=
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
connect between the first-stage sense amplifiers in subarrays and the second-
stage sense amplifiers in the central unit. The write data lines connect between
the input buffer and the first-stage sense amplifier. This results in matching two
read data lines and a write data line (total of three) to every bit-line pair. The
macro requires four metal layers. The Iast metal layer is mainly used for clock
and power distribution.
A 4-bit state machine implemented in each subarray block controls the
subarray. The state machine is synchronized with the global clock of the
macro. It generates the necessary timing signals for the activation of the word
lines, setting of the bit-line sense amplifiers, resetting the word lines, and
precharging the bit lines to half- Voo .
In the worst-case data pattern simulation, the macro including the clock
drivers consumes approximately 10 W. The macro is designed to operate with
a I-GHz clock, at 85°C, nominal process parameters, and a 10% degraded Voo.
The design is fully pipelined and synchronous with 16 independent subarrays.
With l-kb-wide I/O and I-GHz clock, the maximum data rate becomes 1 Tb/s,
The address access time is 3.7 ns, four cycles with a I-GHz clock.
The first generation of logic-based eDRAMs have suffered from severe per-
formance constraints. A major issue in scaling to the next generation is the
migration ease of libraries and cores. DRAMs do not follow the migratable
design rules, unlike the logic technologies, for which a vast majority of libraries
and cores are adaptable to the new design rules by simple automated migration
software and retiming. This is a major advantage of synthesized logic. In
MERGED PROCESSOR DRAMARCHITECTURES 517
comparison, the design rules in DRAM technology are based on the shrink
path of the cell and may not correspond to simple scaling in the logic
technology. For these reasons, the DRAM-based technology is not compatible
with the library and core methodology used by the designers of synthesized
logic.
The embedding of DRAMs in a logic technology typically requires some
additions to the process flow. Figure 6.12 shows the additional steps in dark
boxes [2] and are as follows: (1) a deep trench for the storage node , (2) an
optimized shallow trench isolation, used in place of the logic shallow trench,
(3) predoping of the polysilicon word lines, and (4) predoping of the block . The
boxes in gray are the steps required for pass transistor and word-line drivers
of the embedded DRAM, but also used in the format ion of 2.5-/3.3- V I/Os. The
additional steps are expected to increase the process complexity by about 25%.
An example of merged processor memory architecture is Mitsubishi's
embedded DRAM technology (eRAM™) that uses a three-layer metal, Hyper-
DRAM process . The result is a standard eRAM product called M32R /D,
which integrates a 32-bit RISC processor, DSP functions, 2 MB of DRAM, and
4 K B of cache SRAM - all on the same die. Figure 6.13a shows a block
diagram of the typical MCU chip for consumer applications, which has a
configuration that integrates a CPU and certain peripheral circuits on the same
chip , while memory and additional application specific peripheral circuits
Figure 6.12 Additional steps required for embedding DRAMs in a logic technology
process. (From reference 2, with permission of IEEE.)
518 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
I Peripheral I
(a) (b)
11l
Q)
'C
110
t3
1 Mbyt e/2 Mbytes
E Q) ~
Q)
Q) Ql 2
11l
:::J
Q) :0 '5
.0
:::J cO :0
a N N ::>
(0
~
~
0
~ N
CP U I- ~
Instruc tion
Data
32 bits , 66.6 MHz
TT External bus 16 bits , 16.67 MHz
(c)
while implementing higher-density memory with small cell size and good
charge retention characteristics. To satisfy these requirements, M32R/D uses a
merged logic and DRAM technology called hyperDRAM process. Two key
techniques used in this process are (1) a suppression of interference between
the logic and DRAM sections by using a triple-well structure for electrical
isolation and (2) reduction of wiring pitch by planarizing the dielectric layers
between the interconnect layers with CMP.
The M32R/D consists of a small RISC CPU connected to the DRAM via
a 128-bit-wide bus, as shown in Figure 6.13c. The instruction queue, 128-/32-
bit selector, and the bus interface unit (BIU) buffer are connected with the
128-bit bus between the DRAM array and the 4-Kbyte cache. The instruction
queue consists of two 128-bit entries and holds instructions for the CPU. When
the CPU reads and writes operand data, the 128-/32-bit selector adjusts the
data transfer between the 128-bit bus and the CPU. The BIU is a 128-bit data
buffer that supports burst transfers on the 128-bit bounded data.
The cache is direct mapped with 128-bit block size and follows a write-back
policy. It has two caching modes: (1) on-chip DRAM and (2) instruction-
caching. In the on-chip DRAM caching mode, the cache functions as a unified
cache for caching instructions and data from the on-chip DRAM. This mode
is effective when most or all of an entire program and data fit in the internal
DRAM, and the data are frequently read and written by the CPU. In the
instruction-caching mode, the entire cache works as an instruction cache that
caches instructions both from the on-chip DRAM and from external memory
chips. This mode is for application that uses M32R/D and external ROM chips
as a system's data processing components.
The 128-bit internal bus operates at 66.6 MHz and transfers 128 bits of data
between the CPU and memory in one cycle. A cache hit requires one access
cycle (i.e., there is no wait cycle). An operand access transfers 32 bits of data
between the CPU and the cache in one cycle through the 128-/32-bit selector.
Instruction fetch operations fetch 128 bits of instruction code into the instruc-
tion queue in one cycle, so that the CPU can get four to eight instructions in
one cycle. On a cache read miss, the data transfer from the DRAM to the CPU
and the update of the cache data take four cycles on a DRAM page hit, or
eight cycles on a DRAM page miss. On a cache write miss, the data transfer
between the cache and the CPU takes only two cycles (i.e., one wait cycle).
The major advantage of integrating the CPU and DRAM is the feasibility
of using a wide and fast internal bus for higher performance. If on-chip DRAM
is sufficiently large for an application, it doesn't need additional memory
outside the chip. The integration of the CPU and memory eliminates power
consumption of the I/O buffers driven during a DRAM access. The wide
internal bus also reduces the number of DRAM reads and writes.
A multimedia system has to deal with an enormous quantity of data, so it
needs a large bandwidth memory bus and a large bandwidth system bus.
Increasing the bit width of the memory bus is one way to increase the memory
bandwidth, but this increases the pin count. An alternative is to increase the
520 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
data frequency by using the Rambus DRAM (RDRAM) that has only 11 signal
pins for the Rambus channel and boosts the signal rate to 600 Mbytes/s (see
Chapter 4, Section 4.3). The RDRAM approach makes it possible to get a large
bandwidth and small memory granularity, and a multimedia system can be
made with just one RDRAM chip. In comparison, to obtain a memory
bandwidth greater than 600 Mbyte/s by using 16-bit, 100-MHz bus, 64-Mb
SDRAM devices, the bus width must be 64 bits and the minimum memory size
must be 256 Mb (or 32 Mbytes) [26]. A memory of this size is too large to be
used in embedded systems. Therefore, the RDRAM is an especially suitable
device for embedded application, and integrating the system bus on the chip
can provide higher data rates.
A multmedia-oriented embedded RISC processor has been developed that
has a high data rate system and uses a concurrent RDRAM (C-RDRAM)
controller in a superscalar architecture [27]. Figure 6.14a shows the block
diagram of the processor. A CPU core and several peripheral circuits are
integrated on one chip to handle the multimedia data efficiently. The core and
the peripherals are connected with one of the two buses incorporated into the
chip. One bus is a 200-MHz, 64-bit system bus, which connects the peripherals
requiring a high data rate. The other is a 50-MHz, 16-bit bus for serial interface
connection. The core and the system bus operates using a 200-MHz clock
generated by an internal phase-locked loop circuit from an external 50-MHz
clock signal. The processor is fabricated using a 0.25-,um, four-layer metal
CMOS process technology and has 3.9 million transistors mounted on a
10.5 x lO.5-mln 2 die size, and the power consumption of chip is less than 2 W.
The CPU core has a six-stage pipeline structure for high-clock-frequency
operation. The whole pipeline is divided into three parts: an instruction
dispatch pipeline (I-pipe), an integer execution pipeline (V -pipe), and a multi-
media single-instruction multiple-data (SIMD) coprocessor pipeline (M-pipe).
These pipelines work in parallel to provide execution speeds up to 2000
MOPS, a level of performance sufficient for MPEG-2 decoding, and MPEG-l
encoding.
The V-pipe includes a 32-bit multiply-adder and a 64-bit shifter. The
multiply-adder executes simple 32-bit multiplication or 32-bitaddition follow-
ing the multiplication. The 64-bit shifter performs 32-bit logical and arithmetic
left/right operations as well as 64-bit shift operations. The M-pipe performs
SIMD parallel operations on eight packed bytes, four packed half-words, or
two packed words. The unit has a 32-word, 64-bit register file with four read
ports and two write ports and also has six function units: a multiply unit, an
add unit, a shift unit, a logic function unit, and two data-type converter units.
The execution units are fully pipelined and have one-clock throughput and
fixed four-clock latency. Also, there are four 16-bit multiply-adders in the
M-pipe.
The CPU core contains two 16-Kbit internal caches used for the instruc-
tions and data. The peripherals include an RDRAM control unit (RCU), a
video control unit (VCU), an audio control unit (ACU), and a bus control unit
MERGED PROCESSOR DRAM ARCHITECTURES 521
~Ra mbu s
~channel
Instruction unit ~ S erial
,.,~~ audio out
E ~16-bit
~ video out
*
>.
l/l
"iii
c
ffi
~-+-- Externa l interrupt
~ request
~-+-- Externa l DMA
C ~ request
- -r- - - - -
16b @ 50 MHz
r -- - ' - - -- - - ,
Seriall/Os
g eu ~ bus
3Z-b addre ss/data
@ 50 MHz
Clock in
Izoo MHzl
Reset Refresh
controller controller
I 75 MHz I
RAe
u _ ~ 300 MHzl-uu uu u u
Rambu s channel (up to 600 MB/s)
(b)
Figure 6.14 Block diagrams. (a) Embedded RISe processor with a Rambus DRAM
controller. (b) Rambus DRAM control unit. (From reference 26, with permission of
IEEE.)
522 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
(BCU) that are connected to the 200-MHz, 64-bit internal system bus. Also,
large data buffers with sizes ranging from 384 to 512 bytes are included in each
peripheral. The RCU controls a Rambus channel, which handles 600-Mbyte/s
data transmission. Eight C-RDRAM devices can be connected directly to the
RCU. The RCU handles memory accesses from the CPU core and various
interfaces. The VCU and ACU have double 256-byte buffers. Each of them
fetches 256-bytes data from the C-RDRAM, in which a frame buffer and an
audio data buffer are assigned, and outputs the data synchronously with the
interface's individual clock. The VCU has 16-bit parallel data ports for the
video data. The ACU has a serial port for 'audio data. The BCU manages a
32-bit parallel address/data multiplexed external bus and also acts as a bridge
circuit to a 16-bit, 50-MHz bus, which is connected to various serial-port
controllers. The 200-MHz internal system bus also connects to the interrupt
control unit and the direct memory access control unit. Both these units direct
the CPU core and the BCU, according to external signals.
The RCU provides a bridge between two high-speed buses: the 200-MHz,
64-bit internal system bus and the 600-Mbytejs Rambus channel. Figure 6.14b
shows the block diagram of the RCV. The Rambus channel exchanges data
using signals synchronized with each edge of the 300-MHz clock signal. The
Rambus ASIC cell (RAC) manages the physical layer of the Rambus channel
protocol, multiplexes write data by 8:1, and demultiplexes read data by 1:8. The
RAC also generates a 75-MHz clock from the 300-MHz clock and communi-
cates with circuits using this 75-MHz clock. The internal system bus side of the
RCU operates using a 200-MHz clock. The Rambus memory controller
(RMC) follows the RDRAM commands in the queue to compose and
decompose packets for the Rambus channel and controls the timing with which
the packets are sent to the Rambus channel. The RCU issues refresh com-
mands and current-level-control commands periodically.
To reduce memory access latency, the RCV interleaves transactions and
prefetches instructions. The performance estimate of this multimedia indicates
that the interleaving results in data rates as high as 533 Mbyte/s and reduces
the 256-byte read latency by 9%. The prefetching reduces the instruction cache
refill latency by 70%.
The logic processes and commodity DRAM fabrication processes have con-
flicting needs; therefore, embedding DRAM using either of those two ap-
proaches involves a compromise between the logic performance and DRAM
cell size. The logic technology requires transistors with high threshold voltages
that can provide fast switching times. But high-threshold voltages produce high
leakage currents. DRAM technology requires transistors with low leakage
currents to minimize the size of capacitor in the memory cell; however, low
leakage means slower transistors. Taking into consideration the various
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 523
tradeoffs involved, the embedded DRAM. suppliers have taken one of the two
approaches. The companies with established logic processes and ASIC heritage
start with their logic process and add masking steps to produce DRAM
capacitors, using merged processor DRAM logic architectures that were
discussed in Section 6.4.
In comparison, the companies with a DRAM manufacturing heritage such
as Mitsubishi, Samsung, Toshiba, and Infineon Technologies, start with a
DRAM process and add masking steps to include logic. Some examples of this
approach are modular embedded DRAM core developed by Infineon Tech-
nologies and a multimedia accelerator (MSM7680) by Oki Semiconductor.
There are some experimental DRAM processes with embedded logic architec-
ture being developed, such as intelligent RAM (IRAM, by the University of
California at Berkeley) and computational RAM (CRAM). These are briefly
described in the following sections.
~
....c ....~c I
1 MbitDRAM '"
al
1 Mbi tDRAM '"
al
1 MbitDRAM I
!E
....c
1 MhitDRAM 1 MhitDRAM 1 MhitDRAM
'"
al
I
Interface Interface Interface
Figure 6.15 A modular embedded DRAM core with examples of three different
memory configurations, each of them with a memory size of 4 Mb. (From reference 13,
with permission of IEEE.)
the resolution enhancement of color and space requires a data transfer rate
directly proportional to its resolution and refresh rate. The peak transfer rate
of a 1280 x 1,024-pixel, true color (a 24-bit color depth) image at a 75-Hz
refresh rate is around 400 MB/s, which is an order of magnitude increase
compared to the conventional data transfer rate of 640 x 480-pixel, 8-bit color
with 60-Hz refresh rate display. The image overlay (or image scaling) doubles
the data transfer rate. Image blending overlays two images by mixing them
together using certain weighting coefficients. Color keying is another overlay
operation in which one of the two images is selected based on color. The data
from two images is read concurrently in both of these operations. In addition
to the read operation of such data, the write operation to update the image in
the frame buffer requires data transfer rate at tens of Mbytes/s, The graphic
controllers use 64-bit and higher databus widths to support external DRAM
for the frame buffer.
An example of the embedded DRAM implementation in graphic display
system is Oki Electric Industry Company's multimedia accelerator chip
(MSM7680) that integrates the frame buffer with graphic controller functions
such as a 2-D drawing engine, MPEGI decoder, digital/analog converter for
RGB analog output, and a clock generator phase-locked loop. The MSM7680
uses a four-layer polysilicon, three-metal layer process to embed DRAM in a
complex logic circuit, and has a wide bus with high peak bandwidth, which
eliminates the data transfer bottleneck between the frame buffer and data
processing units.
The integration of frame buffer on a chip using wide databus between the
frame buffer and the FIFO buffers provides a high data transfer rate. The width
between these buffers is restricted by the I/O pin count when the frame buffer
is implemented with separate memory. The internal databus of MSM7680 is
256 bits wide and has a peak data transfer rate of over 2 Gbytes/s when the
column access cycle time is less than 16 ns. A lower power dissipation is
another advantage of DRAM integration, because embedded DRAMs have a
smaller internal databus capacitance than does a bus connected to an external
DRAM.
Figure 6.16 shows a top level view of the MSM7680 LSI multimedia
accelerator architecture that integrates the MPEG-l video/audio decoder, 2-D
GUI (graphics user interface) engine, and RAMDAC (135-MHz, true color
digital/analog converter) [29]. The host bus interface supports the PCI
protocol at 33-MHz default clock frequency. This interface can support at up
to 50 or 66 MHz with the insertion of wait states and includes the standard
PCI configuration register access. The bus interface performs as either a slave
or a bus master with a multichannel DMA, which individually manages 2-D-
rendering command and data streams, and MPEG video and audio data
streams. It also supports abort, parity generation, and checking functions.
Because all drawing coordinates are rectangular, the hardware must trans-
late rectangular coordinates to linear addresses, as well as perform the clipping
and ternary (256) raster operations (ROPs). The on-chip graphic engine
526 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
External
DRA M
Figure 6.16 Block diagram of top-level architecture for Oki MSM7680 multimedia
accelerator with embedded DRAM. (From reference 29, with permission of IEEE.)
supports transparent/opaque text and bit-block transfers for color and mon -
chrome sources and patterns. The bitmaps can be defined with 4-, 8-, 16-, and
24-pixel formats. The 2-D acceleration is performed in 8-, 16-, and 24-bpp
formats. The MSM7680 includes an MPEGI video and audio decoder. The
host software manages the overall displ ay buffers and files, splits data streams,
and recovers errors. The host manages synchronization between the video and
audio by using special timing and synchronization hardware facilities. The
MSM7680 also includes display processor and graphics video mixer, and it is
compatible with standa rd VGA format and supports the VGA and SVGA
(VESA-enhanced VGA) modes .
The MSM7680 has 1.25-MB embedded DRAM configured with syn-
chronous DRAM, a 256-bit internal databus width, and high-speed access. The
building of DRAM into the controller serves to optimize the composition of
row and column bit length according to the required functions. The embedded
DRAM consists of 21 small block s, each consisting of 2,048 cells (column) x
256 cells (row) . To reduce the number of transistors that are charged or
disch arged in one read or write operation, the memory block is divided into
21 small blocks for every 512 Kbits. This makes it possible to control the access
speed and reduce the power consumption during a read or a write.
The read and write control logic uses RAS and CAS signals just like a
standard DRAM. This control logic supports RAS access mode (read and write
operations of one word) and CAS access mode (read and write operations of
the consecutive data). In the CAS access mode, up to eight words (256
bits/word) can be accessed consecutively. A 256-bit -wide databus width is
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 527
selected between the internal DRAM and the memory interface (MIU/FIFO)
instead of the standard 128-bit DRAM width. This MMS7680 memory
configuration allows general data transfer rates of 800 MB/s in RAS access
mode. When reading and/or writing eight words consecutively at 80 MHz, the
MSM7680 transfers data at 2 Gbytes/s in CAS access mode.
In one section of the embedded DRAM, large buffers are implemented for
command and image data to speed up 2-D drawing. The host CPU communi-
cates all commands, parameters, and operand data for the rendering engine
through these two buffers (FIFOs). The MIU/FIFO supports both internal
and external memory operations. Internal memory is accessible with an
extremely high bandwidth due to a wide, low-latency internal interface. The
external memory can be accessed, along with an optional 1-Mbyte attachment
of memory, with high bandwidth via a burst EDO DRAM interface. A set of
special registers is used that connects the memory core to the peripheral logic
circuits in the interface circuitry to maximize use of the high bandwidth gained
by using the embedded DRAM. In the MSM7680, each function model has a
256-bit register that takes the 256-bit-wide data read from the embedded
DRAM. These registers then forward the data gathered to each function
module through a different-width bus.
A benchmark evaluation (Winbench97, 233- MHz Pentium II, SVGA
display, 16-bpp color) of this multimedia accelerator performance in 2-D
drawings recorded 104 Mpixel/s.
These are some of the potential disadvantages and challenges of the IRAM
approach:
• IRAM will be fabricated in a memory process that has been optimized for
smaller cell size and low charge leakage current rather than the fast
transistor speed. This same DRAM fabrication process offers fewer metal
layers than a logic process to lower costs, because routing speed is less of
an issue in a memory.
DRAM PROCESSES WITH EMBEDDED LOGIC ARCHITECTURES 529
• DRAMs are designed to work in plastic packages and dissipate less than
2 W, while desktop microprocessors dissipate 20 to 50 W using ceramic
packages.
• Some applications may not fit within the on-chip memory of an IRAM,
and hence IRAMs must access either conventional DRAMs or other
IRAMs over a much slower path than the on-chip accesses.
• A biggest challenge to the IRAM approach is matching the cost of a
DRAM memory. The DRAMs include redundant memory to improve
yield and therefore lower the cost. Traditionally, microprocessors have no
redundant logic to improve the yield, and therefore the on-chip logic may
effectively determine the yield of the IRAM. In addition, testing time
affects the chip costs. Given both the logic and DRAM die on the same
chip, an IRAM die may need to be tested on both the logic and memory
testers.
+
x
2-way Superscalar
Processor
Load/S tore
1024
Vector Reg isters
~~ ~~ ~ ~
~G GG G G
... 1R?4 ... ...
1R?4 19. 24
~G G~ ~ G
Figure 6.17 Block diagram of a vector IRAM in an O.I8-Jlm DRAM proce ss imple-
mentation [32].
I/O C-RAMs
CPU Processing
elements
Cache
C-RAM
display I----l~ Video display
controller
om ~~~~
0000
C-RAM
c=J
(a)
Write
enable
register
Shift left
Shift right
ALU (multiplexer)
Global instruction
______t__
Broadcast bus
(b)
Figure 6.18 (a) Illustration showing DRAM replacement with computational RAM
and support logic. (b) CRAM processing element. (From reference 33, with permission
of IEEE.)
EMBEDDED EEPROM AND FLASH MEMORIES 533
This processing element can be implemented with fewer than 100 transistors,
using a dynamic logic multiplexer. The processing element design fits in the
pitch of eight bit lines (four folded bit-line pairs or columns) across several
generations of DRAMs. To make the effective use of silicon area, structures in
this processing element often serve multiple purposes. For example, X and Y
registers are used to store the results of local computations (such as sum and
carry) as well as to act as the destination for left and right shift operations
between the adjacent processing elements. During communication operations,
the ALU is used for routing signals. The processing elements and support
circuitry adds 18% to the area of existing DRAM design. A single processing
element occupies an area of approximately 360 bits of memory (including sense
amplifier and decoder overhead).
Because DRAM technology is different from the logic-based processor
technology, the implementation of computational RAM does present some
problems, which are generic to the implementation of a merged memory-logic
process and were discussed earlier. The Ie manufacturers offering merged
logic-DRAM processes address these problems through the use of (1) a
separate implant mask for the memory cell array, (2) a separately biased well
for the memory cell array, and (3) two thicknesses of gate oxides. Faster logic
in DRAM is available at the expense of these extra process steps. However,
because it is the DRAM's cycle time that largely determines computational
RAM's performance, the computational RAM would see only a small benefit
from a merged logic-DRAM process. The computational RAMs, designed as
it is on a commodity DRAM process, can be manufactured at a lower cost than
the merged logic-DRAM process.
The computational RAM processing elements share a common instruc-
tion bus and thus operate in a SIMD mode. Since the mid-1990s, essentially
all new massively parallel SIMD designs have used the embedded memory
approach.
The most popular examples of embedded flash memories devices are pro-
grammable logic devices (PLDs), field programmable gate arrays (FPGAs),
DSPs, and microcontrollers. The embedded system designers prefer to use
flash-based processors, which can be quickly programmed, before transferring
their code to a more cost-effective ROM-based chip for high-volume produc-
tion. During the development and early production, the flexibility of on-chip
flash memory speeds software development and allows for changes up to the
last minute-even if the software will be eventually programmed into an
on-chip ROM. Using embedded flash throughout the life of the product offers
some additional advantages. For example, the designers of embedded systems
do not have to physically remove devices to provide software updates, and
embedded flash reduces parts inventory. The embedded flash can also reduce
534 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
the inventory of finished products, because the same system can be pro-
grammed with different software, allowing a single inventory of different
models of a product with different capabilities [35]. This section will briefly
discuss the use of embedded flash and EEPROM technologies in microcontrol-
lers (MCUs). Section 6.7 will review memory card designs in smartmedia and
multimedia applications that combine flash chips and/or microcontroller in
wide variety of consumer products such as the cellular phones, mobile laptop
and palmtop computers, digital cameras, and so on.
The need for use embedded flash and EEPROMs on microcontrollers comes
from several directions. The software code updates are getting more frequent
because of the shorter design cycles and competitive pressure in time to market.
Also, the electrical reprogrammability can be used as a means to cost-effectively
extend the life of a product. Microcontroller designers are taking advantage of
both the flash and EEPROM architectures and incorporating both memory
types simultaneously on to the same microcontroller chip. Typically, the flash
memory is used for the program storage area because the cells offer the best
packing density and the program code has to be usually written in large blocks.
In comparison, the EEPROM arrays are often used in fairly small blocks, from
less than 100 bytes to about 1 kbyte-just enough to hold the desired
parameters.
The range of microcontrollers that incorporate either EEPROM or flash
memories run from the low-cost 4-bit devices that pack a few bytes worth of
electrically erasable storage to 8-, 16-, and 32-bit CISC and RISe processors
that can pack 128 kbytes or more of flash memory. The addition of a flash
memory to a processor is limited not only to MCUs but also to DSPs. If the
MCV selected does not meet the desired memory configuration or feature-set
combination, the designers have the option of creating their own custom
microcontroller. Many of the MCU suppliers and ASIC vendors such as
Hitachi, Motorola, Philips Semiconductors, SGS-Thomson Microelectronics,
Texas Instruments, Toshiba, and others have both microcontroller cores and
blocks of flash or EEPROM memory available in their cell libraries from
which a custom controller can be assembled [36].
The well-established Intel 8051 (and all the alternate sources) and the
Motorola 68HC05 (including 68HC08 and 68HC11) are two of the most
popular 8-bit microcontroller families that include on-chip flash or EEPROM,
or both. Both Atmel and Philips have developed flash-based versions of the
MCV. Additional 8-bit processor families such as Hitachi's H8 series, SGS-
Thomson's ST9 (8/16 bit) series, and TI's TMS370 families offer devices with
on-chip EEPROM and/or flash memory, as well. In addition to EEPROM
storage on board, some of the devices also have on-chip ROM, RAM, and a
mix of features ranging from asynchronous or synchronous serial ports to
timers, analog-to-digital converters (ADCs), and general-purpose I/O lines.
Many of these device families are targeted at smart card applications.
The higher-performance microcontrollers with 16- and 32-bit wide datapath
controllers are also available with EEPROM and flash memory blocks from
EMBEDDED EEPROM AND FLASH MEMORIES 535
various baking and high-temperature storage life (HTSL) tests showed their
data retention capabilities equivalent to the floating -gate device performance.
Cycling tests demonstrated the technology capability to support 100,000 cycles
and beyond.
Linear cards can also be used to download data to main memory (DRAM),
the transfer being controlled by the CPU in the host system. The PCMCIA
cards available as Type I (3.3 rnm thick), Type II (5 mm thick), and Type III
(10 mm thick) formats have densities greater than 1 Gbyte. Both flash memory
and .magnetic media variants exists for the PC cards.
Advanced Technology Attachment (ATA) flash cards are I/O mapped
devices for file storage applications. To achieve plug-and-play interoperability,
they implement the PC standard I/O interface to the host computer. The cards
emulate ATA disk drive operation and do not allow the CPU in the host
system to directly access the flash memory. ATA flash cards can be used
interchangeably in a variety of systems and products that have slots for PC
cards. The two types of ATA flash cards - PC cards and CompactFlash™
(CF™) cards-are similar in concept. In both, an on-card controller imple-
ments a standardized interface that allows the ATA flash PC cards and CF
cards to operate as an external memory storage device without imposing
overhead on the system CPU or requiring special software. However, there are
differences in form factor (CF cards are smaller than ATA PC cards), density
(CF cards have a lower maximum density due to their smaller size), and
protocol specification [40].
The ATA flash cards are 53.9 mm x 85.5 mm (the same width and length
as modem cards and linear PC cards), use a 68-pin connector, and are 5 mm
thick (Type II PC card height limit). In comparison, the CF cards are 36.4
mm x 42.8 mm (62% smaller than the PC cards) and 3.3 mm thick. The file
structure in ATA flash cards is predefined. Data are accessed in sectors of 512
bytes or more, even when only one byte is needed. The applications for ATA
flash cards include mobile computing systems such as handheld pes, PDAs,
digital still cameras, smart digital phones, GPS systems, and communication
systems such as the cellular base stations, PBX equipment, and digital routing
switches.
The architecture of ATA PC cards and CF cards is almost identical. In
addition to AND, DINOR, NAND, or Flash chips, both card types contain
an intelligent controller-a processor with typically 256 Kbytes of RAM (for
temporary data storage and address translation) and some additional logic, or
a microprocessor-core-based ASIC that combines these circuits. The controller
performs two functions. It controls the reading and writing of data from and
to the flash devices, and it implements the protocols of the ATA interface, thus
offioading the CPU in the host system.
The ATA cards are optimal for many markets because they offer high-
density and inter-operability. By far the largest applications for ATA flash
cards are consumer products, particularly digital cameras and palmtop com-
puting devices. However, these PC cards are too big and thick for many
portable systems. A development in this area has been the Miniature Cards,
first championed by Intel and Sharp and later by AMD and Fujitsu, which
reduced the card's footprint while retaining the PCMCIA's parallel interface
for high-bandwidth transfers.
538 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
allow for greater storage capacity. The SD cards have increased the interface
count to nine contacts, but are otherwise identical to their MMC predecessors.
The MMCA is considering a variety of secure MMC proposals including a
SD-like expansion of the interface pin count to 9 or 13 contacts and a thicker
card assembly.
The storage media inside the cards is equally important from the cost,
performance, and reliability point of view. NOR-based flash technology has
received widespread adoption in EPROM replacement direct-code-execution
applications such the PC BIOS and cellular phones. However, its applications
in the more write-intensive data- and file-storage applications, especially in
designs that do not mix code and data within a single chip, have not gained
much popularity. Multilevel-cell (MLC) storage technologies such as Intel's
StrataFlash have lower cost per bit, but have the drawbacks in areas of
program performance and increased erase-block sizes. Some NOR supporters
continue to advocate the technology in write-mostly applications.
Sandisk is the memory-card market leader, and its products have been
NOR-based ever since the company's inception. Recently, Sandisk has an-
nounced products based on its 256-Mbit double-density ~1LC chip. Micron is
supplying CompactFlash cards based on its MediaFlash component and
matched memory controller. STMicroelectronics is offering a 64-Mbit MLC
chip and controller. In adapting NOR-based technology to mass storage
applications, a memory vendor is prone to accepting larger die with more array
decoding periphery logic to reduce erase block size, making the media appear
more like a hard-disk drive sector-based approach.
The flash memory based on EEPROM technology has a larger cell size than
the NOR flash, but has a better erase and write performance. Atmel's
DataFlash and Nexcom Technology's Serial Flash both use EEPROM-based
flash arrays. Silicon Storage Technology (SST), which uses an EEPROM-
derived flash cell as the basis for its code storage and execution chips, is
supplying other NAND flash memory vendor's chips in its CompactFlash
cards. Currently, NAND-based flash technology is the preferred storage media
in many CompactFlash cards and Memory Stick modules, recently introduced
by Sony Corporation. Samsung and Toshiba have NAND-based products that
are supplied with some nonfunctional blocks, which are mapped-around by the
memory controller while they are on-card or in-system. AMD offers 64-Mbit
UltraNAND devices in limited production. Sandisk and Toshiba have jointly
announced a O.16-,um technology-based, 5I2-Mbit NAND flash device, which,
by using optional MLC technology, can store 1 Gbit of information. In about
a year, the companies plan to double the size of their largest device to a 2-Gbit
MLC version based on a.13-,um technology.
Hitachi and Mitsubishi's AND-based flash is the dominant technology in
data and file storage. These companies have introduced their 256-Mbit MLC
(128 million cells, with 2 bits per cell) AND chip. Hitachi and Mitsubishi have
also announced NAND-based product offerings. According to the data sheet
specifications, AND does not suffer from the same degree of write performance
540 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
in file storage applications that can tolerate large granularity differences and
do not require high-density or multidensity options. They use a proprietary
NAND interface and cannot be used in ATA card applications unless they are
inserted into an active adapter card that implements the ATA interface.
SmartMedia cards conform to a proprietary specification originated by
Toshiba and currently supported by Toshiba and Samsung. Because the
SmartMedia cards contain just one or two flash chips, they are clearly the
simplest possible flash-based nonvolatile storage implementation. However,
this simple configuration imposes a hardware and software burden on the host
system, which must have a controller very similar to the complexity of an ATA
controller.
SmartMedia flash cards are very small and thin (45.0 mm x 37.0 mm x 0.76
mm) and are targeted at serial access, file storage applications. The connections
to the host system are via a 22-pad elastomeric connector or a probe-type
mechanical connector. The power supply voltage is either 5 V or 3.3 V, but not
both. The software that runs on the SmartMedia card controller in the host
system must vary according to the card type used and manufacturer because
the operating specifications for flash chips, such as those for reading and
writing, vary between device types. The software must be written to handle the
full range of flash chips that might possibly be used in the system's SmartMedia
card slot for current and the future configurations. The SmartMedia card also
mandates software licensing fees for the host sytem and requires extra ROM
and SRAM in the host system. These system software- and hardware-related
issues can complicate the system design and increase time to market. The
SmartMedia cards held under 13% of the market in 1997.
The basic design issues for SmartMedia cards are system development cost
and the requirement for a hardware interface and a software driver in every
host system. Some other areas of concern are reliability due to the use of
unproven packaging technology, compatibility issues because of the propri-
etary interface, capacity growth and density granularity, and interconnection-
related issues. The SmartMedia card as compared to a CF card has per-card
cost advantage at its density points, but that quickly disappears considering
the greater development sources that are required to design a product to accept
a SmartMedia card than are the ones needed for a CF card.
Another factor that must be considered when evaluating SmartMedia cards
versus CF cards is the compatibility issue for varying supply voltages and
software requirements. To be used in a system that has an ATA PC card slot,
the SmartMedia must be inserted into a very costly interface card containing
control functions that implement the ATA interface. By comparison, the CF
card needs only an inexpensive "pass-through" connector. Additional disad-
vantage of the SmartMedia compared to the CF card is the packaging
approach used that has mechanical strength deficiencies. Premature Smar-
tMedia card failure is likely to occur if subjected to forces that cause twisting
or bending.
542 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
• CLK: One-bit data transfer on the command and data lines occur on
each cycle of this signal. The frequency may vary between zero and the
maximum clock frequency.
• CMD: It is a bidirectional command channel used for card initialization
and data transfer commands. The CMD signal has two operation modes:
open drain for initialization mode and push-pull for fast command
transfer. The commands are sent from the MultiMedia Card bus master
to the card, and the responses are sent from the card to the host.
• DAT: It is bidirectional data channel. The DAT signal operates in the
push-pull mode. Only one card or the host is driving this signal at a time.
• Read Only Memory (ROM) Cards. These cards are manufactured with a
fixed data content, and they are typically used as a distribution media for
software, audio, and so on.
• Read/Write (R/W) Cards. These are available in various versions such as
flash, one-time programmable (OTP), and multiple-time programmable
(MTP). These cards are typically sold as blank media and are used for
mass data storage, with the end-user recording of audio, video, or digital
images.
MEMORY CARDS AND MULTIMEDIA APPLICATIONS 543
• I/O Cards. These cards are intended for communication (e.g., modems)
and typically have an additional interface link.
Figure 6.20 shows the MultiMedia Card (a) architecture and (b) the
bus system [42]. The MultiMedia Card bus is designed to connect either
solid-state mass storage memor y or I/O devices in a card format to multimedia
CMO
11 1 Voo
CI_K OAT
Interface driver
I OCRI31:0] ~
;
• ;
r CID1127:01 ~ Card -
interface c
.2
I RCA11 5:01 ~ controller
re set 4- ~
I DSRII 5:0 ~ III
"0
c
I CSDl 127:0 ~ 0
; lii
~
;1 Memory core interface reset J..-- -~
Me 'no yc ore
(a)
1 1 Mult iMediaCard bu 5
! 1 ! 1 !
Card Card Card Ca rd Ca rd
(I/O ) (RO M) (OTP) (MT P) (Flash)
(b)
Figure 6.20 MultiMedia card . (a) Architecture. (b) Bus System. (From reference 42.)
544 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
The MultiMedia Card controller is the link between the application and the
MultiCard bus with its cards. It translates the protocol of the standard
MultiMedia Card bus to the application bus and is divided into two major
parts: (1) the application adapter, which is the applications oriented part, and
(2) the MultiMedia Card adapter, which is the MultiMedia-Card-oriented part.
The application adapter consists at least of a bus slave and a bridge into the
MultiMedia Card system. It can be extended to become a master on the
application bus and supports functions like the DMA or server application
specific needs.
and write operation can be directly attributed to the use of the internal clock
of the controller. For an ATA flash disk to have a higher performance, the
clock rate needs' to be increased, which places a greater power drain on the
battery in mobile applications [43].
The FTL solution can achieve the same functionality of the flash media
through software. The FTL is a simple software driver that acts as a translator
between the flash media and the BIOS parameter block/file allocation table
(BPB/FAT). Because FTL is a software-based solution, it can bring the cost
down as compared to the ATA approach, roughly by an order of magnitude.
However, the FTL approach has a strong reliance on the speed of the host
processor.
The DiskOnChip (DOC) is M-System's first monolithic solid-state flash
disk, which combines a disk controller with flash memory on a single die and
is available in a standard 32-pin DIP or TSOP package. The DiskOnChip
products are optimized for use in information appliances such as the set-top
boxes and portable PC-compatible systems that require minimal weight, space,
and power consumption. In order to emulate a hard disk, a flash disk requires
a software management layer. The M-system has patented a flash file system
management technology called TrueFFS(R\ that allows flash components to
fully emulate hard disk. The TrueFFS has the following features:
• The use of wear leveling algorithms to ensure that all blocks are erased
an equal number of times, which can potentially increase the life of the
product by several orders of magnitude.
• Using virtual blocking of the flash device to make the large erase blocks
transparent to the operator.
• Automatic mapping of bad blocks.
• Implementation of a power loss recovery mechanism to guarantee abso-
lute protection of data.
TrueFFS drivers also support 16-bit and 32-bit bus architecture, which is
commonly used in the RISC processors. The DiskOnChip is compatible with
the standard EEPROM pinout, and it supports local bus and ISA bus interface
options. It utilizes the Reed-Solomon error detection code (EDC) and error
correction code (ECC), which provide the following error immunity for each
512-byte block of data: (1) correction of up to two lO-bit symbols including
two random bit errors, as well as correction of single bursts up to 11 bits, and
(2) detection of single bursts up to 31 bits and double bursts up to 11 bits, as
well as detection of up to 4 random bit errors.
Figure 6.21a illustrates a typical interface of the DiskOnChip-Millennium to
a system [44]. It is connected as a standard memory device using standard
memory interface signals. Typically, the DOC can be mapped to any free 8-KB
546 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
A[12:0]
•
D[7 :0] ~
OE
• DiskOnChip-
Millennium
WE BMBytes
CE
RSTIN
(TSOP ON LV)
- -. (a)
Extended
Memory
OFFFFFH 1M
BIOS
OFOOOOH
1
DiskOnChip J Bk
OC8000H
Display
OBOOOOH 640k
RAM
o
(b)
REFERENCES
1. Chung-Yu Wu, Embedded Memory, in The VLSI Handbook, Wai-Kai Chun (Ed.),
CRC Press, Ne\v York, Chapter 50.
2. S. Iyer and Howard L. Kalter, Embedded DRAM technology: Opportunities and
challenges, IEEE Spectrum, 1999, pp. 56-64.
3. Richard A. Quinnnell, Embedding DRAM boosts performance at a cost, Silicon
Strategies, February 1998, pp. 19-22.
4. Brian Dipert, Embedded memory: the all-purpose core, EDN, March 13, 1998, pp.
34-50.
5. Norbert Wehn and Spren Hein, Embedded DRAM architectural tradeoffs, Techni-
cal Paper Siemens web page.
6. Richard A. Quinnell et aI., Focus Report: ASICs with embedded memory, Silicon
Strategies, February 1998, pp. 53-61.
7. Richard Stacpoole et aI., Cache memories, IEEE Potentials, 2000, pp. 24-29.
8. Gary Green, Extended cacheability, EDN Products Design, Aug. 8, 1997, pp. 29-32.
9. David Barringer et aI., Modernize your memory subsystem design, Electron. Des.,
February 5, 1996, pp. 83-92.
10. Jim Handy, Fast Cache Memory, in The Cache Men-lory Book, Academic Press, San
Diego, CA, 1998, Chapter 4.
11. TI TMS 320C6211 Cache Analysis, Application report SPRA427, September 1998,
pp. 1-11, TI web page.
12. Eric Hall and George Costakis, Developing a design methodology for embedded
memories, lSD, January 2000, pp. 13-16.
13. Konarad Schonemann, A modular embedded DRAM core concept in 0.24 tIm
technology, IEEE Proc. of MTDT' 98 Conference on 24-25 August 1998 at San
Jose, CA.
14. Embedded DRAM, Samsung Home Page.
15. Danny D. Yeung, Embedded memories are the key to unleashing the power of SOC
designs, Electron. Des., December 4,2000, pp. 123-129.
16. W. Leung et aI., The ideal SoC memory: 1T-SRAM, Proc. IEEE ASIC/SOC
Conference 2000, pp. 32- 35.
17. K. Takeda et al., A 16-Mb 400 MHz loadless CMOS four-transistor SRAM macro,
IEEE JSSC, Vol. 35, no. 11, November 2000, pp. 1631-1639.
18. A. Lalchandani and F. Krupecki, dRAMASIC™: The marriage of memory and
logic, EDN Products Edition, May 14, 1997, pp. 23-25.
19. R. C. Foss et aI., Re-inventing the DRAM for embedded use: A compiled, wide-
databus DRAM macrocell with high bandwidth and low power, Proc. IEEE
Custom I C Conference, May 13, 1998, pp. 1- 5.
20. Y. Agata et al., An 8 ns random cycle embedded RAM macro with dual-port
interleaved DRAM architecture (D 2RAM), IEEE JSSC, Vol. 35, no. 11, November
2000, pp. 1668-1671.
21. M. Inoue et aI., A 16-Mb DRAM with a relaxed sense-amplifier-pitch open-bit-line
architecture, IEEE JSSC, Vol. 23, pp. 1104-1112, October 1988.
548 EMBEDDED MEMORIES DESIGNS AND APPLICATIONS
22. O. Takahashi et aI., I-GHz fully pipelined 3.7 ns address access time 8k x 1024
embedded synchronous DRAM macro, IEEE JSSC, Vol. 15, no. 11, November
2000, pp. 1673-1688.
23. P. Hofstee et aI., A 1-GHz single issue 64-b Power PC processor, 2000 IEEE ISSCC
Dig. Tech. Papers, pp. 92-93.
24. S. Crowder et aI., Integration of trench DRAM into a single performance 0.18 lim
logic technology with copper BEOL, in 1998 IEEE IEDM Dig. Tech. Papers, pp.
1017-1020.
25. Y. Nunomura et al., M32R/D-Integrating DRAM and microprocessor, IEEE
Micro, November/December 1997, pp. 40-47.
26. K. Suzuki et aI., A 2000-MOPS embedded RISC processor with a Rambus DRAM
controller, IEEE ,JSSC, Vol. 34, no. 7, July 1999, pp. 1010-1020.
27. K. Suzuki et aI., V830R/AV: An embedded multimedia superscalar RISC processor,
IEEE Micro Magazine, Vol. 18, pp. 36-47, March 1998.
28. N. Wehn et aI., Embedded DRAM architectural trade-off's, Design, Automation and
Test Conference, Europe, 1998.
29. Ichiro Sase et al., Multimedia LSI accelerator with embedded DRAM, IEEE Micro,
November/December 1997, pp. 49-54.
30. Steven Przybylski, New DRAM technologies: A comprehensive analysis of the new
architectures, .tv[icroliesiqn Resources, Sebastopol, California, 1994.
31. David Patterson et al., Intelligent RAM (IRAM): The industrial setting, applica-
tions, and architectures, University of California at Berkeley web page.
32. David Patterson et aI., A case for intelligent RAM: IRAM, from University of
California at Berkeley web page and IEEE Micro, April 1997.
33. D. G. Elliott et aI., Computational RAM: Implementing processors in memory,
IEEE Des. Test of C011lpUt., January-March 1999, pp. 32-41.
34. T. Shimizu et aI., A multimedia 32b RISe microprocessor with 16 Mb DRAM,
IEEE ISSCC Proc. 1996, pp. 216-217.
35. John Bond, Embedded flash speeds time-to-market, Comput. Des., December 1998,
pp.28-30.
36. Dave Bursky, Flash and EEPROM technologies combine on feature-rich MCUs,
Electron. Des., May 27, 1997, pp. 81-93.
37. Boaz Eitan et aI., Embedding flash memory in SOC applications, ISD Magazine,
December 2000, pp. 46-50.
38. K-T. Chang et aI., A new SONGS memory using source-side injection for program-
ming, IEEE Electron Device Lett., EDL-19(7), pp. 253-255, 1998.
39. Flash Card White Paper: Technology and Market Backgrounder, April 1998,
Hitachi web site.
40. CF and CompactFlash Paper: CompactFlash Association web page.
41. CompactFlash ™ Sandisk web page.
42. The MultiMedia Card System Summary Based on System Specification Version 2.2,
MMCA Technical Committee.
43. Stefanie Helm, The changing face of solid state memory, EDN Products, May 14,
1997.
44. DiskOnf.hip" Millennium Single Chip Flash Disk Data Sheets.