0% found this document useful (0 votes)
14 views

ApplicationSpecific DRAM Architectures and Designs

Uploaded by

m.kessad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ApplicationSpecific DRAM Architectures and Designs

Uploaded by

m.kessad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

CHAPTER 4

APPLICATION-SPECIFIC DRAM
ARCHITECTURES AND DESIGNS

The technical advances in multimegabit DRAMs have resulted in greater


demand for memory designs incorporating specialized performance require-
ments for applications, such as high-end desktops/workstations, PC servers/
mainframes, 3-D graphics, network routers, and switches. For example, several
years ago, the rapidly expanding video graphics market spawned the need for
high-speed serial interfaces, which encouraged the development of (a) dual-port
video RAMs (VDRAMs) with one serial and one random port or (b) the frame
buffer with two serial ports. The memory designers made efforts to simplify the
complex external control requirements to make them more compatible with the
SRAM systems. These design modifications of lower-cost DRAMs for use in
the SRAM-specific applications were called the pseudostatic DRAMs
(PSRAMs), or virtual static DRAMs (VSRAMs). These earlier .developments
were discussed briefly in Semiconductor M emories [1].
Some more recent high-performance memory architectures include syn-
chronous DRAMs (SDRAMs), enhanced SDRAM (ESDRAM), cache DRAM
(CDRAM), and virtual channel memory (VCM) DRAMs, described in Chap-
ter 3 of this book. Chapter 4 introduces some application specific memory
architectures and designs in more details, such as the Video RAMs, syn-
chronous graphic RAMs (SGRAMs), double data rate (DDR) SGRAMs,
Rambus Technology, synchronous link DRAMs (SLDRAMs), and 3-D RAMs.
In high-performance memory systems, latency is an important parameter
and is the time required from an initial data request to actually obtain the first
piece of data, which in the older DRAMs was equivalent to access time.
However, the newer DRAMs with static column and fast page modes have a
burst transfer mode that allows them to transfer as little as a page or up to the

237
238 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

entire contents of a memory chip. Therefore, for these DRAMs, the memory
latency is as important as the time for each subsequent data word in the
'transfer sequence (the burst length). One of the first attempts to speed up
DRAM access time was the cache DRAM (CDRAM), in which the internal
architecture consists of a standard DRAM storage array and an on-chip cache.
The cache and memory core array are linked by a wide bus, so that the entire
cache can be loaded up in just a single cycle. The CDRAMs were discussed in
Chapter 3 (Section 3.7).
For PC memories, the biggest debate in DRAM applications has been
whether random-access latency or burst bandwidth is the more significant
performance parameter [2]. The shorter the average burst-access length, the
lower the chances of amortizing an extended initial latency over much shorter
subsequent burst accesses. Also, the more effective the CPU's caching scheme
before the DRAM array, the more random the DRAM-code accesses to fill the
cache lines. When a CPU, particularly one without pipelining or prefetch
support, has to read information from main memory, the CPU stalls, wasting
clock cycles until the completion of first data access. Therefore, the fewer the
system masters accessing main memory, the lower the chances that they will
consume a significant amount of a memory's peak bandwidth.
The l6-Mbit SDRAMs with their dual-bank architecture were the first
multiple-sourced, new architecture memories to offer performance levels well
above that obtainable from the extended-dataout RAMs (EDRAMs). The
first-generation l6-Mb SDRAMs were specified as lOO-MHz devices. Although
SDRAMs are designed to a JEDEC standard, slight differences in interpreta-
tions of the specification and the test methodologies have made chip interchan-
gabilitya concern. Therefore, the chips that are capable of lOO-MHz operation
under ideal conditions are specified for limited operation to 66 MHz due to
the timing differences in most PCs.
To boost SDRAM performance of first-generation l6-Mb devices, memory
designers have tightened some of the ac timing margins, de parameters, and
layout rules to achieve a "true" lOO-MHz operation for compliance with
PCIOO IOO-MHz SDRAM specification requirements for a lOO-MHz system
operation. The second-generation SDRAMs have been pushing process tech-
nologies to achieve even higher speeds, such as clock rates of up to 133 MHz.
The second-generation SDRAMs include higher-performance devices that
employ four memory banks per chip. SDRAMs in the 16- and 64-Mb
generation are available with word widths of 4, 8, or 16 bits. The advanced
64-Mb and 256-Mb SDRAMs are available in 32-bit word width. The DDR
SDRAMs allow the chip to deliver data twice as fast as the single-data-rate
SDRAMs. These were discussed in Chapter 3 (Section 3.5).
Many high-end computer architectures, servers, and other systems that
require hundreds of megabytes of DRAMs have been using SDRAMs for the
main memory. However, the future home-office desktop computers will typi-
cally use some variation of specialized DRAM architectures such as the
Rambus DRAM (RDRAM) due to their smaller granularity. In general, larger
APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS 239

memory systems most often employ narrower word widths, because such
systems often require a lot of depth, whereas the smaller systems end up using
wider word chips, because the memo~y depth is smaller and wide memories
could greatly reduce the chip count [3].
For example, the current SDRAMs are available with word widths of x 4,
x 8, x 16 bits. Assuming a 64-bit-wide memory module, if the unit is assembled
with 4-bit-wide SDRAMs, it would have a depth of 16 Mwords and a total
storage of 128 Mbytes. However, if these memory modules are built with
8-bit-wide SDRAMs, the module would pack 64 Mbytes and have a depth of
8 Mwords.
The vendor proprietary DDR SDRAM variants began appearing in 1997.
Initial main memory devices operating at 133 MHz deliver a burst bandwidth
of 2.1 Gbytes/sec across a 64-bit data bus. The Direct Rambus DRAM
(DRDRAM) and synchronous link DRAM (SLDRAM) are two examples of
next-generation DRAM architectural developments to address the speed
requirements of latest-generation high-performance processors. Both these
architectures employ packet command protocols, which combine the separate
command and address pins of previous memory interfaces into command
bursts [4]. This approach reduces the number of pins required for addressing
and control, as well as facilitates the pipelining of requests to memory, Direct
RDRAM and SLDRAM transfer commands and data on both edges of the
clock.
Rambus Inc., has developed Rambus architecture in conjunction with
various partners, and although it does not manufacture or markets the chips,
it has licensed the controller interface cell and the memory design to companies
such as Hitachi Inc., LG Semiconductor, NEC, Oki Semiconductor, Sam sung
Electronics Corp., and Toshiba Semiconductor Co. The companies that have
signed on to produce DRDRAMs include Fujitsu Corp., Hyundai Inc., IBM
Corp., Infineon Technologies, Texas Instruments, Micron Technology, and
Mitsubishi Electric Corp. The original RDRAMs had a latency of several
hundred nanoseconds, which affected their performance. The second-gener-
ation implementation as the concurrent RDRAM has been optimized for use
in main memory systems. For example, the concurrent RDRAM available in
8-bit- or 9-bit-wide versions have either 16/18- or 64/72-Mb capabilities and
can burst unlimited length data strings at 600 MHz such that sustained
bandwidth for 32-byte transfers (e.g., for a cache-line fill) can be at 426
Mbytes/s, The DRDRAM interface consists of 16- or 18-bit datapath and an
8-bit control bus, with the interface able to operate at clock rates up to 800
MHz (rising and falling edges of a 400-MHz clock). These DRDRAMs are
available in densities such as 32 Mb for graphics design applications and
64/128 Mb for main memory applications.
The SLDRAM defines its first-generation SDRAM interface as a 16- or
18-bit-wide bus supporting up to 8 loads and operating at 400 Mbpsjpin with
a 200-MHz clock; and using buffered modules, it can support up to 64 loads.
The Direct RDRAM has a 16- or 18-bit-wide data bus, but it can support up
240 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

TABLE 4.1 A Comparison of the Significant Features and Characteristics of


High-Performance DRAM Architectures for DDR SDRAM, Direct RDRAl\'I,
SLDRAM and SDRAM, at the 64-Mb Level [3]

DDR Direct
Parameter SDRAM RDRAM SLDRAM SDRAM

Bandwith 1.6 Gbytes/s 1.6 Gbytes/s 1.6 Gbytes/s 0.8 Gbytes/s


Clock frequency 100 MHz 400 MHz 200 MHz 100 MHz
Data-transfer 200 MHz 800 MHz 400 MHz 100 MHz
frequency
Bus width 16 bits x 4 16 bits 16 bits x 2 16 bits x 4
Granularity 32 Mbytes 8 Mbytes 16 Mbytes 16 Mbytes
System power 1500 rnA 1200- 1500 rnA 1000 rnA 500 rnA
(4 devices/system)
Active power 375 rnA 1000- 1300 rnA 440 rnA 125 rnA
(1. device)

to 32 loads and operate at 800 Mbps/pin, using a 400-MHz clock, twice the
speed of the first-generation SLDRAMs. The RDRAMs and SLDRAM archi-
tectures will be discussed in more detail in Sections 4.3 and 4.4, respectively.
Table 4.1 compares the significant features and characteristics of these high-
performance DRAM architectures for SDRAM, DDRAM, Direct RDRAM,
and SLDRAM at the 64-Mb level [3J.
For several years, the graphics memory architectures were designed around
the video DRAM, which is basically a dual-ported DRAM that allowed
independent writes and reads to the RAM from either port [5J. The host port
was a standard random access port, while the graphic port was optimized for
bursting data to the graphics subsystem through a pair of small parallel-to-byte
serial shift registers. However, the extra area on the chip required by the shift
registers and control circuits increased VRAM manufacturing costs. Additional
examples of graphic optimized memories include a specialty triple-port DRAM
developed by NEe, a multi bank DRAM (MDRAM) developed by Mosys
Corp., the cache DRAMs (CDRAMs), and the Window RAM developed by
Samsung Electronics Corp.
The MDRAM has also been designed into some graphic subsystems. An
MDRAM is basically an array of many independent 256-kbit (32-kbyte)
DRAMs, each with a 32-bit interface, connected to a common internal bus. The
external 32-bit bus is a buffered extension to the internal bus. The independent
bank architecture facilitates overlapping, or "hiding" the row address strobe
access and precharge penalties, so that the average access times will approach
the column address strobe access time.
Some high-performance graphics workstation vendors have designed their
own graphic memory architectures to meet their specific requirements. In the
VIDEO RAMs (VRAMs) 241

current generation of graphics memory designs, the graphics controllers have


been widening their frame buffer interfaces from the 32-bit- to 64-bit-wide
buses (and few even wider buses- 128 to 192 bits) to allow memory data
transfers at rates of 83 to 125 MHz, and higher. In the high-end of the
controller market, chips with 128-bit-wide buses have started to appear,
allowing twice as much data to move on every bus cycle.
The addition of 64-bit-\vide interfaces to the off-chip frame buffer for the
graphics controllers have caused impressive gain in performance, but increased
the power consumption due to wide buses and high-speed signal switching.
Therefore, a different approach is being used for the portable, low-power
consumption requirement applications, by eliminating the wide, off-chip buses
and embedding the frame buffer memory into the graphics controller chip.
Many graphics memory chip suppliers are using that approach by using the
embedded memory in a merged architecture approach. This eventually allows
the designer to use even wider buses on the chip to connect the memory to the
controller. The latest-generation graphic memory controllers from NeoMagic
and other suppliers use a 128-bit-wide bus for the data transfers. These
embedded memories and merged memory-logic architectures will be discussed
in Chapter 6. The following sections provide an overview of architecture and
designs of VRAMs, SGRAMs, fast cycle RAMs (FCRAMs), Rambus Technol-
ogy, SLDRAMs, 3-D RAMs, and performance versus cost selection tradeoffs
for some of these devices.

4.1. VIDEO RAMs (VRAMs)

A key to the performance of a graphics system is the memory design and


configuration. Several considerations need to be made in order to -pick up the
optimal configuration for a particular application. These include [6]:

• Desired resolution, bit depth, and refresh rate


• Desired expandability, minimum and maximum buffer sizes
• Desired performance/cost goals
• Compatibility with existing standards (e.g., VGA), and so on.

One of the first steps in designing a graphics system is determining the frame
buffer size (or sizes) and internal (nondisplayed) resolution. The next step is to
use that information in configuring the memory for optimum system perform-
ance.
The VRAM was developed to increase the bandwidth of raster graphics
display frame buffers. If a DRAM is used as a frame buffer, it must be accessible
by both the host/graphics controller and the CRT refresh circuitry. The raster
graphics display requires that a constant, uninterrupted flow of pixel data be
available in the CRT drive circuitry. This requires that the host or graphics
242 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

processor must be interrupted when a request is made by the CRT drive


circuitry for a new line of pixel. The Video RAM is a very specialized form of
a DRAM that can provide high-speed serial streams of data to a video monitor
and includes a series of registers called serial access memory (SAM) registers
tied to a serial port [7J. Because the primary use of this type of RAM is video
screen refresh, the VRAM is designed to allow serial data to be accessed from
this port continuously. The VRAMs also have a DRAM interface that is
completely separate from the serial registers and serial port. While data are
being read from the serial port, other data may be stored in or read from the
DRAM array via the DRAM port.
In a typical graphics application for a VRAM, the frame or the image buffer
stores a bit map of the image to be displayed on the screen. The pixel depth
in bits represents the color resolution, and the pixel width (pixel time)
represents the image resolution on the screen. The greater the number of bits
per pixel, the greater the color resolution. Also, smaller the pixel time, the
greater the image resolution. Many applications require 8 bits per pixel to
provide a choice of 256 colors. Very high resolution pictures using 24 bits per
pixel can generate up to 16 million colors.
The data from SAM are fed to the digital-to-analog converter (DAC) by a
32-bit-wide data bus. Inside the DAC, this data are multiplexed and latched.
The 8 bits latched into the RAMDAC serve as an address for anyone of the
256 entries in the lookup table. Each one of the entries in the lookup table
consists of a total of 24 bits (8 bits for red, 8 bits for green, 8 bits for blue
colors). This represents a 16-million-color palette. The 24 bits in the lookup
table can be modified by the user without affecting contents of the frame buffer.
There are three 8-bit video D/A converters inside theRAMDAC to convert the
digital bits to analog color signals to be sent to three guns of the CRT monitor.
The serial port is 1000/0 dedicated to providing the data to CRT refresh
circuitry. While the serial port (SAM) is asynchronously supplying the data for
display, the random port is always available for read/write for the graphics
controller. The random port is unavailable only during the read/write transfers.
However, these read/write transfers require only 2% of the total time to
transfer data from the DRAM array to SAM. The VRAM supports the
following three basic operations:

• Bidirectional random access to the DRAM


• Bidirectional serial access to the SAM
• Bidirectional transfer of data between any DRAM row and the SAM

The 4-Mb dual-port video RAM (VRAM) consists of a DRAM organized


as a 256K x 16-bit device interfaced to a serial register/serial access memory
(SAM). Some 4-Mb dual-port VRAMs have a SAM with a 256K x 16
organization, known as the half-depth SAM, whereas the other 4-Mb VRAMs
have a SAM with 512K x 16 organization known as the full-depth SAM. The
VIDEO RAMs (VRAMs) 243

00 0
to
0 0 15

c
'0,
o
--J
eE RE
CE
o Timing
o TAG
$ Generator W
and
~ Control OSF
Logic SC
SE
OPT,NC
W rite-per
Bit

SOOO
16 to
S0015

Figure 4.1 Block diagram for architecture of a standard 4-Mb VRAM. (From
reference 8, with perm ission of IBM Corp.)

chip size of full-depth SAM is larger than the half-depth SAM . Figure 4.1
shows the architecture of a standard 4 Mb VRAM [8].
A full-depth SAM is a 512 x 16 serial buffer built into a 4-Mb VRAM. The
buffer is used for serial read /write, and in the full transfer mode a full word line
(512 x 16) is transferred to the SAM . In most applications, serial port is always
being read . Therefore, the transfer has to be synchronized to the last read
operation from the SAM, which creates timing problems. To avoid these, a
split register transfer is preferred, so that in the split register transfer mode, half
of the word line (256 x 16) or half-row is transferred to its respective half of
the SAM, while the other half of the SAM is being read . This helps avoid the
possibility of overlap of a read from the SAM while the data are being
transferred from the DRAM array to the SAM register or buffer. For high-end
graphics applications, it is desirable to read /write part of the SAM . Many
designers prefer to stop reading at some boundary and jump to another
address in the SAM.
A half-depth SAM is a 256 x 16 serial register built into the 4-Mb VRAM.
This buffer is used to provide serial read /write operations. While being used in
the full transfer mode, a half-word line (256 x 16) either lower or upper is
transferred to the SAM (256 x 16), In the split register transfer mode, only
one-quarter of the word line (128 x 16) or row is transferred to either the lower
or upper half of the SAM that is not being written/read. A mode that allows
the designe r to jump to the other half without serially clocking through the
244 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

midpoint (127/128) is called the serial register stop (SRS) mode. A CBRS (CE
before RE refresh with mode SET) cycle is initiated to put the VRAM in the
SRS mode. A half- depth SAM part is considered compatible with the
full-depth SAM part, if the replacement of half-depth SAM with the full-depth
SAM does not affect the system operation.
The VRAMs have number of features that are specifically designed to
enhance performance and flexibility in the graphics applications, such as the
block write, write-per-bit, flash write, mask register, and color register. All of
these options work with the DRAM portion of the VRAM and are used to
efficiently update screen data stored in the DRAM. These features are briefly
described below.

• Block rnAite This feature can be used to write contents of the color
registers into eight consecutive column locations in the DRAM in one
operation. The masking feature allows precise selection of the memory
locations that get the color data. This option is useful for quickly filling
large areas such as the polygons with a single color during real-time
imaging applications.
• Write-Per-Bit The write-per-bit is a temporary masking option used to
mask specific inputs during the write operations. When used in conjunc-
tion with the data mask in the mask register, the write-per-bit feature
allows selection of the memory locations that need to be written.
• Flash lfj4ite Flash write clears large portions of the DRAM quickly.
Each time the flash write option is selected, an entire row of data in the
DRAM is cleared.
• M ask Register The mask register stores mask data that can be used to
prevent certain memory locations from being written. This feature is
generally used with the block write option and can be used during the
normal writes. The bits that are masked (mask data = 0) retain their old
data, while the unmasked bits are overwritten with the new data.
• Color Register The color register stores the data for one or more screen
colors. These data are then written to memory locations in the DRAM
corresponding to the portions of the screen that will use the stored color.
The major function of color register is to rapidly store the color data
associated with large areas of a single color, such as a filled polygon.

4.2. SYNCHRONOUS GRAPHIC RAMs (SGRAMs)

SGRAM architecture and operations are based on that of the synchronous


DRAM. As with all the DRAMs, the main functions of the SGRAM are storing
data in the memory array and reading the data out of the array; and because
the SGRAM array is a DRAM array, it must be refreshed periodically. To
make reading and writing data fast and efficient, block write and write-per-bit
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 245

functions have been added specifically to address the graphics applications


requirements. These features are selected via a special mode of registers and
additional command pins that can be loaded with the appropriate information.
As in the case of SDRAM, all input signals are registered on the positive edge
of the clock signal.
As an example, IBM's SGRAM is designed for reading and writing data in
bursts of 1, 2, 4, or 8 bits or a full page. Once a row has been activated, only
the starting column address for each burst is required. An internal counter
increments the column address for each memory location after the first address
access in a burst mode. When the read or write parameters are set up, the
memory continues through the entire burst until completion or interruption
with new data being presented or accepted on each cycle of the burst. To move
to a new row, the current row must be precharged and then the new row may
be activated.
There are five major differences between the SGRAM and conventional
DRAMs, as follows [9]:

1. Synchronized Operation An SGRAM uses a clock input for the syn-


chronization, whereas the DRAM is basically asynchronous memory,
even if it uses two clocks, RAS and CAS. Each operation of the DRAM
is determined by the commands, and all operations are referenced by a
positive clock edge.
2. Burst Mode The burst mode is a very high speed access, utilizing an
internal column address generator. Once the column address for the first
access is set, the following addresses are generated automatically by the
internal column address counter.
3. Mode Register The mode register controls SGRAM operation and
function into desired system conditions. The mode register has mode
register table that can be programmed and configured, if a system
requires interleave for the burst. type and two clocks for CAS latency.
4. Write-per-Bit This function enables selective write operation for each of
the 32-bit I/Os and is activated by ACTVM command for each bank.
5. Block mAite This function enables writing the same data (logic 0 or 1)
into all of the memory cells for eight successive column (8 x 32 bit)
within a selected row.

The SGRAMs are also very similar to the SDRAMs, except that they have
several additional functions to improve their effectiveness in graphics systems
designs. Both the block-write and write-per-bit functions have been added to
make the reading and writing operations faster and more efficient. As in
SDRAMs, all input signals are registered on the positive edge of the clock, and
data can be written or read in the bursts of 1, 2, 4, or 8 bits or a full page [5].
The SGRAMs have many programmable features that require system
configuration during both the initialization and graphics operations. A small
246 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

command interpreter on the chip allows the burst length, the column-address-
strobe latency, the write-per-bit modes, 8-column block write, and the color
register to be set up to the desired initial values and altered when the system
conditions change.
An example of the first-generation SGRAMs offerings are 8-Mb devices,
organized as 256-Kword x 32-bit, so that two of these chips can form a
2-Mbyte frame buffer for a 64-bit graphics controller. The 8-Mb SGRAMs are
available with data clock speeds of 83 MHz, 100 MHz, 125 MHz, or even
higher. The second-generation devices are 16-Mb SGRAMs that can double
the word depth, allowing a 4-Mbyte buffer to be built with just two chips. The
improved versions of these 16-Mb SGRAMs include devices with higher clock
speeds or DDR transfers. Therefore, a 16-Mb SGRAM with DDR capability,
along with the 100-MHz clock, will transfer data at 200 MHz, allowing
graphics bandwidth performance up to 800 Mbytes/s (peak). An example is
ISO-MHz DDR developed by IBM that can deliver 300 Mb/s for each pin and
a peak data rate of 1.2 Gbytes/s over the 32-bit bus.
The DDR memory performs I/O transactions on both the rising and falling
edges of the clock cycle. The DDR SGRAM uses a bidirectional data strobe
(DQS) moving with DQs (multiplexed data I/O) in parallel and is used in the
system as a reference signal to fetch the corresponding DQs. A benefit of using
DQS is to eliminate the clock skew and timing variation effects between the
memory and controller during the high-speed data transfer at each pin. In
addition, the skew between the input clocks of the memory and controller can
be ignored because the DQS synchronizes both data input and output at both
of its edges [10].
A major advantage of DDR usage in 3-D graphics applications is that it
doubles the memory bandwidth. For example, two x 32 DDR SGRAMs
running at 200-MHz clock frequency offer a peak date throughput of 3.2
Gbytes/s for a 64-bit bus and 6.4 Gbytes/s for a 128-bit memory interface. For
a 64-bit (8 bytes) bus, the peak rate is calculated as 8 x 200 X 106 X 2 (both
clock edges) = 3.2 Gbytes/s, Similarly, for a 128-bit bus (16 bytes), the peak
rate is calculated as 16 x 200 x 106 X 2 ~ 6.4 Gbytes/s,
Section 4.2.1 provides an example of a 64-Mb DDR SGRAM supplied by
Samsung Electronics.

4.2.1. 64-Mb OOR SGRAM


An example of a 64-Mb DDR SGRAM is a 512K x 32-bit x 4 bank device
with a bidirectional data strobe supplied by Samsung Electronics, which is
specified for a maximum clock frequency up to 166 MHz and for a peak
performance up to 1.328 GB/s/chip with I/O transactions on both edges of the
clock cycle. Figure 4.2 shows the block diagram of this 64-Mb DDR SGRAM
[11]. The major functional blocks and command set of this device are
described below.
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 247

co

~I ~I
M
x
• a0 :E

l
0 r-

I 1/0 Control • ~
I I

Output Buffer t
Strobe
Gen. --
--'

~L IG,,[
l 2-bit preietchj ...._ -_ ._ --
U
-e-
Ii ~~ '--' '" r-r-r-« '---
s:
r- -
u,

'~.->< ",-
Sense AMP l<-
C, < - U>

:::....
~ en
~ IV
~
iii
-e
c:
Ql
--'
*.,
'0>
c:<:
In

~
c

--
.Q ..
[0 0 ::J ell'" 8Ql u; c:<:
"S U ::!: a:: [ :;
~ ~ ~ s N
M
co
M
N
M
N
M
0
c: .,
In ..- '"
E
c
~. - - ~ ro -;;:-" ~
c: ~ ~ ~ E >. E
co
- +" ~ '~ '"
N
:;; :;;
N N
:;; "
0 0 ~

J "~ '" U
c:
Ql
ro a:'"
0

*'"
'0>
<-
I~
.
U)
--' Ql
Ql
c:<:
-
~g -r I t I U)
c:
E <-
I~
'" c: 1=
0 0
oU
10 11'5 /
I 1 <{
U
--'
~-
I~
~ Row Decoder I Col. Buffer
,,[ <-
o
UJ
;: .-
n.,
Qj
U>
'"c:
n>
Row Buffer
r
Refresh Counter r--
-
LCBR
LRAS
--'
c:<:
~.-
--'
<-

<-
113

UJ
:.:

i
In U) U
~ -
Address Register
UJ
--'

-
.- IG,,[
U
:.:

IG
I Io U
--'

,,[ o<{
U

Figure 4.2 Block diagram of 64-Mb DDR SGRAM organized as 512 Kbit x 32
I/O x 4 Bank, (From reference 11, with permission of Samsung Electronics.)

Mode Register Set (MRS) The mode register stores the data for control of
the various operating modes of the DDR SGRAM. It programs CAS latency,
addressing mode, burst length, test mode, and other vendor-specific options to
make the device useful for a va riety of different applications. To operate the
DDR SGRAM, the mode register must be written after power-up because its
default value is not defined. The mode register is written by asserting Iowan
CS, RAS, CAS, and WE. The state of the address pins Ao-A I O and BA o' BA I
in the same cycle as CS, RAS, CAS, and WE going low is written in the mode
register. One clock cycle is requested to complete the write operation in the
mode register. The mode register contents can be changed using the same
command and clock cycle requirements during the operation, as long as all
248 APPLICATION-SPECIFIC DRAM ARCH ITECTURES AND DESIGNS

banks are in the idle state. The mode register is divided into various fields
depending on functionality. The burst length uses Ao-'-A z, addressing mode
uses A3 , and CAS latency (read latency from column address) uses A4 -A o. The
A7 is used for test mode. Pins A7 , As, BAo, and BA I must be set low for the
normal DDR SGRAM operation. Table 4.2 shows specific codes for various
burst length, addressing modes, CAS latencies, and MRS cycle [11].

Define Special Function (DSF) The DSF controls the graphic applications
of SGRAM. If DSF pin is tied [ow, the SGRAM functions like an SDRAM.
The SGRAM can be used as a unified memory by the appropriate DSF
command. All the graphic function modes can be entered only by setting the

TABLE 4.2 Mode Register Set (MRS)-Specific Codes for Various Burst Length,
Addressing Modes, CAS Latencies, and MRS Cycle

Mode Register

A7
o
mode
Normal
"3 Type
.__0._..- _. Sequen tial
1 Test 1 Interleave

CAS Late ncy Bu rst Length


A6 As A4 Latency Burst Type
0 0 0 Reserve
A2 At "0 Seq uential Interleave
0 0 1 Reserve 0 0 0 Reserve Reserve
• RFU(Reserved for
0 1 0 2
future use)
should slay ·0 . ¥ "~-'-

0 1 1 3
0
0
0
1
1
--- --- 4--
0
2 2
4
during MRS cycle. 1 0 0 Reserve 0 1 1 8
-~
1 0 1 Reserve 1 0 0 Reserve Reserve
1 1 0 Reserve 1 0 1 Reserve Reserve
1 1 1 Reserv e 1 1 0 Reserve Reserve
1 1 1 Full page Reserve
- _• • _ . _ ••• _ - _ . _ . _ • • • • • _ ¥ ¥ .

MRS Cycle
0 1 3
CK,CK J~ -~
I
-- --j-- d
.. _.. __ t~
I
___.u·--u
___. __ . .. _,
. __. .. _.
I
2

._. l·:'---d
'X.-- _. . ..-"".'
_. C.1-
. ._.\
I I
4

I
5

I
6

I
7

I
8

Command C:.~?~::)··{~;:1-{_Lt:~':_X~RS -)-G1~~H~:~tii-}ct~~ {. : ~~~}{~<:~


I I I I I I I I I
I ..
:-----~ ---- - · ~ r ··~···········--·-~ ~ .,..-_.__ ...., I I I I
I I Ill. I lMRo=1 tc. , I I I I

., : MRS can be issued only at all banks precharge slate .


·2 : Minimum tRP is reQuired to issue MRS comm and .

SOllfee: Reference II , with permission of Samsung Electronics


SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 249

DSF high when issuing commands, which otherwise would be normal SDRAM
commands.

Special Mode Register Set (SMRS) There is a special mode register in the
DDR SGRAM called the color register. When A6 and DSF goes high in the
same cycle as CS, RAS, CAS, and WE going low, load color register (LCR)
process is executed and the color register is filled with the color data for
associated DQs through the DQ pins. At the next clock of LCR, a new
command can be issued. SMRS command compared with the MRS can be
issued at the active state under the condition that DQs are idle.

Block Write Block write is a feature that allows simultaneous writing of


consecutive 16 columns of data within a RAM device during a single access
cycle. During a block write, the data to be written comes from an internal
"color" register. The block of columns to be written is aligned on 16 column
boundaries and is defined by the column address with the 4 LSBs ignored. The
Write command with DSF input as high enables block write for the associated
bank. The block width is 16 columns, where column is 11 bits for x 11 part. The-
color register is the same data width as the data port of the chip. The color
register provides the data without column masking.

Burst Mode Operation The burst mode operation is used to provide a


constant flow of data to memory locations (write cycle) or from the memory
location (read cycle). Two parameters that define the burst mode operation are
the burst length and the burst sequence, both of which are programmable and
determined by the address bits Ao-A 3 during the Mode Register Set (MRS)
command. The burst type is used to define the sequence in which the burst data
will be delivered or stored in the SGRAM. Two types of burst sequences are
supported: sequential and interleaved. The burst length controls the number of
bits that will be output after a read command, or the number of bits to be input
after a write command. The burst length can be programmed to have values
of 2, 4, 8, or full page.

Bank Activation Command The bank activation command is issued by


holding CAS and WE high with CS and RAS low at the rising edge of the
clock. The DDR SGRAM has four independent banks, so that the two bank
select addresses (BAo, BAt) are supported. The bank activation command must
be applied before any read or write operation is executed. Once a bank has
been activated, it must be precharged before another bank activation command
can be applied to the same bank.

Burst Read Operation The burst read operation in a DDR SGRAM is


similar to the one in current SDRAM, so that the burst read command is
issued by asserting the CS and CAS low while holding the RAS and WE high
at the rising edge of the clock after time tReD from the bank activation. The
250 APPLICATION -SPECIFIC DRAM ARCHITECTURES AND DESIGNS

< Burst Length=4, CAS Latency=2, 3 >

o 1 234 567 B
Ck,Ck ~ : ~ :ru~Cj~L
Command

DOS I

CAS Lalency=2 ( I I Postambl


DO's --~--;--I8EE;B~~--.---r----.----.--
, I t I
I t I I I

OOS : : : I ,------fir:- -..;-- -+---+-- -


CAS Lalency=3 ( I I ~~~~ I I
DO's --I--'~------'-'-<~~~I---"---'----'---~--

(a)

< Burst Length=4 >

o 1 2 3 4 567 B
CK,CK ~
I I I I I I I I I

Command

DQS

DQ's

(b)

Figure 4.3 Two 64-Mb DDR SGRAM timing diagrams. (a) Burst read operation. (b)
Burst write operation . (From reference 11 , with permission of Samsung Electronics.)

address inputs (Ao-A?) determine the starting address for the burst operation.
The mode register sets type of burst (sequential or interleaved) and burst length
(2, 4, 8, full page). The first output data are available after the CAS latency
from the READ command, and the consecutive data are presented on the
falling and rising edge of the data strobe adopted by the DDR SGRAM until
the burst length is completed. Figure 4.3a shows the timing diagram of burst
read operation [11].

Burst Write Operation The burst command is issued by having CS, CAS,
and WE low while holding RAS high at the rising edge of the clock. The
address inputs determine the starting column address. There is no real latency
required for the burst write cycle . The first data for burst write cycle must be
applied at the first rising edge of the data strobe enabled after t DQ SS from the
rising edge of t he clock that the write command is issued. The remaining data
inputs must be supplied on each subsequent falling and rising edge of the data
strobe until the burst length is completed. When the burst operation is
completed, any additional data supplied to the DQ pins will be ignored. Figure
4.3b shows the timing diagram of a burst write operation,
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 251

Burst Interrupt Operation These are the various burst interruption modes:

• Read Interrupted by a Read A burst read can be interrupted before


completion of the burst by a new read command of any bank. When the
previous burst is interrupted, the remaining addresses are overridden by
the new address with the full burst length. The data from the first read
command continues to appear on the outputs until the CAS latency from
the interrupting read command is satisfied. At this stage, the data from
interrupting read command appear.
• Read Interrupted by Burst Stop & A Vltite To interrupt a burst read with
a write command, burst stop command must be asserted to avoid data
contention on the I/O bus by placing theDQs (Output Drivers) in a
high-impedance state at least one clock cycle before the write command
is initiated.
• Read Interrupted by a Precharqe A burst read operation can be interrup-
ted by a precharge of the same bank. A minimum of one clock cycle (t CK )
is required for the read to precharge intervals without interrupting a read
burst. A precharge command to output disable latency is equivalent to the
CAS latency.
• lt7·ite Interrupted by a lt7·ite A burst write can be interrupted before
completion of the burst by the new write command, with the only
restriction being that the interval that separates the commands must be
at least one tCK • When the previous burst is interrupted, the remaining
addresses are overridden by the new address and data will be written into
the device until the programmed burst length is satisfied.
• Write Interrupted by a Read & Data Mask (DAl) A burst write can be
interrupted by a read command to any bank. To avoid data contention,
the DQs must be in the high-impedance state at least one clock cycle
before interrupting data that appear on the outputs. When the read
command is registered, any residual data from the burst write cycle will
be masked by the DM.
• Write Interrupted by a Precharge & Data Mask (DM) A burst write
operation can be interrupted before completion of the burst by a
precharge of the same bank. A write recovery time (t RDL) is required before
a precharge command to complete the write operation. When precharge
command is asserted, any residual data from the burst write cycle must
be masked by the DM.

Burst Stop Command The burst stop command is initiated by having RAS
and CAS high with CS and WE low at the rising edge of the clock only. The
burst stop command has the fewest restrictions, which makes it the easiest
method to use when terminating a burst operation before it has been com-
pleted. When the burst command is issued during a burst read cycle, both the
data and DQS (data strobe) go to a high-impedance state after a delay that is
252 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

equal to the CAS latency set in the mode register. However, the burst
command is not supported during a write burst operation.

Data Mask (DM) Function The DDR SGRAM has a data mask function
that can be used in conjunction with the data write cycle only (and not read
cycle). When the data mask is activated (DM high) during the write operation,
the write data are masked immediately (DM to data-mask latency is zero).

Auto-Precharge Function The auto-precharge command can be issued by


having column address As high when a read or a write command is asserted
into the DDR SGRAM. If As is low when a read or write command is issued,
then normal read or write burst operation is asserted and the bank remains
active after the completion of burst sequence. When the auto-precharge
command is activated, the active bank automatically begins to precharge at the
earliest possible time during the read or write cycle after tRAS(min) is satisfied.
Therefore, this function can be executed as either the read with auto-precharge
command or the write with auto-precharge command.

Precharge Command The precharge command is used to precharge or


close a bank that has been activated. The precharge command is issued when
CS, RAS, and WE are low and CAS is high at the rising edge of the clock, CK.
The precharge command can be used to precharge each bank respectively or
all banks simultaneously. The bank select addresses (BAa and BA l ) are used
to define which bank is precharged when the command is initiated. For write
cycle, tRDL(min) must be satisfied from the start of the last burst write cycle until
the precharge command can be issued.

Auto-Refresh An auto-refresh command is issued by having CS, RAS, and


CAS held low with CKE and WE high at the rising edge of the clock (CK).
All banks must be precharged and be idle for a tRP(min) before the auto-refresh
command is applied. Once this cycle has started, no control of the external
address pins is required because of the internal address counter. When the
refresh cycle has completed, all banks will be in the idle state.

Self-Refresh A self-refresh command is executed by having CS, RAS, CAS,


and eKE held low with WE high at the rising edge of the clock. Once the
self-refresh command is initiated, CKE must be held low to keep the device in
the self-refresh mode. After one clock cycle from the self-refresh command, all
of the external control signals including system clock (CK, CK) can be
disabled except CKE. The clock is internally disabled during self-refresh
operation to reduce power.

Power Down Mode The power down mode is entered when CKE is low and
is exited when CKE is high. Once the power down mode is initiated, all of the
receiver circuits except CK and CKE are gated-off to reduce power consump-
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 253

tion. During the power down mode, refresh operations cannot be performed;
therefore, the device cannot remain in the power down mode longer than the
refresh period (t REF ) of the device.

4.2.2. 256-Mb DDR Fast Cycle RAM (FCRAM™)


An example of this Fast Cycle RAM (FCRAM) is Fujitsu Semiconductor's
256-Mb DDR FCRAM containing 268, 435, 456 memory cells accessible in an
8-bit format (MB81N26847A) or 16-bit format (MB81N261647) with three
different versions of random access times (22 ns, 24 ns, and 30 ns). This
FCRAM features a fully synchronous operation referenced to a clock edge,
whereby all operations are synchronized at a clock input, and provides double
data rate by transferring data at every rising and falling clock edge. Figure 4.4
shows the block diagram of 256-Mb DDR FCRAl\t1 for 16-bit format device:
MB81N261647 [12]. The major functional modes and commands for these
256-Mb FCRAMs (both 8-bit and 16-bit devices) are described in the following
text.

Clock (CLK, CLK) This FCRAM adopts a differential clock scheme in which
CLK is a master clock and its rising edge is used to latch all command and
address inputs. CLK is a complementary clock input. An internal delay locked
loop (DLL) circuit tracks the signal a cross-point CLK and CLK, and
generates some clock delay for the output buffer control at read mode. This
DLL circuit requires some lock-on time for the stable delay time generation.

Power Down (PO) The PD is a synchronous input signal and enables


low-power mode, power-down mode, and self-refresh mode. The power-down
mode is entered when PD is brought low while all banks are in the idle state,
and exited, when it returns to the high state. When PD is brought low after
I p DV ' FCRAM performs auto-refresh and enters power-down mode. During.the
power-down and self-refresh mode, both CLK and CLK are disabled after a
specified time.

Chip Select (CS) and Function Select (FN) Unlike regular SDRAM's
command input signals, the FCRAM has only two control signals: (1) CS and
(2) FN. Each operation is determined by two consecutive command inputs.

Bank Address (BA o' BA1) The FCRAM has four internal banks, and the
banks selection by BA occurs at Read (RDA) or write (WRA) command.

Address Inputs (Ao to A14 ) Address input selects an arbitrary location of


each memory cell matrix within each bank. The FCRAM adopts an address
multiplexer to reduce the pin count of the address lines. At either RDA or
WRA command, fifteen upper addresses are initially latched as well as two
bank addresses and the remainder of lower addresses are then latched by an
LAL command.
254 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS

ClK -.
--.

.;
CLOCK
BUFFER ~ To each bloc k

--.
I Banl<-3

I I Bank·2

I Bank· f

Bank-o

CS
CONTROL

-. SIGNAL
LATCH
COMMAND
DECODER
f--
,. f--
FWI

FN --.
DRAM
MODE CORE
REGISTER
(4M X 16)

l f-

ADDRESS

Aoto A14 -. ADDRESS


22

BUFFERI - -
BA, .BA , -. REGISTER

DO,
.. BURST
-. COUNTER
ACCESS
-
BU RST

-..
10 2 ADDRESS
DO,
1/0 DATA 110
LOOS
BUFFER!
f4--
+
REGISTER
&
DO, ....
.-&
DeS
10 1~
GENERA·
00 ..
TOR ..... Voo
UOOS .- Clod< Buffe r
. - v."
~ Vi-",, / Vssu

+v 000 Vsso

Figure 4.4 Block diagram of a 256-Mb DDR FCRAM l6-bit format. (From reference
12, with permission of Fujit su Semiconductor.)

Data Strobe (DQS) DQS is a bidirection signal and used as data strobe.
During read operation, DQS provides the read data strobe signal that is
intended to use input data strobe signal at the receiver circuit of the control-
ler(s). It turns low before the first data comes out, and it toggle s high to low
or low to high until the end of the burst read. The CAS latency is specified to
the first low-to-high transition of this DQS output. During the write operation,
DQS is used to latch corresponding byte of write signal s. In the read data
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 255

strobe operation, the first rising edge of DQS input latches the first input data
and the following falling edge of DQS signal latches second input data. This
sequence is continued until the end of the burst count.

Data Inputs and Outputs (DQn ) Input data are latched by DQS input
signal and written into memory at the clock following the write command
input. Output data are obtained together with DQS output signals at pro-
grammed read CAS latency.

Read (RDA) and Lower Address Latch (LAL) The FCRAM adopts two
consecutive command inputs scheme. The read or write operation is deter-
mined at first RDA or WRA command input from standby state of the banks
to be accessed (see state diagram, Figure 4.5). The read mode is entered when
RDA command is asserted with bank address and upper address input, and
LAL command with lower address input must be followed at the next clock
input. The output data are then valid after programmed CAS latency (CL)
from a LAL command until the end of the burst. The read mode is automati-
cally exited after random cycle latency.

Write (WRA), Lower Address Latch (LAL) The write mode is entered and
exited in the same manner as the read mode. The input data store is started at
the rising edge of DQS input from CL-l until the end of the burst count. The
write operation has a feature of "on-the-fly" variable write length (VW) at
every LAL command input following WRA command. Unlike data mask
(DM) of regular DDR SDRAM, VW does not provide random data mask
capability and VW controls the burst counter for the write burst, and its burst
length is set by a combination of two control addresses, VWO and VWl, and
programmed burst length condition. The data in masked address location
remains unchanged.

Burst Mode Operation and Burst Type The burst mode provides faster
memory access, and the read and write operations are burst oriented. The burst
mode is implemented by keeping the same addresses and by automatic strobing
of least significant addresses in every single clock edge until programmed burst
length (BL). Access time from clock of the burst mode is specified as tACo The
internal lower address counter operation is determined by a mode register,
which defines burst type (BT) and burst count length (BL) of 2 or 4 bits of
boundary. The burst type can be selected either sequential or interleave mode.

Mode Register Set (MRS) The mode register provides a variety of different
operations, and can be programmed MRS command following RDA command
input if all banks are in a standby state. The read operation initiated by RDA
command is canceled if MRS command is asserted at the next clock input from
RDA command instead of LAL command required for read operation. The
FCRAM has two registers: (1) standard mode and (2) extended mode. The
256 APPLICATION ·SPECIFIC DRAM ARCHITECTURES AND DESIGNS

o
SymbolDefinitions :

One Time s tare

- - - - . . Command Input

-----.. Automatic Sequence

Figure 4.5 A 256-Mb FCRAM state diagram for single bank operation. (From
reference 12, with permission of Fujitsu Semiconductor.)

standard mode register has four operation fields: (1) burst length, (2) burst
type, (3) CAS latency, and (4) test mode (this test mode must not be used). The
extended mode register has two fields: (1) DLL enable and (2) output driver
strength. These two registers are selected by BAO at MRS command entry and
each field is also set by the address line at MRS command, as well. Once these
fields are programmed, the contents are held until reprogrammed by another
MRS command (or part loses power). MRS command should only be issued
on condition that all banks are in the idle state and all outputs are in the
high-impedance state.
RAMBUS TECHNOLOGY OVERVIEW 257

Auto-Refresh (REF) The auto-refresh mode is entered by REF command


following the WRA command. The REF command should only be issued
under the condition that all banks are in the idle state and all outputs are in
the high-impedance state.

Self-Refresh Entry (SELF) Self-refresh function provides automatic refresh


using an internal timer as well as auto-refresh, and continue the refresh
operation until canceled by SELFX. The self-refresh mode is entered by
applying a REF command (i.e., following WRA command) in conjunction with
PD=LOW.
Figure 4.5 shows the 256-Mb FCRAM state diagram simplified for single
bank operation [12].

4.3 RAMBUS TECHNOLOGY OVERVIEW

The Direct RDRAM is a high-speed memory for graphic applications and


offers double the word width of the original RDRAM, and it is supplied in
either a 16- or 18-bit (two extra bits for polarity or data)-wide organization.
The storage capacities offered are 64/72 Mb, 128/144 Mb, and 256/288 Mb [5].
The internal multibank architecture of the RDRAMs allows the highest
sustained bandwidth for multiple, randomly accessed memory transactions.
The chips have been designed to minimize access latencies. An on-chip write
buffer allows data to be written, and then it lets the host move on to another
task. Several precharge mechanisms provide the memory controller a lot of
operating flexibility. The ability to interleave transactions also helps improve
the data-transfer operations.
Direct RDRAMs are also power smart and have advanced power manage-
ment modes, ranging from the basic power down with just self-refresh oper-
ation active, to modes in which various portions of the chip are either powered
down or put in standby modes for control signal inputs.
The concurrent RDRAM alternative performs two bank operations simul-
taneously to allow high transfer rates using interleaved transactions. These
memories can operate with speeds of 600 MHz, achieving data transfer speeds
of up to 1.2 Gbytes/s. The concurrent RDRAMs are 16/18 Mb and 64/72 Mb
densities with 8- or 9-bit word width. Latencies can be kept low by operating
the two or four 1-Kbyte or 2-Kbyte sense amplifiers as high-speed caches and
by using the random access mode (page mode) to facilitate large block
transfers. The concurrent RDRAMs can deliver about 15% performance
increase over the original RDRAM in graphics, video, and other applications.
Typically, memory subsystems have been designed that can transfer data for
one requester at a time, so that the length of time required to complete a
transaction adds to the latency of any pending requests. For a given bus
width and clock frequency, the amount of time the bus is occupied depends
upon the data transfer size and memory bus bandwidth. Therefore, the memory
258 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

bandwidth directly affects memory system latency. In graphic intensive multi-


media applications, the bandwidth dependent latency is a dominant factor in
the memory subsystem's performance. Some of the traditional approaches to
increasing memory bandwidth include speeding up the memory clock, increas-
ing the bus width or both.
The SDRAM-based memory subsystem designs have performance limita-
tions beyond 100-MHz clock rates. A second approach for increasing memory
bandwidth involves transferring memory data on both clock edges, without
changing properties of any other nets. One of the critical problems in this
approach is meeting the setup and hold timing specifications for the data bus
at each device. Due to the difficulty in meeting bus-timing constraints, the
maximum system clock frequency must be reduced from that of a single-edge
clock system to avoid violating critical timing specifications. Therefore, any
clock rate reduction comes at the expense of memory control bandwidth.
A third scaling approach to increase the memory bandwidth has been to
increase the data bus width to 128 bits. Because a wide, high-speed bus can
generate large transient currents in the driver elements, a significant number of
ground and power pins are needed on the controller to support a large number
of bus I/O pins. Thus, the memory width expansion comes at the expense of
increased pin count, larger I/O power requirements, and a sacrifice in memory
granularity. In general, increasing the bus width can also cause noise problems
and result in higher power dissipation.
One of the solutions offered for high-performance memory subsystem design
has been the introduction of proprietary Direct Rambus DRAM (RDRAM)
architecture. Rambus" is a high throughput memory interface technology
designed for PC, workstation, and graphics/multimedia applications. The
major goal of Rambus has been to close the processor-memory performance
gap by providing (a) an order-of-magnitude improvement in bandwidth and
(b) a scalable architecture compatible with higher density evolution of the
DRAM processes. Also, to make Rambus affordable in mainstream markets,
the DRAMs cost in die size and packaging must be kept comparable to the
commodity DRAMs. To meet these goals, the following extensions were made
in the existing Rambus interface:

• A wider interface with two-byte wide datapath


• Higher clock frequency of 800-MHz transfer rate
• More efficient protocol capable of providing 950/0 efficiency

The Rambus architecture has three main elements: (1) Rambus interface, (2)
Rambus channel, and (3) RDRAM. Figure 4.6a shows the block diagram of
Rambus architecture and its three main elements [13J.
The Rambus interface is implemented on both the memory controller and
the RDRAM devices on the channel. The Rambus channel incorporates (a) a
system level specification that can allow the system using Rambus channel(s)
to operate at a full rated speed and (b) a capability of transferring data at rates
RAMBUS TECHNOLOGY OVERVIEW 259

Rambus Interfa ce
.------ --,
Vre•
Memory
Co nt roller Termination

~
Rambus Channel
Bus Clock
BOOMHz
Transfer Rate =
(a)

I Controller I I RDRAM 1 I .. I RDRAM n I Vterm

~II ISOutIII I I II L., ..k>< >-.>


OOAI8:0). OOBI8:0}
ROW!2:0), COllo:O)
• I
• "'i'
ClkFromM asler(x2)
ClkToM a ster(x2)
Vre'

400MH~a~
Gnd . GndA
Vdd. VddA
800MHz Oata >-
<>
Tra nsfer
Rate

(b)

Figure 4.6 Rambus architecture showing (a) three major elements and (b) memory
controller and RDRAM connections to resistor-terminated transmission lines. (From
reference 13, with permission of Rambus, Inc.)

of up to 800 MHz. The Rambus has a well-defined mechanical interface. The


memory controller and RDRAMs connect to the printed circuit board with an
interface that has only 30 high-speed signals. Each Rambus channel supports
up to 32 RDRAMs. Modular memory expansion is available using RIMM™
modules, in the form factors similar to conventional DRAM implementations.
The RDRAM is the CMOS DRAM incorporating Rambus interface cir-
cuitry that are available in 64j72-Mb, 128j144-Mb, 255j288-Mb, and even
higher densities in the future. RDRAMs respond to the requests from the
memory controller and therefore require little internal logic. The Rambus
interface can be implemented on ASICs, conventional microprocessors, or
graphics chips. Each memory controller has its own Rambus interface, and this
interface cell is available in several ASIC processes from many vendors. This
interface converts low-swing voltage levels used by the Rambus channel to the
CMOS logic levels internal to the ASIC.
The high-speed signaling used on the Rambus channel is called RSL
(Ram bus signaling level). The high-speed operation is achieved using a
combination of techniques such as low-voltage signaling, high-quality trans-
mission lines, channel topology, pseudodifferential inputs, current mode
drivers, differential clocks, and dense packaging. The RDRAMs use surface
mount chip scale packaging for high signal quality through reduced stub
inductance and low input capacitance. A Rambus channel includes 30 high-
speed, controlled impedance, matched signal transmission lines . The high-speed
260 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

signals are terminated at their characteristic impedance at the RDRAM end of


the channel. Figure 4.6b shows the typical Rambus channel bus topology with
a memory controller at one end, RDRAMs in the middle, and termination
resistors at the other end.
The power is dissipated on the Rambus channel only when the device drives
a logic "1" (low voltage) on the data line. All high-speed signals on the Rambus
channel use low-voltage swings of 800 mV by using differential sensing. Each
of the inputs consists of a pair of differential clock samplers, one operating on
the rising edge of the clock and other on the falling edge. The negative input
of the input samplers is connected to Jt:.ef' The Rambus channel is synchronous,
which means that all commands and data are referenced to the clock edges. At
the physical level, data are only transferred across the DQA and DQB lines,
whereas all control information is sent across the ROWand COL pins. The
clock source can be a separate clock generator as shown in Figure 4.6b, or it
can be integrated into the memory controller.
The clock and data travel parallel to minimize skew. Therefore, an RDRAM
sends data to the memory controller synchronously with the ClockToMaster,
and the controller sends data to the RDRAM synchronously with the Clock-
FrornMaster. The data transfers occur only between the memory controller
and the RDRAMs, and never directly between the RDRAMs. Data driven by
the memory controller propagates past all the RDRAMs with the desired
voltage swing, so that all theRDRAMs can correctly sense the data.
The signals are terminated at one end of the channel with matched terminator
to eliminate any reflections. Data are effectively transferred on both edges of the
400-MHz clock, which results in 800-Mbps-per-wire transfer rate. Therefore, the
Rambus clock cycle is 2.5 ns. For full interleaving across even the first and last
RDRAMs on the channel, the protocol allows each device to be programmed
with a read delay. Also, due to the topology of the channel, write data can be sent
down the channel in the next clock cycle following a read transaction. As soon as
the read data reach the controller, the controller can issue the write data without
any delay. However, a write followed by a read transaction must wait a short
delay, equal to the round trip length of the channel.
The Rambus channel has data and control bits moving in packets. Each
packet type is four clock cycles in length (10 ns) at 800 MHz. A completely
independent control and address bus is split into two groups pins, one for the
row commands and the other for column commands. Only data are transferred
across the two-byte-wide data bus. To ensure proper synchronization of all
devices connected to the Rambus channel, all packets begin during even
intervals (falling clock edge). However, the packets may begin on any falling
clock edge, and they can be spaced any number of clock cycles apart. Each of
the buses operate independently of each other, allowing a row command,
column command, and data to he transferred at the same time to different
banks of an RDRAM or to different RDRAMs.
The row packets are sent across three ROW pins of the control and address
bus. Row packets include two types of commands: activate (ACT) and pre-
RAMBUS TECHNOLOGY OVERVIEW 261

charge (PRER). An activate command can be sent to any RDRAM bank,


whose sense amplifiers have been previous precharged. Other commands can
be sent across the row pins, including refresh and power state control. The
column (COL) packets are sent across the five COL pins of the control and
address bus. The COL packets are split into two fields, where the first field
specifies the primary operation, such as a read or a write, and the second field
can be masks for writes or can be an extended (XOP) command. Several XOP
commands can be used instead of the mask field.
Each data packet contains 16 bytes of data. The RDRAMs use a 128-bit-
wide internal datapath, allowing 16 bytes of data to be transferred for each
column access. On the two-byte-wide channel, two bytes are transferred on
each rising and falling clock edge, so that 16 bytes are transferred in four clock
cycles (10 ns). In this way, the data packets have the same length as row and
column packets.
Figure 4.7a shows a typical read operation [13]. In general, a completely
random access is accomplished by asserting the ACT command across the row

/ _ Row acc ess can be skipped


, I if page is open

ROW ~t--_··..._ - - =~..I;_-+--+--


[2:0J

COLUMN
[4:0 J

DATA
[15:0] Ton.~~ --l-- _-
Row .
- r---.
Acc es.40ns

I
....
Column
Access20ns 1.-
(a)

ROW
[2:0]

COLUMN
[4:0]

DATA
[15:0J
/'
_·'1-·--1
Co nvenlional DRAM
W rile Data limin g

(b)

Figure 4.7 Rambus memory. (a) Read operation. (b) Write operation. (From reference
13, with permission of Rambus, Inc.)
262 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

pins followed by a read command sent across the COL pins. After the data are
read, a precharge command is executed to prepare that bank for another
completely random read. The data are always returned in a fixed, but user
selectable, number of clock cycles from the end of the read command. A write
transaction timing is similar to the read operation since the control packets are
sent the same way. Figure 4.7b shows a typical write operation.
One significant difference between an RDRAM and a conventional DRAM
is that the write data are delayed to match the timing of a read transaction's
data transfer. A write command on the COL bus tells the RDRAM that the
data will be written to the device in a precisely specified number of clock cycles
later. Normally, this data would then be written into the core as soon as the
data are received. Each of the commands on the control bus can be pipelined
for higher throughput.
The RDRAMs internal core supports a 128-/144-bit-wide data path operat-
ing at 100 MHz, which is one-eighth the clock rate of the channel. Thus, every
10 ns, 16 bytes can be transferred to or from the core. The RDRAMs have
separate data and control buses. The data bus permits data transfer rates up
to 800 MHz, capable of 1.6-Gb/s data transfer rate on either a x 16 or x 18
bus configuration. The control bus adds another 800 Mb/s of control informa-
tion to the RDRAM. The control bus is further separated into ROWand COL
pins, allowing concurrent row and column operations while the data are being
transferred from a previous command.
The Rambus architecture allows up to 1-Gbit DRAM densities, up to 32
RDRAMs per channel, and enough flexibility in the row, column, and bits to
allow for various configurations in these densities. The 64-/72-Mb RDRAMs
can support either 8 independent or "16" doubled banks. In a doubled bank
.core, the number of sense amplifiers required is reduced to nearly half while
keeping the total number of banks relative high compared to other DRAM
alternatives. The larger number of banks helps prevent interference between the
memory requests. The number of banks accessible to the controller is the
cumulative number of banks across all the RDRAMs on the channel. However,
the restriction imposed by doubled banks is that the adjacent banks cannot be
activated. Once a bank is activated, that bank must be precharged in order for
the adjacent bank to be activated.
For low-power system operation, the RDRAMs have several operating
modes, as follows: Active, Standby, Nap, and PowerDown. These four modes
are distinguished by (a) their respective power consumption and (b) the time
taken by the RDRAM to execute a transaction from that mode. An RDRAM
automatically transitions to a standby mode at the end of a transaction. In a
subsystem, when.all the RDRAMs are in a standby mode, the RDRAM's logic
for row addresses is always monitoring the arrival of row packets. If an
RDRAM decodes a row packet and recognizes its address, that RDRAM will
transition to the active state to execute the read or write operation and will
then return to standby mode once the transaction is completed. Power
RAMBUS TECHNOLOGY OVERVIEW 263

consumption can be further reduced by placing one or more of the RDRAMs


into the Nap or PowerDown modes.
The Rambus Memory Controller (RMC) is a block of digital logic residing
on the Rambus-based controller as protocol support for the management of
read and write transactions to the Rambus DRAMs. The interface to the RMC
is a simple two-wire handshake. The RMC directly connects to the Rambus
ASIC cell (RAC) as an Input/Output cell. The RAC provides the basic
multiplexingjdemultiplexing functions for converting from a byte-serial bus
operating at the Rambus channel frequency (up to 800 MHz) to the memory
controller's eight byte bus operating at up to 200-MHz signaling rate. The
RMC is provided as a synthesizable Verilog or VHDL source code, which can
be incorporated in the memory controller's design process.
In a typical memory system configuration, the RDRAMs provide more
system memory banks than the SDRAMs or other conventional DRAMs on a
per-megabyte basis. The Direct RDRAMs incorporate (a) the same physical
layer as its predecessor (Rambus base and Concurrent DRAMs), with the
major differences being the channel width, which is 18 bits instead of 9, and (b)
the address and control information, which is no longer multiplexed onto the
data field.
The Direct RDRAMs protocol introduces direct control of all the row and
column resources concurrently with the data transfer operations (hence the
name "direct"). It also supports explicit control of the precharge and row
sensing operations as well as the data scheduling during concurrent column
operations. A Direct RDRAM can therefore perform row precharging and
sensing operations concurrently with the column operations to provide on-chip
interleaving. This implies that the user can schedule the data resulting from the
row operation to appear immediately after the completion of column oper-
ation. The interleaving can only occur when the requests target different banks
in either the same Direct RDRAM or a different RDRAM on the channel. The
more banks in a system, the better the chances that any two requests will be
mapped to different banks. The higher interleaving can improve memory
system performance.
The designers can maintain precise control over clock-to-data delay and bus
sample points by compensating the internal clock skew and duty cycle with
delay locked loops (DLLs). The DLLs allow all bus transfers to operate so that
they are synchronized to both edges of a 400-MHz clock to provide an
800-MHz data rate per pin. Each Direct RDRAM contains two DLLs for
locking the internal clocks to the external clocks. All high-speed, Direct
Rambus signals are single-ended except the clocks, which are differential. The
high level for the buses is 1.8 V, and the low level is 1.0 V. A common bused
reference voltage of 1.4 V feeds into all input receivers, which are differential
amplifiers.
The memory system designs based on Direct RDRAMs can be upgraded by
using Direct Rambus DRAM modules (called the RIMMs) that fit into the
264 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

sockets similar to the industry standard DIMMs. These RIMMS, although


similar in appearance to DIMMs that are connected in parallel, are basically
different because the RIMMs are connected in series when installed in a system.

4.3.1. Direct RDRAM Technologies and Architectures


The Direct RDRAMs are currently available in following three configurations,
which have quite similar architectural and performance features except for
density:

1. 64-/72-Mbit organized as 4M words x 16 or 18 bits


2. 128-/144-Mbit organized as 8M words x 16 or 18 bits
3. 256-/288-Mbit organized as 16M words x 16 or 18 bits

The 64-/72-Mbit RDRAMs have 16 banks compared to the 32-bank architec-


ture for 128-/144-Mbit and 256-/288-Mbit devices.
In general, the Direct RDRAM architecture allows sustained bandwidth for
multiple, simultaneous randomly accessed memory transactions. The 32 banks
can allow four transactions simultaneously at full data rates. The x 18
organization can allow implementation of an ECC scheme. The Direct RDAM
consists basically of two major blocks: (a) a core block built from the banks
and sense amplifiers similar to those found in other types of DRAMs and (b)
a Direct Rambus interface block, which permits an external controller to access
this core at up to 1.6 Obis. Figure 4.8 shows the block diagram of 128-/144-
Mbit Direct RDRAM [14]. The major functional components, including pin
descriptions, commands, and packet format, are described below.

Pin Descriptions

• Control Registers The SCK, CMD, SIOO, and SIOI pins (shown in the
upper center of Figure 4.8) are used to write and read a block of control
registers, which supply the RDRAM configuration information to a
controller and select the operating modes of the device. The 9-bit REFR
value is used for tracking of the last refreshed row, and 5-bit DEVID
specifies the device address to the RDRAM on the channel.
• Clocking The CTM and CTMN pins (Clock-to-Master) generate TCLK
(Transmit Clock), the internal clock used to transmit the read data. The
CFM and CFMN pins (Clock-from-Master) generate RCLK (Receive
Clock), the internal clock signal used to receive the write data and to
receive the ROWand COL pins.
• DQA, DQB Pins These 18 pins carry read (Q) and write (D) data across
the channel. They are multiplexed or demultiplexed from (or to) two
72-bit data paths that are running at one-eighth the data frequency, inside
the RDRAM.
RAMBUS TECHNOLOGY OVERVIEW 265

CIMCIIIN SCK.CND SIOO.SJl l CIIICIIIN


1 Z· i

Figure 4.8 Block diagram of 128-/144-Mbit Direct RDRAM. (From reference 14, with
permission of Rambus, Inc.)

Banks The 16-Mbyte core of the RDRAM is divided into 32 of 0.5-


Mbyte banks, each organized as 512 rows, with each row containing 64
dualocts. A dualoct is the smallest unit of data that can be addressed, and
each dualoct contains 16 bytes.

Sense Amplifiers The RDRAM contains 34 sense amplifiers (as com-


pared to 17 for 64-/72-Mbit organization). Each sense amplifier consists
266 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

of 512 bytes of fast storage (256 for DQA and 256 for DQB) and can hold
one-half of one row of one bank of the RDRAM. The sense amplifier may
hold any of the 512 half-rows of an associated bank. However, each sense
amplifier is shared between two adjacent banks of the RDRAM (except
for number 0, 15, 30, and 31), which introduces the restriction that
adjacent banks may not be simultaneously accessed.
• RQ Pins These pins carry the control and address information, and they
are divided into two groups. One group of pins (RQ7, RQ6, RQ5, which
are also called ROW2, ROWI, ROWO) are used primarily for controlling
the row accesses. The second group of pins (RQ4, RQ3, RQ2, RQl, RQO,
which are also called COL4, COL3, COL2, COLI, COLO) are used
primarily for controlling the column accesses.
• ROW Pins The main function of these three pins is to manage the
transfer of data between the banks and the sense amplifiers of the
RDRAM. These pins are demultiplexed into a 24-bit ROWA (row
activate) or ROWR (row operation) packet.
• COL Pins The main function of these five pins is to manage the transfer
of data between the DQA/DQB pins and the sense amplifiers of the
RDRAM. These pins are demultiplexed into a 23-bit COLC (column
operation) packet and either a 17-bit CaLM (mask) packet or a 17-bit
COLX (extended operation) packet.

Commands These are the major commands:


• ACT Command An ACT (activate) command from a ROWA packet
causes one of the 512 rows of the selected bank to be loaded to its
associated sense amplifiers (two 256-byte sense amplifiers for DQA and
two for DQB).
• PRER Command A PRER (precharge) command from a ROWR packet
causes the selected bank to release its two associated sense amplifiers,
which allows a different row in that bank to be activated or permits
adjacent banks to be activated.
• RD Command The RD (read) command causes one of the 64 dualocts of
one of the sense amplifiers to be transmitted on the DQA/DQB pins of
the channel.
• WR Command The WR (write) command causes a dualoct received from
the DQA/DQB data pins of the channel to be loaded into the write buffer,
which also has space for the BC bank address and C column address
information. The data in the write buffer are automatically retired (written
with optional bytemask) to one of the 64 dualocts of one of the sense
amplifiers during a subsequent COP command. A retire can take place
during a RD, WR, or NOCOP to another device, or during a WR or
NOCOP to the same device. The write buffer will not retire during a RD
to the same device.
RAMBUS TECHNOLOGY OVERVIEW 267

• PREC Precharqe The PREC, RDA, and WRA commands are similar to
the NOCOP, RD, and WR, except that a precharge operation is per-
formed at the end of the column operation. These commands provide a
second mechanism for performing the precharge operation.
• PREX Precharge After an RD command, or after a WR command with
no byte masking (M = 0), a COLX packet may be used to specify an
extended operation (XOP). The most important XOP command is PREX,
which provides a third mechanism for performing a precharge operation.

Packet Formats Figure 4.9 shows the format of the ROWA and ROWR
packets on the ROW pins [14]. Table 4.3a describes the fields that comprise
these row packets [14]. For example, DR4T and DR4F bits are encoded to
contain both the DR4 device address bit and a framing bit, which allows the
ROWA or ROWR packet to be recognized by the RDRAM. The AV
(ROWA/ROWR packet selection) bit distinguishes between the two packet
types. Both the ROWA and ROWR packet provide a 5-bit device address and
a 5-bit bank address. A ROWA packet uses the remaining bits to specify a 9-bit
row address, and the ROWR packet uses the remaining bits for an l l-bit
opcode field.
Figure 4.9 also shows the formats of CaLC, CaLM, and COLX packets on
the COL pins. Table 4.3b describes the fields that comprise these column
packets. The CaLC packet uses the S (start) bit for framing. A COLM or
CaLX packet is aligned with this COLC packet, and it is also framed by the
S bit. The 23-bit COLC packet has a 5-bit device address, a 5-bit bank address,
a 6-bit column address, and a 4-bit opcode. The COLC packet specifies a read
or a write command, as well as some power management commands.
The remaining 17 bits are interpreted as a COLM (M = 1) or COLX
(M = 0) packet. A COLM packet is used for a COLC write command, which
needs bytemask control. A COLX packet may be used to specify an indepen-
dent precharge command. It contains a 5-bit device address, a 5-bit bank
address, and a 5-bit opcode. The COLX packet may also be used to specify
some housekeeping and power management commands. The COLX packet is
framed within a COLC packet but is not otherwise associated with any other
packet.
A row cycle begins with the activate (ACT) operation. The activation
process is destructive, that is, the act of sensing the value of a bit in a bank's
storage cell transfers the bit to the sense amplifier, but leaves the original bit
in the storage cell with an incorrect value. Because the activation process is
destructive, a hidden operation called restore is automatically performed. The
restore operation rewrites the bits in the sense amplifier back into the storage
cells of the activated row of the bank. While the restore operation takes place,
the sense amplifier may be read (RD) and written (WR) using the column
operations. If new data are written into the sense amplifier, it is automatically
forwarded to the storage cells of the bank, so that the data in the activated row
268 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS

r. T, r, 'I} I I T. r, 1"1l) in
CTl\ l /Cf Mn I n I n rT1 n C1 MICI'I\ I n In rn In n
i L..J l L..J I L..J I L..J J , L..J I L..J I L..J ! L..J I
IWW 2 HOW2 Jllll ' DkZ

HOWl kl HOWl

RO\\'U No HO WO UIU UNO

C01.3 u

COI.2 cz

COI.1 CI

CO LO LO

, T. 'I, I Til I Til I I Titr., Tu Tl ~ I

CTMlcf'MIl In rn rn n C'I MICI'Mn In rn rn n


IL..J , L..J , L..J ,L..J i I L..J , L..J IL..J I L..J i
j I
COl A I COlA \
1
COL3 I COU

CO Ll ,
I
CO\2

CO LI
!
COI.1
I
CO LO ,
I
COLU

Figure 4.9 Direct RDRAM 128-/144-Mbit row and column packet formats. (From
reference 14, with permission of Rambu s, Inc.)

and the data in the sense amplifier remain identical. When both the restore and
the column operations are completed, the sense amplifier and bank are
precharged (PRE). This leaves them in the proper state to begin another
activate operation.

Read Transaction-Example Figure 4.l0a shows an example of a read


transaction [14]. It begins by activating a bank with an ACT aO command in
RAMBUS TECHNOLOGY OVERVIEW 269

TABLE 4.3 Direct RDRAM Field Descriptions for (a) ROWA and ROWR Packets
and (b) COLC, COLM, and COLX Packets

Field Description

(o ) ROWA and ROWR Packets


DR4T, DR4F Bits for framing (recognizing) a ROWA or ROWR packet.
Also encodes highest device address bit.
DR3 .. DRO Device address for ROWA or ROWR packet.
BR4 .. BRO Bank address for ROWA or ROWR packet. RsvB denotes
bits ignored by the RDRAM.
AV Selects between ROWA packet (AV = 1) and ROWR
packet (A V = 0).
R8 .. RO Row address for ROWA packet. RsvR denotes bits ignored
by the RDRAM.
ROPIO .. ROPO Opcode field for ROWR packet. Specifies precharge,
refresh, and power management functions.

(b) CaLC, CaLM, and CaLX Packets


s Bit for framing (recognizing) a COLC packet, and
indirectly for framing COLM and COLX packets.
DC4 .. DCO Device address for COLC packet.
BC4 .. BCO Bank address for COLC packet. RsvB denotes bits
reserved for future extension (controller drives O's),
C5 .. CO Column address for COLC packet. RsvC denotes bits
ignored by the RDRAM.
COP3 .. COPO Opcode field for COLC packet. Specifies read, write,
precharge, and power management functions.
M Selects between COLM packet (M = 1) and COLX
packet (M = 0).
MA7 .. MAO Bytemask write control bits. 1 = write, 0 = no-write.
MAO controls the earliest byte on DQA8 .. O.
MB7 .. MBO Bytemask write control bits. 1 = write, 0 = no-write.
MBO con trols the earliest byte on DQB8 .. O.
DX4 .. DXO Device address for COLX packet.
BX4 .. BXO Bank address for COLX packet. RsvB denotes bits
reserved for future extension (controller drives O's).
XOP4 .. XOPO Opcode field for COLX packet. Specifies precharge,
I OL control, and power management functions.

Source: Reference 15, with permission of Rambus Inc.

a ROWA packet, followed by another command RD al issued time t RC D later


in a COLC packet. The ACT command includes the device bank, and row
address (abbreviated as aD), while the RD command includes the device, bank,
and column address (abbreviated as al). A time tcAc after the RD command,
the read data dualoct Q(al) is returned by the device. It should be noted that
270 APPLICATION-SPECIF IC DRAM ARCHITECTURES AND DESIGNS

ROW2
..ROWO

COL4
..COLO

D8A8B8..0..0
D

funs.l<11on a: RD . 0· a2 .. jJ Ca2
Trons.l<1lonb: n bO•

(a)

ROW2
..Rowa
COL4
..COLO

DQA8 ..0
DQB8..0

,
Traruac li on b : xx
.1 • • B Col .2. . B C.2
(b)

Figure 4.10 Direct RDRAM exampl e. (a) Read tran sacti on. (b) Write transaction .
(F rom reference 14, with permi ssion of Rarnbus, Inc.)

. the packets on ROWand COL pin s use the end of the packet as a timing
reference point, while the packets on the DQA/DQB pins use the beginning of
the packet as a timing reference point.
A time tcc after the first CaLC packet on the COL pin s, a second is issued
which contains an RD a2 command. The a2 address has the same device and
bank address as the a1 address (and aD address), but a different column
address. A time t CA C after the second RD command, a second read data dualoct
Q(a2) is returned by the de vice. Next, a PRER is issued in a ROWR packet
on the ROW pins , which cau ses the bank to precharge so that a different row
may be activated in a subsequent transaction or so that an adjacent bank may
be activated. The a3 address includes the same device and bank address as the
aD, a l, and a2 addresses. Th e PRER command must occur a t a time t RDP or
more after the ACT command, and also time t RDP or more after the last RD
command. Thi s transaction example read s two dualocts, but there is actually
RAMBUS TECHNOLOGY OVERVIEW 271

time to read three dualocts before tRDP becomes the limiting parameter rather
than t R A S '
Finally, an ACT bO command is issued in a ROWR packet on the ROW
pins. The second ACT command must occur a time t RC or more after the first
ACT command and a time tRP or more after the PRER command, to ensure
that the bank and its associated sense amplifiers are precharged. This example
(for both the read and write transactions) assumes that the second transaction
has the same device and bank address as the first transaction, but a different
row address. The transaction b may not be started until transaction a has been
completed. However, the transactions to other banks or devices may be issued
during transaction a.
The interleaved read transactions are similar to the example shown in
Figure 4.10a, except that they are directed to the nonadjacent banks of a single
RDRAM and the DQ data pins efficiency is 100%.

Write Transaction-Example Figure 4.10b shows the example of a write


transaction that begins by activating a bank with an ACT aO command in a
ROWA packet. Another command WR a1 is issued at time tRCD - t RT R in a
CaLC packet. The ACT command includes the device, bank, and row address
(abbreviated as aO), while the write command includes device, bank, and
column address (abbreviated as a1). A time tCWD after the WR command, the
write data dualoct D(al) is issued.
A time lee after the first CaLC packet on the COL pins, a second CaLC
packet is issued, which contains a WR a2 command. The a2 address has the
same device and bank address as the a1 address (and aO address), but a
different column address. A time tewD after the second WR command, a second
write dualoct D(a2) issued. A time t RTR after each WR command, an optional
CaLM packet MSK (a1) is issued, and at the same time a CaLC packet is
issued, causing the write butTer to automatically retire. If a CaLM packet is
not issued, all the data bytes are unconditionally written. If the CaLC packet,
which causes the write buffer to retire is delayed, then the CaLM packet (if
used) must also be delayed.
Next, a PRER a3 command is issued in a ROWR packet on the ROW pins,
which causes the bank to precharge so that a different row may be activated
in a subsequent transaction or so that an adjacent bank may be activated. The
a3 address includes the same device and bank address as the aO, a l, and a2
addresses. The PRER command must occur at time t R A S or more after the
original ACT command. It should be noted that the activation operation in
any DRAM is destructive, and the contents of the selected row must be
restored from the two associated sense amplifiers of the bank during the t R A S
interval. A PRER a3 command is issued in a ROWR packet on the ROW pins.
The PRER command must occur a time tRT P or more after the last CaLC,
which causes an automatic retire.
Finally, an ACT bO command is issued in an ROWR packet on the ROW
pins, and the second ACT command must occur at a time t RC or more after
272 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

the first ACT command, as well as at a time t RP or more after the PRER
command.
The process of writing a dualoct into the sense amp of an RDRAM bank
occurs in t\VO steps: (1) The write command, write address, and write data are
transported into the write buffer and (2) the RDRAM automatically retires the
write buffer, with an optional bytemask, into the sense amplifier. This two-step
write process reduces the natural turn-around delay due to the internal
bidirectional data pins. The interleaved write transactions are similar to the
one shown in Figure 4.10b, except that they are directed to the nonadjacent
banks of a single RDRAM. This allows a new transaction to be issued once
every t RR interval rather than once every t R C interval, and the DQ data pin
efficiency is 100% with this sequence.
The Direct Rambus clock generator (DReG) provides the channel clock
signals for the Rambus memory subsystem and includes signals for synchron-
ization of the Rambus channel clock to an external system clock. On the logic
side, the Rambus interface consists of two components: the Rambus ASIC cell
(RAe) and the Rambus memory controller (RMC). The RAC physically
connects through the package pins to the Rambus channel and is a library
macrocell implemented in ASIC design to interface the core logic of the ASIC
device to a high-speed Rambus channel. The RAC typically resides in a portion
of the ASIC's I/O pad ring and converts the high-speed Rambus signal level
(RSL) on the Rambus channel into lower-speed CMOS levels usable for the
ASIC design. The RAe functions as a high-performance parallel-to-serial and
serial-to-parallel converter performing the packing and unpacking functions of
high-frequency data packets into the wider and synchronous 128-bit (Rambus)
data words.
The RAe consists of t\VO delay-locked loops (DLLs), input and output (I/O)
driver cells, input and output shift registers, and multiplexers. The two DLLs
provided are a transmit DLL and a receive DLL. The transmit DLL ensures
that the written commands and data are transmitted in precise 180-degree
phase quadrature with an associated Clock from Master (CFM) clock. The
receive DLL ensures that a proper phase is retained between the incoming read
data and its associated Clock to Master (CTM) clock.
Figure 4.11 shows the block diagram of a RAC cell [14]. The external
interface, which consists of the RSL high-speed channel, is referred to as the
Rambus Channel Interface, while the internal lower speed CMOS level signals
are referred to as the Application Port Interface. A typical Rambus channel
can deliver two bytes of data every 1.25 ns, which is seen as 16 bytes of data
every 10 ns on the Application Port Interface. This data are referenced to the
SynClk.

4.3.2. Direct Rambus Memory System Based Designs


In traditional DRAM-based memory system designs, architectures and algo-
rithms have been developed that can detect and correct multiple bit errors such
RAMBUS TECHNOLOGY OVERVIEW 273

SyBtem Interfa ce Logic RAC Inl erfec e Log ic RAC

-. _.
--~-
Addre$s
Address Transmit
Bul fer ,- r-e-

::! ::!
~ ~

..
Data
--- - ~
Microproc n $Or
lmertace Lolii.c
Oilt3
Transmit
Butter
-~- f--
0
a:

T T
0
a:

-Conh ol
....
l-
Beceive
Buffor ~
-'1} <4 ._ - BusC'll.

H Syrduooou; l
Logic
-
~....
I
-~
~.~" ,~ Logie

Figure 4.11 Direct Rambus ASIC (RAC) block diagram. (From reference 15, with
permission of Rambus, Inc.)

that a system can continue to operate even if an entire DRAM device fails. This
capability is known in the industry as "chipkill" [15]. The Hamming error
correction code (ECC) scheme has been widely used and involves attaching a
number of checksum or syndrome bits along with the corresponding data, as
it is being transmitted (or written to the memory). On the receiving side, the
controller again generates syndrome bits based on the received data pattern
and compares it against the syndrome bits stored from the write operation, and
the comparison can correct single-bit errors and detect double-bit errors.
Therefore, this scheme is called single-bit error correction, double-bit error
detection (see Semiconductor Memories, Chapter 5.6).
In most ECC-based systems, 64 data bits are used along with 8 additional
syndrome bits, resulting in a total word size of 72 bits. The Rambus DRAM
supports the ECC approach using the x 18 organization, which operates as a
144-bit datapath, 128 bits of which can be used as data, while the remaining
16 bits can be used for syndrome (9 are needed for ECC) or other functions.
This can only be effective for double-bit error detection and single-bit error
correction, and multiple errors contained in the data word are not correctable
using this technique. For chip kill protection, architectural partitioning of the
memory array is used along with an ECC coding technique, which spreads the
data word across many DRAMs such that any individual DRAM contributes
only one bit. The major drawback of this approach is that the system requires
a minimum of 72 DRAMs (using x 1 DRAMs) in the case of the 72-bit ECC
word. Using the x 4 DRAM configuration requires increasing the ECC word
size and number of ECC checkers by a corresponding multiple to a total of
288 data bits and four ECC checkers.
274 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

The implementation of a traditional chipkill technique for a Rambus-based


DRAM system is complicated by the fact that a single RDRAM returns 16
bytes of data (one dualoct) across all its data pins. To spread out the ECC
syndrome bits across the required number of RDRAMs involves a combina-
tion of many channels operating in parallel and/or multiple requests to several
RDRAMs on any given channel. However, the tradeoff for chipkill support is
reduced memory bandwidth and/or added memory latency. To address these
issues, Rambus has implemented a new feature in 256-Mbit memory generation
called interleaved data mode (IDM), which is completely transparent and
compatible with the existing RDRAMs and memory controller implementa-
tions.
The IDM enables a group of RDRAMs on a single channel to respond to
a single request, which is similar to a wide SDRAM datapath employing
multiple devices acting in parallel. Each RDRAM in the group receives and
transmits information on unique data pines) for the entire CAS cycle, so that
instead of a single RDRAM returning all 16 bytes, multiple RDRAMs in
parallel transmit/receive data on separate pins to make up the dualoct. This
approach enables the ECC word to be distributed efficiently across the multiple
devices while still maintaining a peak memory bandwidth of 1.6 Gb/s per
channel with no added latency.
The IDM is enabled during the initialization by a register bit setting. When
enabled, eight RDRAM devices on a channel will respond in parallel to row
and column requests. Each of eight RDRAMs will transmit and receive
information on unique data pins, determined by a mapping function of the
device identification field in the row and column packets. The address to pin
mapping is handled by the RDRAM; hence the row and column packet
formats themselves are unchanged in IDM. In the x 16 RDRAM organization,
each device will receive/transmit two data bytes across two unique pins (one
from DQA, the other from DQB), whereas in the x 18 organization, six devices
will receive/transmit two data bytes across two unique pins, while the other
two devices will receive/transmit three bytes across three pins.
The memory bandwidth projections for the workstation and server perform-
ance for next several years are for multigigabytes per second (OBis) transfer
rate to be delivered from gigabytes of memory storage [16]. In the conven-
tional approach, these memory systems have tended to use commodity
DRAMs in wide datapaths, such as 288 or even 576 bits wide to achieve the
bandwidth requirements. For large memory system designs, the Direct Rambus
Channel can support up to 32 devices without expansion buffer chips. Similar-
ly, the SDRAM interface on the controller can drive up to 36 SDRAMs
without buffers. A comparison of the two approaches shows that for a given
number of controller pins, a Direct Rambus system provides four times the
capacity (or four times the bandwidth) over a SDRAM system, or some other
combination of the two.
In order to support gigabytes of RAM and high bandwidth, system
designers often evaluate a range of system configurations. Using conventional
SYNCHRONOUS LINK DRAMs 275

DRAMs in a high-bandwidth system requires wide data buses. For example, a


workstation could support 288 bits of datapath to an array of DRAMs, so that
a gigabyte of RAM could be implemented with four ranks, where each rank is
a row with 18 of 256-Mb DRAMs. By using 66-MHz SDRAMs, the system
supports a 2.4-GBjs peak data rate. Rambus memory systems have been
implemented with as many as 80 channels (even higher possible). As an
example, with ASIC technology, a single Rambus memory controller can
support up to eight channels for eight gigabytes of RAM using 256-Mb
RDRAMs for 9.6-GB/s peak data rate. The expansion buffer chips can be
added to either of these systems to further increase capacity. However, in
buffered systems, latency can become a concern.
The emerging 3D and multimedia applications require high memory band-
width under a wide variety of demanding conditions. For example, 3D
rendering requires a heavy mixture of random read and write operations,
random length accesses, and unaligned data transfers. In addition, the memory
accesses can occur as simultaneous threads to different portions of the frame
buffer including the z-buffer, the display buffer, the back buffer, and local
texture memory [17]. Therefore, under these conditions, the sustained band-
width becomes a critical issue. In addition, the frame buffer capacity is a key
consideration since 3D is not only demanding more bandwidth from local
graphics frame buffers, but also more memory. The growing demand for faster
response time from 3D games is also driving up the local frame buffer sizes.
Direct RDRAMs have three features that makes them very efficient for
graphic operations: zero delay between the reads and writes, independent row
and column resources, and pipelined write capability. The lack of a delay
between the read and write transactions enables faster z-buffer and pixel
read-modify-write operations. The independent row and column resources
allow one page to be activated while the column of a different page is being
accessed (Le., z-buffer page activate during the pixel write). The pipelined write
capability allows for a subsequent read or write command to be queued before
the initial write data are written.

4.4. SYNCHRONOUS LINK DRAMs (SLDRAMs)

Synchronous link DRAMs (SLDRAMs) is a new memory interface specifica-


tion developed through the cooperative efforts of leading semiconductor
memory manufacturers with a goal to meet the high data bandwidth require-
ments of emerging processor architecture while retaining the low cost of earlier
DRAM interface standards [18]. During the development phases of SDRAM
in 1989-1990, the IEEE Std. P1596 was developed by the Scalable Coherent
Interface (SCI) group to address memory interfacing issues. As a result, IEEE
Std. 1596.4 was introduced for memory interfaces as a RamLink, which is a
point-to-point interface using the SCI protocol but with a reduced command
set that makes it memory-specific. However, it was realized that the RamLink
276 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

had latency limitations for larger memory configurations, and therefore a


parallel interface would be more effective. The SLDRAM consortium includes
following companies: Fujitsu Semiconductors; Hitachi, Inc.; Hyundai Elec-
tronics Co., Ltd.; IBM Microelectronics; LG Semiconductor Co., Ltd.; Mat-
sushita Electric Industrial Co., Ltd.; Micron Technology, Inc.; Mitsubishi
Electric Corp.; Mosaid Technologies, Inc.; Mosel Vitelic, Inc.; NEC; Oki
Electric Industry Co., Ltd.; Samsung Electronics; Infineon Technologies; Texas
Instruments, Inc.; and Toshiba Corp.
The name synchronous link was selected because the new standard would
basically address the most efficient way of linking SDRAMs to get the highest
performance. To enable that, extensive simulations were performed using the
address, control, read/write operations between the controller and memory
devices utilizing various bus configurations. The SLDRAM optimal configur-
ation was selected to fully utilize the data bus bandwidth and consists of (a) a
unidirectional bus for the command and address and (b) a bidirectional bus
for the read and write data.
SDRAMs (see Chapter 3) include several important architectural features
over the standard EDO, including multiple internal banks., a clocked syn-
chronous interface, terminated small-signal signaling scheme, and programm-
able data bursts. The DDR standard makes further improvements by data
clocking on both edges and a return clock, allowing higher-speed operation
and better system timing margin. The SLDRAM enhances features of the
SDRAMs and DDRs by adding an address/control packet protocol, in-system
timing and signaling scheme optimization, and scalability to future generation
of these devices. The first-generation SLDRAMs have been 64-Mb devices
operating at 400 Mbpsjpin, with follow-on offerings that have interfaces with
bandwidths of 600 and 800 Mbpsjpin and even higher. The SLDRAM pro-
tocol allows mixed interfaces of different speeds. For example, when plugged
into a 400-Mbpsjpin system, an 800-Mbpsjpin system will operate correctly at
400 Mops/pin.
The SDRAM, DDR, and SLDRAM all use a DRAM core that has a page
mode cycle time of approximately 10 ns. To maintain an efficient die layout
and obtain an interface rate higher than the core cycle time, the device must
fetch several words in parallel. However, the need to widen the internal
datapath also leads to a die cost penalty. For example, the DDRs running at
200 Mbps/pin become cost effective for 64-Mb devices, which have sufficient
number of active memory subarrays to support a 32-bit datapath without
substantial area penalty. The SLDRAMs with a 400-Mbps/pin, 16-bit I/O
interface employing a 64-bit internal datapath will be cost effective in the
256-Mb density. The 800-Mbpsjpin data rates will employ 128-bit internal
datapath.
The SLDRAM is specified as a general-purpose, high-performance DRAM
for a wide variety of applications. The SLDRAM is an open standard that will
be formalized by IEEE Std. P1596.7 and the JEDEC specifications. This
SyncLink standard specifies a high-bandwidth and packet-based interface
SYNCHRONOUS LINK DRAMs 277

optimized for interchanging data between a memory controller and one or


more DRAMs.

4.4.1. SLDRAM Standard


As memory densities have increased, the need to access a video image every
refresh cycle has been met by increasing the memory chip interface widths to
4-bit, 8-bit, 16-bit, and even 32-bit widths. A single chip can provide sufficient
storage for holding a video image or an operating system, if the access
bandwidth requirements are met. The SyncLink architecture that increases the
memory component bandwidth is expected to reduce the costs of small
systems, where the minimum number of memories is driven by the system's
bandwidth (not storage size) requirements. The SyncLink protocols are int-
ended to be technology-independent, assuming that the controller could be
integrated with the processor for small systems or could efficiently interface to
a variety of system buses (for large systems).
The SyncLink provides higher performance by reducing the number of
feature options supported, since only a few burst transfer sizes are defined, and
no byte-selects for partial words writes are provided. A strictly fixed timing is
adopted, so that the pipeline techniques can be used to optimize the SLDRAl\1
performance. Therefore, leaving a row open has little speed advantage, but it
can be used to save power during the read/modify/write sequences. The
SyncLink is intended to be an interconnect standard for the memory arrays,
and these are the major objectives influencing its development:

• Scalability to a majority of future DRAM applications


• Provide memory controllers capability to schedule the responses for
concurrently active requests
• Flexibility to support other RAM-like components that emulate the
RAM-chip component characteristics including (but not limited to) the
ROM and Flash memories, high-bandwidth I/O devices, a bridge to other
interconnect systems (such as SCI), and so on.

The primary goal of SyncLink standard has been to support low-cost


commodity SLDRAM parts. To meet the objectives listed above, the design
strategy for interface standard has been to make it as simple as possible while
shifting the complexity over to the memory controller functions, rather than
memory chips. Also, the mixed DRAM components should be supported by a
simple controller, which assumes the slowest DRAM access time for all.
Therefore, the scope of SyncLink standard established to meet this design
strategy is to leverage the physical layers and packet formats developed by
others, as follows:

1. Protocols. The SyncLink interface uses RamLink-like protocols to com-


municate between a controller and one or more memory devices.
278 APPLICAT ION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

2. Electri cal Signals. The SyncLink interface specifies signal s used to com -
municate between the controller and one or more SLDRAMs, and it
references other standards for detailed signal levels and timing character-
istics.
3. Physical Packaging. The SyncLink interface does not specify physical
packaging requirements, because other standard groups (e.g., JEDEC)
are expected to define the physical packaging standards based on market
req uirements.

The major application of SyncLink is the interconnection of a memory


controller to a small number of commodity RAMs , as shown in Figure 4.12
[19]. For larger systems and higher-performance applications, multiple links
are expected to be used to improve the total avai lable bandwidth for large
transfers and to reduce the average latency for small, inte rleaved transfers.
A SyncLink RAM device may have multiple sub-RAMs or blocks, as shown
in Figure 4.13a [19]. This implies that except for initialization, the blocks act

Ul 10-command-link
::l
.c ~
E
Q)
l-
ec I RAM I I RAM I ( ...) I RAM I narrow data path
Cii 0
o
>.
Ul
16/32/64/128 read&write data

Ul 10-command-link
::l
.c ...
.!!! wide data path
s
E
Ul
eE
>. 0
Ul o

16/32/64 /128 read&wr ite data

command link

Ul
~
::l
.c
E
Q)
eE connected SIMMs
Cii 0 or modules
>.
tn
o

data links

Figure 4.12 SyncLink standard interconnection of a memory controller to a small


number of commodity RAM s. (From reference 19, with permission of IEEE.)
SYNCHRONOUS LINK DRAMs 279

block-A

block-S

_ - - - up to 64 SLDRAMs - - -+-

processor

16,18
SyncLink
SCI

Gbit
Serial Express

PC l

Serial Sus

Ethernet

Figure 4.13 SyncLink memory. (a) Multiple sub-RAMs or blocks. (b) A typical small
memory subsystem design. (From reference 19, with permission of IEEE.)

essentially like independent RAMs. The banks contain rows, and rows contain
columns. A row is the amount of data read or written to one of the chip's
internal storage arrays. Columns are subsets of rows that are read or written
in individual read or write operations, as seen by the chip interface. For
example, if the datapath to the chip is 16 bits wide at the package level, each
16-bit subset of the current row is connected to the I/O pins as a column access
within that row . A typical data transfer in SyncLink concatenates four 16-bit
columns to make a data packet.
Therefore,accessing the columns within the same row is faster than
accessing another row, saving the row access time required to bring the row of
data from the actual RAM storage cells. The multiple banks within each
sub-RAM can provide an additional level of parallelism. In summary, a bank
corresponds to a row that may be held ready for multiple accesses; a sub-RAM
280 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

corresponds to one or more banks sharing one timing controller that can
perform only one operation at a time; a RAM corresponds to multiple
sub-RAMs that can access data concurrently but shares initialization and
addressing facilities, as well as the package pins and some internal datapaths.
Multiple RAMs sharing one controller comprise a memory subsystem.
In the Sync Link configuration, two shared links (buses), a unidirectional
commandl.ink and bidirectional dataLink, are used to connect the controller to
multiple slaves (typically SLDRAM chips). The SyncLink uses shared-link
(bused) communication to achieve a simple high-bandwidth data transfer path
between a memory controller and one or more memory slaves (up to 64
SLDRAMs). The use of just one controller on each SyncLink subsystem
simplifies the initialization and arbitration protocols, whereas limiting the
number of SLDRAMs to 64 simplifies the packet encoding, because the
SLDRAM address (slave/d) can be contained in the first byte of each packet.
The limit is 64 rather than 128, because half of the 7-bit slave!Ds are used for
the broadcast and multicast addresses.
The link from the controller to the SyncLink nodes, the commandl.ink, is
unidirectional, and the signal values can change every clock tick. The nominal
clock period is physical-layer dependent, but SyncLink changes data values on
both edges of the clock. For example, a memory system with a 2.5-ns clock
period and a 10-bit-wide commandl.ink corresponds to a raw bandwidth of
200M command packets/so The basic lO-bit-wide commandl.ink contains 14
signals: linkiln (a low-speed asynchronous initialization signal), a strobe (clock)
signal, a listen signal that enables the flag and data receivers, a flag signal, and
10 data signals. The listen, .flag, and data are source-synchronous; that is, the
incoming strobe signal indicates when the other input signals are valid. The flag
signal marks the beginning of transmitted packets. The data signals are used
to transmit bytes within the packets, and depending upon their location within
a packet, the bytes provide address, command, status, or data value.
The datal.ink is 16 or 18 bits wide, carrying the read data from SyncLink
nodes back to the controller or write data from the controller to one or more
SyncLink nodes. The bit rate is same as for the commarull.ink, and the
minimum block transferred corresponds to 4 bits on each datal.ink pin, the
same duration as the command. This implies that for a memory system with a
2.5-ns clock period, the data transfer rate can be as high as 1600 Mbyte/s, The
SyncLink architecture supports multiples of both 16- and 18-bit-wide DRAMs.
The 18-bit chips can be used by 16-bit controllers, because the extra bits are
logically disconnected until enabled by a controller initiated command. Figure
4.13b shows an example of a typical small memory subsystem design.
To support variable-width DRAM connections and a "vide variety of
configurations, the SyncLink uses address compare logic that supports a
variety of multicast (x 2, x 4, ...) addresses in addition to single chip and
broadcast addresses. This decoding of multicast slave!D addresses is simpler
and more flexible than providing separate chipSelect signals to individual
DRAMs, as is done with currently available SDRAMs. To encode the
SYNCHRONOUS LINK DRAMs 281

power-of-two multicast addresses, the number of DRAM chips is limited to 64,


such that lower 0-63 locations are used to address the individual chips,
whereas the higher 64-127 locations specify multicast addresses.
The SyncLink address space is partitioned into 64 nodes, each with an
arbitrarily large memory space. The packet format allows addressing the bytes
as needed, for indefinite expansion. The mapping between the SLDRAl\t1
addresses and the system-level byte addresses is handled at the SyncLink
controller. The SLDRAM chips are expected to have less than 128 control
registers, which can be set using the store command. The values of the control
registers determine a variety of RAM operational parameters, to be defined.
Similarly, up to 128 status registers can be read by the load command, and 128
event actions could be defined. The SyncLink initialization process allows the
controller to determine the number of attached SLDRAMs. The read oper-
ations of the standard ROMs or status registers associated with the SLDRAMs
allows the controller to determine the SLDRAMs capabilities.
To access SLDRAM, the controller initiates read or write transactions
addressed to one or more of the SLDRAMs on the attached links. SyncLink
uses split-data transactions because the read data packet is not returned
immediately (or the write data are not needed immediately), so that other
(possibly unrelated) command may be transmitted on the commandl.ink, or
other data may be transmitted on the datal.ink, while the request is being
processed by the SLDRAM. The read and write transfer bursts of data, so that
the total data transferred depends on the width of the SLDRAM's datapath,
the number of SLDRAMs that are accessed concurrently, and the length of the
burst.
The timing of data flow on the datal.ink relative to the corresponding
request (command/address) packet on the commandl.ink is set at initialization
time. To set such parameters in the SLDRAMs at initialization time, the
controller uses a store command to place the appropriate values into the
control register. Once this has been done, a load command, which is essentially
a read with special addressing, can be used to read configuration information
from each SLDRAM via the dataLink. This information can then be used to
refine the SLDRAM parameter settings.
A compact event command is provided for synchronizing refresh and
controlling certain operating modes. The events may be broadcast or multicast.
For reads or loads, the delay between the command and the SLDRAM
response with data on the datal.ink is set by a control register. For writes, the
delay between the command and the time when the SLDRAM accepts data
from the dataLink is set by a control register. Some of the major commands
and functions are briefly described below.

• Read Transactions The read transactions are split response transactions


with two components, referred to as the request and response packets. For
reading a burst of data, a read transaction is used. A read request packet
transfers command and address from the controller to the SLDRAM. The
282 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

data packet returns data after a fixed delay V·C, which is basically the sum
of row access and the column access delays of the SLDRAM, and is set
at initialization time. A read can be directed to one SLDRAM, or
multicast. The multicast is useful only when there are multiple dataLinks
because only one device at a time is permitted to drive any particular
datal.ink.
• Load Transactions A load transaction is similar to a read, but uses special
addressing to access information about the characteristics of particular
SLDRAMs, which is usually information needed to initialize the system.
The delay for load data is also set by Trc, based on the assumption that
the registers can be accessed at least as fast as the SLDRAM. As in the
case of read transactions, a load can be directed to one SLDRAM, or
multicast.
• J.t1 ite Transactions The write request packet transfers command and
A

address from the controller to the SLDRAM on the commandl.ink, and


after a precise delay the controller transfers data to the SLDRAM on the
dataLink. No response packet is returned. A write can be directed to one
SLDRAM, multicast to a subset, or broadcast to all. The SLDRAM
accepts the controller provided data from the dataLink after a delay TIvc,
which is set at initialization time, and is basically the sum of row access
and column access delays of the SLDRAM.
• Event Transactions An event transaction is similar to a write, but an
abbreviated address is used and without the use of datal.ink. An event can
be directed to one SLDRAM, multicast to a subset, or broadcast to all.
Events transfer only seven bits of encoded control information from the
controller to the SLDRAM(s). Due to their limited information contents,
events are only used for special purposes (e.g., synchronizing SLDRAM
refresh operations).
• Store Transactions A store transaction is similar to an event, but contains
data that are sent on the commandl.ink for storage into the control
registers within the SLDRAMs. A store can be directed to one SLDRAM,
multicast to a subset, or broadcast to all. Store is used during initialization
to set the operating parameters needed by the SLDRAMs to values
compatible with the requirements of the controller and other SLDRAMs
in the system.
• Refresh Operations The SyncLink simplifies the SLDRAM architecture
by placing the refresh responsibility in the controller (except during the
shutDown operation). To support efficient controller scheduling algo-
rithms, the SyncLink standard requires the SLDRAMs to support auto-
refresh. The prefix auto refers to the addressing that is automatically
updated by the SLDRAM, and not timing, which is explicitly set by the
controller. The use of auto-refresh capability can allow the controller to
accurately predict the SLDRAM response times and schedule refresh
activity during the idle periods. Broadcast and directed refresh events
SYNCHRONOUS LINK DRAMs 283

initiate the SLDRAM's refresh operations. The self-re.fresh mode in which


all SLDRAMs refresh themselves is performed by all the SLDRAMs
during low-power shutDown operation of the system.

4.4.2. SLDRAM Architectural and Functional Overview


SLDRAM has been specified as a general-purpose, high-performance DRAM
for variety of applications such as in main memory, low-power mobile designs,
and high-end servers and workstations supporting sustained bandwidth with
low latency, as well as support for large, hierarchical memory configurations.
The SLDRAM enhances the features of SDRAM and DDR architectures with
the addition of a packetized address/control protocol, in-system timing and
signaling optimization, and full compatibility from one generation to the next
[20J. The high performance is achieved by improving the interface while
leaving the DRAM core relatively unchanged.
The SLDRAM command packets include spare bits to accommodate
addressing up to 4-Gb generation devices. The first-generation SLDRAMs are
devices that employ 64-bit internal datapath and operation capability of 400
Mb/s/pin. With a 16-bit-wide data interface, these SLDRAMs offer a data
bandwidth of 800 Mb/s, The second-generation SLDRAMs and future devel-
opment plans include devices with 600-Mb/s/pin, 800-Mb/s/pin, and> I-Gb/s/
pin interfaces that will be introduced as they become cost-effective.
The SLDRAM multidrop bus has one memory controller and up to eight
loads, each of the loads being either a single SLDRAM device or a buffered
module with many SLDRAM devices. The command, address, and control
information from the memory controller flows to the SLDRAM on the
unidirectional CommandLink. The read and write data flow between the
controller and SLDRAM on the bidirectional DataLink. Both CommandLink
and DataLink operate at the same rate (400 Mb/s/pin, 600 Mb/s/pin, and so
on).
Figure 4.14a shows the basic SLDRAM bus topology [18]. The Command-
Link comprises the following signals: CCLK, CCLK*, FLAG, CA[9:0],
LISTEN, LINKON, and RESET. The commands consist of four consecutive
l Ovbit words on CA[9:0]. A 1 on the FLAG bit indicates the first word of a
command. The SLDRAM uses both edges of the differential free-running clock
(CCLKjCCLK*) to latch command words. For a 400-Mbjs/pin SLDRAM, the
clock frequency is 200 MHz, and bit period N (also referred to as a clock "tick")
is 2.5 ns-half the clock period. The SLDRAM monitors the CommandLink
for commands while the LISTEN pin is high. When the LISTEN pin is low,
there can be no commands on theCommandl.ink, and the SLDRAMs enter a
power-saving standby mode. When the LINKON pin is low, the SLDRAMs
enter a shutdown mode, in which CCLK can be turned off to achieve zero power
on the link. A RESET signal puts the SLDRAMs in a known state on power-up.
The DataLink is a bidirectional bus for the transmission of write data from
controller to the SLDRAMs and for the transmission of read data from the
284 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS

RESET"
L1NKON
LISTEN
CCLK (free runninal 2, Command Link
FLAG
CA[9:0] I 10 I
+, +,
Memory
controller
SO SI SLDRAM or ~Oooo -l!. SLDRAM or
-SO
SL module 1 SL module 8

SI
000

DO[17:0] 18
DCLKO 1 2 l
DataLink
DCLK1
, 2
(bid irectional , intermittent)

(a)

ON 4N aN 12N 16N 20N 24N 28N 32N 36N 40N 44N 48N 52N
j j
CCLK

FlAG

OataLink

OCLKO

OCLKI
1----'----1
Preamb le

(b)

Figure 4.14 SLDRAM (a) Bus topology. (b) Bus transactions timing diagram. (From
reference 18, with permission of IEEE).

SLDRAMs back to the controller. It consists of DQ[17:0], DCLKO, DCLKO*,


DCLKl, and DCLKI *. The read and write packets of minimum burst length
of four are accompanied by either one of the differential clocks. The two sets
of clocks allow control of the DataLink to pass from one device to another
with the minimum gap.
Figure 4.l4b illustrates the timing diagram in which a series of page read
and page write commands are issued by the memory controller to the
SLDRAMs, and all burst lengths shown are 4N, although the controller can
SYNCHRONOUS LINK DRAMs 285

dynamically mix 4N and 8N bursts. The first two commands are page reads to
the SLDRAM 0 to either the same or different banks. SLDRAM 0 drives the
read data on the data bus along with DCLKO to provide the memory
controller the clock edges to strobe in the read data. Because the first two page
read commands are for the same SLDRAM, it is not necessary to insert a gap
between the two 4N data bursts because the SLDRAM itself ensures that
DCLKO is driven continuously. However, the data burst for the next page read
(to SLDRAM 1) must be separated by a 2N gap. This allows for settling of the
DataLink bus and for the timing uncertainty between SLDRAM 0 and
SLDRAM 1. A 2N gap is necessary whenever time control of the DataLink
passes from one device to another.
The next command is a write command in which the controller drives
DCLKO to strobe the write data into the SLDRAM 2. The page write latency
of the SLDRAM is programmed to equal page read latency minus 2N. The
subsequent read command to SLDRAM 3 does not require any additional
delay to achieve the 2N gap on the DataLink. The final burst of three
consecutive write commands shows that the 2N gap between data bursts is not
necessary when the system is writing to different SLDRAM devices, because all
data originates from the memory controller.
When control of the DataLink passes from one device to another, the bus
remains at a midpoint level for nominally 2N, which results in indeterminate
data and possibly multiple transitions at the input buffers. To address this
problem, the data clocks have a 0010 preamble before the transition associated
with the first bit of data occurrence. The controller programs each SLDRAM
with four timing latency parameters: page read, page write, bank read, and
bank write. The latency can be defined as the time between the command burst
and start of the associated data burst. For consistent memory subsystem
operation, each SLDRAM should be programmed with the same values. On
power-up, the memory controller polls the status registers in each SLDRAM
to determine minimum latencies, which may vary across the manufacturers.
The memory controller then programs each SLDRAM with the worst-case
values.
The read latency is adjustable in coarse increments of unit bit intervals and
fine increments of fractional bit intervals. The controller programs the coarse
and fine read latency of each SLDRAM, so that the read data bursts from
different devices at different electrical distances from the controller all arrive
back at the controller with equal delay from the command packet. Write
latency is only adjustable in coarse increments, and its value determines when
the SLDRAM begins looking for transitions on the DCLK to strobe in write
data.

4.4.3. SLDRAM (Example)


An example of SLDRAM is 4M x 18 synchronous, very high speed, packet-
oriented, pipelined device containing 75,497,472 bits specified for 400-Mb/s/pin
286 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

performance available from the SLDRAM consortium member Micron Tech-


nology, Inc. This device is internally configured as eight banks of 128K x 72,
with each of these banks organized as 1024 rows by 128 columns by 72 bits.
The 72 bits per column access are transferred over the I/O interface as burst
of four 18-bit words. Figure 4.15 shows the functional block diagram [21].
All transactions begin with a request packet. The read and write request
packets contain the specific command and address information required. Read
and write data are transferred in packets. A single-column access involves the
transfer of a single data packet, which is a burst of four 18-bit words. The data
from either one or two columns in a page may be accessed with a single request
packet. Read or write requests may be issued to idle banks, or to the open row
in active banks. These read or write requests indicate whether to leave the row
open after the access, or to perform a self-timed precharge at the completion
of the access (auto-precharge).
The 4M x 18 SLDRAM uses a pipelined architecture and multiple internal
banks to achieve high speed while providing high, effective bandwidth oper-
ation. These devices include ability to synchronously burst, high data rates
with automatic column address generation, the ability to interleave between
several internal banks in order to hide the precharge time, and the capability
to provide a continuous burst of data across random row and/or column
locations, even with 8-byte granularity. The SLDRAMs must be powered up
and initialized in a predefined manner because the operational procedures
other than those specified may result in an undefined operation. The following
sections briefly describe the various packets included in the protocol, com-
mands, register definition, and read/write functional operations.

Packet Definitions

• Read, mAite, or Row Op Request Packet The Read, Write, or Row Op


Request Packet is used to initiate any read or write access operation, or
to open or close a specific row in a specific bank. A read or write request
results in the transfer of a data packet on the data bus after a specified
time. The data packet is driven by the SLDRAM for a read operation, or
by the memory controller for a write operation. An Open Row or Close
Row request generates no response.

• Register Read Request Packet The register read request packet is used to
initiate a read access to a register address. In response to a register read
request packet, the SLDRAM provides a data packet on the data bus after
a specified time.

• Register mAite Request Packet The register write request packet is used
to initiate a write access to a register address. This packet consists of four
words, with the later two being the data to be written to the selected
register.
r-- ------- - --------------------------------------- - ----------- --------------------- ---- --- --- --- ------------------------------------- ------- - - -- ----- -
,,
,
,,
, COM MAND
DECODER
&
FLAG - H c SEO UENCER
CAO-
CA9
3

ICL K

ADDRESS
SEQUENCER DOO-
D01 7
DCL KO, DCL KO#
DCLK1 , DC LK1 #

ICLK
WRITE
CLOCK LATCH
RCL K
CCLK DIVIDERS &
WC LK
CC LK# & DRI VER S
(200 MH z)
, DELAYS
,, OTHER
,, CLOCK S
,- ------- -- ------ - - - -- - - - - ---- - --- -- --- - - - - --- - - -- -- --- -- - ----- - -- - -- - - - -- -- - - --- ------- -- --- - ------ - -- - - -- --- - - - -- -- - - - - - -------- - - - -- - - - -- --- - - --- - - -

Figure 4.15 Block diagram of a 4M x 18 SLDRAM. (From refernce 21, with permission of IEEE.)
'"
~
288 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

• Event Request Packet The event request packet is used to initiate a hard
or a soft reset, an autorefresh, or a Close All Rows command, or to enter
or exit selfrefresh, adjust output voltage levels, adjust the Fine Read
Vernier, or adjust the Data Offset Vernier. The output voltage levels, or
the fine read or data offset verniers, can be adjusted using a dedicated
Adjust Settings Event Request Packet, or as part of an autorefresh event.
• Data Sync Request Packet A data sync request packet is used to control
the output logic values and patterns used for the level adjustment, latency
detection, and timing synchronization.
• Data Packet A data packet is provided by the controller for each write
request and by the SLDRAM for each read request. Each data packet
contains either 8 bytes or 16 bytes, depending on whether the burst length
was set to 4 or 8, respectively, in the corresponding request packet. There
are no output disable or write masking capabilities within the data packet.
When the burst length of 8 is selected, the first 8 bytes in the packet
correspond to the column address contained in the request packet, and
the second 8 bytes correspond to the same column address except with an
inverted LSB.

Commands Descriptions Table 4.4 provides a quick reference of the


available commands for the 4M x 18 SLDRAM, and brief descriptions are
provided in the following section [21]. All command packets must start on
positive edge of the SLDRAM CCLK.

• Open Row The OPEN ROW command is used to open (or activate) a
row in a particular bank in preparation for a subsequent, but separate,
column access command. The row remains open (or active) for the
accesses until a CLOSE ROW command or an access-and-close-row type
command is issued to that bank. After an OPEN ROW command is
issued to a given bank, a CLOSE ROW command or an access-and-close-
row type command must be issued to that bank before a different row in
that same bank can be opened.
• Close Row The CLOSE ROW command is used to close a row in a
specified bank. This command is useful when it is desired to close a row
that was previously left open in anticipation of subsequent page accesses.
• Read The Page Read commands and Bank Read commands are used to
initiate a read access to an open row, or to a close row, respectively. The
commands indicate the burst length, the selected DCLK, and whether to
leave the row open after the access. The read data appears on the DQs
based on the corresponding Read Delay Register values, Fine Read
Vernier, and the Data Offset Vernier settings previously programmed into
the device.
• m·ite The Page Write and Bank Write commands are used to initiate a
write access to an open row, or to a closed row, respectively. The
SYNCHRONOUS LINK DRAMs 289

commands indicate the burst length, the selected DCLK, and whether to
leave the row open after the access. Write data are expected on the DQs
at a time determined by the corresponding Write Delay Register value
previously programmed into the device.
• No Operation (NOP) The FLAG HIGH indicates the start of a valid
request packet; FLAG then goes LOW for the remainder of the packet.
FLAG LOW at any other time results in a No Operation (NOP). A NOP
prevents unwanted commands from being registered during the idle states,
and does not affect operations already in progress.
• Register Read A Register Read command is used to read contents of the
device status registers. The register data are available on the DQs after
the delay determined by the Page Read Delay Register value, Fine Read
Vernier, and Data Offset Vernier settings previously programmed into the
device.
• Register ltJAite The Register Write command is used to write to the
control registers of the device. The register data are included within the
request packet containing the command.
• Event The events (e.g., Hard Reset, Soft Reset, Auto-Refresh, etc.) are
used to issue commands that do not require a specific address within a
device or devices.
• Read Sync (Stop Read Sync) This commands instructs the SLDRAM to
start (stop) transmitting the specified synchronization pattern to be used
by the controller to adjust input capture timing.
• Drive DCLKs LOl-V (High) This command instructs the SLDRAM to
drive the DCLK outputs Low (High) until overridden by another
DRIVEDCLK or READ command. DCLK is specific in this context, and
the DCLK# outputs will be in the opposite state.
• Drive DCLKs Toggling This command instructs the SLDRAM to drive
the DCLK outputs toggling at the operating frequency of the device until
overridden by . another DRIVE DCLK or READ command.
• Disable DCLKs This command instructs the SLDRAM to disable (High
Z) the DCLK/DCLK# outputs until overridden by another DRIVECLK
or READ command.

RegisterDefinition The SLDRAM includes two sets of registers: the control


registers and the status registers. The control registers are write-only registers,
which are logically 20 bits wide. Currently, all control registers are 8 bits or
less (physically), so that the remaining bits are "don't care" to the SLDRAM.
However, to allow for future revision, the controller should write a 0 to each
"don't care" bit. The data to be written to a control register are provided via
the command/address bus as a part of the Register Write Packet. The ID
register consists of eight bits, which are all set to 1 (= ID value 255) upon
hardware reset and are subsequently programmed to a unique value during
I\)
TABLE 4.4 4M x 18 SLDRAM Commands
<0
o
CMD5 CMD4 CMD3 Command CMD2 CMDI CMDO Subcommand

0 0 0 0 0 0 Read Access, Leave Row Open, Drive DCLKO


0 0 0 0 0 1 Read Access, Leave Row Open, Drive DCLKI

0 0 0 0 1 0 Read Access, Close Row, Drive DCLKO


0 0 0 Page Access, 0 1 1 Read Access, Close Row, Drive DCLKI
0 0 0 Burst of 4 1 0 0 Write Access, Leave Row Open, Use DCLKO
0 0 0 1 0 1 Write Access, Leave Row Open, Use DCLKI
0 0 0 1 1 0 Write Access, Close Row, Use DCLKO
0 0 0 1 1 1 Write Access, Close Row, Use DCLKI
0 0 1 0 0 0 Read Access, Leave Row Open, Drive DCLKO
0 0 1 0 0 1 Read Access, Leave Row Open, Drive DCLKI
0 0 1 0 1 0 Read Access, Close Row, Drive DCLKO
0 0 1 Page Access, 0 1 1 Read Access, Close Row, Drive DCLKI
0 0 1 Burst of 8 1 0 0 Write Access, Leave Row Open, Use DCLKO
0 0 1 1 0 1 Write Access, Leave Row Open, Use DCLKI
0 0 1 1 1 0 Write Access, Close Row, Use DCLKO
0 0 1 1 1 1 Write Access, Close Row, Use DCLKI

0 1 0 0 0 0 Read Access, Leave Row Open, Drive DCLKO


0 1 0 0 0 1 Read Access, Leave Row Open, Drive DCLKI
0 1 0 0 1 0 Read Access, Close Row, Drive DCLKO
0 1 0 Bank Access, 0 1 1 Read Access, Close Row, Drive DCLKI
0 1 0 Burst of 4 1 0 0 Write Access, Leave Row Open, Use DCLKO
0 1 0 1 0 1 Write Access, Leave Row Open, Use DCLKl
0 1 0 1 1 0 Write Access, Close Row, Use DCLKO
0 1 0 1 1 1 Write Access, Close Row, Use DCLKI
0 1 1 0 0 0 Read Access, Leave Row Open, Drive DCLKO
0 1 1 0 0 1 Read Access, Leave Row Open, Drive DCLKI
0 1 1 0 1 0 Read Access, Close Row, Drive DCLKO
0 1 1 Bank Access, 0 1 1 Read Access, Close Row, Drive DCLKI
0 1 1 Burst of 8 1 0 0 Write Access, Leave Row Open, Use DCLKO
0 1 1 1 0 1 Write Access, Leave Row Open, Use DCLKl
0 1 1 1 1 0 Write Access, Close Row, Use DCLKO
0 1 1 1 1 1 Write Access, Close Row, Use DCLKI
1 0 0 0 0 0 Reserved
1 0 0 0 0 1 Open Row
1 0 0 Register 0 1 0 Close Row
1 0 0 Access, 0 1 1 Register Write
1 0 0 Row Op, 1 0 0 Register Read, Use DCLKO
1 0 0 or Event 1 0 1 Register Read, Use DCLKI
1 0 0 1 1 0 Reserved
N
CD
....
N
co
N

TABLE 4.4 Continued

CMDS CMD4 CMD3 Command CMD2 CMDI CMDO Subcommand

1 0 0 1 1 1 Event
1 0 1 0 0 0 Read Sync (Drive both DCLKs)
1 0 1 0 0 1 Stop Read Sync
1 0 1 0 1 0 Drive DCLKs LOW
1 0 1 Data 0 1 1 Drive DCLKs HIGH
1 0 1 Sync 1 0 0 Reserved
1 0 1 1 0 1 Reserved
1 0 1 1 1 0 Disable DCLKs
1 0 1 1 1 1 Drive DCLKs Toggling
1 1 0 Reserved X X X Napa

1 1 1 Reserved X X X Reserved

Q: Reserved for buffer-only commands; must be treated as NOP by SLDRAMs.


Source: Reference 21, with permission of IEEE.
SYNCHRONOUS LINK DRAMs 293

initialization. Each SLDRAfvl monitors the command/address bus for the


start of a request packet and then performs a comparison between the ID
contained in the request packet and the one contained in its internal ID
register. If there is a match within a given SLDRAM, the device will process
the request packet.
The status registers are read-only registers that are logically 72 bits wide.
Physically, all status registers are currently 32 bits, so that the remaining bits
are driven LOW during status register reads. The data being read from a status
register are provided in a burst of 4, after a delay equal to the Actual Page
Read Delay previously programmed into the device. The configuration register
contains a code, which uniquely identifies the memory device vendor, the valid
operating frequencies for the device, the number of banks in the device, the
number of rows per bank, the number of columns per page, and the number
of DQs on the device.

Read Accesses The read accesses are initiated with a read request packet.
When accessing an idle bank (bank read access), the request packet includes
the bank, row, and column addresses, the burst length, and a bit indicating
whether or not to close the row after access. The same is true for accessing the
open row in an active bank (a page read access), except that the row address
will be ignored. During a read access, the first of four (or eight) data words in
the data packet are available following the total read delay; the remaining three
(or seven) data words, one each, are available every tick (2.5 ns) later. The total
read delay is equal to the coarse delay (Bank Read Delay or Page Read Delay)
stored in the SLDRAM register plus the fine delay of the Data Offset Vernier
and the Fine Read Vernier for DQs and DCLKO. Figure 4.16 show's the
minimum and maximum total read delays for (a) bank read access and (b) page
read access [21].
The SLDRAM clocking scheme is designed to provide for the temporal
alignment of all read data at the memory controller data pins, regardless of the
source SLDRAM. This temporal alignment scheme can be broken down into
different levels. At the lowest level (device level data capture), the DCLK
transitions and DQ transitions of an individual SLDRAM are adjusted (moved
in time) relative to each other to facilitate the capture by the controller of the
DQ signals using the DCLK signals. Thus, the SLDRAM clocking scheme
allows for individual device adjustment without requiring the memory control-
ler to implement memory device specific internal adjustments. At the next level
of timing alignment (device level optimization), the DCLK and DQ transitions
are moved as a group in time to align the DCLK edges with the preferred
phase of an internal controller clock.
The first two levels of timing alignments are sub-tick-level adjustments. At
the next level (system level optimization), coarse (integer tick value) adjust-
ments are made in order to establish the same latency between a read
command being issued by the controller and the corresponding data arriving
back at the controller for all SLDRAl\1s devices in the system.
294 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS

TO TO+1Ons TO+20ns TO+30ns

ClK ~~~
(200 MHz) b J I I I I I
pa nk Real..\ I I I I I

COMMANDI Request I
rrrn
I : : : : :

ADDRESS I I I I I
: : : : Maximum :
I I I I Bank Read .
Delay :

:
I I I I

DATA ! : :
1M" '
i ~I
I
:
I
"
Ba~~~~~: ITDJ DID
I Delay I I

DCl K -.---r-----i----i------r:---,~r---r----,r---;---
(mi n case) : :
I •
I I

(a)

TO ro - ron s

ClK Ju--U-uUu-U-u--U-u~
(200 MHz) .... J I I
rage ReaC\ I I

COMMANDI i Rnuer i : :
ADDRESS I I : :
I I ,Maximuml
: : :PageRead:
I I I Delay I

DATA
I
:
I
I :
:•
I I
M. .
immum I
I
:, 1
I
:
rr-r-t""""1
'I
I I Page Readl L.L.l-LJ DID
DClK
(min case)
:
:
:
:
: rl n
Delay
:L-....J :LJ
rlr-->----+---+-- -
LJ :
I I I I I
I I I I I

(b)

Figure 4.16 SLDRAM rmrumum and maximum total read delays. (a) Bank read
access. (b) Page read ccess. (From reference 21 , with permission of IEEE .)

The two DCLK signals provided by each SLDRAM (as well as the memory
controller) provide increased effective bandwidth when switching between
different sources of data on the bus (e.g., a read from one SLDRAM followed
by a read from another SLDRAM, read-to-write or write-to-read transitions).
The preamble and leading cycle in a given DCLK sequence can be hidden -
that is, overlapped with data associated with the other DCLK signal.

Write Accesses The write accesses are initiated with a WRITE packet
request packet. When accessing an idle bank (bank write access) , the request
packet includes the bank, row, and column addresses, the burst length, and a
bit indicating whether or not to close the row after the access. The same is true
when accessing the open row in an active bank (a page write access) except
that the row address will be ignored. During a WRITE access, the first of four
(or eight) data words in the data packet is driven by the controller, aligned
SYNCHRONOUS LINK DRAMs 295

with the selected DCLK, and, after a delay (Bank Write Delay or Page Write
Delay), programmed into the SLDRAM registers. The remaining three (or
seven) data words will follow, one each , every clock tick (2.5 ns) later. Figure
4.17 shows the minimum and maximum delay before arrival of data at the
SLDRAM during (a) bank write access and (b) page write access [21].

Standby Mode In the standby mode, all output driver s are disabled and all
input receivers except those for the CCLK, RESET#, LISTEN, and LINKON
are disabled. The standby mode is entered by deactivating the LISTEN signal
at any time except during the transfer of a request packet. The standby mode
can be nested within the self-refresh mode.

TO+SOns
ClK
TO
,
TO+l 0ns TO+20ns
, TO+30ns
, TO+40ns

(200 MH z) ,,
Bank Write,
COMMAND/
ADDRESS
,eqn i t

r
I
I
Max imum
I Bank Write

':•
I Delay

DATA

DCl K
(min case)

(a)

TO TO+l0ns TO+20ns TO+30ns TO+40ns


ClK
(200 MHz) , , ,,
,Page Write'
:Request : ,,
COMMAND/
, ,,
ADDRESS
" "
,,1
,, : Maximum:
'Page Write'
: Delay :
:' -+----;r-----;r--- - r-----.j
,
DATA I 'll( -I :
M inimum
I
:Page Write:
I
qTIJ
, Delay ,
DClK -i- - -r'-----, :
(min case )

(b)

Figure 4.17 SLDRAM minimum and maximum write delay times during (a) bank
write access and (b) page write access. (From reference 21, with permission of IEEE.)
296 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

Shutdown Mode In shutdown mode, all internal clocks, all output drivers,
and all input receivers are disabled, except for the LINKON and RESET#.
The shutdown mode is entered by deactivating the LINKON signal while the
device is already in the standby mode. The shutdown may be nested within the
self-refresh mode.

Self-Refresh Mode In a self-refresh mode, an on-chip oscillator and refresh


logic are enabled, thereby suspending the requirement for periodic auto-refresh
events initiated by the memory controller. The standby and/or shutdown
modes may be nested within the self-refresh mode. No other commands/
accesses are permitted to the SLDRAM while in the self-refresh mode.

An improvement on the 4M x 18 SLDRAM with 400-Mb/s/pin perform-


ance is the development of an 8M x 18 device containing 150,994,944 bits and
specified for 600-Mb/s/pin operation. This SLDRAM uses pipelined architec-
ture and multiple internal banks to achieve high-speed operation and high
effective bandwidth. It is internally configured as eight banks of 256K x 72
bits, and each of the 256K x 72 banks is organized as 2048 rows by 128
columns by 72 bits. The 72-bits-per-column accesses are transferred over the
I/O interface in a burst of four 18-bit words.

4.5. 3-D RAM

A major issue in the development of high-performance 3-D graphics hardware


has been the rate at which pixels can be rendered into a frame buffer using
conventional DRAM or VRAM. In 1994, Mitsubishi Corp. pioneered the
introduction of its family of 3-D RAM to provide an order of magnitude
increase in the rendering performance. This 3-D RAM architecture is based
upon (1) an optimized memory array that minimizes the average memory cycle
time when rendering and (2) selective on-chip logic that converts the interface
with the rendering controller from a read-modified-write mode to a write-
mostly mode. The device is an enhancement of Mitsubishi's Cache DRAM
(CDRAM) architecture that is optimized for 3-D graphics rendering with the
addition of on-chip ALU, compare unit, serial access memory (SAM) video
buffers, and other functions. The ALU allows the 3-D RAM to modify data in
a single pixel buffer cycle, eliminating typical read-write-modify bottlenecks.
The 256-bit internal global bus allows each 3-D RAM chip to process up to
400 million, 32-bit pixels per second.
These are some of the basic features responsible for overall performance
improvement:

• 10-Mbit DRAM array supporting 1280 x 1024 x 8 frame buffer


• Four independent, interleaved DRAM banks
3-D RAM 297

• 2048-bit SRAM pixel buffer as the cache between DRAM and ALU
• Built-in, tile-oriented, memory addressing for the rendering and scan-line-
oriented memory addressing for video refresh
• 256-bit global bus connecting DRAM banks and pixel buffer
• Flexible, dual video buffer supporting 76-Hz CRT refresh

Figure 4.18 shows the simplified 3-D RAM block diagram with external pins
[22]. The DRAM array is partitioned into four independent banks (A, B, C,
and D) of 2.5 Mb each, and together these four banks can support a screen
resolution of 1280 x 1024 x 8. The independent banks can be interleaved to
facilitate nearly uninterrupted frame buffer update and, at the same time,

VID_CLK
Video
Control VID_CKE
DRAM DRAM VID_QSF
Bank A BankB

VID_OE

VID_Q

DRAM_EN
DRAM_OP
DRAM DRAM_BS
Control DRAM_A
MCLK
DRAM RESET
DRAM
Banke BankO

PALU_EN
PALU_WE
SCAN _RST Pixel PALU_OP
SCAN _TCK Test Control PALU _A
SCAN _TMS Access PALU_BE
SCAN_TDI Port
PASS_OUT
SCAN_TOO PASS_IN
32 HIT

SRAM
Pixel PALU_DX
Butter 32
32 PALU_DQ

Figure 4.18 Simplified 3-D RAM block diagram with external pins. (From reference
22, with permission of Mitsubishi Corp.)
298 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

transfer pixel data to the dual video buffer for screen refresh. Data from the
DRAM banks are transferred over the 256-bit global bus to the triple-ported
pixel buffer. The pixel buffer consists of eight blocks, each of which is 256 bits
and is updated in a single transfer on the global bus. The memory size of pixel
buffer is 2 Kbits.
The ALU uses two of the pixel buffer ports to read and write data in the
same clock cycle. Each video buffer is 80 x 8 bits and is loaded in a single
DRAM operation. One video buffer can be loaded while the other is sending
out video data. The on-board pixel buffer can hold up to eight blocks of data,
each block containing 256 bits, and has a cycle time of 8 ns and 10 ns. With
the 256-bit global bus operating at a maximum speed of 20 ns and transferring
32-byte blocks, data can be moved from the DRAM banks to the pixel buffer
at a rate of up to 1.6 Gllytes/s. The ALU converts z-buffer and pixel blend
operations from the "read-modify-writes" to "mostly writes," which allows data
modifications to be completed in a single pixel buffer cycle, reducing execution
time by up to 75 0(0.
A word has 32 bits and is the unit of data operations within the pixel ALU
and between the pixel ALU and the pixel buffer. When the pixel ALU accesses
the pixel buffer, not only does a block address need to be specified, but also a
word has to be identified. Because there are eight blocks in the pixel buffer and
eight words in a block, the upper three bits of the input pins PALU_
A designate the block, and the lower three bits select the word. The data in a
word are directly mapped to PALU_DQ[31:0] in corresponding order. In other
words, bit 0 of the word is mapped to PALU_DQO, bit 1 to PALU_DQ1, and
so on. Figure 4.19 shows the relations and addressing scheme of the blocks and
words in the pixel buffer and in the DRAM page [22J.
Although an ALU write operation operates on one word at a time, each of
the four bytes in a word may be individually masked. The mapping is also
direct and linear: Byte 0 is PALU_DQ[7:o), byte 1 is PALU_DQ[15:8)' byte 2
is PALU-.:DQ[23:16]' and byte 3 is PALU_DQ[31:24]. A block has 256 bits and
is the unit of memory operations between a DRAM bank and the pixel buffer
over the global bus. The input pins DRAM_A select a block from the pixel
buffer and a block from the page of a DRAM bank. The DRAM operations
on a block data are Unmasked Write Block (UWB), Masked Write Block
(MWB), and Read Block (RDB).
A page in a DRAM bank is organized into 10 x 4 blocks; and because each
block has 256 bits, a page has 10,240 bits. There are four DRAM banks in a
3-D RAM chip, such that the pages of the same page address from all four
DRAM banks compose a page group. Therefore, a page group has 20 x 8
blocks.
Figure 4.19 shows the block and page drawn as rectangular shapes that can
be related to tiled frame buffer memory organization. For example, if display
resolution is 1280 x 1024 x 8, then a 32-bit word contains four pixels. Because
a block may be considered as having 2 x 4 words, it contains 8 x 4 pixels. A
page is organized into 10 x 4 blocks, so it contains 80 x 16 pixels; thus a page
3-D RAM 299

256f
Pixel Buffer
Global Bus t 256

10 14 18 1C 20 24
~ 00 04 08 OC
01 05 09 00 11 15 19 10 21 25
'"
-'"
"" :cg 02 06 OA OE 12 16 1 1E 22 26

~ 03 07 06 OF 13 17 16 IF 23 27
<l I>

Selecting a block in the height


direction from a DRAM page
Selecting a block in the width
direction from a DRAM page
Selecting one of eight blocks
in the Pixel Buffer

.>:
7:0 15:8 23:16 31:24 Selecting one of eight words
Word 0 In Block 0 from the selected block
Selecting one of eight blocks
from the Pixel Buffer

Figure 4.19 3-D RAM relations and addressing scheme of blocks and words in the
pixel bufferand in the DRAM page. (From reference 22, with permission of Mitsubishi
Corp.)

group holds 160 x 32 pixels. Therefore, a screen is made up of 8 x 32 page


groups. The advantage of such a frame organization is the min imizat ion of
page miss penalty.
The 3-D RAM ha s several major function al blocks as follow s:

DRAM banks
Pixel Buffer
• Pixel, ALU
Video Buffers
• Global Bus

Th ese are briefly described below.

DRAM Banks The 3-D RAM contains four independent DRAM banks,
which ca n be interleaved to facilitate hidden precharge or access in one bank
300 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

while screen refresh is being performed in another bank. Each DRAM bank
has 256 pages with 10,240 bits per page, for a total page storage capacity of
2,621,440 bits. An additional 257th page can be accessed for special functions.
A row decoder takes a 9-bit page address signal to generate 257 word lines,
one for each page. The word lines select which page is connected to the sense
amplifiers. The sense amplifiers read and write the page selected by the row
decoder. Figure 4.20a shows the block diagram of a DRAM bank consisting
of row decoder, address latch, DRAM array and sense amplifiers [22J.
During an Access Page (ACP) operation, the row decoders selects a page
by activating its word line, which transfers the bit charge of that page to the
sense amplifiers. The sense amplifiers amplify the charges. After the sensing and
amplification are completed, the sense amplifiers are ready to interface with the
global bus or video buffer. In a way, the sense amplifiers function as a
"write-through" cache, and no write back to the DRAM array is necessary.
Alternatively, the data in the sense amplifiers can be written to any page in the
same bank at this time, simply by selecting a word line without first equalizing
the sense amplifiers. This function is called Duplicate Page (DUP), and a
typical application of this function can be copying from the 257th page to one
of the normal 256 pages - all 10,240 bits at a time for fast area fill.
When the sense amplifiers in a DRAM bank complete the read/write
operations with the global bus or video buffer, a precharge (PRE) bank
operation usually follows. This precharge bank cycle deactivates the selected
word line corresponding to the current page and equalizes the sense amplifiers.
The DRAM banks must be precharged prior to accessing a new page.
The major DRAM operations are: Unmasked Write Block (UWB), Masked
Write Block (MWB), Read Block (RDB), Precharge Bank (PRE), Video
Transfer (VDX), Duplicate Page (DUP), Access Page (ACP), and No Oper-
ation (NaP). These operations are briefly described in the following sections.
Figure 4.20b illustrates the Unmasked Write Block (UWB), Masked Write
Block (MWB), and the Read Block (RDB) operations on the global bus.

• Unmasked m·ite Block (UW B) The UWB operation copies 32 bytes from
the specified pixel buffer block over the global bus to the specified block
in the sense amplifiers and the DRAM page of a selected DRAM bank.
The 32-bit Plane Mask register has no effect on UWB operation. The
32-bit Dirty Tag still controls which bytes of the block are updated.
• Masked Jt1 ite Block (MWB)
4
The MWB operation copies 32 bytes from
the specified pixel buffer block over the global bus to the specified block
in the sense amplifier and the DRAM page of a selected DRAM bank.
Both the 32-bit Dirty Tag and the 32-bit Plane Mask register control
which bytes of the block are updated.
• Read Block (RDB) The RDB operation copies 32 bytes from the sense
amplifiers of a selected DRAM bank over the global bus to the specified
block in the pixel buffer. The corresponding 32-bit Dirty Tag is cleared.
3-D RAM 301

10,240 bits/pag e

~/ t
U)
Ql
Ql s: CJ)
~"
0 0 B DRAM array
C'<l
Q.
u C'<l
a: Ql --l r-
io
L.......9.
-, N

t
Sense amplifiers

(a)

1 page/257 Bank-C Bank-D


Bank -B
1 block/4O
257 pages 257 pages 257 pages
0
!
,
,
-RDB
:I
, ~

I, V 256-b it Global Bus


UWB or MWB I
(32 bytes )

I ITIJI[[]
Pixel Buffer
(b)

~k-A Bank-B Bank-C Bank-D


o 0
1 pag e/257 1 page/257 257 pages 257 pages
I
I I I

I
j

l
I
fl
I Video Buffer I I I Video Buffer II I 16 VI
I
, /
,.

Video Transfer
(c)

Figure 4.20 3-D RAM block diagrams. (a) DRAM bank. (b) UWB or MWB , and
RDB on the global bus. (c) Video transfer from a page in Bank A to video butTer 1.
(From reference 22, with permi ssion of Mit subi shi Crop.)
302 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

• Precharqe Bank Operation (PRE) The PRE operation first deactivates


the word line corresponding to the most recently accessed DRAM page
of a selected DRAM bank and then equalizes the bit lines of the sense
amplifiers for a subsequent access page operation. After a precharge bank
operation has been performed on a certain DRAM bank, only the
following operations can be performed on that DRAM bank: Access Page,
Precharge Bank, and Nap.
• Video Transfer (V DX) There are two parts to the VDX operation: video
buffer load and video output. Video buffer load relates to the transfer from
the sense amplifiers of a selected DRAM bank to a corresponding video
buffer. Video output relates to the transfer from a video buffer to the
VID_Q pins. There are two video buffers available for interleave transfer.
Video buffer I is for Bank A and Bank C. Video Buffer II is for Bank B
and Bank D. There are two byte order formats for the VID_Q video
output pins: normal mode and reversed mode. To avoid data corruption
in the video buffer, the user should not start a video transfer operation to
the video buffer that is outputting data to the VID_Q bus. Figure 4.20c
shows the video transfer example from a page in Bank A to video buffer I.
• Duplicate Page (DUP) In one duplicate page operation, all 10,240 bits of
the data in the sense amplifiers of a selected DRAM bank can be
transferred to any specified page in the same bank. The data in the sense
amplifiers are not affected by this operation. If the DRAM_As pin is 0,
then the DRAM_A[7:0l pins select one of the 256 normal pages. If the
DRAM_As is 1, then the DRAM_A[7:0l pins are ignored and the extra
page is written.
• Access Page (ACP) The ACP operation activates the word line corre-
sponding to the specified DRAM page of a selected DRAM bank and
transfers the data in the DRAM array to the sense amplifiers. If the
DRAM_As pin is "0," thenthe DRAM_A[7:ol pins select one of the 256
normal pages. If the DRAM_As pin is "1," then the DRAM_A[7:ol pins
are ignored and the extra page is transferred. A precharge operation must
have been performed on a DRAM bank before an ACP operation is to be
performed.
• No Operation (NOP) The Nap operation may be freely inserted be-
tween the ACP operation and the PRE operation on the same bank. The
Naps are issued when the DRAM arrays are idle, no read or write is
required by the pixel buffer, and no video buffer load is necessary. Also,
NOPs are required to satisfy to the timing interlocks of the various
DRAM operations.

Pixel Buffer The pixel buffer is a 2048-bit SRAM organized into 256-bit
blocks, as shown in Figure 4.20. During a DRAM operation, these blocks can
be addressed from the DRAM_A pins for block transfers on the global bus.
During a pixel ALU operation, the 32-bit pixel ALU accesses the pixel buffer,
3-D RAM 303

requiring not only the block address to be specified but also the 32-bit word
to be identified. This is done by using the 6-bit PALU_A pins such that the
upper three bits select one of the eight blocks in the pixel buffer, and the lower
three bits specify one of the eight words in the selected block. The availability
of both the DRAM_A and PALU_A pins allow concurrent DRAM and pixel
ALU operations. Figure 4.21a shows the pixel buffer elements [22J.
The pixel buffer functions as a level-one write back pixel cache and includes
the following: a 256-bit read/write port, a 32-bit read port, and a 32-bit write
port. The 256-bit read/write port is connected to the global bus via a write
buffer, and the two 32-bit ports are connected to the pixel ALU and the pixel
data pins. All three ports can be used simultaneously as long as the same
memory cell is not accessed. An operation that involves only the pixel ALU
and the pixel buffer is called a pixel ALU operation. Figure 4.21b shows the
block diagram of a triple-port pixel buffer, a global bus, and a dual-port Dirty
Tag RAM.

Pixel ALU Some of the major elements and operations of pixel ALU are
described in the following text.

Dirty Tag Each data byte of a 256-bit block is associated with a Dirty Tag
bit, which means that each word "byte" is associated with four Dirty Tag bits
and that a 32-bit Dirty Tag memory controls the corresponding 32-byte block
data. The Dirty Tag RAM in the pixel buffer contains eight such 32-bit Dirty
Tags. When a block is transferred from the sense amplifiers to the pixel buffer
through the 256-bit port, the corresponding 32-bit Dirty Tag is cleared. When
a block is transferred from the pixel buffer to a DRAM bank, the Dirty Tag
determines which bytes are actually written. When a Dirty Tag bit is "1," the
corresponding data byte is written under the control of the Plane Mask
register, whereas if a Dirty Tag bit is "0," the corresponding byte of data in the
DRAM bank is not written and retains its former value.
There are three major aspects of Dirty Tag operations: tag clear, tag set, and
tag initialization. In normal operation modes, the clearing and setting of the
Dirty Tags by these read and write operations are done by the on-chip logic
in the 3-D RAM and are basically transparent to the rendering controller. The
Dirty Tag bits are used by the 3-D RAM internally and are not output to the
external pins. The Dirty Tag bits play an important role for all four write
operations of the Pixel ALU to the Pixel Buffer: Stateful/Stateless Initial Data
Write and StatefuI/Stateless Normal Data Write.
The Stateless Data Writes refer to the condition whereby the states of the
Pixel ALU units are entirely ignored and the write data are passed to the Pixel
Buffer unaffected, whereas in the Stateful Data Writes the settings of the
various registers in the Pixel ALU, the results of the compare tests, and the
states of the PASS_IN all affect whether the bits of pixel data will be written
into the Pixel Buffer. Initial and Normal Data Writes refer to the manner in
which the Dirty Tag is updated.
304 APPLICATION·SPECIFIC DRAM ARCHITE CTURES AND DESIGNS

Pixel Buffer

8 16 24 1 9 17 25
10 18 26 3 11 19 27
12 20 28 5 13 21 29
6 1422 30 7 1523 31
Dirty Tag for Block 0

7.015:823:16 31:24 7:0 15:823:16 31:24


Plane Mask Word 0 in Block 0
(a)

to DRA M Sense Amps


Globa l Bus
Write Block Enable
Global
(Pixel Buffe r to DRA M)
Bus 256

Write
Buffer
Enable
32 Write
Ena ble
Logic 256
OOOO H
Global Bus
Read Block Enabl e
(DRA M to Pixel Buffer)

Block Add ress


3 from DRAM _A[861

Dirty Tag RAM 3 Block Address


from PALU...A[5:31
8 blocks x 32 bits
write port Word Address
32 from PALU ..A[2:OJ

Irom Pixel ALU


32·b, 1 Pla ne Mask
t 32
from Pixel ALU to Pixel ALU
(b)

Figure 4.21 3-D RAM. (a) Pixel buffer elements. (b) Triple-port pixel buffer, global
bus, and dual-port Dirty Tag RAM. (From reference 22, with permission of Mitsubishi
Corp.)
3-D RAM 305

Many 2-D rendering operations, such as the text drawing, involves writing
the same color to many pixels. In 3-D RAM, the Color Expansion is done with
the Dirty Tags associated with the Pixel Buffer blocks. The pixel color is
written eight times to a Pixel Buffer block, so that all of the pixels in the block
are the same color. Next, a 32-bit word is written to the Dirty Tag of the
associated block. Finally, the block is written to a DRAM bank. The pixel
whose corresponding Dirty Tag bit is set is changed to the new color while the
other pixels are unaffected.

Plane Mask The 32-bit Plane Mask register (PM[31:0] is used to qualify two
write functions: (1) as per-bit write enables on 32-bit data for a Stateful
(Initial/Normal) data write operations from the Pixel ALU to the Pixel Buffer
and (2) as per-write enables on the 256-bit data for a Masked Write Block
(MWB) operation from the Pixel Buffer to the sense amplifiers of a DRAM
bank over the Global Bus. For a Stateful Data Write, the Plane Mask serves
as per-bit write enables over the entering data from the Pixel ALU· write port;
bit 0 of the Plane Mask enables or disables bit 0 of the incoming 32-bit pixel
data, bit 1 of the enable Plane Mask enables or disables bit 1 of the incoming
32-bit pixel data, and so on.

4.5.1. Pixel ALU Operations


As indicated earlier, the major objective of including the Pixel ALU on chip is
to convert the interface from a read-modify-write interface to a write-mostly
interface. This logic integration with memory arrays greatly improves render-
ing throughput by avoiding the time consuming reads and direction changes
on the data bus. The Pixel AL U consists of four 8-bit ROP/Blend units, which
may be independently programmed to perform either a raster operation or a
blending function, one 32-bit Match Compare unit, and one 32-bit Magnitude
Compare unit. The two compare units are also commonly referred to as the
Dual Compare units. The ROP/Blend units and the Dual Compare units are
highly pipelined. The output of a RaP/Blend unit is conditionally written to
the Pixel Buffer, depending on the comparison results from the on-chip Dual
Compare units and from the Dual Compare units of the preceding 3-D-RAM
chips. Figure 4.22a shows the Pixel ALU block diagram [22].
The ROP/Blend units can be configured as either a Rap unit or a blend unit
by setting a register bit. Each ROP unit can perform all 16 standard ROP
functions. One of the operands of the ROP functions is the old data from the
Pixel Buffer, and the other operand may be either the data from the primary I/O
pins or the data from an internal register (called the Constant Source register).
In the Dual Compare units, both the Match Compare and Magnitude
Compare are done in parallel. One of the sources is always the old data from
the Pixel Buffer. The other source is independently selectable between the data
from the PALU_DQ pins and the data from the Constant Source register.
There are also t\VO mask registers, namely Match Mask and Magnitude Mask,
306 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS

PALU_DX I3 Oj
PALU_DQ I31 01 PASS_OUT
Pixel Buffer

~~'~16 ALU Read Port ALU Write Port II-


PASS_IN11OJ
Source
36 36
-~
32
Lc-
0 (7:0) 8 32
'----<
ROPI
~
{NX3• NI31:241' NXo• N(7:oJ} 18
Blend
{KX o• K(7:01} 9 Unit 0

0 115:81 8
(NX 3• NI3t:241' NX 1• Nl1S:8)} 18
(KX, . KllS:8)} 9
ROPI
Blend
Unit 1
u
0 123:16) 8,
ROPI
~
(NX3• NI31 :24). NX2• NI23:161} 1!l
Blend
[KX 2• KI23:16j} 9 Unit 2

0 131:24) 8
(NX3• NI31 :241' NX3. NI31:24)} 9 ROPI 8

{KX3. ~31241} 9
Blend
Unit 3
-+
° 131.0j32
NI31:0)3? Dual
Compare
KI31:013? Unit

(a)

Stencil Enable Write Enable Write Enable


Load Match Mask - - - - - - - - ,
to Pixel Buffer to Pixel Buffer
32 Byte 3 Byte s 0.1 2

32
N[ 31:0] Match

Compa re
K[31 :0]
2

32

2
32 Magnitude
0 [31 :0) -I-----+----4I---+--+I
Compare

Loa d Magnitude Mask


Loa d Compare Control ~-----, ' - - - - - - ' Stencil Function Result

(!PINS10J II PASS_IN(1 JI && (!PIN S'8) II PASS_INIO))

(b)

Figure 4.22 3-D RAM block diagram . (a) Pixel ALU . (b) Du al Co mpa re unit (F rom
reference 22, with perm ission of Mitsub ishi Corp.)
3-D RAM 307

that define which bits of the 32-bit words will be compared and which will be
"don't care." The results of both Match Compare and Magnitude Compare
operations are logically ANDed together to generate the PASS_OUT pin. The
external PASS_IN signal (fed from another 3-D RAM chip) and the internally
generated PASS_OUT signal are then logically ANDed together to produce a
Write Enable signal to the Pixel Buffer. Figure 4.22b shows the block diagram
of the Dual Compare unit.

Video Buffers The 3-D RAM functional block diagram in Figure 4.18
shows the Video Buffers I and II, each of which receives 640 bits of data at a
time from one of the two DRAM banks connected to it. Sixteen bits of data
are shifted out onto the video data pins every video clock cycle at a 14-ns rate.
It takes 40 video clocks to shift all data out of a video buffer. These two video
buffers can be alternated to provide a seamless stream of video data.

Global Bus The 3-D RAM functional block diagram in Figure 4.18 shows
the Global Bus connecting the Pixel Buffer to the sense amplifiers of all four
DRAM banks. The Global Bus consists of 256 data lines and during a transfer
from the Pixel Buffer to DRAM, the 256 bits are conditionally written
depending on the 32-bit Dirty Tag and the 32-bit Plane Mask. When a data
block is transferred from the Pixel Buffer to the sense amplifiers, the Dirty Tag
and Plane Mask control which bits of the sense amplifiers are changed using
the Write Buffer. A read operation across. the global bus always means a read
by the Pixel ALU; that is, the data are transferred from a DRAM bank into
the Pixel Buffer. Similarly, a write operation across the Global Bus means that
the data are updated from the Pixel Buffer to a DRAM bank. These operations
are accomplished by using Global Bus Read Block Enable and Global Bus
Write Block Enable signals.
The 3-D RAMs can be used to implement frame buffers of various
resolutions and depths. These are some of the examples of frame buffer
organizations:

1280 x 1024 x 8 Buffer Organization In this organization, the screen dis-


play is made up of an 8W x 32H array of page groups (i.e., 8 page groups wide
by 32 page groups high). A page group is 160 pixels wide by 32 pixels high and
consists of the same page from all four DRAM banks (A, B, C, D). The four
independent DRAM banks can be interleaved to allow pages to be prefetched
as images are drawn. Each page within a page group is 80 pixels wide by 16
pixels high. The pages are sliced either into sixteen 80-pixel wide scan lines
when sending data to its Video Buffer or into a lOW x 4H array of 256-bit
blocks when dealing with the Global Bus. Two pixels are shifted out of the
Video Buffer every video clock.
The blocks are 8 pixels wide by 4 pixels high and can be transferred to and
from one of the Pixel Buffer blocks via the Global Bus. The Pixel ALU and
data pins access four pixels of a Pixel Buffer block at a time. The Dirty Tag
308 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

for an entire Pixel Buffer block can be written in a single cycle from the data
pins.

1280 x 1024 x 32 Single-Buffered Organization A frame buffer of this size


requires four 3-D RAMs. There are two recommended methods of organizing
the 3-D RAMs, which trade off the 2-D color expansion rendering performance
with pixel-oriented rendering performance, as follows:

• Each of the four components of a pixel (R, G, B, a) are in separate 3-D


RAMs. Therefore, each 3-D RAM supports 1280 x 1024 x 8.
• All four components of a pixel reside in the same 3-D RAMs. The four
3-D RAMs are interleaved on a pixel-by-pixel basis in a scan line. Thus,
each 3-D RAM supports 320 x 1024 x 32. This is very similar to the
1280 x 1024 x 8 except that the pixels are four times as deep and the
widths of the screen, page groups, pages, and blocks are one-fourth as
wide.

One pixel is shifted out of the Video Buffer every two video clocks. The Pixel
ALU and PALU_DQ pins access one pixel of a Pixel Buffer block. The Dirty
Tag for an entire Pixel Buffer block can be written in a single cycle from the
PALU_DQ pins. The Dirty Tag controls the four bytes of 32-bit pixel
independently. Figure 4.23 shows the block diagram of 1280 x 1024 x 32
frame buffer consisting of four 3-D RAMs, a rendering controller and a
RAMDAC [22].
The rendering controller writes pixel data across the 128-bit bus to the four
3-D RAMs. The controller commands most of the 3-D RAM operations,
including ALU functions, Pixel Buffer addressing, and DRAM operations. The
controller can also command video display by setting up the RAMDAC and
requesting video transfers from 3-D RAMs. During the use of the 128-bit
pixel data bus, four pixels can be moved across the bus on one cycle.

1280 x 1024 x 32 Double-Buffered Organization with Z The basic con-


figuration for 1280 x 1024 x 32 double-buffered organization with Z buffer
uses twelve 3-D RAMs. In this configuration, each 3-D RAM (for buffers A, B,
and Z) covers a 320 x 1024 portion of the 1280 x 1024 displayed image. This
means that the vertical scrolling can take place at a very high speed because
all data movement occurs within the 3-D RAM chips rather than across the
chips. Horizontal scrolling would require 3-D RAM to 3-D RAM data
transfers. Each of the buffers A, B, and Z is 32 bits in pixel width, allowing 8
bits each for R, G, B, and 8 bits for alpha or overlays. The eight 3-D
RAMs containing this data are referred to as the Color Buffer 3-D RAMs. In
the case of a Z buffer, 24 bits can be used for the depth and 8 bits for a
combination of stencil pattern ID and window ID, and these four 3-D RAMs
are referred to as the Z Buffer 3-D RAMs.
MEMORY SYSTEM DESIGN CONSIDERATIONS 309

System Interface

!
Address & Control
Rendering Controller

! Pixel Data ...... - .........._..


,
t 32
t 32
t 32
t 32
I Monitor
:

l
t
,
... . 3D-RAM -+ 3D-RAM . . .• 3D-RAM ~ 3D-RAM
.... .. .. ... .1

l~ideo
Control I r r
Video Data 16

Video Data 16
RAMDAC f--
Video Data 16

Video Data 16

Figure 4.23 3D-RAM block diagram for 1280 x 1024 x 32 frame bulTer organization.
(From reference 22, with permission of Mitsubishi Corp.)

640 x 512 x 8 Double-Buffered Organization with Z A single 3-D RAM


chip can be configured to suppo rt 640 x 512 x 8 double-buffered organization
with 16-bit Z. This configuration may be suitable for a very high performance,
low-cost consumer home or arcade game application.

4.6. MEMORY SYSTEM DESIGN CONSIDERATIONS

The technology trend for PC main memory D RAMs over the past several
years has been to improve the data transfer rat es by using improved commod-
ity DRAM architectures such as the EDO devices and then the SDRAMs,
which ha ve evolved from 66-MHz version to PC I00 and PC133 SDRAMs. In
addition, further performance improvements have been proposed for DRAM
architectures such as the D DR devices , Rambus SD RAMs, and SL D RAMs.
Rambus architecture promises to deal with escalating microprocessor clock
rates th at require addressing of two key issues, as follows [23]: (1) latency,
which is basically the time per iod that a microprocessor has to wait for first
piece of data after it is requested , and (2) data transfer rate. The Rambus
310 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

reduces some of these problems, in part through strict layout rules that specify
maximum path lengths, so that the signal is not distorted by a particular path.
Also, the Rambus architecture is packet-based, so that the stored data and
address information are sent to the microprocessor as a single packet that is
several bits long. While the Rambus provides the performance, there are
limitations to the maximum size of memory that can be used with Rambus
architecture. Therefore, a memory system design has to be based on cost/
performance considerations.
Some workstations and high-end servers are using DDR DRAMs, because
these applications require larger memories, and other techniques to improve
the memory performance such as interleaving are available. Because DDRs
DRAMs have adequate transfer rates and somewhat better latency than the
Rambus architecture, many workstation and server designs can take advantage
of the DDR's low cost. SLDRAMs also have the potential to find a niche
market in this area, because it is a high-speed part that can be used as a
building block for large memory systems.
For the past few years, SGRAMs have been the most commonly used
memory for graphics design applications, starting with the 8-Mbyte part and
evolving to 16-Mbyte and higher densities. The graphics system designers have
always preferred as wide a memory as possible, to minimize the size of overall
memory. In the earlier designs, a 1-Mbyte frame buffer size was considered
more than enough. Nowadays, with 3-D applications growth, the graphics
designers have started using 2-, 4-, 8-, and even l6-Mbyte frame buffers. The
economics of SDRAMs have been pushing graphics system designers away
from the SGRAMs. Also, the advent of PClOO SDRAM has created a set of
specifications that are ideal for high-speed data transfer in graphics applica-
tions. The combination of PCIOO specification with l-Mbyte x 16-SDRAM
devices is finding wide acceptance in the graphics design industry.
In computing applications, SDRAM has been the mainstream memory and
takes advantage of the fact that most PC memory accesses are sequential; it is
designed to fetch all of the bits in a burst as fast as possible. In SDRAM
architecture, an on-chip burst counter allows the column part of the address
to increment rapidly. The memory controller provides the location and size of
the memory block required, while the SDRAM chip supplies the bits as fast as
the CPU can take them, using a clock for timing synchronization of the
memory chip to the CPU's system clock [24J. This key feature of SDRAM
provides an important advantage over other asynchronous memory types,
enabling data to be delivered off-chip at a burst rate of up to 100 MHz. Once
a burst has started, all remaining bits of the burst lengths are delivered at a
10-ns rate.
The other three competing technologies have been Rambus DRAM, DDR
SDRAM, and SyncLink DRAMs (SLDRAMs), of which Rambus architecture
has become the choice for PCs because of Intel's support. The future of
SLDRAMs is uncertain. Currently, with mainstream CPUs operating over 800
MHz (and higher), it is clear that their external memory bandwidth cannot
MEMORY SYSTEM DESIGN CONSIDERATIONS 311

meet the increasing application demands. Direct RDRAM has been introduced
to address those issues and is a result of collaboration between Intel and
Rambus to develop a new memory system. It is actually a third iteration of the
original Rambus designs running at 600 MHz, which then increased to 700
MHz with the introduction of Concurrent RDRAM.
In the Direct Rambus designs, at current speeds, a single channel is capable
of data transfers at 1.6 Gbytes/s and higher. Also, multiple channels can be
used in parallel to achieve a throughput of up to 6.4 Gbytes/s, The new
architecture will have operational capability for bus speeds of up to 133 MHz.
The Rambus DRAM also has an edge in latency because at the 800-MHz data
rate, an interface to the device operates at an extremely fine timing granularity
of 1.25 ns. The PCI00 SDRAM interface runs with a coarse timing granular-
ity of 10 ns. The 133-MHz SDRAM interface, with its coarse timing granularity
of 7.5 ns, incurs a mismatch with the timing of memory core.
Rambus design appears to be the popular choice in PC DRAM architecture
evolution. Intel has released its 820-chip set (code-named Camino), which has
a 133-MHz system bus with direct interfacing to the Rambus DRAMs. Several
other major PC manufacturers such as IBM, Hewlett-Packard, Micron, and
Dell Computers are expected to release their business and/or consumer
desktops with the Rambus DRAMs.
The DDR DRAM is the other memory technology competing to provide
system builders with high-performance alternatives to Direct RDRAM. The
DDR SDRAM, by providing the chip's output operations on both the rising
and falling edges of the clock, effectively doubles the clock frequency. It has the
most appeal to workstation and high-end server's designers.
The chip sets and memory controllers already exist, which support 133-
MHz (PC133) and faster memory buses. However, a PCl33 SDRAM mayor
may not outperform a PCIOO SDRAM, depending on three critical parameters,
as follows: CAS latency (CL), RAS-to-CAS delay time (t RCD )' and RAS
precharge time (t RP) . These parameters are measured in terms of the number of
clock cycles. For example, a device with CL = 2 cycles, t RCD = 2 cycles, and
t RP = 2 cycles, is commonly referred to as a 2-2-2 device. Table 4.5 shows a
comparison of a PCI00 CL2 device to PC 133 CL2 device [25]. The values
shown in this table are taken from Toshiba's 128 Mb SDRAM data sheet.
Table 4.5 shows that in comparison to PCIOO CL2 device, which is
considered current baseline for memory performance, the PCl33 CL3 device is
about 4% slower, while the PCl33 CL2 device is 170/0 faster. The calculations
shown are based solely on the three critical parameters listed above, and actual
system performance will depend on the application, and other factors, as well.
It should be noted that two out of three critical parameters t RP and tRCD' are
shown as fixed values in nanoseconds and are not necessarily an integer
number of cycles. If the memory controller only interprets these parameters as
an integer number of clock cycles, then they must be rounded up to the next
highest value. Therefore, in Table 4.5, the PCIOO eL2 device is referred to as
2-2-2, the PCl33 CL3 device as 3-3-3, and PC133 CL2 device as 2-2-2.
312 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

TABLE 4.5 A Comparison of a PCI00 eL2 Device to PCl33 CL2 Device [25]

RAS RAS-to-CAS
Memory CAS Latency Precharge Delay CL + t RP + tRCD Performance
Bus Speed (CL) Time (t RP) Time (fRCD) (total time) (Normalized)

100 MHz 20 ns 20 ns 20 ns 60 ns 1.00


(PCIOO) (2 cycles) (2 cycles) (2 cycles)
133 MHz 22.5 ns 20 ns 20 ns 62.5 ns 0.96
(PC133) (3 cycles) (2.67 cycles) (2.67 cycles)
133 MHz 15 ns 15 ns 15 ns 45 ns 1.25
(PCI33) (2 cycles) (2 cycles) (2 cycles)

The performance benefits of the DDR versus RDRAM are commonly


debated in the industry; and a wide range of performance numbers are shown,
especially in the peak bandwidth comparisons, While peak bandwidth is
important, there are other factors also that need to be taken into consideration,
such as sustained (or effective) bandwidth, latency, the number of internal
banks, and the read-to-write/write-to-read bus turnaround time. Effective
bandwidth is also a function of certain system or application-dependent
parameters, such as the burst length. Table 4.6 shows the comparison of peak
bandwidth for PCIOO, DDR, and RDRAM for various memory bus widths.
The DDR device shown is based on current industry specifications, which
include lOa-MHz and 133-MHz clock rates. The DDR-II is currently being
defined by JEDEC and is expected to offer much higher clock rates and
features to improve effective bandwidth.
Table 4.6 analysis shows that the DDR devices can match RDRAMs in
terms of peak bandwidth. However, the system designer must evaluate the
tradeoffs of widening the bus from 64 to 128 bits for the DDR versus adding
multiple channels for the RDRAM. Additionally, the peak bandwidth is only
one factor in determining the effective bandwidth.
All of the DRAM types commonly used in the industry such as the EDO,
SDRAM, DDR, and RDRAM have one thing in common, namely, their
memory cores; the major differences lie in the peripheral logic circuitry designs.
Another offering in application-specific, high-performance memory designs
is the fast-cycle RAM (FCRAM), which addresses the issue of slow memory
core by segmenting it into smaller arrays such that the data can be accessed
much faster and latency is improved. The key measure of FCRAM latency
reduction and system performance improvement is the read/write cycle time
(t RC) parameter, which specifies the amount of time a DRAM takes for a read
or a write cycle before it can start another one. In the case of conventional
memory types, including the SDRAM, DDR and RDRAM, tRC is typically on
the order of 70 ns, whereas for the FCRAM, a t RC of 20 or 30 ns is possible.
In addition to faster tRC ' FCRAM has several other features to help improve
performance (see Section 4.2.2).
MEMORY SYSTEM DESIGN CONSIDERATIONS 313

TABLE 4.6 A Comparison of Peak Bandwidth for PCIOO, DDR, and RDRAM
for Various Memory Bus Widths [25]

DRAM Type Clock/Data Rate Memory Bus Width Peak Bandwidth

PClOO 100 MHz/lOO MHz 64 bit 800 MB/s


DDR 100 MHz/200 MHz 64 bit 1.6 GB/s
DDR-II 200 MHz/400 MHz 64 bit 3.2 GB/s
128 bit 6.4 GB/s
RDRAM 400 MHz/800 MHz 16 bit (1 channel) 1.6 GB/s
32 bit (2 channels) 3.2 GB/s
64 bit (4 channels) 6.4 GB/s

In summary, the major DRAM features and parameters that affect a


high-performance memory system design are as follows:

• Latency This is basically the amount of time it takes for a DRAM to


begin outputting data in response to a command from the memory
controller. Typically, there are several measures of DRAM latency, such
as the row address access time (t RA e ) and RAS precharge time (t RP) ' Most
of the measures of DRAM latency are a function of the memory core
design and the wafer process technology used.
• Number 0.[Banks The number of banks in a DRAM is a major factor in
determining the actual system latency. This is due to the fact that a
DRAM can access data much faster if it is located in a bank that has been
activated (or precharged). The precharged bank can be the same page
(row) that is currently being accessed, or it can be in a bank that is not
currently being accessed. If the data are located in a precharged bank, this
is often called a page hit, which means that the data can be accessed very
quickly without a delay penalty of having to close the current page and
precharge another bank. However, if the data is in a bank that has not
been precharged, or in a different row within the bank currently being
accessed, a page miss occurs and performance is degraded due to the
additional latency of having to precharge the bank.
The memory controller design can minimize latency by keeping all
unused banks precharged. Thus, more internal DRAM banks increases
the probability that the next data accessed will be to an active bank and
minimizes latency. Although adding more banks increases the hit rate and
reduces latency, it can increase the die size and cost of the DRAM.
• Bus Turnaround Time The bus turnaround time is the time it takes a
DRAM to switch between a read and a write cycle or between a write and
a read cycle. This is becoming a critical factor, and delays in turning the
bus around can result in costly dead bus cycles and reduced performance.
314 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

In order to minimize dead bus cycles, fast (preferably, zero) read-to-write


or write-to-read bus turnaround time is required. The bus turnaround
time is even more critical for the DDR, because data is transferred on both
the rising and falling edges of the clock. In other words, for every dead
clock cycle, there are to two dead data cycles. The emerging DDR-II
standard is attempting to address the issue of bus turnaround time. In the
case of the RDRAM, the bus turnaround time is less of an issue because
the device has separate address and control busses, such that simultaneous
decoding is not required.
• Burst Length and Randomness These are application-dependent par-
ameters. In general, the burst length is defined as the number of successive
accesses (column addresses) within a row or precharged bank. It is the
number of successive read/write cycles without having to provide a new
address. DRAMs can access data quickly if the next data are located in
the same row as the current data or in a precharged bank. Therefore, as
the burst length becomes longer, initial latency is minimized and the
effective bandwidth approaches peak bandwidth. Graphics design is a
good example of an application with relatively long bursts, whereas the
network and switching routers tend to have very short burst lengths.
Therefore, the applications with very short burst lengths are often referred
to as "random access" applications, because it is not easy for the memory
controller to predict where the next data bits are located.
In general, the burst length of 1 or 2 (short) is fairly typical of network
switches/routers, that of 4 to 8 (medium) for PC main memory, and 8 to
256 (long) for graphics applications. A comparison of currently available
DRAM timing specifications show the burst length of four to be an
optimal number. The burst lengths less than 4 do not take much
advantage of the DRAM's peak bandwidth capability.

A combination of all the factors listed above determines the effective


bandwidth of the system bus utilization, which means the percentage of time
the memory bus is reading/writing data. Once this factor is known, it is easy
to determine effective bandwidth by multiplying the bus utilization factor by
the peak bandwidth. For example, if in a system the bus utilization is 500/0 and
the peak bandwidth is 2 Gbyte/s, the effective bandwidth is 1 Gbyte/s max.
A comparison of bus utilization for various DRAM types such as pelOO,
PCl33, DDR, RDRAM, and FCRAM shows that with the SDRAM and DDR
devices, the bus utilization decreases as the clock frequency increases. This is
because the dead bus cycles have a greater impact on performance rates
increase. RDRAM and FCRAM can perform nearly gapless read-write and
write-read bursts, because there are almost never any dead bus cycles.
RDRAM has the highest effective bandwidth due to its architecture specifically
designed for the PC main memory. FCRAM also matches the RDRAM in
terms of effective bandwidth, and it is especially suitable for applications with
more randomness and shorter burst lengths.
MEMORY SYSTEM DESIGN CONSIDERATIONS 315

TABLE 4.7 Examples of Memory Granularity for a Peak Bus Width for a Variety of
DRAM Types and Systems Implementations

DRAM
DRAM Data System Peak
DRAM Type Density Bus Width Bus Width Granularity Bandwidth

SDRAM 64 Mbit 16 bit 64 bit 32 MB 800 MB/s


(100-MHz clock) 128 Mbit 16 bit 64 bit 64 MB 800 MB/s
256 Mbit 16 bit 64 bit 128 MB 800 MB/s
512 Mbit 16 bit 64 bit 256 MB 800 MB/s

DDR 64 Mbit 16 bit 64 bit 32 MB 2.13 GB/s


(133-MHz clock) 128 Mbit 16 bit 64 bit 64 MB 2.13 GB/s
256 Mbit 16 bit 64 bit 128 MB 2.13 GB/s
512 Mbit 16 bit 64 bit 256 MB 2.13 GB/s

RDRAM 128 Mbit 16 bit 16 bit 16 MB 1.6 GB/s


(400-MHz clock) (1 channel)
32 bit 32 MB 3.2 GB/s
(2 channels)
64 bit 64 MB 6.4 GB/s
(4 channels)

256 Mbit 16 bit 16 bit 32 MB 1.6 GB/s


(1 channel)
32 bit 64 MB 3.2 GB/s
(2 channels)
64 bit 128 MB 6.4 GB/s
(4 channels)

512 Mbit 16 bit 16 bit 64 MB 1.6 GB/s


(1 channel)
32 bit 128 MB 3.2 GB,Is
(2 channels)
64 bit 256 MB 6.4 GB/s
(4 channels)

A major factor in evaluating a memory system cost/performance tradeoffs is


the concept of granularity, which ultimately determines the system cost.
Granularity is the minimum system density (in megabytes) that is possible for
a given DRAM configuration and system bus width. Table 4.7 shows the
granularity for a peak bus width for a variety of DRAM types and system
implementations [25].
A key observation is the difference in system architectures for the SDRAMs
(including DDR) versus RDRAMs. SDRAMs must be used in parallel, which
increases granularity. In the example shown in this table, four 16-bit devices
must be connected in parallel to match the 64-bit system bus width. Therefore,
the system granularity is four times granularity of the device. For RDRAM,
because the system bus (Rambus channel) width is the same as the device bus
width, the granularity of the system is equal to that of the RDRAM multiplied
316 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS

by the number of channels. In terms of cost, because a single RDRAM can be


used, smaller memory (lower cost) systems using RDRAM are possible than
for the SDRAMs. For example, using 256M ( x 16) SDRAMs, a system with
64 MB is not possible. However, with 256M RDRAMs, 32-MB or 64-MB
systems are possible.
An RDRAM system can be built with less memory than an SDRAM/DDR
system and at the same time have significantly better performance. For
example, using 256M RDRAMs, a 64-MB, 2-channel RDRAM system can be
built providing 3.2 GB/s of peak bandwidth, along with an effective bandwidth
that is almost five times that of the corresponding SDRAM system and over
2.5 times that for the DDR system. The cost for this RDRAM system will be
comparably lower since 64 MB is not possible with the SDRAM/DDR at
256M density. However, this may change as the SDRAMjDDR chips evolve
to gigabit densities in the future.

REFERENCES
1. Ashok K. Sharma, Semiconductor Memories: Technology, Testing and Reliability,
IEEE Press, New York, 1997.
2. Brian Dipert, The slammin', jammin' DRAM scramble, EDN, January 20, 2000, pp.
68-82.
3. Dave Bursky, Advanced DRAM architectures overcome data bandwidth limits,
Electron. Des., November 17, 1997, pp. 73-88.
4. Bruce Miller et aI., Two high-bandwidth memory bus structures, IEEE Des. Test
C0I11pUt., January-March 1999, pp. 42-52.

5. Dave Bursky, Graphics-optimized DRAMs deliver top-notch performance, Elec-


tron. Des., March 23, 1998, pp. 89-100.
6. Billy Garrett, Applying Rambus technology to graphics, February 1992 (version
1.0), Rambus Inc., web page.
7. IBM Application Note: Designing with 4 Mb VRAM.
8. IBM Application Note: Half SAM and Full SAM Compatibility.
9. IBM Application Note: Understanding VRAM and SGRAM Operation.
10. M. Chao et al., Double Data Rate SGRAM delivers needed bandwidth for 3-D
graphics, Samsung Electronics web page.
11. Samsung 64 M DDR SGRAM Data Sheets(K4D623237M).
12. Fujitsu Semiconductor Data Sheets: Memory CMOS 256 Mbit Double Data Rate
FCRAM™.
13. Rambus Technology Overview paper on Rambus web page.
14. Rambus 128/144 Mbit Direct RDRAM specification sheets, Document DL0059,
Version 1.0.
15. Rambus Application Note: Rambus in High Availability Systems.
16. Rambus Application Note: Direct Rambus Memory for Large Memory Systems.
REFERENCES 317

17. Rich Warnke, Designing a multimedia subsystem with Rambus DRAMs, Multi-
media Systems Design, March 1998.
18. Peter Gillingham et aI., SLDRAM: High-performance, open standard memory,
IEEE Micro, November/December 1997, pp. 29-39.
19. IEEE Standard P1596.7 (Draft 0.99): SyncLink Memory Interface Standard.
20. Peter Gillingham, SLDRAM architectural and functional overview, Technical
Paper on SLDRAM web site and Mosaid Technologies, Inc., web site.
21. SLDRAM 400 Mb/s/pin Data Sheet CORP400.P65, Rev. 7/9/98.
22. Mitsubishi 3D-RAM (M5M410092B) Data Sheets Preliminary Rev. 0.95.
23. Mark Ellsberry, Memory design consideration for accelerating data transfer rate,
Computer Design, November 1998, pp. 58-62.
24. Jeff Child, DRAM's ride to next generation looks rocky, Embedded Syst. Dev.,
December 1999, pp. 44-47.
25. Application Note Toshiba's web page: Choosing High-Performance DRAM for
Tomorrow's Applications.

You might also like