ApplicationSpecific DRAM Architectures and Designs
ApplicationSpecific DRAM Architectures and Designs
APPLICATION-SPECIFIC DRAM
ARCHITECTURES AND DESIGNS
237
238 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
entire contents of a memory chip. Therefore, for these DRAMs, the memory
latency is as important as the time for each subsequent data word in the
'transfer sequence (the burst length). One of the first attempts to speed up
DRAM access time was the cache DRAM (CDRAM), in which the internal
architecture consists of a standard DRAM storage array and an on-chip cache.
The cache and memory core array are linked by a wide bus, so that the entire
cache can be loaded up in just a single cycle. The CDRAMs were discussed in
Chapter 3 (Section 3.7).
For PC memories, the biggest debate in DRAM applications has been
whether random-access latency or burst bandwidth is the more significant
performance parameter [2]. The shorter the average burst-access length, the
lower the chances of amortizing an extended initial latency over much shorter
subsequent burst accesses. Also, the more effective the CPU's caching scheme
before the DRAM array, the more random the DRAM-code accesses to fill the
cache lines. When a CPU, particularly one without pipelining or prefetch
support, has to read information from main memory, the CPU stalls, wasting
clock cycles until the completion of first data access. Therefore, the fewer the
system masters accessing main memory, the lower the chances that they will
consume a significant amount of a memory's peak bandwidth.
The l6-Mbit SDRAMs with their dual-bank architecture were the first
multiple-sourced, new architecture memories to offer performance levels well
above that obtainable from the extended-dataout RAMs (EDRAMs). The
first-generation l6-Mb SDRAMs were specified as lOO-MHz devices. Although
SDRAMs are designed to a JEDEC standard, slight differences in interpreta-
tions of the specification and the test methodologies have made chip interchan-
gabilitya concern. Therefore, the chips that are capable of lOO-MHz operation
under ideal conditions are specified for limited operation to 66 MHz due to
the timing differences in most PCs.
To boost SDRAM performance of first-generation l6-Mb devices, memory
designers have tightened some of the ac timing margins, de parameters, and
layout rules to achieve a "true" lOO-MHz operation for compliance with
PCIOO IOO-MHz SDRAM specification requirements for a lOO-MHz system
operation. The second-generation SDRAMs have been pushing process tech-
nologies to achieve even higher speeds, such as clock rates of up to 133 MHz.
The second-generation SDRAMs include higher-performance devices that
employ four memory banks per chip. SDRAMs in the 16- and 64-Mb
generation are available with word widths of 4, 8, or 16 bits. The advanced
64-Mb and 256-Mb SDRAMs are available in 32-bit word width. The DDR
SDRAMs allow the chip to deliver data twice as fast as the single-data-rate
SDRAMs. These were discussed in Chapter 3 (Section 3.5).
Many high-end computer architectures, servers, and other systems that
require hundreds of megabytes of DRAMs have been using SDRAMs for the
main memory. However, the future home-office desktop computers will typi-
cally use some variation of specialized DRAM architectures such as the
Rambus DRAM (RDRAM) due to their smaller granularity. In general, larger
APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS 239
memory systems most often employ narrower word widths, because such
systems often require a lot of depth, whereas the smaller systems end up using
wider word chips, because the memo~y depth is smaller and wide memories
could greatly reduce the chip count [3].
For example, the current SDRAMs are available with word widths of x 4,
x 8, x 16 bits. Assuming a 64-bit-wide memory module, if the unit is assembled
with 4-bit-wide SDRAMs, it would have a depth of 16 Mwords and a total
storage of 128 Mbytes. However, if these memory modules are built with
8-bit-wide SDRAMs, the module would pack 64 Mbytes and have a depth of
8 Mwords.
The vendor proprietary DDR SDRAM variants began appearing in 1997.
Initial main memory devices operating at 133 MHz deliver a burst bandwidth
of 2.1 Gbytes/sec across a 64-bit data bus. The Direct Rambus DRAM
(DRDRAM) and synchronous link DRAM (SLDRAM) are two examples of
next-generation DRAM architectural developments to address the speed
requirements of latest-generation high-performance processors. Both these
architectures employ packet command protocols, which combine the separate
command and address pins of previous memory interfaces into command
bursts [4]. This approach reduces the number of pins required for addressing
and control, as well as facilitates the pipelining of requests to memory, Direct
RDRAM and SLDRAM transfer commands and data on both edges of the
clock.
Rambus Inc., has developed Rambus architecture in conjunction with
various partners, and although it does not manufacture or markets the chips,
it has licensed the controller interface cell and the memory design to companies
such as Hitachi Inc., LG Semiconductor, NEC, Oki Semiconductor, Sam sung
Electronics Corp., and Toshiba Semiconductor Co. The companies that have
signed on to produce DRDRAMs include Fujitsu Corp., Hyundai Inc., IBM
Corp., Infineon Technologies, Texas Instruments, Micron Technology, and
Mitsubishi Electric Corp. The original RDRAMs had a latency of several
hundred nanoseconds, which affected their performance. The second-gener-
ation implementation as the concurrent RDRAM has been optimized for use
in main memory systems. For example, the concurrent RDRAM available in
8-bit- or 9-bit-wide versions have either 16/18- or 64/72-Mb capabilities and
can burst unlimited length data strings at 600 MHz such that sustained
bandwidth for 32-byte transfers (e.g., for a cache-line fill) can be at 426
Mbytes/s, The DRDRAM interface consists of 16- or 18-bit datapath and an
8-bit control bus, with the interface able to operate at clock rates up to 800
MHz (rising and falling edges of a 400-MHz clock). These DRDRAMs are
available in densities such as 32 Mb for graphics design applications and
64/128 Mb for main memory applications.
The SLDRAM defines its first-generation SDRAM interface as a 16- or
18-bit-wide bus supporting up to 8 loads and operating at 400 Mbpsjpin with
a 200-MHz clock; and using buffered modules, it can support up to 64 loads.
The Direct RDRAM has a 16- or 18-bit-wide data bus, but it can support up
240 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
DDR Direct
Parameter SDRAM RDRAM SLDRAM SDRAM
to 32 loads and operate at 800 Mbps/pin, using a 400-MHz clock, twice the
speed of the first-generation SLDRAMs. The RDRAMs and SLDRAM archi-
tectures will be discussed in more detail in Sections 4.3 and 4.4, respectively.
Table 4.1 compares the significant features and characteristics of these high-
performance DRAM architectures for SDRAM, DDRAM, Direct RDRAM,
and SLDRAM at the 64-Mb level [3J.
For several years, the graphics memory architectures were designed around
the video DRAM, which is basically a dual-ported DRAM that allowed
independent writes and reads to the RAM from either port [5J. The host port
was a standard random access port, while the graphic port was optimized for
bursting data to the graphics subsystem through a pair of small parallel-to-byte
serial shift registers. However, the extra area on the chip required by the shift
registers and control circuits increased VRAM manufacturing costs. Additional
examples of graphic optimized memories include a specialty triple-port DRAM
developed by NEe, a multi bank DRAM (MDRAM) developed by Mosys
Corp., the cache DRAMs (CDRAMs), and the Window RAM developed by
Samsung Electronics Corp.
The MDRAM has also been designed into some graphic subsystems. An
MDRAM is basically an array of many independent 256-kbit (32-kbyte)
DRAMs, each with a 32-bit interface, connected to a common internal bus. The
external 32-bit bus is a buffered extension to the internal bus. The independent
bank architecture facilitates overlapping, or "hiding" the row address strobe
access and precharge penalties, so that the average access times will approach
the column address strobe access time.
Some high-performance graphics workstation vendors have designed their
own graphic memory architectures to meet their specific requirements. In the
VIDEO RAMs (VRAMs) 241
One of the first steps in designing a graphics system is determining the frame
buffer size (or sizes) and internal (nondisplayed) resolution. The next step is to
use that information in configuring the memory for optimum system perform-
ance.
The VRAM was developed to increase the bandwidth of raster graphics
display frame buffers. If a DRAM is used as a frame buffer, it must be accessible
by both the host/graphics controller and the CRT refresh circuitry. The raster
graphics display requires that a constant, uninterrupted flow of pixel data be
available in the CRT drive circuitry. This requires that the host or graphics
242 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
00 0
to
0 0 15
c
'0,
o
--J
eE RE
CE
o Timing
o TAG
$ Generator W
and
~ Control OSF
Logic SC
SE
OPT,NC
W rite-per
Bit
SOOO
16 to
S0015
Figure 4.1 Block diagram for architecture of a standard 4-Mb VRAM. (From
reference 8, with perm ission of IBM Corp.)
chip size of full-depth SAM is larger than the half-depth SAM . Figure 4.1
shows the architecture of a standard 4 Mb VRAM [8].
A full-depth SAM is a 512 x 16 serial buffer built into a 4-Mb VRAM. The
buffer is used for serial read /write, and in the full transfer mode a full word line
(512 x 16) is transferred to the SAM . In most applications, serial port is always
being read . Therefore, the transfer has to be synchronized to the last read
operation from the SAM, which creates timing problems. To avoid these, a
split register transfer is preferred, so that in the split register transfer mode, half
of the word line (256 x 16) or half-row is transferred to its respective half of
the SAM, while the other half of the SAM is being read . This helps avoid the
possibility of overlap of a read from the SAM while the data are being
transferred from the DRAM array to the SAM register or buffer. For high-end
graphics applications, it is desirable to read /write part of the SAM . Many
designers prefer to stop reading at some boundary and jump to another
address in the SAM.
A half-depth SAM is a 256 x 16 serial register built into the 4-Mb VRAM.
This buffer is used to provide serial read /write operations. While being used in
the full transfer mode, a half-word line (256 x 16) either lower or upper is
transferred to the SAM (256 x 16), In the split register transfer mode, only
one-quarter of the word line (128 x 16) or row is transferred to either the lower
or upper half of the SAM that is not being written/read. A mode that allows
the designe r to jump to the other half without serially clocking through the
244 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
midpoint (127/128) is called the serial register stop (SRS) mode. A CBRS (CE
before RE refresh with mode SET) cycle is initiated to put the VRAM in the
SRS mode. A half- depth SAM part is considered compatible with the
full-depth SAM part, if the replacement of half-depth SAM with the full-depth
SAM does not affect the system operation.
The VRAMs have number of features that are specifically designed to
enhance performance and flexibility in the graphics applications, such as the
block write, write-per-bit, flash write, mask register, and color register. All of
these options work with the DRAM portion of the VRAM and are used to
efficiently update screen data stored in the DRAM. These features are briefly
described below.
• Block rnAite This feature can be used to write contents of the color
registers into eight consecutive column locations in the DRAM in one
operation. The masking feature allows precise selection of the memory
locations that get the color data. This option is useful for quickly filling
large areas such as the polygons with a single color during real-time
imaging applications.
• Write-Per-Bit The write-per-bit is a temporary masking option used to
mask specific inputs during the write operations. When used in conjunc-
tion with the data mask in the mask register, the write-per-bit feature
allows selection of the memory locations that need to be written.
• Flash lfj4ite Flash write clears large portions of the DRAM quickly.
Each time the flash write option is selected, an entire row of data in the
DRAM is cleared.
• M ask Register The mask register stores mask data that can be used to
prevent certain memory locations from being written. This feature is
generally used with the block write option and can be used during the
normal writes. The bits that are masked (mask data = 0) retain their old
data, while the unmasked bits are overwritten with the new data.
• Color Register The color register stores the data for one or more screen
colors. These data are then written to memory locations in the DRAM
corresponding to the portions of the screen that will use the stored color.
The major function of color register is to rapidly store the color data
associated with large areas of a single color, such as a filled polygon.
The SGRAMs are also very similar to the SDRAMs, except that they have
several additional functions to improve their effectiveness in graphics systems
designs. Both the block-write and write-per-bit functions have been added to
make the reading and writing operations faster and more efficient. As in
SDRAMs, all input signals are registered on the positive edge of the clock, and
data can be written or read in the bursts of 1, 2, 4, or 8 bits or a full page [5].
The SGRAMs have many programmable features that require system
configuration during both the initialization and graphics operations. A small
246 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
command interpreter on the chip allows the burst length, the column-address-
strobe latency, the write-per-bit modes, 8-column block write, and the color
register to be set up to the desired initial values and altered when the system
conditions change.
An example of the first-generation SGRAMs offerings are 8-Mb devices,
organized as 256-Kword x 32-bit, so that two of these chips can form a
2-Mbyte frame buffer for a 64-bit graphics controller. The 8-Mb SGRAMs are
available with data clock speeds of 83 MHz, 100 MHz, 125 MHz, or even
higher. The second-generation devices are 16-Mb SGRAMs that can double
the word depth, allowing a 4-Mbyte buffer to be built with just two chips. The
improved versions of these 16-Mb SGRAMs include devices with higher clock
speeds or DDR transfers. Therefore, a 16-Mb SGRAM with DDR capability,
along with the 100-MHz clock, will transfer data at 200 MHz, allowing
graphics bandwidth performance up to 800 Mbytes/s (peak). An example is
ISO-MHz DDR developed by IBM that can deliver 300 Mb/s for each pin and
a peak data rate of 1.2 Gbytes/s over the 32-bit bus.
The DDR memory performs I/O transactions on both the rising and falling
edges of the clock cycle. The DDR SGRAM uses a bidirectional data strobe
(DQS) moving with DQs (multiplexed data I/O) in parallel and is used in the
system as a reference signal to fetch the corresponding DQs. A benefit of using
DQS is to eliminate the clock skew and timing variation effects between the
memory and controller during the high-speed data transfer at each pin. In
addition, the skew between the input clocks of the memory and controller can
be ignored because the DQS synchronizes both data input and output at both
of its edges [10].
A major advantage of DDR usage in 3-D graphics applications is that it
doubles the memory bandwidth. For example, two x 32 DDR SGRAMs
running at 200-MHz clock frequency offer a peak date throughput of 3.2
Gbytes/s for a 64-bit bus and 6.4 Gbytes/s for a 128-bit memory interface. For
a 64-bit (8 bytes) bus, the peak rate is calculated as 8 x 200 X 106 X 2 (both
clock edges) = 3.2 Gbytes/s, Similarly, for a 128-bit bus (16 bytes), the peak
rate is calculated as 16 x 200 x 106 X 2 ~ 6.4 Gbytes/s,
Section 4.2.1 provides an example of a 64-Mb DDR SGRAM supplied by
Samsung Electronics.
co
~I ~I
M
x
• a0 :E
l
0 r-
I 1/0 Control • ~
I I
•
Output Buffer t
Strobe
Gen. --
--'
~L IG,,[
l 2-bit preietchj ...._ -_ ._ --
U
-e-
Ii ~~ '--' '" r-r-r-« '---
s:
r- -
u,
'~.->< ",-
Sense AMP l<-
C, < - U>
:::....
~ en
~ IV
~
iii
-e
c:
Ql
--'
*.,
'0>
c:<:
In
~
c
--
.Q ..
[0 0 ::J ell'" 8Ql u; c:<:
"S U ::!: a:: [ :;
~ ~ ~ s N
M
co
M
N
M
N
M
0
c: .,
In ..- '"
E
c
~. - - ~ ro -;;:-" ~
c: ~ ~ ~ E >. E
co
- +" ~ '~ '"
N
:;; :;;
N N
:;; "
0 0 ~
J "~ '" U
c:
Ql
ro a:'"
0
*'"
'0>
<-
I~
.
U)
--' Ql
Ql
c:<:
-
~g -r I t I U)
c:
E <-
I~
'" c: 1=
0 0
oU
10 11'5 /
I 1 <{
U
--'
~-
I~
~ Row Decoder I Col. Buffer
,,[ <-
o
UJ
;: .-
n.,
Qj
U>
'"c:
n>
Row Buffer
r
Refresh Counter r--
-
LCBR
LRAS
--'
c:<:
~.-
--'
<-
<-
113
UJ
:.:
i
In U) U
~ -
Address Register
UJ
--'
-
.- IG,,[
U
:.:
IG
I Io U
--'
,,[ o<{
U
Figure 4.2 Block diagram of 64-Mb DDR SGRAM organized as 512 Kbit x 32
I/O x 4 Bank, (From reference 11, with permission of Samsung Electronics.)
Mode Register Set (MRS) The mode register stores the data for control of
the various operating modes of the DDR SGRAM. It programs CAS latency,
addressing mode, burst length, test mode, and other vendor-specific options to
make the device useful for a va riety of different applications. To operate the
DDR SGRAM, the mode register must be written after power-up because its
default value is not defined. The mode register is written by asserting Iowan
CS, RAS, CAS, and WE. The state of the address pins Ao-A I O and BA o' BA I
in the same cycle as CS, RAS, CAS, and WE going low is written in the mode
register. One clock cycle is requested to complete the write operation in the
mode register. The mode register contents can be changed using the same
command and clock cycle requirements during the operation, as long as all
248 APPLICATION-SPECIFIC DRAM ARCH ITECTURES AND DESIGNS
banks are in the idle state. The mode register is divided into various fields
depending on functionality. The burst length uses Ao-'-A z, addressing mode
uses A3 , and CAS latency (read latency from column address) uses A4 -A o. The
A7 is used for test mode. Pins A7 , As, BAo, and BA I must be set low for the
normal DDR SGRAM operation. Table 4.2 shows specific codes for various
burst length, addressing modes, CAS latencies, and MRS cycle [11].
Define Special Function (DSF) The DSF controls the graphic applications
of SGRAM. If DSF pin is tied [ow, the SGRAM functions like an SDRAM.
The SGRAM can be used as a unified memory by the appropriate DSF
command. All the graphic function modes can be entered only by setting the
TABLE 4.2 Mode Register Set (MRS)-Specific Codes for Various Burst Length,
Addressing Modes, CAS Latencies, and MRS Cycle
Mode Register
A7
o
mode
Normal
"3 Type
.__0._..- _. Sequen tial
1 Test 1 Interleave
0 1 1 3
0
0
0
1
1
--- --- 4--
0
2 2
4
during MRS cycle. 1 0 0 Reserve 0 1 1 8
-~
1 0 1 Reserve 1 0 0 Reserve Reserve
1 1 0 Reserve 1 0 1 Reserve Reserve
1 1 1 Reserv e 1 1 0 Reserve Reserve
1 1 1 Full page Reserve
- _• • _ . _ ••• _ - _ . _ . _ • • • • • _ ¥ ¥ .
MRS Cycle
0 1 3
CK,CK J~ -~
I
-- --j-- d
.. _.. __ t~
I
___.u·--u
___. __ . .. _,
. __. .. _.
I
2
._. l·:'---d
'X.-- _. . ..-"".'
_. C.1-
. ._.\
I I
4
I
5
I
6
I
7
I
8
DSF high when issuing commands, which otherwise would be normal SDRAM
commands.
Special Mode Register Set (SMRS) There is a special mode register in the
DDR SGRAM called the color register. When A6 and DSF goes high in the
same cycle as CS, RAS, CAS, and WE going low, load color register (LCR)
process is executed and the color register is filled with the color data for
associated DQs through the DQ pins. At the next clock of LCR, a new
command can be issued. SMRS command compared with the MRS can be
issued at the active state under the condition that DQs are idle.
o 1 234 567 B
Ck,Ck ~ : ~ :ru~Cj~L
Command
DOS I
(a)
o 1 2 3 4 567 B
CK,CK ~
I I I I I I I I I
Command
DQS
DQ's
(b)
Figure 4.3 Two 64-Mb DDR SGRAM timing diagrams. (a) Burst read operation. (b)
Burst write operation . (From reference 11 , with permission of Samsung Electronics.)
address inputs (Ao-A?) determine the starting address for the burst operation.
The mode register sets type of burst (sequential or interleaved) and burst length
(2, 4, 8, full page). The first output data are available after the CAS latency
from the READ command, and the consecutive data are presented on the
falling and rising edge of the data strobe adopted by the DDR SGRAM until
the burst length is completed. Figure 4.3a shows the timing diagram of burst
read operation [11].
Burst Write Operation The burst command is issued by having CS, CAS,
and WE low while holding RAS high at the rising edge of the clock. The
address inputs determine the starting column address. There is no real latency
required for the burst write cycle . The first data for burst write cycle must be
applied at the first rising edge of the data strobe enabled after t DQ SS from the
rising edge of t he clock that the write command is issued. The remaining data
inputs must be supplied on each subsequent falling and rising edge of the data
strobe until the burst length is completed. When the burst operation is
completed, any additional data supplied to the DQ pins will be ignored. Figure
4.3b shows the timing diagram of a burst write operation,
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 251
Burst Interrupt Operation These are the various burst interruption modes:
Burst Stop Command The burst stop command is initiated by having RAS
and CAS high with CS and WE low at the rising edge of the clock only. The
burst stop command has the fewest restrictions, which makes it the easiest
method to use when terminating a burst operation before it has been com-
pleted. When the burst command is issued during a burst read cycle, both the
data and DQS (data strobe) go to a high-impedance state after a delay that is
252 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
equal to the CAS latency set in the mode register. However, the burst
command is not supported during a write burst operation.
Data Mask (DM) Function The DDR SGRAM has a data mask function
that can be used in conjunction with the data write cycle only (and not read
cycle). When the data mask is activated (DM high) during the write operation,
the write data are masked immediately (DM to data-mask latency is zero).
Power Down Mode The power down mode is entered when CKE is low and
is exited when CKE is high. Once the power down mode is initiated, all of the
receiver circuits except CK and CKE are gated-off to reduce power consump-
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 253
tion. During the power down mode, refresh operations cannot be performed;
therefore, the device cannot remain in the power down mode longer than the
refresh period (t REF ) of the device.
Clock (CLK, CLK) This FCRAM adopts a differential clock scheme in which
CLK is a master clock and its rising edge is used to latch all command and
address inputs. CLK is a complementary clock input. An internal delay locked
loop (DLL) circuit tracks the signal a cross-point CLK and CLK, and
generates some clock delay for the output buffer control at read mode. This
DLL circuit requires some lock-on time for the stable delay time generation.
Chip Select (CS) and Function Select (FN) Unlike regular SDRAM's
command input signals, the FCRAM has only two control signals: (1) CS and
(2) FN. Each operation is determined by two consecutive command inputs.
Bank Address (BA o' BA1) The FCRAM has four internal banks, and the
banks selection by BA occurs at Read (RDA) or write (WRA) command.
ClK -.
--.
.;
CLOCK
BUFFER ~ To each bloc k
--.
I Banl<-3
I I Bank·2
I Bank· f
Bank-o
CS
CONTROL
-. SIGNAL
LATCH
COMMAND
DECODER
f--
,. f--
FWI
FN --.
DRAM
MODE CORE
REGISTER
(4M X 16)
l f-
ADDRESS
BUFFERI - -
BA, .BA , -. REGISTER
DO,
.. BURST
-. COUNTER
ACCESS
-
BU RST
-..
10 2 ADDRESS
DO,
1/0 DATA 110
LOOS
BUFFER!
f4--
+
REGISTER
&
DO, ....
.-&
DeS
10 1~
GENERA·
00 ..
TOR ..... Voo
UOOS .- Clod< Buffe r
. - v."
~ Vi-",, / Vssu
+v 000 Vsso
Figure 4.4 Block diagram of a 256-Mb DDR FCRAM l6-bit format. (From reference
12, with permission of Fujit su Semiconductor.)
Data Strobe (DQS) DQS is a bidirection signal and used as data strobe.
During read operation, DQS provides the read data strobe signal that is
intended to use input data strobe signal at the receiver circuit of the control-
ler(s). It turns low before the first data comes out, and it toggle s high to low
or low to high until the end of the burst read. The CAS latency is specified to
the first low-to-high transition of this DQS output. During the write operation,
DQS is used to latch corresponding byte of write signal s. In the read data
SYNCHRONOUS GRAPHIC RAMs (SGRAMs) 255
strobe operation, the first rising edge of DQS input latches the first input data
and the following falling edge of DQS signal latches second input data. This
sequence is continued until the end of the burst count.
Data Inputs and Outputs (DQn ) Input data are latched by DQS input
signal and written into memory at the clock following the write command
input. Output data are obtained together with DQS output signals at pro-
grammed read CAS latency.
Read (RDA) and Lower Address Latch (LAL) The FCRAM adopts two
consecutive command inputs scheme. The read or write operation is deter-
mined at first RDA or WRA command input from standby state of the banks
to be accessed (see state diagram, Figure 4.5). The read mode is entered when
RDA command is asserted with bank address and upper address input, and
LAL command with lower address input must be followed at the next clock
input. The output data are then valid after programmed CAS latency (CL)
from a LAL command until the end of the burst. The read mode is automati-
cally exited after random cycle latency.
Write (WRA), Lower Address Latch (LAL) The write mode is entered and
exited in the same manner as the read mode. The input data store is started at
the rising edge of DQS input from CL-l until the end of the burst count. The
write operation has a feature of "on-the-fly" variable write length (VW) at
every LAL command input following WRA command. Unlike data mask
(DM) of regular DDR SDRAM, VW does not provide random data mask
capability and VW controls the burst counter for the write burst, and its burst
length is set by a combination of two control addresses, VWO and VWl, and
programmed burst length condition. The data in masked address location
remains unchanged.
Burst Mode Operation and Burst Type The burst mode provides faster
memory access, and the read and write operations are burst oriented. The burst
mode is implemented by keeping the same addresses and by automatic strobing
of least significant addresses in every single clock edge until programmed burst
length (BL). Access time from clock of the burst mode is specified as tACo The
internal lower address counter operation is determined by a mode register,
which defines burst type (BT) and burst count length (BL) of 2 or 4 bits of
boundary. The burst type can be selected either sequential or interleave mode.
Mode Register Set (MRS) The mode register provides a variety of different
operations, and can be programmed MRS command following RDA command
input if all banks are in a standby state. The read operation initiated by RDA
command is canceled if MRS command is asserted at the next clock input from
RDA command instead of LAL command required for read operation. The
FCRAM has two registers: (1) standard mode and (2) extended mode. The
256 APPLICATION ·SPECIFIC DRAM ARCHITECTURES AND DESIGNS
o
SymbolDefinitions :
- - - - . . Command Input
Figure 4.5 A 256-Mb FCRAM state diagram for single bank operation. (From
reference 12, with permission of Fujitsu Semiconductor.)
standard mode register has four operation fields: (1) burst length, (2) burst
type, (3) CAS latency, and (4) test mode (this test mode must not be used). The
extended mode register has two fields: (1) DLL enable and (2) output driver
strength. These two registers are selected by BAO at MRS command entry and
each field is also set by the address line at MRS command, as well. Once these
fields are programmed, the contents are held until reprogrammed by another
MRS command (or part loses power). MRS command should only be issued
on condition that all banks are in the idle state and all outputs are in the
high-impedance state.
RAMBUS TECHNOLOGY OVERVIEW 257
The Rambus architecture has three main elements: (1) Rambus interface, (2)
Rambus channel, and (3) RDRAM. Figure 4.6a shows the block diagram of
Rambus architecture and its three main elements [13J.
The Rambus interface is implemented on both the memory controller and
the RDRAM devices on the channel. The Rambus channel incorporates (a) a
system level specification that can allow the system using Rambus channel(s)
to operate at a full rated speed and (b) a capability of transferring data at rates
RAMBUS TECHNOLOGY OVERVIEW 259
Rambus Interfa ce
.------ --,
Vre•
Memory
Co nt roller Termination
~
Rambus Channel
Bus Clock
BOOMHz
Transfer Rate =
(a)
400MH~a~
Gnd . GndA
Vdd. VddA
800MHz Oata >-
<>
Tra nsfer
Rate
(b)
Figure 4.6 Rambus architecture showing (a) three major elements and (b) memory
controller and RDRAM connections to resistor-terminated transmission lines. (From
reference 13, with permission of Rambus, Inc.)
COLUMN
[4:0 J
DATA
[15:0] Ton.~~ --l-- _-
Row .
- r---.
Acc es.40ns
I
....
Column
Access20ns 1.-
(a)
ROW
[2:0]
COLUMN
[4:0]
DATA
[15:0J
/'
_·'1-·--1
Co nvenlional DRAM
W rile Data limin g
(b)
Figure 4.7 Rambus memory. (a) Read operation. (b) Write operation. (From reference
13, with permission of Rambus, Inc.)
262 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
pins followed by a read command sent across the COL pins. After the data are
read, a precharge command is executed to prepare that bank for another
completely random read. The data are always returned in a fixed, but user
selectable, number of clock cycles from the end of the read command. A write
transaction timing is similar to the read operation since the control packets are
sent the same way. Figure 4.7b shows a typical write operation.
One significant difference between an RDRAM and a conventional DRAM
is that the write data are delayed to match the timing of a read transaction's
data transfer. A write command on the COL bus tells the RDRAM that the
data will be written to the device in a precisely specified number of clock cycles
later. Normally, this data would then be written into the core as soon as the
data are received. Each of the commands on the control bus can be pipelined
for higher throughput.
The RDRAMs internal core supports a 128-/144-bit-wide data path operat-
ing at 100 MHz, which is one-eighth the clock rate of the channel. Thus, every
10 ns, 16 bytes can be transferred to or from the core. The RDRAMs have
separate data and control buses. The data bus permits data transfer rates up
to 800 MHz, capable of 1.6-Gb/s data transfer rate on either a x 16 or x 18
bus configuration. The control bus adds another 800 Mb/s of control informa-
tion to the RDRAM. The control bus is further separated into ROWand COL
pins, allowing concurrent row and column operations while the data are being
transferred from a previous command.
The Rambus architecture allows up to 1-Gbit DRAM densities, up to 32
RDRAMs per channel, and enough flexibility in the row, column, and bits to
allow for various configurations in these densities. The 64-/72-Mb RDRAMs
can support either 8 independent or "16" doubled banks. In a doubled bank
.core, the number of sense amplifiers required is reduced to nearly half while
keeping the total number of banks relative high compared to other DRAM
alternatives. The larger number of banks helps prevent interference between the
memory requests. The number of banks accessible to the controller is the
cumulative number of banks across all the RDRAMs on the channel. However,
the restriction imposed by doubled banks is that the adjacent banks cannot be
activated. Once a bank is activated, that bank must be precharged in order for
the adjacent bank to be activated.
For low-power system operation, the RDRAMs have several operating
modes, as follows: Active, Standby, Nap, and PowerDown. These four modes
are distinguished by (a) their respective power consumption and (b) the time
taken by the RDRAM to execute a transaction from that mode. An RDRAM
automatically transitions to a standby mode at the end of a transaction. In a
subsystem, when.all the RDRAMs are in a standby mode, the RDRAM's logic
for row addresses is always monitoring the arrival of row packets. If an
RDRAM decodes a row packet and recognizes its address, that RDRAM will
transition to the active state to execute the read or write operation and will
then return to standby mode once the transaction is completed. Power
RAMBUS TECHNOLOGY OVERVIEW 263
Pin Descriptions
• Control Registers The SCK, CMD, SIOO, and SIOI pins (shown in the
upper center of Figure 4.8) are used to write and read a block of control
registers, which supply the RDRAM configuration information to a
controller and select the operating modes of the device. The 9-bit REFR
value is used for tracking of the last refreshed row, and 5-bit DEVID
specifies the device address to the RDRAM on the channel.
• Clocking The CTM and CTMN pins (Clock-to-Master) generate TCLK
(Transmit Clock), the internal clock used to transmit the read data. The
CFM and CFMN pins (Clock-from-Master) generate RCLK (Receive
Clock), the internal clock signal used to receive the write data and to
receive the ROWand COL pins.
• DQA, DQB Pins These 18 pins carry read (Q) and write (D) data across
the channel. They are multiplexed or demultiplexed from (or to) two
72-bit data paths that are running at one-eighth the data frequency, inside
the RDRAM.
RAMBUS TECHNOLOGY OVERVIEW 265
Figure 4.8 Block diagram of 128-/144-Mbit Direct RDRAM. (From reference 14, with
permission of Rambus, Inc.)
of 512 bytes of fast storage (256 for DQA and 256 for DQB) and can hold
one-half of one row of one bank of the RDRAM. The sense amplifier may
hold any of the 512 half-rows of an associated bank. However, each sense
amplifier is shared between two adjacent banks of the RDRAM (except
for number 0, 15, 30, and 31), which introduces the restriction that
adjacent banks may not be simultaneously accessed.
• RQ Pins These pins carry the control and address information, and they
are divided into two groups. One group of pins (RQ7, RQ6, RQ5, which
are also called ROW2, ROWI, ROWO) are used primarily for controlling
the row accesses. The second group of pins (RQ4, RQ3, RQ2, RQl, RQO,
which are also called COL4, COL3, COL2, COLI, COLO) are used
primarily for controlling the column accesses.
• ROW Pins The main function of these three pins is to manage the
transfer of data between the banks and the sense amplifiers of the
RDRAM. These pins are demultiplexed into a 24-bit ROWA (row
activate) or ROWR (row operation) packet.
• COL Pins The main function of these five pins is to manage the transfer
of data between the DQA/DQB pins and the sense amplifiers of the
RDRAM. These pins are demultiplexed into a 23-bit COLC (column
operation) packet and either a 17-bit CaLM (mask) packet or a 17-bit
COLX (extended operation) packet.
• PREC Precharqe The PREC, RDA, and WRA commands are similar to
the NOCOP, RD, and WR, except that a precharge operation is per-
formed at the end of the column operation. These commands provide a
second mechanism for performing the precharge operation.
• PREX Precharge After an RD command, or after a WR command with
no byte masking (M = 0), a COLX packet may be used to specify an
extended operation (XOP). The most important XOP command is PREX,
which provides a third mechanism for performing a precharge operation.
Packet Formats Figure 4.9 shows the format of the ROWA and ROWR
packets on the ROW pins [14]. Table 4.3a describes the fields that comprise
these row packets [14]. For example, DR4T and DR4F bits are encoded to
contain both the DR4 device address bit and a framing bit, which allows the
ROWA or ROWR packet to be recognized by the RDRAM. The AV
(ROWA/ROWR packet selection) bit distinguishes between the two packet
types. Both the ROWA and ROWR packet provide a 5-bit device address and
a 5-bit bank address. A ROWA packet uses the remaining bits to specify a 9-bit
row address, and the ROWR packet uses the remaining bits for an l l-bit
opcode field.
Figure 4.9 also shows the formats of CaLC, CaLM, and COLX packets on
the COL pins. Table 4.3b describes the fields that comprise these column
packets. The CaLC packet uses the S (start) bit for framing. A COLM or
CaLX packet is aligned with this COLC packet, and it is also framed by the
S bit. The 23-bit COLC packet has a 5-bit device address, a 5-bit bank address,
a 6-bit column address, and a 4-bit opcode. The COLC packet specifies a read
or a write command, as well as some power management commands.
The remaining 17 bits are interpreted as a COLM (M = 1) or COLX
(M = 0) packet. A COLM packet is used for a COLC write command, which
needs bytemask control. A COLX packet may be used to specify an indepen-
dent precharge command. It contains a 5-bit device address, a 5-bit bank
address, and a 5-bit opcode. The COLX packet may also be used to specify
some housekeeping and power management commands. The COLX packet is
framed within a COLC packet but is not otherwise associated with any other
packet.
A row cycle begins with the activate (ACT) operation. The activation
process is destructive, that is, the act of sensing the value of a bit in a bank's
storage cell transfers the bit to the sense amplifier, but leaves the original bit
in the storage cell with an incorrect value. Because the activation process is
destructive, a hidden operation called restore is automatically performed. The
restore operation rewrites the bits in the sense amplifier back into the storage
cells of the activated row of the bank. While the restore operation takes place,
the sense amplifier may be read (RD) and written (WR) using the column
operations. If new data are written into the sense amplifier, it is automatically
forwarded to the storage cells of the bank, so that the data in the activated row
268 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS
r. T, r, 'I} I I T. r, 1"1l) in
CTl\ l /Cf Mn I n I n rT1 n C1 MICI'I\ I n In rn In n
i L..J l L..J I L..J I L..J J , L..J I L..J I L..J ! L..J I
IWW 2 HOW2 Jllll ' DkZ
HOWl kl HOWl
C01.3 u
COI.2 cz
COI.1 CI
CO LO LO
CO Ll ,
I
CO\2
CO LI
!
COI.1
I
CO LO ,
I
COLU
Figure 4.9 Direct RDRAM 128-/144-Mbit row and column packet formats. (From
reference 14, with permission of Rambu s, Inc.)
and the data in the sense amplifier remain identical. When both the restore and
the column operations are completed, the sense amplifier and bank are
precharged (PRE). This leaves them in the proper state to begin another
activate operation.
TABLE 4.3 Direct RDRAM Field Descriptions for (a) ROWA and ROWR Packets
and (b) COLC, COLM, and COLX Packets
Field Description
ROW2
..ROWO
COL4
..COLO
D8A8B8..0..0
D
funs.l<11on a: RD . 0· a2 .. jJ Ca2
Trons.l<1lonb: n bO•
(a)
ROW2
..Rowa
COL4
..COLO
DQA8 ..0
DQB8..0
,
Traruac li on b : xx
.1 • • B Col .2. . B C.2
(b)
Figure 4.10 Direct RDRAM exampl e. (a) Read tran sacti on. (b) Write transaction .
(F rom reference 14, with permi ssion of Rarnbus, Inc.)
. the packets on ROWand COL pin s use the end of the packet as a timing
reference point, while the packets on the DQA/DQB pins use the beginning of
the packet as a timing reference point.
A time tcc after the first CaLC packet on the COL pin s, a second is issued
which contains an RD a2 command. The a2 address has the same device and
bank address as the a1 address (and aD address), but a different column
address. A time t CA C after the second RD command, a second read data dualoct
Q(a2) is returned by the de vice. Next, a PRER is issued in a ROWR packet
on the ROW pins , which cau ses the bank to precharge so that a different row
may be activated in a subsequent transaction or so that an adjacent bank may
be activated. The a3 address includes the same device and bank address as the
aD, a l, and a2 addresses. Th e PRER command must occur a t a time t RDP or
more after the ACT command, and also time t RDP or more after the last RD
command. Thi s transaction example read s two dualocts, but there is actually
RAMBUS TECHNOLOGY OVERVIEW 271
time to read three dualocts before tRDP becomes the limiting parameter rather
than t R A S '
Finally, an ACT bO command is issued in a ROWR packet on the ROW
pins. The second ACT command must occur a time t RC or more after the first
ACT command and a time tRP or more after the PRER command, to ensure
that the bank and its associated sense amplifiers are precharged. This example
(for both the read and write transactions) assumes that the second transaction
has the same device and bank address as the first transaction, but a different
row address. The transaction b may not be started until transaction a has been
completed. However, the transactions to other banks or devices may be issued
during transaction a.
The interleaved read transactions are similar to the example shown in
Figure 4.10a, except that they are directed to the nonadjacent banks of a single
RDRAM and the DQ data pins efficiency is 100%.
the first ACT command, as well as at a time t RP or more after the PRER
command.
The process of writing a dualoct into the sense amp of an RDRAM bank
occurs in t\VO steps: (1) The write command, write address, and write data are
transported into the write buffer and (2) the RDRAM automatically retires the
write buffer, with an optional bytemask, into the sense amplifier. This two-step
write process reduces the natural turn-around delay due to the internal
bidirectional data pins. The interleaved write transactions are similar to the
one shown in Figure 4.10b, except that they are directed to the nonadjacent
banks of a single RDRAM. This allows a new transaction to be issued once
every t RR interval rather than once every t R C interval, and the DQ data pin
efficiency is 100% with this sequence.
The Direct Rambus clock generator (DReG) provides the channel clock
signals for the Rambus memory subsystem and includes signals for synchron-
ization of the Rambus channel clock to an external system clock. On the logic
side, the Rambus interface consists of two components: the Rambus ASIC cell
(RAe) and the Rambus memory controller (RMC). The RAC physically
connects through the package pins to the Rambus channel and is a library
macrocell implemented in ASIC design to interface the core logic of the ASIC
device to a high-speed Rambus channel. The RAC typically resides in a portion
of the ASIC's I/O pad ring and converts the high-speed Rambus signal level
(RSL) on the Rambus channel into lower-speed CMOS levels usable for the
ASIC design. The RAe functions as a high-performance parallel-to-serial and
serial-to-parallel converter performing the packing and unpacking functions of
high-frequency data packets into the wider and synchronous 128-bit (Rambus)
data words.
The RAe consists of t\VO delay-locked loops (DLLs), input and output (I/O)
driver cells, input and output shift registers, and multiplexers. The two DLLs
provided are a transmit DLL and a receive DLL. The transmit DLL ensures
that the written commands and data are transmitted in precise 180-degree
phase quadrature with an associated Clock from Master (CFM) clock. The
receive DLL ensures that a proper phase is retained between the incoming read
data and its associated Clock to Master (CTM) clock.
Figure 4.11 shows the block diagram of a RAC cell [14]. The external
interface, which consists of the RSL high-speed channel, is referred to as the
Rambus Channel Interface, while the internal lower speed CMOS level signals
are referred to as the Application Port Interface. A typical Rambus channel
can deliver two bytes of data every 1.25 ns, which is seen as 16 bytes of data
every 10 ns on the Application Port Interface. This data are referenced to the
SynClk.
-. _.
--~-
Addre$s
Address Transmit
Bul fer ,- r-e-
::! ::!
~ ~
..
Data
--- - ~
Microproc n $Or
lmertace Lolii.c
Oilt3
Transmit
Butter
-~- f--
0
a:
T T
0
a:
-Conh ol
....
l-
Beceive
Buffor ~
-'1} <4 ._ - BusC'll.
H Syrduooou; l
Logic
-
~....
I
-~
~.~" ,~ Logie
Figure 4.11 Direct Rambus ASIC (RAC) block diagram. (From reference 15, with
permission of Rambus, Inc.)
that a system can continue to operate even if an entire DRAM device fails. This
capability is known in the industry as "chipkill" [15]. The Hamming error
correction code (ECC) scheme has been widely used and involves attaching a
number of checksum or syndrome bits along with the corresponding data, as
it is being transmitted (or written to the memory). On the receiving side, the
controller again generates syndrome bits based on the received data pattern
and compares it against the syndrome bits stored from the write operation, and
the comparison can correct single-bit errors and detect double-bit errors.
Therefore, this scheme is called single-bit error correction, double-bit error
detection (see Semiconductor Memories, Chapter 5.6).
In most ECC-based systems, 64 data bits are used along with 8 additional
syndrome bits, resulting in a total word size of 72 bits. The Rambus DRAM
supports the ECC approach using the x 18 organization, which operates as a
144-bit datapath, 128 bits of which can be used as data, while the remaining
16 bits can be used for syndrome (9 are needed for ECC) or other functions.
This can only be effective for double-bit error detection and single-bit error
correction, and multiple errors contained in the data word are not correctable
using this technique. For chip kill protection, architectural partitioning of the
memory array is used along with an ECC coding technique, which spreads the
data word across many DRAMs such that any individual DRAM contributes
only one bit. The major drawback of this approach is that the system requires
a minimum of 72 DRAMs (using x 1 DRAMs) in the case of the 72-bit ECC
word. Using the x 4 DRAM configuration requires increasing the ECC word
size and number of ECC checkers by a corresponding multiple to a total of
288 data bits and four ECC checkers.
274 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
2. Electri cal Signals. The SyncLink interface specifies signal s used to com -
municate between the controller and one or more SLDRAMs, and it
references other standards for detailed signal levels and timing character-
istics.
3. Physical Packaging. The SyncLink interface does not specify physical
packaging requirements, because other standard groups (e.g., JEDEC)
are expected to define the physical packaging standards based on market
req uirements.
Ul 10-command-link
::l
.c ~
E
Q)
l-
ec I RAM I I RAM I ( ...) I RAM I narrow data path
Cii 0
o
>.
Ul
16/32/64/128 read&write data
Ul 10-command-link
::l
.c ...
.!!! wide data path
s
E
Ul
eE
>. 0
Ul o
command link
Ul
~
::l
.c
E
Q)
eE connected SIMMs
Cii 0 or modules
>.
tn
o
data links
block-A
block-S
_ - - - up to 64 SLDRAMs - - -+-
processor
16,18
SyncLink
SCI
Gbit
Serial Express
PC l
Serial Sus
Ethernet
Figure 4.13 SyncLink memory. (a) Multiple sub-RAMs or blocks. (b) A typical small
memory subsystem design. (From reference 19, with permission of IEEE.)
essentially like independent RAMs. The banks contain rows, and rows contain
columns. A row is the amount of data read or written to one of the chip's
internal storage arrays. Columns are subsets of rows that are read or written
in individual read or write operations, as seen by the chip interface. For
example, if the datapath to the chip is 16 bits wide at the package level, each
16-bit subset of the current row is connected to the I/O pins as a column access
within that row . A typical data transfer in SyncLink concatenates four 16-bit
columns to make a data packet.
Therefore,accessing the columns within the same row is faster than
accessing another row, saving the row access time required to bring the row of
data from the actual RAM storage cells. The multiple banks within each
sub-RAM can provide an additional level of parallelism. In summary, a bank
corresponds to a row that may be held ready for multiple accesses; a sub-RAM
280 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
corresponds to one or more banks sharing one timing controller that can
perform only one operation at a time; a RAM corresponds to multiple
sub-RAMs that can access data concurrently but shares initialization and
addressing facilities, as well as the package pins and some internal datapaths.
Multiple RAMs sharing one controller comprise a memory subsystem.
In the Sync Link configuration, two shared links (buses), a unidirectional
commandl.ink and bidirectional dataLink, are used to connect the controller to
multiple slaves (typically SLDRAM chips). The SyncLink uses shared-link
(bused) communication to achieve a simple high-bandwidth data transfer path
between a memory controller and one or more memory slaves (up to 64
SLDRAMs). The use of just one controller on each SyncLink subsystem
simplifies the initialization and arbitration protocols, whereas limiting the
number of SLDRAMs to 64 simplifies the packet encoding, because the
SLDRAM address (slave/d) can be contained in the first byte of each packet.
The limit is 64 rather than 128, because half of the 7-bit slave!Ds are used for
the broadcast and multicast addresses.
The link from the controller to the SyncLink nodes, the commandl.ink, is
unidirectional, and the signal values can change every clock tick. The nominal
clock period is physical-layer dependent, but SyncLink changes data values on
both edges of the clock. For example, a memory system with a 2.5-ns clock
period and a 10-bit-wide commandl.ink corresponds to a raw bandwidth of
200M command packets/so The basic lO-bit-wide commandl.ink contains 14
signals: linkiln (a low-speed asynchronous initialization signal), a strobe (clock)
signal, a listen signal that enables the flag and data receivers, a flag signal, and
10 data signals. The listen, .flag, and data are source-synchronous; that is, the
incoming strobe signal indicates when the other input signals are valid. The flag
signal marks the beginning of transmitted packets. The data signals are used
to transmit bytes within the packets, and depending upon their location within
a packet, the bytes provide address, command, status, or data value.
The datal.ink is 16 or 18 bits wide, carrying the read data from SyncLink
nodes back to the controller or write data from the controller to one or more
SyncLink nodes. The bit rate is same as for the commarull.ink, and the
minimum block transferred corresponds to 4 bits on each datal.ink pin, the
same duration as the command. This implies that for a memory system with a
2.5-ns clock period, the data transfer rate can be as high as 1600 Mbyte/s, The
SyncLink architecture supports multiples of both 16- and 18-bit-wide DRAMs.
The 18-bit chips can be used by 16-bit controllers, because the extra bits are
logically disconnected until enabled by a controller initiated command. Figure
4.13b shows an example of a typical small memory subsystem design.
To support variable-width DRAM connections and a "vide variety of
configurations, the SyncLink uses address compare logic that supports a
variety of multicast (x 2, x 4, ...) addresses in addition to single chip and
broadcast addresses. This decoding of multicast slave!D addresses is simpler
and more flexible than providing separate chipSelect signals to individual
DRAMs, as is done with currently available SDRAMs. To encode the
SYNCHRONOUS LINK DRAMs 281
data packet returns data after a fixed delay V·C, which is basically the sum
of row access and the column access delays of the SLDRAM, and is set
at initialization time. A read can be directed to one SLDRAM, or
multicast. The multicast is useful only when there are multiple dataLinks
because only one device at a time is permitted to drive any particular
datal.ink.
• Load Transactions A load transaction is similar to a read, but uses special
addressing to access information about the characteristics of particular
SLDRAMs, which is usually information needed to initialize the system.
The delay for load data is also set by Trc, based on the assumption that
the registers can be accessed at least as fast as the SLDRAM. As in the
case of read transactions, a load can be directed to one SLDRAM, or
multicast.
• J.t1 ite Transactions The write request packet transfers command and
A
RESET"
L1NKON
LISTEN
CCLK (free runninal 2, Command Link
FLAG
CA[9:0] I 10 I
+, +,
Memory
controller
SO SI SLDRAM or ~Oooo -l!. SLDRAM or
-SO
SL module 1 SL module 8
SI
000
DO[17:0] 18
DCLKO 1 2 l
DataLink
DCLK1
, 2
(bid irectional , intermittent)
(a)
ON 4N aN 12N 16N 20N 24N 28N 32N 36N 40N 44N 48N 52N
j j
CCLK
FlAG
OataLink
OCLKO
OCLKI
1----'----1
Preamb le
(b)
Figure 4.14 SLDRAM (a) Bus topology. (b) Bus transactions timing diagram. (From
reference 18, with permission of IEEE).
dynamically mix 4N and 8N bursts. The first two commands are page reads to
the SLDRAM 0 to either the same or different banks. SLDRAM 0 drives the
read data on the data bus along with DCLKO to provide the memory
controller the clock edges to strobe in the read data. Because the first two page
read commands are for the same SLDRAM, it is not necessary to insert a gap
between the two 4N data bursts because the SLDRAM itself ensures that
DCLKO is driven continuously. However, the data burst for the next page read
(to SLDRAM 1) must be separated by a 2N gap. This allows for settling of the
DataLink bus and for the timing uncertainty between SLDRAM 0 and
SLDRAM 1. A 2N gap is necessary whenever time control of the DataLink
passes from one device to another.
The next command is a write command in which the controller drives
DCLKO to strobe the write data into the SLDRAM 2. The page write latency
of the SLDRAM is programmed to equal page read latency minus 2N. The
subsequent read command to SLDRAM 3 does not require any additional
delay to achieve the 2N gap on the DataLink. The final burst of three
consecutive write commands shows that the 2N gap between data bursts is not
necessary when the system is writing to different SLDRAM devices, because all
data originates from the memory controller.
When control of the DataLink passes from one device to another, the bus
remains at a midpoint level for nominally 2N, which results in indeterminate
data and possibly multiple transitions at the input buffers. To address this
problem, the data clocks have a 0010 preamble before the transition associated
with the first bit of data occurrence. The controller programs each SLDRAM
with four timing latency parameters: page read, page write, bank read, and
bank write. The latency can be defined as the time between the command burst
and start of the associated data burst. For consistent memory subsystem
operation, each SLDRAM should be programmed with the same values. On
power-up, the memory controller polls the status registers in each SLDRAM
to determine minimum latencies, which may vary across the manufacturers.
The memory controller then programs each SLDRAM with the worst-case
values.
The read latency is adjustable in coarse increments of unit bit intervals and
fine increments of fractional bit intervals. The controller programs the coarse
and fine read latency of each SLDRAM, so that the read data bursts from
different devices at different electrical distances from the controller all arrive
back at the controller with equal delay from the command packet. Write
latency is only adjustable in coarse increments, and its value determines when
the SLDRAM begins looking for transitions on the DCLK to strobe in write
data.
Packet Definitions
• Register Read Request Packet The register read request packet is used to
initiate a read access to a register address. In response to a register read
request packet, the SLDRAM provides a data packet on the data bus after
a specified time.
• Register mAite Request Packet The register write request packet is used
to initiate a write access to a register address. This packet consists of four
words, with the later two being the data to be written to the selected
register.
r-- ------- - --------------------------------------- - ----------- --------------------- ---- --- --- --- ------------------------------------- ------- - - -- ----- -
,,
,
,,
, COM MAND
DECODER
&
FLAG - H c SEO UENCER
CAO-
CA9
3
ICL K
ADDRESS
SEQUENCER DOO-
D01 7
DCL KO, DCL KO#
DCLK1 , DC LK1 #
ICLK
WRITE
CLOCK LATCH
RCL K
CCLK DIVIDERS &
WC LK
CC LK# & DRI VER S
(200 MH z)
, DELAYS
,, OTHER
,, CLOCK S
,- ------- -- ------ - - - -- - - - - ---- - --- -- --- - - - - --- - - -- -- --- -- - ----- - -- - -- - - - -- -- - - --- ------- -- --- - ------ - -- - - -- --- - - - -- -- - - - - - -------- - - - -- - - - -- --- - - --- - - -
Figure 4.15 Block diagram of a 4M x 18 SLDRAM. (From refernce 21, with permission of IEEE.)
'"
~
288 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
• Event Request Packet The event request packet is used to initiate a hard
or a soft reset, an autorefresh, or a Close All Rows command, or to enter
or exit selfrefresh, adjust output voltage levels, adjust the Fine Read
Vernier, or adjust the Data Offset Vernier. The output voltage levels, or
the fine read or data offset verniers, can be adjusted using a dedicated
Adjust Settings Event Request Packet, or as part of an autorefresh event.
• Data Sync Request Packet A data sync request packet is used to control
the output logic values and patterns used for the level adjustment, latency
detection, and timing synchronization.
• Data Packet A data packet is provided by the controller for each write
request and by the SLDRAM for each read request. Each data packet
contains either 8 bytes or 16 bytes, depending on whether the burst length
was set to 4 or 8, respectively, in the corresponding request packet. There
are no output disable or write masking capabilities within the data packet.
When the burst length of 8 is selected, the first 8 bytes in the packet
correspond to the column address contained in the request packet, and
the second 8 bytes correspond to the same column address except with an
inverted LSB.
• Open Row The OPEN ROW command is used to open (or activate) a
row in a particular bank in preparation for a subsequent, but separate,
column access command. The row remains open (or active) for the
accesses until a CLOSE ROW command or an access-and-close-row type
command is issued to that bank. After an OPEN ROW command is
issued to a given bank, a CLOSE ROW command or an access-and-close-
row type command must be issued to that bank before a different row in
that same bank can be opened.
• Close Row The CLOSE ROW command is used to close a row in a
specified bank. This command is useful when it is desired to close a row
that was previously left open in anticipation of subsequent page accesses.
• Read The Page Read commands and Bank Read commands are used to
initiate a read access to an open row, or to a close row, respectively. The
commands indicate the burst length, the selected DCLK, and whether to
leave the row open after the access. The read data appears on the DQs
based on the corresponding Read Delay Register values, Fine Read
Vernier, and the Data Offset Vernier settings previously programmed into
the device.
• m·ite The Page Write and Bank Write commands are used to initiate a
write access to an open row, or to a closed row, respectively. The
SYNCHRONOUS LINK DRAMs 289
commands indicate the burst length, the selected DCLK, and whether to
leave the row open after the access. Write data are expected on the DQs
at a time determined by the corresponding Write Delay Register value
previously programmed into the device.
• No Operation (NOP) The FLAG HIGH indicates the start of a valid
request packet; FLAG then goes LOW for the remainder of the packet.
FLAG LOW at any other time results in a No Operation (NOP). A NOP
prevents unwanted commands from being registered during the idle states,
and does not affect operations already in progress.
• Register Read A Register Read command is used to read contents of the
device status registers. The register data are available on the DQs after
the delay determined by the Page Read Delay Register value, Fine Read
Vernier, and Data Offset Vernier settings previously programmed into the
device.
• Register ltJAite The Register Write command is used to write to the
control registers of the device. The register data are included within the
request packet containing the command.
• Event The events (e.g., Hard Reset, Soft Reset, Auto-Refresh, etc.) are
used to issue commands that do not require a specific address within a
device or devices.
• Read Sync (Stop Read Sync) This commands instructs the SLDRAM to
start (stop) transmitting the specified synchronization pattern to be used
by the controller to adjust input capture timing.
• Drive DCLKs LOl-V (High) This command instructs the SLDRAM to
drive the DCLK outputs Low (High) until overridden by another
DRIVEDCLK or READ command. DCLK is specific in this context, and
the DCLK# outputs will be in the opposite state.
• Drive DCLKs Toggling This command instructs the SLDRAM to drive
the DCLK outputs toggling at the operating frequency of the device until
overridden by . another DRIVE DCLK or READ command.
• Disable DCLKs This command instructs the SLDRAM to disable (High
Z) the DCLK/DCLK# outputs until overridden by another DRIVECLK
or READ command.
1 0 0 1 1 1 Event
1 0 1 0 0 0 Read Sync (Drive both DCLKs)
1 0 1 0 0 1 Stop Read Sync
1 0 1 0 1 0 Drive DCLKs LOW
1 0 1 Data 0 1 1 Drive DCLKs HIGH
1 0 1 Sync 1 0 0 Reserved
1 0 1 1 0 1 Reserved
1 0 1 1 1 0 Disable DCLKs
1 0 1 1 1 1 Drive DCLKs Toggling
1 1 0 Reserved X X X Napa
1 1 1 Reserved X X X Reserved
Read Accesses The read accesses are initiated with a read request packet.
When accessing an idle bank (bank read access), the request packet includes
the bank, row, and column addresses, the burst length, and a bit indicating
whether or not to close the row after access. The same is true for accessing the
open row in an active bank (a page read access), except that the row address
will be ignored. During a read access, the first of four (or eight) data words in
the data packet are available following the total read delay; the remaining three
(or seven) data words, one each, are available every tick (2.5 ns) later. The total
read delay is equal to the coarse delay (Bank Read Delay or Page Read Delay)
stored in the SLDRAM register plus the fine delay of the Data Offset Vernier
and the Fine Read Vernier for DQs and DCLKO. Figure 4.16 show's the
minimum and maximum total read delays for (a) bank read access and (b) page
read access [21].
The SLDRAM clocking scheme is designed to provide for the temporal
alignment of all read data at the memory controller data pins, regardless of the
source SLDRAM. This temporal alignment scheme can be broken down into
different levels. At the lowest level (device level data capture), the DCLK
transitions and DQ transitions of an individual SLDRAM are adjusted (moved
in time) relative to each other to facilitate the capture by the controller of the
DQ signals using the DCLK signals. Thus, the SLDRAM clocking scheme
allows for individual device adjustment without requiring the memory control-
ler to implement memory device specific internal adjustments. At the next level
of timing alignment (device level optimization), the DCLK and DQ transitions
are moved as a group in time to align the DCLK edges with the preferred
phase of an internal controller clock.
The first two levels of timing alignments are sub-tick-level adjustments. At
the next level (system level optimization), coarse (integer tick value) adjust-
ments are made in order to establish the same latency between a read
command being issued by the controller and the corresponding data arriving
back at the controller for all SLDRAl\1s devices in the system.
294 APPLICATION·SPECIFIC DRAM ARCHITECTURES AND DESIGNS
ClK ~~~
(200 MHz) b J I I I I I
pa nk Real..\ I I I I I
COMMANDI Request I
rrrn
I : : : : :
ADDRESS I I I I I
: : : : Maximum :
I I I I Bank Read .
Delay :
:
I I I I
DATA ! : :
1M" '
i ~I
I
:
I
"
Ba~~~~~: ITDJ DID
I Delay I I
DCl K -.---r-----i----i------r:---,~r---r----,r---;---
(mi n case) : :
I •
I I
(a)
TO ro - ron s
ClK Ju--U-uUu-U-u--U-u~
(200 MHz) .... J I I
rage ReaC\ I I
COMMANDI i Rnuer i : :
ADDRESS I I : :
I I ,Maximuml
: : :PageRead:
I I I Delay I
DATA
I
:
I
I :
:•
I I
M. .
immum I
I
:, 1
I
:
rr-r-t""""1
'I
I I Page Readl L.L.l-LJ DID
DClK
(min case)
:
:
:
:
: rl n
Delay
:L-....J :LJ
rlr-->----+---+-- -
LJ :
I I I I I
I I I I I
(b)
Figure 4.16 SLDRAM rmrumum and maximum total read delays. (a) Bank read
access. (b) Page read ccess. (From reference 21 , with permission of IEEE .)
The two DCLK signals provided by each SLDRAM (as well as the memory
controller) provide increased effective bandwidth when switching between
different sources of data on the bus (e.g., a read from one SLDRAM followed
by a read from another SLDRAM, read-to-write or write-to-read transitions).
The preamble and leading cycle in a given DCLK sequence can be hidden -
that is, overlapped with data associated with the other DCLK signal.
Write Accesses The write accesses are initiated with a WRITE packet
request packet. When accessing an idle bank (bank write access) , the request
packet includes the bank, row, and column addresses, the burst length, and a
bit indicating whether or not to close the row after the access. The same is true
when accessing the open row in an active bank (a page write access) except
that the row address will be ignored. During a WRITE access, the first of four
(or eight) data words in the data packet is driven by the controller, aligned
SYNCHRONOUS LINK DRAMs 295
with the selected DCLK, and, after a delay (Bank Write Delay or Page Write
Delay), programmed into the SLDRAM registers. The remaining three (or
seven) data words will follow, one each , every clock tick (2.5 ns) later. Figure
4.17 shows the minimum and maximum delay before arrival of data at the
SLDRAM during (a) bank write access and (b) page write access [21].
Standby Mode In the standby mode, all output driver s are disabled and all
input receivers except those for the CCLK, RESET#, LISTEN, and LINKON
are disabled. The standby mode is entered by deactivating the LISTEN signal
at any time except during the transfer of a request packet. The standby mode
can be nested within the self-refresh mode.
TO+SOns
ClK
TO
,
TO+l 0ns TO+20ns
, TO+30ns
, TO+40ns
(200 MH z) ,,
Bank Write,
COMMAND/
ADDRESS
,eqn i t
r
I
I
Max imum
I Bank Write
':•
I Delay
DATA
DCl K
(min case)
(a)
(b)
Figure 4.17 SLDRAM minimum and maximum write delay times during (a) bank
write access and (b) page write access. (From reference 21, with permission of IEEE.)
296 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
Shutdown Mode In shutdown mode, all internal clocks, all output drivers,
and all input receivers are disabled, except for the LINKON and RESET#.
The shutdown mode is entered by deactivating the LINKON signal while the
device is already in the standby mode. The shutdown may be nested within the
self-refresh mode.
• 2048-bit SRAM pixel buffer as the cache between DRAM and ALU
• Built-in, tile-oriented, memory addressing for the rendering and scan-line-
oriented memory addressing for video refresh
• 256-bit global bus connecting DRAM banks and pixel buffer
• Flexible, dual video buffer supporting 76-Hz CRT refresh
Figure 4.18 shows the simplified 3-D RAM block diagram with external pins
[22]. The DRAM array is partitioned into four independent banks (A, B, C,
and D) of 2.5 Mb each, and together these four banks can support a screen
resolution of 1280 x 1024 x 8. The independent banks can be interleaved to
facilitate nearly uninterrupted frame buffer update and, at the same time,
VID_CLK
Video
Control VID_CKE
DRAM DRAM VID_QSF
Bank A BankB
VID_OE
VID_Q
DRAM_EN
DRAM_OP
DRAM DRAM_BS
Control DRAM_A
MCLK
DRAM RESET
DRAM
Banke BankO
PALU_EN
PALU_WE
SCAN _RST Pixel PALU_OP
SCAN _TCK Test Control PALU _A
SCAN _TMS Access PALU_BE
SCAN_TDI Port
PASS_OUT
SCAN_TOO PASS_IN
32 HIT
SRAM
Pixel PALU_DX
Butter 32
32 PALU_DQ
Figure 4.18 Simplified 3-D RAM block diagram with external pins. (From reference
22, with permission of Mitsubishi Corp.)
298 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
transfer pixel data to the dual video buffer for screen refresh. Data from the
DRAM banks are transferred over the 256-bit global bus to the triple-ported
pixel buffer. The pixel buffer consists of eight blocks, each of which is 256 bits
and is updated in a single transfer on the global bus. The memory size of pixel
buffer is 2 Kbits.
The ALU uses two of the pixel buffer ports to read and write data in the
same clock cycle. Each video buffer is 80 x 8 bits and is loaded in a single
DRAM operation. One video buffer can be loaded while the other is sending
out video data. The on-board pixel buffer can hold up to eight blocks of data,
each block containing 256 bits, and has a cycle time of 8 ns and 10 ns. With
the 256-bit global bus operating at a maximum speed of 20 ns and transferring
32-byte blocks, data can be moved from the DRAM banks to the pixel buffer
at a rate of up to 1.6 Gllytes/s. The ALU converts z-buffer and pixel blend
operations from the "read-modify-writes" to "mostly writes," which allows data
modifications to be completed in a single pixel buffer cycle, reducing execution
time by up to 75 0(0.
A word has 32 bits and is the unit of data operations within the pixel ALU
and between the pixel ALU and the pixel buffer. When the pixel ALU accesses
the pixel buffer, not only does a block address need to be specified, but also a
word has to be identified. Because there are eight blocks in the pixel buffer and
eight words in a block, the upper three bits of the input pins PALU_
A designate the block, and the lower three bits select the word. The data in a
word are directly mapped to PALU_DQ[31:0] in corresponding order. In other
words, bit 0 of the word is mapped to PALU_DQO, bit 1 to PALU_DQ1, and
so on. Figure 4.19 shows the relations and addressing scheme of the blocks and
words in the pixel buffer and in the DRAM page [22J.
Although an ALU write operation operates on one word at a time, each of
the four bytes in a word may be individually masked. The mapping is also
direct and linear: Byte 0 is PALU_DQ[7:o), byte 1 is PALU_DQ[15:8)' byte 2
is PALU-.:DQ[23:16]' and byte 3 is PALU_DQ[31:24]. A block has 256 bits and
is the unit of memory operations between a DRAM bank and the pixel buffer
over the global bus. The input pins DRAM_A select a block from the pixel
buffer and a block from the page of a DRAM bank. The DRAM operations
on a block data are Unmasked Write Block (UWB), Masked Write Block
(MWB), and Read Block (RDB).
A page in a DRAM bank is organized into 10 x 4 blocks; and because each
block has 256 bits, a page has 10,240 bits. There are four DRAM banks in a
3-D RAM chip, such that the pages of the same page address from all four
DRAM banks compose a page group. Therefore, a page group has 20 x 8
blocks.
Figure 4.19 shows the block and page drawn as rectangular shapes that can
be related to tiled frame buffer memory organization. For example, if display
resolution is 1280 x 1024 x 8, then a 32-bit word contains four pixels. Because
a block may be considered as having 2 x 4 words, it contains 8 x 4 pixels. A
page is organized into 10 x 4 blocks, so it contains 80 x 16 pixels; thus a page
3-D RAM 299
256f
Pixel Buffer
Global Bus t 256
10 14 18 1C 20 24
~ 00 04 08 OC
01 05 09 00 11 15 19 10 21 25
'"
-'"
"" :cg 02 06 OA OE 12 16 1 1E 22 26
~ 03 07 06 OF 13 17 16 IF 23 27
<l I>
.>:
7:0 15:8 23:16 31:24 Selecting one of eight words
Word 0 In Block 0 from the selected block
Selecting one of eight blocks
from the Pixel Buffer
Figure 4.19 3-D RAM relations and addressing scheme of blocks and words in the
pixel bufferand in the DRAM page. (From reference 22, with permission of Mitsubishi
Corp.)
DRAM banks
Pixel Buffer
• Pixel, ALU
Video Buffers
• Global Bus
DRAM Banks The 3-D RAM contains four independent DRAM banks,
which ca n be interleaved to facilitate hidden precharge or access in one bank
300 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
while screen refresh is being performed in another bank. Each DRAM bank
has 256 pages with 10,240 bits per page, for a total page storage capacity of
2,621,440 bits. An additional 257th page can be accessed for special functions.
A row decoder takes a 9-bit page address signal to generate 257 word lines,
one for each page. The word lines select which page is connected to the sense
amplifiers. The sense amplifiers read and write the page selected by the row
decoder. Figure 4.20a shows the block diagram of a DRAM bank consisting
of row decoder, address latch, DRAM array and sense amplifiers [22J.
During an Access Page (ACP) operation, the row decoders selects a page
by activating its word line, which transfers the bit charge of that page to the
sense amplifiers. The sense amplifiers amplify the charges. After the sensing and
amplification are completed, the sense amplifiers are ready to interface with the
global bus or video buffer. In a way, the sense amplifiers function as a
"write-through" cache, and no write back to the DRAM array is necessary.
Alternatively, the data in the sense amplifiers can be written to any page in the
same bank at this time, simply by selecting a word line without first equalizing
the sense amplifiers. This function is called Duplicate Page (DUP), and a
typical application of this function can be copying from the 257th page to one
of the normal 256 pages - all 10,240 bits at a time for fast area fill.
When the sense amplifiers in a DRAM bank complete the read/write
operations with the global bus or video buffer, a precharge (PRE) bank
operation usually follows. This precharge bank cycle deactivates the selected
word line corresponding to the current page and equalizes the sense amplifiers.
The DRAM banks must be precharged prior to accessing a new page.
The major DRAM operations are: Unmasked Write Block (UWB), Masked
Write Block (MWB), Read Block (RDB), Precharge Bank (PRE), Video
Transfer (VDX), Duplicate Page (DUP), Access Page (ACP), and No Oper-
ation (NaP). These operations are briefly described in the following sections.
Figure 4.20b illustrates the Unmasked Write Block (UWB), Masked Write
Block (MWB), and the Read Block (RDB) operations on the global bus.
• Unmasked m·ite Block (UW B) The UWB operation copies 32 bytes from
the specified pixel buffer block over the global bus to the specified block
in the sense amplifiers and the DRAM page of a selected DRAM bank.
The 32-bit Plane Mask register has no effect on UWB operation. The
32-bit Dirty Tag still controls which bytes of the block are updated.
• Masked Jt1 ite Block (MWB)
4
The MWB operation copies 32 bytes from
the specified pixel buffer block over the global bus to the specified block
in the sense amplifier and the DRAM page of a selected DRAM bank.
Both the 32-bit Dirty Tag and the 32-bit Plane Mask register control
which bytes of the block are updated.
• Read Block (RDB) The RDB operation copies 32 bytes from the sense
amplifiers of a selected DRAM bank over the global bus to the specified
block in the pixel buffer. The corresponding 32-bit Dirty Tag is cleared.
3-D RAM 301
10,240 bits/pag e
~/ t
U)
Ql
Ql s: CJ)
~"
0 0 B DRAM array
C'<l
Q.
u C'<l
a: Ql --l r-
io
L.......9.
-, N
t
Sense amplifiers
(a)
I ITIJI[[]
Pixel Buffer
(b)
I
j
l
I
fl
I Video Buffer I I I Video Buffer II I 16 VI
I
, /
,.
Video Transfer
(c)
Figure 4.20 3-D RAM block diagrams. (a) DRAM bank. (b) UWB or MWB , and
RDB on the global bus. (c) Video transfer from a page in Bank A to video butTer 1.
(From reference 22, with permi ssion of Mit subi shi Crop.)
302 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
Pixel Buffer The pixel buffer is a 2048-bit SRAM organized into 256-bit
blocks, as shown in Figure 4.20. During a DRAM operation, these blocks can
be addressed from the DRAM_A pins for block transfers on the global bus.
During a pixel ALU operation, the 32-bit pixel ALU accesses the pixel buffer,
3-D RAM 303
requiring not only the block address to be specified but also the 32-bit word
to be identified. This is done by using the 6-bit PALU_A pins such that the
upper three bits select one of the eight blocks in the pixel buffer, and the lower
three bits specify one of the eight words in the selected block. The availability
of both the DRAM_A and PALU_A pins allow concurrent DRAM and pixel
ALU operations. Figure 4.21a shows the pixel buffer elements [22J.
The pixel buffer functions as a level-one write back pixel cache and includes
the following: a 256-bit read/write port, a 32-bit read port, and a 32-bit write
port. The 256-bit read/write port is connected to the global bus via a write
buffer, and the two 32-bit ports are connected to the pixel ALU and the pixel
data pins. All three ports can be used simultaneously as long as the same
memory cell is not accessed. An operation that involves only the pixel ALU
and the pixel buffer is called a pixel ALU operation. Figure 4.21b shows the
block diagram of a triple-port pixel buffer, a global bus, and a dual-port Dirty
Tag RAM.
Pixel ALU Some of the major elements and operations of pixel ALU are
described in the following text.
Dirty Tag Each data byte of a 256-bit block is associated with a Dirty Tag
bit, which means that each word "byte" is associated with four Dirty Tag bits
and that a 32-bit Dirty Tag memory controls the corresponding 32-byte block
data. The Dirty Tag RAM in the pixel buffer contains eight such 32-bit Dirty
Tags. When a block is transferred from the sense amplifiers to the pixel buffer
through the 256-bit port, the corresponding 32-bit Dirty Tag is cleared. When
a block is transferred from the pixel buffer to a DRAM bank, the Dirty Tag
determines which bytes are actually written. When a Dirty Tag bit is "1," the
corresponding data byte is written under the control of the Plane Mask
register, whereas if a Dirty Tag bit is "0," the corresponding byte of data in the
DRAM bank is not written and retains its former value.
There are three major aspects of Dirty Tag operations: tag clear, tag set, and
tag initialization. In normal operation modes, the clearing and setting of the
Dirty Tags by these read and write operations are done by the on-chip logic
in the 3-D RAM and are basically transparent to the rendering controller. The
Dirty Tag bits are used by the 3-D RAM internally and are not output to the
external pins. The Dirty Tag bits play an important role for all four write
operations of the Pixel ALU to the Pixel Buffer: Stateful/Stateless Initial Data
Write and StatefuI/Stateless Normal Data Write.
The Stateless Data Writes refer to the condition whereby the states of the
Pixel ALU units are entirely ignored and the write data are passed to the Pixel
Buffer unaffected, whereas in the Stateful Data Writes the settings of the
various registers in the Pixel ALU, the results of the compare tests, and the
states of the PASS_IN all affect whether the bits of pixel data will be written
into the Pixel Buffer. Initial and Normal Data Writes refer to the manner in
which the Dirty Tag is updated.
304 APPLICATION·SPECIFIC DRAM ARCHITE CTURES AND DESIGNS
Pixel Buffer
8 16 24 1 9 17 25
10 18 26 3 11 19 27
12 20 28 5 13 21 29
6 1422 30 7 1523 31
Dirty Tag for Block 0
Write
Buffer
Enable
32 Write
Ena ble
Logic 256
OOOO H
Global Bus
Read Block Enabl e
(DRA M to Pixel Buffer)
Figure 4.21 3-D RAM. (a) Pixel buffer elements. (b) Triple-port pixel buffer, global
bus, and dual-port Dirty Tag RAM. (From reference 22, with permission of Mitsubishi
Corp.)
3-D RAM 305
Many 2-D rendering operations, such as the text drawing, involves writing
the same color to many pixels. In 3-D RAM, the Color Expansion is done with
the Dirty Tags associated with the Pixel Buffer blocks. The pixel color is
written eight times to a Pixel Buffer block, so that all of the pixels in the block
are the same color. Next, a 32-bit word is written to the Dirty Tag of the
associated block. Finally, the block is written to a DRAM bank. The pixel
whose corresponding Dirty Tag bit is set is changed to the new color while the
other pixels are unaffected.
Plane Mask The 32-bit Plane Mask register (PM[31:0] is used to qualify two
write functions: (1) as per-bit write enables on 32-bit data for a Stateful
(Initial/Normal) data write operations from the Pixel ALU to the Pixel Buffer
and (2) as per-write enables on the 256-bit data for a Masked Write Block
(MWB) operation from the Pixel Buffer to the sense amplifiers of a DRAM
bank over the Global Bus. For a Stateful Data Write, the Plane Mask serves
as per-bit write enables over the entering data from the Pixel ALU· write port;
bit 0 of the Plane Mask enables or disables bit 0 of the incoming 32-bit pixel
data, bit 1 of the enable Plane Mask enables or disables bit 1 of the incoming
32-bit pixel data, and so on.
PALU_DX I3 Oj
PALU_DQ I31 01 PASS_OUT
Pixel Buffer
0 115:81 8
(NX 3• NI3t:241' NX 1• Nl1S:8)} 18
(KX, . KllS:8)} 9
ROPI
Blend
Unit 1
u
0 123:16) 8,
ROPI
~
(NX3• NI31 :24). NX2• NI23:161} 1!l
Blend
[KX 2• KI23:16j} 9 Unit 2
0 131:24) 8
(NX3• NI31 :241' NX3. NI31:24)} 9 ROPI 8
{KX3. ~31241} 9
Blend
Unit 3
-+
° 131.0j32
NI31:0)3? Dual
Compare
KI31:013? Unit
(a)
32
N[ 31:0] Match
Compa re
K[31 :0]
2
32
2
32 Magnitude
0 [31 :0) -I-----+----4I---+--+I
Compare
(b)
Figure 4.22 3-D RAM block diagram . (a) Pixel ALU . (b) Du al Co mpa re unit (F rom
reference 22, with perm ission of Mitsub ishi Corp.)
3-D RAM 307
that define which bits of the 32-bit words will be compared and which will be
"don't care." The results of both Match Compare and Magnitude Compare
operations are logically ANDed together to generate the PASS_OUT pin. The
external PASS_IN signal (fed from another 3-D RAM chip) and the internally
generated PASS_OUT signal are then logically ANDed together to produce a
Write Enable signal to the Pixel Buffer. Figure 4.22b shows the block diagram
of the Dual Compare unit.
Video Buffers The 3-D RAM functional block diagram in Figure 4.18
shows the Video Buffers I and II, each of which receives 640 bits of data at a
time from one of the two DRAM banks connected to it. Sixteen bits of data
are shifted out onto the video data pins every video clock cycle at a 14-ns rate.
It takes 40 video clocks to shift all data out of a video buffer. These two video
buffers can be alternated to provide a seamless stream of video data.
Global Bus The 3-D RAM functional block diagram in Figure 4.18 shows
the Global Bus connecting the Pixel Buffer to the sense amplifiers of all four
DRAM banks. The Global Bus consists of 256 data lines and during a transfer
from the Pixel Buffer to DRAM, the 256 bits are conditionally written
depending on the 32-bit Dirty Tag and the 32-bit Plane Mask. When a data
block is transferred from the Pixel Buffer to the sense amplifiers, the Dirty Tag
and Plane Mask control which bits of the sense amplifiers are changed using
the Write Buffer. A read operation across. the global bus always means a read
by the Pixel ALU; that is, the data are transferred from a DRAM bank into
the Pixel Buffer. Similarly, a write operation across the Global Bus means that
the data are updated from the Pixel Buffer to a DRAM bank. These operations
are accomplished by using Global Bus Read Block Enable and Global Bus
Write Block Enable signals.
The 3-D RAMs can be used to implement frame buffers of various
resolutions and depths. These are some of the examples of frame buffer
organizations:
for an entire Pixel Buffer block can be written in a single cycle from the data
pins.
One pixel is shifted out of the Video Buffer every two video clocks. The Pixel
ALU and PALU_DQ pins access one pixel of a Pixel Buffer block. The Dirty
Tag for an entire Pixel Buffer block can be written in a single cycle from the
PALU_DQ pins. The Dirty Tag controls the four bytes of 32-bit pixel
independently. Figure 4.23 shows the block diagram of 1280 x 1024 x 32
frame buffer consisting of four 3-D RAMs, a rendering controller and a
RAMDAC [22].
The rendering controller writes pixel data across the 128-bit bus to the four
3-D RAMs. The controller commands most of the 3-D RAM operations,
including ALU functions, Pixel Buffer addressing, and DRAM operations. The
controller can also command video display by setting up the RAMDAC and
requesting video transfers from 3-D RAMs. During the use of the 128-bit
pixel data bus, four pixels can be moved across the bus on one cycle.
System Interface
!
Address & Control
Rendering Controller
l
t
,
... . 3D-RAM -+ 3D-RAM . . .• 3D-RAM ~ 3D-RAM
.... .. .. ... .1
l~ideo
Control I r r
Video Data 16
Video Data 16
RAMDAC f--
Video Data 16
Video Data 16
Figure 4.23 3D-RAM block diagram for 1280 x 1024 x 32 frame bulTer organization.
(From reference 22, with permission of Mitsubishi Corp.)
The technology trend for PC main memory D RAMs over the past several
years has been to improve the data transfer rat es by using improved commod-
ity DRAM architectures such as the EDO devices and then the SDRAMs,
which ha ve evolved from 66-MHz version to PC I00 and PC133 SDRAMs. In
addition, further performance improvements have been proposed for DRAM
architectures such as the D DR devices , Rambus SD RAMs, and SL D RAMs.
Rambus architecture promises to deal with escalating microprocessor clock
rates th at require addressing of two key issues, as follows [23]: (1) latency,
which is basically the time per iod that a microprocessor has to wait for first
piece of data after it is requested , and (2) data transfer rate. The Rambus
310 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
reduces some of these problems, in part through strict layout rules that specify
maximum path lengths, so that the signal is not distorted by a particular path.
Also, the Rambus architecture is packet-based, so that the stored data and
address information are sent to the microprocessor as a single packet that is
several bits long. While the Rambus provides the performance, there are
limitations to the maximum size of memory that can be used with Rambus
architecture. Therefore, a memory system design has to be based on cost/
performance considerations.
Some workstations and high-end servers are using DDR DRAMs, because
these applications require larger memories, and other techniques to improve
the memory performance such as interleaving are available. Because DDRs
DRAMs have adequate transfer rates and somewhat better latency than the
Rambus architecture, many workstation and server designs can take advantage
of the DDR's low cost. SLDRAMs also have the potential to find a niche
market in this area, because it is a high-speed part that can be used as a
building block for large memory systems.
For the past few years, SGRAMs have been the most commonly used
memory for graphics design applications, starting with the 8-Mbyte part and
evolving to 16-Mbyte and higher densities. The graphics system designers have
always preferred as wide a memory as possible, to minimize the size of overall
memory. In the earlier designs, a 1-Mbyte frame buffer size was considered
more than enough. Nowadays, with 3-D applications growth, the graphics
designers have started using 2-, 4-, 8-, and even l6-Mbyte frame buffers. The
economics of SDRAMs have been pushing graphics system designers away
from the SGRAMs. Also, the advent of PClOO SDRAM has created a set of
specifications that are ideal for high-speed data transfer in graphics applica-
tions. The combination of PCIOO specification with l-Mbyte x 16-SDRAM
devices is finding wide acceptance in the graphics design industry.
In computing applications, SDRAM has been the mainstream memory and
takes advantage of the fact that most PC memory accesses are sequential; it is
designed to fetch all of the bits in a burst as fast as possible. In SDRAM
architecture, an on-chip burst counter allows the column part of the address
to increment rapidly. The memory controller provides the location and size of
the memory block required, while the SDRAM chip supplies the bits as fast as
the CPU can take them, using a clock for timing synchronization of the
memory chip to the CPU's system clock [24J. This key feature of SDRAM
provides an important advantage over other asynchronous memory types,
enabling data to be delivered off-chip at a burst rate of up to 100 MHz. Once
a burst has started, all remaining bits of the burst lengths are delivered at a
10-ns rate.
The other three competing technologies have been Rambus DRAM, DDR
SDRAM, and SyncLink DRAMs (SLDRAMs), of which Rambus architecture
has become the choice for PCs because of Intel's support. The future of
SLDRAMs is uncertain. Currently, with mainstream CPUs operating over 800
MHz (and higher), it is clear that their external memory bandwidth cannot
MEMORY SYSTEM DESIGN CONSIDERATIONS 311
meet the increasing application demands. Direct RDRAM has been introduced
to address those issues and is a result of collaboration between Intel and
Rambus to develop a new memory system. It is actually a third iteration of the
original Rambus designs running at 600 MHz, which then increased to 700
MHz with the introduction of Concurrent RDRAM.
In the Direct Rambus designs, at current speeds, a single channel is capable
of data transfers at 1.6 Gbytes/s and higher. Also, multiple channels can be
used in parallel to achieve a throughput of up to 6.4 Gbytes/s, The new
architecture will have operational capability for bus speeds of up to 133 MHz.
The Rambus DRAM also has an edge in latency because at the 800-MHz data
rate, an interface to the device operates at an extremely fine timing granularity
of 1.25 ns. The PCI00 SDRAM interface runs with a coarse timing granular-
ity of 10 ns. The 133-MHz SDRAM interface, with its coarse timing granularity
of 7.5 ns, incurs a mismatch with the timing of memory core.
Rambus design appears to be the popular choice in PC DRAM architecture
evolution. Intel has released its 820-chip set (code-named Camino), which has
a 133-MHz system bus with direct interfacing to the Rambus DRAMs. Several
other major PC manufacturers such as IBM, Hewlett-Packard, Micron, and
Dell Computers are expected to release their business and/or consumer
desktops with the Rambus DRAMs.
The DDR DRAM is the other memory technology competing to provide
system builders with high-performance alternatives to Direct RDRAM. The
DDR SDRAM, by providing the chip's output operations on both the rising
and falling edges of the clock, effectively doubles the clock frequency. It has the
most appeal to workstation and high-end server's designers.
The chip sets and memory controllers already exist, which support 133-
MHz (PC133) and faster memory buses. However, a PCl33 SDRAM mayor
may not outperform a PCIOO SDRAM, depending on three critical parameters,
as follows: CAS latency (CL), RAS-to-CAS delay time (t RCD )' and RAS
precharge time (t RP) . These parameters are measured in terms of the number of
clock cycles. For example, a device with CL = 2 cycles, t RCD = 2 cycles, and
t RP = 2 cycles, is commonly referred to as a 2-2-2 device. Table 4.5 shows a
comparison of a PCI00 CL2 device to PC 133 CL2 device [25]. The values
shown in this table are taken from Toshiba's 128 Mb SDRAM data sheet.
Table 4.5 shows that in comparison to PCIOO CL2 device, which is
considered current baseline for memory performance, the PCl33 CL3 device is
about 4% slower, while the PCl33 CL2 device is 170/0 faster. The calculations
shown are based solely on the three critical parameters listed above, and actual
system performance will depend on the application, and other factors, as well.
It should be noted that two out of three critical parameters t RP and tRCD' are
shown as fixed values in nanoseconds and are not necessarily an integer
number of cycles. If the memory controller only interprets these parameters as
an integer number of clock cycles, then they must be rounded up to the next
highest value. Therefore, in Table 4.5, the PCIOO eL2 device is referred to as
2-2-2, the PCl33 CL3 device as 3-3-3, and PC133 CL2 device as 2-2-2.
312 APPLICATION-SPECIFIC DRAM ARCHITECTURES AND DESIGNS
TABLE 4.5 A Comparison of a PCI00 eL2 Device to PCl33 CL2 Device [25]
RAS RAS-to-CAS
Memory CAS Latency Precharge Delay CL + t RP + tRCD Performance
Bus Speed (CL) Time (t RP) Time (fRCD) (total time) (Normalized)
TABLE 4.6 A Comparison of Peak Bandwidth for PCIOO, DDR, and RDRAM
for Various Memory Bus Widths [25]
TABLE 4.7 Examples of Memory Granularity for a Peak Bus Width for a Variety of
DRAM Types and Systems Implementations
DRAM
DRAM Data System Peak
DRAM Type Density Bus Width Bus Width Granularity Bandwidth
REFERENCES
1. Ashok K. Sharma, Semiconductor Memories: Technology, Testing and Reliability,
IEEE Press, New York, 1997.
2. Brian Dipert, The slammin', jammin' DRAM scramble, EDN, January 20, 2000, pp.
68-82.
3. Dave Bursky, Advanced DRAM architectures overcome data bandwidth limits,
Electron. Des., November 17, 1997, pp. 73-88.
4. Bruce Miller et aI., Two high-bandwidth memory bus structures, IEEE Des. Test
C0I11pUt., January-March 1999, pp. 42-52.
17. Rich Warnke, Designing a multimedia subsystem with Rambus DRAMs, Multi-
media Systems Design, March 1998.
18. Peter Gillingham et aI., SLDRAM: High-performance, open standard memory,
IEEE Micro, November/December 1997, pp. 29-39.
19. IEEE Standard P1596.7 (Draft 0.99): SyncLink Memory Interface Standard.
20. Peter Gillingham, SLDRAM architectural and functional overview, Technical
Paper on SLDRAM web site and Mosaid Technologies, Inc., web site.
21. SLDRAM 400 Mb/s/pin Data Sheet CORP400.P65, Rev. 7/9/98.
22. Mitsubishi 3D-RAM (M5M410092B) Data Sheets Preliminary Rev. 0.95.
23. Mark Ellsberry, Memory design consideration for accelerating data transfer rate,
Computer Design, November 1998, pp. 58-62.
24. Jeff Child, DRAM's ride to next generation looks rocky, Embedded Syst. Dev.,
December 1999, pp. 44-47.
25. Application Note Toshiba's web page: Choosing High-Performance DRAM for
Tomorrow's Applications.