0% found this document useful (0 votes)
146 views50 pages

Onur 740 Fall11 Lecture25 Mainmemory

The document discusses the organization of the main memory subsystem. It describes how memory is organized into channels, DIMMs, ranks, chips, banks, rows and columns. A channel connects a processor to one or more DIMMs. Each DIMM can contain one or two ranks of memory chips. Each chip contains multiple independently accessible memory banks.

Uploaded by

arunraj03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views50 pages

Onur 740 Fall11 Lecture25 Mainmemory

The document discusses the organization of the main memory subsystem. It describes how memory is organized into channels, DIMMs, ranks, chips, banks, rows and columns. A channel connects a processor to one or more DIMMs. Each DIMM can contain one or two ranks of memory chips. Each chip contains multiple independently accessible memory banks.

Uploaded by

arunraj03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

15-740/18-740

Computer Architecture
Lecture 25: Main Memory

Prof. Onur Mutlu


Yoongu Kim
Carnegie Mellon University
Today
n  SRAM vs. DRAM
n  Interleaving/Banking
n  DRAM Microarchitecture
q  Memory controller
q  Memory buses
q  Banks, ranks, channels, DIMMs
q  Address mapping: software vs. hardware
q  DRAM refresh
n  Memory scheduling policies
n  Memory power/energy management
n  Multi-core issues
q  Fairness, interference
q  Large DRAM capacity
2
Readings
n  Recommended:
q  Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling:
Enabling High-Performance and Fair Memory Controllers,”
IEEE Micro Top Picks 2009.
q  Mutlu and Moscibroda, “Stall-Time Fair Memory Access
Scheduling for Chip Multiprocessors,” MICRO 2007.
q  Zhang et al., “A Permutation-based Page Interleaving Scheme
to Reduce Row-buffer Conflicts and Exploit Data Locality,”
MICRO 2000.
q  Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008.
q  Rixner et al., “Memory Access Scheduling,” ISCA 2000.

3
4

DRAM BANKS
DRAM INTERFACE
Main Memory in the System

DRAM MEMORY
CORE 1

CORE 3
CONTROLLER

L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2

CORE 2
CORE 0
SHARED L3 CACHE
Memory Bank Organization
n  Read access sequence:

1. Decode row address


& drive word-lines

2. Selected bits drive


bit-lines
• Entire row read

3. Amplify row data

4. Decode column
address & select subset
of row
• Send to output

5. Precharge bit-lines
• For next access

5
SRAM (Static Random Access Memory)
Read Sequence
row select 1. address decode
2. drive row select
3. selected bit-cells drive bitlines

_bitline
bitline

(entire row is read together)


4. diff. sensing and col. select
(data is ready)
5. precharge all bitlines
(for next read or write)
bit-cell array
n+m n 2n 2n row x 2m-col Access latency dominated by steps 2 and 3
Cycling time dominated by steps 2, 3 and 5
(n≈m to minimize
overall latency) - step 2 proportional to 2m
- step 3 and 5 proportional to 2n
m 2m diff pairs
sense amp and mux
1
6
DRAM (Dynamic Random Access Memory)
row enable
Bits stored as charges on node
capacitance (non-restorative)
_bitline

- bit cell loses charge when read


- bit cell loses charge over time
Read Sequence
1~3 same as SRAM
RAS 4. a “flip-flopping” sense amp
bit-cell array amplifies and regenerates the
n 2n bitline, data bit is mux’ed out
2n row x 2m-col
5. precharge all bitlines
(n≈m to minimize
overall latency)
Refresh: A DRAM controller must
m 2m periodically read all rows within the
sense amp and mux allowed refresh time (10s of ms)
1 such that charge is restored in cells
A DRAM die comprises
CAS of multiple such arrays
7
SRAM vs. DRAM
n  SRAM is preferable for register files and L1/L2 caches
q  Fast access
q  No refreshes
q  Simpler manufacturing (compatible with logic process)
q  Lower density (6 transistors per cell)
q  Higher cost

n  DRAM is preferable for stand-alone memory chips


q  Much higher capacity
q  Higher density
q  Lower cost

8
Memory  subsystem  organiza1on  
•  Memory  subsystem  organiza1on  
–  Channel  
–  DIMM  
–  Rank  
–  Chip  
–  Bank  
–  Row/Column  
Memory  subsystem  
“Channel”   DIMM  (Dual  in-­‐line  memory  module)  

Processor  

Memory  channel   Memory  channel  


Breaking  down  a  DIMM  
DIMM  (Dual  in-­‐line  memory  module)  

Side  view  

Front  of  DIMM   Back  of  DIMM  


Breaking  down  a  DIMM  
DIMM  (Dual  in-­‐line  memory  module)  

Side  view  

Front  of  DIMM   Back  of  DIMM  

Rank  0:  collec1on  of  8  chips   Rank  1  


Rank  

Rank  0  (Front)   Rank  1  (Back)  

<0:63>   <0:63>  

Addr/Cmd   CS  <0:1>   Data  <0:63>  

Memory  channel  
DIMM  &  Rank  (from  JEDEC)  
Breaking  down  a  Rank  

.  .  .  

Chip  0  

Chip  1  

Chip  7  
Rank  0  

<56:63>  
<8:15>  
<0:7>  
<0:63>  

Data  <0:63>  
Breaking  down  a  Chip  
Chip  0  

Bank  0  
<0:7>  
<0:7>  

<0:7>  

<0:7>  
...  

<0:7>  
Breaking  down  a  Bank  
2kB  
1B  (column)  

row  16k-­‐1  

...  
Bank  0  

row  0  
<0:7>  

Row-­‐buffer  
1B   1B   1B  
...  
<0:7>  
Memory  subsystem  organiza1on  
•  Memory  subsystem  organiza1on  
–  Channel  
–  DIMM  
–  Rank  
–  Chip  
–  Bank  
–  Row/Column  
Example:  Transferring  a  cache  block  
Physical  memory  space  

0xFFFF…F  
Channel  0  
...  

DIMM  0  

0x40  
 to   Rank  0  
p ed
64B     Map
cache  block  

0x00  
Example:  Transferring  a  cache  block  
Physical  memory  space  
Chip  0   Chip  1   Chip  7  
Rank  0  
0xFFFF…F  

.  .  .  
...  

<56:63>  
<8:15>  
<0:7>  

0x40  

64B    
Data  <0:63>  
cache  block  

0x00  
Example:  Transferring  a  cache  block  
Physical  memory  space  
Chip  0   Chip  1   Chip  7  
0xFFFF…F  
Rank  0  

Row  0   .  .  .  
Col  0  
...  

<56:63>  
<8:15>  
<0:7>  

0x40  

64B    
Data  <0:63>  
cache  block  

0x00  
Example:  Transferring  a  cache  block  
Physical  memory  space  
Chip  0   Chip  1   Chip  7  
0xFFFF…F  
Rank  0  

Row  0   .  .  .  
Col  0  
...  

<56:63>  
<8:15>  
<0:7>  

0x40  

64B    
Data  <0:63>  
cache  block  
8B  
0x00   8B  
Example:  Transferring  a  cache  block  
Physical  memory  space  
Chip  0   Chip  1   Chip  7  
0xFFFF…F  
Rank  0  

Row  0   .  .  .  
Col  1  
...  

<56:63>  
<8:15>  
<0:7>  

0x40  

64B    
Data  <0:63>  
cache  block  
8B  
0x00  
Example:  Transferring  a  cache  block  
Physical  memory  space  
Chip  0   Chip  1   Chip  7  
0xFFFF…F  
Rank  0  

Row  0   .  .  .  
Col  1  
...  

<56:63>  
<8:15>  
<0:7>  

0x40  

64B    
8B   Data  <0:63>  
cache  block  
8B  
0x00   8B  
Example:  Transferring  a  cache  block  
Physical  memory  space  
Chip  0   Chip  1   Chip  7  
0xFFFF…F  
Rank  0  

Row  0   .  .  .  
Col  1  
...  

<56:63>  
<8:15>  
<0:7>  

0x40  

64B    
8B   Data  <0:63>  
cache  block  
8B  
0x00  
A  64B  cache  block  takes  8  I/O  cycles  to  transfer.  
 
During  the  process,  8  columns  are  read  sequenUally.  
Page Mode DRAM
n  A DRAM bank is a 2D array of cells: rows x columns
n  A “DRAM row” is also called a “DRAM page”
n  “Sense amplifiers” also called “row buffer”

n  Each address is a <row,column> pair


n  Access to a “closed row”
q  Activate command opens row (placed into row buffer)
q  Read/write command reads/writes column in the row buffer
q  Precharge command closes the row and prepares the bank for
next access
n  Access to an “open row”
q  No need for activate command

26
DRAM Bank Operation
Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)

Row decoder
(Row 1, Column 0)

Rows
Row address 0
1

Row 01
Row
Empty Row Buffer CONFLICT
HIT !

Column address 0
1
85 Column mux

Data

27
Latency Components: Basic DRAM Operation
n  CPU → controller transfer time
n  Controller latency
q  Queuing & scheduling delay at the controller
q  Access converted to basic commands
n  Controller → DRAM transfer time
n  DRAM bank latency
q  Simple CAS is row is “open” OR
q  RAS + CAS if array precharged OR
q  PRE + RAS + CAS (worst case)
n  DRAM → CPU transfer time (through controller)

28
A DRAM Chip and DIMM
n  Chip: Consists of multiple banks (2-16 in Synchronous DRAM)
n  Banks share command/address/data buses
n  The chip itself has a narrow interface (4-16 bits per read)

n  Multiple chips are put together to form a wide interface


q  Called a module
q  DIMM: Dual Inline Memory Module
q  All chips in one side of a DIMM are operated the same way (rank)
n  Respond to a single command
n  Share address and command buses, but provide different data

n  If we have chips with 8-bit interface, to read 8 bytes in a


single access, use 8 chips in a DIMM
29
128M x 8-bit DRAM Chip

30
A 64-bit Wide DIMM

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip

Command Data

31
A 64-bit Wide DIMM
n  Advantages:
q  Acts like a high-
capacity DRAM chip
with a wide
interface
q  Flexibility: memory
controller does not
need to deal with
individual chips

n  Disadvantages:
q  Granularity:
Accesses cannot be
smaller than the
interface width

32
Multiple DIMMs
n  Advantages:
q  Enables even
higher capacity

n  Disadvantages:
q  Interconnect
complexity and
energy
consumption
can be high

33
DRAM Channels

n  2 Independent Channels: 2 Memory Controllers (Above)


n  2 Dependent/Lockstep Channels: 1 Memory Controller with
wide interface (Not Shown above)
34
Generalized Memory Structure

35
Multiple Banks (Interleaving) and Channels
n  Multiple banks
q  Enable concurrent DRAM accesses
q  Bits in address determine which bank an address resides in
n  Multiple independent channels serve the same purpose
q  But they are even better because they have separate data buses
q  Increased bus bandwidth

n  Enabling more concurrency requires reducing


q  Bank conflicts
q  Channel conflicts
n  How to select/randomize bank/channel indices in address?
q  Lower order bits have more entropy
q  Randomizing hash functions (XOR of different address bits)
36
How Multiple Banks/Channels Help

37
Multiple Channels
n  Advantages
q  Increased bandwidth
q  Multiple concurrent accesses (if independent channels)

n  Disadvantages
q  Higher cost than a single channel
n  More board wires
n  More pins (if on-chip memory controller)

38
Address Mapping (Single Channel)
n  Single-channel system with 8-byte memory bus
q  2GB memory, 8 banks, 16K rows & 2K columns per bank
n  Row interleaving
q  Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

n  Cache block interleaving


n  Consecutive cache block addresses in consecutive banks
n  64 byte cache blocks

Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits

n  Accesses to consecutive cache blocks can be serviced in parallel


n  How about random accesses? Strided accesses?
39
Bank Mapping Randomization
n  DRAM controller can randomize the address mapping to
banks so that bank conflicts are less likely

3 bits Column (11 bits) Byte in bus (3 bits)

XOR

Bank index
(3 bits)

40
Address Mapping (Multiple Channels)
C Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

Row (14 bits) C Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

Row (14 bits) Bank (3 bits) C Column (11 bits) Byte in bus (3 bits)

Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits)

n  Where are consecutive cache blocks?


C Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits

Row (14 bits) C High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits
Row (14 bits) High Column C Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits

Row (14 bits) High Column Bank (3 bits) C Low Col. Byte in bus (3 bits)
8 bits 3 bits
Row (14 bits) High Column Bank (3 bits) Low Col. C Byte in bus (3 bits)
8 bits 3 bits

41
Interaction with VirtualàPhysical Mapping
n  Operating System influences where an address maps to in
DRAM
Virtual Page number (52 bits) Page offset (12 bits) VA

Physical Frame number (19 bits) Page offset (12 bits) PA


Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA

n  Operating system can control which bank a virtual page is


mapped to. It can randomize Pageà<Bank,Channel>
mappings

n  Application cannot know/determine which bank it is accessing

42
DRAM Refresh (I)
n  DRAM capacitor charge leaks over time
n  The memory controller needs to read each row periodically
to restore the charge
q  Activate + precharge each row every N ms
q  Typical N = 64 ms
n  Implications on performance?
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms
the DRAM will be unavailable until refresh ends
n  Burst refresh: All rows refreshed immediately after one
another
n  Distributed refresh: Each row refreshed at a different time,
at regular intervals
43
DRAM Refresh (II)

n  Distributed refresh eliminates long pause times


n  How else we can reduce the effect of refresh on
performance?
q  Can we reduce the number of refreshes?

44
DRAM Controller
n  Purpose and functions
q  Ensure correct operation of DRAM (refresh)

q  Service DRAM requests while obeying timing constraints of


DRAM chips
n  Constraints: resource conflicts (bank, bus, channel), minimum
write-to-read delays
n  Translate requests to DRAM command sequences

q  Buffer and schedule requests to improve performance


n  Reordering and row-buffer management

q  Manage power consumption and thermals in DRAM


n  Turn on/off DRAM chips, manage power modes
45
DRAM Controller Issues
n  Where to place?

q  In chipset
+ More flexibility to plug different DRAM types into the system
+ Less power density in the CPU chip

q  On CPU chip


+ Reduced latency for main memory access
+ Higher bandwidth between cores and controller
q  More information can be communicated (e.g. request’s importance in
the processing core)

46
DRAM Controller (II)

47
A Modern DRAM Controller

48
DRAM Scheduling Policies (I)
n  FCFS (first come first served)
q  Oldest request first

n  FR-FCFS (first ready, first come first served)


1. Row-hit first
2. Oldest first
Goal: Maximize row buffer hit rate à maximize DRAM throughput

q  Actually, scheduling is done at the command level


n  Column commands (read/write) prioritized over row commands
(activate/precharge)
n  Within each group, older commands prioritized over younger ones

49
DRAM Scheduling Policies (II)
n  A scheduling policy is essentially a prioritization order

n  Prioritization can be based on


q  Request age
q  Row buffer hit/miss status
q  Request type (prefetch, read, write)
q  Requestor type (load miss or store miss)
q  Request criticality
n  Oldest miss in the core?
n  How many instructions in core are dependent on it?

50

You might also like