0% found this document useful (0 votes)
61 views

Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017

The document discusses memory systems and technologies. It provides an overview of topics that will be covered related to memory technology, caches, virtual memory, parallelism in memory systems, and memory system implementation. It acknowledges sources for the content and provides background on trends in processor and memory speeds over time that have led to the need for memory hierarchies.

Uploaded by

Richu Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017

The document discusses memory systems and technologies. It provides an overview of topics that will be covered related to memory technology, caches, virtual memory, parallelism in memory systems, and memory system implementation. It acknowledges sources for the content and provides background on trends in processor and memory speeds over time that have led to the need for memory hierarchies.

Uploaded by

Richu Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture

10: Memory System


-- Memory Technology

CSE 564 Computer Architecture Summer 2017

Department of Computer Science and Engineering


Yonghong Yan
yan@oakland.edu
www.secs.oakland.edu/~yan

1
Topics for Memory Systems
•  Memory Technology and Metrics
–  SRAM, DRAM, Flash/SSD, 3-D Stack Memory, Phase-change
memory
–  Latency and Bandwidth, Error CorrecNon
–  Memory wall
•  Cache
–  Cache basics
–  Cache performance and op@miza@on
–  Advanced op@miza@on
–  Mul@ple-level cache, shared and private cache, prefetching
•  Virtual Memory
–  Protec@on, Virtualiza@on, and Reloca@on
–  Page/segment, protec@on
–  Address Transla@on and TLB
2
Topics for Memory Systems
•  Parallelism (to be discussed in TLP)
–  Memory Consistency model
•  Instruc@ons for fence, etc
–  Cache coherence
–  NUMA and first touch
–  Transac@onal memory (Not covered)

•  Implementa@on – (Not Covered)


–  SoTware/Hardware interface
–  Cache/memory controller
–  Bus systems and interconnect

3
Acknowledgement
•  Based on slides prepared by: Professor David A. PaWerson
Computer Science 252, Fall 1996, and edited and presented
by Prof. Kurt Keutzer for 2000 from UCB
•  Some slides are adapted from the textbook slides for
Computer Organiza@on and Design, FiTh Edi@on: The
Hardware/SoTware Interface

4
The Big Picture: Where are We Now?
Processor
Input
Control
Memory

Datapath
Output

•  Memory system
–  Supplying data on @me for computa@on
–  Term memory include circuits for storing data
•  Cache (SRAM)
•  Scratchpad (SRAM)
•  RAM (DRAM)
•  etc
5
Technology Trends
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM
Year Size Cycle Time
1980 1000:1! 64 Kb 2:1! 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns

6
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) à Memory Wall

1000! µProc
CPU!

60%/yr.
Moore s Law
Performance

(2X/1.5yr)
100! Processor-Memory
Performance Gap:
(grows 50% / year)
10!
DRAM
DRAM
9%/yr.
1! (2X/10 yrs)
1980!
1981!
1982!
1983!
1984!
1985!
1986!
1987!
1988!
1989!
1990!
1991!
1992!
1993!
1994!
1995!
1996!
1997!
1998!
1999!
2000!
Time 7
The SituaNon: Microprocessor
•  Rely on caches to bridge gap
•  Microprocessor-DRAM performance gap
–  @me of a full cache miss in instruc@ons executed
1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136
instruc@ons
2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320
instruc@ons
3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648
instruc@ons
–  1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5X

8
The Goal: illusion of large, fast, cheap memory

•  Fact: Large memories are slow, fast memories are small


•  How do we create a memory that is large, cheap and fast
(most of the @me)?
–  Hierarchy
–  Parallelism

9
An Expanded View of the Memory System

Processor

Control
Memory
Memory
Memory

Memory
Memory
Datapath

Speed: Fastest Slowest


Size: Smallest Biggest
Cost: Highest Lowest

10
Why hierarchy works
•  The Principle of Locality:
–  Program access a rela@vely small por@on of the address space
at any instant of @me.

Probability
of reference

0 Address Space 2^n - 1

11
Typical Memory Reference PaWerns
Address n loop iteraNons

InstrucNon
fetches

subrouNne subrouNne
call return
Stack
accesses
argument access

Data
accesses scalar accesses
Time

12
Locality
•  Principle of Locality:
–  Programs tend to reuse data and instruc@ons near those
they have used recently, or that were recently referenced
themselves
–  Spa@al locality: Items with nearby addresses tend to be
referenced close together in @me
–  Temporal locality: Recently referenced items are likely to be
referenced in the near future

Locality Example: sum = 0;


for (i = 0; i < n; i++)
•  Data sum += a[i];
– Reference array elements in succession return sum;
(stride-1 reference pattern): Spatial locality
Temporal locality
– Reference sum each iteration:
•  Instructions
Spatial locality
– Reference instructions in sequence:
Temporal locality
– Cycle through loop repeatedly: 13
Locality Example
•  Claim: Being able to look at code and get qualita@ve sense
of its locality is key skill for professional programmer

•  Ques@on: Does this func@on have good locality?

int sumarrayrows(int a[M][N])


{
int i, j, sum = 0;

for (i = 0; i < M; i++)


for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
14
Locality Example
•  Ques@on: Does this func@on have good locality?

int sumarraycols(int a[M][N])


{
int i, j, sum = 0;

for (j = 0; j < N; j++)


for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}

15
Locality Example
•  Ques@on: Can you permute the loops so that the func@on
scans the 3-d array a[] with a stride-1 reference paWern
(and thus has good spa@al locality)?

int sumarray3d(int a[M][N][N])


{
int i, j, k, sum = 0;

for (i = 0; i < N; i++)


for (j = 0; j < N; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}

16
Memory Hierarchy of a Computer System
•  By taking advantage of the principle of locality:
–  Present the user with as much memory as is available in the
cheapest technology.
–  Provide access at the speed offered by the fastest technology.
Processor

Control Tertiary
Secondary Storage
Storage (Tape)
2nd/3rd Main (Disk)
On-Chip
Registers

Level Memory
Cache

Datapath Cache (DRAM)


(SRAM)

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s


(10s ms) (10s sec)
Size (bytes): 100s Ks Ms Gs Ts
17
Memory Hierarchy in Real

18
Intel i7 (Nahelem)
§  Private L1 and L2
–  L2 is 256KB each. 10 cycle latency
§  8MB shared L3. ~40 cycles latency

19
Area

20
How is the hierarchy managed?
•  Registers <-> Memory
–  by compiler (programmer?)
•  cache <-> memory
–  by the hardware
•  memory <-> disks
–  by the hardware and opera@ng system (virtual memory)
–  by the programmer (files)

•  Virtual memory
–  Virtual layer between applica@on address space to physical
memory
–  Not part of the physical memory hierarchy
21
Technology vs Architectures
•  Technology determines the raw speed
–  Latency
–  Bandwidth
–  It is material science

•  Architecture put them together with processors


–  So physical speed can be achieved in real

22
Memory Technology
•  StaNc RAM (SRAM)
–  0.5ns – 2.5ns, $2000 – $5000 per GB
•  Dynamic RAM (DRAM)
–  50ns – 70ns, $20 – $75 per GB

•  3-D stack memory Ideal memory:


•  Access time of SRAM
•  Solid state disk •  Capacity and cost/GB of
disk
•  Magne@c disk
–  5ms – 20ms, $0.20 – $2 per GB

23
Memory Technology
•  Random Access:
–  Random is good: access @me is the same for all loca@ons
–  DRAM: Dynamic Random Access Memory
•  High density, low power, cheap, slow
•  Dynamic: need to be refreshed regularly
–  SRAM: StaNc Random Access Memory
•  Low density, high power, expensive, fast
•  Sta@c: content will last forever (un@l lose power)
•  Non-so-random Access Technology:
–  Access @me varies from loca@on to loca@on and from @me to @me
–  Examples: Disk, CDROM
•  Sequen@al Access Technology: access @me linear in loca@on
(e.g.,Tape)
24
Main Memory Background
•  Performance of Main Memory:
–  Latency: Cache Miss Penalty
•  Access Time: @me between request and word arrives
•  Cycle Time: @me between requests
–  Bandwidth: I/O & Large Block Miss Penalty (L2)
•  Main Memory is DRAM : Dynamic Random Access Memory
–  Needs to be refreshed periodically (8 ms)
–  Addresses divided into 2 halves (Memory as a 2D matrix):
•  RAS or Row Access Strobe
•  CAS or Column Access Strobe
•  Cache uses SRAM : StaNc Random Access Memory
–  No refresh (6 transistors/bit vs. 1 transistor)
Size: DRAM/SRAM - 4-8
Cost/Cycle =me: SRAM/DRAM - 8-16

25
Random Access Memory (RAM) Technology

•  Why need to know about RAM technology?


–  Processor performance is usually limited by memory bandwidth
–  As IC densi@es increase, lots of memory will fit on processor
chip
•  Tailor on-chip memory to specific needs
-  Instruction cache
-  Data cache
-  Write buffer
•  What makes RAM different from a bunch of flip-flops?
–  Density: RAM is much denser

26
StaNc RAM Cell
6-Transistor SRAM Cell word
word
0 1 (row select)

0 1

bit bit

bit bit
•  Write:
1. Drive bit lines (bit=1, bit=0)
2.. Select row replaced with pullup
•  Read: to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
27
Problems with SRAM
•  Six transistors use up a lot of area
•  Consider a Zero is stored in the cell:
–  Transistor N1 will try to pull bit to 0
–  Transistor P2 will try to pull bit bar to 1
•  But bit lines are precharged to high: Are P1 and P2
necessary? Select = 1

P1 P2
Off On
On
On
On Off
N1 N2

bit = 1 bit = 0 28
1-Transistor Memory Cell (DRAM)
•  Write: row select
–  1. Drive bit line
–  2.. Select row
•  Read:
–  1. Precharge bit line to Vdd
–  2.. Select row
–  3. Cell and bit line share charges bit
•  Very small voltage changes on the bit line
–  4. Sense (fancy sense amp)
•  Can detect changes of ~1 million electrons
–  5. Write: restore the value
•  Refresh
–  1. Just do a dummy read to every cell.
29
Classical DRAM OrganizaNon (square)
bit (data) lines
r
o Each intersection represents
w a 1-T DRAM Cell

d RAM Cell
e Array
c
o
d word (row) select
e
r

row Column Selector &


I/O Circuits Column
address Address

Row and Column Address together:


data –  Select 1 bit a Nme
30
DRAM logical organizaNon (4 Mbit)

Column Decoder

11 Sense Amps & I/O D

A0…A10 Memory Array Q


(2,048 x 2,048)
Storage
Word Line Cell


Square root of bits per RAS/CAS
31
Main Memory Performance
Cycle Time

Access Time Time

•  DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access


Time
–  - 2:1
•  DRAM (Read/Write) Cycle Time :
–  How frequent can you ini@ate an access?
–  Analogy: A liWle kid can only ask his father for money on Saturday
•  DRAM (Read/Write) Access Time:
–  How quickly will you get what you want once you ini@ate an access?
–  Analogy: As soon as he asks, his father will give him the money
•  DRAM Bandwidth Limita@on analogy:
–  What happens if he runs out of money on Wednesday?
32
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving: CPU Memory

D1 available
Start Access for D1 Start Access for D2

Memory
Access Pattern with 4-way Interleaving: Bank 0
Memory
Bank 1
CPU
Memory
Bank 2
Memory
Access Bank 0

Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
33
Main Memory Performance
•  Timing model
–  1 to send address,
–  4 for access @me, 10 cycle @me, 1 to send data
–  Cache Block is 4 words
•  Simple M.P. = 4 x (1+10+1) = 48
•  Wide M.P. = 1 + 10 + 1 = 12
•  Interleaved M.P. = 1+10+1 + 3 =15
address address address address
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Bank 0 Bank 1 Bank 2 Bank 3

34
Independent Memory Banks
•  How many banks?
number banks number clocks to access word in bank
–  For sequen@al accesses, otherwise will return to original bank
before it has next word ready
•  Increasing DRAM => fewer chips => harder to have banks
–  Growth bits/chip DRAM : 50%-60%/yr
–  Nathan Myrvold M/S: mature soTware growth
(33%/yr for NT) - growth MB/$ of DRAM (25%-30%/yr)

35
DRAM History
•  DRAMs: capacity +60%/yr, cost –30%/yr
–  2.5X cells/area, 1.5X die size in -3 years
•  97 DRAM fab line costs $1B to $2B
–  DRAM only: density, leakage v. speed
•  Rely on increasing no. of computers & memory per computer
(60% market)
–  SIMM or DIMM is replaceable unit
=> computers use any genera@on DRAM
•  Commodity, second source industry
=> high volume, low profit, conserva@ve
–  LiWle organiza@on innova@on in 20 years
page mode, EDO, Synch DRAM
•  Order of importance: 1) Cost/bit 1a) Capacity
–  RAMBUS: 10X BW, +30% cost => liWle impact
36
Advanced DRAM OrganizaNon
•  Bits in a DRAM are organized as a rectangular array
–  DRAM accesses an en@re row
–  Burst mode: supply successive words from a row with reduced
latency
•  Double data rate (DDR)
DRAM
–  Transfer on rising and
falling clock edges
•  Quad data rate (QDR)
DRAM
–  Separate DDR inputs
and outputs

37
3-D Stack Memory
•  High Bandwidth Memory
•  Hybrid Memory Cube

38
Flash Storage
•  Nonvolatile semiconductor storage
–  100× – 1000× faster than disk
–  Smaller, lower power, more robust
–  But more $/GB (between disk and DRAM)

39
Flash Types
•  NOR flash: bit cell like a NOR gate
–  Random read/write access
–  Used for instruction memory in embedded systems
•  NAND flash: bit cell like a NAND gate
–  Denser (bits/area), but block-at-a-time access
–  Cheaper per GB
–  Used for USB keys, media storage, …
•  Flash bits wears out after 1000 s of accesses
–  Not suitable for direct RAM or disk replacement
–  Wear leveling: remap data to less used blocks

40
Disk Storage
•  Nonvolatile, rotating magnetic storage

41
Disk Sectors and Access
•  Each sector records
–  Sector ID
–  Data (512 bytes, 4096 bytes proposed)
–  Error correcting code (ECC)
•  Used to hide defects and recording errors
–  Synchronization fields and gaps
•  Access to a sector involves
–  Queuing delay if other accesses are pending
–  Seek: move the heads
–  Rotational latency
–  Data transfer
–  Controller overhead

42
Disk Access Example
•  Given
–  512B sector, 15,000rpm, 4ms average seek time, 100MB/
s transfer rate, 0.2ms controller overhead, idle disk
•  Average read time
–  4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms
•  If actual average seek time is 1ms
–  Average read time = 3.2ms

43
Summary:
•  Two Different Types of Locality:
–  Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon.
–  Spa@al Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon.
•  By taking advantage of the principle of locality:
–  Present the user with as much memory as is available in the
cheapest technology.
–  Provide access at the speed offered by the fastest technology.
•  DRAM is slow but cheap and dense:
–  Good choice for presen@ng the user with a BIG memory system
•  SRAM is fast but expensive and not very dense:
–  Good choice for providing the user FAST access @me.
44

You might also like