0% found this document useful (0 votes)

92 views10 pages

Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers

The document discusses cache design and optimization. It covers several key aspects of cache design including block placement, identification, replacement, and write strategy. It defines different types of caches such as direct mapped, set associative, and fully associative caches. It also discusses concepts like cache hits, misses, locality, and techniques to reduce thrashing and optimize algorithms for caches through blocking.

Uploaded by

deepu7deepti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views10 pages

Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers

Uploaded by

deepu7deepti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture 5: Memory Hierarchy and Cache

Cache: A safe place for hiding and storing things.

Websters New World Dictionary (1976)

Traditional Four Questions for Memory Hierarchy Designers

Q1: Where can a block be placed in the upper level?

(Block placement)

Q2: How is a block found if it is in the upper level?

Fully Associative, Set Associative, Direct Mapped Tag/Block

(Block identification)

Q3: Which block should be replaced on a miss?

(Block replacement)
Random, LRU

Q4: What happens on a write?

(Write strategy)

Write Back or Write Through (with Write Buffer)

Cache-Related Terms
ICACHE : Instruction cache DCACHE (L1) : Data cache closest to registers SCACHE (L2) : Secondary data cache TCACHE (L3) : Third level data cache
Data from SCACHE has to go through DCACHE to registers TCACHE is larger than SCACHE, and SCACHE is larger than DCACHE Not all processors have TCACHE

Unified versus Split Caches

This refers to having a single or separate caches for data and machine instructions. Split is obviously superior. It reduces thrashing, which we will come to shortly..

Unified vs Split Caches

Unified vs Separate I&D
Proc Unified Cache-1 Unified Cache-2 I-Cache-1 Proc Unified Cache-2 D-Cache-1

Simplest Cache: Direct Mapped

Memory Address
0 1 2 3 4 5 6 7 8 9 A B C D E F

Memory 4 Byte Direct Mapped Cache

Cache Index 0 1 2 3

Example:
16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%

Location 0 can be occupied by data from:

Memory location 0, 4, 8, ... etc. In general: any memory location whose 2 LSBs of the address are 0s Address<1:0> => cache index

Which is better (ignore L2 cache)?

Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port)
5

Which one should we place in the cache? How can we tell which one is in the cache?

Cache Mapping Strategies

There are two common sets of methods in use for determining which cache lines are used to hold copies of memory lines. Direct: Cache address = memory address MODULO cache size.

Cache Basics
Cache hit: a memory access that is found in the cache -- cheap Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere Consider a tiny cache (for illustration only)
X|00|0 X001 X011 X101 X111

Address tag line offset

Set associative: There are N cache banks and memory is assigned to just one of the banks. There are three algorithmic choices for which line to replace:
Random: Choose any line using an analog random number generator. This is cheap and simple to make. LRU (least recently used): Preserves temporal locality, but is expensive. This is not much better than random according to (biased) studies. FIFO (first in, first out): Random is far superior. 7

X010 X100 X110

Cache line length: number of bytes loaded together in one entry Direct mapped: only one address (line) in a given range in cache Associative: 2 or more lines with different addresses exist

Direct-Mapped Cache
Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.

Fully Associative Cache

Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.

cache

main memory Main memory

Set Associative Cache

Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way setassociative cache a block from main memory can go into N (N > 1) locations in the cache.

Here assume cache has 8 blocks, while memory has 32

Fully associative 12 can go anywhere Direct mapped 12 can go only into block 4 (12 mod 8) Set associative 12 can go anywhere in Set 0 (12 mod 4)

Block no

01234567

2-way set-associative cache

Main memory
11111111112 22222222333 01234567890123456789 012345678901

Here assume cache has 8 blocks, while memory has 32

Fully associative 12 can go anywhere Direct mapped 12 can go only into block 4 (12 mod 8) Set associative 12 can go anywhere in Set 0 (12 mod 4)

Diagrams
Serial: CPU
Registers Logic

Cache

Main Memory

Block no

01234567

Parallel:

Shared Memory ... Network

Cache 1

Cache 2

... ...

Cache p

CPU 1
11111111112 22222222333 01234567890123456789 012345678901
13

CPU 2

CPU p
14

Tuning for Caches

1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining.

Registers
Registers are the source and destination of most CPU data operations. They hold one element each. They are made of static RAM (SRAM), which is very expensive. The access time is usually 1-1.5 CPU clock cycles. Registers are at the top of the memory subsystem. 16

The Principle of Locality

The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time.

Principals of Locality
Temporal: an item referenced now will be again soon. Spatial: an item referenced now causes neighbors to be referenced soon. Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs. Cache lines are typically 32-128 bytes with 1024 being the longest currently.
18

Two Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Last 15 years, HW relied on localilty for speed

Cache Thrashing
Thrashing occurs when frequently used cache lines replace each other. There are three primary causes for thrashing:
Instructions and data can conflict, particularly in unified caches. Too many variables or too large of arrays are accessed that do not fit into cache. Indirect addressing, e.g., sparse matrices.

Cache Coherence for Multiprocessors

All data must be coherent between memory levels. Multiple processors with separate caches must inform the other processors quickly about data modifications (by the cache line). Only hardware is fast enough to do this. Standard protocols on multiprocessors:
Snoopy: all processors monitor the memory bus. Directory based: Cache lines maintain an extra 2 bits per processor to maintain clean/dirty 20 status bits.

Machine architects can add sets to the associativity. Users can buy another vendors machine. However, neither solution is realistic.
19

Indirect Addressing
d=0 do i = 1,n j = ind(i) d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) ) end do Change loop statement to d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) )

Cache Thrashing by Memory Allocation

x
parameter ( m = 1024*1024 )

y z r

real a(m), b(m) For a 4 Mb direct mapped cache, a(i) and b(i) are always mapped to the same cache line. This is trivially avoided using padding.

real a(m), extra(32), b(m)

Note that r(1,j)-r(3,j) are in contiguous memory and probably are in the same cache line (d is probably in a register and is irrelevant). The original form uses 3 cache lines at every instance of the loop and can cause cache thrashing.
21

extra is at least 128 bytes in length, which is longer than a cache line on all but one memory subsystem that is available today.
22

Cache Blocking
We want blocks to fit into cache. On parallel computers we have p x cache so that data may fit into cache on p processors, but not one. This leads to superlinear speed up! Consider matrix-matrix multiply.

Cache Blocking
do kk = 1,n,nblk do jj = 1,n,nblk do ii = 1,n,nblk do k = kk,kk+nblk-1 do j = jj,jj+nblk-1 do i = ii,ii+nblk-1 c(i,j) = c(i,j) + a(i,k) * b(k,j) end do . . . end do
23 24

N M NB

K M

do k = 1,n do j = 1,n do i = 1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do

An alternate form is ...

Summary : The Cache Design Space

Several interacting dimensions
cache size block size associativity replacement policy write-through vs write-back write allocation

Lessons
Cache Size Associativity

Block Size

The optimal choice is a compromise

depends on access characteristics Bad workload use (I-cache, D-cache, TLB) Good depends on technology / cost

Factor A

Factor B

The actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly Since we want to write fast programs, we must take the architecture into account, even on uniprocessors Since the actual performance is so complicated, we need simple models to help us design efficient algorithms We will illustrate with a common technique for improving cache performance, called blocking

Less

More
25 26

Simplicity often wins

Optimizing Matrix Addition for Caches

Dimension A(n,n), B(n,n), C(n,n) A, B, C stored by column (as in Fortran) Algorithm 1:
for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)

Loop Fusion Example

/* Before */ for (i = 0; i for (j = 0; a[i][j] for (i = 0; i for (j = 0; d[i][j] /* After */ for (i = 0; i for (j = 0; { a[i][j] d[i][j] < j = < j = N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; N; i = i+1) < N; j = j+1) a[i][j] + c[i][j];

Algorithm 2:
for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)

What is memory access pattern for Algs 1 and 2? Which is faster? What if A, B, C stored by row (as in C)?
27

< j = =

N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

Optimizing Matrix Multiply for Caches

Several techniques for making this faster on modern processors Some optimizations done automatically by compiler, but can do much better In general, you should use optimized libraries (often supplied by vendor) for this and other very common linear algebra operations Other algorithms you may want are not going to be supplied by vendor, so need to know these techniques 29
BLAS = Basic Linear Algebra Subroutines heavily studied

Warm up: Matrix-vector multiplication y = y + A*x

for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j)

A(i,:)

=
y(i) y(i)

*
x(:)

Warm up: Matrix-vector multiplication y = y + A*x

{read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} m = number of slow memory refs = 3*n + n^2 f = number of arithmetic operations = 2*n^2 q = f/m ~= 2 Matrix-vector multiplication limited by slow memory speed
31

Multiply C=C+A*B
for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

C(i,j)

A(i,:)

B(:,j)

Matrix Multiply C=C+A*B(unblocked, or untiled)

for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}
A(i,:)

Matrix Multiply (unblocked, or untiled)

q=ops/slow mem ref

Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector mult

C(i,j)

B(:,j) 33

C(i,j)

A(i,:)

B(:,j) 34

Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}
C(i,j) C(i,j) A(i,k)

Matrix Multiply (blocked or tiled)

Why is this algorithm correct?

q=ops/slow mem ref n size of matrix b blocksize N number of blocks

Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C once = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))

B(k,j) 35

More on BLAS (Basic Linear Algebra Subroutines)

Industry standard interface(evolving) Vendors, others supply optimized implementations History
BLAS1 (1970s): vector operations: dot product, saxpy (y= *x+y), etc m=2*n, f=2*n, q ~1 or less BLAS2 (mid 1980s) matrix-vector operations: matrix vector multiply, etc m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1 BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2

BLAS for Performance

Alpha EV 5/6 500MHz (1Gflop/s peak)

700 600 500 400 300 200 100 0 10 100 200 300 400
Order of vector/Matrices

Level 3 BLAS

Mflop/s

Level 2 BLAS Level 1 BLAS

500

Good algorithms used BLAS3 when possible (LAPACK) www.netlib.org/blas, www.netlib.org/lapack

Development of blocked algorithms important for performance

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs 38 BLAS 1 (saxpy of n vectors)

Optimizing in practice
Tiling for registers
loop unrolling, use of named register variables

Strassens Matrix Multiply

The traditional algorithm (with or without tiling) has O(n^3) flops Strassen discovered an algorithm with asymptotically lower flops
O(n^2.81)

Tiling for multiple levels of cache Exploiting fine-grained parallelism within the processor
super scalar pipelining

Consider a 2x2 matrix multiply, normally 8 multiplies

Let M = [m11 m12] = [a11 a12] * [b11 b12] [m21 m22] [a21 a22] [b21 b22] p5 = (a11 + a12) * b22 p6 = (a21 a11) * (b11 + b12) p7 = (a12 - a22) * (b21 + b22)

Complicated compiler interactions Hard to do by hand (but youll try) Automatic optimization an active research area
PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac www.cs.berkeley.edu/~iyer/asci_slides.ps ATLAS: www.netlib.org/atlas/index.html

Let p1 = (a11 + a22) * (b11 + b22) p2 = (a21 + a22) * b11 p3 = a11 * (b12 - b22) p4 = a22 * (b21 b11) Then m11 = p1 + p4 p5 + p7 m12 = p3 + p5 m21 = p2 + p4 39 m22 = p1 + p3 p2 + p6

Extends to nxn by divide&conquer

Strassen (continued)
T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)^2 = O(n^log_2 7) = O(n^2.81)

Summary
Performance programming on uniprocessors requires
understanding of memory system levels, costs, sizes understanding of fine-grained parallelism in processor to produce good instruction mix

Available in several libraries Up to several time faster if n large enough (100s) Needs more memory than standard algorithm Can be less accurate because of roundoff error Current worlds record is O(n^2.376..)
41

Blocking (tiling) is a basic approach that can be applied to many matrix algorithms Applies to uniprocessors and parallel processors
The technique works for any architecture, but choosing the blocksize b and other details depends on the architecture

Similar techniques are possible on other data structures You will get to try this in Assignment 2 (see the class homepage)
42

Summary: Memory Hierachy

Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs?
1000X DRAM growth removed the controversy

Performance = Effective Use of Memory Hierarchy

Mflop/s

Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms?
43

Can only do arithmetic on data at the top of the hierarchy Higher level BLAS lets us do this
BLAS Memory Flops Ref s 3n n2 4 n2 2n 2n2 2n3 Flops/ Memory Ref s 2/ 3 2 n/ 2

Level 1, 2 & 3 BLAS Intel PII 450MHz

350 300 250 200 150 100 50 0 10 100 200 300 400 500
Order of vector/Matrices

Level 1
y=y+x

Level 2
y=y+Ax

Level 3
C=C+AB

Development of blocked algorithms important for performance

Improving Ratio of Floating Point Operations to Memory Accesses

subroutine mult(n1,nd1,n2,nd2,y,a,x) implicit real*8 (a-h,o-z) dimension a(nd1,nd2),y(nd2),x(nd1) do 10, i=1,n1 t=0.d0 do 20, j=1,n2 20 t=t+a(j,i)*x(j) 10 y(i)=t return end

Improving Ratio of Floating Point Operations to Memory Accesses

c works correctly when n1,n2 are multiples of 4 dimension a(nd1,nd2), y(nd2), x(nd1) do i=1,n1-3,4 t1=0.d0 t2=0.d0 t3=0.d0 t4=0.d0 do j=1,n2-3,4 t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+ 1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3) t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+ 1 a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3) t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+ 1 a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3) t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+ 1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3) enddo y(i+0)=t1 y(i+1)=t2 y(i+2)=t3 y(i+3)=t4 enddo

2 FLOPS 2 LOADS

32 FLOPS 20 LOADS
46

Amdahls Law
Amdahls Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahls Law are given below:

Illustration of Amdahls Law

It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
250 fp = 1.000 fp = 0.999 fp = 0.990 fp = 0.900

tN = (fp/N + fs)t1 S = 1/(fs + fp/N) Where:

Effect of multiple processors on run time

speedup

200 150 100 50 0 0

Effect of multiple processors on speedup

fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors

100

150

200

250
48

Number of processors

Amdahls Law Vs. Reality

Amdahls Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.
80 70 60 speedup 50 40 30 20 10 0 0 50 100 150 Number of processors 200 250
49

More on Amdahls Law

Amdahls Law can be generalized to any two processes of with different speeds Ex.: Apply to fprocessor and fmemory:

fp = 0.99

Amdahl's Law Reality

The growing processor-memory performance gap will undermine our efforts at achieving maximum possible speedup!

Gustafsons Law
Thus, Amdahls Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large. There is a way around this: increase the
bigger problems mean bigger grids or more particles: bigger arrays
number of serial operations generally remains constant; number of parallel operations 51 increases: parallel fraction increases

Parallel Performance Metrics: Speedup

Relative performance:
60 T3E

Absolute performance:
2000 1750

Speedup

Ideal 40

MFLOPS

O2K

1500 1250 1000 750

T3E T3E Ideal O2K O2K Ideal

20 500 10 250 0 0 10 20 30 40 50 60 0 8 16 24 32 40 48

problem size

Processors

Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.
52

Fixed-Problem Size Scaling

a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS), variable subgrid 1 Amdahl Limit: SA(n) = T(1) / T(n) = ---------------f/n+(1-f) This bounds the speedup based only on the fraction of the code that cannot use parallelism ( 1- f ); it ignores all other factors SA --> 1 / ( 1- f ) as n -->

Fixed-Problem Size Scaling (Contd)

Efficiency (n) = T(1) / [ T(n) * n] Memory requirements decrease with n Surface-to-volume ratio increases with n Superlinear speedup possible from cache effects Motivation: what is the largest # of procs I can use effectively and what is the fastest time that I can solve a given problem? Problems: - Sequential runs often not possible (large problems) - Speedup (and efficiency) is misleading if processors are slow
53 54

Fixed-Problem Size Scaling: Examples

S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC'97 Santa Cruz, August 25-28 1997.
55

Fixed-Problem Size Scaling Examples

Scaled Speedup Experiments

a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling. Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time. Memory and surface-to-volume effects remain constant.

Scaled Speedup Experiments

Be wary of benchmarks that scale problems to unreasonably-large sizes - scale the problem to fill the machine when a smaller size will do; - simplify the science in order to add computation -> Worlds largest MD simulation - 10 gazillion particles! - run grid sizes for only a few cycles because the full run wont finish during this lifetime or because the resolution makes no sense compared with resolution of input data Suggested alternate approach (Gustafson): Constant time benchmarks - run code for a fixed time and measure work done

Processors NChains 1 2 4 8 16 32 64 128 256 512

Example of a Scaled Speedup Experiment

Time Natoms 32 64 128 256 512 940 1700 2800 4100 5300 38.4 38.4 38.5 38.6 38.7 35.7 32.7 27.4 20.75 14.49 2368 4736 9472 18944 37888 69560 125800 207200 303400 392200 Time per Atom per PE 1.62E-02 8.11E-03 4.06E-03 2.04E-03 1.02E-03 5.13E-04 2.60E-04 1.32E-04 6.84E-05 3.69E-05

Time Efficiency per Atom 1.62E-02 1.000 1.62E-02 1.000 1.63E-02 0.997 1.63E-02 0.995 1.63E-02 0.992 1.64E-02 0.987 1.66E-02 0.975 1.69E-02 0.958 1.75E-02 0.926 1.89E-02 0.857

TBON on ASCI Red Efficiency

1.040 0.940 0.840 0.740 0.640 0.540 0.440 0 200 400 600

Oxford Discover 1 Student Book 2nd Edition
67% (6)
Oxford Discover 1 Student Book 2nd Edition
193 pages
Cache Memory Presentation Slides
No ratings yet
Cache Memory Presentation Slides
25 pages
Different Types of Electrical Power and Hydraulic Tools
0% (1)
Different Types of Electrical Power and Hydraulic Tools
8 pages
Sample Computer Hardware Servicing Interview Questions-A
100% (4)
Sample Computer Hardware Servicing Interview Questions-A
2 pages
SYS600 - IEC 60870-5-104 Slave Protocol - 756654 - ENb
100% (1)
SYS600 - IEC 60870-5-104 Slave Protocol - 756654 - ENb
112 pages
Useful Moshell Commands
No ratings yet
Useful Moshell Commands
4 pages
Exchange Online Migration
100% (1)
Exchange Online Migration
63 pages
Questions: Information Technology Question Bank
No ratings yet
Questions: Information Technology Question Bank
10 pages
Notes On Operating Systems: Dror G. Feitelson
No ratings yet
Notes On Operating Systems: Dror G. Feitelson
314 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
46 pages
Eng Setup DRV
No ratings yet
Eng Setup DRV
132 pages
Beginning FPGA Programming - Partie68
No ratings yet
Beginning FPGA Programming - Partie68
5 pages
McAfee ePO Backup
No ratings yet
McAfee ePO Backup
4 pages
Cache Memory in Computer Organization
No ratings yet
Cache Memory in Computer Organization
5 pages
Mpls L3 Configuration Steps
No ratings yet
Mpls L3 Configuration Steps
6 pages
Stds62 Using SWIFTNet PDF
100% (1)
Stds62 Using SWIFTNet PDF
305 pages
Cache Mapping
100% (1)
Cache Mapping
44 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
Case Study For Mechatronic Design of A Coin Counter
100% (1)
Case Study For Mechatronic Design of A Coin Counter
4 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Chap 6
No ratings yet
Chap 6
48 pages
ECOSYS M3145idn 3645idn M3655idn M3660idn Spec Sheets
No ratings yet
ECOSYS M3145idn 3645idn M3655idn M3660idn Spec Sheets
8 pages
Bro FlexScanS2410W S2110W
No ratings yet
Bro FlexScanS2410W S2110W
4 pages
SUP0046 - Installing Clisp Studio and Setting Up A Database
No ratings yet
SUP0046 - Installing Clisp Studio and Setting Up A Database
16 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
67 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Cache Memory in Computer Organizatin
No ratings yet
Cache Memory in Computer Organizatin
12 pages
ACA Unit-5
No ratings yet
ACA Unit-5
54 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
115 pages
Module 6 CO 2020
No ratings yet
Module 6 CO 2020
40 pages
Synopsis Report of School Website
75% (8)
Synopsis Report of School Website
4 pages
Cache Memory
No ratings yet
Cache Memory
39 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
76 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
No ratings yet
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
25 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
Cache Memory
No ratings yet
Cache Memory
47 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
Assosiative Mapping - Cache Memory
No ratings yet
Assosiative Mapping - Cache Memory
2 pages
Miss Rate Versus Block Size: 25% 1K 4K 16K 64K 256K
No ratings yet
Miss Rate Versus Block Size: 25% 1K 4K 16K 64K 256K
33 pages
CH04 COA9e Cache Memory Repaired
No ratings yet
CH04 COA9e Cache Memory Repaired
42 pages
TCO!Stream Technical Compliance
No ratings yet
TCO!Stream Technical Compliance
3 pages
Rfid Cisco
No ratings yet
Rfid Cisco
38 pages
Direct-Mapped Cache: Write Allocate With Write-Through Protocol
No ratings yet
Direct-Mapped Cache: Write Allocate With Write-Through Protocol
25 pages
Restful Web Services With Dropwizard
No ratings yet
Restful Web Services With Dropwizard
9 pages
Computer Architecture: Memory Hierarchy Design
No ratings yet
Computer Architecture: Memory Hierarchy Design
60 pages
Lab 8
No ratings yet
Lab 8
3 pages
VMware Vsphere 5 Clustering Technical Deepdive
No ratings yet
VMware Vsphere 5 Clustering Technical Deepdive
300 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Rsnetworx For Control Net
No ratings yet
Rsnetworx For Control Net
100 pages
EE6304 Lecture9 Mem Caches
No ratings yet
EE6304 Lecture9 Mem Caches
61 pages
04 - Cache Memory
No ratings yet
04 - Cache Memory
47 pages
DNS Servers
No ratings yet
DNS Servers
3 pages
Cache: Contents and Introduction
No ratings yet
Cache: Contents and Introduction
13 pages
Cache Memory
No ratings yet
Cache Memory
4 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
51 pages
CAO - Lecutre7 Cache Memory
100% (1)
CAO - Lecutre7 Cache Memory
39 pages
MCA Computer Organisation and Architecture 05
No ratings yet
MCA Computer Organisation and Architecture 05
16 pages
Conspect of Lecture 7
No ratings yet
Conspect of Lecture 7
13 pages
Large and Fast: Exploiting Memory Hierarchy
No ratings yet
Large and Fast: Exploiting Memory Hierarchy
48 pages
Cache Memory
No ratings yet
Cache Memory
57 pages
CH04 COA10e
No ratings yet
CH04 COA10e
41 pages
Cache
No ratings yet
Cache
36 pages
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
54 pages
EE8351 Digital Logic Circuits: S.S.Harish Department of EEE
No ratings yet
EE8351 Digital Logic Circuits: S.S.Harish Department of EEE
52 pages
Project Management Dashboard - Multiple Projects
No ratings yet
Project Management Dashboard - Multiple Projects
18 pages
Online Is The Condition of Being Connected To A Network of Computers or Other Devices
No ratings yet
Online Is The Condition of Being Connected To A Network of Computers or Other Devices
5 pages
Lecture 04 IS064
No ratings yet
Lecture 04 IS064
41 pages
CS2115 Chapter-6
No ratings yet
CS2115 Chapter-6
45 pages
6.cache Memory - BVK
No ratings yet
6.cache Memory - BVK
47 pages
Unit-2 CDA DrManojY
No ratings yet
Unit-2 CDA DrManojY
81 pages
10 Caches
No ratings yet
10 Caches
34 pages
04 - Cache Memory
No ratings yet
04 - Cache Memory
61 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
11 Cache Memory
No ratings yet
11 Cache Memory
40 pages
Ejemplos de Ensayos Narrativos Ficticios
100% (2)
Ejemplos de Ensayos Narrativos Ficticios
7 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
55-Types of Caches, Caches Misses,-04!03!2025
No ratings yet
55-Types of Caches, Caches Misses,-04!03!2025
64 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
26 pages
Instrument Cable Tags
No ratings yet
Instrument Cable Tags
8 pages
Memory Originated
No ratings yet
Memory Originated
28 pages
Cache Memory
No ratings yet
Cache Memory
51 pages
6.module 2 - Part 2
No ratings yet
6.module 2 - Part 2
39 pages
Memory Organization AndCache Mapping Study 13
100% (1)
Memory Organization AndCache Mapping Study 13
55 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet

Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers

Uploaded by

Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers

Uploaded by

Lecture 5: Memory Hierarchy and Cache

Cache: A safe place for hiding and storing things.

Traditional Four Questions for Memory Hierarchy Designers

Q2: How is a block found if it is in the upper level?

Fully Associative, Set Associative, Direct Mapped Tag/Block

Q3: Which block should be replaced on a miss?

Q4: What happens on a write?

Write Back or Write Through (with Write Buffer)

Unified versus Split Caches

Unified vs Split Caches

Simplest Cache: Direct Mapped

Memory 4 Byte Direct Mapped Cache

Location 0 can be occupied by data from:

Which is better (ignore L2 cache)?

Cache Mapping Strategies

Address tag line offset

X010 X100 X110

Fully Associative Cache

main memory Main memory

Set Associative Cache

Here assume cache has 8 blocks, while memory has 32

2-way set-associative cache

Here assume cache has 8 blocks, while memory has 32

Shared Memory ... Network

Tuning for Caches

The Principle of Locality

Two Different Types of Locality:

Last 15 years, HW relied on localilty for speed

Cache Coherence for Multiprocessors

Cache Thrashing by Memory Allocation

real a(m), extra(32), b(m)

do k = 1,n do j = 1,n do i = 1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do

Summary : The Cache Design Space

The optimal choice is a compromise

Simplicity often wins

Optimizing Matrix Addition for Caches

Loop Fusion Example

N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; a[i][j] + c[i][j];}

Optimizing Matrix Multiply for Caches

Warm up: Matrix-vector multiplication y = y + A*x

Warm up: Matrix-vector multiplication y = y + A*x

Matrix Multiply C=C+A*B(unblocked, or untiled)

Matrix Multiply (unblocked, or untiled)

q=ops/slow mem ref

Matrix Multiply (blocked, or tiled)

Matrix Multiply (blocked or tiled)

q=ops/slow mem ref n size of matrix b blocksize N number of blocks

More on BLAS (Basic Linear Algebra Subroutines)

BLAS for Performance

Level 2 BLAS Level 1 BLAS

Good algorithms used BLAS3 when possible (LAPACK) www.netlib.org/blas, www.netlib.org/lapack

Development of blocked algorithms important for performance

Strassens Matrix Multiply

Consider a 2x2 matrix multiply, normally 8 multiplies

Extends to nxn by divide&conquer

Summary: Memory Hierachy

Performance = Effective Use of Memory Hierarchy

Level 1, 2 & 3 BLAS Intel PII 450MHz

Development of blocked algorithms important for performance

Improving Ratio of Floating Point Operations to Memory Accesses

Improving Ratio of Floating Point Operations to Memory Accesses

**** 2 FLOPS **** 2 LOADS

Illustration of Amdahls Law

tN = (fp/N + fs)t1 S = 1/(fs + fp/N) Where:

Effect of multiple processors on run time

200 150 100 50 0 0

Effect of multiple processors on speedup

fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors

Amdahls Law Vs. Reality

More on Amdahls Law

Amdahl's Law Reality

Parallel Performance Metrics: Speedup

1500 1250 1000 750

T3E T3E Ideal O2K O2K Ideal

Fixed-Problem Size Scaling

Fixed-Problem Size Scaling (Contd)

2 FLOPS 2 LOADS