0% found this document useful (0 votes)
92 views10 pages

Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers

The document discusses cache design and optimization. It covers several key aspects of cache design including block placement, identification, replacement, and write strategy. It defines different types of caches such as direct mapped, set associative, and fully associative caches. It also discusses concepts like cache hits, misses, locality, and techniques to reduce thrashing and optimize algorithms for caches through blocking.

Uploaded by

deepu7deepti
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views10 pages

Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers

The document discusses cache design and optimization. It covers several key aspects of cache design including block placement, identification, replacement, and write strategy. It defines different types of caches such as direct mapped, set associative, and fully associative caches. It also discusses concepts like cache hits, misses, locality, and techniques to reduce thrashing and optimize algorithms for caches through blocking.

Uploaded by

deepu7deepti
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 5: Memory Hierarchy and Cache

Cache: A safe place for hiding and storing things.


Websters New World Dictionary (1976)

Traditional Four Questions for Memory Hierarchy Designers


Q1: Where can a block be placed in the upper level?

(Block placement)

Q2: How is a block found if it is in the upper level?

Fully Associative, Set Associative, Direct Mapped Tag/Block

(Block identification)

Q3: Which block should be replaced on a miss?

(Block replacement)
Random, LRU

Q4: What happens on a write?

(Write strategy)

Write Back or Write Through (with Write Buffer)


2

Cache-Related Terms
ICACHE : Instruction cache DCACHE (L1) : Data cache closest to registers SCACHE (L2) : Secondary data cache TCACHE (L3) : Third level data cache
Data from SCACHE has to go through DCACHE to registers TCACHE is larger than SCACHE, and SCACHE is larger than DCACHE Not all processors have TCACHE

Unified versus Split Caches


This refers to having a single or separate caches for data and machine instructions. Split is obviously superior. It reduces thrashing, which we will come to shortly..

Unified vs Split Caches


Unified vs Separate I&D
Proc Unified Cache-1 Unified Cache-2 I-Cache-1 Proc Unified Cache-2 D-Cache-1

Simplest Cache: Direct Mapped


Memory Address
0 1 2 3 4 5 6 7 8 9 A B C D E F

Memory 4 Byte Direct Mapped Cache


Cache Index 0 1 2 3

Example:
16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99%

Location 0 can be occupied by data from:


Memory location 0, 4, 8, ... etc. In general: any memory location whose 2 LSBs of the address are 0s Address<1:0> => cache index

Which is better (ignore L2 cache)?


Assume 33% data ops 75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port)
5

Which one should we place in the cache? How can we tell which one is in the cache?

Cache Mapping Strategies


There are two common sets of methods in use for determining which cache lines are used to hold copies of memory lines. Direct: Cache address = memory address MODULO cache size.

Cache Basics
Cache hit: a memory access that is found in the cache -- cheap Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere Consider a tiny cache (for illustration only)
X|00|0 X001 X011 X101 X111

Address tag line offset

Set associative: There are N cache banks and memory is assigned to just one of the banks. There are three algorithmic choices for which line to replace:
Random: Choose any line using an analog random number generator. This is cheap and simple to make. LRU (least recently used): Preserves temporal locality, but is expensive. This is not much better than random according to (biased) studies. FIFO (first in, first out): Random is far superior. 7

X010 X100 X110

Cache line length: number of bytes loaded together in one entry Direct mapped: only one address (line) in a given range in cache Associative: 2 or more lines with different addresses exist

Direct-Mapped Cache
Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.

Fully Associative Cache


Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.

cache

cache

main memory Main memory

10

Set Associative Cache


Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way setassociative cache a block from main memory can go into N (N > 1) locations in the cache.

Here assume cache has 8 blocks, while memory has 32


Fully associative 12 can go anywhere Direct mapped 12 can go only into block 4 (12 mod 8) Set associative 12 can go anywhere in Set 0 (12 mod 4)

Block no

01234567

01234567

01234567

2-way set-associative cache

Main memory
11111111112 22222222333 01234567890123456789 012345678901

11

12

Here assume cache has 8 blocks, while memory has 32


Fully associative 12 can go anywhere Direct mapped 12 can go only into block 4 (12 mod 8) Set associative 12 can go anywhere in Set 0 (12 mod 4)

Diagrams
Serial: CPU
Registers Logic

Cache

Main Memory

Block no

01234567

01234567

01234567

Parallel:

Shared Memory ... Network

Cache 1

Cache 2

... ...

Cache p

CPU 1
11111111112 22222222333 01234567890123456789 012345678901
13

CPU 2

CPU p
14

Tuning for Caches


1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining.

Registers
Registers are the source and destination of most CPU data operations. They hold one element each. They are made of static RAM (SRAM), which is very expensive. The access time is usually 1-1.5 CPU clock cycles. Registers are at the top of the memory subsystem. 16

15

The Principle of Locality


The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time.

Principals of Locality
Temporal: an item referenced now will be again soon. Spatial: an item referenced now causes neighbors to be referenced soon. Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs. Cache lines are typically 32-128 bytes with 1024 being the longest currently.
18

Two Different Types of Locality:


Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Last 15 years, HW relied on localilty for speed

17

Cache Thrashing
Thrashing occurs when frequently used cache lines replace each other. There are three primary causes for thrashing:
Instructions and data can conflict, particularly in unified caches. Too many variables or too large of arrays are accessed that do not fit into cache. Indirect addressing, e.g., sparse matrices.

Cache Coherence for Multiprocessors


All data must be coherent between memory levels. Multiple processors with separate caches must inform the other processors quickly about data modifications (by the cache line). Only hardware is fast enough to do this. Standard protocols on multiprocessors:
Snoopy: all processors monitor the memory bus. Directory based: Cache lines maintain an extra 2 bits per processor to maintain clean/dirty 20 status bits.

Machine architects can add sets to the associativity. Users can buy another vendors machine. However, neither solution is realistic.
19

Indirect Addressing
d=0 do i = 1,n j = ind(i) d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) ) end do Change loop statement to d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) )

Cache Thrashing by Memory Allocation


x
parameter ( m = 1024*1024 )

y z r

real a(m), b(m) For a 4 Mb direct mapped cache, a(i) and b(i) are always mapped to the same cache line. This is trivially avoided using padding.

real a(m), extra(32), b(m)


Note that r(1,j)-r(3,j) are in contiguous memory and probably are in the same cache line (d is probably in a register and is irrelevant). The original form uses 3 cache lines at every instance of the loop and can cause cache thrashing.
21

extra is at least 128 bytes in length, which is longer than a cache line on all but one memory subsystem that is available today.
22

Cache Blocking
We want blocks to fit into cache. On parallel computers we have p x cache so that data may fit into cache on p processors, but not one. This leads to superlinear speed up! Consider matrix-matrix multiply.

Cache Blocking
do kk = 1,n,nblk do jj = 1,n,nblk do ii = 1,n,nblk do k = kk,kk+nblk-1 do j = jj,jj+nblk-1 do i = ii,ii+nblk-1 c(i,j) = c(i,j) + a(i,k) * b(k,j) end do . . . end do
23 24

N M NB

K M

do k = 1,n do j = 1,n do i = 1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do


An alternate form is ...

Summary : The Cache Design Space


Several interacting dimensions
cache size block size associativity replacement policy write-through vs write-back write allocation

Lessons
Cache Size Associativity

Block Size

The optimal choice is a compromise


depends on access characteristics Bad workload use (I-cache, D-cache, TLB) Good depends on technology / cost

Factor A

Factor B

The actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly Since we want to write fast programs, we must take the architecture into account, even on uniprocessors Since the actual performance is so complicated, we need simple models to help us design efficient algorithms We will illustrate with a common technique for improving cache performance, called blocking

Less

More
25 26

Simplicity often wins

Optimizing Matrix Addition for Caches


Dimension A(n,n), B(n,n), C(n,n) A, B, C stored by column (as in Fortran) Algorithm 1:
for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)

Loop Fusion Example


/* Before */ for (i = 0; i for (j = 0; a[i][j] for (i = 0; i for (j = 0; d[i][j] /* After */ for (i = 0; i for (j = 0; { a[i][j] d[i][j] < j = < j = N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; N; i = i+1) < N; j = j+1) a[i][j] + c[i][j];

Algorithm 2:
for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)

What is memory access pattern for Algs 1 and 2? Which is faster? What if A, B, C stored by row (as in C)?
27

< j = =

N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

28

Optimizing Matrix Multiply for Caches


Several techniques for making this faster on modern processors Some optimizations done automatically by compiler, but can do much better In general, you should use optimized libraries (often supplied by vendor) for this and other very common linear algebra operations Other algorithms you may want are not going to be supplied by vendor, so need to know these techniques 29
BLAS = Basic Linear Algebra Subroutines heavily studied

Warm up: Matrix-vector multiplication y = y + A*x


for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j)

A(i,:)

=
y(i) y(i)

*
x(:)

30

Warm up: Matrix-vector multiplication y = y + A*x


{read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} m = number of slow memory refs = 3*n + n^2 f = number of arithmetic operations = 2*n^2 q = f/m ~= 2 Matrix-vector multiplication limited by slow memory speed
31

Multiply C=C+A*B
for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

C(i,j)

C(i,j)

A(i,:)

B(:,j)

32

Matrix Multiply C=C+A*B(unblocked, or untiled)


for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}
A(i,:)

Matrix Multiply (unblocked, or untiled)

q=ops/slow mem ref

Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector mult

C(i,j)

C(i,j)

B(:,j) 33

C(i,j)

C(i,j)

A(i,:)

B(:,j) 34

Matrix Multiply (blocked, or tiled)


Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}
C(i,j) C(i,j) A(i,k)

Matrix Multiply (blocked or tiled)


Why is this algorithm correct?

q=ops/slow mem ref n size of matrix b blocksize N number of blocks

Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C once = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))

B(k,j) 35

36

More on BLAS (Basic Linear Algebra Subroutines)


Industry standard interface(evolving) Vendors, others supply optimized implementations History
BLAS1 (1970s): vector operations: dot product, saxpy (y= *x+y), etc m=2*n, f=2*n, q ~1 or less BLAS2 (mid 1980s) matrix-vector operations: matrix vector multiply, etc m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1 BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2

BLAS for Performance


Alpha EV 5/6 500MHz (1Gflop/s peak)

700 600 500 400 300 200 100 0 10 100 200 300 400
Order of vector/Matrices

Level 3 BLAS

Mflop/s

Level 2 BLAS Level 1 BLAS


500

Good algorithms used BLAS3 when possible (LAPACK) www.netlib.org/blas, www.netlib.org/lapack

37

Development of blocked algorithms important for performance

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs 38 BLAS 1 (saxpy of n vectors)

Optimizing in practice
Tiling for registers
loop unrolling, use of named register variables

Strassens Matrix Multiply


The traditional algorithm (with or without tiling) has O(n^3) flops Strassen discovered an algorithm with asymptotically lower flops
O(n^2.81)

Tiling for multiple levels of cache Exploiting fine-grained parallelism within the processor
super scalar pipelining

Consider a 2x2 matrix multiply, normally 8 multiplies


Let M = [m11 m12] = [a11 a12] * [b11 b12] [m21 m22] [a21 a22] [b21 b22] p5 = (a11 + a12) * b22 p6 = (a21 a11) * (b11 + b12) p7 = (a12 - a22) * (b21 + b22)

Complicated compiler interactions Hard to do by hand (but youll try) Automatic optimization an active research area
PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac www.cs.berkeley.edu/~iyer/asci_slides.ps ATLAS: www.netlib.org/atlas/index.html

Let p1 = (a11 + a22) * (b11 + b22) p2 = (a21 + a22) * b11 p3 = a11 * (b12 - b22) p4 = a22 * (b21 b11) Then m11 = p1 + p4 p5 + p7 m12 = p3 + p5 m21 = p2 + p4 39 m22 = p1 + p3 p2 + p6

Extends to nxn by divide&conquer

40

Strassen (continued)
T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)^2 = O(n^log_2 7) = O(n^2.81)

Summary
Performance programming on uniprocessors requires
understanding of memory system levels, costs, sizes understanding of fine-grained parallelism in processor to produce good instruction mix

Available in several libraries Up to several time faster if n large enough (100s) Needs more memory than standard algorithm Can be less accurate because of roundoff error Current worlds record is O(n^2.376..)
41

Blocking (tiling) is a basic approach that can be applied to many matrix algorithms Applies to uniprocessors and parallel processors
The technique works for any architecture, but choosing the blocksize b and other details depends on the architecture

Similar techniques are possible on other data structures You will get to try this in Assignment 2 (see the class homepage)
42

Summary: Memory Hierachy


Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs?
1000X DRAM growth removed the controversy

Performance = Effective Use of Memory Hierarchy


Mflop/s

Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms?
43

Can only do arithmetic on data at the top of the hierarchy Higher level BLAS lets us do this
BLAS Memory Flops Ref s 3n n2 4 n2 2n 2n2 2n3 Flops/ Memory Ref s 2/ 3 2 n/ 2

Level 1, 2 & 3 BLAS Intel PII 450MHz


350 300 250 200 150 100 50 0 10 100 200 300 400 500
Order of vector/Matrices

Level 1
y=y+x

Level 2
y=y+Ax

Level 3
C=C+AB

Development of blocked algorithms important for performance


44

Improving Ratio of Floating Point Operations to Memory Accesses


subroutine mult(n1,nd1,n2,nd2,y,a,x) implicit real*8 (a-h,o-z) dimension a(nd1,nd2),y(nd2),x(nd1) do 10, i=1,n1 t=0.d0 do 20, j=1,n2 20 t=t+a(j,i)*x(j) 10 y(i)=t return end

Improving Ratio of Floating Point Operations to Memory Accesses


c works correctly when n1,n2 are multiples of 4 dimension a(nd1,nd2), y(nd2), x(nd1) do i=1,n1-3,4 t1=0.d0 t2=0.d0 t3=0.d0 t4=0.d0 do j=1,n2-3,4 t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+ 1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3) t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+ 1 a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3) t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+ 1 a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3) t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+ 1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3) enddo y(i+0)=t1 y(i+1)=t2 y(i+2)=t3 y(i+3)=t4 enddo

**** 2 FLOPS **** 2 LOADS

32 FLOPS 20 LOADS
46

45

Amdahls Law
Amdahls Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahls Law are given below:

Illustration of Amdahls Law


It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
250 fp = 1.000 fp = 0.999 fp = 0.990 fp = 0.900

tN = (fp/N + fs)t1 S = 1/(fs + fp/N) Where:

Effect of multiple processors on run time


speedup

200 150 100 50 0 0


47

Effect of multiple processors on speedup

fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors

50

100

150

200

250
48

Number of processors

Amdahls Law Vs. Reality


Amdahls Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.
80 70 60 speedup 50 40 30 20 10 0 0 50 100 150 Number of processors 200 250
49

More on Amdahls Law


Amdahls Law can be generalized to any two processes of with different speeds Ex.: Apply to fprocessor and fmemory:

fp = 0.99

Amdahl's Law Reality

The growing processor-memory performance gap will undermine our efforts at achieving maximum possible speedup!

50

Gustafsons Law
Thus, Amdahls Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large. There is a way around this: increase the
bigger problems mean bigger grids or more particles: bigger arrays
number of serial operations generally remains constant; number of parallel operations 51 increases: parallel fraction increases

Parallel Performance Metrics: Speedup


Relative performance:
60 T3E

Absolute performance:
2000 1750

Speedup

Ideal 40

MFLOPS

50

O2K

1500 1250 1000 750

T3E T3E Ideal O2K O2K Ideal

30

20 500 10 250 0 0 10 20 30 40 50 60 0 8 16 24 32 40 48

problem size

Processors

Processors

Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.
52

Fixed-Problem Size Scaling


a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS), variable subgrid 1 Amdahl Limit: SA(n) = T(1) / T(n) = ---------------f/n+(1-f) This bounds the speedup based only on the fraction of the code that cannot use parallelism ( 1- f ); it ignores all other factors SA --> 1 / ( 1- f ) as n -->

Fixed-Problem Size Scaling (Contd)


Efficiency (n) = T(1) / [ T(n) * n] Memory requirements decrease with n Surface-to-volume ratio increases with n Superlinear speedup possible from cache effects Motivation: what is the largest # of procs I can use effectively and what is the fastest time that I can solve a given problem? Problems: - Sequential runs often not possible (large problems) - Speedup (and efficiency) is misleading if processors are slow
53 54

Fixed-Problem Size Scaling: Examples


S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC'97 Santa Cruz, August 25-28 1997.
55

Fixed-Problem Size Scaling Examples

56

Scaled Speedup Experiments


a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling. Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time. Memory and surface-to-volume effects remain constant.

Scaled Speedup Experiments


Be wary of benchmarks that scale problems to unreasonably-large sizes - scale the problem to fill the machine when a smaller size will do; - simplify the science in order to add computation -> Worlds largest MD simulation - 10 gazillion particles! - run grid sizes for only a few cycles because the full run wont finish during this lifetime or because the resolution makes no sense compared with resolution of input data Suggested alternate approach (Gustafson): Constant time benchmarks - run code for a fixed time and measure work done

57

58

Processors NChains 1 2 4 8 16 32 64 128 256 512

Example of a Scaled Speedup Experiment


Time Natoms 32 64 128 256 512 940 1700 2800 4100 5300 38.4 38.4 38.5 38.6 38.7 35.7 32.7 27.4 20.75 14.49 2368 4736 9472 18944 37888 69560 125800 207200 303400 392200 Time per Atom per PE 1.62E-02 8.11E-03 4.06E-03 2.04E-03 1.02E-03 5.13E-04 2.60E-04 1.32E-04 6.84E-05 3.69E-05

Time Efficiency per Atom 1.62E-02 1.000 1.62E-02 1.000 1.63E-02 0.997 1.63E-02 0.995 1.63E-02 0.992 1.64E-02 0.987 1.66E-02 0.975 1.69E-02 0.958 1.75E-02 0.926 1.89E-02 0.857

TBON on ASCI Red Efficiency


1.040 0.940 0.840 0.740 0.640 0.540 0.440 0 200 400 600

59

60

10

You might also like