ch4 PC
ch4 PC
Evaluation in Uniprocessors
2
Difficult Enough for Uniprocessors
Workloads need to be renewed and reconsidered
Input data sets affect key interactions
• Changes from SPEC92 to SPEC95
Accurate simulators costly to develop and verify
Simulation is time-consuming
But the effort pays off: Good evaluation leads to good design
Speedup
15
15
10 10
5
5
0 0
1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31
Number of processors Number of processors
6
Measuring Performance
Absolute performance
• Most important to end user
Performance improvement due to parallelism
• Speedup(p) = Performance(p) / Performance(1), always
Both should be measured
Performance = Work / Time, always
Work is determined by input configuration of the problem
If work is fixed,can measure performance as 1/Time
• Or retain explicit work measure (e.g. transactions/sec, bonds/sec)
• Still w.r.t particular configuration, and still what’s measured is time
8
Too Large a Problem
9
Demonstrating Scaling Problems
Small Ocean and big equation solver problems on SGI Origin2000
50
Grid solver: 12 K x 12 K
45
Ideal
30 40
Ideal
25 Ocean: 258 x 258 35
30
20
25
Speedup
Speedup
15 20
15
10
10
5
5
0 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of processors Number of processors
10
Questions in Scaling
11
Under What Constraints to Scale?
Two types of constraints:
• User-oriented, e.g. particles, rows, transactions, I/Os per processor
• Resource-oriented, e.g. memory, time
Which is more appropriate depends on application domain
• User-oriented easier for user to think about and change
• Resource-oriented more general, and often more real
Resource-oriented scaling models:
• Problem constrained (PC)
• Memory constrained (MC)
• Time constrained (TC)
(TPC: transactions, users, terminals scale with “computing power”)
Growth under MC and TC may be hard to predict
12
Problem Constrained Scaling
SpeedupPC(p) = Time(1)
Time(p)
13
Time Constrained Scaling
14
Memory Constrained Scaling
15
Impact of Scaling Models: Grid Solver
MC Scaling:
• Grid size = n√p-by-n√p
• Iterations to converge = n√p
• Work = O(n√p)3
• ( )
Ideal parallel execution time = O (n√p) = n3 √p
3
p
• Grows by n√p
• 1 hr on uniprocessor means 32 hr on 1024 processors
TC scaling:
• If scaled grid size is k-by-k, then k3/p = n3, so k = n 3√p.
• Memory needed per processor = k2/p = n2 / 3√p
• Diminishes as cube root of number of processors
16
Impact on Solver Execution Characteristics
17
Scaling Workload Parameters: Barnes-Hut
Scaling rule:
All components of simulation error should scale at same rate
18
Effects of Scaling Rule
If number of processors (p) is scaled by a factor of k
• Under Time Constrained scaling:
– increase in number of particles is less than √k
• Under Memory Constrained scaling:
– elapsed time becomes unacceptably large
Time Constrained is most realistic for this application
19
Performance and Scaling Summary
20
Evaluating a Real Machine
Choosing Workloads
Metrics
21
Performance Isolation: Microbenchmarks
Microbenchmarks: Small, specially written programs to isolate
performance characteristics
• Processing
• Local memory
• Input/output
• Communication and remote access (read/write, send/receive)
• Synchronization (locks, barriers)
• Contention 300
256 K– 8 M
250
200 128 K
Time (ns)
64 K
150
32 K
100 16 K
50
8K
0
8 32 128 512 2K 8K 32 K 128 K 512 K 2 M
Stride
22
Evaluation using Realistic Workloads
23
Types of Workloads
• Kernels: matrix factorization, FFT, depth-first tree search
• Complete Applications: ocean simulation, crew scheduling, database
• Multiprogrammed Workloads
Adequate concurrency
25
Representativeness
Should adequately represent domains of interest, e.g.:
• Scientific: Physics, Chemistry, Biology, Weather ...
• Engineering: CAD, Circuit Analysis ...
• Graphics: Rendering, radiosity ...
• Information management: Databases, transaction
processing, decision support ...
• Optimization
• Artificial Intelligence: Robotics, expert systems ...
• Multiprogrammed general-purpose workloads
• System software: e.g. the operating system
26
Coverage: Stressing Features
Easy to mislead with workloads
• Choose those with features for which machine is good, avoid others
Some features of interest:
• Compute v. memory v. communication v. I/O bound
• Working set size and spatial locality
• Local memory and communication bandwidth needs
• Importance of communication latency
• Fine-grained or coarse-grained
– Data access, communication, task size
• Synchronization patterns and granularity
• Contention
• Communication patterns
• Data structuring, e.g. 2-d or 4-d arrays for SAS grid problem
• Data layout, distribution and alignment, even if properly structured
• Orchestration
– contention
– long versus short messages
– synchronization frequency and cost, ...
• Also, random problems with “unimportant” data structures
Optimizing applications takes work
• Many practical applications may not be very well optimized
May examine selected different levels to test robustness of system
28
Concurrency
Should have enough to utilize the processors
• If load imbalance dominates, may not be much machine can do
• (Still, useful to know what kinds of workloads/configurations don’t
have enough concurrency)
Algorithmic speedup: useful measure of concurrency/imbalance
• Speedup (under scaling model) assuming all
memory/communication operations take zero time
• Ignores memory system, measures imbalance and extra work
• Uses PRAM machine model (Parallel Random Access Machine)
– Unrealistic, but widely used for theoretical algorithm development
ScaLapack
• Message-passing kernels
TPC
• Transaction processing
SPEC-HPC
...
30
Evaluating a Fixed-size Machine
Now assume workload is fixed too
Many critical characteristics depend on problem size
• Inherent application characteristics
– concurrency and load balance (generally improve with problem size)
– communication to computation ratio (generally improve)
– working sets and spatial locality (generally worsen and improve, resp.)
• Interactions with machine organizational parameters
• Nature of the major bottleneck: comm., imbalance, local access...
Insufficient to use a single problem size
Need to choose problem sizes appropriately
• Understanding of workloads will help
Examine step by step using grid solver
• Assume 64 processors with 1MB cache and 64MB memory each
31
Steps in Choosing Problem Sizes
32
Steps in Choosing Problem Sizes (contd.)
33
Steps in Choosing Problem Sizes (contd.)
4. Use temporal locality and working sets
Fitting or not dramatically changes local traffic and artifactual comm.
E.g. Raytrace working sets are nonlocal, Ocean are local
100
% of
Miss working
ratio WS1 set that
ts in
cache of WS3 WS2 WS1
WS2 size C
WS3
(a) (b)
C Problem1 Problem3
Cache size Problem size
Problem2 Problem4
• Choose problem sizes on both sides of a knee if realistic
– Critical to understand growth rate of working sets
• Also try to pick one very large size (exercises TLB misses etc.)
• Solver: first (2 subrows) usually fits, second (full partition) may or not
– Doesn’t for largest (2K) so add 4K-b-4K grid
– Add 16K as large size, so grid sizes now 256, 1K, 2K, 4K, 16K (in each dimension)
34
Steps in Choosing Problem Sizes (contd)
5. Use spatial locality and granularity interactions
• E.g., in grid solver, can we distribute data at page granularity in SAS?
– Affects whether cache misses are satisfied locally or cause comm.
– With 2D array representation, grid sizes 512, 1K, 2K no, 4K, 16K yes
– For 4-D array representation, yes except for very small problems
• So no need to expand choices for this reason
More stark example: false sharing in Radix sort
• Becomes a problem when cache line size exceeds n/(r*p) for radix r
12
Cold/capacity
10 True sharing
False sharing
8 n = 256-K keys n = 1-M keys
Miss rate (%)
0
8 16 32 64 128 256 8 16 32 64 128 256
Line size (bytes)
Traffic (bytes/FLOP)
with True sharing
p processors 0.6
(computation 0.4
solver) 0.0
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Number of processors
n/p is large
• Low communication to computation ratio
• Good spatial locality with large cache lines
• Data distribution and false sharing not problems even with 2-d array
• Working set doesn’t fit in cache; high local capacity miss rate.
n/p is small
• High communication to computation ratio
• Spatial locality may be poor; false-sharing may be a problem
• Working set fits in cache; low capacity miss rate.
e.g. Shouldn’t make conclusions about spatial locality based only on
small problems, particularly if these are not very representative.
36
Summary Example 2: Barnes-Hut
Large problem size (large n or small ) relative to p most of the
time is in force-computation phase
• Good load balance
• Low communication to computation ratio
37
Varying p on a Given Machine
Already know how to scale problem size with p
38
Metrics for Comparing Machines
Both cost and performance are important (as is effort)
• For a fixed machine as well as how they scale
• E.g. if speedup increases less than linearly, may still be very cost effective if cost
to run the program scales sublinearly too
• Some measure of “cost-performance” is most useful
But cost is difficult to get a handle on
• Depends on market and volume, not just on hardware/effort
Also, cost and performance can be measured independently
Focus here on measuring performance
Many metrics used for performance
• Based on absolute performance, speedup, rate, utilization, size ...
• Some important and should always be presented
• Others should be used only very carefully, if at all
• Let’s examine some ...
39
Absolute Performance
What is Performance(1)?
1. Parallel program on one processor of parallel machine?
2. Same sequential algorithm on one processor of parallel machine?
3. “Best” sequential program on one processor of parallel machine?
4. “Best” sequential program on agreed-upon standard machine?
3. is more honest than 1. or 2. for users
• 2. may be okay for architects to understand parallel performance
4. evaluates uniprocessor performance of machine as well
• Similar to absolute performance
41
Processing Rates
Popoular to measure computer operations per unit time
• MFLOPS, MIPS
• As opposed to operations that have meaning at application level
Neither good for comparing machines
• Can be artificially inflated
– Worse algorithm with greater FLOP rate, or even add useless cheap ops
• Who cares about FLOPS anyway?
• Different floating point ops (add, mul, …) take different time
• Burdened with legacy of misuse
Can use independently known no. of operations as work measure
• Then rate no different from measuring execution time
42
Resource Utilization
Architects often measure how well resources are utilized
• E.g. processor utilization, memory...
43
Metrics based on Problem Size
E.g. Smallest problem size needed to achieve given parallel efficiency
(parallel efficiency = speedup/p)
• Motivation: everything depends on problem size, and smaller
problems have more parallel overheads
• Distinguish comm. architectures by ability to run smaller problems
• Introduces another scaling model: efficiency-constrained scaling
Caveats
• Sometimes larger problem has worse parallel efficiency
– Working sets have nonlocal data, and may not fit for large problems
• Small problems may not stress local memory system
44
Percentage Improvement in Performance
Summary of metrics
• For user: absolute performance
• For architect: absolute performance as well as speedup
– any study should present both
– size-based metrics useful for concisely including problem size effect
• Other metrics useful for specialized reasons, usually to architect
– but must be careful when using, and only in conjunction with above
45
Presenting Results
Beware peak values
• Never obtained in practice
– Peak = “guaranteed not to exceed”
• Not meaningful for a user who wants to run an application
Even single measure of “sustained” performance is not really useful
• Sustained on what benchmark?
Averages over benchmarks are not very useful
• Behavior is too dependent on benchmark, so average obscures
Report performance per benchmark
• Interpreting results needs understanding of benchmark
Must specify problem and machine configuration
• Can make dramatic difference to results; e.g. working set fits or not
46
“Twelve Ways to Fool the Masses”
Compare 32-bit results to others’ 64-bit results
Present inner kernel results instead of whole application
Use assembly coding and compare with others’ Fortran or C codes
Scale problem size with number of processors, but don’t tell
Quote performance results linearly projected to a full system
Compare with scalar, unoptimized, uniprocessor results on CRAYs
Compare with old code on obsolete system
Use parallel code as base instead of best sequential one
Quote performance in terms of utilization, speedups/peak MFLOPS/$
Use inefficient algorithms to get high MFLOPS rates
Measure parallel times on dedicated sys, but uniprocessor on busy sys
Show pretty pictures and videos, but don’t talk about performance
47
Some Important Observations
48
Operating Points Based on Working Sets
• Some working sets scale with application parameters and p, some don’t
• Some operating points are realistic, some aren’t
• operating point = f (cache/replication size, application parameters, p)49
Evaluating an Idea or Tradeoff
50
Multiprocessor Simulation
51
Execution-driven Simulation
Memory hierarchy simulator returns simulated time information to
reference generator, which is used to schedule simulated processes
P1 $1 Mem 1
P2 $2 Mem2 N
e
t
P3 $3 Mem3
w
o
· · r
· · k
· ·
Pp $p Memp
52
Difficulties in Simulation-based Evaluation
Two major problems, beyond accuracy and reliability:
• Cost of simulation (in time and memory)
– cannot simulate the problem/machine sizes we care about
– have to use scaled down problem and machine sizes
• how to scale down and stay representative?
53
Scaling Down Parameters for Simulation
Want scaled-down machine running scaled-down problem to be
representative of full-sized scenario
• No good formulas exist
• But very important since reality of most evaluation
• Should understand limitations and guidelines to avoid pitfalls
54
Scaling Down Problem Parameters
Some parameters don’t affect parallel performance much, but do
affect runtime, and can be scaled down
• Common example is no. of time-steps in many scientific applications
– need a few to allow settling down, but don’t need more
– may need to omit cold-start when recording time and statistics
• First look for such parameters
• Others can be scaled according to earlier scaling arguments
55
Difficulties in Scaling N, p Representatively
56
Scaling Down Other Machine Parameters
Often necessary when scaling down problem size
• E.g. may not represent working set not fitting if cache not scaled
More difficult to do with confidence
• Cache/replication size: guide by scaling of working sets, not data set
• Associativity and Granularities: more difficult
– should try to keep unchanged since hard to predict effects, but ...
– greater impact with scaled-down application and system parameters
– difficult to find good solutions for both communication and local access
57
Dealing with the Parameter Space
58
An Example Evaluation
Goal of study: To determine the value of adding a block transfer
facility to a cache-coherent SAS machine with distributed
memory
Workloads: Choose at least some that have communication that is
amenable to block transfer (e.g. grid solver)
Choosing parameters is more difficult. 3 goals:
• Avoid unrealistic execution characteristics
• Obtain good coverage of realistic characteristics
• Prune the parameter space based on
– goals of study
– restrictions imposed by technology or assumptions
– understanding of parameter interactions
Let’s use equation solver as example
59
Choosing Parameters
Problem size and number of processors
• Use inherent characteristics considerations as discussed earlier
• For example, low c-to-c ratio will not allow block transfer to help much
• Suppose one size chosen is 514-by-514 grid with 16 processors
Cache/Replication Size
• Choose based on knowledge of working set curve
• Choosing cache sizes for given problem and machine size analogous to
choosing problem sizes for given cache and machine size, discussed
• Whether or not working set fits affects block transfer benefits greatly
– if local data, not fitting makes communication relatively less important
– If nonlocal, can increase artifactual comm. So BT has more opportunity
• Sharp knees in working set curve can help prune space (next slide)
– Knees can be determined by analysis or by very simple simulation
60
Example of Pruning using Knees
unrealistic
Miss rate or operating
Comm. traffic point realistic
operating
points
Associativity
• Effects difficult to predict, but range of associativity usually small
• Be careful about using direct-mapped lowest-level caches
62
Choosing Parameters (contd.)
Revisiting choices
• Values of earlier parameters may have be revised based on interactions with
those chosen later
• E.g. choosing direct-mapped cache may require choosing larger caches
63
Summary of Evaluating a Tradeoff
64
Illustrating Workload Characterization
65
LU: Dense Matrix Factorization
Factorize matrix A into lower- and upper-triangular matrices: A = LU
n elements,
N blocks Pseudocode for a process
• Blocked for reuse in the cache: B3 ops on B2 data for each block “task”
– Tradeoffs in block size
• Scatter assignment of blocks improves load balance in later stages
• Good temporal and spatial locality (4-D arrays), low communication
• Only barrier synchronization (conservative but simple)
66
Radix
Sort b-bit integer keys, using radix r (i.e. b/r phases using r bits each)
• Each phase reads from input array, writes to output array, alternating
Three steps within each phase
• Build local histogram of key frequencies
• Combine local histograms into global histogram (prefix computation)
• Permute from input to output array based on histogram values
Process0 Process1 Process2 ... Processp
Input array
Output array
Process0 Process1 Process2 ... Processp
A B
computing visibility
between patch pairs (2) After the first refinement
•Parallelism across
A B
here), or interactions
(3) After three more refinements: A2 subdivides B; then A2 is subdivided due to B2;
•Highly irregular and then A22 subdivides B1
A B
unpredictable access and
sharing patterns A1 A2 B1 B2
grained locking
68
Multiprog: A Simple Mix
69
Characteristics: Data Access and Synch.
Using default problem sizes (at small end of realistic for 64-p system)
Table 4.1 General Statistics about Application Programs
For the parallel programs, shared reads and writes simply refer to all nonstack references issued by the application processes.
All
such references do not necessarily point to data that is truly shared by multiple processes. The Multiprog workload is not rallel
a pa
application, so it does not access shared data. A dash in a table entry means that this measurement is not applicable to not
or is
measured for that application (e.g., Radix has no oating-point operations). (M) denotes that measurement in that column is in
millions.
30
20
10
0
0 102 03 04 05 06 0
Number of processors
0.15
0.1
0.05
0
-0.05
2 4 8 163 26 4
Number of processors
•Average bw need generally small, except for Radix (so watch out for it)
–well-optimized, but burstiness may make others demand more bw too
•Scaling trend with p different across applications
72
Growth Rates of Comm-to-comp Ratio
Table 4.1 Growth Rates of Inherent Communication-to-Computation Ratio
73
Working Set Curves (Default Config.)
Measure using fully associative LRU caches with 8-byte cache block
40 60 10
L0 WS
L1 WS
8
30
40
•Small 20
L1 WS
4
20
associativity can 10
L2 WS
L WS
2
2
L1 WS
L2 WS
affect whether
0 0 0
1
2
4
8
1,024
1
2
4
8
128
1
2
4
8
256
512
1,024
32
64
128
256
512
64
256
512
1,024
32
64
128
16
16
32
16
working set fits Cache size (K) Cache size (K) Cache size (K)
•Key working 20 20
50
defined in most 10 10
20
L2 WS
L1 WS
cases 5
L2 WS
5
L1 WS
10
0 0 0
1
2
4
8
1
2
4
8
1
2
4
8
512
1,024
128
256
1,024
512
32
64
128
256
16
32
64
512
32
64
128
256
1,024
16
16
Cache size (K) Cache size (K) Cache size (K)
74
Working Set Growth Rates
How they scale, whether important, and realistic to fit/not?
Table 4.1 Important Working Sets and Their Growth Rates for the SPLASH-2 Suite
Realistic Realistic
Not to Not to
Working Impor- Fit in Working Growth Impor- Fit in
Program Set 1 Growth Rate tant? Cache? Set 2 Rate tant? Cache?
LU One block Fixed(
B) Yes No Partition DS/P No Yes
of DS
Ocean A few
PD S Yes No Partition DS/P Yes Yes
subrows of DS
Barnes- Tree data log DS 2 Yes No Partition DS/P No Yes
Hut for 1 body of DS
Radiosity BSP tree log(
polygons) Yes No Unstruc- Unstruc- No Yes
tured tured
Radix Histogram Radixr Yes No Partition DS/P Yes Yes
of DS
Raytrace Scene and Unstructured Yes Yes — — — —
grid data
reused
across rays
DSrepresents the data set size, and
P is the number of processes.
75
Concluding Remarks
76