Pthreads Mod
Pthreads Mod
a a a a
mem1 mem2 mem3 memN
network
Shared Memory vs. Message Passing
pthread_ func()
create(func)
Thread Joining Example
void *func(void *) { ….. }
pthread_t id; int X;
pthread_create(&id, NULL, func, &X);
…..
pthread_join(id, NULL);
…..
Example of Thread Creation (contd.)
main()
pthread_
create(func) func()
pthread_
join(id)
pthread_
exit()
Sequential SOR
for some number of timesteps/iterations {
for (i=0; i<n; i++ )
for( j=1, j<n, j++ )
temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1]
[j]
grid[i][j-1] + grid[i]
[j+1] );
for( i=0; i<n; i++ )
for( j=1; j<n; j++ )
grid[i][j] = temp[i][j];
}
Parallel SOR
• First (i,j) loop nest can be parallelized.
• Second (i,j) loop nest can be parallelized.
• Must wait to start second loop nest until all
processors have finished first.
• Must wait to start first loop nest of next
iteration until all processors have second
loop nest of previous iteration.
• Give n/p rows to each processor.
Pthreads SOR: Parallel parts (1)
void* sor_1(void *s)
{
int slice = (int) s;
int from = (slice*n)/p;
int to = ((slice+1)*n)/p;
for( i=from; i<to; i++)
for( j=0; j<n; j++ )
temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1]
[j]
+grid[i][j-1] + grid[i][j+1]);
}
Pthreads SOR: Parallel parts (2)
void* sor_2(void *s)
{
int slice = (int) s;
int from = (slice*n)/p;
int to = ((slice+1)*n)/p;
Machine-independent
Performance Optimization Techniques
Returning to Sequential vs. Parallel
• Sequential execution time: t seconds.
• Startup overhead of parallel execution: t_st
seconds (depends on architecture)
• (Ideal) parallel execution time: t/p + t_st.
• If t/p + t_st > t, no gain.
General Idea
• Parallelism limited by dependences.
• Restructure code to eliminate or reduce
dependences.
• Sometimes possible by compiler, but good
to know how to do it by hand.
Optimizations: Example 16
for (i = 0; i < 100000; i++)
a[i + 1000] = a[i] + 1;
• Parallel
P1 P2 P3 P4
Perfect Pipeline ?
• Granularity.
• Load balance.
• Locality.
• Synchronization and communication.
Load Balance
• Load imbalance = difference in execution
time between processors between barriers.
• Execution time may not be predictable.
– Regular data parallel: yes.
– Irregular data parallel or pipeline: perhaps.
– Task queue: no.
Static vs. Dynamic
• Static: done once, by the programmer
– block, cyclic, etc.
– fine for regular data parallel
• Dynamic: done at runtime
– task queue
– fine for unpredictable execution times
– usually high overhead
• Semi-static: done once, at run-time
Choice is not inherent
• MM or SOR could be done using task
queues: put all iterations in a queue.
– In heterogeneous environment.
– In multitasked environment.
Static Load Balancing
• Block
– best locality
– possibly poor load balance
• Cyclic
– better load balance
– worse locality
• Block-cyclic
– load balancing advantages of cyclic (mostly)
– better locality
Dynamic Load Balancing (1 of 2)
• Centralized: single task queue.
– Easy to program
– Excellent load balance
• Distributed: task queue per processor.
– Less contention during synchronization
Dynamic Load Balancing (2 of 2)
• Task stealing with distributed queues:
– Processes normally remove and insert tasks
from their own queue.
– When queue is empty, remove task(s) from
other queues.
• Extra overhead and programming difficulty.
• Better load balancing.
Semi-static Load Balancing
• Measure the cost of program parts.
• Use measurement to partition computation.
• Done once, done every iteration, done every
n iterations.
Molecular Dynamics (MD)
• Simulation of a set of bodies under the
influence of physical laws.
• Atoms, molecules, celestial bodies, ...
• Have same basic structure.
F
F
F
Molecular Dynamics (Skeleton)
for some number of timesteps {
for all molecules i
for all other molecules j
force[i] += f( loc[i], loc[j] );
for all molecules i
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics
• To reduce amount of computation, account
for interaction only with nearby molecules.
Molecular Dynamics (continued)
for some number of timesteps {
for all molecules i
for all nearby molecules j
force[i] += f( loc[i], loc[j] );
for all molecules i
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (continued)
for each molecule i
number of nearby molecules count[i]
array of indices of nearby molecules index[j]
( 0 <= j < count[i])
Molecular Dynamics (continued)
for some number of timesteps {
for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )
force[i] +=
f(loc[i],loc[index[j]]);
for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (simple)
for some number of timesteps {
parallel for
for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )
force[i] += f(loc[i],loc[index[j]]);
parallel for
for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (simple)
• Simple to program.
• Possibly poor load balance
– block distribution of i iterations (molecules)
– could lead to uneven neighbor distribution
– cyclic does not help
Better Load Balance
• Assign iterations such that each processor
has ~ the same number of neighbors.
• Array of “assign records”
– size: number of processors
– two elements:
• beginning i value (molecule)
• ending i value (molecule)
• Recompute partition periodically
Frequency of Balancing
• Every time neighbor list is recomputed.
– once during initialization.
– every iteration.
– every n iterations.
• Extra overhead vs. better approximation
and better load balance.
Summary
• Parallel code optimization
– Granularity
– Load balance
– Locality
– Synchronization