0% found this document useful (0 votes)
36 views110 pages

Pthreads Mod

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views110 pages

Pthreads Mod

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 110

ECE1747 Parallel Programming

Shared Memory Multithreading


Pthreads
Shared Memory
• All threads access the same shared memory
data space.

Shared Memory Address Space

proc1 proc2 proc3 procN


Shared Memory (continued)
• Concretely, it means that a variable x, a
pointer p, or an array a[] refer to the same
object, no matter what processor the
reference originates from.
• We have more or less implicitly assumed
this to be the case in earlier examples.
Shared Memory

proc1 proc2 proc3 procN


Distributed Memory - Message Passing

The alternative model to shared memory.

a a a a
mem1 mem2 mem3 memN

proc1 proc2 proc3 procN

network
Shared Memory vs. Message Passing

• Same terminology is used in distinguishing


hardware.
• For us: distinguish programming models,
not hardware.
Programming vs. Hardware
• One can implement
– a shared memory programming model
– on shared or distributed memory hardware
– (also in software or in hardware)
• One can implement
– a message passing programming model
– on shared or distributed memory hardware
Portability of programming models

shared memory message passing


programming programming

shared memory distr. memory


machine machine
Shared Memory Programming:
Important Point to Remember
• No matter what the implementation, it
conceptually looks like shared memory.
• There may be some (important)
performance differences.
Multithreading
• User has explicit control over thread.
• Good: control can be used to performance
benefit.
• Bad: user has to deal with it.
Pthreads
• POSIX standard shared-memory
multithreading interface.
• Provides primitives for process
management and synchronization.
What does the user have to do?
• Decide how to decompose the computation
into parallel parts.
• Create (and destroy) processes to support
that decomposition.
• Add synchronization to make sure
dependences are covered.
General Thread Structure
• Typically, a thread is a concurrent
execution of a function or a procedure.
• So, your program needs to be restructured
such that parallel parts form separate
procedures or functions.
Example of Thread Creation (contd.)
main()

pthread_ func()
create(func)
Thread Joining Example
void *func(void *) { ….. }
pthread_t id; int X;
pthread_create(&id, NULL, func, &X);
…..
pthread_join(id, NULL);
…..
Example of Thread Creation (contd.)
main()

pthread_
create(func) func()

pthread_
join(id)
pthread_
exit()
Sequential SOR
for some number of timesteps/iterations {
for (i=0; i<n; i++ )
for( j=1, j<n, j++ )
temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1]
[j]
grid[i][j-1] + grid[i]
[j+1] );
for( i=0; i<n; i++ )
for( j=1; j<n; j++ )
grid[i][j] = temp[i][j];
}
Parallel SOR
• First (i,j) loop nest can be parallelized.
• Second (i,j) loop nest can be parallelized.
• Must wait to start second loop nest until all
processors have finished first.
• Must wait to start first loop nest of next
iteration until all processors have second
loop nest of previous iteration.
• Give n/p rows to each processor.
Pthreads SOR: Parallel parts (1)
void* sor_1(void *s)
{
int slice = (int) s;
int from = (slice*n)/p;
int to = ((slice+1)*n)/p;
for( i=from; i<to; i++)
for( j=0; j<n; j++ )
temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1]
[j]
+grid[i][j-1] + grid[i][j+1]);
}
Pthreads SOR: Parallel parts (2)
void* sor_2(void *s)
{
int slice = (int) s;
int from = (slice*n)/p;
int to = ((slice+1)*n)/p;

for( i=from; i<to; i++)


for( j=0; j<n; j++ )
grid[i][j] = temp[i][j];
}
Pthreads SOR: main
for some number of timesteps {
for( i=0; i<p; i++ )
pthread_create(&thrd[i], NULL, sor_1, (void *)i);
for( i=0; i<p; i++ )
pthread_join(thrd[i], NULL);
for( i=0; i<p; i++ )
pthread_create(&thrd[i], NULL, sor_2, (void *)i);
for( i=0; i<p; i++ )
pthread_join(thrd[i], NULL);
}
Summary: Thread Management
• pthread_create(): creates a parallel thread
executing a given function (and arguments),
returns thread identifier.
• pthread_exit(): terminates thread.
• pthread_join(): waits for thread with
particular thread identifier to terminate.
Summary: Program Structure
• Encapsulate parallel parts in functions.
• Use function arguments to parameterize
what a particular thread does.
• Call pthread_create() with the function and
arguments, save thread identifier returned.
• Call pthread_join() with that thread
identifier.
Pthreads Synchronization
• Create/exit/join
– provide some form of synchronization,
– at a very coarse level,
– requires thread creation/destruction.
• Need for finer-grain synchronization
– mutex locks,
– condition variables.
Use of Mutex Locks
• To implement critical sections.
• Pthreads provides only exclusive locks.
• Some other systems allow shared-read,
exclusive-write locks.
Condition variables (1 of 5)
pthread_cond_init(
pthread_cond_t *cond,
pthread_cond_attr *attr)
• Creates a new condition variable cond.
• Attribute: ignore for now.
Condition Variables (2 of 5)
pthread_cond_destroy(
pthread_cond_t *cond)
• Destroys the condition variable cond.
Condition Variables (3 of 5)
pthread_cond_wait(
pthread_cond_t *cond,
pthread_mutex_t *mutex)
• Blocks the calling thread, waiting on cond.
• Unlocks the mutex.
Condition Variables (4 of 5)
pthread_cond_signal(
pthread_cond_t *cond)
• Unblocks one thread waiting on cond.
• Which one is determined by scheduler.
• If no thread waiting, then signal is a no-op.
Condition Variables (5 of 5)
pthread_cond_broadcast(
pthread_cond_t *cond)
• Unblocks all threads waiting on cond.
• If no thread waiting, then broadcast is a no-
op.
Use of Condition Variables
• To implement signal-wait synchronization
discussed in earlier examples.
• Important note: a signal is “forgotten” if
there is no corresponding wait that has
already happened.
Barrier Synchronization
• A wait at a barrier causes a thread to wait
until all threads have performed a wait at
the barrier.
• At that point, they all proceed.
Implementing Barriers in Pthreads

• Count the number of arrivals at the barrier.


• Wait if this is not the last arrival.
• Make everyone unblock if this is the last
arrival.
• Since the arrival count is a shared variable,
enclose the whole operation in a mutex
lock-unlock.
Implementing Barriers in Pthreads
void barrier()
{
pthread_mutex_lock(&mutex_arr);
arrived++;
if (arrived<N) {
pthread_cond_wait(&cond, &mutex_arr);
}
else {
pthread_cond_broadcast(&cond);
arrived=0; /* be prepared for next barrier */
}
pthread_mutex_unlock(&mutex_arr);
}
Parallel SOR with Barriers (1 of 2)
void* sor (void* arg)
{
int slice = (int)arg;
int from = (slice * (n-1))/p + 1;
int to = ((slice+1) * (n-1))/p + 1;

for some number of iterations { … }


}
Parallel SOR with Barriers (2 of 2)
for (i=from; i<to; i++)
for (j=1; j<n; j++)
temp[i][j] = 0.25 * (grid[i-1][j] +
grid[i+1][j] + grid[i][j-1] + grid[i]
[j+1]);
barrier();
for (i=from; i<to; i++)
for (j=1; j<n; j++)
grid[i][j]=temp[i][j];
barrier();
Parallel SOR with Barriers: main
int main(int argc, char *argv[])
{
pthread_t *thrd[p];
/* Initialize mutex and condition variables */
for (i=0; i<p; i++)
pthread_create (&thrd[i], &attr, sor,
(void*)i);
for (i=0; i<p; i++)
pthread_join (thrd[i], NULL);
/* Destroy mutex and condition variables */
}
Note again
• Many shared memory programming
systems (other than Pthreads) have barriers
as basic primitive.
• If they do, you should use it, not construct it
yourself.
• Implementation may be more efficient than
what you can do yourself.
Busy Waiting
• Not an explicit part of the API.
• Available in a general shared memory
programming environment.
Busy Waiting
initially: flag = 0;

P1: produce data;


flag = 1;

P2: while( !flag ) ;


consume data;
Use of Busy Waiting
• On the surface, simple and efficient.
• In general, not a recommended practice.
• Often leads to messy and unreadable code
(blurs data/synchronization distinction).
• May be inefficient
Private Data in Pthreads
• To make a variable private in Pthreads, you
need to make an array out of it.
• Index the array by thread identifier, which
you should keep track of .
• Not very elegant or efficient.
Other Primitives in Pthreads
• Set the attributes of a thread.
• Set the attributes of a mutex lock.
• Set scheduling parameters.
ECE 1747 Parallel Programming

Machine-independent
Performance Optimization Techniques
Returning to Sequential vs. Parallel
• Sequential execution time: t seconds.
• Startup overhead of parallel execution: t_st
seconds (depends on architecture)
• (Ideal) parallel execution time: t/p + t_st.
• If t/p + t_st > t, no gain.
General Idea
• Parallelism limited by dependences.
• Restructure code to eliminate or reduce
dependences.
• Sometimes possible by compiler, but good
to know how to do it by hand.
Optimizations: Example 16
for (i = 0; i < 100000; i++)
a[i + 1000] = a[i] + 1;

Cannot be parallelized as is.


May be parallelized by applying certain
code transformations.
Example Transformation
for (i=1; i < 100; i++){
int stride = i* 1000;
for (j = 0; j < 1000; j++)
a[stride+j] = a[j] + i;
}
Code Transformations
• Reorganize code such that
– dependences are removed or reduced
– large pieces of parallel work emerge

• Code can become messy … there is a point


of diminishing returns.
Flavors of Parallelism
• Data parallelism: all processors do the same
thing on different data.
– Regular
– Irregular
• Task parallelism: processors do different
tasks.
– Task queue
– Pipelines
Task Parallelism
• Each process performs a different task.
• Two principal flavors:
– pipelines
– task queues
• Program Examples: PIPE (pipeline), TSP
(task queue).
Pipeline
• Often occurs with image processing
applications, where a number of images
undergoes a sequence of transformations.
• E.g., rendering, clipping, compression, etc.
Sequential Program
for( i=0; i<num_pic, read(in_pic[i]); i++ ) {
int_pic_1[i] = trans1( in_pic[i] );
int_pic_2[i] = trans2( int_pic_1[i]);
int_pic_3[i] = trans3( int_pic_2[i]);
out_pic[i] = trans4( int_pic_3[i]);
}
Parallelizing a Pipeline
• For simplicity, assume we have 4
processors (i.e., equal to the number of
transformations).
• Furthermore, assume we have a very large
number of pictures (>> 4).
Parallelizing a Pipeline (part 1)
Processor 1:

for( i=0; i<num_pics, read(in_pic[i]); i++ ) {


int_pic_1[i] = trans1( in_pic[i] );
signal(event_1_2[i]);
}
Parallelizing a Pipeline (part 2)
Processor 2:

for( i=0; i<num_pics; i++ ) {


wait( event_1_2[i] );
int_pic_2[i] = trans2( int_pic_1[i] );
signal(event_2_3[i] );
}

Same for processor 3


Parallelizing a Pipeline (part 3)
Processor 4:

for( i=0; i<num_pics; i++ ) {


wait( event_3_4[i] );
out_pic[i] = trans4( int_pic_3[i] );
}
Use of Wait/Signal (Pipelining)
• Sequential

• Parallel

(Pattern -- picture; horiz. line -- processor).


PIPE
P1:for( i=0; i<num_pics, read(in_pic); i++ ) {
int_pic_1[i] = trans1( in_pic );
signal( event_1_2[i] );
}
P2: for( i=0; i<num_pics; i++ ) {
wait( event_1_2[i] );
int_pic_2[i] = trans2( int_pic_1[i] );
signal( event_2_3[i] );
}
PIPE Using Pthreads
• Replacing the original wait/signal by a
Pthreads condition variable wait/signal will
not work.
– signals before a wait are forgotten.
– we need to remember a signal.
How to remember a signal (1 of 2)
semaphore_signal(i) {
pthread_mutex_lock(&mutex_rem[i]);
arrived [i]= 1;
pthread_cond_signal(&cond[i]);
pthread_mutex_unlock(&mutex_rem[i]);
}
How to Remember a Signal (2 of 2)
semaphore_wait(i) {
pthreads_mutex_lock(&mutex_rem[i]);
if( arrived[i] = 0 ) {
pthreads_cond_wait(&cond[i],
mutex_rem[i]);
}
arrived[i] = 0;
pthreads_mutex_unlock(&mutex_rem[i]);
}
PIPE with Pthreads
P1:for( i=0; i<num_pics, read(in_pic); i++ ) {
int_pic_1[i] = trans1( in_pic );
semaphore_signal( event_1_2[i] );
}
P2: for( i=0; i<num_pics; i++ ) {
semaphore_wait( event_1_2[i] );
int_pic_2[i] = trans2( int_pic_1[i] );
semaphore_signal( event_2_3[i] );
}
Another Sequential Program
for( i=0; i<num_pic, read(in_pic); i++ ) {
int_pic_1 = trans1( in_pic );
int_pic_2 = trans2( int_pic_1);
int_pic_3 = trans3( int_pic_2);
out_pic = trans4( int_pic_3);
}
Can we use same parallelization?
Processor 2:

for( i=0; i<num_pics; i++ ) {


wait( event_1_2[i] );
int_pic_2 = trans1( int_pic_1 );
signal(event_2_3[i] );
}

Same for processor 3


Can we use same parallelization?
• No, because of anti-dependence between
stages, there is no parallelism.
• We used privatization to enable pipeline
parallelism.
• Used often to avoid dependences (not only
with pipelines).
• Costly in terms of memory.
In-between Solution
• Use n>1 buffers between stages.
• Block when buffers are full or empty.

P1 P2 P3 P4
Perfect Pipeline ?

(Pattern -- picture; horiz. line -- processor).


Things are often not that perfect
• One stage takes more time than others.
• Stages take a variable amount of time.
• Extra buffers provide some cushion against
variability.
Task Parallelism
• Each process performs a different task.
• Two principal flavors:
– pipelines
– task queues
• Program Examples: PIPE (pipeline), TSP
(task queue).
TSP (Traveling Salesman)
• Goal:
– given a list of cities, a matrix of distances
between them, and a starting city,
– find the shortest tour in which all cities are
visited exactly once.
• Example of an NP-hard search problem.
• Algorithm: branch-and-bound.
Branching
Initialization:
 go from starting city to each possible city
 put resulting partial path into priority queue,
ordered by its current length.
Further (repeatedly):
 take head element out of priority queue,
 expand by each one of remaining cities,
 put resulting partial path into priority queue.
Finding the Solution
• Eventually, a complete path will be found.
• Remember its length as the current shortest
path.
• Every time a complete path is found, check
if we need to update current best path.
• When priority queue becomes empty, best
path is found.
Using a Simple Bound
• Once a complete path is found, we have a
bound on the length of shortest path.
• No use in exploring partial path that is
already longer than the current lower
bound.
Sequential TSP: Data Structures
• Priority queue of partial paths.
• Current best solution and its length.
• For simplicity, we will ignore bounding.
Sequential TSP: Code Outline
init_q(); init_best();
while( (p=de_queue()) != NULL ) {
for each expansion by one city {
q = add_city(p);
if( complete(q) ) { update_best(q) };
else { en_queue(q) };
}
}
Parallel TSP: Possibilities
• Have each process do one expansion.
• Have each process do expansion of one
partial path.
• Have each process do expansion of multiple
partial paths.
• Issue of granularity/performance, not an
issue of correctness.
• Assume: process expands one partial path.
Parallel TSP: Synchronization
• True dependence between process that puts
partial path in queue and the one that takes
it out.
• Dependences arise dynamically.
• Required synchronization: need to make
process wait if q is empty.
Parallel TSP: First cut (part 1)
process i:
while( (p=de_queue()) != NULL ) {
for each expansion by one city {
q = add_city(p);
if complete(q) { update_best(q) };
else en_queue(q);
}
}
Parallel TSP: First cut (part 2)
• In de_queue: wait if q is empty
• In en_queue: signal that q is no longer
empty
Parallel TSP: More synchronization

• All processes operate, potentially at the


same time, on q and best.
• This race must not be allowed to happen.
• Critical section: only one process can
execute in critical section at once.
Parallel TSP: Critical Sections
• All shared data must be protected by critical
section.
• Update_best must be protected by a critical
section.
• En_queue and de_queue must be protected
by the same critical section.
Termination condition
• How do we know when we are done?
• All processes are waiting inside de_queue.
• Count the number of waiting processes
before waiting.
• If equal to total number of processes, we are
done.
Parallel TSP
process i:
while( (p=de_queue()) != NULL ) {
for each expansion by one city {
q = add_city(p);
if complete(q) { update_best(q) };
else en_queue(q);
}
}
Parallel TSP
• Need critical section
– in update_best,
– in en_queue/de_queue.
• In de_queue
– wait if q is empty,
– terminate if all processes are waiting.
• In en_queue:
– signal q is no longer empty.
Parallel TSP: Mutual Exclusion
en_queue() / de_queue() {
pthreads_mutex_lock(&queue);
…;
pthreads_mutex_unlock(&queue);
}
update_best() {
pthreads_mutex_lock(&best);
…;
pthreads_mutex_unlock(&best);
}
Parallel TSP: Condition Synchronization
de_queue() {
while( (q is empty) and (not done) ) {
waiting++;
if( waiting == p ) {
done = true;
pthreads_cond_broadcast(&empty, &queue);
}
else {
pthreads_cond_wait(&empty, &queue);
waiting--;
}
}
if( done )
return null;
else
remove and return head of the queue;
}
Parallel TSP
• Complete parallel program will be provided
on the Web.
• Includes wait/signal on empty q.
• Includes critical sections.
• Includes termination condition.
Factors that Determine Speedup
• Characteristics of parallel code
– granularity
– load balance
– locality
– communication and synchronization
Granularity
• Granularity = size of the program unit that
is executed by a single processor.
• May be a single loop iteration, a set of loop
iterations, etc.
• Fine granularity leads to:
– (positive) ability to use lots of processors
– (positive) finer-grain load balancing
– (negative) increased overhead
Granularity and Critical Sections
• Small granularity => more processors =>
more critical section accesses => more
contention.
Issues in Performance of Parallel Parts

• Granularity.
• Load balance.
• Locality.
• Synchronization and communication.
Load Balance
• Load imbalance = difference in execution
time between processors between barriers.
• Execution time may not be predictable.
– Regular data parallel: yes.
– Irregular data parallel or pipeline: perhaps.
– Task queue: no.
Static vs. Dynamic
• Static: done once, by the programmer
– block, cyclic, etc.
– fine for regular data parallel
• Dynamic: done at runtime
– task queue
– fine for unpredictable execution times
– usually high overhead
• Semi-static: done once, at run-time
Choice is not inherent
• MM or SOR could be done using task
queues: put all iterations in a queue.
– In heterogeneous environment.
– In multitasked environment.
Static Load Balancing
• Block
– best locality
– possibly poor load balance
• Cyclic
– better load balance
– worse locality
• Block-cyclic
– load balancing advantages of cyclic (mostly)
– better locality
Dynamic Load Balancing (1 of 2)
• Centralized: single task queue.
– Easy to program
– Excellent load balance
• Distributed: task queue per processor.
– Less contention during synchronization
Dynamic Load Balancing (2 of 2)
• Task stealing with distributed queues:
– Processes normally remove and insert tasks
from their own queue.
– When queue is empty, remove task(s) from
other queues.
• Extra overhead and programming difficulty.
• Better load balancing.
Semi-static Load Balancing
• Measure the cost of program parts.
• Use measurement to partition computation.
• Done once, done every iteration, done every
n iterations.
Molecular Dynamics (MD)
• Simulation of a set of bodies under the
influence of physical laws.
• Atoms, molecules, celestial bodies, ...
• Have same basic structure.

F
F

F
Molecular Dynamics (Skeleton)
for some number of timesteps {
for all molecules i
for all other molecules j
force[i] += f( loc[i], loc[j] );
for all molecules i
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics
• To reduce amount of computation, account
for interaction only with nearby molecules.
Molecular Dynamics (continued)
for some number of timesteps {
for all molecules i
for all nearby molecules j
force[i] += f( loc[i], loc[j] );
for all molecules i
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (continued)
for each molecule i
number of nearby molecules count[i]
array of indices of nearby molecules index[j]
( 0 <= j < count[i])
Molecular Dynamics (continued)
for some number of timesteps {
for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )
force[i] +=
f(loc[i],loc[index[j]]);
for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (simple)
for some number of timesteps {
parallel for
for( i=0; i<num_mol; i++ )
for( j=0; j<count[i]; j++ )
force[i] += f(loc[i],loc[index[j]]);
parallel for
for( i=0; i<num_mol; i++ )
loc[i] = g( loc[i], force[i] );
}
Molecular Dynamics (simple)
• Simple to program.
• Possibly poor load balance
– block distribution of i iterations (molecules)
– could lead to uneven neighbor distribution
– cyclic does not help
Better Load Balance
• Assign iterations such that each processor
has ~ the same number of neighbors.
• Array of “assign records”
– size: number of processors
– two elements:
• beginning i value (molecule)
• ending i value (molecule)
• Recompute partition periodically
Frequency of Balancing
• Every time neighbor list is recomputed.
– once during initialization.
– every iteration.
– every n iterations.
• Extra overhead vs. better approximation
and better load balance.
Summary
• Parallel code optimization
– Granularity
– Load balance
– Locality
– Synchronization

You might also like