ch4并发编程

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

4 Concurrent

Programming
4.1 Introduction to Parallel Computing
In the early days, most computers have only one processing element,
known as the processor or Central Processing Unit (CPU).
Computer programs are traditionally written for serial computation.
Parallel computing is a computing scheme which tries to use multiple
processors executing parallel algorithms to solve problems faster.
With the advent of multicore processors in recent years, most operating
systems, such as Linux, support Symmetrical Multiprocessing (SMP).
The future of computing is clearly in the direction of parallel
computing.
4.1.1 Sequential Algorithms vs. Parallel
Algorithms

The sequential algorithm consist of many steps, all the steps are to be performed by a
single task serially one step at a time.
The parallel algorithm consist of separate tasks, all tasks are to be performed in parallel.
4.1.2 Parallelism vs. Concurrency
In general, a parallel algorithm only identifies tasks that can be executed
in parallel, but it does not specify how to map the tasks to processing
elements.
Ideally, all the tasks in a parallel algorithm should be executed
simultaneously in real time.
However, true parallel executions can only be achieved in systems with
multiple processing elements, such as multiprocessor or multicore
systems.
On single CPU systems, only one task can execute at a time.
In this case, different tasks can only execute concurrently, i.e. logically
in parallel.
4.2 Threads
4.2.1 Principle of Threads
In the process model, processes are independent execution units.
Each process executes in either kernel mode or user mode.
While in user mode, each process executes in a unique address space,
which is separated from other processes.
Although each process is an independent unit, it has only one execution
path.
Whenever a process must wait for something, e.g. an I/O completion
event, it becomes suspended and the entire process execution stops.
Threads are independent execution units in the same address space of a process.
When creating a process, it is created in a unique address space with a main
thread.
When a process begins, it executes the main thread of the process.
The main thread may create other threads. Each thread may create yet more
threads.
All threads in a process execute in the same address space of the process but each
thread is an independent execution unit.
In the threads model, if a thread becomes suspended, other threads may continue
to execute.
In addition to sharing a common address space, threads also share many other
resources of a process, such as user id, opened file descriptors and
signals, etc. ,
Currently, almost all OS support Pthreads, which is the threads standard of IEEE
POSIX 1003.1c.
4.2.2 Advantages of Threads
Threads have many advantages over processes.
(1). Thread creation and switching are faster.
(2). Threads are more responsive.
(3). Threads are better suited to parallel computing.
Parallel algorithms often require the execution entities to share common
data.
In the process model, processes cannot share data efficiently because their
address spaces are all distinct. Using Interprocess Communication (IPC) to
exchange data
Threads in the same process share all the (global) data in the same address
space.
4.2.3 Disadvantages of Threads
(1). Because of shared address space, threads needs explicit
synchronization from the user.
(2). Many library functions may not be threads safe, e.g. the traditional
strtok() function, which divides a string into tokens in-line.
In general, any function which uses global variables or relies on
contents of static memory is not threads safe.
(3). On single CPU systems, using threads to solve problems is actually
slower than using a sequential program due to the overhead in threads
creation and context switching at run-time.
4.4 Threads Management Functions
The Pthreads library offers the following APIs for threads management.
pthread_create(thread, attr, function, arg) : create thread
pthread_exit(status) : terminate thread
pthread_cancel(thread) : cancel thread
pthread_attr_init(attr) : initialize thread attributes
pthread_attr_destroy(attr) : destroy thread attribute
4.4.1 Create Thread
Threads are created by the pthread_create() function.
int pthread_create (pthread_t *pthread_id, pthread_attr_t *attr,
void *(*func)(void *), void *arg);
which returns 0 on success or an error number on failure. Parameters to
the pthread_create() function are
• pthread_id is a pointer to a variable of the pthread_t type. It will be
filled with the unique thread ID assigned by the OS kernel.
• A thread may get its own ID by the pthread_self() function.
• In Linux, pthread_t type is defined as unsigned long, so thread ID can
be printed as %lu.
• attr is a pointer to another opaque data type, which specifies the thread
attributes.
• func is the entry address of a function for the new thread to execute.
• arg is a pointer to a parameter for the thread function, which can be
written as void *func(void *arg).
• The steps of using an attr parameter are as follows.
(1). Define a pthread attribute variable pthread_attr_t attr
(2). Initialize the attribute variable with pthread_attr_init(&attr)
(3). Set the attribute variable and use it in pthread_create() call
(4). If desired, free the attr resource by pthread_attr_destroy(&attr)
• By default, every thread is created to be joinable with other threads.
• If desired, a thread can be created with the detached attribute, which
makes it non-joinable with other threads.
• The following code segment shows how to create a detached thread.
pthread_attr_t attr; // define an attr variable
pthread_attr_init(&attr); // initialize attr
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
// set attr
pthread_create(&thread_id, &attr, func, NULL); // create thread with
attr
pthread_attr_destroy(&attr); // optional: destroy attr
• Every thread is created with a default stack size. During execution, a
thread may find out its stack size by the function
size_t pthread_attr_getstacksize()
which returns the default stack size.
The following code segment shows how to create a thread with a
specific stack size.
pthread_attr_t attr; // attr variable
size_t stacksize; // stack size
pthread_attr_init(&attr); // initialize attr
stacksize = 0x10000; // stacksize=64KB;
pthread_attr_setstacksize (&attr, stacksize); // set stack size in attr
pthread_create(&threads[t], &attr, func, NULL); // create thread with
stack size
• If the attr parameter is NULL, threads will be created with default
attributes.
• In fact, this is the recommended way of creating threads, which should
be followed unless there is a compelling reason to alter the thread
attributes.
4.4.2 Thread ID
Thread ID is an opaque data type, which depends on implementation.
Thread IDs should not be compared directly.
If needed, they can be compared by the pthread_equal() function.
int pthread_equal (pthread_t t1, pthread_t t2);
which returns zero if the threads are different threads, non-zero
otherwise.
4.4.3 Thread Termination
A thread terminates when the thread function finishes.
Alternatively, a thread may call the function
int pthread_exit (void *status);
to terminate explicitly, where status is the exit status of the thread.
As usual, a 0 exit value means normal termination, and non-zero values
mean abnormal termination.
4.4.4 Thread Join
A thread can wait for the termination of another thread by
int pthread_join (pthread_t thread, void **status_ptr);
The exit status of the terminated thread is returned in status_ptr.
4.5.1 Sum of Matrix by Threads
Example 4.1: Assume that we want to compute the sum of all the
elements in an N N matrix of integers.
The program must be
compiled as
gcc C4.1.c –pthread
4.5.2 Quicksort by Threads
Example 4.2: Quicksort by Concurrent Threads
4.6 Threads Synchronization
Since threads execute in the same address space of a process, they share
all the global variables and data structures in the same address space.
When several threads try to modify the same shared variable or data
structure, if the outcome depends on the execution order of the threads,
it is called a race condition.
In order to prevent race conditions, as well as to support threads
cooperation, threads need synchronization.
In general, synchronization refers to the mechanisms and rules used to
ensure the integrity of shared data objects and coordination of
concurrently executing entities.
4.6.1 Mutex Locks
In Pthreads, locks are called mutex, which stands for Mutual Exclusion.
Mutex variables are declared with the type pthread_mutex_t, and they
must be initialized before using.
(1) Statically, as in
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
which defines a mutex variable m and initializes it with default
attributes.
(2) Dynamically with the pthread_mutex_init() function, which allows
setting the mutex attributes by an attr parameter, as in
pthread_mutex_init (pthread_mutex_t *m, pthread_mutexattr_t,*attr);
As usual, the attr parameter can be set to NULL for default attributes.
After initialization, mutex variables can be used by threads via the
following functions.
int pthread_mutex_lock (pthread_mutex_t *m); // lock mutex
int pthread_mutex_unlock (pthread_mutex_t *m); // unlock mutex
int pthread_mutex_trylock (pthread_mutex_t *m); // try to lock
mutex
int pthread_mutex_destroy (pthread_mutex_t *m); // destroy mutex
• Threads use mutexes as locks to protect shared data objects.
• A thread first creates a mutex and initializes it once.
• A newly created mutex is in the unlocked state and without an owner.
• Each thread tries to access a shared data object by
pthread_mutex_lock(&m); // lock mutex
access shared data object; // access shared data in a critical
region
pthread_mutex_unlock(&m); // unlock mutex
A sequence of executions which can only be performed by one
execution entity at a time is commonly known as a Critical Region
(CR).
Example 4.3: This example is a modified version of Example 4.1. As
before, we shall use N working threads to compute the sum of all the
elements of an NxN matrix of integers. Each working thread computes
the partial sum of a row. Instead of depositing the partial sums in a
global sum[ ] array, each working thread tries to update a global
variable, total, by adding its partial sum to it.
4.6.2 Deadlock Prevention
Deadlock is a condition, in which many execution entities mutually wait
for one another so that none of them can proceed.

In this case, T1 and T2 would mutually wait for each other forever, so
they are in a deadlock due to crossed locking requests.
There are many ways to deal with possible deadlocks, which include
deadlock prevention, deadlock avoidance, deadlock detection and
recovery, etc.
In real systems, the only practical way is deadlock prevention, which tries
to prevent deadlocks from occurring when designing parallel algorithms.
A simple way to prevent deadlock is to order the mutexes and ensure that
every thread requests mutex locks only in a single direction, so that there
are no loops in the request sequences.
However, it may not be possible to design every parallel algorithm with
only uni-directional locking requests.
In such cases, the conditional locking function,
pthread_mutex_trylock(),
may be used to prevent deadlocks.
The trylock() function returns immediately with an error if the mutex is
already locked.
In that case, the calling thread may back-off by releasing some of the
locks it already holds, allowing other threads to continue.
In the above crossed locking example, we may redesign one of the
threads, e.g. T1, as follows
4.6.3 Condition Variables
Condition variables provide a means for threads cooperation.
Condition variables are always used in conjunction with mutex locks.
In Pthreads, condition variables are declared with the type
pthread_cond_t, and must be initialized before using.
Like mutexes, condition variables can also be initialized in two ways.
(1) Statically, when it is declared, as in
pthread_cond_t con = PTHREAD_COND_INITIALIZER;
which defines a condition variable, con, and initializes it with default
attributes.
(2) Dynamically with the pthread_cond_init() function, which allows
setting a condition variable with an attr parameter.
For simplicity, we shall always use a NULL attr parameter for default
attributes.

When using a condition variable, a thread must acquire the associated


mutex lock first.
Then it performs operations within the Critical Region of the mutex
lock and releases the mutex lock, as in
Within the CR of the mutex lock, threads may use condition variables to
cooperate with one another via the following functions.
pthread_cond_wait(condition, mutex):
• This function blocks the calling thread until the specified condition is
signaled. It should be called while mutex is locked.
• It will automatically release the mutex lock while the thread waits.
• After signal is received and a blocked thread is awakened, mutex will
be automatically locked.
pthread_cond_signal(condition):
• This function is used to signal, i.e. to wake up or unblock, a thread
which is waiting on the condition variable.
• It should be called after mutex is locked, and must unlock mutex in
order for pthread_cond_wait() to complete.
pthread_cond_broadcast(condition):
• This function unblocks all threads that are blocked on the condition
variable.
• All unblocked threads will compete for the same mutex to access the
condition variable.
• Their order of execution depends on threads scheduling.
4.6.4 Producer-Consumer Problem
Example 4.4: In this example, we shall implement a simplified version
of the producer-consumer problem, which is also known as the bounded
buffer problem, using threads and condition variables.
In the example program, we shall assume that each buffer holds an
integer value. The shared global variables are defined as
• The index variables head is for putting an item into an empty buffer.
• tail is for taking an item out of a full buffer.
• The variable data is the number of full buffers.
To support cooperation between producer and consumer, we define a
mutex and two condition variables.

• empty represents the condition of any empty buffers.


• full represents the condition of any full buffers.
• When a producer finds there are no empty buffers, it waits on the
empty condition variable.
• When a consumer finds there are no full buffers, it waits on the full
condition variable.
• Figure 4.4 shows the output of running the producer-consumer
example program.
4.6.5 Semaphores
Semaphores are general mechanisms for process synchronization. A (counting)
semaphore is a data structure

Before using, a semaphore must be initialized with an initial value and an


empty waiting queue.
The low level implementation of semaphores guarantees that each semaphore
can only be operated by one executing entity at a time and operations on
semaphores are atomic (indivisible) or primitive from the viewpoint of
executing entities.
The most wellknown operations on semaphores are P and V which are
defined as follows.

BLOCK(s) blocks the calling process in the semaphore’s waiting queue,


and SIGNAL(s) unblocks a process from the semaphore’s waiting queue.
Semaphores are not part of the original Pthreads standard.
However, most Pthreads now support semaphores of POSIX 1003.1b.
POSIX semaphores include the following functions:
int sem_init(sem, value) : initialize sem with an initial value
int sem_wait(sem) : similar to P(sem)
int sem_post(sem) : similar to V(sem)
• In Pthreads, mutexes are strictly for locking and condition variables
are for threads cooperation.
• In contrast, counting semaphores with initial value 1 can be used as
locks.
• Semaphores with other initial values can be used for cooperation.
• Therefore, semaphores are more general and flexible than condition
variables.
The following example illustrates the advantages of semaphores over
condition variables.
Empty=N and full=0 are semaphores for producers and consumers to
cooperate with one another, and mutex=1 is a lock semaphore for
processes to access shared buffers one at a time in a Critical Region.
4.6.6 Barriers
The threads join operation allows a thread (usually the main thread) to
wait for the termination of other threads.
After all awaited threads have terminated, the main thread may create
new threads to continue executing the next parts of a parallel program.
There are situations in which it would be better to keep the threads alive
but require them not to go on until all of them have reached a prescribed
point of synchronization.
In Pthreads, the mechanism is the barrier, along with a set of barrier
functions.
First, the main thread creates a barrier object
pthread_barrier_t barrier;
and calls
pthread_barrier_init(&barrier, NULL, nthreads);
to initialize it with the number of threads that are to be synchronized at
the barrier.
Then the main thread creates working threads to perform tasks. The
working threads use
pthread_barrier_wait(&barrier)
to wait at the barrier until the specified number of threads have reached
the barrier.
When the last thread arrives at the barrier, all the threads resume
execution again.
4.6.7 Solve System of Linear Equations by
Concurrent Threads
We demonstrate applications of concurrent threads and threads join and barrier
operations by an example.
Example 4.6: The example is to solve a system of linear equations by concurrent
threads. Assume AX = B is a system of linear equations, where A is an NxN matrix of
real numbers, X is a column vector of N unknowns and B is column vector of
constants. The problem is to compute the solution vector X. The most well known
algorithm for solving systems of linear equations is Gauss elimination. The algorithm
consists of 2 major steps; row reduction, which reduces the combined matrix [A|B] to
an upper-triangular form, followed by back substitution, which computes the solution
vector X. In the row-reduction steps, partial pivoting is a scheme which ensures the
leading element of the row used to reduce other rows has the maximum absolute value.
Partial pivoting helps improve the accuracy of the numerical computations. The
following shows a Gauss elimination algorithm with partial pivoting
/******** Gauss Elimination Algorithm with Partial Pivoting *******/
Step 1: Row reduction: reduce [A|B] to upper triangular form
for (i=0; i<N ; i++){ // for rows i = 0 to N-1
do partial pivoting; // exchange rows if needed
(1). // barrier
for (j=i+1; j<=N; j++){ // for rows j = i+1 to N
for (k=i+1; k<=N; k++){ // for columns k = i+1 to N
f = A[j,i]/A[i,i]; // reduction factor
A[j,k] -= A[j,k]*f; // reduce row j
}
A[j][i] = 0; // A[j,i] = 0
}
(2). // barrier
}
(3). // join
Step 2: Back Substitution: compute xN-1, xN-2, . . . , x0 in that order
The following shows the complete code of the Example Program C4.5.
Figure 4.5 shows the sample outputs.
4.6.8 Threads in Linux
Linux does not distinguish processes from threads.
To the Linux kernel, a thread is merely a process that shares certain
resources with other processes.
In Linux both processes and threads are created by the clone() system
call, which has the prototype
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg)
clone() is more like a thread creation function. It creates a child process
to execute a function fn(arg) with a child_stack.
The flags field specifies the resources to be shared by the parent and child,
which includes
• CLONE_VM: parent and child share address space
• CLONE_FS: parent and child share file system information, e.g. root,
CWD
• CLONE_FILES: parent and child share opened files
• CLONE_SIGHAND: parent and child share signal handlers and blocked
signals
If any of the flags is specified, both processes share exactly the SAME
resource, not a separate copy of the resource.
If a flag is not specified, the child process usually gets a separate copy of
the resource.
The Linux kernel retains fork() as a system call but it may be implemented
as a library wrapper that calls clone() with appropriate flags.

You might also like