Con Currency
Con Currency
Con Currency
A thread is an independent sequential execution path through a program. Each thread is scheduled for execution separately and independently from other threads. A process is a program component (like a routine) that has its own thread and has the same state information as a coroutine. A task is similar to a process except that it is reduced along some particular dimension (like the difference between a boat and a ship, one is physically smaller than the other). It is often the case that a process has its own memory, while tasks share a common memory. A task is sometimes called a light-weight process (LWP). Parallel execution is when 2 or more operations occur simultaneously, which can only occur when multiple processors (CPUs) are present. Concurrent execution is any situation in which execution of multiple threads appears to be performed in parallel. It is the threads of control associated with processes and tasks that results in concurrent execution.
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
58
59
60 How and when do these parts interact or are they independent? If interaction is necessary, what information must be communicated during the interaction? to debug: Concurrent operations proceed at varying speeds and in non-deterministic order, hence execution is not repeatable (Heisenbug). Reasoning about multiple streams or threads of execution and their interactions is much more complex than for a single thread. E.g. Moving furniture out of a room; cant do it alone, but how many helpers and how to do it quickly to minimize the cost. How many helpers? 1,2,3, ... N, where N is the number of items of furniture more than N? Where are the bottlenecks? the door out of the room, items in front of other items, large items What communication is necessary between the helpers? which item to take next
61 some are fragile and need special care big items need several helpers working together
62 Both implicit and explicit mechanisms are complementary, and hence, can appear together in a single programming language. However, the limitations of implicit mechanisms require that explicit mechanisms always be available to achieve maximum concurrency. C+ + only supports explicit mechanisms, but nothing in its design precludes implicit mechanisms. Some concurrent systems provide a single technique or paradigm that must be used to solve all concurrent problems. While a particular paradigm may be very good for solving certain kinds of problems, it may be awkward or preclude other kinds of solutions. Therefore, a good concurrent system must support a variety of different concurrent approaches, while at the same time not requiring the programmer to work at too low a level. Fundamentally, as the amount of concurrency increases, so does the complexity to express and manage it.
63
state
Parallelism is simulated by rapidly context switching the CPU back and forth between threads. Unlike coroutines, task switching may occur at non-deterministic program locations, i.e., between any two machine instructions. Switching is usually based on a timer interrupt that is independent of program execution. Or on the same computer, which has multiple CPUs, using separate CPUs but sharing the same memory (multiprocessor):
64
computer CPU task1 program 100 5 CPU task2 program
state
state
These tasks run in parallel with each other. Processes may be on different computers using separate CPUs and separate memories (distributed system):
computer1 CPU process state program 100 7 computer2 CPU process state program 100 5
These processes run in parallel with each other. By examining the rst case, which is the simplest, all of the problems that occur with parallelism can be illustrated.
65
state transitions are initiated in response to events: timer alarm (running ready) completion of I/O operation (blocked ready) exceeding some limit (CPU time, etc.) (running halted) exceptions (running halted)
66 1. thread creation the ability to cause another thread of control to come into existence. 2. thread synchronization the ability to establish timing relationships among threads, e.g., same time, same rate, happens before/after. 3. thread communication the ability to correctly transmit data among threads. Thread creation must be a primitive operation; cannot be built from other operations in a language. need new construct to create a thread and dene where the thread starts execution, e.g., COBEGIN/COEND:
BEGIN initial thread creates internal threads, COBEGIN one for each statement in this block BEGIN i := 1; . . . END; p1(5); order and speed of execution p2(7); of internal threads is unknown p3(9); COEND initial thread waits for all internal threads to END; nish (synchronize) before control continues
67
p COBEGIN p i := 1 p1(...) COBEGIN p2(...) p3(...)
...
...
...
...
COEND COEND p
Restricted to creating trees (lattice) of threads. In C+ +, a task must be created for each statement of a COBEGIN using a _Task object:
68
_Task T1 { void main() }; _Task T2 { void main() }; _Task T3 { void main() }; _Task T4 { void main() }; { i = 1; } { p1(5); } { p2(7); } { p3(9); } void uMain::main() { // { int i, j, k; } ??? { // COBEGIN T1 t1; T2 t2; T3 t3; T4 t4; } // COEND } void p1(. . .) { { // COBEGIN T5 t5; T6 t6; T7 t7; T8 t8; } // COEND }
Unusual to create objects in a block and not use them. For task objects, the block waits for each tasks thread to nish. Alternative approach for thread creation is START/WAIT, which can create arbitrary thread graph:
69
p PROGRAM p START PROC p1(. . .) . . . s1 p1 FUNCTION f1(. . .) . . . INT i; START BEGIN s2 (fork) START p1(5); thread starts in p1 WAIT s1 continue execution, do not wait for p1 s3 (fork) START f1(8); thread starts in f1 WAIT s2 s4 (join) WAIT p1; wait for p1 to nish s3 (join) WAIT i := f1; wait for f1 to nish s4
f1
70
COBEGIN p1(. . .) p2(. . .) COEND START p1(. . .) START p2(. . .) WAIT p2 WAIT p1
In C+ +:
_Task T1 { void uMain::main() { void main() { p1(5); } T1 *p1p = new T1; }; . . . s1 . . . _Task T2 { T2 *f1p = new T2; int temp; . . . s2 . . . void main() { temp = f1(8); } delete p1p; public: . . . s3 . . . ~T2() { i = temp; } delete f1p; }; . . . s4 . . . } // start a T1 // start a T2 // wait for p1 // wait for f1
Variable i cannot be assigned until the delete of f1p, otherwise the value could change in s2/s3. Allows same routine to be started multiple times with different arguments.
71
72
matrix subtotals _Task Adder { int *row, size, &subtotal; T0 23 10 5 7 0 void main() { subtotal = 0; T1 -1 6 11 20 0 for ( int r = 0; r < size; r += 1 ) { subtotal += row[r]; T2 56 -13 6 0 0 } } T3 -2 8 -5 1 0 public: total Adder( int row[ ], int size, int &subtotal ) : row( row ), size( size ), subtotal( subtotal ) {} }; void uMain::main() { int rows = 10, cols = 10; int matrix[rows][cols], subtotals[rows], total = 0, r; Adder *adders[rows]; // read in matrix for ( r = 0; r < rows; r += 1 ) { // start threads to sum rows adders[r] = new Adder( matrix[r], cols, subtotals[r] ); } for ( r = 0; r < rows; r += 1 ) { // wait for threads to nish delete adders[r]; total += subtotals[r]; } cout << total << endl; }
73
74
bool Insert = false, Remove = false; int Data; _Task Cons { int N; void main() { int data; _Task Prod { for ( int i = 1; i <= N; i += 1 ) { int N; while ( ! Insert ) {} // busy wait void main() { Insert = false; for ( int i = 1; i <= N; i += 1 ) { data = Data; // remove data Data = i; // transfer data Remove = true; Insert = true; } while ( ! Remove ) {} // busy wait } Remove = false; public: } Cons( int N ) : N( N ) {} } }; public: void uMain::main() { Prod( int N ) : N( N ) {} Prod prod( 5 ); Cons cons( 5 ); }; }
2 innite loops! No, because of implicit switching of threads. cons synchronizes (waits) until prod transfers some data, then prod waits for cons to remove the data. Are 2 synchronization ags necessary?
75
3.9 Communication
Once threads are synchronized there are many ways that information can be transferred from one thread to the other. If the threads are in the same memory, then information can be transferred by value or address (VAR parameters). If the threads are not in the same memory (distributed), then transferring information by value is straightforward but by address is difcult.
3.10 Exceptions
Exceptions can be handled locally within a task, or nonlocally among coroutines, or concurrently among tasks. All concurrent exceptions are nonlocal, but nonlocal exceptions can also be sequential. Local task exceptions are the same as for a class. An unhandled exception raised by a task terminates the program. Nonlocal exceptions are possible because each task has its own stack (execution state)
76 Nonlocal exceptions between a task and a coroutine are the same as between coroutines (single thread). Concurrent exceptions among tasks are more complex due to the multiple threads. A concurrent exception provides an additional kind of communication among tasks. For example, two tasks may begin searching for a key in different sets:
_Task searcher { searcher &partner; void main() { try { ... if ( key == . . . ) _Throw stopEvent() _At partner; } catch( stopEvent ) { . . . }
When one task nds the key, it informs the other task to stop searching. For a concurrent raise, the source execution may only block while queueing the event for delivery at the faulting execution. After the event is delivered, the faulting execution propagates it at the
77 soonest possible opportunity (next context switch); i.e., the faulting task is not interrupted. Nonlocal delivery is initially disabled for a task, so handlers can be set up before any exception can be delivered.
void main() { // initialization, no nonlocal delivery try { // setup handlers _Enable { // enable delivery of exceptions // rest of the code } } catch( nonlocal-exception ) { // handle nonlocal exception } // nalization, no nonlocal delivery }
78 operate on the same object simultaneously. Not a problem if the operation on the object is atomic (not divisible). This means no other thread can modify any partial results during the operation on the object (but the thread can be interrupted). Where an operation is composed of many instructions, it is often necessary to make the operation atomic. A group of instructions on an associated object (data) that must be performed atomically is called a critical section. Preventing simultaneous execution of a critical section by multiple thread is called mutual exclusion. Must determine when concurrent access is allowed and when it must be prevented. One way to handle this is to detect any sharing and serialize all access; wasteful if threads are only reading. Improve by differentiating between reading and writing allow multiple readers or a single writer; still wasteful as a writer may only write at the end of its usage.
79 Need to minimize the amount of mutual exclusion (i.e., make critical sections as small as possible) to maximize concurrency.
80
_Task T { static int tid; string name; // must supply storage ... public: T() { name = "T" + itostring(tid); // shared read setName( name.c_str() ); tid += 1; // shared write } ... }; int T::tid = 0; // initialize static variable in .C le T t[10]; // 10 tasks with individual names
81 These approaches only work if one task creates all the objects so creation is performed serially. In general, it is best to avoid using shared static variables in a concurrent program.
82 cannot be postponed indenitely. Not satisfying this rule is called indenite postponement. 5. There must exist a bound on the number of other threads that are allowed to enter the critical section after a thread has made a request to enter it. Not satisfying this rule is called starvation.
Peter
void CriticalSection() { ::CurrTid = &uThisTask(); for ( int i = 1; i <= 100; i += 1 ) { // work if ( ::CurrTid != &uThisTask() ) { uAbort( "interference" ); } } }
inside
83
Peter
{ // entry protocol // critical section // exit protocol
inside
Breaks rule 1
84 3.15.2 Alternation
int Last = 0; _Task Alternation { int me; void main() { for ( int i = 1; i <= 1000; i += 1 ) { while (::Last == me) {} // entry protocol CriticalSection(); // critical section ::Last = me; // exit protocol } } public: Alternation(int me) : me(me) {} }; void uMain::main() { Alternation t0( 0 ), t1( 1 ); } // shared
Peter
outside
Breaks rule 3
outside
Breaks rule 4
Breaks rule 4
Breaks rule 5
88 3.15.6 Dekker
enum Intent {WantIn, DontWantIn}; Intent *Last; _Task Dekker { Intent &me, &you; void main() { for ( int i = 1; i <= 1000; i += 1 ) { for ( ;; ) { // entry protocol me = WantIn; if ( you == DontWantIn ) break; if ( ::Last == &me ) { me = DontWantIn; while ( ::Last == &me ) {} // you == WantIn } } CriticalSection(); // critical section ::Last = &me; // exit protocol me = DontWantIn; } } public: Dekker( Intent &me, Intent &you ) : me(me), you(you) {} }; void uMain::main() { Intent me = DontWantIn, you = DontWantIn; ::Last = &me; Dekker t0( me, you ), t1( you, me ); }
outside
3.15.7 Peterson
enum Intent {WantIn, DontWantIn}; Intent *Last; _Task Peterson { Intent &me, &you; void main() { for ( int i = 1; i <= 1000; i += 1 ) { me = WantIn; // entry protocol ::Last = &me; while ( you == WantIn && ::Last == &me ) {} CriticalSection(); // critical section me = DontWantIn; // exit protocol } } public: Peterson(Intent &me, Intent &you) : me(me), you(you) {} }; void uMain::main() { Intent me = DontWantIn, you = DontWantIn; Peterson t0(me, you), t1(you, me); }
89
90 Differences between Dekker and Peterson Dekkers algorithm makes no assumptions about atomicity, while Petersons algorithm assumes assignment is an atomic operation. Dekkers algorithm works on a machine where bits are scrambled during simultaneous assignment; Petersons algorithm does not. Prove Dekkers algorithm has no simultaneous assignments.
92
// step 2, wait for tasks with lower priority for ( j = priority+1; j < N; j += 1 ) { while ( intents[j] == WantIn ) {} } CriticalSection(); // critical section intents[priority] = DontWantIn; // exit protocol } } public: NTask( Intent i[ ], int N, int p ) : intents(i), N(N), priority(p) {} };
Breaks rule 5
93
HIGH priority
0 1 2 3 4 5 6 7 8 9
low priority
HIGH priority
low priority
95
HIGH priority
0 1 2 17 3 4 5 18 6 18 7 0 8 20 9 19
low priority
ticket value of (INT_MAX) dont want in low ticket and position value high priority ticket selection is unusual tickets are not unique use position as secondary priority ticket values cannot increase indenitely could fail 3.15.10 Tournament N -thread Prioritized Entry uses N bits. However, no known solution for all 5 rules using only N bits. N-Thread Bakery uses NM bits, where M is the ticket size (e.g., 32 bits), but is only probabilistically correct (limited ticket size). Other N-thread solutions are possible using more memory.
96 The tournament approach uses a minimal binary tree with N /2 start nodes (i.e., full tree with lg N levels). Each node is a Dekker or Peterson 2-thread algorithm. Each thread is assigned to a particular start node, where it begins the mutual exclusion process.
T0 D1 D4 D6 T1 T2 D2 T3 T4 D3 D5 T5 T6
= start node
At each node, one pair of threads is guaranteed to make progress; therefore, each thread eventually reaches the root of the tree. With a minimal binary tree, the tournament approach uses (N 1)M bits, where (N 1) is the number of tree nodes and M is the node size (e.g., Last, me, you, next node).
97 3.15.11 Arbiter Create full-time arbitrator task to control entry to critical section.
bool intent[5]; // initialize to false bool serving[5]; // initialize to false _Task Client { int me; void main() { for ( int i = 0; i < 100; i += 1 ) { intent[me] = true; // entry protocol while ( ! serving[me] ) {} CriticalSection(); intent[me] = false; // exit protocol while ( serving[me] ) {} } } public: Client( int me ) : me( me ) {} };
98
_Task Arbiter { void main() { int i = 0; for ( ;; ) { // cycle for request => no starvation for ( ; ! intent[i]; i = (i + 1) % 5 ) {} serving[i] = true; while ( intent[i] ) {} serving[i] = false; } } };
Mutual exclusion becomes a synchronization between arbiter and each waiting client. Arbiter cycles through waiting clients no starvation. Does not require atomic assignment no simultaneous assignments. Cost is creation, management, and execution (continuous spinning) of the arbiter task.
99
100
int Lock = OPEN; // each task does while ( Lock == CLOSED ); Lock = CLOSED; // critical section Lock = OPEN; // shared // fails to achieve // mutual exclusion
Works for N threads attempting entry to critical section and only depend on one shared datum (lock). However, rule 5 is broken, as there is no bound on service. Unfortunately, there is no such atomic construct in C. Atomic hardware instructions can be used to achieve this effect.
101 3.16.1 Test/Set Instruction The test-and-set instruction performs an atomic read and xed assignment.
int Lock = OPEN; // shared int TestSet( int &b ) { void Task::main() { // each task does /* begin atomic */ while( TestSet( Lock ) == CLOSED ); int temp = b; /* critical section */ b = CLOSED; Lock = OPEN; /* end atomic */ } return temp; }
if test/set returns open loop stops and lock is set to closed if test/set returns closed loop executes until the other thread sets lock to open In the multiple CPU case, memory must also guarantee that multiple CPUs cannot interleave these special R/W instructions on the same memory location.
102 3.16.2 Swap Instruction The swap instruction performs an atomic interchange of two separate values.
int Lock = OPEN; // shared Swap( int &a, &b ) { void Task::main() { // each task does int temp; int dummy = CLOSED; /* begin atomic */ do { temp = a; Swap( Lock,dummy ); a = b; } while( dummy == CLOSED ); b = temp; /* critical section */ /* end atomic */ Lock = OPEN; } }
if swap returns open loop stops and lock is set to closed if swap returns closed loop executes until the other thread sets lock to open 3.16.3 Compare/Assign Instruction The compare-and-assign instruction performs an atomic compare and conditional assignment (erronously called compare-and-swap).
103
int Lock = OPEN; // shared bool CAssn( int &val, void Task::main() { // each task does int comp, int nval ) { while ( ! CAssn(Lock,OPEN,CLOSED)); /* begin atomic */ /* critical section */ if (val == comp) { Lock = OPEN; val = nval; } return true; } return false; /* end atomic */ }
if compare/assign returns open loop stops and lock is set to closed if compare/assign returns closed loop executes until the other thread sets lock to open compare/assign can build other solutions (stack data structure) with a bound on service but with short busy waits. However, these solutions are complex.