CS6210 4b - Synchronization
CS6210 4b - Synchronization
4b - Synchronization
1. Lesson Summary
2. Synchronization Primitives
What is a lock? If you have multiple threads executing, and they share one data
structure, it is important that the threads do not overwrite one another's work. A
lock is something that allows a thread to make sure that when it is accessing some
piece of shared data, it is not being interfered with by another thread.
So a thread can acquire a lock. One the lock is acquired, the thread knows that it
can access some data that is shared by other threads. Once T1 knows it has access
to this data structure, it can make updates to the data structure, and after it is
finished, it can release this lock.
Shared lock
There are also shared locks. This lock is something that allows multiple threads to
access data at the shared time. Under what conditions, however, would this be
meaningful? Say that there are records in the database that multiple threads want
to inspect at the same time, but under the guarantee that no data will change
while the records are being inspected. A shared lock guarantees multiple readers
access to data with the assurance that no other thread will modify the data while it
is being accessed by readers.
1 / 27
4b - Synchronization
Barrier Synchronization
The idea behind barrier synchronization is that there are multiple threads
performing a computation, but require knowing the statuses of the other threads
involved in the computation at some given time. They require the guarantee that
all other threads have reached a particular point in their respective computations
so that they can all move to the next phase of the computation.
In this example, is possible that thread T 1 and T 2 has arrived at the barrier, but the
other threads have not. However, until all thread T n arrived at the barrier, no other
phase of the computation can proceed.
2 / 27
4b - Synchronization
The question here is, if we have a multi-threaded program, where there are two
processes performing the following functions:
p1 p2
line 1 modify struct(A) wait for modifications;
Is it possible to achieve the programmer's intent that has been embodied in this
code snippet? If so, illustrate how so.
The answer is yes, it is possible. The solution is very simple--we can implement a
new flag variable. Simply, we can set the flag to 0-- flag = 0 .
Then, the processes will signal the modification of data between one another
using this flag variable. Here, process 1 will look like:
1 mod(A);
2 flag = 1; // signal p2
The producer will modify the data structure, and once it is done with the
modification, it will set the flag equal to 1 . Process 2 is waiting for the flag to be
3 / 27
4b - Synchronization
set to 1 , and once it is set to 1 , process 2 will exit the spin loop.
Now, let us analyze this solution and see why it works with simply atomic reads
and writes.
1 mod(A);
2 flag = 1; // signal p2
Process 2:
Note that all of these commands are simple read/write accesses. Process 1 is
modifying data using loads and stores, while process 2 is simply reading and using
a value with the same loads and stores.
However, note that there is a difference between the ways in which the processes
are using the flag variable, as opposed to the memory variable A . The flag
variable is being used as a synchronization variable, but it is being accessed by
the same read/write accesses available in the processor.
Atomic read/write operations are good for doing simple co-ordination among
processes. But how can we implement a synchronization, such as a mutual-
exclusion lock? Are atomic read/write operations sufficient for implementing a
mutual-exclusion lock?
5. Atomic Operations
Below is a very simple implementation of a mutual exclusion lock.
1 lock(L) {
2 if (L == 0) {
4 / 27
4b - Synchronization
3 L = 1;
4 }
5 else {
6 while (L == 1); // wait
7 // go back to check if (L == 0)
8 }
9 };
10
11 unlock(L) {
12 L = 0;
13 }
14
In terms of the instructions the processor will execute in order to get this lock, it
will need to be able to:
Can we implement this simple lock using only atomic reads and write alone?
If we look at the set of instructions that the processor has to execute in order to
acquire the lock, it has to:
In other words, we want to read from memory, modify the value, and write it
back to memory in a single atomic instruction. This is needed in order to ensure
5 / 27
4b - Synchronization
test-and-set(<mem_loc>)
This will atomically return the current value in the memory location before
setting the memory location value to 1.
fetch-and-inc(<mem_loc>)
Atomically fetches the old value from memory and increments the current value
in the memory by 1 , or any other value.
latency
This is the time spent by a thread in acquiring an available lock.
waiting time
This is the time spent by a thread in waiting to retrieving access to an
unavailable lock. Waiting time is a function of the application that is out
of the control of an operating system designer.
contention
If several threads are waiting to access a lock, they are contending for
lock access. How long it takes in the presence for one thread to obtain
the lock and the others to give up access to the lock is the inefficiency
incurred by contention.
6 / 27
4b - Synchronization
Latency and contention are the primary concerns of operating systems designers
when implementing synchronization primitives.
7. Naive Spinlock
When a processor is spinning, this means the processor is doing no useful work,
but is simply waiting for the lock to be released.
The first naive spinlock implementation is referred to as the naive spin on test-
and-set.
Spin on test-and-set
1 lock(L) {
2 while (test_and_set(L) == locked); // spin on test and set.
3 }
4
5 unlock(L) {
6 L = unlocked;
7 }
Say that we have three threads, T 1 , T 2 , and T 3 , contending for some shared memory
location L, which has been initialized to unlocked . When we call the lock
primitive, it executes the test_and_set atomic instruction, which returns the old
value of L and returns the old value from L and set it to the new value, which is
locked .
If one thread finds that test_and_set returns locked , it will wait for
test_and_set to return unlocked on the memory location L. This is where the
thread spins.
Assume in our example that T 1 has obtained lock L before threads T 2 and T 3 .
When T 2 and T 3 try to obtain the lock, they will execute the same lock algorithm
and discover that L is locked. Therefore, T 2 and T 3 are effectively stuck spinning
until T 1 calls the unlock function.
What are the potential problems with this naive implementation of the spin on
test-and-set spinlock?
8 / 27
4b - Synchronization
1 lock(L):
2 while (L == locked); // retrieve L from main memory and spin
on the cached value of L.
3
9 / 27
4b - Synchronization
If the one processor with the lock releases it, all contending processors will
notice the updated value of the lock as cache coherence will maintain that
changes made to shared memory is propagated across the private caches of all
processors.
Here, the delay is a function of processor ID, meaning that each processor delays
lock acquisition for a different amount of time. However, because this is a static
delay time, some time can be wasted on the delay process. We can, instead,
perform a dynamic delay.
When the lock is not highly contended for, the processor will not delay for very
long. But with repeated checks (and failures to acquire) the lock, the processor will
delay the acquisition for a little bit longer.
Summary
10 / 27
4b - Synchronization
Because these spinlock algorithms do not require cache access, they can be
implemented on a non-cache coherent processor. Generally speaking, if there is a
lot of contention, then static assignment of delay might be better than
exponential delay with backoff--but in general, any delay implementation
improves a naive spinlock algorithm.
11 / 27
4b - Synchronization
1 First, acquire a ticket. Here, this is done by retrieving and incrementing the
next ticket in the lock structure.
2 Now, spin on the lock. We spin until the my_ticket value is equal to the
now_serving value. now_serving is updated every time the lock is released
in the release_lock function.
3 When L->now_serving == my_ticket , we can successfully acquire the lock.
While the algorithm preserves fairness, every time the lock is released, the
now_serving value is going to be updated by the cache coherence mechanism,
which causes contention on the network. We have not reduced the contention
that can happen on the network when a lock is released because the now_serving
is repeatedly updated across the private caches of all the processors in the system.
Ideally, when T 1 releases the lock, it will signal ONLY one other lock between T 2
and T n to retrieve the next thread, rather than signalling all n − 1 threads at the
same time. It is this idea that lays the foundation for queueing locks.
12 / 27
4b - Synchronization
First, associated with each lock L is an queue of flags flags . The size of this
queue flags is equal to the number of processes in the SMP. If you have an N-
way multiprocessor, then you have N elements in the circular flags array. This flags
array serves as a circular queue for enqueueing the requesters that are requesting
access to lock L.
There can be exactly one processor that can be in the has_lock state, while all
other processors are in the must_wait state. This is because the lock L is mutually
exclusive. In order to initialize the lock, we have to initialize the queue data
structure flags , which marks one slot as has_lock while marking the others as
must_wait .
13 / 27
4b - Synchronization
One important note: the slots are not statically associated with any processor.
While there is a unique spot available for every waiting processor, they are not
statically assigned to the processor at run-time.
idx 0 1 2 3 4 5 6 7
state HL MW MW MW MW MW MW MW
pointer P 1 queuelast
current future requ
lock -estors
holder must
queue here
Say that some processor P x arrives and requests the same lock. Every time that
some processor P x requests lock acquisition, we update the pointer and the
queuelast variable to point to the next space in the array that the next
contending processor will occupy.
idx 0 1 2 3 4 5 6 7
state HL MW MW MW MW MW MW MW
1 Lock(L){
14 / 27
4b - Synchronization
The Lock algorithm for the array-based queueing lock will follow as such.
When we make a lock request, we mark our place in the flags array using an
atomic instruction (to prevent race conditions between contending processors),
fetch_and_increment on the queuelast variable. By calling fetch_and_inc on
the queuelast , we not only retrieve our own place in the queue (through the
fetch operation), but we also increment the queuelast variable to point to the
next available spot in the array.
Once we have marked our position in the flags array, we will now wait for our
turn. In other words, we are spinning on our thread's assigned slot until the state
changes from must_wait to has_lock .
1 Lock(L){
2 myplace = fetch_and_inc(L->queuelast); // mark my place in the
array
3 while(flags[myplace % N] == MW); // stop spinning once MW
becomes HL
4 }
5
1 Lock(L){
2 myplace = fetch_and_inc(L->queuelast); // mark my place in the
array
15 / 27
4b - Synchronization
What happens here is that when P 1 is finished working on the lock, it will call the
unlock operation on the lock, which will do two things:
1 Changes flags[P_1]->state to MW
2 Changes flags[P_1 + 1]->state to HL
idx 0 1 2 3 4 5 6 7
state MW HL MW MW MW MW MW MW
Recall that the flags array is a circular queue. This means that the state of HL is
'circled' around all of the available slots of the array from start to finish. When the
predecessor of P mine acquires the lock, the state of the array looks like:
idx 0 1 2 3 4 5 6 7
state MW MW MW HL MW MW MW MW
16 / 27
4b - Synchronization
Once P x+2 performs the unlock , it will set the state of the slot P mine to HL ,
indicating that the processor P mine has now acquired the lock and can enter the
critical section to perform its necessary modifications.
2. Fairness is preserved.
Lock acquisition is provided sequentially based upon the order of threads
entering the flags array.
using a static, and not a dynamic, data structure to maintain the sequence of
threads.
1 class qnode {
2 qnode next; // points to the successor
3 bool locked; // whether or not processor acquired the lock
4 }
Every processor that requests access to a lock creates a new instance of qnode .
locked indicates whether or not the processor has acquired the lock, while the
next field points to either null or the next processor that has requested access
to the lock, and our lock is also of type qnode .
If no requests for the lock have been issued yet, our linked list will look like:
18 / 27
4b - Synchronization
If only a single request has been made to acquire the lock, the linked list will take
on the following updated appearance:
In this scenario, processor 1 has acquired the lock, and so it is running (as indicated
by the R symbol). More specifically, processor 1 has created an instance of qnode ,
and set its next field to null , to indicate that there is no one after it. At the same
time, it has set the lock 's next pointer to itself. Thus, processor 1 can now access
the critical section.
Say that while processor 1 has acquired the lock, another processor 2 has
requested access next.
Processor 2 will have to update the next pointers of all the processors in the
queue, as well as that of the next pointer in the lock .
We have to update the next pointer in processor 1 to point at the next processor
2, so that processor 1 will signal processor 2 once it has released the lock, and we
update the next pointer in lock to point at P2 . Here are the contents of each
qnode object at this point:
19 / 27
4b - Synchronization
1 LOCK:
2 next = P2
3 locked = True
4
5 P1: (running)
6 next = P2
7 locked = True
8
9 P2: (spinning)
10 next = NULL
11 locked = False
The next pointer in lock is always pointing to the last member of the linked list
queue. Once P2 has made these necessary changes, then it can continue to spin
and wait for the previous processor P1 to signal that it has released the lock.
For example, here is the transition that takes place when P2 tries to request
access from P1 from the figure above.
20 / 27
4b - Synchronization
The arrows above highlighted in red represent the fields that must be updated
when enqueueing a new request to the lock.
Note that for P2 to be successfully enqueued, pointer that was pointing at the
previous node and pointing it to P2 , then I am taking the next pointer of the
previous node and pointing it to P2 . These steps must happen atomically. In
order to facilitate these atomic steps, we will implement fetch-and-store with
two arguments. For example: fetch_and_store(L, I) will return what used to
be contained in L->next . What used to be contained in L->next was P1 , which
will be the predecessor to I . At the same time, it is storing into L a pointer to the
new node I .
There are two steps that the unlock procedure must perform-
21 / 27
4b - Synchronization
unlock takes two arguments--one being the Lock , and then the node that is
making the unlock call (in our example, it's P1 ). Because P1->next points at P2 ,
we will use that link to signal to P2 that the Lock has been released. Because P2
has been spinning on P1->locked , the moment that this guarded data variable
changes, P2 will be signaled to retrieve the lock.
Note, however, that when P2 calls the unlock procedure now, there are no
successors to P2 that will be signaled. How does this change the unlock
procedure?
We must set the qnode corresponding to Lock to null , to indicate that there is
no requester in the queue waiting for the lock. But what happens if a new request
is presently forming?
Race Conditions
22 / 27
4b - Synchronization
Say that we have a new requester, P3 , which performs lock at the same time that
P2 is performing release . At the exact moment that P3 performs
fetch_and_store() , P3 will set Lock->next->next 's pointer to P3 , and Lock-
>next to P3 . Lock->next corresponds to P2 , since P2 has not finished
executing the release function and is still part of the linked list at this moment--
but Lock->next will now attempt to point to P3 , and P2 will be unable to
remove itself fully from the linked list. This is the race condition that can take
place during the release-lock operation.
23 / 27
4b - Synchronization
The primitive for our atomic conditional store operation will be the
compare_and_swap atomic function.
Here, compare and swap will take 3 arguments-- L, I, arg1 . If L==I , then we will
set L to be arg1 and return true . Otherwise, it will not do anything, and return
false .
We can instead replace the dequeue operation being executed in the release
procedure with a compare_and_swap atomic operation in order to avoid the race
condition entirely.
24 / 27
4b - Synchronization
If these atomic instructions are not available in some given architecture, then
they will have to be simulated using the default test-and-set instruction
included on these architectures.
25 / 27
4b - Synchronization
own cached copy of the guarded variable, which reduces excess contention on the
signal.
Additionally, exactly 1 process is signaled when a lock is released. There is only one
atomic operation per critical section (save for the corner case, which requires a
second atomic operation). The space complexity, however, of this data structure is
dynamically-defined, and is a function of how many processors are requesting
access to the lock, as opposed to how many processors in total have been
implemented in the given architecture. Therefore, the space-complexity of this
algorithm improves significantly upon the space-complexity of the Andersen's
lock algorithm.
While both of these algorithms are better for scalability, if the processor only has
test-and-set , then the OS designer must rely only on the exponential backoff
algorithms to handle synchronization in order to maximize performance.
Signal
only
RMW one
ops Space on
Algorithm Latency Contention Fair Spin per CS ovhd release
Spin on
T&S
Spin on
read
26 / 27
4b - Synchronization
Signal
only
RMW one
ops Space on
Algorithm Latency Contention Fair Spin per CS ovhd release
Spin w/
delay
Ticket
lock
Andersen
MCS
Answer
Signa
only
RMW one
ops per Space on
Algorithm Latency Contention Fair Spin CS ovhd relea
Spin on low high N S high low N
T&S
Spin on low med N S medium low N
read
Spin w/ low++ low+ N S low+ low N
delay
Ticket low low++ Y S low++ low+ N
lock
Andersen low+ low Y P 1 high Y
MCS low+ low Y P 1 (max 2) med Y
27 / 27