0% found this document useful (0 votes)
3 views27 pages

CS6210 4b - Synchronization

The document discusses synchronization in multi-threaded programming, focusing on locks, barrier synchronization, and atomic operations. It explains mutual-exclusion locks and shared locks, as well as the challenges of contention and latency in synchronization. Additionally, it introduces various lock implementations, including naive spinlocks, caching spinlocks, and ticket locks, emphasizing the importance of fairness and efficiency in lock acquisition.

Uploaded by

sie.niccht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views27 pages

CS6210 4b - Synchronization

The document discusses synchronization in multi-threaded programming, focusing on locks, barrier synchronization, and atomic operations. It explains mutual-exclusion locks and shared locks, as well as the challenges of contention and latency in synchronization. Additionally, it introduces various lock implementations, including naive spinlocks, caching spinlocks, and ticket locks, emphasizing the importance of fairness and efficiency in lock acquisition.

Uploaded by

sie.niccht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

4b - Synchronization

4b - Synchronization
1. Lesson Summary
2. Synchronization Primitives
What is a lock? If you have multiple threads executing, and they share one data
structure, it is important that the threads do not overwrite one another's work. A
lock is something that allows a thread to make sure that when it is accessing some
piece of shared data, it is not being interfered with by another thread.

So a thread can acquire a lock. One the lock is acquired, the thread knows that it
can access some data that is shared by other threads. Once T1 knows it has access
to this data structure, it can make updates to the data structure, and after it is
finished, it can release this lock.

Locks come in two flavors--a mutual-exclusion lock, and a shared lock.

Mutual exclusion lock


A mutual-exclusion lock (also known as an exclusive lock), which is that it can be
used by only one thread at a time. Mutual-exclusion locks guarantee that no other
thread will interfere with the data protected by the lock except for the thread that
has acquired the exclusive lock at present.

Shared lock
There are also shared locks. This lock is something that allows multiple threads to
access data at the shared time. Under what conditions, however, would this be
meaningful? Say that there are records in the database that multiple threads want
to inspect at the same time, but under the guarantee that no data will change
while the records are being inspected. A shared lock guarantees multiple readers
access to data with the assurance that no other thread will modify the data while it
is being accessed by readers.

1 / 27
4b - Synchronization

Barrier Synchronization
The idea behind barrier synchronization is that there are multiple threads
performing a computation, but require knowing the statuses of the other threads
involved in the computation at some given time. They require the guarantee that
all other threads have reached a particular point in their respective computations
so that they can all move to the next phase of the computation.

In this example, is possible that thread T 1 and T 2 has arrived at the barrier, but the
other threads have not. However, until all thread T n arrived at the barrier, no other
phase of the computation can proceed.

Now that we understand the basic synchronization primitives required on a


shared memory machine, we can start looking at how to implement them.

3. Quiz - Programmer's Intent


In the instruction set architecture of a processor, instructions are atomic by
definition. In other words, if you think about reads and writes to memory, they are
implemented as loads LW and stores SW . During the execution of either a load or

2 / 27
4b - Synchronization

a store instruction, the processor cannot be interrupted--this is the definition of


an atomic instruction.

The question here is, if we have a multi-threaded program, where there are two
processes performing the following functions:

p1 p2
line 1 modify struct(A) wait for modifications;

line 2 use struct(A);

This is an example of a producer-consumer relationship, where one process is


producing data, and another process is consuming it after verifying that the data
has been produced successfully.

Is it possible to achieve the programmer's intent that has been embodied in this
code snippet? If so, illustrate how so.

The answer is yes, it is possible. The solution is very simple--we can implement a
new flag variable. Simply, we can set the flag to 0-- flag = 0 .

Then, the processes will signal the modification of data between one another
using this flag variable. Here, process 1 will look like:

1 mod(A);
2 flag = 1; // signal p2

And process 2 will look like:

1 while (flag != 0);


2 use (A);
3 flag = 0; // re-initialize the flag

The producer will modify the data structure, and once it is done with the
modification, it will set the flag equal to 1 . Process 2 is waiting for the flag to be

3 / 27
4b - Synchronization

set to 1 , and once it is set to 1 , process 2 will exit the spin loop.

Now, let us analyze this solution and see why it works with simply atomic reads
and writes.

4. Programmer's Intent Explanation


Process 1:

1 mod(A);
2 flag = 1; // signal p2

Process 2:

1 while (flag == 0);


2 use (A);
3 flag = 0; // re-initialize the flag

Note that all of these commands are simple read/write accesses. Process 1 is
modifying data using loads and stores, while process 2 is simply reading and using
a value with the same loads and stores.

However, note that there is a difference between the ways in which the processes
are using the flag variable, as opposed to the memory variable A . The flag
variable is being used as a synchronization variable, but it is being accessed by
the same read/write accesses available in the processor.

Atomic read/write operations are good for doing simple co-ordination among
processes. But how can we implement a synchronization, such as a mutual-
exclusion lock? Are atomic read/write operations sufficient for implementing a
mutual-exclusion lock?

5. Atomic Operations
Below is a very simple implementation of a mutual exclusion lock.

1 lock(L) {
2 if (L == 0) {

4 / 27
4b - Synchronization

3 L = 1;
4 }
5 else {
6 while (L == 1); // wait
7 // go back to check if (L == 0)
8 }
9 };
10
11 unlock(L) {
12 L = 0;
13 }
14

In terms of the instructions the processor will execute in order to get this lock, it
will need to be able to:

check if the lock is currently available,


and indicate that the lock is or is not available
To release the lock, we will simply set the lock L to 0, to indicate that the lock
has been released.

Can we implement this simple lock using only atomic reads and write alone?
If we look at the set of instructions that the processor has to execute in order to
acquire the lock, it has to:

read L from memory,


check if L is 0 ,
store that new value 1 into the memory location L
All of these steps must be executed atomically, so that nothing can interfere
with the execution of these steps.
While read and write instructions are atomic individually, a group of read
and write instructions is not atomic. What that means is that reads and
writes cannot be the only atomic operations if we want to implement a lock
algorithm. As such, we will need a new semantic for an atomic instruction
that can do all of these steps in one atomic operation.

In other words, we want to read from memory, modify the value, and write it
back to memory in a single atomic instruction. This is needed in order to ensure
5 / 27
4b - Synchronization

we can implement a lock algorithm. This will be referred to as an atomic read-


modify-write instruction-- RMW .

Atomic RMW instructions


There are a number of existing atomic RMW instructions. Generically, the
following instructions are referred to as fetch-and-φ instructions, meaning that
the operation is going to fetch an old value from memory, and then perform some
modification φ to this memory in an atomic manner.

test-and-set(<mem_loc>)

This will atomically return the current value in the memory location before
setting the memory location value to 1.

fetch-and-inc(<mem_loc>)

Atomically fetches the old value from memory and increments the current value
in the memory by 1 , or any other value.

6. Scalability Issues With Synchronization


There are a number of scalability issues with synchronization primitives,
particularly in lock and barrier algorithms. The sources of inefficiencies include:

latency
This is the time spent by a thread in acquiring an available lock.
waiting time
This is the time spent by a thread in waiting to retrieving access to an
unavailable lock. Waiting time is a function of the application that is out
of the control of an operating system designer.
contention
If several threads are waiting to access a lock, they are contending for
lock access. How long it takes in the presence for one thread to obtain
the lock and the others to give up access to the lock is the inefficiency
incurred by contention.

6 / 27
4b - Synchronization

Latency and contention are the primary concerns of operating systems designers
when implementing synchronization primitives.

7. Naive Spinlock
When a processor is spinning, this means the processor is doing no useful work,
but is simply waiting for the lock to be released.

The first naive spinlock implementation is referred to as the naive spin on test-
and-set.

Spin on test-and-set
1 lock(L) {
2 while (test_and_set(L) == locked); // spin on test and set.
3 }
4
5 unlock(L) {
6 L = unlocked;
7 }

Say that we have three threads, T 1 , T 2 , and T 3 , contending for some shared memory
location L, which has been initialized to unlocked . When we call the lock
primitive, it executes the test_and_set atomic instruction, which returns the old
value of L and returns the old value from L and set it to the new value, which is
locked .

If one thread finds that test_and_set returns locked , it will wait for
test_and_set to return unlocked on the memory location L. This is where the
thread spins.

Assume in our example that T 1 has obtained lock L before threads T 2 and T 3 .
When T 2 and T 3 try to obtain the lock, they will execute the same lock algorithm
and discover that L is locked. Therefore, T 2 and T 3 are effectively stuck spinning
until T 1 calls the unlock function.

8. Quiz - Problems With Native Spinlock


7 / 27
4b - Synchronization

What are the potential problems with this naive implementation of the spin on
test-and-set spinlock?

Too much contention


Does not exploit caches
Disrupts useful work

All three are correct.

Too much contention


First of all, with this naive implementation, there is going to be too much
contention for the lock when the lock is released. All three threads in our
instruction is performing the test-and-set instruction in an attempt to
acquire the lock. If there are thousands of processors, everyone is going
to execute this test-and-set instruction, which leads to excessive
contention on the network when attempting to access this shared
variable.
Does not exploit caches
A shared memory multiprocessor has private cahes associated with
every single processor in the architecture. While there is a private cache
associated with every processor, and a value from memory can be cached
within it, the test_and_set instruction cannot access the private cache,
because the test_and_set instruction cannot atomically perform read-
modify-write instructions as well as include an access to the cache. By
definition, a test-and-set instruction is not going to exploit caches. It is
going to bypass the cache and go directly to memory.
Disrupts useful work
This is also a good answer because when a processor releases the lock,
the processor wants to go on and do some useful work. Similarly, if there
are several processes trying to acquire the lock, if the lock is unavailable,
all other processes will continue to contend for the lock instead of
continuing to perform useful work.

8 / 27
4b - Synchronization

9. Caching Spinlock (spin on read)

While a test-and-set instruction is forced to go to memory in order to preserve


atomicity for read-modify-write instructions, we can make an improvement to our
lock algorithm such that contending threads can access their cache instead of
spinning on the shared memory location.

Assume that we have a shared-memory machine through which hardware is


maintaining cache coherence. The waiting threads can spin locally on the cached
value of the lock L instead of making a direct call to main memory. When spinning
on the local cached value of L, we can assume that, through cache coherence, any
changes to L made on main memory will be propagated to the private caches of
each processor, which will notify the waiting threads.

1 lock(L):
2 while (L == locked); // retrieve L from main memory and spin
on the cached value of L.
3

By spinning on the locally cached value of L, there is no longer any contention on


the network.

9 / 27
4b - Synchronization

If the one processor with the lock releases it, all contending processors will
notice the updated value of the lock as cache coherence will maintain that
changes made to shared memory is propagated across the private caches of all
processors.

10. Spinlocks with Delay


There is another approach to reducing contention for a given lock: implementing
a delay. Each processor will wait before it contends for a lock, even if the processor
detects that the lock is available.

Delay after Lock Release


1 while ((L == locked) || (test_and_set(L) == locked)) {
2 while (L == locked); // locally spinning until the lock has
been released
3 delay(d[P_id]); // once lock is released, we delay for some
time assigned to processor ID
4 }

Here, the delay is a function of processor ID, meaning that each processor delays
lock acquisition for a different amount of time. However, because this is a static
delay time, some time can be wasted on the delay process. We can, instead,
perform a dynamic delay.

Delay with exponential backoff (?)


1 while (test_and_set(L) == locked) {
2 delay(d);
3 d = d * 2;
4 }

When the lock is not highly contended for, the processor will not delay for very
long. But with repeated checks (and failures to acquire) the lock, the processor will
delay the acquisition for a little bit longer.

Summary
10 / 27
4b - Synchronization

Because these spinlock algorithms do not require cache access, they can be
implemented on a non-cache coherent processor. Generally speaking, if there is a
lot of contention, then static assignment of delay might be better than
exponential delay with backoff--but in general, any delay implementation
improves a naive spinlock algorithm.

11. Ticket Lock


If multiple processors are waiting on the lock, then the lock should be acquired by
the processor which has waited the longest amount of time. However, the spinlock
does not keep track of how long each thread has been waiting for the lock--the
threads are indistinguishable from one another. Therefore, spinlock does not
preserve fairness.

Ensuring fairness in lock acquisition


The ticket lock algorithm is simply the implementation of a ticketing system.

1 struct lock { // add new data fields to the lock


2 int next_ticket;
3 int now_serving;
4 };
5 release_lock(L) {
6 L->now_serving++; // every thread performs a release_lock
after it is done
7 }
8 acquire_lock(L) {
9 int my_ticket = fetch_and_inc(L->next_ticket); // acquire a
lock by marking the current position. this is the "get_ticket"
step of the ticket lock.
10 loop: // delay like in spinlock with delay.
11 pause(my_ticket - L->now_serving); // pausing for an
amount of time determined by the ticket and now_serving value
12 if (L->now_serving == my_ticket) return; // acquire the
lock when the ticket == now_serving
13 }

The acquire_lock algorithm is implemented in three steps:

11 / 27
4b - Synchronization

1 First, acquire a ticket. Here, this is done by retrieving and incrementing the
next ticket in the lock structure.
2 Now, spin on the lock. We spin until the my_ticket value is equal to the
now_serving value. now_serving is updated every time the lock is released
in the release_lock function.
3 When L->now_serving == my_ticket , we can successfully acquire the lock.

While the algorithm preserves fairness, every time the lock is released, the
now_serving value is going to be updated by the cache coherence mechanism,
which causes contention on the network. We have not reduced the contention
that can happen on the network when a lock is released because the now_serving
is repeatedly updated across the private caches of all the processors in the system.

12. Spinlock Summary


Our first two spinlock algorithms--read with test-and-set, as well as test-and-set
with delay--reduces contention for the resource but does not guarantee fairness,
and our final spinlock algorithm, the ticket lock, guarantees fairness but does not
reduce contention on the network.

To further illustrate the limitations of spinlock algorithms, consider the following


example. Say that, in a set of N threads (T 1 . . . T n ). Say that T 1 has acquired a lock. As
per the spinlock algorithms, to some extent the remaining threads T 2 to T n are
now waiting on the lock to be released. However, only one thread will be able to
acquire the lock after T 1 has released it. Why should more than one thread
contend for the lock?

Ideally, when T 1 releases the lock, it will signal ONLY one other lock between T 2
and T n to retrieve the next thread, rather than signalling all n − 1 threads at the
same time. It is this idea that lays the foundation for queueing locks.

13. Array Based Queueing Lock (Andersen's lock)

12 / 27
4b - Synchronization

This is also known as Andersen's lock.

First, associated with each lock L is an queue of flags flags . The size of this
queue flags is equal to the number of processes in the SMP. If you have an N-
way multiprocessor, then you have N elements in the circular flags array. This flags
array serves as a circular queue for enqueueing the requesters that are requesting
access to lock L.

Each element in this flags queue can be in one of two states:

1 has_lock - whichever processor that happens to be waiting on this slot is


waiting on a particular slot has the lock L.
2 must_wait - if the processor can only use must_wait as an entrypoint into
the flags queue, then this processor must wait on the lock L.

There can be exactly one processor that can be in the has_lock state, while all
other processors are in the must_wait state. This is because the lock L is mutually
exclusive. In order to initialize the lock, we have to initialize the queue data
structure flags , which marks one slot as has_lock while marking the others as
must_wait .

13 / 27
4b - Synchronization

One important note: the slots are not statically associated with any processor.
While there is a unique spot available for every waiting processor, they are not
statically assigned to the processor at run-time.

14. Array Based Queueing Lock (cont)


The queuelast variable
The queuelast variable points at the next open slot in the flags queue. Each
time a processor requests a lock, the queuelast variable is incremented once.

idx 0 1 2 3 4 5 6 7
state HL MW MW MW MW MW MW MW

pointer P 1 queuelast
current future requ
lock -estors
holder must
queue here

Say that some processor P x arrives and requests the same lock. Every time that
some processor P x requests lock acquisition, we update the pointer and the
queuelast variable to point to the next space in the array that the next
contending processor will occupy.

idx 0 1 2 3 4 5 6 7
state HL MW MW MW MW MW MW MW

pointer P 1 Px P x+1 P x+2 P mine queuelast


current future requ
lock -estors
holder must
queue here

1 Lock(L){

14 / 27
4b - Synchronization

2 myplace = fetch_and_inc(L->queuelast); // mark my place in the


array
3 }

The Lock algorithm for the array-based queueing lock will follow as such.
When we make a lock request, we mark our place in the flags array using an
atomic instruction (to prevent race conditions between contending processors),
fetch_and_increment on the queuelast variable. By calling fetch_and_inc on
the queuelast , we not only retrieve our own place in the queue (through the
fetch operation), but we also increment the queuelast variable to point to the
next available spot in the array.

If the architecture does not support fetch_and_increment , we will have to


simulate the operation using test_and_increment instructions instead.

Once we have marked our position in the flags array, we will now wait for our
turn. In other words, we are spinning on our thread's assigned slot until the state
changes from must_wait to has_lock .

1 Lock(L){
2 myplace = fetch_and_inc(L->queuelast); // mark my place in the
array
3 while(flags[myplace % N] == MW); // stop spinning once MW
becomes HL
4 }
5

15. Array Based Queueing Lock (cont)


Topic
What happens when the lock is released? Let's take a look at the unlock
operation.

1 Lock(L){
2 myplace = fetch_and_inc(L->queuelast); // mark my place in the
array

15 / 27
4b - Synchronization

3 while(flags[myplace % N] == MW); // stop spinning once MW


becomes HL
4 }
5
6 unlock(L){
7 flags[current % N] = MW;
8 flags[(current + 1) % N] = HL;
9 }

What happens here is that when P 1 is finished working on the lock, it will call the
unlock operation on the lock, which will do two things:

1 Changes flags[P_1]->state to MW
2 Changes flags[P_1 + 1]->state to HL

This updates our array to the following state:

idx 0 1 2 3 4 5 6 7
state MW HL MW MW MW MW MW MW

pointer P 1 Px P x+1 P x+2 P mine queuelast


current future requ
lock -estors
holder must
queue here

Recall that the flags array is a circular queue. This means that the state of HL is
'circled' around all of the available slots of the array from start to finish. When the
predecessor of P mine acquires the lock, the state of the array looks like:

idx 0 1 2 3 4 5 6 7
state MW MW MW HL MW MW MW MW

pointer P 1 Px P x+1 P x+2 P mine queuelast


current future requ
lock -estors
holder must
queue here

16 / 27
4b - Synchronization

Once P x+2 performs the unlock , it will set the state of the slot P mine to HL ,
indicating that the processor P mine has now acquired the lock and can enter the
critical section to perform its necessary modifications.

Advantages of Array-Based Queueing Lock


1. There is only one atomic operation performed per critical
section.
Only one atomic operation, fetch_and_increment , needs to be executed in order
to acquire the lock.

2. Fairness is preserved.
Lock acquisition is provided sequentially based upon the order of threads
entering the flags array.

3. Reduced network contention.


Threads spin on their own private copy of their own variable in their own caches.
There is less network contention as only one thread is notified when the lock is
released (as opposed to a large pool of threads), compared with the ticket lock
algorithm.

Disadvantages of Array-Based Queueing Lock


1. The size of the data structure is the same size as the number of
processors in the multiprocessor. TLDR: Excess space
complexity.
Space complexity for this algorithm is O(N) for every lock you have in the program.
A large-scale multiprocessor with dozens of processors can lead to excessive
memory overhead.

In any well-structured multi-threaded program, even though we may have lots of


threads executing in lots of processors, but only a small subset of these processors
may contend for a lock. However, this algorithm nevertheless anticipates the
worst-case, and instantiates a data structure that may far exceed the size of the
subset of processors that contend for the lock. Do note that this is a result of
17 / 27
4b - Synchronization

using a static, and not a dynamic, data structure to maintain the sequence of
threads.

16. Link Based Queueing Lock (MCS lock)


A link-based queueing lock, or the MCS lock, utilizes a linked-list representation
for our queue. We begin with the following implementation of the qnode data
structure:

1 class qnode {
2 qnode next; // points to the successor
3 bool locked; // whether or not processor acquired the lock
4 }

Every processor that requests access to a lock creates a new instance of qnode .
locked indicates whether or not the processor has acquired the lock, while the
next field points to either null or the next processor that has requested access
to the lock, and our lock is also of type qnode .

Scenario 1 - Empty Linked List

If no requests for the lock have been issued yet, our linked list will look like:

1 [!LOCK] -> null

Here, the -> symbol represents the next pointer.

Scenario 2 - Single Request

18 / 27
4b - Synchronization

If only a single request has been made to acquire the lock, the linked list will take
on the following updated appearance:

1 [LOCK] -> [P1 (R)] -> null

In this scenario, processor 1 has acquired the lock, and so it is running (as indicated
by the R symbol). More specifically, processor 1 has created an instance of qnode ,
and set its next field to null , to indicate that there is no one after it. At the same
time, it has set the lock 's next pointer to itself. Thus, processor 1 can now access
the critical section.

17. Link Based Queueing Lock (cont)


Scenario 3 - Waiting on a request

Say that while processor 1 has acquired the lock, another processor 2 has
requested access next.
Processor 2 will have to update the next pointers of all the processors in the
queue, as well as that of the next pointer in the lock .
We have to update the next pointer in processor 1 to point at the next processor
2, so that processor 1 will signal processor 2 once it has released the lock, and we
update the next pointer in lock to point at P2 . Here are the contents of each
qnode object at this point:

19 / 27
4b - Synchronization

1 LOCK:
2 next = P2
3 locked = True
4
5 P1: (running)
6 next = P2
7 locked = True
8
9 P2: (spinning)
10 next = NULL
11 locked = False

The next pointer in lock is always pointing to the last member of the linked list
queue. Once P2 has made these necessary changes, then it can continue to spin
and wait for the previous processor P1 to signal that it has released the lock.

18. Link Based Queueing Lock (cont)


The Lock Algorithm
The Lock algorithm takes two arguments--the dummy node associated with the
lock LOCK , and the qnode to be enqueued into the linked list.

1 class Lock(qnode Lock, qnode I){


2 I->next = null;
3 qnode predecessor = fetch_and_store(Lock, I); // atomically
retrieve the last node that Lock points to
4 if (predecessor != null) {
5 // queue was non-empty
6 I->locked = true;
7 predecessor->next = I;
8 while (I->locked); // spin on I->locked, and await signal
9 }
10 }

For example, here is the transition that takes place when P2 tries to request
access from P1 from the figure above.

20 / 27
4b - Synchronization

The arrows above highlighted in red represent the fields that must be updated
when enqueueing a new request to the lock.

Note that for P2 to be successfully enqueued, pointer that was pointing at the
previous node and pointing it to P2 , then I am taking the next pointer of the
previous node and pointing it to P2 . These steps must happen atomically. In
order to facilitate these atomic steps, we will implement fetch-and-store with
two arguments. For example: fetch_and_store(L, I) will return what used to
be contained in L->next . What used to be contained in L->next was P1 , which
will be the predecessor to I . At the same time, it is storing into L a pointer to the
new node I .

19. Link Based Queueing Lock (cont)


Unlock Algorithm
Now that we have set P1 's pointer at P2 , and P2 is spinning on the locked field.
How will we update the P2->locked field to indicate that the Lock is now
available for P2 ? Naturally, P2 's predecessor, P1 , will need to update that field
with the unlock procedure.

1 class Unlock(qnode Lock, qnode I) {


2 qnode successor = I->next;
3 I->next = null; // remove qnode I from the linked list
4 successor->locked == false; // indicate that the lock is now
released
5 }

There are two steps that the unlock procedure must perform-

21 / 27
4b - Synchronization

1 It must remove the qnode I from the linked list, and


2 signal I 's successor that the lock is now released.

unlock takes two arguments--one being the Lock , and then the node that is
making the unlock call (in our example, it's P1 ). Because P1->next points at P2 ,
we will use that link to signal to P2 that the Lock has been released. Because P2
has been spinning on P1->locked , the moment that this guarded data variable
changes, P2 will be signaled to retrieve the lock.

Note, however, that when P2 calls the unlock procedure now, there are no
successors to P2 that will be signaled. How does this change the unlock
procedure?

We must set the qnode corresponding to Lock to null , to indicate that there is
no requester in the queue waiting for the lock. But what happens if a new request
is presently forming?

Race Conditions

22 / 27
4b - Synchronization

Say that we have a new requester, P3 , which performs lock at the same time that
P2 is performing release . At the exact moment that P3 performs
fetch_and_store() , P3 will set Lock->next->next 's pointer to P3 , and Lock-
>next to P3 . Lock->next corresponds to P2 , since P2 has not finished
executing the release function and is still part of the linked list at this moment--
but Lock->next will now attempt to point to P3 , and P2 will be unable to
remove itself fully from the linked list. This is the race condition that can take
place during the release-lock operation.

20. Link Based Queueing Lock (cont)


Removing the race condition potential
In order to handle the race condition that was discussed in slide 19, we should have
P2 check for whether or not a new request is being formed before performing
release . In other words, there must be an atomic way of setting Lock to null
if Lock is pointing to P2 .

23 / 27
4b - Synchronization

In order to define the atomic instruction, we need to determine the invariant


condition that we will branch off of. This invariant condition will be a conditional
store operation, which will only store a value to a given variable if some condition
is satisfied. Let's see whether we can define this condition.

1 if L->next == P2: set L->next to null


2 else continue

The primitive for our atomic conditional store operation will be the
compare_and_swap atomic function.

1 def compare_and_swap(L, I, arg1):


2 if (L == I):
3 L == arg1
4 return True
5 else:
6 return False

Here, compare and swap will take 3 arguments-- L, I, arg1 . If L==I , then we will
set L to be arg1 and return true . Otherwise, it will not do anything, and return
false .

21. Link-based Queueing Lock (cont)


Atomic release with compare_and_swap .
When we attempt to release a lock while a new request is being formed, a race
condition will be encountered if updating the next link belonging to the qnode
that corresponds to Lock conflicts with the same update being made to the next
link is performed during the acquire operation in the new request.

We can instead replace the dequeue operation being executed in the release
procedure with a compare_and_swap atomic operation in order to avoid the race
condition entirely.

1 class Unlock(qnode Lock, qnode I) {


2 if (I->next == null) { // no known successor

24 / 27
4b - Synchronization

3 if (compare_and_swap(L, I, null)) // replace L's value


with null
4 return // note that compare_and_swap returns true or
false
5 while (I->next) == null; // spin until new request
finishes since false
6 I->next->locked == false; // indicate that the lock is now
released
7 }
8 }

Here, if compare_and_swap fails, this indicates that a new request is being


formed and so, I will spin on I->next becoming non-null, or pointing to the
new request. Recall that the new request performing the Lock operation will
retrieve the addresses of I . In other words, I is spinning until the new request
has completed performing the acquire operation. Afterwards, the node will
emerge from its spin and indicate to the new requester that it has received access
of the lock.

22. Linked Based Queueing Lock (cont)


There are many subtleties behind implementing the algorithms for the link-based
queueing lock, as well as Andersen's lock, in the kernel. The linked-list bassed
queueing lock and the Andersen queueing lock both required, for instance, a new
write instruction, implemented in the form of compare_and_swap ,
fetch_and_store , and fetch_and_increment .

If these atomic instructions are not available in some given architecture, then
they will have to be simulated using the default test-and-set instruction
included on these architectures.

23. Link Based Queueing Lock (cont)


Advantages
The advantages are similar to Andersen's queueing lock--link-based queueing
locks are fair, for example, and every spinning process spins on a unique copy of its

25 / 27
4b - Synchronization

own cached copy of the guarded variable, which reduces excess contention on the
signal.

Additionally, exactly 1 process is signaled when a lock is released. There is only one
atomic operation per critical section (save for the corner case, which requires a
second atomic operation). The space complexity, however, of this data structure is
dynamically-defined, and is a function of how many processors are requesting
access to the lock, as opposed to how many processors in total have been
implemented in the given architecture. Therefore, the space-complexity of this
algorithm improves significantly upon the space-complexity of the Andersen's
lock algorithm.

However, there is more linked-list maintenance overhead that is associated with


making a lock or unlock request. Andersen's array-based queue lock can be faster
than the linked-list based algorithm for this reason. Performance can be further
affected if the necessary atomic instructions are not available to the architecture.
This drawback also applies to Andersen's array-based queue lock.

While both of these algorithms are better for scalability, if the processor only has
test-and-set , then the OS designer must rely only on the exponential backoff
algorithms to handle synchronization in order to maximize performance.

24. Quiz - Algorithm Grading


Grade the algorithms by filling in the following table:

Signal
only
RMW one
ops Space on
Algorithm Latency Contention Fair Spin per CS ovhd release
Spin on
T&S
Spin on
read

26 / 27
4b - Synchronization

Signal
only
RMW one
ops Space on
Algorithm Latency Contention Fair Spin per CS ovhd release
Spin w/
delay
Ticket
lock
Andersen
MCS

Answer

Signa
only
RMW one
ops per Space on
Algorithm Latency Contention Fair Spin CS ovhd relea
Spin on low high N S high low N
T&S
Spin on low med N S medium low N
read
Spin w/ low++ low+ N S low+ low N
delay
Ticket low low++ Y S low++ low+ N
lock
Andersen low+ low Y P 1 high Y
MCS low+ low Y P 1 (max 2) med Y

27 / 27

You might also like