16 Synchronization
16 Synchronization
Implementing
Synchronization
Parallel Computer Architecture and Programming
CMU 15-418/15-618, Fall 2023
Today’s topic: efficiently implementing
synchronization primitives
▪ Primitives for ensuring mutual exclusion
- Locks
- Atomic primitives (e.g., atomic_add)
- Transactions
▪ Primitives for event signaling
- Barriers
- Flags
CMU 15-418/618,
Fall 2023
Three phases of a synchronization event
1. Acquire method
- How a thread attempts to gain access to protected resource
2. Waiting algorithm
- How a thread waits for access to be granted to shared resource
3. Release method
- How thread enables other threads to gain resource when its
work in the synchronized region is complete
CMU 15-418/618,
Fall 2023
Busy waiting
▪ Busy waiting (a.k.a. “spinning”)
while (condition X not true) {}
logic that assumes X is true
CMU 15-418/618,
Fall 2023
“Blocking” synchronization
▪ Idea: if progress cannot be made because a resource cannot
be acquired, it is desirable to free up execution resources for
another thread (preempt the running thread)
if (condition X not true)
block until true; // OS scheduler de-schedules thread
// (let’s another thread use the processor)
CMU 15-418/618,
Fall 2023
Busy waiting vs. blocking
▪ Busy-waiting can be preferable to blocking if:
- Scheduling overhead is larger than expected wait time
- Tail latency effects
- Processor’s resources not needed for other tasks
- This is often the case in a parallel program since we usually don’t oversubscribe
a system when running a performance-critical parallel app (e.g., there aren’t
multiple CPU-intensive programs running at the same time)
- Clarification: be careful to not confuse the above statement with the value of
multi-threading (interleaving execution of multiple threads/tasks to hiding
long latency of memory operations) with other work within the same app.
▪ Examples:
pthread_spinlock_t spin; int lock;
pthread_spin_lock(&spin); OSSpinLockLock(&lock); // OSX spin lock
CMU 15-418/618,
Fall 2023
Implementing Locks
CMU 15-418/618,
Fall 2023
Warm up: a simple, but incorrect, lock
CMU 15-418/618,
Fall 2023
Test-and-set based lock
Atomic test-and-set instruction:
ts R0, mem[addr] // load mem[addr] into R0
// if mem[addr] is 0, set mem[addr] to 1
CMU 15-418/618,
Fall 2023
Test-and-set lock: consider coherence traffic
Processor 0 Processor 1 Processor 2
T&S
BusRdX Invalidate line Invalidate line
Update line in cache (set to 1)
Invalidate line BusRdX T&S
Number of processors
CMU 15-418/618,
Fall 2023
Test-and-test-and-set lock: coherence traffic
Processor 1 Processor 2 Processor 3
T&S
BusRdX Invalidate line Invalidate line
Update line in cache (set to 1)
BusRd BusRd
[P1 is holding lock...] [Many reads from local cache] [Many reads from local cache]
BusRdX
Update line in cache (set to 0) Invalidate line Invalidate line
Invalidate line BusRd BusRd
BusRdX T&S
[P1 is holding lock...] [Many reads from local cache] [Many reads from local cache]
BusRdX
Update line in cache (set to 0) Invalidate line Invalidate line
Invalidate line BusRd BusRd
BusRdX T&S
- Recall: test-and-set lock generated one invalidation per waiting processor per test
▪ More scalable (due to less traffic)
▪ Storage cost unchanged (one int)
▪ Still no provisions for fairness
CMU 15-418/618,
Fall 2023
Test-and-set lock with back off
Upon failure to acquire lock, delay for awhile before retrying
void Lock(volatile int* l) {
int amount = 1;
while (1) {
if (test_and_set(*l) == 0)
return;
delay(amount);
amount *= 2;
}
}
struct lock {
volatile int next_ticket;
volatile int now_serving;
};
void Lock(lock* l) {
int my_ticket = atomic_increment(&l->next_ticket); // take a “ticket”
while (my_ticket != l->now_serving); // wait for number
} // to be called
void unlock(lock* l) {
l->now_serving++;
}
int my_element;
void Lock(lock* l) {
my_element = atomic_circ_increment(&l->head); // assume circular increment
while (l->status[my_element] == 1);
}
void unlock(lock* l) {
l->status[my_element] = 1;
l->status[circ_next(my_element)] = 0; // next() gives next index
}
O(1) interconnect traffic per release, but lock requires space linear in P
Also, the atomic circular increment is a more complex operation (higher overhead)
CMU 15-418/618,
Fall 2023
x86 cmpxchg
▪ Compare and exchange (atomic when used with lock prefix)
lock cmpxchg dst, src
often a memory address
CMU 15-418/618,
Fall 2023
Queue-based Lock (MCS lock)
More details: Figure 5 Algorithms for Scalable Synchronization on Shared Memory Multiprocessor
CMU 15-418/618,
Fall 2023
Implementing Barriers
CMU 15-418/618,
Fall 2023
Implementing a centralized barrier
(Based on shared counter)
struct Barrier_t {
LOCK lock;
int counter; // initialize to 0
int flag; // the flag field should probably be padded to
// sit on its own cache line. Why?
};
CMU 15-418/618,
Fall 2023
Correct centralized barrier
struct Barrier_t {
LOCK lock;
int arrive_counter; // initialize to 0 (number of threads that have arrived)
int leave_counter; // initialize to P (number of threads that have left barrier)
int flag;
};
int local_sense = 0; // private per processor. Main idea: processors wait for flag
// to be equal to local sense
CMU 15-418/618,
Fall 2023
Combining tree implementation of barrier
High contention!
(e.g., single barrier
lock and counter)