CS330 Operating System Part VI
CS330 Operating System Part VI
Lecture 29
Threads
Threads are almost independent execution entities of a single process. Threads of a single process
can be scheduled on different CPUs in a concurrent manner.
Therefore, each thread has a different register state and stack.
At a given point in time, PC of different threads can be different.
Threads are different from processes. Threads of a single process share the address space, and
therefore context switch between them does not require switching the address space.
Is multi-threading useful?
Multithreading allows to leverage multi-core systems.
Threads share the address space, therefore global variables can be accessed from the thread
functions. Also, dynamically allocated memory can be passed as thread arguments.
Thread information is stored in the thread control block (TCB) which is pointed to by the PCB.
TCB contains the register state, which is used to save/restore the CPU state during context switch.
In Linux however, there is an overlap lesser distinction between a thread and a process. A
thread is treated as a separate process. The difference being the that the constructs within domain
of the current process like the address space, file state etc., do not have to be copied like in a vanilla
fork implementation, but rather only pointers to these have to maintained and copied. The thread
differs also from a process in that the TGID differs from the PID of the main process.
Stack for threads dynamically allocated from the address space using mmap() system call and passed
to the OS during thread creation. Threads may have conflict in spaces allocated to them, however
since they function in the same address space, it is an acceptable hindrance.
Creates a thread with tid as its handle, and the thread starts executing the function pointed
to by thfunc pointer. A single argument void* can be passed to the thread. Thread attribute can
be used to control the thread behavior e.g., stack size, stack address etc. Passing NULL sets the
defaults. Returns 0 (NULL) on success.
The thread might also terminate, using system calls like pthread_exit() or pthread_cancel().
In LINUX, both pthread_create and fork are implemented using the clone system call.
pthread_join
This call is generally made from the ‘main’ thread. This call waits for the tid to finish, and the
return value is captured in the retval variable (the thread must allocate the return value which
is freed after the process joins).
Invoking pthread_join for an already finished thread returns immediately.
// pthreads.c
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<pthread.h>
int main()
{
int num_threads = 4, ctr;
pthread_t threads[4];
int tids[4];
void *retval;
/*Create threads*/
for(ctr=0; ctr < num_threads; ++ctr){
tids[ctr] = ctr;
if(pthread_create(&threads[ctr], NULL, do_inc, &tids[ctr]) != 0){
perror("pthread_create");
exit(-1);
}
}
/*Wait for threads to finish their execution*/
4
Output:
int main()
{
pthread_t tid;
if(pthread_create(&tid, NULL, thfunc, NULL) != 0){
perror("pthread_create");
exit(-1);
}
printf("Main stack pointer = %p\n", &tid);
pthread_join(tid, NULL);
}
Output:
On doing the strace we observe mmap, close to the thread stack address and clone calls. There
are several flags passed to the clone() system call, some of which are: CLONE_VM: for address space,
CLONE_FS: for file system, CLONE_THREAD: for thread specific cloning.
The Linux OS, does not distinguishes in terms of the way that fork or pthread creates new
process/threads. The difference is in the flags that have been passed.
5
// find_max_parallel.c
#include<stdio.h>
#include<stdlib.h>
#include<sys/time.h>
#include<string.h>
#include<math.h>
#include<assert.h>
#include<pthread.h>
struct thread_param{
pthread_t tid;
int *array;
int size;
double max;
int max_index;
};
param->max = function(param->array[0]);
param->max_index = 0;
{
struct thread_param *params;
struct timeval start, end;
int *a, *ptr;
int num_elements, ctr, num_threads, per_thread, residue, max_index;
double max = 0.0;
if(argc !=3)
USAGE_EXIT("not enough parameters");
num_elements = atoi(argv[1]);
if(num_elements <=0)
USAGE_EXIT("invalid num elements");
num_threads = atoi(argv[2]);
if(num_threads <=0 || num_threads > MAX_THREADS){
USAGE_EXIT("invalid num of threads");
}
if(per_thread <= 0)
USAGE_EXIT("invalid num of elements to threads");
a = malloc(num_elements * sizeof(int));
if(!a){
USAGE_EXIT("invalid num elements, not enough memory");
}
srand(SEED);
for(ctr=0; ctr<num_elements; ++ctr)
a[ctr] = rand();
ptr = a;
gettimeofday(&start, NULL);
if(residue){
param->size++;
--residue;
}
param->array = ptr;
ptr += param->size;
7
}
assert((ptr - a) == num_elements);
num_elements = 0;
gettimeofday(&end, NULL);
printf("Time taken = %ld microsecs\n", TDIFF(start, end));
printf("Max = %.2f @index %d\n", max, max_index);
free(a);
free(params);
}
Demonstrating the threads, we observe that on increasing the number of threads, reduces the
time taken.
Lecture 30
Consider the simultaneous increment to a global counter variable by multiple threads.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#define MAX_THREADS 64
#define USAGE_EXIT(s) \
do \
{ \
printf("Usage: %s <# of threads> \n %s\n", argv[0], s); \
exit(-1); \
} while (0);
{
unsigned long ctr;
for (ctr = 0; ctr < OP_COUNT; ++ctr)
{
g_ctr++;
// asm volatile("incq %0;"
// : "=m"(g_ctr)
// :
// : "memory");
}
return NULL;
}
num_threads = atoi(argv[1]);
if (num_threads <= 0 || num_threads > MAX_THREADS)
{
USAGE_EXIT("invalid num of threads");
}
Output:
The output is not 2 × 10000000. Rather the output is non-deterministic.
Reason:
counter++ in assembly
9
Mov (counter) R1
Add 1 R1
Mov R1 (counter)
A single C line can be composed of multiple instructions, between them scheduling creates an
issue. For eg.:
Assume after T1 starts execution, it is scheduled out and T2 starts execution. After T1 is
scheduled back and completes execution, counter should have been 2, however this is not the case
because of in-between scheduling.
Some definitions
• Atomic operation: An operation is atomic if it is uninterruptible and indivisible.
• Critical section: A section of code accessing one or more shared resource(s), mostly shared
memory location(s).
• Mutual exclusion: Technique to allow exactly one execution entity to execute the critical
section.
• Race condition: Occurs when multiple threads are allowed to enter the critical section.
Critical sections of an OS
• OS maintains shared information which can be accessed from different OS mode execution
(e.g., system call handlers, interrupt handlers etc.). They are often handled in parallel.
For eg.:
10
1. Consider a page table entry which is being updated due to swapping, and change in
protection flags, simultaneously.
2. The queue of network packets being updated concurrently to deliver the packets to a
process and receive incoming packets from the network device.
In the case of only system calls, disabling preemption, is basically disabling the thread to be
scheduled out during the course of the critical section. This as seen previously, does not rectifies
the issue seen in multi-processor system, and therefore locking is required to refrain other entities
from accessing the shared data.
In the second case, disabling interrupts is a stricter condition than disabling interrupts, as
there are timer interrupts that cause this preemption. Thus it not only allows the thread to not
be descheduled, but also prevents any interrupts from ‘hijacking’ and taking over the execution of
the critical section. In the same case, only locking does not work (Why?).
Lecture 31
Concurrency issues in OS is challenging as finding the race condition itself is non-trivial.
Locking in pthread
pthred_mutex
pthread_mutex_t lock; // Initialized using pthread_mutex_init
static int counter = 0;
void *thfunc(void *) {
int ctr = 0;
for(ctr=0; ctr<10000; ++ctr){
pthread_mutex_lock(&lock); // One thread acquires lock, others wait
counter++; // Critical section
pthread_mutex_unlock(&lock); // Release the lock
}
}
11
lock_t* L;
lock(L) {
return; // -> Lock acquired
}
unlock(L) {
return; // ->Lock released
}
Fairness
Given N threads contending for the lock, number of unsuccessful attempts for lock acquisition for
all contending threads should be same.
Bounded wait property: Given N threads contending for the lock, there should be an upper
bound on the number of attempts made by a given context to acquire the lock.
Lecture 32
Locks
We shall try to give certain implementation techniques towards designing spinlocks.
Buggy attempt
Consider the following implementation:
12
This implementation does not work because compare and swap have to occur atomically.
Meaning, consider the scenario in a single-core system, when after the thread T1, calls lock(L),
it de-schedules between lines 3 and 4. In this case, thread T2 (next on line of execution) gains
the lock, because the value of *L is still 1, and when T1, comes back it (as well as T2) will have
gained the lock. In a multi-core system two cores executing the while can both gain the lock.
Exchange value of register R and value at memory address M. RDI register contains the lock
argument. (To check various context switch scenarios...)
13
In x86, by default the equivalent instruction for CAS cmpxchg is not atomic, but it can be made
so by using the lock prefix.
// usage in x86
1. lock(lock_t *L ) {
2. asm volatile(
3. \mov $1, %%rcx;"
4. \loop: xor %%rax, %%rax;"
5. \lock cmpxchg %%rcx, (%%rdi);"
6. \jnz loop; "
7. : : : \rcx", \rax", \memory");
8. }
9. unlock(int *L ) { *L = 1;}
Value of RAX (=0) is compared against value at address in register RDI and exchanged with
RCX (=1), if they are equal.
StoreConditional (R, M): Stores the value of R to M if no stores happened to M after the
execution of LL instruction (after execution, R = 1). Otherwise, store is not performed (after
execution R=0)
Supported in RISC architectures like mips, risc-v etc.
// assembly equivalent
lock: LL R1, (R2); //R2 = lock address
BNEQZ R1, lock;
ADDUI R1, R0, #1; //R1 = 1
SC R1, (R2)
BEQZ R1, lock
Efficient as the hardware avoids memory traffic for unsuccessful lock acquire attempts. Context
switch between LL and SC results in SC to fail (the architecture provides that).
lock (lock_t* L) {
u64 backoff = 0;
while(LoadLinked(L) || !StoreConditional(L, 1)){
if (backoff < 63) backof++;
pause(1 << backoff); // hint to processor
}
}
Fairness in spinlocks
To ensure fairness, some notion of ordering is required. What if the threads are granted the lock in
the order of their arrival to the lock contention loop? A single lock variable may not be sufficient.
Example solution: Ticket spinlocks
Atomic fetch-and-add
xadd R, M
15
TmpReg T = R + [M]
R = [M]
[M] = T
fetch-and-add instruction, which atomically increments a value while returning the old value
at a particular address. Require lock prefix to be atomic.
// turning locking
struct lock_t {
long ticket;
long turn;
};
Local variable myturn is equivalent order of arrival. If a thread is in CS, myturn must be equal
to turn.
# threads waiting = tickets - turn - 1
Value of turn incremented on lock release. Thread which arrived just after the current thread
enters the CS. When a new thread arrives, it gets the lock after the other threads ahead of the
new thread acquire and release the lock.
Ticket spinlock guarantees bounded waiting. If N threads are contending for the lock
and execution of the CS consumes T cycles, then bound = N * T (assuming negligible context
switch overhead).
Reader-Writer Locks
Allows multiple readers but a single writer to enter the CS.
Consider the following working example of a BST
struct BST {
struct Node* root;
rwlock_t* lock;
}
struct Node {
item_t item;
struct Node* left;
struct Node* right;
}
struct rwlock_t {
Lock read_lock;
Lock write_lock;
int num_readers;
}
init_lock(rwlock_t* rL) {
init_lock(&rl->read_lock);
init_lock(&rl->write_lock);
rl->num_readers=0;
}
This serves as the baseline, we would different implementation for the writers and the readers.
For the writers: Write lock behavior is same as the typical lock, only one thread allowed to
acquire the lock.
17
For the readers: The first reader acquires the write lock, to prevent other writers from entering.
The last reader releases the write lock to allow writers.
void read_lock(rwlock_t *rL) {
lock(&rL->read_lock);
rL->num_readers++;
if(rL->num_readers == 1)
lock(&rL->write_lock);
unlock(&rL->read_lock);
}
void read_unlock(rwlock_t *rL) {
lock(&rL->read_lock);
rL->num_readers--;
if(rL->num_readers == 1)
unlock(&rL->write_lock);
unlock(&rL->read_lock);
}
Consider this solution for two threads only, say T0 and T1. This is buggy because: Both
threads can acquire the lock as “while condition check” and “setting the flag” is non-atomic.
Buggy attempt #2
while(flag[id^1]);
}
This solution doesn’t works as well. Can lead to deadlocks (flag[0] = flag[1] = 1). In other
words the “progress” requirement is not met.
Progress: If no one has acquired the lock and there are contending threads, one of the threads
must acquire the lock within a finite time.
Buggy attempt # 3
int turn = 0;
Assuming T0, applies for lock first, this attempt does guarantee mutual exclusion. However
there is another issue with this, two threads must contend for the lock alternately. Thus, progress
requirement is not met, and one of the threads is stuck in an infinite loop (in non-CS code).
Peterson’s solution
int turn = 0;
int flag[2] = {0, 0};
Mutual exclusion and fairness is guaranteed (The lock is fair because if two threads are con-
tending, they acquire the lock in an alternate manner.)
19
Lecture 33
Semaphores
Consider a scenario when a finite array of size N is accessed from a set of producer and consumer
threads. In this case,
- At most N concurrent producers are allowed if array is empty
- At most N concurrent consumers are allowed if array is full
- If we use mutual exclusion techniques, only one producer or consumer is allowed at any point
of time.
Operations on semaphores
struct semaphore {
int value;
spinlock_t* lock;
queue* waitQ;
} sem_t;
Can be used to in a multi-threaded process or across multiple processes. The second argument
while sem_init is used to depict sharing procedures. If it is 0, then semaphores are shared between
threads of a same process.
Semaphores initialized with a value 1 are called binary semaphores, and can be used to imple-
ment blocking(waiting) locks.
If parent is scheduled after child creation, it waits till child finishes (because sem_wait decre-
ments the value, and now since the value becomes negative, it waits). If child is scheduled before
parent, parent does not wait for the semaphore.
A=0; B=0;
Thread-0 {
A = 1;
printf("B: %d\n", B);
}
Therad-1 {
B = 1;
printf("A: %d\n", A);
}
Attempt 1:
sem_init(s1, 0);
A=0; B=0;
Thread-0 {
A = 1;
sem_wait(&s1);
printf("B: %d\n", B);
}
Thread-1 {
B = 1;
sem_post(&s1);
printf("A: %d\n", A);
}
Attempt 2:
sem_init(s1, 0);
sem_init(s2, 0);
A=0; B=0;
Thread-0 {
A=1;
sem_post(s1);
sem_wait(s2);
printf("B: %d\n", B);
}
Thread-1 {
B=1;
sem_wait(s1);
sem_post(s2);
printf("A: %d\n", A);
}
Producer-Consumer Problem
A buffer of size N, one or more producers and consumers. Producer produces an element into
the buffer (after processing). Consumer extracts an element from the buffer and processes it.
Example: A multi-threaded web server, network protocol layers etc.
How to solve this problem using semaphores?
Buggy Attempt 1
produce(item_t item) {
sem_wait(&empty);
A[pctr]=item;
pctr=(pctr+1)%n;
22
sem_post(&used);
}
consume(item_t item) {
sem_wait(&used);
item_t item = A[cctr];
cctr=(cctr+1)%n;
sem_post(&empty);
return item;
}
This solution is buggy because pctr, cctr are not protected, and can cause race conditions.
Buggy Attempt 2
produce(item_t item) {
Lock(L); sem_wait(&empty);
A[pctr]=item;
pctr=(pctr+1)%n;
sem_post(&used); Unlock(L);
}
consume(item_t item) {
Lock(L); sem_wait(&used);
item_t item = A[cctr];
cctr=(cctr+1)%n;
sem_post(&empty);
return item; Unlock(L);
}
Consider empty = 0 and producer has taken lock before the consumer. This results in a
deadlock, consumer waits for L and producer for empty.
A working solution
produce(item_t item) {
sem_wait(&empty); Lock(L);
A[pctr]=item;
pctr=(pctr+1)%n;
Unlock(L); sem_post(&used);
}
consume(item_t item) {
sem_wait(&used); Lock(L);
item_t item = A[cctr];
23
cctr=(cctr+1)%n;
Unlock(L); sem_post(&empty);
return item;
}
The solution is deadlock free and ensures correct synchronization, but very much serialized
(inside produce and consume).
produce(item_t item) {
sem_wait(&empty); Lock(P);
A[pctr]=item;
pctr=(pctr+1)%n;
Unlock(P); sem_post(&used);
}
consume(item_t item) {
sem_wait(&used); Lock(C);
item_t item = A[cctr];
cctr=(cctr+1)%n;
Unlock(C); sem_post(&empty);
return item;
}
Lecture 34
Some common issues in concurrent programs: atomic issues, failure of ordering assumption and
deadlocks.
Atomicity issues
Consider the following program
char* ptr; // allocated before use
void T1() {
...
strcpy(ptr, "hello world");
...
}
void T2() {
...
24
if (some_condition) {
free(ptr); ptr=NULL;
}
}
This code is buggy because the ptr can be freed before strcpy, which results is segmentation
fault. To rectify this, this solution can be seen:
char* ptr; // allocated before use
void T1() {
...
if(ptr) strcpy(ptr, "hello world");
...
}
void T2() {
...
if (some_condition) {
free(ptr); ptr=NULL;
}
}
This however does not fixes the issue at hand. Consider the following order of execution.
T1: if(ptr) T2: free(ptr) T1: strcpy Result: SEGFAULT.
Ordering issues
This code works under the assumption that line #4 of T2 is executed after the line #4 of T1. If
this ordering is violated, then T1 is stuck in the while loop.
1. bool pending;
2. void T1()
3. {
4. pending=True;
5. do_large_processing();
6. while(pending);
7. }
1. void T2()
2. {
3. do_some_processing();
4. pending=False;
5. some_other_processing();
6. }
Deadlocks
struct acc_t {
lock_t* L;
25
id_t acc_no;
long balance;
};