Cs 6210 Spring 2016 Midterm Soln: Name: - Tas Plus Kishore - GT Number
Cs 6210 Spring 2016 Midterm Soln: Name: - Tas Plus Kishore - GT Number
Note:
Good luck!
Page 1 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
OS Structures
2. (8 min, 17 points)
(a) (9 points) (Exokernel)
A process is currently executing on the processor. The process makes a
system call. List the steps involved in getting this system call serviced
by the operating system that this process belongs to. You should be precise
in mentioning the data structures involved in Exokernel that makes this
service possible.
System Call traps to Exokernel. (+2)
Exokernel identifies the library OS responsible for handling this system
call using Time Slice Vector (+2)
Exokernel uses the PE (Processor Environment) data structure associated with
this library OS to get the system call context, which is the entry point
registered with Exokernel by the library OS for system calls. (+2)
Exokernel upcalls the library OS using this entry point. (+2)
The library OS services the system call (+1)
Page 2 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
3. (7 mins, 12 points)
(c) (L3)
Consider a 64-bit paged-segmented architecture. The virtual address space
is 264 bytes. The TLB does not support address space id, so on an address
space switch, the TLB has to be flushed. The architecture has two segment
registers:
LB: lower bound for a segment
UB: upper bound for a segment
i. (4 points) There are three protection domains each requiring 32 MiB of
address space (Note: Mi is 220). How would you recommend implementing
the 3 protection domains that reduces border crossing overheads among
these domains?
ii. (4 points) Explain with succinct bullets, what happens upon a call
from one protection domain to another.
UB and LB hardware registers are changed to correspond to the called
protection domain.
Context switch is made to transfer control to the entry point in the called
protection domain.
The architecture ensures that virtual addresses generated by the called
domain are within the bounds of legal addresses for that domain.
There is no need to flush the TLB on context switch from one protection
domain to another.
(+1 for each bullet above)
Page 3 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
Virtualization
4. (15 mins, 30 points)
(a) (2 points) (interrupts)
The hypervisor gets an interrupt from the disk. How does it determine which
virtual machine it has to deliver the interrupt?
Every packet from the network will have the MAC address of the NIC to which the
packet is destined. The MAC addresses are uniquely associated with the VMs. Based
on the MAC addresses associated with the VMs and the destination field in the IP
header of the packet, the packet arrival interrupt is delivered to the VM.
[Note: As an aside, NAT protocol on your home router connecting several home
devices to the ISP works quite similarly.]
(+2 if MAC address of NICs associated with the VMs to direct intrpt mentioned)
(c) (6 points) (ballooning)
A virtualized setting uses ballooning to cater to the dynamic memory needs
of VMs. Imagine 4 VMs currently executing on top of the hypervisor. VM1
experiences memory pressure and requests the hypervisor for 100 MB of
additional memory. The hypervisor has no machine memory available currently
to satisfy the request. List the steps taken by the hypervisor in trying to
give the requested memory to VM1 using ballooning. (Concise bullets
please).
Page 4 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
(d) (2 points) (memory reclamation)
One of the techniques for efficient use of memory which is a critical
resource for performance is to dynamically adjust the memory allocated to a
VM by taking away some or all of the idle memory (unused memory) from a
VM. This is referred to in the literature as taxing the VM proportional to
the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away
ALL the idle memory from a VM) not a good idea?
Because any sudden increase in the working set size of the VM will result in poor
performance for that VM potentially violating the SLA for that VM.
Page 5 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
(e) (6 points) (para virtualization)
Using a concrete example (e.g., a disk driver), show how copying memory
buffers between guest VMs and the hypervisor is avoided in a para
virtualized setting?
Page 6 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
(f) (full virtualization)
In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure
maintained by the hypervisor. Answer the following questions pertaining to
S-PT.
(i) (2 point) How many S-PT data structures are there?
(iv) (6 points) The currently running OS, switches from one process (say
P1) to another process (P2). List the sequence of steps before P2 starts
running on the processor.
Guest OS executes the privileged instruction for changing the PTBR to point
to the PT for P2. Results in a trap to the hypervisor. (+2)
From the PPN of the PT for P2, the hypervisor will know the offset into the
S-PT for that VM where the PT resides in machine memory. (+2)
Hypervisor installs the MPN thus located as the PT for P2 into PTBR. (+2)
Once other formalities are complete associated with context switching (which
also needs hypervisor intervention) such as saving the volatile state of P1
into its PCB, and loading the volatile state of P2 from its PCB into the
processor, the process P2 can start executing.
Page 7 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
Synchronization, Communication, and Scheduling in Parallel Systems
5. (15 mins, 30 points)
(a) (5 points) (Answer True/False with justification)
Sequential consistency memory model makes sense only in a non-cache coherent
(NCC) shared memory multiprocessor.
False. (+1)
Sequential consistency memory model is a contract between software and hardware.
(+2)
It is required for the programmer to reason about the correctness of the software.
(+1)
Cache coherence is only a mechanism for implementing the memory consistency model.
It can be implemented in hardware or software. (+1)
Page 8 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
ii. (2 points) What does each of T1 and T2 know about the state of the
data structures at this point of time?
Possibility 1:
T2 knows its predecessor is curr (+1)
T1 knows its predecessor is T2 (+1)
Possibility 2:
T2 does not know anything about the queue associated with L (+1)
T1 knows its predecessor is curr (+1)
iii. (2 points) What sequence of subsequent actions will ensure the correct
formation of the waiting queue of lock requestors behind the current
lock holder?
Possibility 1:
T2 will set the next pointer in curr to point to T2. (+1)
T1 will set the next pointer in T2 to point to T1. (+1)
Possibility 2:
T1 will set the next pointer in curr to point to T1. (+1)
T2 will do a fetch-and-store on L->next; this will result in two
things: (+1)
o T2 will get its predecessor T1
T2 will set the next pointer in T1 to point to T2
o L->next will now point to T2
Page 9 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
(c) (MCS barrier)
i. (6 points) The MCS barrier uses a 4-ary arrival tree where the nodes
are the processors and the links point to children. What will the
arrival tree look like for 16 processors labeled P0-P15? Use P0 as the
root of the tree. You can either draw the tree or give your answer by
showing the children for each of the 16 processors below:
P0: [P1, P2, P3, P4 ]
P1: [P5, P6, P7, P8 ]
P2: [P9, P10, P11, P12 ]
P3: [P13, P14, P15, X ]
P4: [ ]
P5: [ ]
P6: [ ]
P7: [ ]
P8: [ ]
P9: [ ]
P10: [ ]
P11: [ ]
P12: [ ]
P13: [ ]
P14: [ ]
P15: [ ]
(+1 for each of the rows P0 thru P3)
(+2 for showing no children for P4-P15)
ii. (3 points) What is the reason for such a construction of the arrival
tree? (Concise bullets please)
Unique and static location for each processor to signal barrier
completion
Spinning on a statically allocated local word-length variable by
packing data for four processors reduces bus contention
4-ary tree construction shows the best performance on sequent symmetry
used in the experimentation in MCS paper
(+1 for each bullet above)
(d) (4 points) (LRPC)
Recall that light-weight RPC (LRPC) is for cross-domain calls within a
single host without going across the network. The kernel allocates A-stack
in physical memory and maps this into the virtual address space of the
client and the server. It also allocates an E-stack that is visible only to
the server. What is the purpose of the E-stack? (Concise bullets please)
By procedure calling convention, the server procedure expects the actual
parameters to be in a stack in its address space. E-stack is provided for
this purpose.
The arguments placed in the A-stack by the client stub are copied into the E-
stack by the server stub. Once this is done, the server procedure can
execute as it would in a normal procedure call using the E-stack.
Page 10 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
(e) (4 points) (multiprocessor scheduling)
A processor has 2 cores and each core is 4-way multithreaded. The last
level cache of the processor is 32 MB. The OS has the following pool of
ready to run threads:
Pool1: 8 threads each having a working set of 1 MB (medium priority)
Pool2: 3 threads each having a working set of 4 MB (highest priority)
Pool3: 4 threads each having a working set of 8 MB (medium priority)
The OS can choose any subset of the threads from the above pools to schedule
on the cores. Which threads should be scheduled that will make full use of
the available parallelism and the cache capacity while respecting the thread
priority levels?
Page 11 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
Parallel System Case Studies
6. (5 mins, 10 points) (Tornado)
(a) (5 points)
Each core in the figure below is 4-way hardware
multithreaded. The workload managed by the OS
is multithreaded multiprocesses. You are
designing a thread scheduler as a clustered
object. Give the design considerations for this
object. Discuss with concise bullets the
representation you will suggest for such an
object from each core (singleton, partitioned,
fully replicated).
Use one representation of the thread scheduler object for each processor
core.
Each representation has its own local queue of threads.
No need for locking the queue for access since each representation is unique
to each core
Each local queue is populated with at most 8 threads since each processor is
4-way hardware multithreaded (this is just a heuristic to balance interactive
threads with compute-bound threads). The threads in a local queue could be a
mix of threads from single-threaded processes and multi-threaded processes.
Each local queue is populated with the threads of the same process (up to a
max of 4) when possible.
If a process has less than 16 threads, then the threads are placed in the
local queues of the 4 cores in a single NUMA node.
If a process has more than 16 threads, then the threads are split into
multiple local queues so that the threads of the same process can be co-
scheduled on the different nodes (often referred to as gang scheduling)
Implement entry point in thread scheduler object for peer representations
of the same object to call each other for work stealing.
Page 12 of 13
CS 6210 Spring 2016 Midterm Soln
Name:____TAs plus Kishore_________________GT Number:
(b) (5 points)
For the above multiprocessor, we are implementing a DRAM object to physical
memory. Give the design considerations for this object. Discuss with
concise bullets the representation you will suggest for such an object from
each core (singleton, partitioned, fully replicated).
One representation of the DRAM object for each NUMA piece (i.e., shared by
all the 4 cores. Each representation manages the DRAM at that node for memory
allocation, and reclamation. (+3)
Core- and thread-sensitive allocation of physical frames to ward off false
sharing among threads running on different cores of the NUMA node.
(+2)
Page 13 of 13