0% found this document useful (0 votes)

23 views81 pages

L07 Distributed Shared Memory and File Systems

The document discusses the Global Memory System (GMS), which uses idle memory across a local area network as virtual memory. GMS treats the memory of peer nodes as another level in the memory hierarchy, allowing processes to page to remote memories instead of local disks. Each node splits its physical memory into a local part for active processes and a global part available to other nodes. GMS trades network communication for disk I/O by locating needed pages in peer memories during page faults or memory evictions. This integration of cluster memory aims to reduce disk access and improve performance over using just local memory and storage.

Uploaded by

Jiaxu Chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views81 pages

L07 Distributed Shared Memory and File Systems

Uploaded by

Jiaxu Chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

384 - Introduction - Global Memory Systems

Welcome back. We started the discussion of the distributive systems with definitions and principles. We
then saw how object technology, with its innate concepts of inheritance and reuse, helps in structuring
distributive services at different levels.
In this next lesson module, we will discuss the design and implementation of some distributed system
services.
Technological innovation come when one looks beyond the obvious and immediate horizon. Often
this happens in academia, because academicians are not bound by market pressures or compliance to
existing product lines. You'll see this out of the box thinking in the three subsystems that we're going to
discuss as part of this lesson module.
Another important point, often the byproducts of a thought experiment may have more lasting impact
than the original vision behind the thought experiment. History is ripe with many examples. We will not
have the ubiquitous yellow sticky note, but for space exploration. Now close to home, for this course,
Java would not have happened but for the failed video-on-demand trials of the 90s.
In this sense the services we are going to discuss as part of this lesson module, while interesting in
their own right, may themselves be not as important as the technological insights they offer on how to
construct complex distributed services. Such technological insights are the reusable products of most
thought experiments.
There's a common theme in the three distributed subsystems we're going to discuss. Memory is a
critical, precious resource. With advances in network technology leveraging the idle memory of a peer
in a LANmay help overcome shortage of memory locally available in a particular note. The question is
how best to use the cluster memory?
• The physical memory of peers tend to be idle at any particular point of time in a local area network.
Global Memory System, GMS, asks the question, how can we use the peer memory for paging
across the Local Area Network?
• Later on we will see DSM, which stands for Distributed Shared Memory asks a question, if
shared memory makes the life of application programmers simple in a multiprocessor, can we try
to provide the same abstraction in a cluster and make that cluster appear like a shared memory
machine (SMP)?
• A third work that we will see is this Distributed File System DFS, in which we ask the question,
can we use the cluster memory for cooperative caching of files?
Anyhow, back to our first part of this three-part lesson module, namely the Global Memory System.
385 - Context for Global Memory System

Typically, the virtual address space of a process running on a processor is much larger than the physical
memory that is allocated for a particular process. So the role of the virtual memory manager in the
operating system is to give the illusion to the process that all of its virtual address space is contained in
physical memory.
But, in fact, only a portion of the virtual address space is really in the physical memory. This portion
is called the working set of the process. The virtual memory manager supports this illusion for the
process by paging in and out from the disk the pages that are being accessed by a process at any particular
point of time so that the process is happy in terms of having its working set contained in the physical
memory.
Now when a node is connected on a LAN to other peer nodes that are also on the same LAN, it is
conceivable that at any point of time, the memory pressure may be different from the memory pressure
on another nodes. For example this particular node may have a much higher memory pressure because
the work load on it consumes a lot of the physical memory whereas this work station may be idle and its
memory is not being utilized.

This opens up a new line of thought, given that all these nodes are connected on a LAN and some
nodes may be busy while other nodes may be idle. Is it possible to use the cluster memory that's available
for paging in and out the working set of the processes on among different nodes, rather than going to the
local disk? Can we page in and out to the cluster memories that are idle at this point of time?
It turns out that with advances in LAN gear, it's already made possible for gigabit ethernet connectivity
to be available for desktops. And pretty soon, 10 gigabit links are going to be common in connecting
desktops to the LAN. This makes sending a page to a remote memory or fetching a page from a remote
memory faster than sending it to a local disk.
Typically, the local disk access speeds are in the order of MB/s, but on top of that, you have to add
things like seek latency and rotation latency for accessing the data that you want from the disk. So all of
this augers well for saying that perhaps paging in and out through the LAN to peer memories that are
idle may be a good solution to local memory pressure.
So the Global Memory System, or GMS for short, uses cluster memory for paging across the
network.
• In normal memory management, if virtual address to physical address translation fails, the
memory manager knows that it can find this virtual address on the disk and it pages in that
particular page from the disk.
• Now in GMS, if there is a page fault, that is, this translation, virtual address to physical address
fails, which means that the page is not in the physical memory of this node, GMS will ask: is the
content of this page in the cluster memory in one of the nodes of my peers? or could it be on the
disk of a peer node?
So the idea that GMS is sort of integrating the cluster memory into this memory hierarchy. Normally,
when we think about memory hierarchy in a computer system, we say there's a processor, there's the
caches, and there's the main memory, and then there is the virtual memory sitting on the disk.
But now in GMS, we're sort of extending that by saying in addition to the normal memory hierarchy
that exists in any computer system, that is processor, caches and memory, there is also the cluster
memory. And only if it is not in the cluster memory, we will go to the disk in order to get the page that
we're looking for.
GMS trades network communication for disk I/O, and we are doing this only for reading across a
network. GMS does not get in the way of writing to the disk, that is the disk always has copy of all the
pages. The only pages that can be in the cluster memories are pages that have been paged out that are
not dirty. If there are dirty copies of pages, they are to be written onto the disk just like it will happen in
a regular computer system.
GMS does not add any new causes for worrying about failures because even if a node crashes all that
you lose is clean copies of pages belonging to a particular process that may have been in the memory of
this node. Those pages are on the disk. It is just that the remote memories of the cluster is serving as yet
another level in the memory hierarchy.
Normally, in a computer system if a process page faults, it'll go to the disk to get it. But in GMS, if a
process page faults, we know that it is not in physical memory but it could be in one of the peer memories
as well. So GMS is a way of locating the faulting page from one of the peer memories.
Also when the virtual memory manager of GMS decides to evict a page from physical memory to
make room for the current working set of processes running on this node, the VMM will, instead of
swapping it out to the disk, go out on the network and find a peer memory that's idle, and put that
page in the peer memory so that later on if that page is needed, it can fetch it back from the peer and
memory. That's sort of the big picture of how the global memory system works.
386 - GMS Basics

In GMS, when we talk about cache, what we mean is physical memory, the Dynamic Random Access
Memory (DRAM).
There is a sense of community to handle page faults of a particular node.
I mentioned that we are going to use peer memories as a supplement for the disc. In other words, we
can imagine that the physical memory at every node to be split into two parts.
• One part is what we'll call local and local contains the working set of the currently executing
processes at this node. So this is the stuff that this node needs in order to keep all the processes
running on this node happy.
• The global is similar to a disk. This global part is where community service comes in, that is, I'm
saying that out of my total physical memory, this is the part that I'm willing to use a space for
holding pages that are swapped out from my fellow citizens on the local data network.
This split of local and global is dynamic in response to memory pressure. As I mentioned earlier, the
memory pressure is not something that stays constant, right? So over time, depending on what's going
on in a particular node, you may have more need for holding the working set of all your processes, in
which case the local part may keep increasing. On the other hand, if I go off for lunch, my workstation
is not in use and in that case, my local part is going to shrink, and I can house more of my peers swapped
out pages in my global part of the physical memory.
The global part is a spare memory that I'm making available for my peers and local part is the part that
I need for holding the working set of the currently active process at my node, and this boundary keeps
shifting depending on what's going on at my node.
• If all the processes executing in the entire LAN are independent of one another, all the pages are
private.
• On the other hand, you could also be using the cluster for running an application that spans
multiple nodes of the cluster, in which case it is possible that a particular page is shared, and in
that case, that page will be in the local part of multiple peers, because multiple peers are actively
using a page.
So, we have two states for a particular page. It could be private, or it could be shared.
If a page is in the global part of my physical memory, then it is guaranteed to be private. Because the
global part is nothing different from a disk. So when I swap out something, I throw it onto the disk.
Similarly, when I swap out something in GMS, I throw it into my peer memory, i.e. the global cache.
Therefore what is in the global cache is always going to be private (clean) copies of pages, whereas what
is in the local part can be private or can be shared, depending on whether that particular page is being
actively shared by more than one node at a time.
Now one important point, the whole idea of GSM is to serve as a paging facility. In other words, if
you think about a uni-processor running multiple processes and sharing a page, the virtual memory
manager has no concern about the coherence about the pages that are being shared by multiple processes.
That's the same semantic used in GSM. The coherence for shared pages is outside GSM. Instead, it's
an application problem. If there are multiple copies of the pages residing in the local parts of multiple
peers, maintaining the coherence is the concern of the application.Only thing that the GSM is providing
is a service for remote paging. That's important distinction that one has to keep in mind.
In any virtual memory management system, what you do when you have to throw out a page from
physical memory, you run a page replacement algorithm (usually some variant of an LRU, least recently
used algorithm). GSM also does exactly the same thing, except it integrates cluster memory management
of the lowest level across the entire cluster.
So the goal in picking a replacement candidate in GSM is to pick the globally oldest page for
replacement. If, let's say that the memory pressure in the system is such that I have to throw out some
page from the cluster memory onto the disk. The candidate page that I'll choose is the one that is oldest
in the entire cluster. So managing age information is one of the key technical contributions of GMS, how
to manage the age information so that we pick a globally oldest page for replacement in the community
service for handling page faults.
387 - Handling Page Faults Case 1

Before we dive into the nuts and bolts of how GMS is implemented, how it is architected, let's first
understand at the high level what's going on in terms of page fault handling now that we have this
community service underlying GMS.

In this picture, I am showing you two hosts, host P and host Q. The physical memory on both hosts are
divided into the local and the global part.
We already mentioned that these are not fixed in size but the size actually fluctuates depending on the
activity.
A most common case is that I am running a process on P and that page faults on some page X. When that
happens, you have to find out if this page X is in the global cache of some other node in the cluster.
• Let's say that this page happens to be in the global cache of node Q. So the GMS system will
locate this particular page on host Q and the host Q will send the page X to node P over the LAN.
• Clearly, a page fault that means the content was swapped out earlier due to memory pressure, i.e.
the memory on P is already fully utilized. Therefore, if P is going to add X to its current working
set, it needs to find some space for it first.
• In this case P finds new space in the global part. Host P says, well, my working set is increasing
so I have to shrink my community service and reduce the global part by one. P picks ANY global
page, e.g. Y, and sends it to host Q.
• Now host Q will host page Y in Q’s global cache instead.
The key take away for this particular common case is that
• the local allocation of the physical memory on P goes up by one.
• the global allocation, the community service part, goes down by one on host P.
• on host Q, it remains unchanged because all that we have done is we have traded Y for X.
388 - Handling Page Faults Case 2

The second case is similar to the common case.

• The faulted page is still on the global part of Q.

• But the memory pressure on P is exacerbated in this case. There's so much memory pressure that
all of the physical memory that's available at P is being consumed by the working set of
applications running on host P. There's NO community service available on P anymore.
If there is another page fault, for some new page on P, then there is no other option on host P except
to throw out some page from its current working set in order to make room for this missing page.
The action will be similar to the previous case, with the only difference that the candidate that is being
chosen as a victim, i.e. the replacement candidate, is coming from the local part of host P itself.
In this case, P will exchange its LRU local page with the faulted page on Q.
The size of global memory on Q and the local memory on P are unchanged.
389 - Handling Page Faults Case 3

The third case is where the faulting page’s copy only exists on the disk.
In this case, when we have the page fault on node P for page X, we have to go to the disk to fetch it.
The local part of P will go up by one in order to make room for page X.
• If P has global memory (case 1), it will pick any global page for swap out.
• If P has no global memory (case 2), it will pick its LRU page for swap out.
Then where should we send this swap-out page, i.e. Y, to GMS will pick the globally oldest page in
the entire cluster.
• Let's say host R that has the globally oldest page and that page (Z) on host R could be its local
part or the global part.
• So P will send the page (page Y) it chooses for swap-out to host R.
• R will accept page Y, and host it on its global memory.
• As for page Z on host R:
o If Z is in the global cache of R, R can discard it without worrying about it, because anything
in the global cache is clean copy, which already has an identical version on the disk.
o If Z is in R’s local part, it could be dirty. So this dirty/modified copy has to be written out
to the disk.
As a result:
• P-local will +1 and P-global will -1 if P has case 1 situation and will not change if P is in case 2.
• For R, if the oldest page on host R is in the global cache, then there is no change because Z can
be simple discarded to make room for Y.
• If page Z is in the local part of R, it has to be written out to disk to host a global page Y, which
make R-local -1 and R-global +1.
390 - Handling Page Faults Case 4

Next let's consider the case where a page is actively shared.

Two hosts P and Q, and both of them are actively sharing a page X.
Currently some processes on host Q is using that page X so it is in Q’s local memory.
On host P, a process is page faulting on the same page X.
• Because page X is in the current working set of host Q, we want to leave it there and make a copy
of this page into the local part of host P.

• To make room for this copy of X, P has to allocate one more page in for it local part and decrease
its global part has to shrink by one (if it has any).

• Similar to Case 3, P then picks an arbitrary page Y from the global part (or LRU if it does not
have global) and send it to host R, which is hosting the globally oldest page Z.

• At last, R needs to make room for page Y by

o Dropping page Z if Z is in R’s global memory.
o Writing page Z to disk if Z is in R’s local memory.

• The page Y will be hosted in R’s global memory.

391 - Local and Global Boundary Question

392 - Local and Global Boundary Solution

• In all cases except Case 2, the global part of the faulting node P’s cache is empty, the local part
of course is going to go up by one, the global part is going to come down by one.

• Only in Case 2, when the global part is empty already, there's no change to P.

• In Case 1 and 2, there's change to L and G on Q because Q only exchange a page with P and the
new page still stays in Q’s global memory. No change on R because R is not involved.

• In Case 3, the faulted page is on the disk, so Q is not involved. Instead, node R with the globally
oldest page is involved. If the globally oldest page is in R’s local then it will be written to disk,
resulting in L-1 and G+1. If Z is in R’s global part, then it is a clean page that can be discarded
safely and would not cause change to R’s L and G.

• In Case 4, where the faulted page is the local memory of host Q, there's no change in the balance
of L and G because P makes a copy of this page on Q without changing Q. Then P will need to
send an arbitrary page Y to node R with the globally oldest page. Then R will handle the page Y
in the same manner as in Case 3.
393 - Behavior of Algorithm

From the high level description of what happens on page faults and how it is handled, you can see that
the behavior of this GMS global memory management is as follows.
So overtime, you can see that if there is an idle node in the LAN, then that idle node, its working set
is continuously going to decrease as it accommodates more and more of its peers pages that are swapped
out to fit in its global cache.
Eventually a completely idle node, becomes a memory server, for peers on the network.
That's the expected behavior of this logarithm.
The key attribute of the logarithm is the fact that the split between Local and Global is not static,
but is dynamic, depending on what is happening at a particular node.
For instance, even the same node, if it was serving as a memory server for my peers, because I had
gone out to lunch, I come back, and start actively using my workstation. In that case, I'm going to go
back, and the community service part of my machine is going to start decreasing. So the Local Global
split is not static, but it shifts depending on what local memory pressure exist at that node.
394 - Geriatrics

Now that we understand the high level goal of GMS, the most important parameter in the working of
GMS system is age management because we need to identify what is the globally oldest page whenever
want to do replacements.

In any distributed system, one of the goals is to make sure that any management work is not bringing
down the productivity of that node. In other words, you want to make sure that management work is
distributed and not concentrated on any one node.
In the GMS system, since we are working with an active set of nodes, overtime, the node that is active,
may become inactive and a node that was inactive may become and so on.
So you don't want the management of age information to be assigned to any given node.
It is something that should shift over time. That's sort of the fundamental tenant of building any
distributed system, is to make sure that we distribute the management work so that no single node in the
distributed system is overburdened of that. That's a general principle, and we'll see how that principle is
followed in this particular system.
We break the management both along the space axis and time axis, into what is called epochs.
An epoch is in some sense a granularity of management work done by a node. The management work
done by a node is
• either time bound, T a maximum duration,
• or space bound, M maximum number of replacements.
So in the working of the system, if M space replacements have happened, then that epoch is completed
and we start a new epoch and there will be a new manager for age management. Or if the duration T has
passed, the epoch is completed, and so we pick a new manager to manage the next epoch.
T may be on the order of a few seconds and M may be on the order of thousands of replacements. And
the values of T and M vary from epoch to epoch.
At the start of each epoch, every node is going to send the age information summary to the initiator
(the manager node). That is, every node is going to say, here is the age of the pages that is resident at
my node (all the local pages and all the global pages). And remember that a smaller age indicates a more
relevant page while a higher age means an older page. So, N1 sends its set of pages, N2 sends its set of
pages, and so on. Everybody is sending to the manager node (the initiator) the age information.
Using the information from all the nodes,
• The initiator will sort out the M oldest pages in the network and computes a weight w(i) for each
node i, which represents the how the M oldest pages are distributed among all the network nodes.
For instance, if it turns out that N1 has 10% of the candidate pages that are going to be replaced
in the next epoch, then its w(N1) = 0.1.

• The initiator determines the minimum age MinAge, that will be replaced from the cluster in the
new epoch.

• Then the initiator send all the weights and MinAge back to all the nodes in the clusters.

• The initiator also selects the node with the most idle pages (the largest w(i)) to be the initiator for
the next epoch.
Each load is not only receiving its own weight, but it is also getting the fraction of the pages that are
going to be replaced from each of its peer nodes in the entire cluster. And we'll see how this information
is going to be used by each one of these nodes.
And of course, we don't know the future. All that the initiator is doing is that is saying that it is expected
replacement: w1 is the expected share of replacement on N1, w2 is the expected share of replacement on
N2 and so on. When the next epoch starts, that can be different, depending on what happens in these
nodes. But that's the best that we can do, i.e., use the past to predict the future.
That's what the initiates does, using the past, the age information summary from all the notes, to predict
the future in terms of where these M replacements are going to happen and what is the minimum age of
the pages that are going to be replaced in the next epoch.
395 - Geriatrics (cont)

I mentioned that this management function that this initiator is doing, you don't want it to be statically
assigned to one node, because that would be a huge burden. Besides, this node will suddenly become
very active, and if that happens, you must time in the computations to perform the management function.
Now intuitively if you think about it, if a particular node, let's say node N2 has a weight of w(2) = 0.8,
what does that mean? It means that in the upcoming epoch, 80% of the M pages are going to be replaced
from the node N2. In other words, N2 is hosting a whole bunch of senior citizens, so it's a node that's not
very active, right?
So intuitively if we think about it, the guy that has the highest weight is probably the least active.
Why not make him the manager? So in the next epoch, the node with the highest weight will become the
initiator. Now how do you determine that? Well, every one of these nodes is receiving not only the
MinAge of the candidate pages that are going to be replaced in the next epoch, but also the weight
distribution of all other nodes in the cluster. So locally each of the node can look at the weight vector
that it got back from the initiator and say, I know, given that node N3 has the maximum weight, N3 will
be the initiator of the next epoch. So locally you are aware who will be the initiator in the next epoch to
send the age information summary to.
Now let's look at how MinAge is used by each node. Remember that MinAge represents the minimum
age of the M oldest pages. The smaller the age, the more recent that page is. So if you look at this as the
age distribution of all the pages and draw a line here, then the pages to the left of this line, the minimum
age line, are the active pages and the pages to the right of this line is the M oldest pages that will be the
candidate pages that are going to be replaced in the next epoch. This is the information that we're going
to use when we manage page replacements at every node.
So let's see what the action is at a node when there is a page fault.
Upon a page fault I have to evict a page Y, then I will look at the age of this page Y. If page Y is older
than MinAge, then I know it will be replaced. Even if I sent it to a peer node, it will be replaced because
that is part of the candidate M pages that will be replaced in the next epoch. Therefore locally I will make
the decision that it is safe to discard page Y because it is older than MinAge.
Remember that, in the description of how a page fault is handled, I said that, when you have a page
fault in the node, I pick one of the replacement candidates, and I send it to a peer node to store it in the
global cache of the peer node. Well, in this case, if this candidate page is very old (age > MinAge), we
might as well just discard it right now.
On the other hand, if the page happens to be less than the MinAge, then you know that it is part of the
active set of pages for the entire cluster and you cannot throw it away. So we send it to peer node N(i).
How do you pick N(i)? This is where the weight distribution comes in. I received the the weight of all
other nodes from the initiator, so I can pick a node to send this page to, based on the weight distribution
information.
Of course I could say, well, send it to the guy that has the highest weight because that's the guy that is
likely to replace, but remember that this only an estimate of what is going to happen in the future. So,
you don't want to hard code that. Instead, we're going to use some information drawn from these weights.
We're going to factor that weight information into making a choice as to which peer we want to send this
eviction page to. Chances are that the node with a higher weight is likely a candidate that'll pick. But it
will not always be the node with the highest weight, because if that decision is made by everybody then
if the prediction was not accurate enough, we would be making wrong choices.
Basically you can see that this age management, geriatric management, is approximating a global
LRU, not exactly a global LRU. Because true global LRU computation on every page fault is too
expensive, so we don't want to do that. Instead, we are doing an approximation to global LRU by
computing this age information at the beginning of an epoch, and using that information locally in order
to make decisions.
• So we think globally in order to get all the age information, and compute the minimum age and
compute the weight for all the nodes as to how much of the fraction of the replacements are going
to come from each one of these nodes.

• But once that computation has been done, for the duration of the epoch, all the decisions that are
taken at a particular node Is local decisions in terms of, what we want to do, with respect to a page
that we are choosing as an eviction candidate. Do we discard it? Do we store it in a peer’s global
cache?
This might sound like a political slogan but the important thing to use local information in decision
making as much as possible. So think globally but act locally. So that's the key and that's what is ingrained
in this description of the algorithm.
396 - Implementation in Unix

So from algorithm description, we will talk about implementation now. This is where rubber meets the
road in systems research. In any good systems research, the prescription is as follows.
• You identify a pain point.
• Once you identify the pain point, you think of what may be a clever solution to that.
• Then a lot of heavy lifting actually happens in implementing that solution, which may be a very
simple solution. But implementing that is the hard part.
If you take this particular example that we're looking at, the solution idea is fairly simple. The idea is
that instead of using the disk as a paging device, use the cluster memory. But implementing that idea
requires quite a bit of heavy lifting.
One might say that in systems research these technical details of taking an idea and working out the
technical details of implementing that idea is probably the most important nugget. Even if the idea itself
is not enduring, the implementation tricks and techniques that are invented in order to do their
implementation of their idea maybe reusable knowledge for other system research.
So that's a key takeaway in any system research and it is true for this one as well.

Now in this particular case, the authors of GMS used DEC (Digital Equipment Corporations) operating
system as the base system. The operating system is called OSF1 operating system.
There are two key components in the OSF1 memory system.
The first one is the virtual memory system, and this is the one that is responsible for mapping process
virtual address space to physical memory, and worrying about page faults that happen when a process is
trying to access the stack and heap and so on.
It can bring those missing pages perhaps from the disk. and these pages are sometimes referred to as
anonymous pages. A page is housed in a physical page frame and when a page is replaced, that same
physical page frame may host some other virtual page and so on. So the virtual memory system is devoted
to managing the page faults that occur for process virtual address space, in particular the stack and the
heap.
The unified buffer cache is the cache that is used by the file system. Remember that the file system
is also getting stuff from the disk but the other system wants to cache it in physical memories so that it
is faster.
• The unified buffer cache is serving as the extraction for the disk resident files when it gets into
physical memory and user processes perform explicit file access. When user processes do that,
they're actually accessing unified buffer cache. So reads and writes of files go to this unified
buffer cache.
• In addition to that, Unix systems offer the ability to map a file into memory, which is called
memory mapped files. If you have a memory mapped file, you can also have page faults to a file
that has been mapped into memory.
So the unified buffer cache is responsible for handling page faults to map files, as well as explicit read
and write calls that an application process may make to the file system.
Normally this is the picture of how the memory system hangs together in any typical implementation
of an operating system.
• You have the virtual memory manager, you have the unified buffer cache, you have the disk.
• Reads and writes from the virtual memory manager go to the disk, and similarly reads and writes
from the unified buffer cache, go to the disk.
• When a physical page frame is freed, you throw it into the free list and it becomes available for
future use by either the virtual memory manager or the unified buffer cache.
• The page-out daemon’s role is to every once in a while look at what pages can be swapped out to
the disk so that you make room for page faults to be handled without necessarily running an
algorithm to free up pages.
So that's the structure of OSF/1 memory management system.
What the authors of GMS did was to integrate GMS into the operating system and this is where the
heavy lifting needs to be done, because you are modifying the operating system to integrate an idea that
you had for global memory system.
397 - Implementation in Unix (cont)

This calls for modifying both the virtual memory part as well as the UBC part of the original system, to
handle the needs of the global memory system.
Particularly the VMM and UBC, when they need a missing page, it used to go to the disk. But here
what we make them go to the GMS because GMS knows whether a particular page that is missing in the
address space of a process, is present in some remote GMS. Similarly, if a page is missing from the file
cache, GMS knows whether that page is in the remote GMS.
So that's why we modify the VMM and UBC manager to get their calls for getting and putting pages.
getpage() is called when there is a page fault and I need to get that page. I would go to the disc normally,
but I now, I go to GMS. Similarly if a page is missing in the UBC, I ask the GMS to get that page for
me. GMS is integrated into the system so that it figures out where to get the missing page from.
Writes to disk is unchanged. Originally, whenever the VM manager decides to write to the disk, it's
because a page is dirty. That part we don't want to mess with, because we are not affecting the reliability
of the system in any way.

Remember that the most important thing that we do to approximate global LRU is collecting age
information. That's pretty tricky.
For the file cache is pretty straightforward, because it is explicit from the process where it is making
read/write calls, we can insert code as part of integrating GMS into this UBC, to see what pages are being
accessed based on the read/write that are happening to the UBC. We can intercept those calls as part of
integrating DMS to collect age information for the pages that are housed in the unified buffer cache.
VMM on the other hand, is very, very complicated. Because memory access from a process is
happening in hardware on the CPU. The operating system does not see the individual memory access
that a user program is making, so how does GMS collect aid information for the anonymous pages?
The file access is not anonymous because we know exactly what pages are being reached in by a
particular read or write by a file system. But the reads and writes that a user process is making to its
virtual address space are the pages that are anonymous as far as the operating system is concerned.
So, how does the operating system collect the age information? In order to do that, the authors of GMS
introduced a daemon part of the memory manager in the OSF/1 implementation.
The daemon's job is to dump information from the TLB. If you recall, TLB, the translation lookaside
buffer, contains the recent translations that have been done on a particular node. So the TLB contains
information about the recently accessed pages on a particular processor.
Periodically, say every minute, it dumps the contents of the TLB into a data structure that the GMS
maintains so that it can use that information in deriving the age information for all the anonymous pages
that are being handled by the VMM.
That is how GMS collects the age information at every node and this is where I meant, the technical
details are very important. Because this idea of how to gather age information is something that might
be useful if you're implementing a completely different facility.
• Similarly, when it comes to the page-out daemon, normally when it is throwing out pages: if they
are dirty pages, it will write it on to the disk; if they are clean pages, it can simply discard it.

• No when it has a clean page, the page-out daemon will hand it to GMS because GMS will perhaps
put it into cluster memory so that it is available later on for retrieval rather than going to the disk.
So this is how GMS is integrated into this whole picture,
• interacting with the Unified Buffer Cache

• interacting with the VM Manager

• interacting with the Page-out Daemon and the Free List Maintenance as well in terms of allocating
and freeing frames on every node.
398 - Data Structures - UID, PFD

Some more heavy lifting. Let's talk about the distributed data structures that GMS has in order to do this
virtual memory management across the entire cluster.
First GMS has to convert a virtual address from a single process to a global identifier, or what we'll
call as a universal ID. The way we derive the UID from the virtual address is fairly straightforward. If
you think about it, we know which node this virtual address emanated from, we know the IP address,
we know which disk partition contains a copy of the page that corresponds to the virtual address, we
know what are the i-node data structure that corresponds to this page and the offset. So if you put all of
these entities together, you get a universal ID that uniquely identifies a virtual address. Then adding the
offset to that we have a complete virtual address. We can derive UID from the virtual memory system as
well as the UBC.
Then there are three key data structures: PFD, GCD and POD. These three data structures are the
workhorses that make GMS possible.
PFD (Page Frame Directory) is a local per-node data structure. PFD has a mapping between a UID
(your virtual address has been converted to a UID) and a page frame that backs that particular UID.
Because we're doing cluster-wide memory management, the page itself can be in one of four different
states.
• It could be a private page in the local memory or a shared page in the local memory.

• If it is in the node’s global memory, we know by definition, global part is only hosting clean pages.
So the page in the global cache is guaranteed to be a private.

• The last possibility is that it's not in the physical memory of a node, but it is on the disk.
399 - Data Structures - GCD, POD

Now given a UID I know that some PFD in the entire cluster has the physical frame number that
corresponds to it if it happens to be on that node or it's on the disk. If I have a page fault, I convert the
fault VA → UID, then which PFD should I look up (PFD is a local per-node structure)?
I could broadcast to everybody “do you have it”? But that will be very efficient. You don't want to do
that. Or we can create some mapping between a UID and the designated PFD for page number look up.
But we don't want to do a static binding of UID. Because if we make a static mapping, then it pushes the
burden on one node if some page has become really hot and everybody wants to use that page.
So what we want to do is distribute this management function as well, just like the age management.
• We want to distribute the management of UID → PFD mapping by using a data structure called
Global Cache Directory, GCD.
• GCD is a hash table, it's a cluster-wide data structure that manages the UID → PFD mapping, i.e.,
it knows the IP address of the node that has the PFD needed for a given UID.
• Each node only stores a partition of the GCD for performance reasons.
Then we need another data structure that tells us, given a UID, which node has the GCD that corresponds
to this UID and that is the Page Ownership Directory, POD.
• Given an UID, POD knows which node has the GCD that corresponds to this UID.
• POD is replicated on all the nodes. It's an identical replica that is on all the nodes.
So, when I have a page fault,
• For a given VA → UID, the POD will tell me which node holds the corresponding GCD.
• Then the corresponding GCD will tell me which nodes holds the PFD for this UID.
• The corresponding PFD will finally translate the UID to the Physical Page Number I need.
Remember that we could have simply gone from POD to PFD, but that would be a static mapping. This
one level of indirection through a partitioned/distributed GCD allows us to move the responsibility of
hosting a particular PFD to different nodes.
I said that this POD is a replicated data structure. Can it change? Well, it can change over time.
The UID space is something that spans the entire cluster. If the LAN never revolves (the set of nodes
on the LAN is fixed), then POD also remains the same. But if new nodes are added or deleted, POD will
have to change. If that happens, we have to replicate & distribute all POD again. We have ways to handle
it but this is not something that happens too often. Therefore this POD rarely changes.
So the path for page fall handling is if you have a page fault,
• convert VA to UID and the look up this UID in your local POD, find out which node has the GCD.
• then ask the GCD node which node has the PFD,
• then go to the PFD node to retrieve the missing page
• finally we will find the missing page in this PFD and the states of the missing page (four possible
state)
400 - Putting the Data Structures to Work - Page Fault Hit

So now that we understand the data structures, let's put it to work. Let's first look at what happens on a
page fault.
• Say Node A has a page fault, first coverts the virtual address VA → UID.
• Then Node A checks its local POD, to find out the ownership of this UID, i.e. which node has the
GCD for this UID, e.g. Node B.
• Then Node A sends this UID over to Node B, and Node B looks up its GCD data structure and
says, oh, the PFD that can translate this UID is on Node C. which actually represents this virtual
address is actually contained in this particular node, node C.
• Then Node B sends the UID over to Node C. Node C has the PFD that has the mapping between
the UID and the Page Frame Number in Node C’s memory.
• If it is a hit, Node C retrieves the page and sends it over to node A.
• Node A is happy. It can then map this virtual address into its internal structure and the page fault
service is complete and it can resume the process that was blocked for this page fault.
So potentially, when a page fault happens I can have three network communication: A → B, B → C
and C → A. The last one, C → A is equivalent to (or even faster than) paging from the disk so I can
tolerate that. And it only happens on page fault so this network communication C → A is okay. But this
A → B is an extra network communication. Fortunately, the common page fault case happens when
servicing the request of a local process on node A. So the page is likely to be a non-shared page. A non-
shared page is likely to have the corresponding GCD on this node itself, i.e. A and B are the same node
so no network communication is needed consult the GCD. So the page fault service for the common case
could be pretty quick if we only need A → C and C → A.
The important message is that even though these distributed data structures may make it look like
you're going to have a lot of network communication, it happens only when there is a page fault. Since
the common page fault case happens on non-shared pages, the POD and GCD are probably co-resident
on the same node. So we only need LAN communication with the node that contains the PFD and make
the page fault service pretty quick.
401 - Putting the Data Structures to Work - Page Fault Miss

Now the interesting question is, when I go to Node C, is it possible that Node C does not have page in
its memory or disk? Yes, it is possible in two cases.
One case is,
• While Node B was sending this request over, Node C has made a decision to evict that page that
corresponds to this UID because Node C might need to make space for itself.
• So this UID might have been thrown away from Node C’s PFD.
• Node C will inform the owner of this UID (Node B) that he has evicted this page and some other
node (e.g. Node D) is hosting this page now.
• But in a distributed system, things happen asynchronously so Node B might have not received
this update yet. So Node B still think Node C has the page and we have miss.
Second possibility is the uncommon case: Node A’s POD information might be stale.
• When can that happen? When the POD is being recomputed for the LAN as a whole, due to
additions or deletions of nodes. The new POD has not been replicated on each node yet.
• In that case, it is possible that the information that I started with was incorrect. For example, Node
B might not even has the GCD anymore.
So if the PFD owner replaced that page, or my POD information misled me, I'll have a miss.
• I can look up my POD again. and hopefully the POD may have been updated.
• Or I consult the same GCD and the GCD owner now has the updated PFD owner information.
But the important point is that:
• The common case is when both the POD and GCD are coresident on the same node so no extra
POD → GCD communication and no miss on the PFD node.
• Meanwhile, PFD miss is an uncommon case.
402 - Putting the Data Structures to work - Page Eviction

The next thing I want to talk about is Page Evictions. When a page fault happens,
• Of course the key thing is to make sure that we service the page fault so that we get the page and
restart the execution. This is the important thing.

• Meanwhile, we need to make room for the missing page, i.e. some page needs to be evicted.

• However, we don't want to do page eviction it on every page fault. Instead, we want to do it in an
aggregated manner.
For that reason, every node has a paging daemon. The paging daemon in the VMM is integrated with
the GMS system.
• Usually a typical VMM should have a stash of free page frames ready, so that when a page fault
happens it does not need to run around trying to find free page free.

• Normally when the FreeList falls below a threshold, the paging daemon would take a bunch of
pages, LRU candidate pages, and put it all onto the disc.

• But with the integration with GMS, the modified paging demon will do PutPage() on the oldest
pages on this node. When we do the PutPage(), we will choose the candidate node based on the
weight information from the geriatric management.

• Once Node A PutPage(UID) to Node C, Node C will have this page with an updated PFD.

• Node A also has to Update Node B (GCD) that the Node C is hosting this particular UID now.
These actions are not done on every page eviction, but are performed by the paging demon in an
aggregated & coordinated manner when the free list falls below a threshold.
403 - Conclusion

So we've covered a lot of ground so far:

• The principal behind the thought experiment is using network memory as a paging device rather
than disc because the networks have gotten faster.

• We came up with an algorithm for age management globally in the entire cluster, and how to have
that age management done in a manner that doesn't burden any one node but it picks the node that
is lightest in terms of load at any point of time.

• We also saw given the solution for cluster-wide memory management for paging, how to go about
implementing it and all of the heavy lifting that needs to be done in order to take an idea and put
it into practice.

You'll see that in every systems research paper, there is heavy lifting to be done to take a concept to
implementation. Working out the details and ensuring that all the corner cases are covered is a non-trivial
intellectual exercise in building distributed subsystems.
What is enduring in a research exercise like this one, the Global Memory System?
The concept of paging across the network is an interesting thought experiment, but it may not be
feasible exactly for the environment in which the authors carried out this research, namely workstation
clusters connected by a LAN.
• Each workstation in that setting is a private resource owned by an individual who may or may not
want to share his resources, memory in particular.
• On the other hand, data centers today are powered by large scale clusters on the order of thousands
of processes connected by a LAN. No node is individually owned in that setting. Could this idea
of paging across the LAN be feasible? Perhaps.
Even beyond the thought experiment itself what is perhaps more enduring are the techniques,
distributed data structures and the algorithms, for taking the concept of implementation.
In the next part of this lesson module, we will see another thought experiment to put cluster memory
to use.
404 - Introduction - Distributed Shared Memory

In an earlier part of this lesson module, we saw how to build a subsystem that takes advantage of idle
memory in peer nodes on a LAN, namely, using remote memory as a paging device instead of the local
disk.
The intuition behind that idea was the fact that networks have gotten faster. Therefore access to remote
memory may be faster than access to an electromechanical local disk.
Continuing with this lesson, we will look at another way to exploit remote memories, namely, software
implementation of distributed shared memory, i.e. create an operating system abstraction that provides
an illusion of shared memory to the applications, even though the nodes in the LAN and do not physically
share memory.
So distributed shared memory asks the question, if shared memory makes life simple for application
development in a multiprocessor, can we try to provide that same abstraction in a distributed system, and
make the cluster look like a shared memory machine?
405 - Cluster as a Parallel Machine (Implicit Parallel Program)

Now suppose the starting point is a sequential program. How can we exploit the cluster with multiple
processors?
One possibility is to do what is called automatic parallelization. That is, instead of writing an explicit
parallel program, we write a sequential program and let somebody else do the heavy lifting in terms of
identifying opportunities for parallelism in the program and map it to the underlying cluster. This is what
is called an implicitly parallel program. There are opportunities for parallelism, but the program itself is
not written as a parallel program.
Now it is the onus of the tool, in this case an automatic parallelizing compiler, to look at the
sequential program and identify opportunities for parallelism and exploit that by using the resources that
are available in the cluster.
High-performance FORTRAN is an example of a programming language that does automatic
parallelization, but it is user-assisted parallelization in the sense that the user who is writing the sequential
program is using directives for distribution of data and computation. Those directives are then used by
this parallelizing compiler to map these computations onto the resources/nodes in a cluster.
This kind of automatic parallelization, or implicitly parallel programming, works really well for
certain classes of program called data parallel programs. In such programs, for the most part, the data
accesses are fairly static, and it is determinable at compile time. So in other words, there is limited
potential for exploiting the available parallelism in the cluster if we resort to implicit parallel
programming.
406 - Cluster as a Parallel Machine (Message Passing Parallel Program)

To exploit parallelism, the application programmer can also think about his application and write the
program as an explicit parallel program. There are two styles of writing explicit parallel programs and
two corresponding system.
One is message passing style parallel program. The run time system will provide a message passing
library which has primitives for an application thread to do send() and recv() between processes running
on peer nodes. This message passing style parallel program is true to the physical nature of the cluster.
The physical nature of the cluster is the fact that every processor has its private memory that is not
shared between all the processors. So the only way a processor can communicate with another processor
is by sending a message through the network. One processor cannot directly reach into the memory of
another processor because that is not the way a cluster is architected.
Message passing library examples include MPI (Message Passing Interface), PVM, CLF from digital
equipment corporations. To this day, many scientific applications running on large scale clusters in
national labs like Lawrence Livermore, and Argonne National Labs, using MPI as the message passing
fabric.
The only downside to the message-passing style programming is that it is difficult to program using this
style.
• The transitions path from writing sequential program to an explicit parallel program is easier if
there is this notion of shared memory, because it is natural to think of shared data structures among
different threads of an application. (e.g. using pthread on SMP is pretty straight forward)

• On the other hand, if the programmer has to think in terms of coordinating the activities on
different processors by explicitly sending and receiving messages from, that is calling for a fairly
radical change of thinking in terms of how to structure a program.
407 - Cluster as a Parallel Machine (DSM)

This was the motivation for coming up with an abstraction of Distributed Shared Memory (DSM) in
a cluster: to make writing parallel programs on cluster easier.
The idea is that to give an illusion to the application programmer that all of the memory in the entire
cluster is shared.
This would make it easier to transit from parallel programming on an SMP to a parallel programming
on a cluster, without worrying about message passing. They can use the same idea of shared memory
and sharing pointers across the entire cluster, and so on.
Also, since we are providing a shared memory semantic in the DSM library for the application
program, there is no need for marshalling and unmarshalling arguments that are being passed from one
processor to another and so on. All of that is being handled by the fact that there is shared memory. So
when you make a procedure call, and that procedure call is touching some portion of memory that
happens to be on a remote memory. That memory is going to magically become available to the thread
that is making the procedure call.
In other words, the DSM abstraction gives the same level of comfort to a programmer who is used to
programming on a true shared memory machine when they move to cluster. They can use the same set
of primitives (like locks and barriers) for synchronization, and pthread-style of creating threads, that will
run on different nodes of the cluster.
That's the advantage of DSM style of writing an explicit parallel program.
408 - History of Shared Memory Systems

Now, having introduced distributed shared memory, I want to give sort of a birds eye view of the history
of shared memory systems over the last 20 plus years.
On the software DSM side, software DSM was very first thought of in the mid-80s. The Ivy system
built at Yale University by Kai Li, the Clouds Operating System built at Georgia Tech and there were
similar systems built at UPenn. This, I would say, is the beginning of Software DSM. Later on, in early
90s, systems like Munin and TreadMarks were built. I would call them perhaps a second generation DSM.
In the late half of 90s, there were systems like Blizzard, Shasta, Cashmere and Beehive that took some
of the ideas from the early 90s even further.
In parallel with the software DSM, there was a completely different track that was being pursued
which is providing structured objects in a cluster for programming. Systems such as Linda and Orca
were developed in early 90s. Stampede at Georgia Tech was done in concert with the Digital Equipment
Corporation in mid 90s and later on continued into Stampede RT and PTS. (In a later lesson, we'll talk
about Persistent Temporal Streams) This particular axis of structured DSM development is attractive
because it gives a higher level abstraction than just memory to computations that needs to be built on a
cluster.
Early hardware SMP such as BBN Butterfly and Sequent Symmetry appeared in mid 80s and, the
MCS synchronization paper used BBN Butterfly and Sequent Symmetry as the experimental platform.
KSR-1 was another shared memory machine that was built in the early 90s. Alewife was a research
prototype built at MIT. DASH was built at Stanford. Both of them looked at how to scale up beyond an
SMP, and build a truly DSM machine. Commercial versions of that started appearing. SGI silicon
graphics built SGI Origin 0 as a scalable version DSM machine. SGI Altix later on took it even further,
thousands of processors exist in SGI Altix as a large-scale DSM machine. IBM Bluegene is another
example. Today, if you look at what is going on in the space of high performance computing, the clusters
of SMPs have become the work horses in data centers.
409 - Shared Memory Programming

I've already introduced you to shared memory synchronization.

• Lock, particularly the mutual exclusion lock is a primitive that is used ubiquitously in writing
shared memory parallel programs to protect data structure so that one thread can exclusively
modify the data and release the lock so that another thread can inspect the data later on and so on.
• Similarly, barrier is another synchronization primitive that is very popular in scientific programs.
We have covered both of these in fairly great detail in terms of efficient implementation of locks as well
as barriers.
Now the upshot is, if you are writing a shared memory program, there are two types of memory accesses
that are going to happen.
• One is the normal reads and writes to shared data that is being manipulated by a particular thread.
• The second kind is memory access to the synchronization variables that are used in implementing
locks and barriers by the operating system itself. It may be the operating system, or it could be a
user level threads library that is providing these mutual exclusion locks or barrier primitives, but
in implementing those synchronization primitives, those algorithms are going to use reads and
writes to shared memory.
So there are two types of shared memory accesses going on in the execution of a parallel program. One
is access to normal shared data and the other is access to synchronization variables.
410 - Memory Consistency and Cache Coherence

Recall that in one of our earlier lectures, we discussed memory consistency model and the relationship
between memory consistency model and cache coherence, in the context of shared memory systems.
• Memory consistency model is a contract between the application programmer and the system. It
answers the “when” question: when a shared memory location is modified by one processor, how
soon that change is going to be made visible to other processes that have the same memory
location in their respective private caches?

• Cache coherence, on the other hand, is answering the “how” question: how is the system (by
system we mean the system software plus the hardware working together) implementing the
contract of the memory consistency model?
In other words, the guarantee that has been made by the memory consistency model, to the application
programmer has to be fulfilled by the cache coherence mechanism.
So coming back to writing a parallel program, when accesses are made to the shared memory, the
underlying coherence mechanism has to ensure that all the processes see the changes that are being made
to shared memory, commensurate with the memory consistency model.
411 - Sequential Consistency

I want you to recall one particular memory consistency model that I've discussed with you before, that is
sequential consistency.
• In sequential consistency, the idea is very simple. Every process is making some memory accesses.
From the perspective of the programmer P1, the expectation is that these memory accesses are
happening in the textual order that you see here.
• Similarly, if you see the set of memory accesses that are happening on P2. Once again, the
expectation is that the order in which these memory accesses are happening are the textual order.
Now, the question is, what happens to the accesses that are happening on one processor with respect
to access on another processor if they are accessing the same memory location?
For instance, P1 is reading memory location A and P2 is writing to memory location A. What is the
order relationship between this P1-read and this P2-write? This is where sequential consistency model
says about interleaving of memory accesses between multiple processors comes into play.
• From each processor’s perspective, we still observe the memory access in textual program order.
• But the interleaving of the memory accesses from the different processors is arbitrary.
In other words, the sequential memory consistency model builds on the atomicity for individual read-
write operations and the textual program order has to be preserved. But the interleaving of the memory
accesses on different processors can be arbitrary and the programmer should be aware of this.
I also gave you the analogy of a card shark to illustrate what is going on with a sequential consistency
model. The card shark is taking two splits of a card deck and doing a perfect merge shuffle of the two
splits. That's exactly what's going on with sequential consistency.
If you can think of these memory accesses on an individual processor as the card split but instead of
a two-way split you have an n-way split, and we are doing a merge shuffle of all the n-ways splits of the
memory accesses to get the sequentially consistent memory model.
412 - Sequential Consistency Memory Model

With the sequentially consistent memory model, let's come back to a parallel program.
A parallel program is making read write accesses to shared memory, some of them offer data, and
some of them offer synchronization.
As far as the sequentially consistent memory model is concerned, it does not distinguish between data
accesses and synchronization accesses. It has no idea about that.
It only looks at the read/write accesses coming from an individual processor and honoring them in the
order in which it appears and making sure that they can merged across all these processors to preserve
the Sequential Consistency guarantee.
So the upshot is that there's going to be coherence action on every read/write access that the model
sees.
• If this guy writes to a memory location, then the SC memory model has to ensure that this write
is inserted into this global order somewhere.
• In order to insert that in the global order somewhere, it has to perform the coherence action with
respect to all other processors.
That's the upshot of not distinguishing between normal data accesses and synchronization accesses
that is inherent in the SC memory model.
413 - Typical Parallel Program

Now, let's see what happens in a typical parallel program. In a parallel program, if you want to do a
write() operation, you probably get a lock and you have mentally an association between that lock and
the data governed by that lock.
• For example, if I wanted to read or write shared variables A and B, I'll get a lock first. Then I will
access/modify these variables governed by this lock (critical section). Once I'm done with these
shared variables, I'll unlock.
• If another processor, say P2 gets the same lock. It can only get the lock after P1 releases it because
the lock is mutually exclusive. Therefore, only one person can have the lock at a time.
Consequently, if you look at the structure of this critical section for P2, it gets a lock, enter the
critical section.
• By design, we know that either P1 or P2 can be messing with the shared data at any point of time.
That's a guarantee that comes from the fact that we have a lock associated with these data. In other
words, P2 will not access any of the data inside the critical section until P1 releases the lock. We
know this because we designed this program.
But the SC memory model does not know about the association between the shared data and the lock.
It doesn't even know that memory accesses to the lock primitive is different from memory accesses
normal data structures. So the cache coherence mechanism that is provided by the system for
implementing the memory consistency model is going to be doing more work than it needs to do, because
it has to take actions on every one of these memory accesses (lock primitive and normal data), even
though the coherence actions are not warranted for other processes until I release the lock. So there's
going to be more overhead for maintaining the coherence commensurate with the SC memory model,
which means it's going to lead to a poorer scalability of the shared memory system.
In this particular example, since P2 is not going to access any of these data structures until P1 has
released the lock, there's no need for coherence action for A and B until the lock is actually released.
414 - Release Consistency

This is the motivation for the release consistency memory model. In a program that consists of several
different parallel threads. If P1 wants to mess with some shared data structures, it is going to acquire a
lock L by a1(). Then P1 can enter the critical section. Once the critical section is finished, it releases the
lock by r1().
If the same lock is used by some other process at P2, and the critical section of P1 preceded the critical
section of P2. In other words, if P1-R1 happens before P2-A2, all we have to ensure is that all the
coherence actions prior to P1 releasing the lock are completed, before we allow P2 to acquire the lock.
That's the idea of release consistency.
Other synchronization operations can also be mapped to acquire()& release(). If you think about
barrier, arriving at a barrier is equivalent to an acquire(), and leaving the barrier is equivalent to a release().
So before leaving the barrier, we have to make sure that any changes made to shared data structures is
reflected to all the other processes via the cache coherence mechanism. Then we can leave the barrier,
i.e. release().
• So if I do a shared memory access within this group of sections, and that shared memory access
would normally result in some coherence actions on the interconnect reaching to the other
processes and so on. If we use the sequential consistency memory model, you will block P1 and
wait until a cache coherence is completed at every step within each step of P1’s critical section.
• But if we use the release consistent memory model, we do not have to block P1 in order for
coherence actions to be complete. P1 can continue on with its computation within the critical
section and we only have to block a processor at a release point to make sure that any coherent
actions that may have been initiated up until this point, are all complete before we perform this
release operation.
The release consistent memory model allows exploitation of computation on P1, with communication
that may be happening through the coherence mechanism for completing the coherence actions
corresponding to the memory accesses that you're making inside the critical section.
415 - RC Memory Model

So now we come back to our original parallel program.

• There are different threads running on all these processors. They're all making these normal data
accesses and synchronization accesses.

• If the underlying memory model is an RC memory model, it distinguishes between normal data
accesses and synchronization accesses.

• It knows that normal read/write data accesses doesn't have to block the processes. It may start
initiating coherence actions corresponding to these data accesses, but it won't block the processor
for coherence actions to be complete until it encounters a synchronization operation, which is of
the release() operations.

• If a release() synchronization operation hits this RC memory model, it's going to say, all the data
accesses that I've seen from this guy need to be reflected to other processors globally. It's going
to ensure that before allowing the synchronization operation to complete.

• So the coherence action is performed only when the lock is released.

416 - Distributed Shared Memory Example

So let's understand how the RC memory model works with a concrete example.
Let's say the programmer's intent is that one thread P1 will modify a structure A and another thread P2
will wait for the modification and then use the structure A.
• These threads are running on different processors, and we don't know who may execute their code
first. Let's say that P2 executes the code that corresponds to this semantic. That is, it wants to wait
for the modification. In order to do that, it has a flag and this flag has a semantic of 0 =
modification is not done and 1 = modification is actually done. P2 also uses a mutex L and
condition variable C so that it does not constantly check the predicate.
• Meanwhile P1 first modifies the data structure A. Once it is done, it informs P2 by acquiring the
mutex → change flag = 1 → signal the condition variable → unlock the mutex.
• Once the lock has been released, that lock will be acquired implicitly by the operating system on
behalf of P2, because that is a semantic of this condition wait here.
• When P2 wakes up, it will check the flag again to ensure that the modification is completed. Then
P2 can proceed & exit the critical section and then use this modified data structure A.
So that's the semantic that I wanted. The important thing is, if you have an RC memory model, then
all the modifications that P1 is doing can happen in parallel with the waiting that P2 is going. No need
to block P1 on every one of these modifications. The only point at which I have to make sure that these
modifications have been made globally visible is when P1 hits the unlock point in my code.
So, in other words, this code fragment is giving you pictorially the opportunity for exploiting
computation in parallel with communication. If the model was an SC memory model, then for every
read-write accesses on structure A, there would have been coherence actions on every access before you
can do the next one. But with the RC memory model, coherence actions inherent in those modifications
may be going on in the background and you can continue with your computation until you hit this unlock
point. At this point, the memory model will ensure that all the coherence actions are complete before
releasing the lock.
417 - Advantage of RC over SC

So to summarize the advantage of RC over SC, is that there is no waiting for coherence actions on
every memory access, so you can overlap computation with communication.
The expectation is that you will get better performance in a shared memory machine if you use the
RC memory model, compared to an SC memory model.
418 - Lazy RC

Now I'm going to introduce you to a lazy version of the RC memory model, LRC.
In this parallel program a thread acquires a lock, does the data accessing, releases the lock. Another
thread may acquire the same lock. If the critical section of P1 precedes the critical section of P2, then the
RC memory model requires that, at the point of lock release, all the modifications made by P1should be
communicated to all peer processes including P2. This is what is called Eager release consistency,
meaning that at the point of release, the whole system is cache coherent.
Now let's say that the timeline looks like this. Say P1's release happened at this point and P2's acquire
of the same lock happened at a much later point in time. That's the luck of the draw in terms of how
the computation went. There is this time window between P1's release of the lock and P2's acquisition of
the same lock. Now there's an opportunity for procrastination.
Procrastination often helps in system design. We've seen this in mutual exclusion locks where adding
delay between successive attempts of acquiring the lock results in better performance. We saw that in
processes scheduling too, instead of eagerly scheduling the next available task maybe you want to wait
for a task that has more affinity to your processor which results in performance advantage.
Here again there is an opportunity for procrastination and Lazy RC is another instance. The idea is
that, rather than performing all the coherence actions at the point of release, and wait till the acquire
actually happens. Only at the point of acquire, do we take all the coherence actions.
The key point is that you're deferring the point at which you ensure that all the coherence actions are
complete to the point of acquisition.
So in other words, you get an opportunity to overlap computation with communication once again in
this window of time between release of a lock, and the acquisition of the same lock.
419 - Eager vs Lazy RC

Let's see the pros and cons of Lazy RC v.s. Eager RC. Here are timelines of actions on three different
processes, P1, P2, and P3.
In the Eager RC, when P1 has completed its critical section and does the release operation, at the
release point all the changes that we made to variable X in the critical section should be updated to all
the processors, P2 and P3. It could be an invalidation-CC or update-CC. What we are saying is we are
communicating the coherence action to all the other processors. That's what these arrows are indicating.
Now P2 acquires the lock and enters its own critical section modifying the same variable X. When P2
releases the lock we broadcast the changes to all processes.
Notice what is going on. P1 makes modifications, broadcasts it to everybody. But who really needs it?
Only P2 needs it. But the RC memory model is Eager, and it will update everybody who has a copy of
X that. But P3 doesn't care, because it's not using that variable yet.
Now in the Lazy RC model, when P1 releases the lock, we don’t do any cache coherence
communication. We simply release a lock. Later on the next process that happens to acquire that same
lock, the RC memory model will first make sure that all the coherence actions that I've associated with
that particular lock are completed. In this case the previous lock holder had made changes to the variable
x, so we need to pull X from P1 and then P2 can execute its critical section. Similarly when P3 executes
its critical section, it's going to pull it from P2 and complete what it needs to do.
The important thing that you see is that there is no broadcast anymore. It's only point-to-point
communication between the processes that are passing the lock between one to another. In other words,
the number of arrows that you see are communication events. You can see that there's a lot more black
arrows (which indicates communications commensurate with the coherence actions) in the eager RC than
the Lazy RC.
The Lazy RC is also called a pull model, because what we're doing is at the point of acquisition, we're
pulling the coherence actions that need to be completed over here. Whereas, the Eager RC is the push
model in the sense that we're pushing all the coherence actions to everybody at the point of release.
420 - Pros and Cons of Lazy and Eager

So the question is, what are the pros of the Lazy RC model over the Eager RC model.
I want you to write in free form what you think are the pros and what are the cons of the Lazy over the
Eager in memory consistency model.

421 - Pros and Cons of Lazy and Eager

In the lazy model, as we saw in the picture, there were less communication events, which means you
have less messages that are flowing through the network, in order to carry out the coherence actions
compared to the eager model.
But there is a downside to the lazy model as well. That is, at the point of acquisition, you don't have
all the coherence actions complete. Therefore you may have to incur more latency at the point of
acquisition because we have to wait for all the coherence actions be completed.
So Procrastination helps in reducing the number of messages. But it could result in more latency at the
point of acquiring the lock.
422 - Software DSM

So far, we've seen three different memory consistency models. The sequential consistent memory model,
the Eager RC and Lazy RC memory model. Now we're going to transition and talk about Software
Distributed Shared Memory, and how these memory models come into play in building software DSM.
So we're dealing with a computational cluster. Each node in the cluster has its own private physical
memory, which is not physically shared. Therefore, the system software has to implement the consistency
model for the programmer. In a tightly coupled multiprocessor, coherence is maintained at individual
memory access level by the hardware. Unfortunately, such fine grain level of coherence maintenance
will lead to too much overhead in a cluster. Why? Because on every load() or store() instruction on any
one of these processors, the system software has to butt in and perform coherence action in software
through the entire cluster, which is simply infeasible.
So how should we implement software DSM?
First we implement the sharing and coherence maintenance at the granularity level of pages. Now,
even in a simple processor or in a true multiprocessor, the unit of coherence maintenance is larger than
a single word, because in order to exploit spatial locality, the block size used in caches of processors tend
to be bigger than the granularity of memory access in the processor. So we're taking this up a level and
saying, if you're going to do it all in software, let's keep the granularity of coherence maintenance to be
an entire page.
We are going to maintain the coherence of the distributed shared memory in software by cooperating
with the operating system that is running on every node of the processor. We need to provide global
virtual memory abstraction to the application program running on the cluster so that the application
programmer views the entire cluster as a globally shared virtual memory.
Under the cover, what the DSM software does is partitioning this global address space into chunks
that are managed individually on each node of the cluster.
From the application’s point of view, this global virtual memory abstraction offers address
equivalence. That is, if I access a memory location X in my program, it means exactly the same thing
no matter on which processor my program is running.
The way the DSM software is going to handle coherence maintenance is by having distributed
ownership for the different virtual pages that constitute this global virtual address space. So you can
think of this global virtual address space as constituted by several pages, and we're going to say some
pages are owned by processor 1, some pages are owned by processor 2 and some are owned by processor
3, and so on. We split the ownership responsibility among individual processors.
Then the owner of a particular page is responsible for keeping complete coherence information for
that particular page and taking the coherence actions commensurate with that page.
The local physical memories in each one of this processors is being used for hosting portions of the
global virtual memory space, commensurate with the access pattern that is being displayed by the
application on the different processors. For instance, if processor 1 accesses this portion of the global
virtual memory space, then this portion of the address space is mapped into the local physical memory
of this processor. So that a thread that is running on this processor can access this portion of the global
address space. And it might be that same page is being shared with some other processor n over here. In
that case, a copy of this page is existing in both this processor, as well as this processor.
It is up to the processor that is responsible for the ownership of this particular page to worry about the
consistency of this page, that is now resident in multiple locations. The owner of a page will have
metadata that indicates that this particular page is currently shared by both P-1 and P-n.
Statically, we are making an association between a portion of the address space and the owner for
that portion of the address space in terms of coherence maintenance for that portion of the global virtual
memory space.
423 - Software DSM (cont)

So the abstraction layer gives the application an illusion of a global virtual memory.
This is implemented by the DSM software layer. In particular, this DSM software layer, which exists on
every processor, knows the point of access to a page by a processor, i.e. he knows who exactly to contact
as the owner of the page to get the current copy of the page.
• For instance, let's say that there was a page fault on this Processor-1 for a particular portion of the
global address space. That portion of the global address space is currently not resident in the
physical memory of Processor-1.
• When this page fault happens, that page fault is going to be communicated by the operating system
to the DSM software. The DSM software, who knows the owner of the page, will contact the
owner of the page and request a copy of the current copy of the page, i.e. Proessor-3.
• So the owner, either it itself has the current copy of the page or it knows which node currently has
the current copy of the page, will send the page over to the node that is requesting it.(P3 → P1)
• Recall what I said about ownership of the pages. The residency of this page doesn't necessarily
mean that the ownership of the page. The owner of this page could have been this node, so the
DSM software would have contacted the owner. The owner would have said, oh you know what,
this particular place to cut and copy is on this node. So the DSM software would go to this node,
and fetch this page, and put it into this processor.
• Once the page has been brought in to the P1’s physical memory, the DSM software contacts the
virtual memory manager and says, I've completed processing the page fault, would you please
update the page table for this guy, so that he can resume execution?
• Well, then the VM manager gets into action and updates to page table for this process on
Processor-1 to indicate that the faulty virtual page is now mapped to a physical page, and then the
process can resume its execution.
So this is a way coherence is maintained by the DSM software. The cooperation between DSM software
and the VM manager. The coherence maintenance is happening at the level of individual pages. An early
examples of systems that built software DSM include Ivy from Yale, Clouds from Georgia Tech, Mirage
from UPenn, and Munin from Rice.
All of these are DSM systems and they all used coherence maintenance at the granularity of an
individual page. They used a protocol which is often referred to as a single writer protocol. That is, I
mentioned that the directory associated with a portion of the virtual memory space managed by each
node, the directory has information as to who are sharing a page at any point of time. Multiple readers
can share a page at any point of time, but only one writer is allowed to have the page at any point of time.
• So, if there is the writer for a particular page, which is now currently in the memory of two
different processors and one process wants to write to this page, this process will inform the owner
of this page through the DSM software abstraction that: I want to write to this page.
• At that point, the owner is going to invalidate all copies of that page in the entire system, so that
this guy has exclusive access to that page so that they can make modifications to it. This is what
is called single writer multiple reader protocol.
Now the problem with the single writer protocol is that there is potential for what is called false sharing
similar to SMP. Basically the concept of fault sharing is that data appears to be shared even though
programmatically they are not.
In the page-based coherence maintenance, the coherence maintenance by the DSM software, is on the
granularity of a single page. A page may be 4K bytes or 8K bytes, depending on the architecture we're
talking about.
• Within a page, lots of different data structures can actually fit. So, if the coherence maintenance
is being done at the level of an individual page, then we're invalidating copies of the page in
several nodes in order to allow one guy to make modifications to some portion of that page. That
can be very costly in a page-based system due to the coarse granularity of the coherence
information.
• So for example, this one page may contain ten different data structures, each of which is governed
by a distinct lock. When I get a lock, which is only for one particular data structure in this page
and to make modifications for it, I am going go and invalidate all the other copies of the page on
other nodes.
• So the page can be ping-pong between multiple processes while each process only modifies
different portions of the same page. The coherence granularity being a page, will result in this
page shuttling back and forth between these two processes, even though the application program
is perfectly well behaved in terms of using locks to govern access to different data structures.
Unfortunately, all of those data structures happened to fit within the same page, resulting in this
false sharing.
So, page level granularity and single writer multiple reader protocol don't live happily together.
They will lead to fault sharing ping-ponging of the pages among processes across the entire network.
424 - LRC with Multi Writer Coherence Protocol

That brings us to a new coherence protocol, which is multiple writer protocol.

The idea is we still want to maintain coherence information at the granularity of pages.
• Because the operating system operates at the granularity of pages, DSM can be integrated with
the operating system if the granularity of the coherence maintenance is also at the level of a page.
• But we also want to allow multiple writers to be able to write to the same page, recognizing that
an application programmer may have packed lots of different data structures within the same page.
So we are going to see how the multiple writer coherence protocol works, and in particular we will use
that in concert with Lazy RC. Here, I'm going to give you a high level view of how LRC is integrated
with multiple writer protocol in the TreadMarks system.
So the processor P1 acquires a lock and makes modifications. This notation that I'm using is to indicate
that these X, Y, and Z are pages that are being modified. Because we are maintaining coherence on the
level of pages so the data structures that we're modifying within this critical section are contained in
pages X, Y and Z.
TreakMakrs uses the virtual memory hardware to detect accesses and modifications to shared memory
pages. The shared page is initially write-protected. When a write occurs by P1, TreadMarks creates a
copy of the page, i.e. a twin and saves it as part of the TreadMarks’ data structures on P1. It then
unprotects the page so P1 can modify it.
Once the write operation by P1 is done, DSM can calculate a diff by comparing the modified page
against the twin. So X-d, Y-d and Z-d are the diff to pages X, Y, and Z respectively as a result of executing
this critical section.
The coherence protocol we use is Lazy RC. So the next time the same lock L is requested by some
other process P2, we will first invalidate the local copies of page X Y and Z on P2, since DSM know
which page(s) are associated with this lock (It doesn't know what part of the pages are modified. That's
contained in the diff).
Once we've invalidated these pages, then you can allow P2 to get the lock and start getting into its
critical section. Since P2 wants to access page X in the critical section but that page has been invalidated.
At this point, the DSM software also knows that the previous lock holder has the modifications that need
to be made to the original page to get the current version of the page.
I mentioned this ownership based protocol in the DSM software. So the DSM software knows who the
owner of the page is.
• From the owner of the page I can get the original content of X. I'll do that.
• But I'll also need the diff from the previous lock holder.
So the DSM software brings X-d and the original version of the page from the owner of the page, to P2
so that P2 can get the most recent version of X by applying X-d to the original page.
425 - LRC with Multi Writer Coherence Protocol (cont)

Some of you may have thought of this already. What if prior to P2 getting its lock, another process P3
also modified the page X and created its own diff, X-d’?
• Because all of these locks are the same, the DSM software knows that there are two diff associated
with this lock. One diff is associated with P1, and another diff with P3. When processor P2 tries
to access page X, the DSM software needs to get the diff from P1 and the diff from P3 and apply
it to the original pristine version of the page, which is from the owner of the page.

• This process can be extended to any number of processors that may have made modifications to
this page under the provision of the lock L. All of those diffs are going to be applied for the process
of P2 to access the page as the current page.
If P2 needs to access another page, e.g. page Z, at that point once again the DSM knows that Z was
modified by the previous lock holder P1 and Z-d is the diff. DSM also knows where to find the original
version of page Z (i.e. the owner of Z) so that P2 can have the most recent version of Z before accessing
it.
Also we can see that, even though the invalidation was done right at the beginning, we're
procrastinating getting the diff until the point of access. This is what LRC allows you to do: just bring in
what this guy needs. So for instance, inside P2’s critical section maybe only X is accessed. Y and Z are
not accessed at all. In this case, we never bring the diff of Y and Z from P1 to P2.
On the other hand, it is possible that P2, modifies another page Q in its critical section. So now, the
DSM software knows that this particular lock is also associated with Q. So future lock request for L will
result in invalidating X, Y, Z, and Q because all of those may have been modified and the next critical
section that wants to access this lock L has to get the current versions of all the pages that were ever
associated with L
426 - LRC with Multi Writer Coherence Protocol (cont)

Let's talk about the Multiple-Writer part of the protocol.

Note that there could be multiple user data structures in a given page X.
• If that is the case, the programmer probably has different locks for accessing different data
structures within this page X.

• So it is conceivable that when all of this was going on, there was another processor, P4, who is
using a completely different lock, L2. P4 is accessing some data structure that also happens to be
in page X. This is perfectly fine. The DSM software will not do anything in terms of the diff that
it has created with respect to the page X because lock L and lock L2 are very likely to be associated
to different non-shared data structures.

• So P4 can execute in parallel with P1 by acquiring L2, and modified X.When P2 gets its lock L,
the liaison (DSM) software will only bring the diff from the previous users of the same lock L.

• P4 was not using L, so it is probably modifying a different data structure on page X. Therefore
the DSM software is going to assume that that change made by P4 to X is irrelevant to P2's critical
section.
So that is where the multiple writer coherence protocol semantic comes in. Simultaneously, the same
page could be modified by several different threads on several different processors. It is perfectly fine,
as long as they're using different locks. So the association between the set of changes to a page is only to
specific lock which is being used to govern that critical section. This is the reason why this is called a
Multiple-Writer Coherence Protocol. This is how this Multiple-Writer Coherence Protocol lives in
concert with LRC to reduce the amount of communication that goes on in executing critical sections of
an application.
427 - Implementation

I've always said that the fun part of any systems research is the technical details and implementing an
idea. So let's look at some of the implementation details here.
• TreakMakrs uses the virtual memory hardware to detect accesses and modifications to shared
memory pages.
• The shared page X is initially write-protected.
• When a write occurs by P1, TreadMarks creates a copy of the page, i.e. a twin and saves it as part
of the TreadMarks’ data structures on P1.
• It then unprotects the page X so P1 can modify it.
When P1 reaches the release point, the DSM software will compare the modified page with the twin
and compute the diff, which is a runlength encoding of the modifications made to the original page. So
the diff will be computed as oh, the page is changed starting from here up until here and this is the amount
of change.
When the same lock is acquired by a different processor P2, at the point of acquisition (unless it is the
first time P2 is accessing the page), DSM will bring the diff created on P1 to P2, so that P2 can apply
this diff to its local cache to get the most updated version of page X.
As far as this node is concerned, when the release operation is done, the DSM software is going to
compute the diff between changes made to this page and its original copy of the page and keep that as a
diff data structure. If there are multiple pages, all of the diffs will be created at the point of release.
The page will be write protected again, after the critical section is finished, until the next iteration.
Note, if multiple writes are happening to the same portion of a page under different locks, there will
be data race. So caution should be taken when writing programming with the DSM (TreadMarks).
428 - Implementation - Garbage Collection

So this implementation by TreadMarks is an example of the cooperation between the DSM software and
the operating system to make distributed shared memory happen. TreadMarks implemented this LRC
multiple writer coherence protocol on a Unix system. The SIGSEGV generated in the OS layer when a
shared page is accessed by a thread can be caught by TreadMarks’ runtime handler. At that point the
DSM software get into gear, contacts the owner of the page, checks the status of the page.
• If the page is invalid then it gets the page and the diff for that page. Then we can create a current
version of the page.
• If the process that is trying to access the page is making a read access, then there is no problem.
• If the process that wants to use that page wants to write to it, at that point it creates a twin and
then un-protect the page.
One thing that you will notice is that there is space overhead for the creation of the twin. TreadMarks
have TreadMarks have to create diff and then discard the twin.
As time goes by, there could be a lot of these diff lying around in different nodes. If one processor
wants to access a particular page, the DSM software has to go and bring the diff from ALL the prior
users of the page to the current user. A lot of latency is involved before we can start using the page. There
is also a lot of space occupied by those diff lying around.
So one of the things that happens in the DSM software is garbage collection. That is, you keep a
watermark of what is the amount of diff that have been created in the entire system. If it exceeds a
threshold, we start pushing these diff to the original page at the owner, so that we can then get rid of the
diff completely. We don't want to do it too eagerly, nor do we want to wait too long. There will be a
demon process in every node that every once in a while wakes up and sees how much diff have been
created on a node. If it exceeds the threshold then it says okay time to get to work. Let me go and apply
this diff to the original copy of the page so that I can get rid of the diff.
429 - Non Page Based DSM

In this lesson I've covered distributed shared memory and particularly I've given you a specific example
of a distributed shared memory system called TreadMarks that uses Lazy RC and multiple writer
coherence. I just want to leave you with some thoughts about non-page based DSM systems before
concluding this lesson.
I mentioned earlier that if you want to maintain granularity NOT at the page level, you have to track
individual reads and writes on a thread. One approach is called a library based approach.
The idea is that the programming framework/library gives you a way in which you can annotate
shared variables that you use in your program. Whenever you touch a shared variable, the executable
(we created using the library) will cause a trap so that the DSM software will be contacted. Then the
DSM software can take the coherence action needed. Once the DSM software takes the coherence action,
which might include fetching the data associated with the variable you are trying to access, then the DSM
software can resume this thread that caused this trap in the first place.
No operating system support needed, only library is needed to contact the DSM. Examples of systems
that use this mechanism include Shasta at Digital Equipment Corporation, and Beehive at Georgia Tech.
Because we are doing this sharing at the level of variables, you don't have any false sharing.
Another approach is to provide shared abstraction at the level of structures that are meaningful for
an application and this is what is called structured DSM. The idea is that there is a programming library
which actually provides abstractions that can be manipulated using API calls that are part of the language
runtime. So when the application makes the API call, the language runtime gets into gear and says what
coherence actions do I need to make in order to satisfy this API call. The coherence actions might include
fetching data from a remote node in the cluster.Once the semantics of that API call have been executed
by the language runtime, then it's going to resume this thread that made the call in the first place.
Again, no OS support is needed for this structured DSM and it is a very popular approach that has
been used in systems such as Linda, Orca, and Stampede that was done at Georgia Tech. Successors to
Stampede are Stampede RT and PTS. In this course later on, we're going to see PTS as an example of a
structured DSM system.
430 - Scalability

DSM is providing the application developer with a programming model on a cluster that is akin to
pthreads on an SMP.
It looks and feels like a shared memory threads package. That's good.
But what about performance? Will the performance of a multi-threaded app scale up as we increase
the number of processors in the cluster.
Now, from an application programmer's point of view, your expectation is that as you add more
processors, you're going to get better performance. That's your expectation.
What we are doing is, we're exploiting the parallelism that is available both in the application (because
it's structured in that way) and in the hardware, in order to get increase performances as you increase the
number of processors.
But the problem is as you increase the number of processors, because the DSM is implemented in
software, there is increased overhead as well. This overhead increases as the number of processors
increase, so the actual performance is going to be actually much less than your expectation, mitigated by
the increasing overhead with the number of processors.
This buildup of overhead with a number of processes, happens in a true shared memory multi-
processors. And this is even more true in the case of liaison, which is implementing the shared memory
abstraction in software on a cluster.
431 - DSM and Speedup

So we have the illusion of globally shared memory, which is implemented by physical memories that is
strewn all over the entire cluster. The hope is that the application running on different nodes of the cluster
will actually get speed up with increased number of processors. But such speed up is not automatic.
Recall what our good friend Chuck Thacker told us, shared memory scales really well when you don't
share memory. If the sharing is too fine-grained, then no hope of speed up, especially with DSM systems.
Because it is only an illusion of shared memory via software, not even physical shared memory. Even
physical shared memory can lead to overheads. So, this illusion through software can result in even more
overhead so you need to be very careful on how you share and what you share.
The basic principle is that the computation to communication ratio has to be very high if you want
any practical speed up. In other words, the critical sections that modify the data structures better be really,
really hefty (large and long) before somebody else needs to access the same portion of the data.
So what does this mean for shared memory codes? Well basically, if the code has a lot of dynamic
data structures that are manipulated with pointers, then it can lead to a lot of implicit communication
across the local area network. If you are using a pointer that happens to be pointing to memory in a
remote processor, it can result in a network communication across the network. This is the bane of
distributed shared memory. That pointer codes may result in increasing overhead for coherence
maintenance for DSM in LAN. So you have to be very careful on how you structure codes that can
execute efficiently in a cluster using DSM as the vehicle for programming.

432 - Conclusion

In reality, DSM as originally envisioned, i.e. a threads package for a cluster DEAD. Structured DSM,
namely, providing higher level data abstractions for sharing among threads executing on different nodes
of the cluster, is still attractive to reduce the programming pain for the developers of distributed
applications on a cluster. We will discuss one such system, called Persistent Temporal Streams, as part
of a later lesson module.
433 - The First NFS

All of us routinely use file systems, quite often, not the one that is on our local machine (desktop or
laptop), but one that is on a LAN connecting us to a workplace or university.
NFS, which stands for Network File System, just like the word Xerox is used often as a verb to denote
copying, has become a generic name to signify any file system that we access remotely.
So here is a trivia quiz for you. NFS is a name given to the first network file system that was ever
built. In this question I want you to name the company and the year that NFS was first built.

434 - The First NFS (Answer)

Some of you may have thought IBM because IBM has so many firsts to its credit.
But the first network file system labelled NFS, was built by Sun micro systems, which was in the
workstation business.
It wanted a solution for users’ files to be accessible over a local area network. So they built the first
ever network file system and called it, very simply, NFS, in 1985.
435 - NFS

Network file systems has evolved over time, but the idea is still the same.
• You have clients that are distributed all over the LAN.
• You have file servers sitting on the LAN.
• These file servers are central, as far as each client is concerned. Of course, the system
administrator may partition these servers and say that there is one server designated for a certain
class of users. For instance, if you take a university setting, you might have one server serving all
the faculty's needs, and maybe another server serving all the student needs, but so far as a single
client is concerned, it is still a centralized view.
• Since the electromagnetic disk is slow, the server will cache the files that it retrieves from the disk
in memory, so that it can serve the clients better by serving it out memory.
So this is a typical structure of a network file system.
• A centralized server, which is the model used in NFS, is a serious source of bottleneck for
scalability.
• A single server has to field the client requests coming from a group of users that it is serving and
manage all the data and metadata for all the files that are housed on this particular server.
• The data and the metadata of files are persistent data structures, and therefore the file server has
to access these data structures over the I/O bus, which is available for talking to the Disk sub-
system.
So with a centralized server like this, there is limited bandwidth for the server to get the data and the
metadata in and out of the disk. The file system cache is also limited, because it is confined to the
memory space that's available in a given server.
So, instead of this centralized view of the file system, can we implement the file system in a distributed
manner? What does that mean?
436 - DFS

The vision with a distributed file server is that there is no central server any more. Each file is distributed
across several servers.
What does it mean to solve the un-scalability of a central server?
• We want to take a file and distribute it across several different server nodes in the LAN.
• Since the DFS is implemented across all the disks in the network, if a client wants to read or write
a file then it actually contacts all the servers, potentially, to get the data that it is looking for.
• This means that since each file is distributed across all of these different servers, the idle
bandwidth that's available cumulatively across all of these servers can be used to serve the needs
of every individual client.
• This also allows distributing the management of the metadata associated each file among the
distributed server nodes.
• Furthermore, we have more memories available in all of these distributed servers cumulatively,
which means that we have a bigger memory pool available for implementing a file cache,
including all of the server memories.
• The memories of clients may be used as well and that's where we can actually go towards
cooperative caching among the clients.
So, in the extreme, we can treat all the nodes in the cluster (whether we call them S1 or C1) the same
with interchangeable roles as clients or servers.
That is, we can actually make this DFS a serverless file system, if we allow the responsibility of
managing the files, saving the files, caching the files, equally distributed among all the nodes of the
cluster.
437 - Introduction - Distributed File System

That brings us to this specific lesson that we're going to cover in this lecture.
DFS wants to intelligently use the cluster memory for the efficient management of the metadata
associated with the files and for caching the file content cooperatively among the nodes of the cluster for
satisfying future requests for files.
In other words, since we know that a disk is slow, we would like to avoid going to the disk as much
as possible and retrieve the data from the memory of a peer in the network if in fact that peer has
previously accessed the same file.
That's the idea behind cooperative caching of files.
But in order to fully appreciate the question that we're asking and the answer that we're going to discuss
in this lesson, it is important that you have a very good understanding of file systems. If you don't have
it, don't worry. We have supporting lectures that will help you get up to speed, and you can get this
knowledge by diving into a good undergraduate textbook that covers this material well. The textbook
that is used in the under graduate systems course CS2220 at Georgia Tech is a good resource for that.
438 - Preliminaries (Striping a File to Multiple Disks)

So to describe the ideas that are discussed in this distributed file system lesson, I have to introduce you
to a lot of background technologies. The first background technology that I'm going to introduce to you
is the RAID storage.
• RAID stands for Redundant Array of Inexpensive Disks. The idea is a given disk may have a
limited I/O bandwidth capacity. If I can string together a number of disks in parallel, then
cumulatively I can get much more I/O bandwidth.

• Since we have an array of disks, we have increased the probability of failures and that is why the
RAID technology uses an error correcting technology. Let's say a file has four parts to it. When
I write this file, I'm going to write part-1 to this disk-1, part-2 to disk-2, part-3 to disk-3 and so on.

• Because my data is on multiple disks, I have increased the chance that there may be failures on
any one of the disk. Therefore, we compute a checksum for the file that I've stored on these disks
and store the checksum in disk-5. This is what is called Error Correcting Code (ECC). The ECC
allows errors to be detected when I read things from the disk and if there is something is wrong I
can correct it using this extra information that I'm writing on the fifth disk here.
That's the big picture of how striping a file to multiple disks with ECC failure protection works.
• The first drawback of the RAID technology is the cost, i.e. we need multiple hard drives to store
a single file.

• The second problem is the so-called “small write problem”. The overhead of parity (ECC)
management (for too many small write) can hurt performance for small writes. If the system does
not simultaneously overwrite all N-1 blocks of a stripe, it must first read the old parity (ECC) and
some of the old data from the disks to compute the new parity.
439 - Preliminaries (Log Structured File System)

But the idea of having an array of disks to serve the file system is a good one, because it increases the
overall I/O bandwidth that's available in the server. So, how can we solve the small write problem? That
brings me to another background technology that I have to explain to you and which is called log
structured file system.
This approach provides high-performance writes, simple recovery, and a flexible method to locate
file data stored on disk.

• LFS addresses the RAID small write problem by buffering writes in memory and then
committing them to disk in large, contiguous, fixed-sized groups called log segments.

• It threads these segments on disk to create a logical append-only log of file system modifications.

• When used with a RAID, each segment of the log spans a RAID stripe.

• Each log segment is committed as a unit to avoid the need to re-compute parity (ECC).

• LFS also simplifies failure recovery because all recent modifications are located near the end of
the log.
So in other words, we use a space metric or a time metric to figure out when to flush the changes
from the log segment into the disk.
This solves the small write problem because if Y happens to be a small file, we are not (constantly)
writing Y as-is on to the disk, but what we are writing is this log segment in memory and only flush/write
the log segment to disk periodically or when the log segment is large enough. So, we results in less
number of writes.
Therefore, this log segment is going to be a big file and we can use the RAID technology to stripe the
log segments across multiple disks, given the benefit of the parallel I/O that's possible with the RAID
technology.
So this log structured file system solves the small write problem. In log structured file system, there
are only logs, no data files. All the things that you are only writing these logs to the disk.
When you have a read of a file and it has to go to the disk and fetch that file, then the file system has
to reconstruct the file from the logs that it has stored on the disk. As a result, in a log structured file
system, there could be latency associated with reading a file for the first time from the disk.
Once it is read from the disk and reconstructed, it is in memory, i.e. in the file cache of the server, and
everything is fine.
But the first time, you have to read it from the disk, it's going to take some time because you have to
read all these log segments and reconstruct it. That's where parallel RAID technology can be very helpful.
Because you're aggregating all the bandwidth that's available for reading the log segments from multiple
disks at the same time.
Another thing that you have to worry about in a log structured file system, is that these logs represent
changes that have been made to the files. For instance, I may have written a particular block of Y and
that may be the change sitting here. Next time, when I overwrite the same block, the first strike that I
did becomes invalid. I have got a new write of that same block. The new write will be appended to the
end of the new log segment, and meanwhile we need to invalidate the corresponding block in the old log
segment. So, over time, the logs will have lots of “holes” created by overwriting the same block of a
particular file.
The log structured file system employs a cleaner to do such garbage collection periodically to ensure
that the disk is not cluttered with wasted logs that have empty holes in them. So logs is similar to the diff
that you've seen in the DSM system with the multiple writer protocol that we talked about in a previous
lecture.
You may have also heard the term, journaling file system, there is a difference between log structured
file system, and journaling file system. Journaling file systems has both log files as well as data files,
and what a journaling file system does, is it applies the log files to the data files and discards the log files.
The goal of a journaling file system is also to solve the small write problem. But in a journaling file
system, the logs are there only for a short duration of time before the logs are committed to the data
files themselves. Whereas in a log structured file system, you don't have data files at all, all that you
have are log files and reads have to construct the data from the log files.
Although log-based storage simplifies writes, it potentially complicates reads because any block could
be located anywhere in the log, depending on when it was written. LFS’s solution to this problem
provides a general mechanism to handle location-independent data storage. LFS uses per-file inodes to
store pointers to the system’s data blocks. LFS’s inodes move to the end of the log each time they are
modified.
When LFS writes a file’s data block, moving it to the end of the log, it updates the file’s inode to point
to the new location of the data block. It then writes the modified inode to the end of the log as well.
LFS locates the mobile inodes by adding a level of indirection, called an imap. The imap contains the
current log pointers to the system’s inodes. LFS stores the imap in memory and periodically checkpoints
it to disk. These checkpoints form a basis for LFS’s efficient recovery procedure. After a crash, LFS
reads the last checkpoint in the log and then rolls forward, reading the later segments in the log to find
the new location of inodes that were written since the last checkpoint. When recovery completes, the
imap contains pointers to all of the system’s inodes, and the inodes contain pointers to all of the data
blocks.
Another important aspect of LFS is its log cleaner that creates free disk space for new log segments
using a form of generational garbage collection. When the system overwrites a block, it adds the new
version of the block to the newest log segment, creating a “hole” in the segment where the data used to
reside. The cleaner coalesces old, partially empty segments into a smaller number of full segments to
create contiguous space in which to store new segments.
The overhead associated with log cleaning is the primary drawback of LFS. Although Rosenblum’s
original measurements found relatively low cleaner overheads, even a small overhead can make the
cleaner a bottleneck in a distributed environment. Further, some workloads, such as transaction
processing, incur larger cleaning overheads
440 - Preliminaries Software (RAID)

The next background technology is software RAID. I mentioned that hardware RAID has two problems.
The first is the small writes problem and we can get rid of the small write problem by using log structure
file systems.
Another problem is that RAID disk is usually more expensive than commodity disk. In a local area
network, we have lots of compute power distributed over the LAN and every node on the LAN has its
own disks. So could we use the disks across LAN for doing the same thing that we did with hardware
RAID? That is, stripe a file across the disks of all the nodes in LAN.
That's the idea behind the Zebra file system that was built at UC Berkeley, which was the first one to
experiment with this software RAID technology. It combines both log structure file system (avoid small
write problem) and the RAID technology (get the parallelism in a file server to reduce the service latency).
So in this case, what we have are not data files but we have log segments that have to be written out
to the disk in an LFS.
And we stripe the log segment onto multiple nodes of the disks in software and that is the RAID
technology. So if this is a log segment that represents the changes made to several different files on a
particular client node, then the software RAID will take this log segment and stripe it: Part-1 of the log
segment on Node-1, Part-2 on Node-2, Part-3 on Node-3, Part-4 on Node-4 and the ECC that corresponds
to these four parts of the log segments into Node-5.
That's the idea in software RAID. Exactly similar to the hardware RAID except software is doing this
striping of the log segment on multiple nodes available in a LAN.
441 - Putting Them All Together Plus More

Now it's time to put together the background technology that I introduced to you, plus some more, and
describe a particular distributor file system called xFS, which is also built at UC Berkeley.
xFS builds on the shoulders of periodic technologies.
• The first one, the log based striping that I just mentioned to you from the Zebra file system.

• Another technology called co-operative caching, which is also a prior UC Berkley project.

• Lastly, XFS also has introduced several new nuances in order to make the distributed file system
truly scalable towards server-less-ness. Or in other words, no reliance on a central server. Those
techniques include dynamic management of data and meta-data, subsetting of the storage servers
and distributed log cleaning.
We'll talk about all of these techniques in much more detail in the rest of this lecture.
442 - Dynamic Management

In a traditional centralized NFS server, the data blocks are residing on the disks.
• In the memory of the server, the contents include the metadata for the files like the inode structures.
• The files are also brought in from the disk and stored in the memory, so that future requests for
the same files can be served from the memory of the server rather than going to the disk.
• The server also keeps the client caching directory. That is, who on the LAN are currently
accessing files that is the property of this particular server.
• In a Unix file system the server is un-concerned about the semantics of file sharing. In other
words, the assumption is that the server is caching for each client completely independently.
Therefore if clients happened to share a file, that is completely the problem of the clients, and the
server is not concerned about that.
So all of these contents that I just described to you, metadata, file cache, client caching directory. All of
these are in the memory of a particular server.
• If the server happens to be housing hot files used by a lot of users, then that's bad news for the
server in terms of scalability. The server has to worry about the requests simultaneously coming
from lots of clients for these hot files, but it is constrained by the bandwidth that's available to
access the files from the disk, it is constrained by the amount of memory space for caching files
and the metadata of the files.
• At the same time, there could be another server which has adequate bandwidth to the disks and
enough memory space. But unfortunately it may be housing cold files and there are not many
clients for this server.
So you can immediately see that such centralization of the traditional file system results in hot spots.
And that's the thing that we're trying to avoid in a distributed file system, and that's where dynamic
management comes into play.
In xFS, it provides the same functionality as a centralized NFS Server, but it is distributed and the
metadata management is dynamic.
In a centralized file server, the manager for a file and the location of the file is the same node.
In other words, if the file happens to reside on the disk of this server, then this server is the guy that
handles the metadata management for this file as well.
In xFS, metadata management is dynamically distributed. Let's say that you have F1, F2, and F3 are
the hot files. In that case, metadata management for files F2 and F3 can be done by some other node, say
S3. So, in other words, all the data structures that we've talked about that has to reside in the memory of
a particular server like metadata, file cache, and the caching information about who's having the files and
so on, all of that can be distributed with dynamic management of data and metadata, which is the
idea in xFS.
We will see how the dynamic management is implemented in xFS. What we would like to be able to
do is, if a file is accessed by several different nodes, then they will be living in the client caches of the
different nodes. If a file is residing in the cache of a peer node, then it makes sense that a new request for
the same file being handled by that peer node’s cache. That way we can also conserve the total amount
of memory on the servers and use it more frugally by exploiting the memories available in the clients.
Caching files cooperatively among the clients. That's the other nugget in the technical contribution
of xFS.
443 - Log Based Striping and Stripe Groups

Let's understand xFS’s log based striping in software RAID a little bit more.
You have clients on the LAN that are writing to files. When a client makes a change to a file, the
changes are written to an append-only log. This log segment data structure resides in the memory of the
client in the distributed file system.
When this log segment reaches beyond a certain capacity and the threshold number of fragments have
been reached in this log segment, the log segment is written/committed to the disk.
We take these log fragments, and compute the parity, which is the checksum or ECC. This becomes
the log segment that I want to write to the disk.
So you take this particular log fragment along with its ECC and stripe it on storage servers RAID
disks.
It is now available on the storage system and as I mentioned, you want to do this periodically in order
to avoid the chance of data loss due to failures of a particular node.
Same thing is happening on each client. So if you look at the storage server, you see that the storage
server has the log segments from this client, and the log segment from another client.
Another thing that xFS did is subsetting the servers, i.e. we don’t want every log segment to be written
on all the disks available on the LAN. This again is concerned with solving the small rate problem. for
example, if i have storage servers available on the LAN. I might decide that every log segment is going
to write its log over a small fraction of all servers. Another client may similarly choose a subset of storage
servers to write it on. The subset of storage servers that is used for striping a given log segment is called
a stripe group, for that particular log segment.
444 - Stripe Group

So, using the stripe group for a long segment avoids first of all the small-write pitfall.
So for instance, we might decide for certain log segments Z-Y-Z the stripe group is this, for another
log segment P-Q-R the stripe group is this, and for another log segment L-M-N the stripe group is this.
Subsetting the servers into these stripe groups allows parallel client activities so log-segments X-Y-
Z, P-Q-R and L-M-N can be written in parallel if they belong to different clients and stripe groups.
Subsetting increases the availability of the server in saying that not all the servers has to be working
on the same client request. Different subset of servers are working on different client requests and that
results in higher throughput for overall client processing.
When it comes to cleaning the logs (remember that I mentioned earlier that every once in a while, you
have to clean the logs because the logs may have been overwritten by new writes to the same file block,
and old log segments will have invalidated “holes” that need to be cleaned and recycled), efficient log
cleaning can also be facilitated by the fact that you have different stripe groups. You can assign different
cleaning service for different stripe groups that increases the parallelism in the management of all the
things that need to be done in a distributed file system.
An increased availability also means that you can survive multiple server failures. Let's say that these
two disks fail for some reason. You can still serve the clients who are being served by this particular strip
group.
That's the idea of subsetting the servers into stripe groups so that you increase the availability and allow
incremental satisfaction of the user community in spite of failures that may be happening in the system
as a whole.
445 - Cooperative Caching Cache Coherence

Next let's talk about how xFS uses the memory that is available in the clients for cooperatively caching
the files and reducing the stress of data files management.
(From the paper: cooperative caching - using client memory as a global cache via simulation)
• I mentioned earlier that in Unix file system, the file server assumes that it is serving each client
independently, and therefore it doesn't worry about sharing a file. If a particular file happens to be
access by multiple users at the same time, the server doesn't worry about the coherence of that file.
• But in xFS, the file system worries about cache coherence. I have already introduced the idea of
cache coherence in the context of multi processors and distributed shared memory so you are
familiar with the terminology single-writer-multiple-reader.

The unit of cache coherence that XFS maintains is at the level of file blocks. Not an entire file but at the
level of individual file blocks.
• So if you look at a manager (mgr) for a file, mgr is responsible for the metadata management for
a file, e.g. F1 and the metadata in the memory of mgr will have information about the current state
of that file.
• For instance, this particular entry (f1-r-c1-c2) says that a file F1 (managed by this mgr) is being
read concurrently by two different clients, C1 and C2.
• C1 and C2 have copies/caches of this file F1 that were retrieved from this mgr at some point of
time, so C1 and C2 have F1 in their physical memory.
• (For simplicity, I'm showing this as a file, but in fact the granularity at which the coherence and
information about files is kept is at a file block level. So at a block level the manager says a
particular block of a file is in the cache of client C1 and in the memory cache of client C2)
The semantics that is observed for cache coherence is single-writer-multiple-readers.
• So if another client C3 makes a write() request to the manager for this file F1 (file block to be
precise), the mgr receives this request and looks up the metadata for that particular file.
• The mgr sees that this file is now currently read-shared by C1 and C2.
• So we have a read-write conflict.
What the mgr is going to do is basically say, well, if somebody wants to write to that file I have to tell
the guys that currently have the file. They cannot have it anymore.
• So just as in the case of cache coherence in a multi-processor, this manager will send an inv(f1)
message to C1 and C2.
• Upon receiving the inv(f1) message, C1 and C2 will acknowledge the invalidation.
• Once the mgr gets that indication back from the clients, the manager can tell the client C3 that,
OK now you can proceed writing to this file F1.
That's the protocol that is being observed in xFS to keep the copies of the files consistent. So at the end
of this message exchange, C3 will have the right to write to this particular file F1.
How long does it have that privilege?
• When the write request is granted, the client gets the token and the manager can revoke the token
given to C3.
• In particular, this token will be revoked when a future read for the same file comes to the mgr.
But C3 still keeps this copy/cache in his local memory.
• If a new client wants to write to a file while a current client is writing, , the mgr will also invoke
the token, invalidate the file at the current writer client, pull back the contents of the file, and then
distribute it to the new writer client.
That's how cache coherence works.
Using the fact that copies of the file is existing in multiple clients, xFS exploits that fact to do
cooperative caching.
• For example, after the exchange in the example above, C3 has a copy of the file F1.
• If a new read() request comes in, this read() request can be satisfied by getting the contents of F1
from the cache of C3.
• This is what cooperative caching is all about: instead of going to the disk to retrieve the file, we
can get the file content from the cache of one of the clients who happens to have a copy of the file.
446 - Log Cleaning

As client activities go on, log segments evolve on the disk.

• For instance, if on a particular client node, some blocks are being modified.

• The modification to each block are stored in log segment seg-1 and have been committed to the
LAN-RAID (nodes in the cluster), which included changed made to block 1, 2 and 5 (1’-2’-5’).

• Later this client made some new changes to block 1, 3 and 4. Now we have seg-2 = 1’’-3’-4’.

• Now the modification to block 1 in seg-2 is newer than that in seg-1. The 1’ in seg-1 is now a
stale copy so it will be invalidated, leaving a “hole” in seg-1.

• So for those old log segments that have been committed to the disks, the portion that corresponds
to a stale modification will be invalidated over time. For example, it another change to block-2 is
made, i.e. 2’’, 2’ in seg-1 will also be invalidated, leaving another hold in seg-1.
Over time we will end up with many holes in the log segment files. This is why log cleaning is necessary.
• For example, what we want to do is recognize that we have three segments that have holes in them.
• Then we will aggregate all the live blocks from these three segments in to a new segment. So
we've got five live “portions”: 5’-1’’-3’-4’-2’’ and we coalesce those live blocks into one new
segment.
• Once we have aggregated all the live blocks into one new segment, all of the old log segments
can be garbage collected.
In a distributed file system, there is a lot of garbage that is being created across all the storage service.
We don't want this to be done by a single manager. Instead we would like it to be done in a distributed
manner. This is another step towards a truly distributed file system. The log cleaner's responsibilities
include the following.
• It has to find the utilization status of the old log segments.
• It has to pick some set of log segments to clean.
• Once it picks a certain number of log segments, it has to read all the live blocks from these log
segments and aggregate the live blocks into a new log segment.
• Lastly, it can garbage collect all of those log segments.
So this is the cleaning activity that an LFS cleaner has to do. In xFS they distribute this log cleaning
activity as well.
Now remember that in the distributed file system, this log cleaning activity is happening concurrently
with writing to files on the nodes. So there are lots of subtle issues involved in managing this log cleaning
in parallel with normal activities such as creating new log segments, or writing to existing log segments.
In xFS they make the clients also responsible for log cleaning.
There is no separation between client and server. Any node can be a client or a server depending on
what it is doing.
Each client, meaning a node that is generating a log segments, is responsible for the segment utilization
information for the files that they are writing, in terms of creating new files, creating blocks and log
segments in xFS. So the clients are responsible for knowing the utilization of the segments that are
resident at that client node.
Since we have divvied up the entire space of servers into stripe groups, each stripe group is responsible
for cleaning activity that is in that set of servers. Every stripe group has a leader, and the leader in that
stripe group is responsible for assigning cleaning services to the members of that stripe group.
Recall that the manager is responsible for the integrity of the files because it is doing metadata
management, and requests for reading and writing files are going to come to the manager.
On the other hand, the log cleaning responsibility is going to the leader of a stripe group.
The manager is the one that is responsible for resolving conflicts that may arise between client updates
that want to change some log segments and cleaner functions that want to garbage collect some log
segments. Those conflicts are resolved by the manager.
These again are subtle details which I want you to read carefully in the paper.
447 - Unix File System

Next let's talk a little bit about the implementation details of xFS.
First of all, in any Unix file system there are these i-node data structures, which give you a mapping
between the file name and the data blocks on the disk.
So given a file name and the offset that you want to get into that file, the file system has a way of
looking up a data structure called i-node and deciding where exactly on the disc of the data blocks that
you're looking for.
This is happening in any Unix file system.
448 - XFS Data Structures

To open a file, the client first reads the file’s parent directory (labeled 1 in the diagram) to determine
its index number.
• If the block is present in the cache, the request is done.
• Otherwise, xFS first uses the manager map to locate the correct manager for the index number
and then sends the request to the manager.
o The manager then tries to satisfy the request by fetching the data from some other client’s
cache.
o If no other client can supply the data from cache memory, the manager routes the read
request to disk.
▪ The manager examines the imap to locate the block’s index node. If the manager
has to read the index node from disk, it uses the index node’s disk log address and
the stripe group map to determine which storage server to contact.
▪ After having inode, the manager then uses the inode to identify the log address of
the data block. (We have not shown a detail: if the file is large, the manager may
have to read several levels of indirect blocks to find the data block’s address; the
manager follows the same procedure in reading indirect blocks as in reading the
index node.) The manager uses the data block’s log address and the stripe group
map to locate and request the actual data block.
xFS data structures for implementing a truly distributed file system are much more involved. Let's talk
little bit about that.
First of all, I mentioned that the metadata management is not static. Even though the file that you are
looking for may be resident in particular node, the manager for that file may not be at that same node.
When a client looks for a file, it starts with a filename and consults this mmap (manager map) data
structure to know who's the metadata manager for this particular file name. This mmap is replicated on
every node in the cluster.
The manger node action is fairly involved.
• When the client comes to the manager with a filename, the manager looks up the first data
structure called the FileDir (file directory).
• This FileDir has the index number and that index number is the starting point for looking up the
contents of that file.
• Using i-number, it uses another data structure called i-map to get the i-node address for this
particular filename. The i-node address is the address for the log segment associated with this
filename’s inode.
• Using the i-node address, xFS can contact the proper strip group to retrieve the inode from the
server storage.
• With the inode, the xFS can again contact the proper server storage stripe groups to locate the
actual data block that corresponds to filename.
That's the entire road map of what the manager will have to do.
This sounds like a lot of work being done, fortunately caching helps in making sure that this long path
is not taken for every file access.
449 - Client Reading a File

As I said, fortunately, every file read is not going to result in going through so many hops to get the data
blocks. This is where caches come into play.

So when you start with a filename and an offset on a client node, you look up the directory → you get
an index number → (if this file has been accessed before, it's most likely in its own Unix cache)→ you
get the data block from the local cache.
This is the fastest path for file access and hopefully there's a common case.
Now it is possible that a file is shared or the same file is being read by different clients at different points
in time. In either case, there is a possibility that a particular file has been accessed by a client and is in
the cache of that client.
So the next possibility is that you start with a directory → you don't find it in your local cache → you
have to go to the manager map → manager ID → (possible network hop) manager → the manager node
tells you that this particular file, has been accessed by a different client and manager’s metadata says that
some of the client has this file in their cache → manager will tell the client that currently is holding a
copy of the file in its cache to send the data to the first guy that requested it.
Now the data that I requested is coming from the cache of a peer. Still, that's much better than going
to the local disk and pulling it out of the disk, because network speeds are much faster than accessing the
disk.
This is the second best path for file access. Sure, there is network communication involved here
because, (1) hop over the network to get to the manager node if manage is on a different node (2) another
network hop if the manager says, oh, this particular file is cached in a different node. In that case, there
is another network hop to go to the client that currently containing that particular file. (3) another network
hop in order to send the data over to the requester.
So potentially there could be three network hops in order to get the file that I'm looking for, but it
could be less than that depending on the co-location of the manager and the node that is requesting it or
the manager and the node that contains a copy of that file itself. So this is the second best path for file
access.
There is also the pathologically 'real long way' of accessing a particular file.
You start with a filename → directory → index number → it’s not in your cache → manager map →
manager ID → manager → manager looks up its metadata for this file → finds that nobody has it in the
cache → we have to pull it from the desk.
To pull a file from the disk, then it has to look up its imap → inode address → contact the stripe group
map data structure → go to the storage server → get the inode → get the log segment ID for the requested
data block → contact stripe group map → go to storage servers → that storage server is going to give the
requested data block to the client.
So you can see that in this long path, there is network hop as well as accessing the storage servers to
pull the data blocks. It is possible that the inode for the log segment ID associated with this file has been
previously accessed by this manager. In which case, it doesn't have to go to the storage server to get the
inode, because it'll be present in the memory of the manager as part of its caching strategy. And therefore
it can bypass these two network hops because directly from this strip group map it can figure out what
the long segment ID is locally cached so that it can then go to this stripe group map data structure and
figure out where on the disk the data blocks for that log segment is actually secured, so we might be able
to get rid of at least two of these network hops. Therefore we can avoid these two network hops.
But the worst case scenario, if this particular log segment has never been accessed before, then the
long way to get to the data block that you're requesting is going through the data structures in the manager,
network hop, storage look up, and get the data and give it to the client.
450 - Client Writing a File

Writing to a file by a client is fairly straightforward.

What the client is doing is actually aggregating all the writes that is going on into this Log Segment
Data Structure, which is in its memory.
Then it can decide when to flush this Log Segment onto the disk. And when it decides to do that, it
knows the Stripe Group, this particular Log Segment it belongs to, and so it is going to take this Log
Segment and stripe it on the Storage Servers that are part of the Stripe Group. So once it does this write
to its Stripe Group. The client will notify the manager on the lock segments that are being flushed to the
disk so that the manager has up-to-date information about the status of the files it manages.
xFS is a research prototype of a Distributed File System. There have been other instances of
Distributed File Systems such as the Android File System and the Coder File System that were built at
CMU. In fact the Andrew File System served a community of Users in the CMU campus. In a later lesson
we will return to discuss Distributed File Systems again. Specifically we'll look at Andrew File System
and discuss the issue of security and privacy for User Files in a Distributed System.
But I want to leave you with some closing thoughts on the xFS File System.
• First of all, log based striping, and particularly sub-setting the storage servers over which you'll
stripe the log.
• The second technical innovation is combining cooperative caching with Dynamic Management
of Data and Metadata.
• The last technical nugget is the distributed log cleaning, making sure that the responsibility for
cleaning up the logs on the disk is not left to one node, but distributed. Especially taking advantage
of the fact that the clients, who are the mutators for the file system, meaning they are the guys that
are writing to the file system, can keep a count of the changes that they are making to the log
segments and use that information in the log cleaning more effectively.
451 - Conclusion - Distributed File Systems

Today, network file systems are an important component of any computing environment, be it a corporate
setting or university setting.
There are companies that have sprung up such as NetApp solely to pedal scalable NFS products.
In this lesson, going beyond NFS we learned a lot of concepts pertaining to the design and
implementation of distributed file systems, in particular how to make the implementation scalable by
removing centralization and utilizing memory that's available in the nodes of a LAN intelligently.
Such techniques for identifying and removing bottlenecks are the reusable nuggets that we can take
and apply to the design and implementation of other distributed subsystems in additions to file systems
themselves.
Overall, in the set of papers that we studied in this lesson module spanning, GSM, DSM and DFS, we
discussed the design and implementation of subsystems that found creative ways to fully utilize memory
that's available in the nodes of a local area network.

Comparison of Memory Management of Windows With LINUX
83% (6)
Comparison of Memory Management of Windows With LINUX
10 pages
Turbine Governer Control
100% (1)
Turbine Governer Control
44 pages
GRC Config Steps, ARA, EAM
100% (1)
GRC Config Steps, ARA, EAM
61 pages
Virtual Memory (Computer Operating System)
100% (1)
Virtual Memory (Computer Operating System)
44 pages
Ict123 W9
No ratings yet
Ict123 W9
66 pages
Memory L
No ratings yet
Memory L
44 pages
Unit 4 - Os - Theory
No ratings yet
Unit 4 - Os - Theory
110 pages
Virtual Memory in Operating System - GeeksforGeeks
No ratings yet
Virtual Memory in Operating System - GeeksforGeeks
11 pages
Final Unit 5
No ratings yet
Final Unit 5
37 pages
CH 3 OS Memory Managment
No ratings yet
CH 3 OS Memory Managment
77 pages
Lecture 7 Memory 2021
No ratings yet
Lecture 7 Memory 2021
64 pages
Unit 1
No ratings yet
Unit 1
25 pages
Quadrics Interconnection Network
100% (1)
Quadrics Interconnection Network
24 pages
5G Technology in Private Networks PPT (2) HARI CSE
No ratings yet
5G Technology in Private Networks PPT (2) HARI CSE
25 pages
ACA Unit 2
No ratings yet
ACA Unit 2
45 pages
Chapters 7 & 8 Memory Management: Operating Systems: Internals and Design Principles, 6/E
No ratings yet
Chapters 7 & 8 Memory Management: Operating Systems: Internals and Design Principles, 6/E
96 pages
Lect5 - Distributed Shared Memory
No ratings yet
Lect5 - Distributed Shared Memory
120 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
109 pages
DSM1
No ratings yet
DSM1
77 pages
Chapter 5 - Shared Memory Multiprocessor
No ratings yet
Chapter 5 - Shared Memory Multiprocessor
96 pages
FileDirectory
No ratings yet
FileDirectory
29 pages
Resource Management
No ratings yet
Resource Management
11 pages
Mvs
No ratings yet
Mvs
12 pages
The Principle of Referential Locality: Next
No ratings yet
The Principle of Referential Locality: Next
10 pages
Assignment4-Rennie Ramlochan
No ratings yet
Assignment4-Rennie Ramlochan
7 pages
UNIT-6 Memory Mgt.
No ratings yet
UNIT-6 Memory Mgt.
19 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
Virtuual Memeory Ig
No ratings yet
Virtuual Memeory Ig
16 pages
Distributed Shared Memory: Writes To A Logical Shared Address by One Thread Are Visible To Reads of The Other Threads
No ratings yet
Distributed Shared Memory: Writes To A Logical Shared Address by One Thread Are Visible To Reads of The Other Threads
41 pages
Global Memory Management For A Multi Computer System
No ratings yet
Global Memory Management For A Multi Computer System
12 pages
Distributed Shared Memory (DSM)
No ratings yet
Distributed Shared Memory (DSM)
27 pages
Lecture 3 PDC
No ratings yet
Lecture 3 PDC
21 pages
Memory Subsystem Notes-1
No ratings yet
Memory Subsystem Notes-1
8 pages
Unit 2.1
No ratings yet
Unit 2.1
18 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Lectures On DS
No ratings yet
Lectures On DS
8 pages
Memory Management
No ratings yet
Memory Management
26 pages
OS Supplement
No ratings yet
OS Supplement
11 pages
1 Virtual Memory Details
No ratings yet
1 Virtual Memory Details
2 pages
Bashiru Serkina
No ratings yet
Bashiru Serkina
4 pages
OS Notes Done
No ratings yet
OS Notes Done
4 pages
Unit 2
No ratings yet
Unit 2
15 pages
Virtua Memory
No ratings yet
Virtua Memory
5 pages
A Reflective Memory System For Personal Computers - Paper 02
No ratings yet
A Reflective Memory System For Personal Computers - Paper 02
6 pages
Coa Notes
No ratings yet
Coa Notes
9 pages
Important Definitions in OS
No ratings yet
Important Definitions in OS
9 pages
Parallel Computing Lecture # 6: Parallel Computer Memory Architectures
No ratings yet
Parallel Computing Lecture # 6: Parallel Computer Memory Architectures
16 pages
6.033 Computer System Engineering: Mit Opencourseware
No ratings yet
6.033 Computer System Engineering: Mit Opencourseware
7 pages
Distributed Shared Memory-Report
No ratings yet
Distributed Shared Memory-Report
35 pages
Basic Operating Systems Concepts
No ratings yet
Basic Operating Systems Concepts
7 pages
V3i9201434 PDF
No ratings yet
V3i9201434 PDF
6 pages
Ca Research
No ratings yet
Ca Research
5 pages
Fire Alarm System - Notifier PDF
No ratings yet
Fire Alarm System - Notifier PDF
19 pages
This Study Resource Was Shared Via
20% (5)
This Study Resource Was Shared Via
2 pages
Chapter-24 Memory Management
No ratings yet
Chapter-24 Memory Management
8 pages
Virtual Memory
No ratings yet
Virtual Memory
7 pages
Embedded System Architecture
No ratings yet
Embedded System Architecture
10 pages
Xtremax Company Profile 2009
100% (2)
Xtremax Company Profile 2009
13 pages
P D Group2-2
No ratings yet
P D Group2-2
6 pages
Virtual Memory
No ratings yet
Virtual Memory
6 pages
Taxonomy of Parallel Computing Paradigms
No ratings yet
Taxonomy of Parallel Computing Paradigms
9 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
CGR Microproject
No ratings yet
CGR Microproject
11 pages
Digital Forensics Seminar Report
No ratings yet
Digital Forensics Seminar Report
22 pages
Just Exam Online Tutorial
No ratings yet
Just Exam Online Tutorial
61 pages
Network Security and Cryptography: Unit I
No ratings yet
Network Security and Cryptography: Unit I
26 pages
Java Strings - Introduction & String Methods
No ratings yet
Java Strings - Introduction & String Methods
29 pages
Grade 1 Computer Worksheet# 1,2
No ratings yet
Grade 1 Computer Worksheet# 1,2
4 pages
BRKCRS 3810
No ratings yet
BRKCRS 3810
143 pages
Book Recommendation System
No ratings yet
Book Recommendation System
8 pages
Product Enablement
No ratings yet
Product Enablement
72 pages
24c16 - Memoria
No ratings yet
24c16 - Memoria
8 pages
Cs 083 HP 2 Mat Cs Final
No ratings yet
Cs 083 HP 2 Mat Cs Final
10 pages
MSA 4th Edition
No ratings yet
MSA 4th Edition
54 pages
Circuits and Systems For Efficient Portable-to-Portable Wireless Charging
No ratings yet
Circuits and Systems For Efficient Portable-to-Portable Wireless Charging
125 pages
RDBMS - Muj
No ratings yet
RDBMS - Muj
34 pages
4GSASEXPGUIintro PDF
No ratings yet
4GSASEXPGUIintro PDF
11 pages
HS1031 Individual Assignment Answer Sheet T1 2023
No ratings yet
HS1031 Individual Assignment Answer Sheet T1 2023
7 pages
MKE02P64M40SF0
No ratings yet
MKE02P64M40SF0
37 pages
Unit Three DBMS Notes-1
No ratings yet
Unit Three DBMS Notes-1
31 pages
05 - IaaS, PaaS, SaaS Case Study
No ratings yet
05 - IaaS, PaaS, SaaS Case Study
20 pages
Internetshutdown
No ratings yet
Internetshutdown
1 page
Quad9 - A Public and Free DNS Service For A Better Security and
No ratings yet
Quad9 - A Public and Free DNS Service For A Better Security and
3 pages
7.3.2.4 Lab - Research Laptop Screens
No ratings yet
7.3.2.4 Lab - Research Laptop Screens
2 pages
Determining The Initial States in Forward-Backward Ltering
No ratings yet
Determining The Initial States in Forward-Backward Ltering
8 pages
Quamet MSC530M G02 3T 2020-2021 - Part IV
No ratings yet
Quamet MSC530M G02 3T 2020-2021 - Part IV
2 pages

L07 Distributed Shared Memory and File Systems

Uploaded by

L07 Distributed Shared Memory and File Systems

Uploaded by

384 - Introduction - Global Memory Systems

The second case is similar to the common case.

Next let's consider the case where a page is actively shared.

• At last, R needs to make room for page Y by

• The page Y will be hosted in R’s global memory.

392 - Local and Global Boundary Solution

• interacting with the VM Manager

So we've covered a lot of ground so far:

I've already introduced you to shared memory synchronization.

So now we come back to our original parallel program.

• So the coherence action is performed only when the lock is released.

421 - Pros and Cons of Lazy and Eager

That brings us to a new coherence protocol, which is multiple writer protocol.

Let's talk about the Multiple-Writer part of the protocol.

434 - The First NFS (Answer)

As client activities go on, log segments evolve on the disk.

Writing to a file by a client is fairly straightforward.

You might also like