FEROLIN, Mary Bernadette J. November29, 2020 BSCS-2 CS 3104 (4:30 - 6:00, MW) Chapter 10: Virtual Memory
FEROLIN, Mary Bernadette J. November29, 2020 BSCS-2 CS 3104 (4:30 - 6:00, MW) Chapter 10: Virtual Memory
In the previous chapter, we’ve talked about the main memory, more specifically, the methods in
which we can manage the memory so that we are able to maximize it as a resource within the system.
This means having as many processes in memory as possible to allow multiprogramming. A problem,
however, is that the processes need to be in memory before they can be executed (and as we know, we
only have limited memory). However, in this chapter, we will be talking about the virtual memory which
allows us to execute processes without ever having them completely in the memory. And advantage of
this (and I’m pretty sure that this has been mentioned in passing in the previous chapter), is that it allows
the size of the virtual memory to exceed than that of the physical memory, basically abstracting the main
memory into this one huge but uniform array of storage. Hence, the logical memory and physical memory
are considered two separate entities. In addition, virtual memory allows us to implement shared memory
which is useful, especially in interprocess communication. Implementing the virtual memory can be
helpful, but is not easy to implement, and most especially detrimental to the system performance if
implemented carelessly.
10.1 Background
We’ve learned that one of the first requirements to run a process is to ensure that this process is
first loaded into memory. Once it is loaded, only then can it be scheduled to run on the CPU. Hence
memory management is necessary. It is reasonable of course for us to do this, but it can become a
hindrance to actually designing and developing programs that can do tasks efficiently or solve a
problem. This is because, then, we, as programmers of these programs that others use, are limited by
the size of the physical memory – we can’t have a process as big as the physical memory, otherwise,
our CPU utilization would also decline fast. In other words, with only one process in the memory at a
time, we can’t actually do anything significant. When examined, the programs that are loaded into
memory have portions which are not entirely needed, and thus just use up the space inactively,
where other processes could have been placed, For example:
Programs have codes that allow them to handle any unusual errors. However, since these
errors are unusual, we don’t usually encounter them, so most of the time; this code is
technically almost never used.
Arrays, lists and tables take up way more space than they should and need. We know that
arrays are statically defined. So we could be allocating space for an array of size 100, but
really only use 10 elements worth of space.
Other options and features of the program need not necessarily be used often. These
features could have only been loaded only if they were needed.
Even if the possibility of needing to use all parts of the program does come up, it is rarely the case
that we’d need all of it at the same time. So partially loading a program into memory, then, can
provide us with benefits:
We are no longer limited to the amount of physical memory there actually is. Instead, as a
programmer, I could focus on writing a program (no matter the size for as long as it solves
or does a task efficiently) without worrying about whether or not my program could run due
to memory size constraints.
Because a few spaces are only taken by processes at a time, more programs can now be
loaded into memory. With more processes in memory, more processes are ran at the same
time (especially in a multiprocessing environment), hence CPU utilization is maximized.
Less I/O would be needed to load or swap portions of programs into memory, basically
allowing each program to run faster.
Virtual memory involves the separation of the logical memory from the physical memory. The
virtual memory is basically a management technique that provides abstraction of the storage
resources. This abstraction, therefore, allows us to have an extremely large logical memory even
when there is only a smaller physical memory available in actuality. This is shown in Figure 10.1.
The virtual address space is the range of virtual addresses made available to processes by the
operating system. The virtual address space of a process is a logical view of how a process is stored
in memory. The process typically begins at a certain logical address, let’s says 0, and exists in
contiguous memory, as shown in Figure 10.2. Note the emphasis on the word “logical” because if we
can remember from Chapter 9, a process can be assigned multiple page frames, however these
frames are not necessarily stored contiguously in physical memory.
Figure 10.2 shows us how structure of the memory looks like as it was introduced to us in Chapter 3.
Note how the heap grows upward in memory as more memory is dynamically allocated. Similarly,
the stack grows down as more functions are being called in the system. There is no hard and fast rule
for the direction of growth for both heap and stack, and such is dependent upon the architecture of
the system. However, we can know that there is still space in the physical memory for as long as
these two don’t overlap or meet. The gap between the stack and the heap is what is known as the
virtual address space, but will only need actual physical pages if the heap or stack grows. Virtual
address can have holes and when they do, they are known as sparse address spaces. The holes in
these sparse address spaces can later be used if the heap or stack grows or for objects like the library
that are linked during program runtime.
Besides being able to have potentially much larger size than the physical memory, the virtual
memory also allows for the sharing of data between two or more processes through page sharing
(yes, we did talk about this in Chapter 9). Which is why, if we’re able to share data, we’re also able to
reap the following:
Data can be shared amongst
multiple processes. Recall that
as an example in Chapter 9, we
say that instead of each process
having a copy of the standard C
library, they can actually share it
with one another and the
system maintains only one copy.
This is one of the benefits of
using the virtual memory. Every
object which uses this C library
considers it to be a part of their
individual virtual address space.
Typically, a library is mapped
read-only into the space of each process that is linked with it – recall that in chapter 9, for
codes to be sharable, they have to be reentrant codes an reentrant codes are codes which do
not change over the course that it is being used, hence no process should modify it and it is
mapped read-only.
As I’ve stated earlier, virtual memory allows processes to have a shared memory and this
shared memory is vital in allowing different processes to communicate with each other.
Processes sharing a region of memory consider said region as part of their virtual address
space, yet the actual physical pages of memory are shared as is illustrated in Figure 10.3.
Pages can also be shared during process creation with the fork() system call.
The aforementioned are only some of the benefits to using the virtual memory. In the coming
sections of this chapter, these benefits shall be expounded and explored.
Prior to this chapter, we were taught that for a program to run, it must first be loaded into
memory, and only then can it be scheduled to run on the CP. Well and good, this is true. However, the
problem with this is that it requires us to move the entire code of the program into memory. And as
we’ve state earlier, this can hold back from maximizing the memory use because, then, we’d have to
load the entire program even if we would be only parts of it. And even when we do need to use the
entire code, it’s not really all the time that we get to use it all at once. What I’m saying is that most of
the time, only a small portion of the code for the program is actually used at a time. For example, we
have a program which initially lets us choose between a few options. These are only a few options
and yet we’ve loaded the entire thing into memory – the data which are not used as of yet, and the
options, whether or not they’re chosen by the user. And because this takes up much space, we
decrease the chance of many more processes to be in memory.
Alternatively, we can use demand paging. Demand paging is simply a strategy of loading what is
only needed at the time into memory – we load instructions, data and code into memory only as they
are needed. Recall my allusion of bringing coloring materials only when the instructor requires it.
And so, when pages are demanded, only then are they loaded. This strategy is similar to swapping,
which we discussed in Section 9.5.2, where a process is temporarily swapped out of the memory if it,
as of now, is inactive and is stored in the backing store. Using demand paging explains the benefit of
the virtual memory of allowing the memory to be used efficiently. Since only a few portions of the
program are loaded at a time, less space is taken up by the process and more processes can be
loaded into memory – and we want this, that although there are many processes in memory, at least
the memory is maximized instead of being wasted.
With demand paging, pages of the process are now separated – some are in the memory and
some are in the secondary storage. We have to figure out which is which between the two. Using the
valid-invalid bit which we introduced in Section 9.3.3 can be used for this purpose.
As we’ve done so before, we have a bit used in the page table which we could set to either valid or
invalid as we show in Figure 10.4. If the bit is set to “valid”, then the associated page is both legal and
in memory and can, therefore, be accessed. However, this time, note that having the bit set to
“invalid” can mean one of two things – either the page is not valid (meaning that it is not in the
logical address space of the process), or it is valid, but is just not yet brought into memory.
Nothing happens when you don’t actually try to access an invalid page. However, if you do, then
this will cause an error, which we call page fault, and thus a trap to the operating system occurs.
How, then will we be able to demand, access and use pages that are not brought to memory yet? The
procedure for handling page fault is pretty straight forward as we’ll see in Figure 10.5.
(1) We first check the internal table of the process to determine whether the reference is a valid
memory access or not.
(2) If the reference is an invalid access, a trap to the OS is performed and the process is
terminated. However, if the reference is valid and the pages are just not loaded into memory
yet, then we now page it in.
(3) We find a free frame (by taking one from the free-frame list, for example)
(4) We schedule a secondary storage operation to read this specific desired page into a newly
allocated frame.
(5) When the storage read is complete, we modify the internal table with the process and the
page table, so that now, the page is in memory.
(6) We restart the instruction that was interrupted by the trap. The process can now access the
page as though it had already been in memory the whole time.
There are also times, in extreme cases where we can start executing a process with nothing
loaded into memory. When an instruction is needed, which is not in the page table, a page fault
occurs and then the instruction is loaded into memory. Afterwards, a few more instructions, which
are also not in memory yet (since we did say that we’d only call for them if they’re needed), are
faulted as necessary until the every page that is needed is loaded into memory. As you can observe, it
is rigidly a mechanism of demanding only when needed, so much so that it starts out with no pages in
memory and demanding as necessary to execute the program correctly. This is called pure demand
paging.
Theoretically, some programs could access several new pages with each instruction execution –
that is; each instruction could need some other set of instructions or data, and we note that the
program could be executing multiple instructions, thus causing a page fault per instruction. This
would result in an unacceptable performance from demand paging. However, analysis of running
processes shows that this is very unlikely. Hence, the chances of experiencing a fault with every
instruction execution are slim. Programs tend to have locality of reference, which results in
reasonable performance from demand paging. Just a heads-up, locality of reference is simply the
phenomenon in which the memory locations accessed by a computer program are the same for a
particular time. In short, it is the tendency and inclination of a program to access the same memory
set of memory locations or memory locations near one another.
The hardware support for demand paging is the same as that for paging and swapping:
Page Table
- The page table allows us to mark an entry as valid or invalid through the use of the
valid-invalid bits or some other form of protection bits
Secondary Memory
- This memory serves as storage for the pages that are not yet in memory.
- It is usually a high-speed disk or NVM device.
- It is also known as the swap device, for when processes are swapped to and out of
the memory. The section of the secondary memory used for the purpose of swapping
is called the swap space.
It is important for demand paging that the instruction be restarted after every page fault. Recall that
when an instruction needs something that is not yet in memory (marked with an invalid bit in the
page table), a page fault occurs and a trap to the OS is performed. We then try to find such demand
and load it into memory. After the demand has been loaded, then we restart the instruction that has
been interrupted by the trap. If the page fault occurs during the instruction fetch, then we can simply
restart by fetching the instruction again. Similarly, if the page fault occurs when fetching an operand,
then the restart is done by fetching and decoding the instruction, and then fetching the operands.
Note that this restart must be done in exactly the same place and state, except this time, the
demanded page is now present in memory.
Consider the scenario where a three-address instruction is used. The ADD instruction adds the
contents of A and B and stores the sum in C. Normally, the following as the steps to execute the
instruction:
1. Fetch and decode the instruction ADD.
2. Fetch A.
3. Fetch B.
4. Add A and B.
5. Store the sum in C.
We see that we have no value fetched for C; hence C is currently not in memory. This could result in a
page fault, and when it does, we will just simply have to restart the instruction that was interrupted,
more specifically; we fetch and decoded ADD again as well as fetch the operands and operate on
them. And then we can finally store the sum in C since it has now been brought into memory.
A major difficulty arises when one instruction may modify several different locations. Let’s
consider the IBM 360/370 MVC (move character instruction), which can move up to 256 bytes of
data from one location to another. The problem arises when either the destination or source block
straddles a page boundary or overlaps. See, when this happens, it is possible that the source may
have been modified, hence restarting cannot be a valid option since the value of the source block
might have changed the moment the fault happens. However, there can be two ways to approach this
problem:
(a) One solution is that the microcode computes and attempts to access both ends of both
blocks. Basically, if a fault occurs, it will happen during this attempt, and we can simply
restart the instruction since none of the blocks have been modified. Now we know that no
page fault can occur at this point since all relevant pages are now in memory. The initial
set is sort of like a “sacrificial lamb” which goes first, and should anything happen; at least
the important elements are not affected in anyway.
(b) Another solution is to use temporary registers to hold the value of overwritten location –
kind of like a backup. That way, once a page fault occurs and the source block is modified,
we can simply write the old values back into memory and essentially restore the memory
to the state it once was in before the instruction even started. That way, we can now have
the option of restarting.
The example above is not to say that it is the only architectural problem that we could possibly
encounter when adding paging to an existing architecture just to allow demand paging. However, it
is a problem that can happen. People assume that paging can be added to any system, however this is
not always the case, and considerations is put into what the page fault entails in the current
environment of the system.
When page faults occur, the operating system brings the demanded page from the secondary
memory into the main memory. To resolve page faults, most operating systems maintain a free-
frame list, which is basically a pool of free frames, implemented using a linked list as we see in
Figure 10.6. When a code is loaded from the disk, free frames are allocated from this pool of free
frames, which is then inserted into the page table.
When a frame is allocated, the operating usually uses a strategy called zero-fill-on-demand. This
means that when a frame is to be allocated, the frame’s previous contents are zeroed out. When a
system starts, all of the free memory is placed in the free-frame list (after all, no process has been
allocated any space yet). However, as the system continues to function, free frames are requested
and eventually the size of the list grows smaller and smaller with every request. If the list size, at
some point, becomes zero or falls below a certain threshold, then it must be repopulated.
As we know, demand paging does not load certain pages of a process unless they are needed in
execution. Because of this, there is less loading latency at the program startup and there is less
overhead because fewer pages are being read, additionally reducing the context-switching time and
the resources taken to do so. Therefore, demand paging can affect system performance. To see this,
we calculate the effective access time for a demand-paged memory. The effective access time is
simply the weighted average time it takes to access and retrieve a value from memory.
Assume that the memory-access time, denoted ma is 10 nanoseconds. For as long as no page fault
occurs,, then the effective access time is simply equal to the memory-access time. It, however, can
change once page faults are encountered. If a page fault occurs, the relevant page is searched for in
secondary storage and the desired word is accessed.
We assign p to be the probability of a page fault occurring, whereby 0 ≤ p ≤ 1. We would expect
that this value of p is close to 0, which means the probability of that happening is minimal, and only a
few page faults at most are to be expected. The formula for calculating the effective access time, then
is:
To compute the effective access time, we must be able to determine the amount of time it takes
for the page fault to be serviced. And this amount of time would depend upon the sequence of
actions to occur in the system. When a page fault occurs:
1. A trap to the operating system is made
2. The state of the registers and processes is saved
3. We determine that the interrupt issued was a page fault.
4. Next, we determine if the page reference made was legal, and the location of the page in
secondary storage is determined as well.
5. A read from the storage to a free frame is done:
(a) Wait in a queue until the read request is serviced.
(b) Wait for the device seek and/or latency time.
(c) Begin the transfer of the page to a free frame (this way, we can now load the desired
page into main memory)
6. While waiting, allocate the CPU core to some other process (this is assuming that the CPU
will be allocated to another process while the I/O execution occurs, so that CPU utilization
is not diminished).
7. Receive an interrupt from the storage I/O subsystem (this indicates that the I/O is
complete.
8. Save the registers and process state for the other processes state for the other processes
(that is; if step 6 is executed, so that we can easily restore the state of these processes after
the interrupt).
9. Determine that the interrupt was from the secondary storage.
10. Correct the page table and other tables to show that the desired page is now in memory.
11. Wait for the CPU core to be allocated to this process again (this process being the process
that was the cause of the generation of the page fault because a needed age is not yet
loaded into memory).
12. Restore the registers, process state, and new page table, and then resume the interrupted
instruction (this time, the needed page for a process to execute something is now loaded
into memory and we can complete the process execution until the process finishes or until
another page fault is generated).
Note that not all the aforementioned steps are necessarily performed. For instance, if we noticed
that step 8 is executed only if step 6 is executed. In addition, step 6 is only done to allocate the CPU to
another process while the I/O occurs.
In any case, there are generally three major task components of the page-fault service time:
Service the page-fault interrupt
Read in the page
Restart the process.
The first and third tasks can be reduced by carefully breaking them down into a few simple
instructions, with each task ranging from 1 to 100 microseconds. We consider what happens when
we use HDDs as our paging device. The page-switch time will probably be about 8 milliseconds (that
is; 3 milliseconds of average latency, 5 milliseconds of seek and about 0.05 milliseconds of transfer,
thus summing to around 8 milliseconds of page-switch time). Note that in this case, we are only
considering the device-service time. If a queue of processes exists, then we would have to include the
queuing time since we are essentially waiting for the paging device to be free so that it can service
our request.
With an average page-fault service time off 8 milliseconds and a memory access time of 200
nanoseconds, the effective access time in nanoseconds is calculated as shown below. Note that we
use the formula we have shown earlier in determining the effective access time.
Note that the values with the unit of time are expressed in nanoseconds. Also, observe that the
effective access time is directly proportional to the page-fault rate, which means that the more
frequent the page fault occurrence, the longer the effective time of access. Thus, generally, we would
also want to reduce the page-fault rate, and to do so, we’d have to keep the probability of page faults
at a certain level, keeping our performance degradation to be less than 10%.
In short, to keep the slow down due to paging at a reasonable level, we can allow fewer than one
memory access out of 399,900 to page fault. Hence, in a demand-paging system, it is important to
keep the page-fault rate low.
An additional aspect of demand paging is the handling and overall use of swap space. I/O to swap
space is generally faster than that to the file system. It is faster because swap space is allocated in
much larger blocks, and file lookups and indirect allocation methods are not used.
In order to gain better paging throughput, the system has two options:
(a) The first option is to initially copy the entire file image into the swap space at process
startup. This means that when the process is in the middle of execution and performs
demand paging, it does so only from the swap space. Obviously, copying the entire file
image at the start can be a disadvantage. However, at least the system wouldn’t have to
look far when demand paging.
(b) The second option is one that is employed by most operating systems and that is to
demand-page from the file system initially, and then writing the page to swap as they are
replaced. In this strategy, the page faulting to look for the demand page in the file system
for every demanded page happens only once. Once it is retrieved and is swapped for
another page to be executed, the swapped page now resides in the swap space.
Some systems attempt to limit the amount of swap space used through demand paging of binary
executable files. Demand pages for such files are brought directly from the file system. However,
when page replacement is called for, these files are easily overwritten since they are not modified in
the first place. In this case, the file system then serves as the backing store. However, swap space is
still used for pages not associated with a file. This is also known as the anonymous memory.
Essentially, anonymous memory refers to pages not backed by a file. Hence it is called “anonymous”
because it is not associated with any file and has no “label” to call it by. I’d like to view it as a person
walking around the fourth floor of the Bunzel building. When we see people walking there, they’re
usually people who belong to the Department of Computer, Information Sciences and Mathematics.
So if I see Sir Christian walking around the fourth floor, I am easily able to associate him with the
department, since I know that he is its chairman. But if I see an unfamiliar face walking around the
fourth floor halls (and nobody knows who he is as well), then I can’t exactly associate him with the
department, nor can anyone else. Hence, to me, he is someone anonymous or not known. These
anonymous pages are used for things like process stacks and heaps.
Previously, section 9.5.3 explains that mobile systems typically do not support swapping. Instead
it goes to the next closest things to swapping and that is having to demand-page from the file system.
Later, in section 10.7, we get to talk about another commonly used alternative to swapping in mobile
systems called compressed memory.
10.3 Copy-on-Write
Section 10,2 described how demand paging is used to start a process. However, process creation
using the fork() system call may initially bypass the need for demand paging, and that is through
using a technique similar to page sharing as described in Section 9.3.4. By doing this, the number of
pages needed to start a process is minimized.
Using the fork() system call allows the creation of the child process by creating a duplicate of
the parent process’s pages, which is done by copying pages from the parent to the child. However,
consider that many child processes invoke the exec() system call after its creation. At this point,
copying the parent’s address may be unnecessary, which is why we use a method called copy-on-
write, which allows the child processes to initially share pages with its parent process. This way, no
copying is done. The shared pages are called and marked as copy-on-write pages. If the child or the
parent process makes a modification to these shared pages, then a copy of that shared page is
created. Remember that the changes made by either process would not have been needed by the
other, and so a copy of the original copy-on-write page is made so that the process that does not
need any modification to that shared page may still refer to it. From here, we can deduce that this is
probably the reason why the technique is called copy-on-write - because a copy of the page is only
made when it is written on. Figure 10.7 and 10.8 shows this.
Let’s suppose that a child process want to modify a page containing portions of the stack. In
addition, these pages are set to be copy-on-write pages. Initially, when having both child and parent
processes refer to shared pages, the figure above shows us how it would look like. As we now know,
making modification on copy-on-write pages results in the creation of a copy of the original page.
Then, the child process can modify its copied page. Furthermore, when using this technique, we note
that a copy is made only of the modified pages. Unmodified pages stay the way they are, and remain
share by both child and parent process.
In some versions of UNIX (like Linux, macOS and BSD UNIX), a variation of the fork() system
call, called the vfork() (virtual memory fork), is used, that operates different from fork() with
copy-on-write. If the vfork() system call is invoked, we can say that a child process is created. In
this case, however, instead of having pages to be shared by both parent and child process, the parent
process gets suspended and the child process uses its parent’s address space. The technique of copy-
on-write is not employed here, so if any changes were made by the child process on any of the
parent’s pages, then the parent is also able to see these changes once it resumes execution.
Therefore, vfork() must be used with caution (since its changes can affect the parent process as
well). Note that vfork() is used to create child processes which invoke the exec() system call
directly after its creation.
Earlier on, when talking about the page-fault rate, we assumed that each page (of a process)
faults at most once when it is first referenced. However, this can be inaccurate seeing that the
process may not actually use all of its pages. For example, a process can be using 5 pages instead of
10 – that’s half of the number of pages it has. So then, we could run twice as many process (each of
10 pages, but only use 5 in actuality), even if we only had forty frames (that is simply half of the
expected 80 frames). We can see that, now, many processes are able to run even with only a few
frames – increasing our degree of multiprogramming. However, if we do this, then we are over-
allocating our memory – that is; we allowed our process to use more memory than what is
physically available. Again, let’s say we only had forty frames left and we had 6 processes (each
process being 10 pages in size). We’d find that these processes only need to load five pages (for a
total of 30 pages, therefore a requirement of 30 frames). Therefore, we can still allocate frames right,
since they are less than forty? Technically, yes. However, at some point, these processes might need
all of its 10 pages. Recall we only had 40 frames, and now we have 60 pages. Physically, our memory
is not enough to accommodate this amount of pages, hence we have over-allocated.
Note that it’s not just the processes that need the memory – even the I/O buffers take a
considerable amount of space in memory. How much space do we allocate then – to both processes
and buffers? Having to decide this puts a strain on the memory-placement algorithms, because then
we’d have to ensure that we allocated the right amount of space for these things, as well make sure
that no space has been wasted in allocating. Some systems allocate a fixed percentage of memory for
the I/O buffers (which means that there is assurance of available space), while some systems allow
the I/O buffers to complete with each other and all other processes for memory.
If in case it wasn’t clear, over-allocation is manifested when a process demands a page from
secondary memory only to have no frames to be allocated to it from the free-frame list. This is shown
in Figure 10.9. Note that we illustrate the concept of “no free frame” to be a question mark. At this
point, the operating system has options to choose from:
(a) The first option is just to terminate the process. However, paging is the system’s attempt to
improve CPU utilization. Terminating the process exposes the fact that the system uses
demand paging – and such should been logically transparent to the user,
(b) Another option could be just to do swapping, so that the process can be allocated at least
once in memory. However, as we’ve mentioned in Section 9.5, swapping is not really used
by most operating systems because of the overhead it creates when copying processes to
be swapped between memory and swap space.
Most operating systems now combine swapping with page replacement, which is, to give an
overview, an algorithm that decides which memory pages to swap out of the memory when some
other page needs to be allocated. The following sub-sections will be talking about page replacement.
Basic Page Replacement
- The basic page replacement is as basic as it sounds like. If there is no free frame for a
particular page, then we find one that is not currently being used (by that we mean
that the page for which the frame has been allocated to is not used or executed), and
free it. By that, we write the contents of the frame into the backing store and updating
the page table (and all other tables) to indicate that the page is no longer in memory.
Thus the free frame can now be used to hold the page for which the process is faulted.
Basically, we would have a page-fault service routine, but this time, we include page
replacement so that we are able to accommodate the demanded page into memory:
1. Find the location of the desired page on secondary storage
2. Find a free frame:
a. If there is a free frame, then we can use it
b. If there is no free frame, then we employ page replacement by first
finding a victim frame (is called that way because it is the frame
targeted to be swapped out of memory).
c. The victim frame is written into secondary storage (if necessary) and
the page and frame tables are updated accordingly (to signify that the
victim page is no longer in memory and cannot be used).
3. Read the desired page into the newly free frame. Again, we update the page
and frame tables.
4. Continue the process from where the page fault occurred.
- Notice that two page transfers are done here – that is; one transfer for page out and
another transfer for page in. Doubling the page transfer doubles the page-fault
service time, effectively increasing the effective access time as well (which, for the
most part; we don’t want that).
- We can reduced this overhead through the use of a modify bit or dirty bit (and no,
not the Black-Eyed Peas song, either). Here, a modify bit is associated with each
page, and is set by the hardware.
- The modify bit is set by the hardware if the page is modified. From this, we can
select a victim page and examine its modify bit.
(a) If the modify bit is set, this means that the page has been modified (as we
established earlier). In this case, the page must be written into storage.
(b) If otherwise, the page has not been modified in any way. In this case, there is
no need to write the page into storage since it is already there.
- Page placement is basic to demand paging, and is essentially what completes it,
allowing us to distinguish two separate entities called the logical memory and the
physical memory so that we, as programmers can have a large virtual space all while
having a smaller limited physical space.
- Note that even with the virtual memory, all the pages of a process still have to be in
the physical memory if we want them to be executed. However, with demand paging,
the size of the logical address space is no longer constrained by physical memory.
- Implementing demand paging involves two major problems: one is to figure out how
many frames to allocate per process, and two, is how we go about it when no free
frame exists and what situations require that we perform a page replacement. We,
therefore, must develop algorithms which can help us solve these problems. These
algorithms are frame-allocation algorithm and page-replacement algorithm.
- There are many different page-replacement algorithms, which can help us decide
which page is to be replaced when a new page comes in, and we have limited or no
available frames left. Each operating system has its own replacement scheme.
Generally, when deciding which replacement algorithm to use, we would want the
one which has the least page-fault rate.
- We evaluate an algorithm by running it on a particular string of memory references
called the reference string and by computing the number of page faults. The
reference string can be generated in one of two ways: one is to use a random
number generator, or we could trace a given system and record the address of each
memory reference, although this results in a larger number of data. To reduce the
number of data, we use two facts.
o First, for a given page size (recall that the hardware is responsible for this
one), we only consider the page number.
o Second, if we have a reference to the page p, then any reference after that
would not result in a page fault (since we’ve already loaded the page into
memory and it exists there – page fault only occurs when the demanded
page is not there).
o For example, consider the sequence of address spaces below:
0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103,
0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105
The page number is represented by the most significant two bits. Observe
the color pattern. At 100 bytes per page, the sequence then can be reduced
to the sequence below:
1, 4, 1, 6, 1, 6, 1, 6, 1, 6 1
Notice that the way by which the sequence is reduced is that if entries of the
same page number are successive, then we only consider their page
number. Hence, the sequence 0102, 0103, 0104, 0101 is just shown a 1.
o The determine the possible number of page faults to occur, we’d first have
to determine the number of available frames in the system, So let’s say we
had 3 or more available frames. Recall that we only consider the page
number p and if p has already been referenced to, then we won’t have any
more page faults after that. So then with three or more frames, we’d
generally only have three interrupts, one for each time that the page is first
referenced. In the sequence above, we’d only fault the page for pages 1, 4,
and 6. However, if we only had one available frame, then a page fault occurs
for every element in the sequence – that is 11 faults.
o Figure 10.11 shows the how the number of page faults decreases to a
minimal level with the increase of frames. Note that to add frames into the
system is to add physical memory.
- However, though simple, it is not easily zeroed out as the best solution. This is
because when a page remains for a long time in memory, this could mean either one
of two things: either the page is no longer in use for a period of time, or maybe it
contains important variables which are frequent used by the system. Notice that
even when the page selected to be replaced is active, it seems as though everything
still works correctly. Since the page is active, then it is being used and would be
demanded if it were not there. The moment it gets replaced, a fault occurs almost
immediately to retrieve it back, except that some other page gets replaced this time.
- In other words, a bad decision of replacing a page lengthens the page-fault rate and
slows down execution.
- The problem with FIFO page replacement algorithm can be demonstrated here.
Consider the given reference string:
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Figure 10.13 shows the curve of page faults for the given reference string versus the
number of available frames. Notice that when the number of available frames is 3,
the number of page faults is 9 and when the number of frames is 4, the number of
page faults become 10? This seems a bit odd, don’t you think? I mean just previously
we were saying that the greater the number of available frames, the more minimal
the number of page faults.
- The weirdness that was described in the previous point is called the Belady’s
anomaly – that is, for some page replacement algorithms, increasing the number of
frames increases the page-fault rate as well, and this is just what we saw given that
previous example.
Optimal Page Replacement
- For this page-replacement algorithm, we are going to use the reference string and a
memory with three frames:
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
- Because the Belady’s anomaly exists and was discovered, we’re also finding ways to
have an optimal page-replacement algorithm – “optimal” meaning the most
desirable or the best. In short, this algorithm that we are looking for has the lowest
page-fault rate among all other algorithms as well as is consistent in terms of our
initial assumption on the relationship between the number of frames and the page-
fault rate.
- Such algorithm exists and it is called the Optimal Page Replacement Algorithm
(also called OPT or MIN) which allows the system to replace pages that will not be
used to the longest period of time. In other words, if a page is loaded but not used
until later, then it can be replaced by some other page that needs to be in the
memory at that certain point in time.
- In the given example of reference strings, the first three reference cause faults that
fill the three empty frames. However notice that 7 isn’t used until reference 18, so it
is replaced by the next reference 2. The reference to page 3 replaces page 1 since 1
will be the last of the three pages to be referenced again. All in all, using OPT results
in only 9 page faults! This is seen in Figure 10.14. That’s the least you could get with
all the different algorithms.
- This sounds like the best algorithm to use as an option in page replacement.
However, the disadvantage to comes with is the difficulty and complexity in
implementing it as it requires looking into the future of the reference string to be
able to make the decision to switch page. As a result, it is not implemented in
practice, but is mainly used for comparison studies.
LRU Page Replacement
- For this page-replacement algorithm, we are going to use the reference string and a
memory with three frames:
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
- Recall that the optimal algorithm is not feasible to be implemented, so the next best
thing would probably be the approximation of the optimal algorithm. The FIFO
algorithm is different from that of OPT in that it looks back on the references and
chooses to replace the page that has been in the memory for the longest time since it
was brought into memory. On the other hand, OPT is different in that it looks into the
future to see how long until it is used.
- If we use the recent past as an approximation to the near future, then we can replace
the page that has not been used for the longest period of time. In short, the system
checks to see when the last use of the page was and replaces the one that has long
since been used. This is called the Least Recently Used (LRU) Algorithm.
Essentially, the algorithm checks to see the page that has not been recently used in
the longest time as an approximation. If the page has not been used recently, then it
probably won’t be used for a few seconds more (something along those lines).
- Figure 10.15 shows what happens when the LRU replacement is used in our given
example reference string. Note that the algorithm produces 12 faults.
- Again, we have three available frames, so the first three references would cause
faults it is the first time these first three pages are brought into memory. When
reference to page 2 enters, we see that the longest to be recently used is 7, hence 2
replaces 7. There is a reference to page 0, but page 0 is already in memory. And then,
we encounter a reference to page 3. This time, the page to have been in the memory
the longest is now 1, hence page 3 replaces page 1, and so on and so forth (I think
you catch the drift).
- The LRU replacement algorithm is considered to be good – not as good as the
optimal page-algorithm, but not as bad as the FIFO page-replacement algorithm
either. However, implementing this algorithm requires substantial assistance from
the hardware.
- The problem lies in determining an order for the frames defined by the last time of
last use. This problem can be addressed by two implementations:
Counters
- It is what it is. Each page-table entry has a time-of-use field which
helps us determine if the page has recently been used or not and when
it was last recently used.
- When a reference to the page is made, the contents of the clock
register are copied to the time-of-use field in the page-table entry for
that page.
- The LRU page, therefore, is the page with the smallest time value. In
this case, a search is performed to look for this LRU page and a write to
the memory, as well, for each memory access.
- In addition, the times for the pages must be maintained especially
when page tables are changed, brought about by CPU scheduling
(recall that when a new process is executed, the contents of the page
table, or even the page table, itself is now completely different).
- Clock overflow must also be considered
Stack
- Another implementation we’re going to use is the stack. Recall that the
principle employed by the stack is LIFO (Last In, First Out).
- In here, whenever a page is referenced, it is removed from the stack
and put at the top. Basically, allowing the system to easily determine
that the most recently referenced page is always the one at the top,
and the page that hasn’t been referenced in the longest time remains at
the bottom, as shown in Figure 10.16.
- The worst case scenario is when all bits are set and the pointer cycles
through the whole queue, and each page is given a second chance. In this
case, the algorithm is as good as the FIFO algorithm.
Enhanced Second-Chance Algorithm
- This is algorithm is basically the enhancement of the second-chance
algorithm.
- The previous algorithm is enhanced by considering a reference bit and a
modify bit as an ordered pair. Because we have a pair of bits, there can
be four possibilities. [ (possibility): indicates that]
o (0, 0): neither recently used, nor modified – best page to replace
o (0, 1): not recently used but has been modified – not quite as good
because the page will have to be written out before replacement
o (1, 0): recently used but has no modifications – it is most likely that
it will be used again soon
o (1, 1): recently used and contains modifications – it is most likely to
be used again soon; the page will need to be written out to
secondary storage (because of its changes) before it can be
replaced
- Each page belongs to one of the four classes. So by using the same clock
scheme that we described previously, we check the class of the page we
are pointing to instead of checking to see if the reference bit was set to 1.
The first page that we encounter in the lowest nonempty class is
replaced. We may have to traverse through the circular queue a few
times before a victim page is found.
- In this algorithm, the reason why we choose to replace the page that has
not been recently used nor modified is to reduce the number of I/Os
required (so we won’t have to write out the contents of that page before
we can actually replace it).
Counting-Based Page Replacement
- Some algorithms keep a counter of basically how many times a page has been
referenced. From that, we can create two schemes.
o Least Frequently Used (LFU) Page-Replacement Algorithm
- This algorithm requires that the page to be replaced is the page that is
least frequently used.
- A counter records the number of times a particular page has been
referenced. If the count is high, then it is used frequently, otherwise it is
not.
- A loophole, a flaw, if you will, arises if the page is heavily used at the
initial stages of a process’s execution. At this time, the count rises.
However, the page is no longer used afterwards. But since the count is
probably higher than the rest of the pages, it can’t be replaced even
when it is no longer active.
o Most Frequently Used (MFU) Page-Replacement Algorithm
- This provides a stark contrast to the LFU algorithm.
- While it makes sense that the least frequently used page is the one to be
replaced, it also makes sense for the most frequently used page to be
replaced because the higher the count, it is most likely that this page has
had its fair share of memory, and so it must give way to other pages,
since they’ve probably used the memory long enough.
- This is based on the argument that if a page has the smallest count, it is
still probably new in the memory and is yet to be referenced a few
several time.
- This is kind of like an Ethics problem. An old woman and a young adult
are crossing the street and you’re driving at an insanely fast speed. Do
you swerve and kill the old pedestrian or keep straight and kill the
young adult. Most likely, the old lady is the one to be killed for three
reasons: (1) the lady has already lived a long and prosperous life and is
most likely closer to death, (2) Since the lady is old, more resources are
required to take care of her (all the medical bills, the adult diapers,
medicines, special needs, etc), and (3) the young adult still has so much
to do in life and can probably be the one to contribute most to society at
this point in time.
Page Buffering Algorithms
- As an addition to any of the pervious algorithms we discussed that handles paging,
this algorithm has a pool of free frames. Normally, when a page fault occurs, the
system finds its victim page, looks for the demanded page, allocated a frame to it and
writes out the victim page. However, in this case, the demanded page is already
ready and allocated a free frame, so that when a page fault occurs and the victim
page is found, it can easily replace that page. This way, the process can restart as
soon as possible. The victim page is written out and its frame is returned back to the
pool of free frames.
- An expansion of the concept is to maintain a list of modified pages. Whenever a
paging device is idle, a modified page is selected and is written to secondary storage.
Its modify bit is then reset. This scheme increases the probability that a page will be
clean when it is selected for replacement and will not need to be written out.
- A modification of the algorithm is still to keep a pool of free frames. This time, the
pages that were once in those frames are noted. Recall that the frame is written into
secondary storage, however nothing is really ever done to erase what was once in
that frame. So, we can actually use the page from that frame when it is needed – that
is if that particular frame has not been allocated to and used by some other frame.
This way, no I/O is needed. In this case, when a page fault occurs, the free-frame
pool is first searched, and if the page isn’t found there, then we find the page ins
secondary storage and read it into one of the frames.
However, using the virtual memory is not absolute and there are instances where accessing data
through the virtual memory performs worse than actually just accessing the data without the buffer.
An example of this is database. Operating systems implement algorithms which are general-purpose
so that it can cater to the needs of the different processes and use with it the variety of resources in
the system. However, the database is specialized in data access and storage, so it understands their
memory and storage use better than the operating system.
At this point, we’ve discussed about the concept of paging and how this, in some circumstances,
can be efficient especially since we only have a limited number of available frames in the system.
Now the question is how do we allocate frames to processes? If we only had a certain number of free
frames, how many frames must each process get?
Let’s say we have 128 frames and the operating system takes 35 of them. We then only have 93
free frames. When a process enters the system, a series of page faults can occur, assuming that we
employ a pure on-demand paging. The first 93 pages faults would get all the 93 free frames, and
when this happens, page replacement is then employed for the 94th page which comes in. Once the
process is done, all 93 frames are once again free.
There are also a number of ways we can do this as well. Remember that we had 35 frames taken
by the operating system? When the operating system is not using this space, then it can be used to
support user paging. Still, we can try and keep at least three frames free so that when a page fault
occurs, these frames are free to be used. Essentially, when a process enters, it is allocated frames for
its pages, and once it terminates, the frames are deallocated for other pages.
(1) Minimum Number of Frames
Okay, so we can implement the allocation of frames in a number of different ways.
However, there are also requirements and conditions we need to consider in doing so. For
example, we cannot allocate a number of frames that exceeds the total number of available
frames (you can’t give what you don’t have). Also, we must consider that page faults occur
when the page to be executed is not found in memory yet; hence we have to ensure that
enough frames are allocated to hold the different pages. Recall that it was mentioned in the
passing that normally, the each page is allocated to a single frame; hence having n pages
would mean needing n frames. While decreasing the number of frames allocated to the
process does save space, it can also increase page faults since it is likely that not all pages
containing the instructions are in memory. When this happens, performance is decreased
because of the slowing of the execution of the process. Recall that every time a page fault
occurs before an instruction is a completed result in the instruction being restarted.
Therefore, a minimum number of frames must be allocated.
The minimum number of frames is defined and depends upon the architecture of the
computer. For example, an architecture’s move instruction includes more than one word
(which, if you recall is a fixed-sized data), then it may take on more than one frame. In the
32- and 64-bit architectures allow data to move from register to register and between
registers and memory, and not a direct memory-to-memory. Hence, through this, the
number of frames used for a process is limited.
The minimum number of frames to be allocated is defined by the computer’s
architecture, but the maximum number of frames to be allocated depends upon the number
of available frames in the system, and the decision is still left to up to us on how many
frames should be allocated if it is neither the minimum or maximum.
(2) Allocation Algorithms
Okay, so, we know that we have allocate frames to processes, otherwise, how are our
processes even going to get executed anyways if they are not in memory? But how much do
we allocate and how do we allocate these? The operating system resolves this with the use of
allocation algorithms.
In the simplest of ways, if we really just wanted to allocate frames, regardless of whether
we’re doing a proper management of the memory or not, the simplest thing to do would just
be to divide the m number of frames among the n number of processes. This way, all
processes would have the same number of shares of frames. This is called equal allocation
because frames are allocated equally among processes.
However, not all processes need the same number of frames allocated to them, and in
reality, they do vary. Consider a situation where each frame is of size 1KB, and we have two
processes running on the system with different sizes – one 10KB and the other 127 KB. In
addition, we have 62 available frames. Using equal allocation, each process would then be
allocated 31 frames. However, the first process only needs at 10 frames at most, so the rest of
the 21 frames in its ownership is of no use to. The 127-KB process would have benefited
more from it had these 21 frames been given to it as well. Hence, a varied allocation can also
be done, and this is called proportional allocation, because we allocate frames proportional
to the size of the process.
Let the size of the virtual memory for the process Pi be si, and define
S = ∑ si
Then, if the total number of available frames is m, we allocate ai frames to process pi ,
where ai is approximately
ai = si / S x m
Again, note that ai cannot be larger than the minimum number of frames needed by the
process, and the sum must not exceed m (m is the total number of available frames). So, with
the previously given scenario, we can use the formula above to try and see just how many
frames are necessarily allocated to the processes.
S = ∑ si = 10 + 127 = 137
ai = si / S x m
ai = 10/ 137 x 62 ≈ 4
ai = 127/ 137 x 62 ≈ 57
Hence, the 10-KB process is only allocated 4 frames and the 127-KB process with 57 frames.
That is a stark difference from the 31 frames allocated to each process with equal allocation!
Rather than just focusing on equality, we focus on equity. Which process needs more? And
quite frankly. I believe this should be applied in real life as well. Yeah, equality is nice and it
the most needed thing in this world right now, but have you ever been in the lowest position
in your life and yet you can’t really rise above it because people are just treating you the way
they would people with comfortable lives? But I digress.
Equal or proportional allocation – it can still vary according to the degree of
multiprogramming. If your system allows a high level of multiprogramming, then most
likely, the processes we’ve mentioned prior might be allocated a lot less number of frames,
in order to accommodate other processes.
Notice as well that higher-priority processes are treated the same way as the lower-
priority ones (this is equality), regardless if you’re using proportional or equal allocation,
although it would be ideal to give higher-priority processes more memory to speed up its
execution. One variation to the proportional allocation, which addresses this concern, is to
allocate frames not according to the size of the process, rather its priority (or could be a
combination of both).
(3) Global versus Local Allocation
If you’ve programmed before, then you’ve probably heard of the global and local scope.
And depending if the data is global or local, other functions may be able to access this data as
well. Another important factor in frame allocation is page replacement. See, page
replacement algorithms can be classified into two categories. These categories define the
scope when replacement occurs in the system:
Global Replacement
- The name of the category speaks for itself. When a page-replacement
algorithm is classified as a global-replacement algorithm, this means that a
process can choose a replacement frame from the set of frames not allocated
to them (in other words, it can take frames allocated to other processes).
- If, for example, this algorithm is employed, then a higher-priority process is
allowed to select frames from lower-priority processes. Recall how we said
that “ideally, higher-priority processes should be allocated more frames than
low-priority ones”. Because higher-priority processes can take frames which
are not theirs, then we can certainly improve the execution performance of
this process at the expense of the process it is taking away frames from.
- A disadvantage to this is that the set of pages in memory for a process now
depends not only on its paging behavior but also on the paging behavior of
other processes as well. This means that performance and execution time of a
single process may vary because other processes’ behaviors also vary.
Local Replacement
- On the other hand, the local replacement category is the stark opposite of the
global replacement because in here, the algorithm requires that processes
only choose from their individually allocated free frames when replacement
occurs. In other words, since the processes do have frames allocated for
them, why not just use them instead of taking from others?
- Unlike global replacement, local replacement the set lf pages in memory for a
process is merely affected by the paging behavior of the process, itself, and
not of other processes.
- However, as simple as it might be to just let processes use what has been
given to them, it becomes disadvantageous since it does not make the process
available to other processes, and so other processes which that need more
memory beyond their own can’t get some, and processes whose consumption
of memory is way below the number of frames allocated to them waste these
frames per se.
- Thus, although it is sensible to use the local replacement algorithm, the global
replacement is more commonly used since it yields a greater system
throughput.
Did you know that page faults also had categories? Yeah, they do and they can be
classified as major or minor faults - Windows refers to it as hard and soft faults
respectively. Recall that faults occur because the CPU needs certain instructions
found in pages which are not in the process’s page table. So a major page fault
occurs when the system faults the process for that certain page, and is it neither in
the process’s page table nor in memory, which means, we’d have to find it in the
secondary storage, and read it to memory and the instruction restarts. However, on
the other hand, minor page faults occur when the page that the CPU is looking to
execute is not found in the process’s page table but is in memory. Actually, this can
happen in two ways.
(1) If the page is just not in the process’s page table but is in memory – in
which case, all that the operating system has to do is to update the page
table so now the desired page is referenced.
(2) Or it could be this: remember when I said that when a frame is written into
secondary storage, and that frame has not been zeroed out or allocated to
another process, then that frame still contains information on the prior
page which was allocated to it? No? Okay then. Because if this is the case,
then the page is still technically in memory. All we have to do now when a
fault occurs is to reacquire that frame from the free frame list and reassign
it again to the process.
If you haven’t noticed, minor page faults typically consume less time than major
faults.
In Linux, you can actually determine and see the number of major and minor page
faults that occur within the system by typing in the command –eo min_flt,
maj_flt, cmd.
Observe that normally, there are more minor page faults than there are major ones.
This is because typically, major page faults occur when the process is first executed
(so instructions and pages are not yet loaded into memory – this is especially true
for demand paging) and minor paging usually occurs along the way when the
process instruction is currently executed. In this case, Linux could have taken
advantage of shared library that has been loaded into memory.
Now that I’ve probably bored you with all of this stuff, let’s continue, shall we?
- So now, we know that global replacement is much preferred than the local
replacement algorithm. The question is, how do we do it? Since the algorithm
is global, then memory requests can be satisfied using the free-frame list
(that is; the list of free frames that belong to the process). However, instead
of allowing the number of free frames to reach zero before they are replaced,
we instead do replacement if the number of free frames fall below a certain
threshold. That way, we can ensure that there is always an available frame
should the process need it. Figure 10.18 illustrates this strategy.
- So when the number of free frames (for that process) falls low, then basically
it scavenges free memory from other processes (excluding the kernel) just so
it could always have sufficient frames allocated to it. These routines are
called reapers just cause they reap frames from other processes – I mean I’m
sure you know what a grim reaper does, right?
- When the number of frames also reaches the maximum threshold then the
reaping stops – you don’t want to rob other processes of all their frames,
would you?
- Note that the kernel reaper routine may adopt any page-replacement
algorithm, but it typically uses some form of LRU approximation.
- However, in some circumstances, said circumstance being that the reapers
can’t keep the number of frames above the minimum threshold, the reaper
routine gets more aggressive. For example, we may get rid of the second-
chance algorithm altogether and just use the FIFO algorithm. In Linux, a
routine known as the Out-of-Memory (OOM) killer is employed, basically
terminating process just so it could claim all of its frames (kind of brutal, if
I’m being honest).
- Each process has a score, an OOM score to be exact, and this score indicates
the percentage of memory that the process is currently using. The higher the
score is, the more likely this process is going to get killed by the killer routine
since it has the most number of frames we can harvest once it is terminated
(this is the one test I do not wish to have a high score on, if I were a process).
So yes, the higher the score, the more likely the process is going to be
terminated.
- In general, the maximum and minimum thresholds can vary per system and
the reaper routines can act aggressively in dire circumstances.
(4) Non-Uniform Memory Access
So, so far, we’ve talked about page replacement algorithms and how the memory is
accessed for the data and instruction that is needed to run a process in CPU. And for the
most part, we’ve assumed it to be an equal access. However, there are systems that do not
provide equal access in memory brought about by the way that the CPU and memory are
interconnected in the system. This system is called the non-uniform memory access
system – what it does is literally stated word-for-word in the name.
Generally, we know that each CPU has a memory of its own (as shown in Figure 10.19),
and for obvious reasons, accessing one’s own memory can be done so much faster than if I
were to access another CPU’s memory. Think of it this way, I could easily and much more
quickly write my notes down if I just used the pen in my pencil case rather than reaching out
to someone in hopes of obtaining an extra ball-point pen from them (and in most cases, you
wouldn’t even have a pen borrowed anyway).
NUMA systems are, without a doubt, slower than systems that allow equal memory
access. But because NUMA systems have multiple CPUs, it can, therefore, achieve higher
levels of throughput and parallelism.
Managing which page frames are stored at which locations can significantly affect
performance in NUMA systems. The goal of NUMA systems is to have the memory “as close
as possible” to the CPU, which for the most part, just means being as physically close to the
CPU as possible (basically on the same board). So, naturally, memory access is much faster,
and when a process incurs a page fault, the NUMA-aware virtual memory allocates to that
process a frame which is as close as possible to the CPU on which the process is being
scheduled (and where the page would most likely run on during execution). In addition, the
scheduler of NUMA systems tracks the last process running on a particular CPU.
Now, mind you, these are all processes. We haven’t even talked about threads yet.
Imagine having multiple processes with each process containing multiple threads, with
different threads running on different CPUs. What then? How does this work?
As stated in section 5.7.1, Linux does have a way of managing this situation and that is to
have the kernel identify a hierarchy of scheduling domains. In addition to that, threads
aren’t able to migrate from one domain to another, and a free-frame list exists for each
NUMA node (yay! We can now be sure that a thread will be allocated anode
#NoThreadLeftBehind). On the other hand, Solaris solves the complication with threads by
having lgroups (which is really just short of locality groups). So a bund of CPUs and
memory are grouped together and the CPUs within that group can access any memory (for
as long as it is in the group as well). In addition, like Linux, Solaris also has a hierarchy
among the groups. Essentially, as much as possible, what is in the group stays in the group,
so threads are scheduled and allocated memory within that group (as much as possible).
However, if not possible, then it picks the nearby lgroups for the rest of the resources
needed. Hence memory latency is reduced and CPU cache hit rates is maximized (since what
we’re looking for is just most likely within the group). And that is where I stop talking about
the NUMA system. Good night, world.
10.6 Thrashing
We begin this section by asking a question. What would happen if we didn’t have enough frames
(the minimum number of frames) to support the pages in the working set? What happens when the a
particular instruction set is not available in memory to be immediately executed by the CPU? The
answer is simple – page fault. Now recall that when we page fault with no other free frames left, the
system has to find a victim page to be replaced by the demanded page. However, since all of the
process’s pages are actively used, then replacing said page would result in another page fault, and
subsequently another page fault. So instead of executing a task, all we’d have at this point is a series
of page faults. When the system spends more time paging rather than executing, then the process is
in the state of thrashing. Thrashing physically means “to move in a violent and convulsive way”.
Have you seen someone drowning? The thing they’re doing with their hands to stay above the water
is called thrashing, and the person does this because he needs something – primarily the need to stay
alive. So yeah, a bit of history about the activity is called as such. In this section, we’re going to be
talking more about thrashing, specifically more on the points that are listed as follows:
(1) Cause of Thrashing
The following scenario depicts the behavior of early paging systems. So, we know that
the operating system monitors CPU usage. If the CPU usage is minimum, then that means not
a lot is going on and to maximize the capability of this resource, the operating system
introduces a new process so that the CPU is effectively busy in trying to execute both
processes. Also, since a global page-replacement algorithm is used, then a process can
accumulate frames from other processes with no regards for them whatsoever.
So we have a process running and this process just keeps reaping, alright? So the CPU
sees that well, nothing big is really going on here, so let’s add another process. However, this
introduced process needs more frames than the one currently running in the system. So, it
starts faulting an taking frames away from the other process. Conversely, that other process
is also faulting and taking away frames from the new process. The paging device is used
more frequently to replace pages in and out of memory, and as these processes wait on te
paging device, the operating system notices the steady decline in CPU activity. So what does
it do? It gives another process to the one CPU where two processes are fighting. Instead of
alleviating the situation, we now have a big squabble, and effectively, more processes are
introduced because all that’s ever been done is just plain faulting. We are basically
experiencing thrashing.
At this point, the more processes there are, the more evident the thrashing, and since
we’re no basically revolving around page faults, the page-fault rate increases, thereby
increasing the effective memory-access time. Overall, system performance is just at an all-
time low. Figure 10.20 shows how CPU utilization increases with a higher degree of
multiprogramming. However, at the point that the CPU experiences thrashing, the CPU
utilization drops hard. At this point, the proper thing to do by the operating system is to
decrease the number of processes running when it sees that the CPU utilization is declining
because of thrashing.
So, we can resolve this by using a local page-replacement algorithm instead, so that
instead of taking frames from the frame-list of other processes, a particular process will just
harvest from its list of free frames. This concept was introduced in the previous section of
this chapter. No stealing among processes occurs, and well, thrashing is simply contained to
the process, itself. Sounds simple, right? Well, yes, but also no, because the problem isn’t
entirely solved yet.
If processes are thrashing (since you know, they still technically do that minus the
stealing frames from other processes part), then it is waiting in the queue for the paging
device, right? The average service time for a page fault will, therefore, increase, once again
increasing the effective memory-access time.
So, to prevent thrashing, what we would generally want to do is just increase the number
of frames allocated to a process. But, how would we know how much a process exactly
“needs”? One strategy is to look at how many frames the process is actually using. This
approach defines the locality model of execution of a process. The locality model states
that, as a process executes, it moves from locality to locality. Basically, a locality is defined as
the set of pages which are actively in use. A running program typically has multiple
localities. For example, if it calls on a function, another locality is defined. In this locality,
memory references are made to the instructions of the function call, its local variables, and a
subset of the global variables. When the function ends, then the process leaves this locality.
It’s like opening up a new game on Among Us, and when you’re done with it, you just leave
the place, assuming that you’re the process in this scenario.
A process’s locality can change with time, and Figure 10.21 illustrates this. Notice that at
time (a), defined at the x-axis of this model, the locality is defined by the pages [18, 20, 21,
22, 24, 29, 30, 33] (take a look at the numbers to the left and see that at this point the pages
mentioned are the ones usually accessed by the process at this point in time, and you can see
the density of how often these pages were accessed too). At time (b), the locality now
becomes [18, 19, 20, 24, 25, 26, 27, 28, 29, 31, 32, 33]. Note that localities can overlap and
we can see this when pages 18, 19, and 20 are part of both localities in a and b.
Localities are simply defined by their program structure and its data structures.
Furthermore, the locality model states that all programs exhibit this basic memory reference
structure, which means that this memory reference is common to all programs. It makes
sense, though – it’s probably why the locality mode is called a model in the first place. This is
actually on unstated principle behind the concept of caching as described so far.
If we were able to allocate sufficient frames for that locality, then fault occurs when the
CPU looks for the pages in that locality. After that, no other page fault happens. However, if
there were no sufficient frames for that locality, then thrashing will most certainly occur
since it would want to keep all the pages that is actively using whilst faulting for the page it
needs at the same time.
(2) Working-Set Model
The working set model is based on the assumption of locality. This model uses a
parameter Δ, to define the working-set window. Do you see the similarities here? In a
locality model, a certain scope is called the locality and within that scope are a set of pages
that are currently used. The working-set window we are talking about here is similar to the
locality. In addition, the set of pages in the most recent Δ page references is the working set
(we will see this in along with an example in Figure 10.22). The working set, therefore, is
merely an approximation of the program’s locality.
Another side track time! So, an interesting thing is that there is a relationship between the
working set of a process and its page fault rate (you know, the probability of having to
encounter a page fault? Yeah – that’s the one). As we see in Figure 10.22, the working set
changes over time (that is depending upon the page references made within different
localities).Take a look at the figure below:
Recall that with each locality, different data references are also made and therefore
different pages are used which would mean different working sets. Notice that page fault
rate is high during the beginning of the locality. Assuming that we have sufficient frames to
accommodate the working set, then the page faults that occur at this certain point in time is
just when we begin demand-paging a new locality. After all necessary pages are in memory,
then page fault rate drops significantly (now that all, if not most of the needed pages are in
memory). I am done rambling and now you can continue with your life.
T
The thing with using the working-set model, however, is that it’s difficult to keep track of
the working set. Recall that the working-set window is moving and adding a new reference
means dropping the oldest reference in this list. If a page is still references somewhere
within the working-set window, then that page is still in the working set. We can, therefore,
do an approximation with a fixed-interval timer interrupt and a reference bit. So let’s say
that our Δ is 10,000 references and we have a time interrupt set at every 5,000 references.
So when the interrupt occurs, the reference bits within that parameter are copied and
cleared. So, when a page fault occurs, the current reference bit is examined or the two in-
memory bits are used to determine if this page has been used at all recently (in the past
10,000 to 15,000 references). However, note that this strategy will not be entirely accurate
(it is an approximation after all), as we can only determine if it was referenced and not
exactly where the page was referenced. We can, however increase the number of history bits
and the frequency of interrupts, but doing so is costly (everything comes with a cost, you
want a more refined way of tracking these references, you’ve got to pay for it).
(3) Page-Fault Latency
We know that the working-set model works – I mean, it does solve the problem of
thrashing while at the same time, allowing multiprogramming. However, it seems like a
pretty clumsy way of controlling thrashing. Here, we introduce another strategy of solving
thrashing and that is the page fault frequency (PFF).
We know that when thrashing occurs, the page fault frequency is at an all-time high. So
now, instead of focusing on the number of frames (like we did in the working-set model), we
focus on the frequency of the occurrence of page faults. If the page fault frequency is too
high, this means that we have very little frames to accommodate the active pages of the
process. Otherwise, we have allocated too many frames for the process. We can establish a
higher and lower bound of the rate of page fault. If page faulting exceeds the upper bounds
of acceptable page fault rates, then we, therefore have to give the process many more
frames. Conversely, if the rate falls short of the lower bound, then we can decrease the
number of frames (these frames could then be used by other processes that wish to run).
This strategy is depicted in Figure 10.23.
Similar to the working-set strategy, if there is a high page fault rate, then there are no
available frames, and we must, therefore, swap out a process to the backing store. The freed
frames are then distributed to processes with high page-fault rates so as to decrease this
rate.
(4) Current Practice
So, thrashing is a problem, but employing the swapping strategy can impact the
performance of the system, and yes, we have solved the problem of thrashing through this
but here, a tradeoff occurs between system performance and memory management. Which
is why, the current practice is to just simple have enough physical memory to accommodate
processes running in the system without encountering thrashing. Most computing devices
use this strategy and it has not failed as of yet.
So, we’ve talked a whole lot about paging. And paging is cool, you know? It had helped us to
manage memory especially when many processes are to be run. However, an alternative to paging
exists and it is called memory compression, where several frames are compressed into one so that
we lessen the usage of memory without having to do swapping.
Notice that in Figure 10.24, we have 6 free frames in the free-frame list. Assume that this number
of free frames falls below a certain threshold which would normally trigger page replacement. The
page-replacement algorithm (let’s say LRU approximation) selects four frames – 15, 3, 35, and 26 to
add to this free-frame list, but first it places these frames on the modified frame list, which is written
to a swap space, so that they are available to the free-frame list.
However, as we said, alternatively, we can use memory compression and compress three frames,
for example, and store their compressed version in a single page frame. Notice that in Figure 10.25,
frame 7 is removed from the free-frame list, and in the modified frame list, frames 15, 3, and 35 are
compressed and stored in frame 7. Now, frame 7 holds three compressed frames and is now stored
in the compressed-frame list. Now, frames 15, 3, and 35 can now be moved to the free-frame list
(basically, the once-contents of these frames – the pages 15, 3, and35 are now compressed into
frame 7, which means that the frames that used to hold these pages are essentially free, hence placed
in the free-frame list). Should one of the compressed frames be referenced later, then a page fault is
issued. During this time, the compressed frame is decompressed and pages 15, 3, and 35 are
restored in memory.
So, remember that we established that mobile operating systems do not support swapping? What
do they do then to manage their memory? It turns out that compression is the mobile operating
system’s main strategy of managing memory (and yes, this includes Android and iOS). However,
Windows 10 does offer to support memory compression through the Universal Windows Platform
(UWP) architecture that provides a common app platform on every device that runs on Windows 10.
Version 10.9 of the macOS was the first version to support memory compression.
Memory compression uses one of the free frames to hold the compressed pages, however, in the
end, we see that there is a relatively significant saving in the memory right there. In our scenario a
while ago, three pages were compressed into one frame – that’s approximately saving 66.66% of the
memory which would have been used had no memory compression been performed. However, a
contention occurs between the speed of the compression algorithm and the amount of reduction that
can be achieved (which is known as the compression ratio). Basically, do we want faster algorithms
but lower compression ratio? Or do we want slower algorithms but with a higher compression ratio.
At any point, these are just the extremes and most modern operating systems try and find a way to
balance this (having a higher compression ratio all while doing the compression at a relatively high
speed). Compression algorithms have now been taking advantage of the multiple computing cores
that come with modern computer systems, thereby improving by being able to do compressions in
parallel.
In the system, we have a list of available frames from which to choose from when we allocate a
frame to a requesting process in user mode. On some instances, when a user process requests for a
frame, there is always a possibility of having internal fragmentation because the system merely
allocates an entire page frame (I mean there is no way that we can further break this frame to
accommodate the process page in exactly its size). Note how I said that user processes are
accommodated frames from this free-frame list (the same one we’ve been talking about for the
entire duration of this chapter up until now). However, kernel memory is often allocated from a free-
memory pool other than the list that accommodates to user processes. And this is done for two
reasons:
1. The kernel memory can request for memory for data structures of varying sizes, and usually,
sizes for this are less than the size of an entire page (and we do not want internal
fragmentation). Because of this, the kernel uses the memory conservatively and prevents the
wastage of memory caused by fragmentation.
2. The second reason is that pages allocated to user-mode processes need not be in some
contiguous location in physical memory. However, some hardware devices interact directly
with physical memory, which means that sometimes, we may require that memory be
residing in physically contiguous pages.
In this section, we are going to be talking about two strategies used in memory management for
kernel processes, which are the following:
(1) Buddy System
- The buddy system is one memory allocation technique used whereby a fixed-sized
memory segment comprised of physically contiguous memory is allocated.
- Note that with this technique, memory allocation is done using a power-of-2 allocator,
which means that memory is allocated in powers of two (that is; 4 KB, 8KB, 16 KB, 32
KB, 64 KB, and so on.)
- So, if for example, a request for a 21-KB frame comes in, the system then allocates a
frame of whose size is the next closest power of 2, which in this case, is 32 KB.
- Essentially, this technique divides memory into partitions so that memory requests are
satisfied according to best fit (so we would want to allocate a frame whose size is
nearly the exact same size as that of the page requesting for it). These partitions are
called buddies (ding, ding, ding, we have a winner, chicken dinner).
- So let’s say that we initially have a memory segment of 256 KB, and a kernel requests
for a memory of 21 KB. So this 256 is divided into two buddies of 128 KB. Is 128 KB able
to allocate 21 KB of request? Yes it can. Is it the closest value that could fit 21 KB
though? No, it’s not, so we further subdivide it – and we do this for the remainder of the
buddies until we find one frame of best fit. At this point, we’d have 32 KB as the frame
which can approximately fit the 21 Kb request of memory more accurately. We can see
this in Figure 10.26.
- One thing that is advantageous of this buddy system, though, is that the buddies that
we partitioned earlier, can actually form back into one larger segment through a
technique known as coalescing. So for example, in Figure 10.26, if the kernel releases
the CL unit it was allocated, it can coalesce with CR, and thus become BL. BL, in turn can
coalesce with BR to form AL – I think you get my drift here.
- But, I think it is pretty obvious that even though we try to make the segment we
allocate fit the size of the requested memory, fragmentation still occurs. In our example
before, when the kernel requests for a memory of 21 KB, and 32 KB is allocated, yes, we
were able to fulfill the request, but we basically have 11 KB’s worth of unused space
upon allocation. Next, we’re going to be discussing a technique that doesn’t allow us to
waste space due to fragmentation.
(2) Slab Allocation
- Another strategy for kernel memory allocation is called slab allocation. You know
what a slab is? A slab is literally a thick, plat piece of stone, concrete or wood – it’s
basically like a block of cement or something else entirely. In our case, a slab is just one
or more pages which are physically contiguous.
- A cache (yes, the auxiliary memory that stores some data originating from the
secondary storage), is a collection of one or more slabs. Each data structure in the
kernel has its own separate cache – a cache for data structures representing process
descriptors, file objects, semaphores and others. In short, each type of data structure
that represents something has a cache. And the stuff that occupies the cache are called
objects. So when we have a cache for semaphore, for sure that cache will contain
semaphore objects (100%). This relationship between slabs, caches, and objects are
shown in Figure 10.27.
- The slab-allocation algorithm uses cache to store kernel objects. When a cache is
created, objects, which are allocated to the cache, are initially marked as free. The
number of objects in the cache well depend on the size of the associated slab. For
example, if we had a 12-KB slab (which is essentially just a contiguous piece of 4-KB
memory), we could only fit size 2-KB objects in (4 x 3 = 6 x 2). When a new object
for a kernel data structure is needed, the allocator can assign any free object from
the cache to satisfy the request. Recall that all objects are initially free. If they are
now allocated, then the allocated object is now marked used.
- As an example, we look at a scenario where the kernel requests memory from the
slab allocator for an object representing a process descriptor. In Linux, a process
descriptor is of type struct task_struct, which requires approximately 1.7 KB
of memory. So, when the Linux kernel creates a new task (this is the term for
processes in Linux, in case you forgot), it requests the necessary memory for the
struct task_struct object from its cache, and it is satisfies the request through
the object in the slab which is free.
- In Linux, a slab may have one of three states:
Full – When a slab is full, it means that all objects in that slab are in use
Empty – Otherwise, when a slab is empty, all objects are still free
Partial – And when the slab is partial, then some objects are in use while some
are free.
- So when a request is made, this request is first fulfilled with a free object coming
from a partial slab. I’d like to think of it this way. I had two bags of chocolate – one is
opened (and I’m currently eating from it), and the other is not (I’d probably just eat
it when I got home). When someone asks for a piece of chocolate, I think the most
obvious thing to do is to take one from the opened bag and give the piece to her. I
don’t think anyone has opened another bag of chocolate when a particular person
asks for just a piece of chocolate they were eating (that’s unusual). When no free
object exists, then we move to open the new bag of chocolate (basically our empty
slab). And when we don’t have any empty slab, then we buy a new bag of chocolate
(we just create a new slab).
- There are two advantages to using the slab allocator:
As we stated right at the end of the buddy system, we said we’d use this
strategy (I mean the slab allocation strategy) so that no memory is wasted due
to fragmentation. This is because a single slab is just memory divided into
chunks, each of which the size of the object it represents. Essentially, the slab is
just a cluster of objects of different sizes, so there’s no way that fragmentation
can occur when the allocated space is given exactly as requested.
Another advantage is that memory requests can be satisfied quickly. The
process of allocating and de-allocating memory can be time-consuming.
However, in this case, objects are created in advance and are marked as either
free or used. So when a request comes in, it is easily allocated from the cache
and when it is de-allocated, then we can toggle its mark to be free once more.
See how fast transactions become?
- The slab allocator first appeared in Solaris 2.4 kernel, and is even used to cater to
user-mode memory requests. In Linux, however, the slab allocator was first used in
Version 2.2. Prior to that, the buddy system was used.
Okay, so when we apply paging systems, one of the most important things we have to consider is
the page-replacement algorithm that we’re going to use along with it – we would want efficient
algorithms; after all, what’s the point of memory management if we’re not trying to make the usage
of the memory better. However, besides this, there are still other things we need to consider; six of
which we’re going to be talking about in this section:
(1) Prepaging
If you recall, in one of my side note rambles, one of which contained a graph plotting the
relationship between the working set and page fault rate. In that graph, we see that for
every new locality (meaning a new working set), the page fault rate is always at its highest
when it is at the beginning, and this is because in an on-demand paging, we’d have to page
each and every instruction needed to get our process started on execution. After that’s
done, page fault is minimal. So, this paged faulting at the beginning of every locality can
take time, right? What if there was a way that we can just actually bring all the pages that
we need at the time into memory to minimize the page fault rate at the initial of stage of
every locality? There is actually a way and it is called prepaging. Basically, this is the
technique that we use to bring the necessary pages in before they’re even paged at the
beginning of the process execution or locality.
Prepaging can be advantageous in some cases (as with most other suggestions and
algorithms that have been discussed in this book so far). However, now it’s just a matter of,
will prepaging actually cost less than when servicing page faults, or should we rather just
stick to paging the initial pages instead? Let’s say we have s number of pages prepaged and
only a fraction α is actually used. Now, the question is will the cost of s * α saved page
faults outweigh the cost of preparing s * (1 – α) unnecessary pages. If α is close to 0, then
prepaging would not be worth it since we’ve expended to have the pages prepaged with
only a tiny portion of the prepaged pages actually use, hence prepaging loses. Otherwise, if
α is closer to 1 than to 0, then prepaging wins and employing so would be worth it.
However, note that prepaging on executable files can be difficult since there is not really
an exact sequence of pages that should be brought into memory. Files, on the other hand,
are more suited for prepaging since files accesses are usually sequential in nature. In
Linux, this is done through a call to the readahead() system call that prefetches the file
into memory.
(2) Page Size
Another thing we have to consider is the page size. For the most part, operating system
designers do not have the power to choose the actual size of the pages. However, with
newer machines, they are now able to. So, the question is, do we make the page size
smaller of bigger?
Once concern related to the size of the page is the size of the paged table. If we decrease
the size of the page, we effectively “increase” the page table size by allowing more pages to
be accommodated on there. This is good, since we would want maximum CPU utilization at
all times in our system. However, let’s say that memory is allocated to the page at location
000000 and onwards. Then, we know that if it does not quite reach the page boundary,
fragmentation occurs. For a page of 512 bytes, only 256 bytes are wasted, but for a page of
8,192 bytes, this waste is amplified and is 4,096 bytes.
Another thing we’d have to consider is the time taken to read or write a page. When the
storage device is HDD (as we’ll see in Chapter 11, section 1), the I/O time is composed of
the transfer time, latency, and the seek time. The transfer time is proportional to the
amount transferred. So for a transfer rate of 50 MB per second, transferring 512 bytes
would only take 0.01 milliseconds, whereas the latency time would be approximately 3
milliseconds, and seek time 5 milliseconds. Therefore, our total I/O time is 8.01
milliseconds (that is; the total of transfer, latency and seek times). If we double the page
size, the total I/O time would only increase to 8.02 milliseconds (singly reading a 1,024
KB-sized page only needs about this much time). Compare that to when we separately read
two 512 KB-sized pages – we would have a total I/O time of 16.02 milliseconds. Therefore,
with this argument, we generally would want a larger page size than having to read
multiple pages of the same size.
However, with having a smaller page size, the total I/O should be reduced, since locality
is improved. Recall that having smaller page sizes would allow each page to match
program locality more accurately. Let’s say that of a 200 KB-sized process, only 100 KB’s
worth of pages is actually used. Having a large page, we would have to bring in the entire
200 KB. In contrast, if we only had smaller-sized pages, we would be able to bring only the
100 KB half of the process that we actually need and use. With this, then, we have a better
resolution, allowing us to only isolate the memory that is necessary. Resolution sounds
kind of like the thing you adjust to make your videos appear clearer on screen. That’s
because the higher the resolution, the more pixels there are in a certain area of the frame,
and more pixels to fit in the same amount of area would mean smaller pixels making the
entire image more defined. The same concept is applied here. When we only have smaller
pages, we are able to define the pages we only really ever need, hence not much memory is
wasted. With this argument, we would also want to have smaller page sizes.
However, a page size of 1 byte would have a page fault for each byte. So, a process with
200 KB, which only really just needed to use half of the memory, would generate only one
page fault. What we mean by that, is that calling the 200-KB process as a whole would only
end up in one page fault whereas when we define the pages to have small sizes, 200 KB is
essentially just 102,400 1-byte pages, and when a page fault occur for every 1 byte, then
we would have a total of 102,400 page faults incurred. Recall that faulting can incur a large
overhead over the time, stressing the system. In which case, a larger-sized page must be
used instead.
So which is it – smaller or larger pages? Well, there are still other things we’d have to
consider as well (like the relationship between the page size and sector size on the paging
device). In the end, there’s one absolute answer to the question – there are too many
factors that affect this choice, and as we saw, we can get torn in vouching for a large- sized
or a smaller-sized page – both arguments of which do make sense. However, most
computer systems tend to lean to the larger-sized pages, that is; based on history and on
current modern systems.
(3) TLB Reach
In chapter 9, we’ve talked about something other than the page table that can help us
find demanded pages faster than looking them up from the page table and secondary
storage, and this is the TLB or the translation lookaside buffer. In chapter 9 as well, we
introduced the term hit ratio, which, when put simply, means the likelihood of the
translation being resolved in the TLB rather than the page table. From this, we can see that
there is a relationship between the hit ratio and the number of entries in the TLB – I mean
the more entries there are in the table, the more you are likely to find what you are looking
for right (just like a dictionary)? Well, yes, but implementing a TLB to accommodate more
entries is expensive.
Here, we introduce the TLB reach, which is a metric similar to the hit ratio in that it
refers to the amount of memory accessible from the TLB, or more simply, the number of
entries multiplied by the page size (it’s basically like the totality of the size of all the entries
in the TLB). The size of the TLB reach and the TLB are proportional so doubling the TLB
size would mean doubling the TLB reach. However, this may still be insufficient in some
memory-intensive applications.
Alternatively, we could opt to have bigger page sizes. However, as we’ve established
prior to this subsection, increasing the page size can increase the likelihood of internal
fragmentation since not all applications require big spaces. However, this could work in
some architecture that does support pages of multiple sizes as in Linux, and the operating
system can take advantage of this support. Linux also provides a huge pages feature that
allows larger-sized pages can be used in certain regions of the memory.
The ARMv8 architecture also support pages of different sizes. In addition, the ARMv8 has
something called the contiguous bit, which is essentially just a bit that hints if an entry is
just one of a set of contiguous entries. If the bit is set for an entry, then it means that the
particular entry maps contiguous blocks of memory. There are three possible
arrangements that contiguous blocks be mapped in a single TBL entry:
1. 64-KB TLB entry comprising 16 x 4 KB adjacent blocks
2. 1 –GB TLB entry comrpsiing 32 x 32 MB adjacent blocks
3. 2-MB TLB entry comrpsiing either 32 x 64 KB adjacent blocks, or 128 x 16 KB
adjacent blocks.
In cases where support is given to multiple page sizes, then the operating system, rather
than the hardware, steps in – this can cost us the system performance, but in the end, an
increased hit ratio outweighs the performance cost of having the software manage the
TLB.
The array is stored row major, which means that we are iterating through the
multidimensional array one row at a time, whilst iterating through the consecutive
elements in the columns of that row. Basically going from data[0][0] to data [0][1]
until we reach data[0][127]. This means that for each page of 128 words, each row
takes one page. If the operating system is not able to allocate sufficient frames (the 128
frames to be exact) to the entire program, then that means we would experience 128 x 128
= 16,384 page faults. That’s a lot! However, if we would have coded the program to look
like it does below:
We would be able to zero all the words on one page before starting the next page, reducing
the number of page faults to only 128 (compared to approximately 16,000). Careful
selection of data structures and programming structures can increase locality and hence
lower the page-fault rate and the number of pages in the working set. However, do note,
that locality is just one way of measuring efficiency of the use of data. Other measures
include the search speed, the total number of memory references and the total number of
pages touched.
(6) I/O Interlock and Page Locking
There are certain situations, when wed employ demand-paging where we need a few
specific pages to be locked in memory. An example of which is when the I/O is done to or
from user (virtual) memory.
I/O is often implemented by a separate I/O processor. For example, a controller for a USB
storage device is generally given the number of bytes to transfer and a memory address for
the buffer (Figure 10.28). When the transfer is complete, the CPU is interrupted. We must,
then be sure that the following sequence does not occur: A process issues an I/O request
and is put in a queue for that I/O device. Meanwhile, the CPU is given to other processes.
These processes cause page faults, and one of them, using a global replacement algorithm,
replaces the page containing the memory buffer for the waiting process. The pages are
paged out. Sometime later, when the I/O request advances to the head of the device queue,
the I/O occurs to the specified address. However, this frame is now being used for a
different page belonging to another process.
Now that we know what the virtual memory is and how the virtual memory management is done,
we now talk about how they are implemented in the different operating systems in computers,
more specifically in the following operating systems:
Linux
- Recall that Linux employs slab allocation for when kernel memory is allocated.
- Linux uses demand paging. In addition, Linux also has a list of free frames, from
which it allocates frames to pages.
- However, Linux has two types of page lists:
active_list
- It is in this list where the pages that are considered active are contained.
inactive_list
- On the other hand, this is the list which contains pages that are not active
and had not been recently referenced, and can, therefore become
candidates to get reclaimed.
- An access bit is associated with each page that allows us to determine whether the
page has been referenced, since it is set when the page does get referenced. When
the page is first allocated, its access bit is set and it is placed at the rear of the
active_list. Similarly, if a page within said list is referenced, the access bit if also
set. Periodically, this access bit is reset, and if it’s not referenced again soon, then it
might be migrated into the inactive_list since we’re not using it and making it
active. However, if the page in that inactive_list becomes active again, we’re
obviously going to move it back to the active_list where it sits until the access
bit is reset and it won’t be referenced again for a long time, allowing it to move back
again to the inactive_list. We shows this in Figure 10.29.
- The two lists are kept in relative balance, and when the active list grows much larger
than the inactive list, pages at the front of the active list move to the inactive list,
where they become eligible for reclamation.
Windows
- Windows 10 implements virtual memory using demand paging with clustering, a
strategy that recognizes locality of memory references and therefore handles page
faults by bringing in not only the faulting page but also several pages immediately
preceding and following the faulting page.
- A key component of virtual memory management in Windows 10 is working-set
management. When a process is created, it is assigned a working set minimum of 50
pages and a working-set maximum of 345 pages. The working-set minimum is the
minimum number of pages the process is guaranteed to have in memory; if sufficient
memory is available, a process may be assigned as many pages as its working-set
maximum. If the process were not configured to have hard working-set limits,
then these values may be disregarded.
- The LRU approximation is the page replacement algorithm that Windows employs.
Again, we have a list of free frames, whose process of allocation depends upon the
number of frames available relative to a certain threshold. When the amount of free
memory falls below the threshold, the virtual memory manager uses a global
replacement tactic known as automatic working-set trimming to restore the value
to a level above the threshold. Basically, if the process is allocated more than the
minimum number of pages it needs, then these excess pages are removed from the
working set.
Solaris
- In Solaris, when a thread incurs a page fault, the kernel assigns a page to the faulting
thread from the list of free pages it maintains.
- Associated with this list of free pages is a parameter—lotsfree— that represents
a threshold to begin paging. Essentially, this parameter is set to 1/64 the size of the
actual memory. With a frequency of four times per second, the kernel checks
whether the amount of free memory is less than the lotsfree. If it is, then a
pageout starts up, which is basically a routine similar to the second-chance
algorithm we described previously in this chapter.
- The pageout algorithm uses several parameters to control the rate at which pages
are scanned (known as the scanrate). The scanrate is expressed in pages per
second and ranges from slowscan to fastscan. When free memory falls below
lotsfree, scanning occurs at slowscan pages per second and progresses to
fastscan, depending on the amount of free memory available. We see this in
Figure 10.30.
o slowscan: 100 pages per second
o fastscan: (total physical pages)/2 pages per second – 8,192 pages per
second
- The distance (in pages) between the hands of the clock is determined by a system
parameter, handspread.
- The amount of time between the front hand’s clearing a bit and the back hand’s
investigating its value depends on the scanrate and the handspread.
- The page-scanning algorithm skips pages belonging to libraries that are being
shared by several processes, even if they are eligible to be claimed by the scanner.
The algorithm also distinguishes between pages allocated to processes and pages
allocated to regular data files. This is also known as priority paging, which will be
talked about more on Section 14.6.2.