Decay Usage Algorithm
Decay Usage Algorithm
Multiprocessors
Report 97-55
D.H.J Epema
Copyright c 1997 by the Faculty of Technical Mathematics and Informatics, Delft, The
Netherlands.
No part of this Journal may be reproduced in any form, by print, photoprint, microfilm,
or any other means without permission from the Faculty of Technical Mathematics and
Informatics, Delft University of Technology, The Netherlands.
Copies of these reports may be obtained from the bureau of the Faculty of Technical Math-
ematics and Informatics, Julianalaan 132, 2628 BL Delft, phone +31152784568.
A selection of these reports is available in PostScript form at the Faculty’s anonymous ftp-
site. They are located in the directory /pub/publications/tech-reports at ftp.twi.tudelft.nl
Abstract
Decay-usage scheduling is a priority-ageing time-sharing scheduling policy capable of
dealing with a workload of both interactive and batch jobs by decreasing the priority of
a job when it acquires CPU time, and by increasing its priority when it does not use the
(a) CPU. In this paper we deal with a decay-usage scheduling policy in multiprocessors
modeled after widely used systems. The priority of a job consists of a base priority
and a time-dependent component based on processor usage. Because the priorities in
our model are time dependent, a queueing-theoretic analysis, for instance for the mean
job response time, seems impossible. Still, it turns out that as a consequence of the
scheduling policy, the shares of the available CPU time obtained by jobs converge, and
a deterministic analysis for these shares is feasible: We show how for a xed set of jobs
with large processing demands, the steady-state shares can be obtained given the base
priorities, and conversely, how to set the base priorities given the required shares. In
addition, we analyze the relation between the values of the scheduler parameters and
the level of control it can exercise over the steady-state share ratios, and we deal with
the rate of convergence. We validate the model by simulations and by measurements of
actual systems.
Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management|
multiprocessing/multiprogramming, scheduling; D.4.8 [Operating Systems]: Performance|
measurements, modeling and prediction, simulation
General Terms: Measurement, Multiprocessors, Performance, Scheduling, Simulation
Additional Key Words and Phrases: Control, convergence, decay usage, priorities, shares
A short version of this paper, which omits the proofs, Sections 4.3 and 5, and the extended treatment
of the performance results in Section 6, appeared in the Proceedings of Sigmetrics '95/Performance
'95, May 1995, Ottawa, Canada, pp. 74{85.
1 Introduction
Time-sharing systems that support both interactive and batch jobs are steadily becoming
less important in this age of workstations. Instead, for compute-intensive jobs, the use of
multiprocessors and clusters of uniprocessors is increasing. Currently, most of the latter
systems run some variation of the UNIX operating system, which employs a classical time-
sharing policy. In this paper we analyze a scheduling policy in multiprocessors modeled
after scheduling in UNIX systems under a workload of compute-intensive jobs.
An important issue in scheduling policies for time-sharing systems is how to recognize
long, compute-bound jobs in order to give these a low priority relative to short, interactive
jobs. The main mechanism to achieve this without a-priori knowledge of the processing
demands of jobs is by means of multilevel feedback queues. There are two basic variations
of such queues, in each of which the priority of a job is depressed whenever it acquires
CPU time. The rst is the Head-of-the-Line policy, in which, on arrival, a job enters
the highest-priority queue, and each time it exceeds some amount of processing time, its
priority is depressed by appending it to the next-lower priority queue. This continues until
a job reaches the lowest priority level, where it then stays; in particular, its priority remains
depressed permanently. In the second, processor-usage ageing, also known as priority ageing
and decay usage, is employed: When a job acquires CPU time, its priority is continually
being depressed, and periodically, at the end of every decay cycle, the priorities of all jobs are
elevated by diminishing the depressions due to the CPU time obtained (i.e., the processor
usage of each job is decayed ), after which each job is appended to the queue corresponding
to its new priority. As a consequence, the further in the past CPU time has been obtained,
the smaller the priority depression it entails. In addition, the priorities of jobs may have
xed base components, yielding the opportunity to give di erent jobs di erent levels of
service. It is this type of scheduling that is used in UNIX, and also in Mach [2, 3], which
latter is the basis of the operating system of the Open Software Foundation (OSF)1 .
In the continuous-time multiprocessor scheduling policy we study, the priority of a job
is equal to the sum of its xed base priority and a time-dependent component. The latter
part, the accumulated and decayed CPU usage , increases in a linearly proportional fashion
with CPU time obtained, and is periodically divided by the decay factor, which is a constant
larger than one. (A low priority corresponds to a high level of precedence.) Jobs with equal
base priorities belong to the same class. Jobs are selected for service in a processor-sharing
fashion according to their priorities. At any point in time, there is a subdivision of jobs
into three groups, any of which may be empty. The jobs in the rst group have the lowest
priorities (i.e., the highest levels of precedence) and have dedicated processors; those in the
second group have equal priorities, higher than any of the priorities of the jobs in the rst
group, and evenly share the remaining processors; and those in the third group, with yet
higher priorities, do not receive service. This subdivision of jobs may vary during a decay
cycle and may change at the start of a new decay cycle, because the priority of a job that
1 Open Software Foundation, Inc., 11 Cambridge Center, Cambridge, Ma., USA.
1
receives service increases, and because of the recomputation of the priorities at the end of
a decay cycle, respectively.
Because the priorities are time dependent, a queueing-theoretic analysis of the policy, for
instance solving for the mean response time, possibly assuming a Poisson arrival process and
an exponential service-time distribution for each class, seems infeasible. However, provided
that the set of jobs in the system is xed and that all jobs are present from the same time
onwards, a deterministic analysis is feasible. In that case, because jobs with equal base
priorities are treated in the same way, the priorities of the jobs in the same class are always
equal, and the subdivision of jobs into three groups is in fact a subdivision of classes. The
main ingredient of our analysis is keeping track of this subdivision, in particular at the
end of every decay cycle. Our main results are proving the convergence of this scheduling
policy in the sense that the fractions of CPU time obtained by jobs (the shares ) have limits
(the steady-state shares ), and deriving the relation between the base priorities and the
steady-state shares: For a xed set of jobs with large enough processing demands, we show
how to compute the shares given the base priorities, and conversely, how to set the base
priorities given the required shares. The decay-usage scheduling policy is characterized by
four parameters (amongst which the decay factor and the increment in priority due to one
unit of CPU time obtained). It turns out that these parameters are not independent, to the
extent that in general, one parameter would suce as far as the ratios of the steady-state
shares are concerned. We also deal with the transient behavior of the scheduler, i.e., with
the rate of convergence to the steady state and with arrivals and departures of jobs.
In actual systems, jobs receive discrete quanta of CPU time, and some numbers in the
scheduler are integers. In our model, jobs receive CPU time in a continuous processor-
sharing fashion, and we assume all numbers to be reals. In order to validate our model
and to assess the impact of the continuous-time hypothesis without any in uence of im-
plementation details of actual systems, we simulate the UNIX scheduler. In general, the
results of the simulations agree reasonably well with the model when the quantum size
is small. In addition, we performed measurements on uniprocessors and multiprocessors
running versions of UNIX based on 4.3 Berkeley Software Distributions (4.3BSD) UNIX and
Mach. The results for the 4.3BSD uniprocessor agree extremely well with the model, but
those of the 4.3BSD multiprocessor fail to do so. Detailed traces of the latter system show
that its scheduling policy does not comply with the model. For the Mach systems, the mea-
surements match the results of the model not very well; however, after adapting the model
to include the somewhat di erent way in which Mach e ectuates decay, the measurements
match the model very well indeed.
Our analysis extends the analysis by J.L. Hellerstein [10] of scheduling in UNIX System-
V uniprocessors and the short treatment by Black [3] of Mach multiprocessors to a uni ed
treatment of scheduling in uniprocessors and multiprocessors running UNIX System V (up
to Release 4), 4.3BSD UNIX, 4.4BSD UNIX, or Mach. Neither of these authors proves
that the decay-usage scheduling policy reaches a steady state, nor do they deal with the
2
transient behavior of the scheduler. The extension to 4.3BSD (and 4.4BSD) and Mach
involves more complicated formulas for the priorities. The main diculty in the extension
to multiprocessors is to prove the convergence of the scheduling policy and to determine how
many processors serve each class in the steady state. We use some of the techniques of [10],
in particular the subdivision of decay cycles into epochs, at the end of which the priorities
of di erent classes become equal. Hellerstein [11] gives a more detailed exposition of the
results in [10], and also deals with control issues relating to the range of possible ratios of
the steady-state shares and the granularity within this range. In [10, 11], measurements are
presented for UNIX System V on uniprocessors; they match the predictions of the model
remarkably well.
For a compute-intensive workload on a multiprocessor or a cluster of uniprocessors, it is
a natural objective to deliver pre-speci ed shares of the total computing power to groups of
jobs (e.g., all jobs belonging to a single user or to a speci ed group of users). Schedulers for
uniprocessors with this objective have been treated in the literature under the name of fair-
share schedulers [5, 12, 13], although the term share schedulers would be more appropriate.
Our results show how to achieve share scheduling for long, compute-intensive jobs in UNIX
systems, at least in theory, without scheduler modi cations, while in each of [5, 12, 13], an
additional term in the priorities of jobs and a feedback loop are needed. Unfortunately, the
relation between the base priorities and the share ratios is rather intricate. However, like
in other classical time-sharing systems, the two main performance objectives of the UNIX
scheduler are short response times of interactive jobs and a high throughput of large jobs,
rather than share scheduling.
Recently, there has been a renewal of interest in share scheduling of di erent types of
resources, expecially of processor time and of network bandwidth. In [18], a probabilistic
resource-management mechanism called lottery scheduling is presented. Jobs own a number
of tickets proportional to their shares, and for each scheduling decision, a lottery is held
in which the chance of a job to win is proportional to its number of tickets. Resource
abstractions such as ticket transfers, currencies and in ation are supported. Ticket transfers
allow a job to yield its rights to another job, for instance when the former sends a request
to the latter and blocks. Currencies and in ation allow a group of jobs to be isolated from
other jobs when it expands, shrinks or internally re-allocates the relative shares of its jobs.
Experimental results show that lottery scheduling works well for compute-intensive jobs on a
scale of seconds, and that it can also be used for certain non-compute-intensive workloads.
In [19], a deterministic mechanism called stride scheduling is employed to achieve share
scheduling that uses the same abstractions as lottery scheduling. Each job has a stride
value that is inversely proportional to its number of tickets, and a pass value. When a
scheduling decision is taken, the job with the lowest pass value is chosen, and its pass value is
incremented by its stride value. It is shown by means of simulations and an implementation,
that compared to lottery scheduling, stride scheduling achieves a somewhat better accuracy
in throughput ratios and a much lower variability in response times (which are de ned as the
3
times between two successive quantum completions). As a matter of fact, stride scheduling
is but a slight modi cation of decay-usage scheduling (cf. Section 4.1).
In [7], a di erent time-slicing mechanism for achieving proportional resource usage called
time-function scheduling (TFS) is analyzed in uniprocessors. For each class k of jobs, a time
function Fk () is de ned. Within a class, jobs are served in a FCFS fashion. When a job has
spent an amount of time t waiting in the queue corresponding to its class, its time-function
value is equal to Fk (t). After a quantum has expired, the processor selects, among the jobs
at the heads of the non-empty queues, the one with the highest time-function value. It is
shown in [7] that with linear time functions, TFS is able to achieve share-ratio objectives on
a per-class basis very well (in fact, the share ratios are the inverses of the ratios of the slopes
of the time functions), with a waiting-time variance that is substantially lower than that of
lottery scheduling. An important di erence between TFS on the one hand and decay-usage
scheduling and stride scheduling on the other, is that in the former, priorities are depressed
while a job is waiting, while in the latter this happens while a job receives service.
Fair scheduling has also been investigated in the context of networks, in which di erent
packet streams with quality-of-service demands compete for bandwidth. The usual aim
of fair scheduling policies in networks is to approximate some form of processor sharing
[9]. Two di erences with scheduling compute-intensive jobs in multiprocessors are that
in networks, the time scale on which to achieve fair scheduling is much smaller, and that
scheduling is nonpreemptive because the transmission of a packet cannot be interrupted
and resumed.
In [4], a study of three scheduling algorithms for (small) shared-memory multiprocessors
by means of simulations driven by traces of UNIX systems, is reported. The emphasis is
on the impact of load-balancing aspects such as migration overhead on response times. No
attempt is made at simulating the decay-usage scheduling policy of UNIX.
2 The Model
In this section we describe scheduling in UNIX systems and in Mach, our model of this type
of scheduling, and the di erences between them. We then show how our model behaves
over time.
2.1 Scheduling in UNIX Systems
Scheduling in UNIX follows a general pattern, but details vary among variations such as
UNIX System V Releases 2 [1] and 3, UNIX System V Release 4 [8], 4.3BSD UNIX [15] and
4.4BSD UNIX [16], and the operating system Mach [2, 3], which has a close resemblance to
UNIX as far as scheduling is concerned. The general description below, which is valid for
both uniprocessors and multiprocessors, applies to System V up to and including Release 3,
4.3BSD and 4.4BSD, and Mach. It does not apply to System V Release 4, which by default
supports the two scheduling classes time-sharing and real-time, for the former of which it
4
uses a table-driven approach [8]. Therefore, in the remainder of this article, `System V'
refers to the releases of that system up to and including Release 3. For a comprehensive
treatment of scheduling in UNIX, see Chapter 5 of [17].
UNIX employs a round-robin scheduling policy with multilevel feedback and priority
ageing. Time is divided into clock ticks, and CPU time is allocated in time quanta of some
xed number of clock ticks. In most current UNIX and Mach systems, a clock tick is 10 ms.
The set of priorities (typically 0; 1; : : :; 127) is partitioned into the kernel-mode priorities
(usually 0 through 49) and the user-mode priorities (50 through 127). A job (in UNIX called
a process) is represented by a job descriptor, in which the following three elds, implemented
as integers, relate to processor scheduling (also for Mach, we use UNIX terminology):
1. p nice: The nice-value, which acts as a base priority and has range 0 through 19.
2. p cpu: This eld is incremented by 1 for every clock tick received, and every second,
it is recomputed by dividing it by the decay factor D, and in some systems, by adding
p nice to it. Although p nice enters in its periodic recomputation, we will refer to
p cpu as the accumulated and decayed CPU usage, or sometimes simply as the CPU
usage. It is the division by D that we call the decay of (CPU) usage, and also, because
p cpu enters in the priority, priority ageing.
3. p usrpri: The priority of a job, which is the sum of a linear combination of p cpu and
p nice, and of the constant PUSER serving to separate kernel-mode priorities from
user-mode priorities (with the ranges of priorities mentioned above, PUSER equals
50). A high value of p usrpri corresponds to a low precedence. In the sequel, in the
context of UNIX, `priority' refers to p usrpri.
In principle, there is a run queue for every priority value. A waiting process is in the run
queue corresponding to its priority. When a time quantum expires, the scheduler selects
the process at the head of the highest non-empty run queue (lowest value of p usrpri)
to run. In many implementations, there are 32 instead of 128 run queues; then, the two
least-signi cant bits of p usrpri are ignored when deciding to which run queue to append
a process.
2.1.1 System V
In UNIX System V (excluding Release 4) [1], when a process receives a clock tick, the elds
in its job descriptor are recomputed according to:
(
p cpu := p cpu + 1
p usrpri := PUSER + 0:5 p cpu + p nice
Every second, the following recomputation is performed for every process:
(
p cpu := p cpu=2
p usrpri := PUSER + 0:5 p cpu + p nice
So here, D = 2. Usually, PUSER = 50.
5
2.1.2 4.3BSD
Scheduling in 4.3BSD UNIX ([15], p. 87) is identical to that in 4.4BSD UNIX ([16], p. 92).
Because we performed measurements on 4.3BSD UNIX systems, we will in the sequel only
refer to that system. In 4.3BSD UNIX, the priority of a running process is only recomputed|
according to (2) below|when p cpu becomes a multiple of 4 (presumably because of the
division by 4 of p cpu in (2)). Every second, the following recomputation is performed for
every process:
8 load p cpu + p nice
>
> p cpu := 2 2load (1)
< +1
>
>
: p usrpri := PUSER + p cpu 4 + 2 p nice (2)
Here, load is the total number of jobs in the run queues as sampled by the system, divided
by the number of processors. From (1), we see that D = (2 load +1)=(2 load). (This value
for D has been chosen to achieve that only about 10% of the value of p cpu persists after
5 load seconds.) Usually, PUSER = 50. In 4.3BSD UNIX, a clock tick is 10 ms and a time
quantum is 10 clock ticks.
2.1.3 Mach
In Mach [3], a quantity l called the load factor, is de ned by
l = max(1; M=P );
where M is the number of processes and P the number of processors. A basic element in
Mach scheduling is the increment in priority R0 due to one second of received CPU time
when l = 1 (R0 has been set to approximately 3:8). In general, the increment in priority
due to one second of CPU time is lR0. The priority of a process is
0
p usrpri = PUSER + lR
T p cpu + 0:5 p nice; (3)
where T is the number of clock ticks per second. The e ect of the load factor in the
coecient of p cpu is to keep the priorities in the same range regardless of the number of
processes and processors (see Section 3.2, Remark 5). Every second, the CPU usage of
every process is decayed according to:
p cpu := 5 p cpu=8
(so D = 1:6), and p usrpri is recomputed according to (3). In Mach, PUSER = 12, a
clock tick is 10 ms, a quantum is 10 clock ticks, and there is a direct match between the
priorities, which range from 0 through 31, and the 32 run queues.
In fact, the way decay is e ectuated in Mach is somewhat di erent from what is described
above. Mach maintains a global 1-second counter C and a local decay counter Cp for every
6
process p. Whenever a clock interrupt occurs at a CPU, the system e ectuates decay for the
process p running on that CPU by the following computation (p cpup is the eld p cpu of pro-
cess p):
8
>
< p cpup := p cpup =DC?C p
(4)
>
: Cp := C (5)
Subsequently, the priority of p is recomputed according to (3). In order to avoid starvation
of processes that do not get a chance to run and perform this priority recomputation, but
that would run if they could adjust their priorities, every 2 seconds the system performs
the computation of (4), (5) and (3) for every process. Because we are not able to model
this form of decay, throughout this paper, we assume in our modeling of Mach that decay is
e ectuated every second (with decay factor D = 1:6) for every process, like in UNIX. Only
when discussing Mach measurements in Section 6 will we analyze the implications of the
actual way Mach e ectuates decay.
2.2 The Decay-Usage Scheduling Model
Our model of decay-usage scheduling is de ned as follows.
1. There are P processors of equal speed.
2. There are K classes of jobs. We consider the model from time t = 0 onwards. There
is a xed set of jobs, all present at t = 0; no arrivals or departures occur. (In Section
5 we will relax this assumption.) There are Mk class-k jobs, k = 1; : : :; K . We write
M^ k = Pkl=1 Ml, and assume M^ K > P . Let k0 be the index such that M^ k0 P and
M^ k0 +1 > P . If M1 > P , then let k0 = 0.
3. A class-k job has base priority bk , with bk real and non-negative, and with bk < bl
when k < l, k; l = 1; : : :; K . The priority of a class-k job at time t is
qk (t) = bk + Rvk (t); (6)
with and R real, positive constants. The function vk (t) will be explained in 4. below
(where it will also be shown that the priority of a job only depends on its class).
4. Time is divided into intervals of length T called decay cycles , from t = 0 onwards. Let
tn = nT; n = 1; 2; : : :. The n-th decay cycle [tn?1 ; tn] is denoted by Tn. The scheduling
policy is a variation of a policy known as priority processor sharing [14] and also as
discriminatory processor sharing [6], that is, jobs simultaneously progress at possibly
di erent rates (called their processor shares), which may change over time.
The functions vk are piece-wise linear and are de ned as follows: vk (0) = 0, and for
every subinterval [t1; t2 ] of a decay cycle during which all jobs of class k receive a
constant processor share, say f ,
vk (t) = vk (t1 ) + f (t ? t1 ); for t1 t t2: (7)
7
Furthermore, at the end of every decay cycle, the following recomputation is per-
formed:
vk (t+n ) := vk (t?n )=D + bk ; k = 1; : : :; K; (8)
where D, which is called the decay factor, and are real constants, D > 1; 0, and
where t?n and t+n denote the time tn immediately before and after the decay of (8) has
been performed.
5. At any point in time t, the set of jobs that receive service and the processor shares
of those jobs are determined as follows. Order the classes such that qk (t) qk +1 (t),
i i
the index such that qk +1 (t) = qk (t) and that qk (t) < qk +1 (t). If such an r does not
r s s s
exist, let r = 0; if such an s does not exist, let s = K . Now at time t, each of the jobs
of classes k1; : : :; kr has a dedicated processor, the jobs of classes kr+1 ; : : :; ks evenly
P
share the remaining P ? ri=1 Mk processors, and classes ks+1 ; : : :; kK do not receive
i
service.
There are a few things to note about this scheduling policy:
The behavior of the model does not change when the base priorities bk are replaced
by bk + C , k = 1; : : :; K , for some constant C such that b + C 0; the di erences
1
The dimension of the parameter T is time. We are at liberty to express time as elapsed
time, or as the number of clock ticks delivered by a processor during a decay cycle.
This gives us the opportunity to compare the behavior of our model for processors of
di erent speeds by choosing di erent values for T (di erent numbers of clock ticks in
a decay cycle, each clock tick representing the same amount of useful work).
The parameters R and D may depend on the other parameters (such as P and the
Mk ), but should be constant in any instance of the model.
Because of (6), (7) and (8), for n 2 we have
nX
? 2 nX
? nX
?
1 2
bk D?j vk (t?n ) T D?j + bk D?j : (9)
j =0 j =0 j =0
The lower bound is only reached when class-k jobs do not receive service during
decay cycles T1 ; : : :; Tn, while the upper bound is only reached when class-k jobs have
dedicated processors during decay cycles T1; : : :; Tn.
Because all processors are assumed to have equal speeds, any number M of jobs can
be given equal shares on any number P 0 of processors by means of processor sharing,
possibly also across processors, when M P 0 , which shows the feasibility of 5. above.
8
When during a decay cycle a job receives service, its priority increases proportionally
to its processor share; when it does not receive service, its priority remains constant.
Shares only change during a decay cycle when two priorities qk (t) and ql (t) become
equal. If at time t, jobs of di erent classes have equal priorities, they will receive
equal shares at any time during the remainder of that decay cycle, and their priorities
remain equal. Therefore, there are at most K ? 1 points in a decay cycle where shares
change. The intervals between two consecutive such points are called epochs . The
l-th epoch of decay cycle n is denoted by Tn(l).
The parameters of System V are R = 0:5, D = 2, = 1, = 0. Insofar as we con ne
ourselves to 4.3BSD systems in their steady states, we can assume that the load as sampled
by the system is equal to M^ K =P , so because we assume that there are more processes than
processors, the de nitions of load for 4.3BSD and of the load factor l for Mach coincide,
and we will in the sequel denote both by l. The parameters of 4.3BSD are R = 0:25,
D = (2l + 1)=2l, = 2, and = 1, and those of Mach are R = lR0=T , D = 1:6, = 0:5,
and = 0, with l = M^ K =P . Currently, T = 100 clock ticks in almost all implementations
of these operating systems the author knows of, and throughout this paper we will assume
this value for T . Also, in the sequel we will use the terms base priority and nice-value
interchangeably.
2.3 Discrepancies between UNIX Scheduling and the Model
There are three points where our model di ers from real UNIX scheduling (except for the
shift in priority by PUSER, which has no e ect), viz.:
1. In our model we use continuous time and a continuous range for all variables in the
scheduler, while actual systems use discrete time and force most variables to have
integer values. In our model, a job can get any fraction of a second of CPU time
per second (or any real number of clock ticks less than or equal to T during a decay
cycle), while in actual systems, a process can only get an integral number of quanta
per second.
2. The UNIX scheduler uses priority clamping: Because of their representations in a
xed number of bits, p cpu and p usrpri are set to their maximum values whenever
the computations for these elds result in larger values. Thus, in 4.3BSD, p cpu cannot
exceed 255 and p usrpri cannot exceed 127 ([15], p. 87). We will see in Section 4.3
that clamping is particularly prominent in 4.3BSD systems. The issue of an upper
bound of p usrpri is also addressed in [11].
3. In many UNIX systems, the two least-signi cant bits of p usrpri are ignored when
determining to which run queue to append a process.
9
2.4 The Operation of the Model
In this section we describe the operation of our decay-usage scheduling model. According
to the description in Section 2.2, at the beginning of the rst decay cycle T1, all jobs in
classes 1; 2; : : :; k0 get dedicated processors, and their priorities all increase at the same rate
R. If P = M^ k0 , this operation continues until either qk0 (t) is equal to qk0 +1(t) or until T1
nishes, whichever occurs rst. If P > M^ k0 , the jobs of class k0 + 1 share the remaining
processors, and their priority increases at rate R(P ? M^ k0 )=Mk0 +1 , which is smaller than
R. This operation continues until one of four things happens:
1. The priority of class k0 becomes equal to the priority of class k0 + 1. Then, the jobs
in classes 1; 2; : : :; k0 ? 1 continue having dedicated processors, and the jobs in classes
k0 and k0 + 1 start sharing P ? M^ k0 ?1 processors.
2. The priority of class k0 + 1 becomes equal to the priority of class k0 + 2. In this case,
the jobs in classes 1; 2; : : :; k0 continue having dedicated processors, and the jobs in
classes k0 + 1 and k0 + 2 start sharing P ? M^ k0 processors.
3. The priorities of classes k0 and k0 + 1 become equal to the priority of class k0 + 2 at
the same time. Then, the jobs in classes 1; 2; : : :; k0 ? 1 continue having dedicated
processors, and the jobs in classes k0; k0 + 1 and k0 + 2 start sharing P ? M^ k0 ?1
processors.
4. Before any of 1.{3. happens, T1 nishes.
Continuing in this way, it is clear that T1 consists of at most K epochs T1(1); T1(2); : : :, with
service delivered as follows. During T1(l), the jobs in classes 1; 2; : : :; i1(l) have dedicated
processors, jobs in classes i1 (l) + 1; : : :; j1(l) share P ? M^ i1 (l) processors, and the remaining
classes do not receive service, for some i1 (l) and j1(l). Recalling the four possibilities at
the end of an epoch detailed above, we have i1 (l + 1) = i1(l) ? 1 and j1 (l + 1) = j1(l), or
i1(l + 1) = i1(l) and j1(l + 1) = j1(l) + 1, or i1 (l + 1) = i1(l) ? 1 and j1 (l + 1) = j1 (l) + 1,
or T1(l) nishes because T1 does. That is, either the jobs with the highest base priority
among those having dedicated processors catch up with the jobs that receive service but do
not have dedicated processors, the latter catch up with the jobs in the class with the lowest
base priority among those that are waiting, or both. Among the classes with jobs having
dedicated processors, none can catch up with the class with the next-higher base priority,
because the jobs of all these classes receive service at the same rate, and so, by (6), their
priorities increase at the same rate, too.
We conclude that there exist values i1 and j1 such that the jobs of classes 1; : : :; i1
each have a dedicated processor during the entire rst decay cycle T1, the jobs of classes
i1 + 1; : : :; j1 receive processor time during T1 but do not have dedicated processors during
at least part of T1, and the jobs of classes j1 + 1; : : :; K do not receive service during T1.
Also, at the end of T1, we have
qk (t?1 ) < qk+1 (t?1 ); k = 1; : : :; i1 ? 1; j1 + 1; : : :; K ? 1; (10)
10
qk (t?1 ) qk+1 (t?1 ); k = i 1 ; j1 ; (11)
qk (t?1 ) = qi1 +1 (t?1 ); k = i1 + 2; : : :; j1: (12)
Putting
? 1) + ;
= (DRD (13)
in general we have
?
qk (t+n ) = qk (Dtn ) + Rbk ; k = 1; : : :; K; n = 1; 2; : : :: (14)
It easily follows that
qk (t+1 ) < qk+1 (t+1 ); k = 1; 2; : : :; K ? 1:
As a consequence, by induction on n, the operation of the model during Tn is analogous to
that during T1 , although the starting values of the priorities, the lengths of corresponding
epochs, and even the number of epochs may be di erent. We now de ne:
in as the index such that the jobs of classes 1; : : :; in have dedicated processors during
Tn, and those of class in + 1 do not; if there are no such classes, we set in = 0;
jn as the highest index such that the jobs of class jn receive a non-zero amount of
processor time during Tn ;
Qn = qj (t?n ) as the highest priority attained at the end of Tn by a class that receives
n
service during Tn .
Obviously, in jn and in k . If M^ k0 = P , then jn k , otherwise jn > k . Because of
0 0 0
the way of operation of the scheduling policy explained above, we have for n 1
qk (t?n ) < qk+1 (t?n ); k = 1; : : :; in ? 1; jn + 1; : : :; K ? 1; (15)
qk (t?n ) qk+1 (t?n ); k = i n ; jn ; (16)
qk (t?n ) = qi +1 (t?n );
n k = in + 2; : : :; jn ; (17)
qk (t+n ) < qk+1 (t+n ); k = 1; 2; : : :; K ? 1: (18)
The operation of the model during a decay cycle is illustrated in Figure 1. The dashed
lines indicate the priorities at the start of Tn (we take = 1). On the uniprocessor, class 1
catches up with class 2 at the end of Tn (1), classes 1 and 2 catch up with class 3 at the end
of Tn (2), but Tn ends before the jobs of class 4 get any service, so in = 0 and jn = 3. On
the multiprocessor, classes 1 and 2 have dedicated processors during Tn , and classes 3,4 and
5 receive some service, so M^ 2 < P < M^ 5 , in = 2 and jn = 5. Note that the area between
the graphs of the priorities at the beginning and at the end of a decay cycle is RPT .
11
priority priority
b6
b4 Qn b5
Qn q2 (t?n ) b4
RT b3 b3
b2 q1 (t?n ) b2
q1(t+n?1 ) q1 (t+n?1 )
b1 b1
M1 - M -
2 M3 - M4 - M1 -M-M - M -M-M-
2 3 4 5 6
numbers of jobs numbers of jobs
(a) (b)
Figure 1: Examples of decay-usage scheduling on (a) a uniprocessor and (b) a multiproces-
sor.
12
In Proposition 1 we show that the ck (n) (or equivalently, the shares sk (n)) are the
solution of a system of linear equations of a size that is not known beforehand, subject to
four inequalities.
Proposition 1. The amounts ck (n); k = 1; : : :; K , of CPU time and the class indices in
and jn are uniquely determined by the set of equations and inequalities
8
>
> ck (n) = T; k = 1; : : :; in; (23)
>
> qk (tn? ) + Rck (n) = qi +1 (t+n?1 ) + Rci
> +1 (n); k = in + 2; : : :; jn ; (24)
+
>
> ck (n)
1
= 0;
n n
k = jn + 1; : : :; K; (25)
>
>
>
< XK
Mk ck (n) = PT; (26)
>
> k =1
>
> ci (n) < T; (27)
>
> qi (tn? ) + RT +
n +1
>
> cj (n) > 0; (29)
: qj (tn?1 ) + Rcj (n)
+
n
qj +1(t+n?1 ); if jn + 1 K: (30)
n n n
PROOF. Because of the de nition of in and jn , a solution has to satisfy (in)equalities (23),
(25), (27), and (29). (In)equalities (24), (28) and (30) are rewritten from (16) and (17),
and (26) states that the CPU time consumed is equal to the amount available. So we only
have to prove that there is a unique solution. For xed values of in and jn , the system of
linear equations (23){(26) indeed has a unique solution, because, putting k = qk (t+n?1 )=R
and substituting (23){(25) into (26), one only has to solve
X
j n
which means that for xed in , the ranges of possible values for the left-hand side of (31)
are mutually disjoint for di erent values of jn , proving the assertion.
Now we consider solutions for di erent values of in , so assume that there are two so-
lutions (c1; : : :; cK ; s; t) and (c01; : : :; c0K ; s0; t0) for (c1(n); : : :; cK (n); in; jn), with s < s0 . We
rst show that t t0 . Using (24), (28), the fact that cs < T because s < s0 , (15){(17), and
0
t + ct = t +1 + ct +1 > t +1 t
0 0 0 0 + c0t ;
0 (33)
13
which contradicts (32), and so we conclude that t t0 . Now substituting (23){(25) into
(26) for the two solutions and using (32), we have
X
t 0
Xt
(P ? M^ s )T = Mk ( t + c0t ? k ) >
0 0 Mk ( t + ct ? k ) = (P ? M^ s)T; (34)
k=s+1 k=s+1
which is a contradiction.
3.2 Convergence
In this section we prove that the decay-usage scheduling policy converges in the sense
described in the introduction of Section 3. The main step is to prove that from decay cycle
T2 onwards, the set of classes with dedicated processors is non-increasing and the set of
classes that receive CPU time is non-decreasing in successive decay cycles (see Proposition
2 below).
Proposition 2. For n = 2; 3; : : :, in in and jn jn.
+1 +1
have
qk (t+n ) + Rck (n + 1) qi +1 (t+n ) + RT;
n
(P ? Mi )T =
n
Mk ck (n) < Mk ck (n + 1) (P ? M^ i )T; n
k=in +1 k=in +1
14
which is a contradiction.
We now prove that jn+1 jn for n 2. Assume jn+1 < jn , so cj (n + 1) = 0. We rst n
show that
ck (n + 1) < ck (n); k = in+1 + 1; : : :; jn: (39)
We already know that in+1 in , and if in+1 < in , then (39) is clear for k = in+1 +1; : : :; in.
By (19), (15){(17), and because cj (n + 1) = 0, respectively, we have
n
(P ? M^ i +1 )T =
n
Mk ck (n + 1) < Mk ck (n) = (P ? M^ i +1 )T;
n
k=in+1 +1 k=in+1 +1
which is a contradiction. (For the last equality, we use the fact that in+1 in , which has
already been established.)
Remark 1. The condition n 2 in Proposition 2 is necessary. A counterexample for
n = 1 is the instance of the model with parameters P = 1, K = 2, b1 = 0, b2 = T=2,
M1 = M2 = 1, R = 1; D = 2; = 1; = 2. There, q1 (t?1 ) = q2(t?1 ) = 3T=4, q1 (t+1) = 3T=8,
q2(t+1 ) = 13T=8, and so q2 (t+1 ) ? q1(t+1 ) > RT , and class 2 does not receive service in T2
(in fact, class 2 starves from T2 onwards). Proposition 2 holds also for n = 1 if RD
(which is the case for System V, 4.3BSD, and Mach): An easy check shows that then (38)
is also true for n = 1.
Corollary. There exist N > 0; i 0; j K , such that in = i and jn = j for all n N .
0 0 0 0
In fact, from TN onwards, the model operates as if the P -way multiprocessor were parti-
tioned into M^ i0 uniprocessors|one for each job of classes 1; : : :; i0|and a (P ? M^ i0 )-way
multiprocessor serving classes i0 + 1; : : :; j0.
Proposition 3 is only a preparation for the Theorem.
Proposition 3. (a) If during decay cycle Tn, P 0 processors only serve job classes k; k +
1; : : :; l and only these processors serve these classes, then
Xl
Mi (qi(t?n ) ? qi (t+n?1 )) = RP 0 T: (40)
i=k
15
(b) If during decay cycles Tm ; : : :; Tn, m n, P 0 processors only serve job classes k; k +
1; : : :; l and only these processors serve these classes, then
X
l Xl nX
?m
Mi (qi (t?n ) ? qi (0)) = RDm?n Mivi (t+m?1 ) + RP 0 T D?j +
i=k i=k j =0
X
l n?X
m?1
+ R Mi bi D?j : (41)
i=k j =0
If moreover, qi (t?n ) = qk (t?n ) for i = k + 1; : : :; l, then
Pl M q (0) + RDm?n Pl M v (t ) +
qi(t?n ) = i k i i
= i k i i m? +
= 1
^ ^
Ml ? Mk ?
RP T j D j + R Pli k Mi bi Pjn?m? D?j
P 1
0 n?m ? 1
+ =0 =
; i = k; : : :; l: (42)
=0
M^ l ? M^ k? 1
PROOF. By the Corollary, from TN onwards, the jobs of classes 1; : : :; i0 always have dedi-
cated processors, the jobs of classes i0 + 1; : : :; j0 are jointly served by P ? M^ i0 processors
and have equal priorities at the end of Tn for all n N , and classes j0 + 1; : : :; K starve.
For each of these three groups of classes, qk can be computed from (42). For k i0 and
k > j0, the value of ck is obvious and vk can be found from
vk = (ck D+ ?b1k )D ; (44)
which is obtained by taking the limit for n ! 1 in (22). For k = i0 + 1; : : :; j0, ck and vk
can be found from qk = bk + Rvk , which is obtained from (6), and from (44).
16
Remark 2. In Proposition 5, a more explicit formula for the ck ; k = i + 1; : : :; j will be
0 0
given.
Remark 3. For n N , qk (t?n ) = Qn for k = i + 1; : : :; j , so by (14), ql(tn ) ? qk (tn ) =
0 0
+ +
and all k. We say that the model is in the steady state, when the allocation of processors
to classes does not change anymore, i.e., after the rst decay cycle Tn with in = i0 and
jn = j0. While the limiting priorities of the Theorem are never attained, the shares in the
steady state are equal to their limits as given in the Theorem.
Remark 4. By the Theorem, we see that as far as the shares in the steady state and the
limiting priorities are concerned, a model with > 0 is equivalent to a model with = 0
and with the base priorities b0k given by
b0k = 1 + (RD
D ? 1) bk ; k = 1; 2; : : :; K:
the parameters as in Section 2), by (43) we have for the actual systems we consider:
System V: Q = b + 100 ; (45)
9 l l 25
4.3BSD: Q = 4 + 2 b + l + 50; (46)
0 b
Mach: Q = 2b + 8R 3 2 + 10: (47)
(These equations do not include PUSER.) Note that for Mach, the priority Q is invariant
with respect to P and T .
3.3 Steady-State Shares
Assuming the numbers of jobs Mk ; k = 1; : : :; K to be xed, we now show how to compute
the steady-state shares sk = ck =PT given the base priorities bk .
Proposition 4. The amounts ck ; k = 1; : : :; K , of CPU time and the class indices i and 0
17
8
>
> ck = T; k = 1; : : :; i0; (48)
>
> ck = ci0 +1 ? (bk ? bi0+1 ); k = i0 + 2; : : :; j0; (49)
>
>
>
> ck = 0; k = j0 + 1; : : :; K; (50)
>
> X
K
< Mk ck = PT; (51)
>
> k=1
>
> ci0 +1 < T; (52)
>
> ci0 +1 T ? (bi0+1 ? bi0 ) if i0 > 0; (53)
>
>
>
> cj0 > 0; (54)
: cj0 (bj0+1 ? bj0 ); if j0 + 1 K: (55)
PROOF. In order to nd the ck , we have to solve for c1; : : :; cK ; i0; j0 the set of equations
and inequalities obtained by taking the limit for n ! 1 in (23){(30). Recalling that
qk (t+n?1 ) + Rck (n) = bk + Rvk (n); k = 1; : : :; K;
and using (44) in (24), (28), and (30), we nd the set of equations and inequalities in the
proposition. In a similar way as in Proposition 1, one can prove that this set has a unique
solution.
Because i0 and j0 are not known beforehand, it seems that there is no closed-form
expression for the ck , but they can be computed by the algorithm in Figure 2. We start by
assuming that the jobs in classes 1; : : :; k0 have dedicated processors throughout a decay
cycle in the steady state (step s2), and that if M^ k0 = P (M^ k0 < P ), classes k0 + 1; : : :; K
(k0 + 2; : : :; K ) starve (step s3). Whenever step s8 is executed, i and j indicate the highest-
numbered class that is assumed to have dedicated processors, and the highest-numbered
class that is assumed to receive any service at all, respectively. In step s8, we solve the
linear system consisting of Equations (48){(51), which can be rewritten as in (58) and (59).
The condition of step s6 is the same as (55), the condition of step s4 is the same as (53).
Proposition 5. (a) The amounts ck ; k = 1; : : :; K , of CPU time and the class indices i 0
18
input: P; T; R; D; ; ; K; Mk; bk; k = 1; 2; : : :; K
output: i ; j ; ck; k = 1; 2; : : :; K
0 0
PROOFS. (a) Clearly, the algorithm in Figure 2 computes a solution that satis es Equations
(48){(51) and Inequalities (53) and (55). To prove the algorithm correct, we have to prove
that (52) and (54) are also satis ed. We show this by proving that
(ci+1 < T ) and (cj > 0) (60)
is an invariant of the algorithm. Immediately after step s8 has been executed for the rst
time, if M^ k0 = P , we have i = j = k0 , ck0 +1 = 0 < T , and ck0 = T > 0, and if M^ k0 < P , we
have i + 1 = j = k0 + 1, and
19
The body of step s6 is re-executed if either the condition of step s6 or the condition of step
s4 is not satis ed. In the rst case, we have
cj > (bj+1 ? bj ): (62)
By (58), we then also have
ci+1 > (bj+1 ? bi+1 ): (63)
Denoting the solution of (58) and (59) of the re-execution by c0i+1 ; : : :; c0j +1, we have
jX
+1 jX
+1
Mk (c0i+1 ? (bk ? bi+1 )) = Mk (c0j+1 ? (bk ? bj+1 )) = (P ? M^ i )T: (64)
k=i+1 k=i+1
Comparing the left-hand sides of (61) and (64) and using (63), we nd
(M^ j +1 ? M^ i )c0i+1 = (M^ j ? M^ i )ci+1 + Mj +1 (bj +1 ? bi+1 ) < (M^ j +1 ? M^ i )ci+1 ;
and so c0i+1 < ci+1 < T . Comparing the middle terms of (61) and (64) and using (62), we
nd
(M^ j +1 ? M^ i )c0j +1 = (M^ j ? M^ i )(cj ? (bj +1 ? bj )) > 0;
and so c0j +1 > 0.
Now assume that the body of step s6 is re-executed because the condition of step s4 is
not true, so
ci+1 < T ? (bi+1 ? bi ): (65)
By (58), we then also have
cj < T ? (bj ? bi): (66)
Denoting the solution of (58) and (59) of the re-execution by c0i ; : : :; c0j , we have
X
j X
j
Mk (c0i ? (bk ? bi )) = Mk (c0j ? (bk ? bj )) = (P ? M^ i?1 )T: (67)
k=i k=i
Comparing the left-hand sides of (61) and (67) and using (65), we nd
X
j
(M^ j ? M^ i?1 )c0i = Mi T + Mk (ci+1 ? (bi ? bi+1)) < (M^ j ? M^ i?1 )T;
k=i+1
so c0i < T . Comparing the middle terms of (61) and (67) and using (66), we nd
(M^ j ? M^ i?1 )c0j = (M^ j ? M^ i )cj + Mi (T ? (bj ? bi )) > (M^ j ? M^ i?1 )cj ;
and so c0j > cj > 0.
(b) The formula for the ck can be obtained by substituting (48){(50) into (51), computing
ci0 +1 , and using (49).
(c) The share ratios follow directly from (b).
20
As a special case, when there are only two classes, class-1 jobs do not have dedicated
processors, and class-2 jobs do not starve, then (57) can be written as
s1 = PTRD + ( (D ? 1) + RD)M2(b2 ? b1) ; (68)
s2 PTRD ? ( (D ? 1) + RD)M1(b2 ? b1)
which for the three actual systems we consider, yields
System V: s1 = 100P + M2(b2 ? b1) ; (69)
s2 100P ? M1 (b2 ? b1)
4.3BSD: s1 = 25P + (1=((M1 + M2 )=P + 0:5) + 0:25)M2(b2 ? b1 ) ; (70)
s2 25P ? (1=((M1 + M2 )=P + 0:5) + 0:25)M1(b2 ? b1 )
Mach: s1 = 60:8(M1 + M2) + 3M2(b2 ? b1) : (71)
s2 60:8(M1 + M2) ? 3M1(b2 ? b1)
3.4 Heterogeneous Workloads
It is a natural question how our decay-usage policy behaves under a heterogeneous work-
load consisting of long and short compute-intensive jobs (for instance, real-time system
functions), or when some jobs are not compute-bound (for instance, interactive work). In
the former case, it seems impossible to analyze the impact of the short-running jobs on the
long-running ones with only a stochastic description of the former part of the workload in
terms of the distributions of the inter-arrival times and service times. Only when the base
priorities of the short jobs are so low that their priorities never get as high the priorities of
the long jobs, and when the short jobs jointly take a constant amount of time during each
decay cycle, the solution is simple: In order to nd the shares of the long-running jobs,
replace T by the amount of time T 0 remaining for these jobs.
When there are also jobs that perform I/O operations, our analysis is still valid provided
that the amount of time spent waiting for I/O per decay cycle and per job is not very large.
During an I/O operation, a job is suspended and so its priority remains constant, assuming
that the operation is completed in the same decay cycle in which it started. For disk I/O
operations, this is probably very often the case, because such an operation takes on the
order of tens of milliseconds and the length of a decay cycle is one second. When after the
I/O operation has nished, the job becomes runnable again, its priority will fall short of
the priority of the other jobs in its class, so it will be preferred for using a CPU until its
priority becomes equal to that of the other jobs in its class.
4 Exercising Control
In this section we deal with the control that can be exercised by the decay-usage scheduling
policy over the share ratios. We show how to set the base priorities given the required
shares, we trace the in uence of the scheduler parameters on the share ratios that can be
attained, and we investigate the e ect on these ratios of the bounds on the CPU usage and
priority elds in actual systems.
21
4.1 Achieving Share-Ratio Objectives
In Section 3.3 we showed how to compute the steady-state shares from the base priorities.
Conversely, one may want to set share-ratio objectives in terms of the required amounts ck
of CPU time in a decay cycle or in terms of the required shares sk , and compute a set of base
priorities bk (or rather the di erences bk ? b1) yielding these shares. We can assume that
T > c1 > > cK > 0 (or 1=P > s1 > > sK > 0), so i0 = 0 and j0 = K . In addition,
P
we assume rst that all the available capacity is requested, that is, that Kk=1 Mk ck = PT
P
(or Kk=1 Mk sk = 1). Then there is always a solution, which can be found by inverting (58),
after putting b1 equal to an arbitrary non-negative value:
bk = b1 + (c1 ? ck )= ; k = 2; : : :; K: (72)
For P = 1, (72) coincides with Equation 20 of [11]. In actual systems, the values of the
base priorities are con ned to be integers. Then the integer solution which is closest to the
solution of (72) has to be chosen, which may yield share ratios that deviate considerably
from the objectives.
In [11], the behavior of the decay-usage policy for uniprocessors is also analyzed in the
P
underloaded case characterized by Mk ck < PT , and in the overloaded case de ned by
P M c > PT . It is shown that in either case s0 ? s = s0 ? s ; k = 2; : : :; K , where
k k k k 1 1
the s0k denote the obtained steady-state shares, provided that no starvation occurs in the
overloaded case, i.e., that s0K > 0. Measurements in [11] show that the policy indeed
behaves in this way in practice. This property of equal di erences between the required
and the obtained shares clearly carries over to multiprocessors in those cases when no
class has dedicated processors and no class starves: The underloaded and overloaded cases
correspond to lengthening and shortening the decay cycle, which in general amounts to
lengthening and shortening the last epoch, in which all jobs of all classes evenly share all
processors. So the decay-usage scheduling policy is fair in the sense that an excess or
de cit of capacity is spread equally over all jobs. It would perhaps be a more desirable,
and fairer, policy if it enjoyed the property that s0k =sk = s01 =s1 ; k = 2; : : :; K . Clearly,
lottery scheduling [18], stride scheduling [19] and time-function scheduling [7] do enjoy this
property.
One cannot easily employ the decay-usage scheduling policy to achieve Priority Pro-
cessor Sharing (PPS, alternatively called discriminatory processor sharing, see [6, 14]) or
Group Priority Processor Sharing (GPPS). In either of these two policies, for each class k, a
P
priority rk is de ned. In PPS, every job of class k has a processor share of rk = Ml rl , and
so the share ratios of rk =rl are constant, independent of the numbers of jobs in the classes.
P
In GPPS, the jobs of class (or group) k jointly have a processor share of rk = rl , and jobs
within a class have equal shares. By (57), achieving these policies with decay-usage schedul-
ing entails complicated recomputations of the base priorities on arrivals and departures of
jobs. One can however easily modify the decay-usage scheduling policy to implement PPS
in a simple way. This modi cation consists in setting bk = 0; k = 1; : : :; K; D = 2; = 1,
22
and = 0, and in replacing R by class-dependent parameters Rk , k = 1; : : :; K . Then,
sk =sl = Rl=Rk , so one should put Rk = 1=rk . In fact, this modi cation of decay-usage
scheduling is nothing else than stride scheduling [19]. PPS and GPPS can easily be achieved
by lottery scheduling [18] and by stride scheduling [19]. For PPS, on arrival, a job simply
gets the same number of tickets as the other jobs in its group, and after a departure, nothing
has to be done. For GPPS, the currency of the group of an arriving or a departing job has
to be in ated or de ated, respectively. GPPS is also achieved by time-function scheduling
[7].
4.2 Scheduler Parameters and the Range of Share Ratios
In this section we will trace the impact of the values of the parameters in the decay-usage
scheduling model on the share ratios given by (57), where we assume that i0 = 0 and j0 = K .
As far as the steady-state shares are concerned, the scheduler parameters R; D; ; and the
system parameters P; T are not independent: Only the value of =PT is relevant. The
larger this value is, the higher the level of control is that can be exercised by the decay-
usage scheduling policy, that is, the larger the range of possible share ratios.
If we assume that R and D are constants, = (D ? 1)=RD + can assume any positive
value, and so, as far as the steady-state share ratios are concerned, one parameter in the
scheduler would be sucient, instead of four. For instance, one can take D = 2; = 1, and
= 0 (these are the values of System V), with R the only remaining scheduler parameter. A
further consequence of (57) for constant R and D is that it is immaterial whether there are
P processors of equal speed (each delivering T clock ticks per decay cycle), or one processor
which is P times as fast (delivering PT clock ticks per decay cycle). In either case, an
amount PT of processor time is delivered in one decay cycle, and the steady-state shares
are equal. In addition, in order to achieve the same level of control for xed numbers of jobs
for di erent values of P and T , the range of base priorities has to be proportional to either
of these two parameters. In large multiprocessors, one may have the option to partition
logically the system into a set of multiprocessors with a smaller number of processors each,
for instance, in order to reduce the contention for the central run queue or so as to assign
parts of the machine to di erent sets of applications. Assuming that the ranges of the base
priorities will be the same in the components of the partitioned multiprocessor and in the
original system, and that the numbers of jobs will be roughly proportional to the sizes of
the components, partitioning yields about the same level of control.
Using the values of the parameters given at the end of Section 2.2, we have for the three
systems considered:
System V: = 1; (73)
4.3BSD: = 22ll +
+1
9; (74)
Mach: = 3PT 5 : (75)
16R0M^ K l
23
For xed values of P and T and for a xed range of base priorities, this means that 4.3BSD
has a higher level of control than System V, that Mach has a higher (lower) level of control
than System V for loads lower (higher) than 5, and nally, that 4.3BSD has a higher level
of control than Mach, except for loads smaller than about 2. As an example, consider the
case of two classes with P = M1 = M2 = 1. Then, because b2 ? b1 19, we nd from (69),
(70), and (71), that in these three systems, s1 =s2 1:47, s1 =s2 2:95, and s1 =s2 2:60,
respectively. (In Mach, b2 = 19 is treated as if b2 = 18 because of the multiplication by
= 0:5.) For two cases with two classes, the levels of control for the three systems are
depicted in Figure 3. In Figure 3a, the load equals 6, and 4.3BSD has the highest level of
control and Mach the lowest over the whole range of base priorities of class 2. In Figure
3b for varying load, the share ratio is constant for Mach (cf. (71)), again 4.3BSD has the
highest level of control, and the graphs for System V and Mach intersect between the loads
of 4 and 6.
In Mach, in which R is not a constant, because of (75), the share ratios given in (57)
reduce to
sk = 16R0M^ K + 3 PKi=1 Mi (bi ? bk ) ;
sl 16R0M^ K + 3 PKi=1 Mi(bi ? bl) k; l = 1; 2; : : :; K; (76)
which is invariant with respect to P and T , so the range of base priorities does not have to
be adjusted for multiprocessors or for di erent processor speeds.
If R and D are constants (as in System V), or if they depend on the other parameters of
the model in such a way that only depends on the load l (as in 4.3BSD), then by (57), the
share ratios do not change when P and the Mk are replaced by P and Mk ; k = 1; 2; : : :; K ,
with a positive integer. By (76), the share ratios in Mach even do not change when only
the Mk are replaced by the same multiples.
As to the di erences between the steady-state shares of classes, from (49) we nd
sk ? sl = (bl ? bk )=PT; k; l = 1; : : :; K;
so again, if R and D are constants, increasing P and/or T reduces the contrast among
classes while increasing increases it, as was already concluded in [11], Section IV-A, for
the uniprocessor case of System V.
4.3 Bounds in the Scheduler
In actual systems, the values of p cpu and p usrpri are stored as non-negative integers in
elds of nite size, and so these values each have a maximum. The corresponding values
in the model are vk and PUSER + qk . We will now qualitatively indicate how our decay-
scheduling policy behaves when there is either only a bound for the CPU usages or only a
bound for the priorities; considering what happens when there are bounds for both quantities
is rather complicated, and in the actual systems we consider, only one of the bounds plays
a role. We refer to the model with (without) bounds as the constrained (unconstrained)
24
40 20
35 4.3BSD 4.3BSD
30 System V 15 System V
25 Mach Mach
ratio of 20 10
shares
15
10 5
5
0 0
1 3 5 7 9 11 13 15 17 19 2 4 6 8 10 12 14 16 18 20
base priority of class 2 load
(a) (b)
Figure 3: Ratio of shares (s1 =s2) versus (a) the base priority of class 2 (b2) and (b) the
load (l) for di erent parameterizations of the decay-usage scheduler (K = 2; b1 = 0; in (a),
M1 = 5P; M2 = P ; in (b), M1 = M2 = MP; l = 2M; b2 = 10).
model. Throughout this section, we assume that i0 = 0 and j0 = K in the unconstrained
model.
Let us rst assume there is a bound v on CPU usage. In the unconstrained model,
v1 > vk ; k = 2; : : :; K , so if some vk attains the value v, v1 will certainly do so. Assume that
when v1 attains v during a decay cycle in the steady state, classes 1; : : :; K receive service.
From then on, the priority of class 1 cannot increase anymore, but the slightest additional
amount of service to any of the classes k = 2; : : :; K will increase its priority qk () beyond
q1(), and so during the remainder of the decay cycle, either class 1 monopolizes all CPU's
and classes 2; : : :; K (and K + 1; : : :; K ) do not receive service (M1 P ), or the class-1
jobs will continue using dedicated CPU's, leaving the remainder (at that instant) to classes
2; : : :; K (M1 < P ). Subsequently, v2 may attain the value v , etc. We conclude that the
bound v on CPU usage favors the lower classes.
We now turn to the case of a bound, say q , on the priorities. We claim that such a
bound in general favors the higher classes. In the unconstrained model, we denote by qiu (t)
the priority of class i at time t, and by Q the limiting priority of all classes at the end of
a decay cycle in the steady state. Let qi0 (t) = bi + Rvi0 (t) be the virtual priority of class i
in the corresponding constrained model, where vi0 is de ned as vi in 4. of Section 2.2, and
let qi0 = limn!1 qi0 (t?n ). The (real) priority qic (t) of class i in the constrained model is equal
to qic (t) = min(q; qi0(t)). The amount cci (n) of CPU time of class i in decay cycle Tn in the
constrained model is given by (cf. (19))
q 0 (t? ) ? q 0 (t+ )
ci (n) = n R i n?1 :
c i (77)
If Q q , the constrained model behaves exactly like the unconstrained model, and the
steady-state shares are same. Let us now assume that Q > q . Then, in the constrained
model, qi0 > q; i = 1; : : :; K , because for at least some i this has to hold, and if it would
not hold for some class, that class would have dedicated processors, and as a consequence,
25
qi0 would even exceed Q. Now if limn!1 qi0 (t+n ) q for i = 1; : : :; K , and if all classes
reach priority q in the steady state simultaneously, obviously, the unconstrained and the
constrained models again behave in the same way, and qi0 (t) = qiu (t) for t 0 and i =
1; : : :; K . Now assume that in every decay cycle Tn with n n0 for some n0 , some class l
attains priority q at least an amount of time t0 > 0 before some class k does, with k < l.
(It is easy to see that this is possible; class l may even have qlc (t+n ) = q in the steady state.)
Because all classes whose (real) priorities are equal to q share processors evenly, we then
have ql0 (t?n ) > qk0 (t?n ) for n n0 , and ql0 > qk0 (the di erence between the latter can be
bounded from below by an expression in t0 and the parameters of the model). Similarly as
in (14), we have
0 ?
qi0 (t+n ) = qi (Dtn ) + Rbi; i = 1; : : :; K;
and so, using (77) and putting cci = limn!1 cci (n), we have
cck ? ccl = (D ? 1)( qk0 ? ql0) + (b ? b ) < R(b ? b );
RD l k l k
which by (49) proves our claim. In particular, in the constrained model, the classes i with
qic (t+n ) = q in the steady state cannot even be distinguished anymore. Also in [11], Section
IV-C, it was concluded that an upper bound to the priority reduces the level of control.
Usually in actual systems, p cpu 255, and in System V and in 4.3BSD, p usrpri 127,
where the latter value contains the constant PUSER = 50. In general, by (9) we have
vk (t?n ) (T D+ ?b1k)D :
Because in System V, D = 2 and = 0, only the bound on the priority may be attained.
Now by (45), the condition that the system operates in an unconstrained fashion is
b + 100 77;
l
so if l is not very low, this bound is not reached. This conclusion is in accordance with [11].
For 4.3BSD, assume b1 = 0. Then by (1) and (2), only the bound on the CPU usage
p cpu matters. So the condition that the system operates in an unconstrained fashion is
v1 255. Because q1 = Q and R = 0:25, and by (6) and (46), this equivalent to
(9 + 2l)b + 100 55:
l (78)
For instance, when K = 2 and M1 = M2 = P , then v1 > 255 for all values of b2 > 0. We
have done measurements on 4.3BSD uniprocessor systems in cases with two classes in which
according to the model, the bound v is attained by v1, and these con rm the conclusion
that in such a case, the bound v favors the lowest class (cf. Section 6.4).
In Mach, by (47) the bound of 31 on the priority cannot be reached because PUSER =
12 and b 18. Because q1 = Q and R = lR0=T , and by (6) and (47), the condition v1 255
is equivalent to
800 + 50(b ? b1 ) 255; (79)
3l 3:8l
which is always satis ed when l 2, because b 18.
26
5 Analysis of the Transient Behavior
In our model we have made the rather arti cial assumption that all jobs are present at time
t = 0, and that no jobs depart. In this section we will do away with this assumption by
showing that the Corollary and the Theorem still hold if after some point in time, no arrivals
or departures occur. In order to do so, we shift the time origin to the start of the rst decay
cycle after the last arrival or departure. Then we cannot assume anymore that the vk (0+ )
are equal to zero. However, our treatment remains valid when this is not the case, as long
as the orderings of the classes according to their base priorities and to their priorities at
t = 0+ coincide (i.e., as long as priority inversion (see below) is absent). This means that
it is sucient to show that when priority inversion does occur, it disappears within a nite
number of decay cycles. We give upper bounds on this number, and subsequently, we show
how to deduce the rate of convergence to the steady state starting from a situation without
priority inversion. Putting these two results together, we have dealt with the transient
behavior of our model in that we have determined a bound on the rate of convergence to
the steady state starting from any situation that can come about in the model (including
arrivals and departures). In general, it is not necessary to add the amounts of time to
let priority inversion disappear and to converge from there to the steady state. Priority
inversion is a step towards convergence, and in fact, when there are only two classes, it is
the same. Our conclusion will be that the rate of convergence can theoretically be arbitrarily
low. However, in System V and Mach, convergence takes at most about 20 seconds, while
in 4.3BSD, it may take some minutes.
5.1 Priority Inversion
In dealing with arrivals, we face two problems. First, even if the base priority of an arriving
job is equal to that of some class already present, we cannot in general include the job in
that class, because in our treatment, we assumed that the jobs in the same class have equal
CPU usages. However, there is no diculty in having simultaneously arriving jobs with
equal base priorities constitute a class of their own, because while we assumed that bk < bl
for k < l, our treatment remains valid if we only had assumed that bk bl when k < l.
Second, priority inversion may occur. We say that priority inversion occurs for classes k; l
with bk < bl at time t, if qk (t) > ql (t). Clearly, departures do not cause additional problems.
We now show that if priority inversion does occur at time t = 0 for classes k and l with
k < l, then within a nite number of decay cycles, it has disappeared. By (6) and (8), we
have 0 1
v (0 ) Xn c (j ) nX
?1 b
qi(t+n ) = bi + R @ Dn + Dn+1?j + D?j A ;
+
i i i i = k; l: (80)
j =1 j =0
The priority inversion has disappeared when qk (t+n ) ql (t+n ). Because when priority inver-
27
sion occurs at the beginning of Tj , ck (j ) cl (j ), this certainly holds when
! nX ?1 b
v k (0 +
) vl (0+
)
R Dn ? Dn (bl ? bk) + R l b k
?j ? D?j : (81)
j =0 D
By (8) and (9), when during a run of the model D is bounded from below by D > 1 (in
particular, when D is a constant, we can take D = D), we have
vk (0+ ) < T D+b?k 1D : (82)
Now replacing vk (0+ ) by this bound, putting vl (0+ ) = 0, and setting bl ? bk to its smallest
possible value, we can nd the lowest value for n for which (81) holds. Clearly, in theory it
can take arbitrary many decay cycles before priority inversion has disappeared.
One has to keep in mind that in actual systems, the (base) priorities and the CPU
usages are stored in integers, and that the result of any operation on them is rounded to
below. One easily checks that in order to satisfy (81), in System V and Mach n = 6 and
n = 9 are sucient, respectively. For 4.3BSD, we saw in Section 4.3 that vk (0? ) can attain
its bound of 255. Then by (8), vk (0+ ) can also attain this bound, so for 4.3BSD, we have
to put vk (0+ ) = 255 in (81). For l 127, each division of vk (0+ ) by D = (2l + 1)=2l, where
l is the load after t = 0, in (81) lowers its value only by 1, so lifting the priority inversion
may take as many as 255 decay cycles. However, for l 10, n = 64 suces.
5.2 The Rate of Convergence
In theory, it can take arbitrarily many decay cycles before the steady state as de ned in
Remark 3 is attained in the model. In order to demonstrate this, take the instance of the
model with P = 1, K = 2, = 0, and let Q be the limiting priority of class 1 if class 2 were
absent. Now choose b2 such that q2 (0) = Q ? for some small positive .
We now show how one can derive the rate of convergence of the decay-usage scheduling
policy, i.e., the number of decay cycles before the classes receive their steady-state shares.
Because of arrivals and departures as treated in Section 5.1, we allow vk (0+ ) > 0, but we
assume that there is no priority inversion. In addition, we assume that i2 i1 and j2 j1 by
taking t1 = T as the origin of time, if necessary (cf. Proposition 2 and Remark 1), and that
i0 = 0 and j0 = K . By Proposition 2, either class 1 or class K is the last one to join the set
of classes sharing all P processors. In the former case, which is only possible when M1 < P ,
we can use (42) with m = 1 for k = l = 1; P 0 = M1 and for k = 2; l = K; P 0 = P ? M1,
respectively, to nd the smallest value of n for which the priority of class 1 with dedicated
priorities would at least be equal to the priority of classes 2; : : :; K . If class K is the
last to join the other classes in sharing all P processors, we can use (42) with m = 1 for
k = 1; l = K ? 1; P 0 = min(P; M^ K?1 ) and for k = l = K; P 0 = max(0; P ? M^ K?1 ) to nd
the smallest value of n for which the priority of classes 1; : : :; K ? 1 would at least be equal
to the priority of class K .
28
Because of the rounding to below to integer values, we see from (42) that the priorities
have certainly reached their limiting values for n ! 1 in System V, Mach, and 4.3BSD, for
n = 7, n = 10, and n = 256, respectively. It can be shown that in all 4.3BSD measurements
reported in Section 6.4, in which we start with all jobs at time 0 and so vk (0+ ) = 0 for all
k, the steady state must have been reached within 17 decay cycles.
29
usages or the priorities. It is characterized by the parameters P; T; R; D; ; ; K; Mk; bk of
our model, and by the number of clock ticks in a time quantum. It can run with and
without division by 4 of the priority when entering a job in a run queue. We have observed
in the simulation output, that the e ect of this division is to lower the share ratios sk =sl ,
k < l, as may be expected, because the scheduler can distinguish less well between di erent
priorities. However, we found this e ect to be marginal, and all simulations reported below
are without this division by 4.
In the simulations, the simulated time in each experiment was also 20 minutes, and the
ratios of shares were computed in the same way as for the measurements. In coarse-grained
(cg) simulations, we used the values of and given in Section 2, we put T = 100, and we
let the number of clock ticks per quantum be 1 (System V) or 10 (4.3BSD and Mach). It
turned out that the results of such simulations can deviate greatly from the model output.
Because we suspected that the coarse discretization of time was to blame for this, in some
cases we also ran ne-grained (fg) simulations, in which T = 1000, in which and were
replaced by 10 and 10 , respectively, and in which the number of clock ticks per quantum
was always 10. In addition, for Mach R0 was replaced by 10R0. By (13) and (57), this
does not change the ratios of shares, but continuous time is approximated more closely.
Simulations were coarse-grained, unless otherwise stated.
UNIX reorders the run queues at the end of each second by a linear scan of the process
table. As a consequence, the order in which the jobs in an experiment appear in this table
potentially has a strong e ect on the share ratios, favoring jobs in lower positions in the
table. It turns out that this order is usually the order of the creation times of the jobs. Our
simulator mimics the UNIX process table, so there a similar problem occurs. In order to
exclude any bias due to the order in the process table, both in the simulations and in the
measurements, jobs are created in random order with respect to their base priorities. (This
is the only random element in our simulations.) However, in both the simulations and the
measurements, the variability of the shares of CPU time obtained by the jobs within a single
class was very small (at most 3%, usually much smaller). Also, the share ratios obtained
from di erent simulation runs and di erent measurements|the latter were all performed
twice|for the same experiments showed little variability, and so for both, below we always
simply report a single result.
In our model, the ratios of the steady-state shares sk =sl are realized during every single
decay cycle. Clearly, if a steady state is reached in a simulation, in the sense that at the
start of successive decay cycles all priorities have the same values and the jobs have the
same order in the run queues, there is only a very restricted set of possible share ratios, and
we can expect large deviations from the model. For instance, if P = 1; M1 = 3, and M2 = 1,
then on a 4.3BSD system, which has 10 quanta a second, only s1 =s2 = 1=7; 1=2; 3; 1 are
possible. Below we will nd that there is sometimes a very good match of the model and
the simulations, which almost always means that there is no steady state in the simulations
and that the share ratios are only achieved across a number of decay cycles, a situation for
30
10 5
model System V
8 simulation System V ? r
4
model 4.3BSD r r
6 simulation 4.3BSD
r
3
ratio of r
shares
r
r
4 2
?
r
r r r
? ? ? ?
r r r
2 ?
r
1
r
model Mach
? ?
r
? ?
r
simulation Mach
? ? ? ? ? ? ? ? ?
r
r r r
fg simulation Mach
r r r
r
0 0
1 3 5 7 9 11 13 15 17 19 2 4 6 8 10 12 14 16 18
base priority of class 2 base priority of class 2
(a) (b)
Figure 4: Ratio of shares (s1 =s2 ) versus the base priority of class 2 (b2) (K = 2; M2 =
P; b1 = 0; in (a), P = 1; 4; M1 = 3P ; in (b), P = 1; 4; 8; 16; M1 = 5P ).
which our model says nothing. The same holds for the measurements.
6.2 The Model versus the Simulations
We compare the model output with simulation results for four representative experiment
sets.
I) Two classes, xed numbers of jobs, increasing base priority of class 2.
We rst consider System V and 4.3BSD, see Figure 4a. It turns out that the simulation
output is identical for P = 1 and P = 4 (as is of course the model output). For 4.3BSD
and b2 = 16; 17, the model and the simulations give ratios of 13.95 and 36.03, and of 13.00
and 26.33, respectively; for b2 = 18; 19, in both the model and the simulations, class-2
jobs starve. The model output and the simulation results match quite well. Furthermore,
4.3BSD discriminates much more between jobs with the same di erence in base priorities
than System V does, as was to be expected (cf. Section 4.2). For Mach, the cg simulations
deviated somewhat from the model, so we also ran fg simulations, which match the model
better (see Figure 4b). For each of the values 1,4,8,16 for the number P of processors, both
the cg and the fg simulations gave identical results.
II) Two classes, xed base priorities, increasing number of jobs with the lowest base priority.
Again we rst consider System V and 4.3BSD, see Figure 5a. The model output is correct
for any value of P , the simulations have been run for P = 1; 4 with identical results. For
4.3BSD and M1 = 12; 13, the model gives ratios of 16:17 and 223:00, respectively, and
for M1 14 it indicates starvation for class 2, while the simulations give starvation for
M1 12. The deviation between the model and the cg simulations is considerable for some
values of M1 =P , and is caused by the discrete values in the scheduler, as we now show by
31
12 7
model System V
10 simulation System V ? + r 6
fg simulation System V + r
5
8 model 4.3BSD r
simulation 4.3BSD ? 4
r r
ratio of 6 fg simulation 4.3BSD
r
r
r
shares + 3 r
r
4 + r
+ +? +? ? ?
r
r
2 model Mach
+? ?
r
r
simulation Mach
2 +? +? +? +? +?
r r
+? +? +?
r r r r r
1 fg simulation Mach
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13
number of class-1 jobs per processor number of class-1 jobs per processor
(a) (b)
Figure 5: Ratio of shares (s1=s2 ) versus the number of class-1 jobs per processor (M1=P )
(K = 2; M2 = P; b1 = 0; in (a), P = 1; 4; b2 = 6; in (b), P = 1; 4; 8; 16; b2 = 18).
32
for cg simulations, starvation of class 2 occurs). Traces of the cg simulations show that for
each of the values 6; 7; 8; 9 of M1 =P , the class-2 jobs get almost exactly the same amount
of CPU time: For r = M1 =M2 = 7; 8; 9, the share ratios are virtually equal to 6=r times the
ratio for M1 =M2 = 6. The explanation is that for M1 =P = 6; 7; 8; 9, the following sequence
of events during a decay cycle occurs on the average the same number of times. The class-1
jobs make their way to the queue with the class-2 jobs by receiving a quantum, then each
class-2 job gets a quantum, and then the class-1 jobs consume the remaining quanta. For
M1=P 10, in every decay cycle, each of the class-1 jobs needs a quantum before it can
reach the queue with the class-2 jobs, and so the latter starve.
Finally, note the qualitative di erence in the behavior of System V and 4.3BSD on the
one hand and Mach on the other. In the former two systems, class-2 jobs starve when the
number of class-1 jobs is high enough, but in the latter, the share ratio has a limit for
increasing M1=P (of about 8:94 by (76)).
III) Three classes.
Table 1 shows the ratios of shares and the limiting priority Q at the end of a decay cycle
for an experiment set with three classes on a uniprocessor with 4.3BSD parameters. In the
simulations, Q is computed as the sum of the average priority of all jobs at the end of the
last decay cycle and of PUSER = 50. For the model, Q is computed as the sum of the value
given by (43) and PUSER. In an actual system, priority clamping would have occurred
because Q > 127.
M1 M2 M3 s1 =s2 s1 =s3 Q
model simulation model simulation model simulation
1 1 1 1.58 1.58 3.75 3.71 142.1 140.7
1 1 2 1.58 1.53 3.78 3.83 154.1 152.5
1 2 1 1.68 1.67 5.25 5.00 144.5 142.8
1 2 2 1.67 1.68 5.11 5.00 156.3 154.6
2 1 1 1.82 1.85 10.07 9.25 134.9 134.0
2 1 2 1.78 1.78 7.98 8.01 147.7 145.8
2 2 1 1.92 1.92 24.11 22.59 139.2 137.2
2 2 2 1.87 1.87 14.66 14.04 151.4 149.7
33
M1 M2 M3 M4 s1 =s2 s1 =s3 s1 =s4
model simulation model simulation model simulation
3 3 3 3 1.32 1.31 1.96 1.97 3.75 3.71
3 3 3 7 1.32 1.32 1.96 1.98 3.78 3.75
3 3 7 3 1.35 1.34 2.09 2.12 4.61 4.33
3 3 7 7 1.35 1.37 2.09 2.12 4.62 4.67
3 7 3 3 1.39 1.35 2.27 2.25 6.16 6.00
3 7 3 7 1.38 1.38 2.23 2.22 5.76 5.62
3 7 7 3 1.41 1.42 2.40 2.36 7.98 7.70
3 7 7 7 1.40 1.39 2.36 2.38 7.39 6.90
7 3 3 3 1.43 1.41 2.50 2.51 10.07 9.67
7 3 3 7 1.41 1.42 2.40 2.37 7.98 8.01
7 3 7 3 1.45 1.44 2.63 2.60 14.05 13.33
7 3 7 7 1.43 1.44 2.54 2.54 10.90 10.55
7 7 3 3 1.49 1.49 2.94 2.89 101.59 starvation
7 7 3 7 1.47 1.45 2.76 2.70 23.02 18.00
7 7 7 3 1.51 1.51 3.07 2.83 starvation starvation
7 7 7 7 1.49 1.47 2.90 2.80 57.82 196.00
now argue that this phenomenon is probably due to a combination of the way in which
Mach e ectuates decay described in Section 2.1.3, and of the relatively large quantum size
of 100 ms. Let's for the sake of this discussion introduce the following terminology. An
extended decay cycle is a 2-second interval at the end of which Mach e ectuates decay for
every job; this decay is called global decay. Local decay is the decay e ectuated for a process
when it runs and its local decay counter is found to be smaller than the global 1-second
counter. Obviously, during the rst half of an extended decay cycle, local decay does not
occur; it is only performed at the rst selection of a job for execution during the second half
of an extended decay cycle|which happens on the basis of its priority to which no (local
or global ) decay has yet been applied during the current extended decay cycle.
We rst show that the only continuous-time model of this form decay we can think of,
yields the same share ratios as the original model. In this model, at the end of the rst half
of an extended decay cycle in the steady state, all jobs will have attained the same priority,
and they can be thought of as all being in a single queue, ordered according to their class,
with all class-K jobs at the head and all class-1 jobs at the tail. Then, the rst job that
is selected for service in the second half of the extended decay cycle, which would get an
in nitesimally small time quantum according to the PS-type scheduling policy, is subjected
to local decay (with decay factor D). As a consequence, its priority drops, it will receive
service until its priority is again equal to that of the other jobs, and it will be appended to
the queue. This happens for all jobs in the order just described, after which they will all
receive the same amount of service until the end of the extended decay cycle. Then, global
decay (with decay factor D) is performed, as a result of which the jobs will be at priority
levels which increase according to their class numbers. Clearly, by (14), the di erence in
34
priority increase between classes k and l is R(bk ? bl ) in either half of an extended decay
cycle, and so the share ratios are the same as in the original model.
Now if time is discrete, at the start of the second half of an extended decay cycle,
following the course of things in the continuous-time model described above, the jobs of the
higher classes will tend to be at the head(s) of the queue(s), and those of the lower classes
at the tail(s), and so, because there not very many time quanta, the former will relatively
be favored. We have included the Mach way of decay as an option in our simulator, and
refer to simulations with this form of decay as adapted simulations, which again can be
coarse-grained or ne-grained. The number of clock ticks in an extended decay cycle is 200
and 2000, respectively, and in the latter, the parameters ; ; R0 are adapted in the same
way as indicated in Section 6.1. In Figure 6, it is shown that indeed class 2 gets better
service with the way Mach performs decay, but only when the time quantum is large.
We have found from the traces of the adapted simulations, that especially when their
base priorities are much higher than that of class 1, or when there are many jobs per
processor, jobs of class k > 1 starve often during the rst and sometimes during the second
half of an extended decay cycle. Now if the jobs of class k each get an amount ck of CPU
time during either the rst or the second half of an extended decay cycle, and nothing in
the other half, and if vk is the CPU usage at the end of an extended decay cycle, we have
vk = vk =D2 + ck , so the jobs experience a decay factor of D2 and a decay-cycle length of 2T .
Class-1 jobs still get service during either half of an extended decay cycle; if they get the
same amount c1 during either half, then v1 = v1 =D2 + c1=D2 + c1 =D. The share ratio s1 =sk
for k 2 is then given by 2c1=ck . Because in addition jobs are rst selected during the
second half of a decay cycle on the basis of their priority to which no (local) decay has yet
been applied, and because Mach really operates in 2-second cycles, we conclude that Mach
can be modeled more closely with a decay factor of D = 2:56|equal to the square of the
original decay factor|and with a decay cycle of T = 200 clock ticks. As the increment in
priority due to one second of CPU time is built into the system, we still have R = lR0=100.
Using (57), the steady-state share ratios are now given by
sk = 10:24R0M^ K + 1:56 PKi=1 Mi(bi ? bk ) ;
sl 10:24R0M^ K + 1:56 PKi=1 Mi (bi ? bl) k; l = 1; : : :; K:
We will refer to the model with the new values of the parameters T and D as the adapted
model . In Figure 6, we nd that the output of the adapted model and the results of
the adapted simulations agree reasonably well, especially when the number of class-1 jobs
varies. The decreasing behavior of the share ratios in Figure 6b for M1 =P = 5; 6; 7 and
for M1 =P = 10; 11; 12; 13 can be explained in the same way as in Section 6.2, with the
class-2 jobs getting one quantum per extended decay cycle, and one quantum in every two
extended decay cycles, respectively.
By (13), we have 8=l in the adapted model of Mach. In order to evaluate the levels
of control in the original and the adapted models, we have to compare the values of =PT
with T = 100 and T = 200, respectively. So, because of (75), the e ect of the adaptation is
35
5 7
6
4 r
5
3
ratio of
r 4 r r
r r
shares
r
2 3
r
r
r r r r
r
model Mach
r r
r r
r
2 model Mach
1
r
r
adapted model Mach
r
adapted model Mach
adapted simulation Mach r 1 adapted simulation Mach r
36
of the parameters the model is valid. When the model is invalid, it is because the value
of p cpu of CPU usage of class 1 (v1 in the model) exceeds 255. In some cases below,
theoretically, 255 < v1 < 256, so because of the roundings to below in actual systems, the
model can then still be used. As mentioned above, the 4-processor Sun uses a decay factor
of D = (2M^ K + 1)=2M^ K , and in the same way as in Section 4.3, it can easily be shown that
then the bound 255 of p cpu for class 1 is reached for any set of jobs. As a consequence, all
our measurements on the 4-processor Sun are worthless for our purposes (but they support
the conclusion of Section 4.3 that the bound for p cpu favors the lower classes by showing
(much) larger share ratios, and often indicating starvation of the higher class(es)).
I) Two classes, xed numbers of jobs, increasing base priority of class 2.
First we consider the Sun, see Figure 7a. The model is only valid for b2 11. The match
of the measurements and the model is perfect. For the Sequent under DYNIX (see Figure
7b, the model is valid for b2 13), the measurements show a larger share ratio than the
model. Traces of the system revealed that jobs may get longer time slices than 100 ms,
even up to 600 ms, while their priority does not justify this. Because high-priority processes
are eligible for running earlier during a decay cycle, this favors the lower classes. We have
observed the same phenomenon|higher share ratios|in some measurements of the same
Sequent/DYNIX system with only one CPU enabled. Therefore, we think that the deviation
is not due to a multiprocessor e ect, but to some implementation detail of the operating
system.
In Figure 8, we show Mach measurements for P = 1; 4; 8; 16, all with the same ratio of
M1 =M2 . Note that indeed the share ratios do not depend on the number of processors P
(cf. Section 4.2). On Mach, we could not let the numbers of jobs increase linearly with the
number of processors because of a limitation of the numbers of jobs per user. A comparison
of Figures 4b and 8 shows the di erence between the original and adapted models for Mach.
II) Two classes, xed base priorities, increasing number of jobs with the lowest base priority.
See Figure 9 for the 4.3BSD-based systems; the model is only valid for M1 3P . For the
Sun (P = 1), for M1 = 12, the measurements twice give a ratio of 12.31 and the model
gives 16.17; for M1 13, the measurements indicate starvation, while the model gives a
ratio of 223.00 for M1 = 13 and starvation for M1 14. The cases M1 = 1; 2 con rm our
conclusion of Section 4.3 that a bound for CPU usage favors the lower classes (here class
1). For the Sequent (P = 4), again we nd that the measurements indicate higher share
ratios than the model does, which is again probably caused by giving time quanta larger
than 100 ms to class 1.
The match of the adapted model and the Mach measurements in Figure 10 is only
reasonable (with one strange outlyer for P = M1 = 16). Taking into account the measured
load and the obtained CPU time (cf. Section 6.3) does not explain the gap. Especially for
this experiment set, the in uence of adapting the model is considerable (compare Figures
5b and 10).
37
III) Three classes, xed numbers of jobs, increasing base priority of class 3.
The match of the model and the measurements on both the Sun (see Figure 11a, the model
is only valid for b3 9) and Mach (see Figure 12) is truly remarkable.
IV) Three classes, xed base priorities, increasing number of jobs with the lowest base pri-
ority.
The results for the Sun are in Figure 11b. The model is only valid for M1 3. Again, the
measurements agree quite well with the model.
7 Conclusions
We have analyzed a decay-usage scheduling policy for multiprocessors modeled after dif-
ferent variations of UNIX and after Mach. Our main results are the convergence of the
policy and the relation between the base priorities and the steady-state shares. Our simu-
lations validate our analysis, but also show that the discretization of time may be a source
of considerable deviation from the model. The measurements of the 4.3BSD uniprocessor
and of Mach match the model remarkably well, those of the 4.3BSD multiprocessor show
a discrepency that we ascribe to implementation details which we have not been able to
identify.
Our results show that share scheduling can be achieved in UNIX, but that unfortunately,
in the decay-usage scheduling policy we have analyzed, the shares depend on the numbers of
jobs in the classes in an intricate way. Therefore, the policy does not easily achieve Priority
Processor Sharing (i.e., xed ratios of the shares of jobs of di erent classes, regardless of their
numbers) or Group Priority Processor Sharing ( xed ratios of the total amounts of CPU
time obtained jointly by all jobs of the di erent classes), both of which may be desirable.
On the other hand, the objectives of the UNIX scheduler are fast interactive response to
short, I/O-bound jobs and the prohibition of starvation of compute-bound jobs, not share
scheduling.
We have seen that in order to have the same range of possible share ratios for the same
set of jobs in systems that do not employ the Mach load-factor technique, the range of base
priorities should be proportional to both the number and the speed of the processors, or in
other words, the leverage of decay-usage scheduling with the same range of base priorities is
much larger in small and slow multiprocessors than in large or fast ones. Also, of the actual
systems considered, 4.3BSD has the highest level of control over the share ratios. Finally, a
scheduler with a constant decay factor and a constant increment of priority due to a clock
tick of CPU time obtained, is completely described by one parameter instead of four, at
least as far as the steady-state behavior is concerned.
38
8 Acknowledgments
The support of the High Performance Computing Department, managed by W.G. Pope,
of the IBM T.J. Watson Research Center in Yorktown Heights, NY, USA, where part of
the research reported on in this paper was performed, and of IBM The Netherlands, is
gratefully acknowledged. In addition, the author owes much to stimulating discussions with
J.L. Hellerstein of the IBM Research Division. Furthermore, the author thanks the OSF
Research Institute in Grenoble, France, for the opportunity to perform measurements on
their multiprocessor Mach system, and Andrei Danes and Philippe Bernadat of OSF for
their help. Finally, the author thanks I.S. Herschberg for his careful reading of a draft
version of this paper and his suggesting numerous improvements in its exposition.
UNIX is a registered trademark of X/OPEN Company, Ltd.
39
12 8
10 P = 1; M1 = 5; M2 = 1
r
7 P = 4; M1 = 19; M2 = 3 r
measurement r
6 measurement r
8 model model
5 simulation
fg simulation +
r
ratio of 6 4 +
shares r
+ +
r
4 r 3 r
r
2
+ + + r
r r
2 r r r
r
r
1 r
+ +r+ r
0 0
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9
base priority of class 2 base priority of class 2
Figure 7: Ratio of shares (s1 =s2 ) versus the base priority of class 2 (b2) (4.3BSD, K =
2; b1 = 0).
40
4 4
3.5 P =1 r
3.5 P =4
3 M1 = 5 3 M1 = 20 r
M2 = 1 M2 = 4
r
2.5 r
2.5 r
ratio of 2 2 r
shares r
r
r
1.5 1.5
r
r r
r r
r r
1 1
r r
0.5 0.5
0 0
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
4 4
3.5 P =8 3.5 P = 16
3 M1 = 40 r
3 M1 = 40 r
2.5 M2 = 8 r
2.5 M2 = 8 r
ratio of 2 r
2 r
shares r
r
r
1.5 r
r
r
1.5 r
r
r r
1 r
1 r
0.5 0.5
0 0
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
base priority of class 2 base priority of class 2
measurement r adapted model
Figure 8: Ratio of shares (s1=s2 ) versus the base priority of class 2 (b2) (Mach, K = 2; b1 =
0).
41
10 12
r
8 P = 1; M2 = 1 10 P = 4; M2 = 4
r
measurement r measurement r
model 8 model
6 simulation
ratio of r
6 fg simulation +
+
r
shares
4
r r
r
4 r +
r
+
+
r
2 r
r
+
+ r
+ + + +
r
r r r
r
2 r
r
r
0 0
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10
number of class-1 jobs number of class-1 jobs per processor
Figure 9: Ratio of shares (s1 =s2) versus the number of class-1 jobs per processor (M1=P )
(4.3BSD, K = 2; b1 = 0; b2 = 6).
42
5 5
4
2 22222 4
22 22
2 222222
r r r r r
r
2
r
2
r
3 r r r r
3
22 22 r r r r
ratio of r r r r r r
r r
2
r
shares 2
r
2 2 2
1 P =1 1 P =4
M2 = 1 M2 = 4
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
5 5
4 4 r
2
r
222 222 2 2 2 2 2 2
r
r r r r r
3 r
22 2 r r r
r r r r
r
3 2 2 r
2 2
r r r
ratio of
r
shares 2 2
2 2
1 P =8 1 P = 16
M2 = 4 M2 = 4
0 0
2 3 4 5 6 7 8 9 10 11 12 13 4 5 6 7 8 9 10 11 12 13
ratio of numbers of jobs ratio of numbers of jobs
measurement r adapted model
adapted model (measured load and percentage CPU) 2
Figure 10: Ratio of shares (s1 =s2) versus the ratio of the numbers of jobs of classes 1 and
2 (M1=M2 ) (Mach, K = 2; b1 = 0; b2 = 18).
43
10 10
M1 = 5; M2 = 2; M3 = 1; b2 = 2 ? M2 = 2; M3 = 1; b2 = 3; b3 = 5
8 measurement, s1 =s2 r
8 measurement, s1 =s2 r
?
model, s1 =s2 model, s1 =s2
measurement, s1 =s3 ? measurement, s1 =s3 ? ?
6 model, s1 =s3 6 model, s1 =s3
ratios of ?
shares
4 4 ? ?
? ?
? ? ? ?
2 ? ? 2 ? ? ? ? ?
r
?
r r
r r r
r r r r r
r r r r r r r r r
0 0
3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13
(a) base priority of class 3 (b) number of class-1 jobs
Figure 11: Ratios of shares (s1 =sk ; k = 2; 3) versus (a) the base priority of class 3 (b3) and
(b) the number of class-1 jobs (M1) (4.3BSD, P = 1; K = 3; b1 = 0).
44
4 4
3.5 P =1 3.5 P =4
3
M1 = 5 ? 3
M1 = 10 ?
M2 = 2 M2 = 4
2.5 M3 = 1 ? 2.5 M3 = 2 ?
ratios of 2 ? 2 ?
shares ? ? ?
1.5 ? 1.5 ? ?
?
r r r r r r r ?
r r r r r r r
1 1
0.5 0.5
0 0
6 8 10 12 14 16 18 6 8 10 12 14 16 18
4 4
3.5 P =8 3.5 P = 16
3
M1 = 20 3
M1 = 30
M2 = 8 ? M2 = 12 ?
2.5 M3 = 4 ? 2.5 M3 = 6 ?
ratios of 2 ? 2 ?
shares ? ?
1.5 ? ? 1.5 ? ?
?
r r r r r r r ?
r r r r r r r
1 1
0.5 0.5
0 0
6 8 10 12 14 16 18 6 8 10 12 14 16 18
base priority of class 3 base priority of class 3
measurement, s1 =s2 r
measurement, s1 =s3 ?
adapted model, s1 =s2 adapted model, s1 =s3 : : :
Figure 12: Ratios of shares (s1 =sk ; k = 2; 3) versus the base priority of class 3 (b3) (Mach,
K = 3; b1 = 0; b2 = 4).
45
References
[1] M.J. Bach, The Design of the UNIX Operating System, Prentice-Hall, 1986.
[2] D.L. Black, "Scheduling Support for Concurrency and Parallelism in the Mach Oper-
ating System," IEEE Computer, May, 35{43, 1990.
[3] D.L. Black, Scheduling and Resource Management Techniques for Multiprocessors, Re-
port CMU-CS-90-152, Carnegie Mellon University, 1990.
[4] S. Curran and M. Stumm, "A Comparison of Basic CPU Scheduling Algorithms for
Multiprocessor UNIX," Computing Systems, Vol. 3, 551{579, 1990.
[5] R.B. Essick, "An Event-Based Fair Share Scheduler," USENIX, Winter, 147{161, 1990.
[6] G. Fayolle, I. Mitrani, and R. Iasnogorodski, "Sharing a Processor among Many Job
Classes," J. of the ACM, Vol. 27, 519{532, 1980.
[7] L.L. Fong and M.S. Squillante, Time-Function Scheduling: A General Approach to
Controllable Resource Management, IBM Research Report RC 20155, IBM Research
Division, New York, NY, 1995.
[8] B. Goodheart and J. Cox, The Magic Garden Explained, The Internals of UNIX System
V Release 4, An Open Systems Design, Prentice-Hall, 1994.
[9] A.G. Greenberg and N. Madras, "How Fair is Fair Queuing?," J. of the ACM, Vol. 39,
568{598, 1992.
[10] J.L. Hellerstein, "Control Considerations for CPU Scheduling in UNIX Systems,"
USENIX, Winter, 359{374, 1992.
[11] J.L. Hellerstein, "Achieving Service Rate Objectives With Decay-Usage Scheduling,"
IEEE Trans. on Softw. Eng., Vol. 19, 813{825, 1993.
[12] G.J. Henry, "The Fair Share Scheduler," AT&T Bell Laboratories Technical Journal,
Vol. 63, 1845{1857, 1984.
[13] J. Kay and P. Lauder, "A Fair Share Scheduler," Comm. of the ACM, Vol. 31, 44{55,
1988.
[14] L. Kleinrock, "Time-Shared Systems: A Theoretical Treatment," J. of the ACM, Vol.
14, 242{261, 1967.
[15] S.J. Leer, M.K. McKusick, M.J. Karels, and J.S. Quarterman, The Design and Im-
plementation of the 4.3BSD UNIX Operating System, Addison-Wesley, 1989.
[16] M.K. McKusick, K. Bostic, M.J. Karels, and J.S. Quarterman, The Design and Imple-
mentation of the 4.4BSD Operating System, Addison-Wesley, 1996.
46
[17] U. Vahalia, UNIX Internals, The New Frontiers, Prentice-Hall, 1996.
[18] C.A. Waldspurger and W.E. Weihl, "Lottery Scheduling: Flexible Proportional-Share
Resource Management," Proc. of the First USENIX Symposium on Operating Systems
Design and Implementation (OSDI), Monterey, CA, 1{11, 1994.
[19] C.A. Waldspurger and W.E. Weihl, Stride Scheduling: Deterministic Proportional-
Share Resource Management, Technical Memorandum MIT/LCS/TM-528, MIT Lab-
oratory for Computer Science, Cambridge, MA, 1995.
47