0% found this document useful (0 votes)
8 views

A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling

The paper presents a new approach to CPU utilization estimation and code profiling by introducing a randomized, aperiodic sampling clock to the Unix 4.4bsd kernel, addressing inaccuracies caused by traditional periodic sampling methods. The authors demonstrate that the conventional CPU estimator suffers from synchronization issues leading to significant errors, and their randomized method effectively eliminates these systematic errors. Additionally, the paper discusses the implementation of this new model and its application in user-level code profiling, showcasing improvements over existing systems.

Uploaded by

Bird and Comb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling

The paper presents a new approach to CPU utilization estimation and code profiling by introducing a randomized, aperiodic sampling clock to the Unix 4.4bsd kernel, addressing inaccuracies caused by traditional periodic sampling methods. The authors demonstrate that the conventional CPU estimator suffers from synchronization issues leading to significant errors, and their randomized method effectively eliminates these systematic errors. Additionally, the paper discusses the implementation of this new model and its application in user-level code profiling, showcasing improvements over existing systems.

Uploaded by

Bird and Comb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Randomized Sampling Clock for

CPU Utilization Estimation and Code Profiling


Steven McCanne3 and Chris Torek3y
Lawrence Berkeley Laboratory
One Cyclotron Road
Berkeley, CA 94720
[email protected], [email protected]

Abstract be adversely affected. Furthermore, the accuracies of the


getrusage system call and the /bin/time command will be
The unix rusage statistics are well known to be highly inac- compromised.
curate measurements of CPU utilization. We have observed In this paper, we outline the theory behind the statistical
errors in real applications as large as 80%, and we show how CPU estimator. We then introduce a new approach based on
to construct an adversary process that can use an arbitrary randomization. Next, we explain how the new model fits into
amount of the CPU without being charged. We demonstrate the current 4.4bsd system, and how it can drive code profiling
that these inaccuracies result from aliasing effects between as well. Finally, we give some case studies that demonstrate
the periodic system clock and periodic process behavior. Pro- problems with the existing system and show that our approach
cess behavior cannot be changed but periodic sampling can. has overcome them.
To eliminate aliasing, we have introduced a randomized, ape-
riodic sampling clock into the 4.4bsd kernel. Our measure-
ments show that this randomization has completely removed 2 A Statistical Model
the systematic errors.
An exact measurement of CPU utilization would require the
precise timing of every interrupt and system call. Since this
1 Introduction is prohibitive, systems rely on a cheaper methodology based
on sampling. Here, a sequence of samples of the CPU state is
Traditional implementations of the Unix operating system used to estimate the true utilization percentage, which in turn
provide coarse grained, statistical measurements of CPU uti- can be viewed a probability. For example, the probability
lization. On each tick of the system clock, the CPU state that the CPU is in a given state is simply ratio of the time
is examined. If the processor is in user mode, the current spent in that state to the elapsed time.
process is charged with one sampling interval of user time. For the CPU estimator, there are three relevant CPU states:
Similarly, if the processor is in system mode, the current pro- user mode, system mode, and interrupt mode. Call the prob-
cess is charged system time abilities of being in each of these states pu , ps , and pi respec-
This approach is problematic. A process can become syn- tively. Then, if a process runs for Te time units, the amount
chronized with the sampling clock, resulting in large scale of time spent in each CPU state is simply
errors in the utilization statistic. For instance, a process that
runs in phase with the system clock might always surrender
Tu = pu Te
the CPU before the clock interrupt arrives, thereby accumu- Ts = ps Te
lating no usage time. Ti = pi Te
CPU time estimation is of particular importance, as it
drives the scheduling algorithm. If the utilization estimate We need to devise a sampling experiment that produces un-
is in error, scheduling, and hence system performance, can biased estimates for pu , ps , and pi . Moreover, the estimates
should get better as we make more observations.
3 This work was supported by the Director, Office of Energy Research,
The observations of CPU state can be related to the prob-
Scientific Computing Staff, of the U.S. Department of Energy under Contract
No. DE-AC03-76SF00098.
ability estimates using elementary probability theory. The
y This is a preprint of a paper to be presented at the 1993 Winter sequence of observations comprises what probability theory
USENIX conference, January 25–29, 1993, San Diego, CA. calls a random sample, and the Law of Large Numbers tells

1
2 A Randomized CPU Sampling Clock

U(t)
us that the sample mean converges to the mean of each obser-
vation, provided the observations are independent. We can
view each sample as a Bernoulli random variable, which is
1 with probability p and 0 otherwise; its mean is p. Thus, t

assuming independence, the sample mean converges to p, U U U


n n+1 n+2
which is what we want.
For example, consider the sequence of observations
Figure 1: Process Isochrony
fU1 ; U2 ; . . . ; Un g, where Uk is 1 if the CPU is in user mode,

and 0 otherwise. Each Uk is Bernoulli with mean pu . Thus,


the Law of Large Numbers says that time utilization estimates are in fact “statistical”, but does not
1+ 2+ + attempt to clarify the estimation technique.
pu = nlim
!1
U U

n
111 Un
In any case, the statistical model allows us to clearly see
where the conventional algorithm breaks down. The problem
A good estimate of pu then is is that the sequence fUn g appearing in Equation 1 is not a

ˆ
pu = U1 + 2+ +
U 111 Un
sequence of independent observations. For example, know-
ing the previous observation gives you information about the
n
next one—i.e., adjacent observations are dependent. Since
the Law of Large Numbers applies only to sequences of inde-
Taking the sample sequence to be Bernoulli assumes that
pendent random variables, the probability estimate will not
the underlying process is stationary, which means that the
necessarily converge to the true probability.
probabilities of being in each state remain constant over time.
Although this is not generally true of programs, there is no The problem that arises from this lack of independence is
way to proceed without making this assumption. Program- clearly illustrated in the case of an isochronous process. A
mers often put code to be profiled inside a loop, or otherwise process is characterized as isochronous if its behavior is pe-
run the code many times. This repetitive behavior then has riodic and consistent, for instance, as shown by the graph in
an overall stationary behavior. Furthermore, long lived pro- ()
Figure 1. The function U t represents the user mode uti-
cesses generally exhibit repetitive behavior, so the assump- lization of the CPU, as a function of time, while the arrows
tion is reasonable for CPU estimation as well. indicated the fixed rate sampling process. Because the sam-
pling process is synchronized with the process behavior, the
= =
estimator will compute pˆu 1, even though pu 6 1. This is
2.1 The Conventional CPU Estimator a systematic error as it does not diminish with more samples.

In the conventional method, rather than compute the proba-


bility estimates mentioned above, estimates of Tu and Ts are 2.2 Adding Randomization
directly maintained. Call these times Tˆu and Tˆs . Assuming
that the set of samples comprises a true random sample, this Now that we know how the conventional estimator is failing,
method would be equivalent to computing the probability es- we would think it would be easy to correct the problem. All
1
timates. Let be the sampling interval. Then the algorithm we need to do is change the sampling technique so that we get
computes Tˆu as independent observations. This turns out to be a non-trivial

X
n X
n
! problem.
Tˆu = 1 k = ( 1)
U n
1
Uk = ˆ
Te pu ( 1) In theory, the most straightforward approach would be to
=
k 1
n
k 1 = simply choose some number of random samples uniformly
distributed over the lifetime of the system. But this approach
= 1
since Tˆe n is an estimate of the elapsed time. Thus, Tˆu is obviously infeasible since the system is continually running
and pˆu Te are approximately equal. A similar analysis holds and its history is not retained.
for Tˆs . An attractive alternative is to continue to use an interval
This approach fails, however, because the samples used to based sampling approach, but to use random rather than fixed
compute Tˆu and Tˆs do not comprise a truly random sample. intervals. If we sample at a time Ti , then the next sample time
Since the sampling mechanism uses fixed intervals, random is given by
samples would only result if system and process behavior Ti +1 = Ti + Wi+1
were itself random. The original implementation was prob-
ably based on this assumption. Unfortunately, programs are, where each Wi is a random variable, and T0 0. =
for the most part, deterministic, so random behavior should Intuitively, randomizing the sampling clock phase should
not be expected. The BSD book [5] points out that the run break any synchronization with process behavior. But how
Winter USENIX – Jan., 1993 – San Diego, CA 3

can we be sure that the observations are statistically indepen- the old system, the hardclock interrupt also gathered statis-
dent and hence that the sample mean will converge to the tics; now they are driven off a separate statclock timer. Each
probability estimate? The answer depends on the distribu- time statclock returns from its interrupt context, the timer is
tion of W . For example, if W is constant, then we have the reprogrammed for a random interval chosen from a uniform
existing approach, which we know won’t work. More gener- distribution as described in the previous section.
ally, if W is arithmetic, for instance it takes on values only in Process “wall clock time” is computed directly from the
fnk : n  0; k = 0; 1; . . .g, then we have a similar problem. actual time of day at process switch. The microtime function
From a theoretical perspective, a particularly nice choice is used to obtain a high resolution timestamp when the process
for W is the exponential distribution. The sequence fTk g is continued, and again when the process is suspended; the
would then correspond to a Poisson process, for which a well difference between these times is then added to a running
known result is that the conditional distribution of arrivals sum. This figure becomes the Te factor in the approximation
on a subset of time, given their number, is uniform. In other formulas.
words, if we know how many samples occurred over some Meanwhile, on each statclock tick, the current process, if
interval, then those samples are uniformly distributed on that any, is charged a user, system, or interrupt tick according to
interval. These uniformly distributed random samples would the CPU mode at the time of the statclock interrupt; call these
result in a truly random aggregate sample. counts u; s; and i respectively. Since the sum of these counts
However, implementation difficulties arise for exponen- is the total number of samples taken, the probability estimates
tially distributed intervals. Occasionally, the time difference are easily computed: pk = k=(u + s + i) for k 2 fu; s; ig.
between adjacent samples will be smaller than the interrupt From this and Te , we can then compute Tu and Ts . Ti can be
service time. Depending on how the clock hardware operates, similarly computed, but since there is no way to identify the
race conditions could result when reprogramming the timer. true source of such time it currently disappears into general
Also, it is not clear what effect an occasional very large sam- system overhead.
ple interval would have on other aspects of the system (for The statclock abstraction is available only on machines
example, the scheduler). with high-precision, programmable clocks. The randomized
Our solution is to let W be uniformly distributed on sampling intervals are generated by programming the clock’s
[Tmin ; Tmax ]. In this case, Tmin can be chosen to be much limit register with a pseudo-random number. To reduce over-
larger than the interrupt service time, simplifying implemen- head, a cheap-but-good random number generator is used [1].
tation. On systems without programmable clocks, statclock is called
We must be sure, however, that this approach, like the directly from hardclock, and the functionality is unchanged
others, is unbiased. We can argue this using another result from the existing system.
from probability theory, the Ergodic Theorem, which is a
generalization of the Law of Large Numbers. This result
says that the sample mean will converge to the true mean 4 Code Profiling
if the sequence of samples is ergodic (and not necessarily
independent). Assuming the underlying process is stationary, Kernel support for user level code profiling [4] is carried out
and that our sampling process begins at 01, we can argue in a manner identical to CPU usage estimation. We therefore
that our sample sequence is ergodic. We omit the details and wanted to apply the lessons learned from the randomized
refer the reader to [3, Ch. 6]. sampling clock to the profiling system. Along the way, we
Note that the probability estimates converge independently fixed some problems with the traditional profiling support.
of the frequency of the sampling clock. Only the rate of con- When a statclock tick occurs while a profiled process is
vergence is controlled by the mean sampling period. Thus, running, a profiling buffer in user address space must be up-
the average sampling rate, and hence the overhead of the CPU dated. In the previous system, this buffer update cannot be
estimator, is dynamically adjustable. This contrasts with the carried out by the clock interrupt handler because page faults
existing system which required the rate to be configured into are not permitted in an interrupt context. Instead, the clock
the kernel at compile time. handler schedules a profiling “asynchronous system trap” or
AST, which causes a trap to occur just before returning to
user mode. In this trap context, page faults may occur, and
3 Implementation the user’s profiling buffer is updated.
The 4.4bsd kernel avoids most such ASTs through two new
Incorporating the randomized sampling model into the exist- routines to manipulate user memory from interrupt context.
ing system was relatively straightforward. In the 4.4bsd ker- These routines attempt to update the user profiling counts di-
nel, all real-time and time-of-day events, including process rectly. If a fault occurs, the update is aborted and the profiling
scheduling, are driven off a fixed-rate hardclock interrupt. In code schedules an AST as before. Typically only a few such
4 A Randomized CPU Sampling Clock

ASTs are required to page in the user profiling buffer. From real cpu %cpu
then on, the updates can be carried out cheaply from interrupt w/o hog 7.1 7.1 100%
SunOS
context. w/ hog 584.5 334.8 57%
System call profiling in the existing system is inaccurate. w/o hog 7.1 7.0 99%
4.4bsd

In this case, rather than update the user profile buffer during w/ hog 15.5 7.0 45%
the system call, the kernel reads the process’s accumulated
Table 1: Effect of hog on CPU bound process
system time at entry to each call and again at exit from each
call. The difference between these two times is converted to a
tick count, which is added as if from an AST. This includes the 5.2 A CPU Adversary
same interrupt-time excess found in the per-process statistics,
and computing this value is complicated due to the need to An adversarial program was written in an attempt to defeat
turn time into ticks. the CPU utilization statistics altogether. This program, which
In 4.4bsd, system call profiling is still done at the end we call hog, first estimates the phase of the system clock. It
of each call for efficiency, but now it is merely a matter of then enters a hard loop, performing gettimeofday system calls,
subtracting the previous system tick count from the current until just before a hardclock tick is going to happen. At this
count. point, it goes to sleep until the next system clock. Thus, the
process is never charged with a sampling tick, and never will
accumulate CPU time.2 Furthermore, its scheduling priority
remains favorable, so it always runs, even if there are other
5 Results processes waiting.
Since hog sleeps every other system clock tick, it will use
We devised three experiments, two contrived and one from at most half of the CPU. Thus, two hogs are required to use up
production software, that uncover the anomalies of the old the whole CPU. We augmented hog to fork once from main,
CPU utilization estimator. The tests were run under both and the results were dramatic. Table 1 shows timings for a
SunOS 4.1.1 and 4.4bsd. In each experiment, the anoma- CPU bound process when run in the presence and absence
lies were clearly evident under the old CPU estimator, while of hog. The CPU bound test program simply counted to 10
under 4.4bsd, they disappeared. million. The first column gives the real time of execution,
while the second column gives the CPU time as measured
by the system. Without hog, the two systems are similar, as
5.1 Interrupt Activity Interference expected. But with hog, the SunOS process takes 80 times
longer to finish even though the time command reports a uti-
The first experiment clearly demonstrates that interrupt ac- lization of 57%. If this figure were correct, it should only
tivity is charged to the current process as system time. A have taken 1.75 times longer. Under 4.4bsd, the test process
program was written that executes an infinite loop, and runs gets a fair share of the CPU. The 15.5 seconds of real time is
a 4 Hz alarm that logs system time usage. Since the program consistent with 45% utilization.
uses only a small amount of system CPU, just enough to pro- In taking these measurements, we noticed anomalous
cess an alarm signal every 0.25 seconds, system time should scheduler behavior in 4.4bsd. Even though the CPU uti-
accumulate very slowly. lization estimates were accurate, the scheduler often exhibits
This program was simultaneously run on two sparcstation unfairness between the CPU bound counting program and the
1+ machines, one running SunOS 4.1.1 and the other 4.4bsd. hog. In the presence of the hog, the CPU bound process got
Partway through execution, each host was exposed to an on- as little as 9% and as much as 65% of the CPU. The bsd
slaught of interrupt activity1 . The interrupt activity was then scheduler is known to be flawed [6], but it should do better
terminated, and finally, the program was stopped. here. This remains to be investigated.
Figure 2 shows the plot of system CPU time versus real The code for hog is given in the appendix.
time for both processes. The lower line represents the 4.4bsd
behavior, and is as expected—very little system time is accu-
mulated. Under SunOS, however, during the interrupt activ-
5.3 The Isochronous Anomaly
ity, the process is charged with a significant amount of system The previous two experiments were run under controlled con-
CPU time. Note that this process had nothing to do with the ditions in an attempt to expose the worst case behavior of the
interrupt activity; clearly, the statistics are skewed. utilization error. However, we have experienced problematic
1 The interrupt activity was generated by putting each host’s network in- 2 Actually, there is a low probability that the process is run just before a

terface into promiscuous mode, causing all network packets to be processed. clock interrupt, in which case there is insufficient time to discover that the
A 780KB/s Ethernet transfer was then initiated on the local net. interrupt is coming.
Winter USENIX – Jan., 1993 – San Diego, CA 5

4
SunOS
BSD

3
CPU Time (sec)

2
1
0

0 10 20 30 40

Real Time (sec)

Figure 2: Interrupt Interference with System CPU Time

ble slots.3 As a result, when the minimum phase difference


allows vat to carry out all of its processing before the clock
ticks arrive, CPU time never accumulates. On the other hand,
when the phase is such that ticks always occur, too much CPU
time is charged. These two modes of operation are reflected
in the SunOS data as the flat and steep regions. While this
argument predicts that we should remain in a given mode in-
definitely, the data actually oscillates between the two modes
Figure 3: Unexpected CPU Load Oscillations
with a period of about one minute. Without going into detail,
various effects, some internal to vat and some due to unrelated
interrupt activity, can cause the phase between vat’s behavior
behavior in a normal operating environment. For example,
and the system clock to drift.
Figure 3 shows a window dump of xcpu, displaying a load
oscillating between 0 and 10%, with a period of about one
minute. The oscillations in the load average were not ex-
pected, and the cause was an audio conferencing program
called vat [2]. Vat processes a frame of audio samples every 6 Conclusion
22.5ms and should therefore exert a constant CPU load. But
xcpu indicated otherwise. We have presented a new approach for measuring CPU uti-
To verify our theory that vat was causing these load fluctu- lization that uses randomized sampling to overcome the defi-
ations, we modified it to log its CPU time, every 22.5ms, to ciencies of the old approach. Randomization prevents an ad-
a debugging file, and ran the new version under both SunOS versary from foiling the utilization estimator and precludes
and 4.4bsd. Figure 4 shows these results. Since vat operates synchronization between the sampling system and process
continuously, you would expect its CPU time to increase lin- behavior. We have corrected problems with erroneous ac-
early. This is the case for 4.4bsd. But under SunOS, there counting of interrupt activity, and we have streamlined the
are flat and steep regions of about 30 seconds in length. This kernel support for code profiling. Finally, we have conducted
anomaly is clearly due to the inaccuracy of the old sampling experiments to demonstrate that the new system performs as
methodology. expected.
The problem is that vat is running synchronously with
the system clock. Since vat runs exactly every 22.5ms, it 3 The least common multiple of the two periods is 90ms. Therefore, there

is aliased onto the 10ms system clock in only four possi- =


are 90=22:5 4 phase positions for vat to cycle through.
6 A Randomized CPU Sampling Clock

SunOS
BSD

6
CPU Time (sec)

4
2
0

0 50 100 150 200

Real Time (sec)

Figure 4: System Time Oscillations

7 Acknowledgements [2] Casner, S., and Deering, S. First IETF Internet audiocast.
ConneXions 6, 6 (1992), 10–17.
Van Jacobson originally suggested that randomization be [3] Durrett, R. Probability: Theory and Examples. Brooks/Cole
used to circumvent the statistical biases in the CPU estima- Publishing Company, 1991.
tor. Additionally, he helped interpret the results of our exper- [4] Graham, S. L., Kessler, P. B., and McKusick, M. K. gprof:
iments and provided suggestions for implementation strate- A call graph execution profiler. In Proceedings of the SIGPLAN
gies. ’82 Symposium on Compiler Construction (June 1982).
The idea that the CPU scheduler can be defeated is not new. [5] Leffler, S. J., McKusick, M. K., Karels, M. J., and Quar-
Dheeraj Sanghi and Olafur Gudmundsson wrote a program terman, J. S. The Design and Implementation of the 4.3BSD
similar to the hog presented in this paper. According to them, UNIX Operating System. Addison-Wesley, 1989.
the idea was originally proposed by Ashok Agrawala. [6] Straathof, J. H., Thareja, A., and Agrawala, A. UNIX
We are grateful to Spyros Papadakis for helping us formu- scheduling for large systems. In Proceedings of the 1986 Win-
late our probabilistic arguments. His advice was impeccable; ter USENIX Technical Conference (Denver, CO, Jan. 1986),
any errors were introduced by us. USENIX, pp. 111–138.
Finally, we would like to thank Vern Paxson, Van Jacob-
son, Craig Leres, Deana Goldsmith, and the referees for their
helpful comments on drafts of this paper. Author Information
Steven McCanne has been with the Lawrence Berkeley Lab-
8 Availability oratory since 1988, working on network analysis tools and
remote conferencing applications. He holds a B.S. de-
The statclock code appears in the 4.4BSD alpha release. Cur- gree in Electrical Engineering and Computer Science from
rently, sparcstation and HP-9000/300 series machines are U.C. Berkeley, and is currently a Ph.D. student in Computer
supported. Science at U.C.B.
Chris Torek has been rewriting bits of the Berkeley kernel
for about six years. He joined LBL in 1991, and has been
References working on porting bsd to the sparcstation. In his off hours
he spends far too much time on USENET.
[1] Carta, D. G. Two fast implementations of the “minimal stan-
dard” random number generator. Communications of the ACM
33, 1 (Jan. 1990).
Winter USENIX – Jan., 1993 – San Diego, CA 7

Appendix: Adversary Source Code


#include <signal.h>
#include <sys=param.h>
#include <sys=time.h>
#include <sys=resource.h>
#define tvdiff(x, y) n
(1000000 3 ((y).tv sec 0 (x).tv sec) + (y).tv usec 0 (x).tv usec)

struct timeval hc; =3 our best guess for when a hardclock happened 3=
struct timeval now; =3 hold time0of0day for signal handler 3=
volatile int ntick;

alarm handler()
f
u long u;
struct timeval tv;
static int mindel = 5000000;

++ntick;
gettimeofday(&tv, 0);
u = tvdiff(now, tv);
if (u < mindel) f
mindel = u;
hc = tv;
g
usleep(1);
g
=3
3 Try to figure out when hardclock happens.
3=
struct timeval
train()
f
struct itimerval it;

signal(SIGVTALRM, alarm handler);


it.it interval.tv usec = it.it value.tv usec = 1;
it.it interval.tv sec = it.it value.tv sec = 0;
=3
3 Sleep right before we set the timer. This way, we’re sure to
3 get a whole time slice, and we won’t be switched out before
3 we estimate the hardclock time.
3=
usleep(1);
setitimer(ITIMER VIRTUAL, &it, 0);
for (ntick = 0; ntick < 20; )
gettimeofday(&now, 0);
=3
3 Turn the timer off.
3=
it.it interval.tv usec = it.it value.tv usec = 0;
setitimer(ITIMER VIRTUAL, &it, 0);

return (hc);
g
8 A Randomized CPU Sampling Clock

int
main(argc, argv)
int argc;
char 33argv;
f
long us, s, bias, delta, off;
struct timeval tv;

=3
3 Determine when hardclocks are happening then compute a bias
3 with respect to an even multiple of hardclock ticks.
3 Assume 10ms tick. Since a second is an even multiple of
3 a tick, we only need to look at usecs.
3=
tv = train();
bias = tv.tv usec % 10000;

=3
3 Make one copy of ourself.
3 We need two processes to do real damage.
3=
fork();

for (;;) f
gettimeofday(&tv, 0);
=3
3 Round down to even tick multiple, then add in bias.
3 Compute estimate of next hardclock into s and us.
3=
s = tv.tv sec;
us = tv.tv usec;
delta = us % 10000;
off = bias 0 delta;
if (off < 0)
off += 10000;
us += off;
if (us >= 1000000) f
us 0= 1000000;
++s;
g
=3
3 Spin until 1ms before next hardclock.
3=
us 0= 1000;
while (tv.tv sec < s jj (tv.tv sec == s && tv.tv usec < us))
gettimeofday(&tv, 0);
usleep(1);
g
g

You might also like