A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling
A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling
1
2 A Randomized CPU Sampling Clock
U(t)
us that the sample mean converges to the mean of each obser-
vation, provided the observations are independent. We can
view each sample as a Bernoulli random variable, which is
1 with probability p and 0 otherwise; its mean is p. Thus, t
n
111 Un
In any case, the statistical model allows us to clearly see
where the conventional algorithm breaks down. The problem
A good estimate of pu then is is that the sequence fUn g appearing in Equation 1 is not a
ˆ
pu = U1 + 2+ +
U 111 Un
sequence of independent observations. For example, know-
ing the previous observation gives you information about the
n
next one—i.e., adjacent observations are dependent. Since
the Law of Large Numbers applies only to sequences of inde-
Taking the sample sequence to be Bernoulli assumes that
pendent random variables, the probability estimate will not
the underlying process is stationary, which means that the
necessarily converge to the true probability.
probabilities of being in each state remain constant over time.
Although this is not generally true of programs, there is no The problem that arises from this lack of independence is
way to proceed without making this assumption. Program- clearly illustrated in the case of an isochronous process. A
mers often put code to be profiled inside a loop, or otherwise process is characterized as isochronous if its behavior is pe-
run the code many times. This repetitive behavior then has riodic and consistent, for instance, as shown by the graph in
an overall stationary behavior. Furthermore, long lived pro- ()
Figure 1. The function U t represents the user mode uti-
cesses generally exhibit repetitive behavior, so the assump- lization of the CPU, as a function of time, while the arrows
tion is reasonable for CPU estimation as well. indicated the fixed rate sampling process. Because the sam-
pling process is synchronized with the process behavior, the
= =
estimator will compute pˆu 1, even though pu 6 1. This is
2.1 The Conventional CPU Estimator a systematic error as it does not diminish with more samples.
X
n X
n
! problem.
Tˆu = 1 k = ( 1)
U n
1
Uk = ˆ
Te pu ( 1) In theory, the most straightforward approach would be to
=
k 1
n
k 1 = simply choose some number of random samples uniformly
distributed over the lifetime of the system. But this approach
= 1
since Tˆe n is an estimate of the elapsed time. Thus, Tˆu is obviously infeasible since the system is continually running
and pˆu Te are approximately equal. A similar analysis holds and its history is not retained.
for Tˆs . An attractive alternative is to continue to use an interval
This approach fails, however, because the samples used to based sampling approach, but to use random rather than fixed
compute Tˆu and Tˆs do not comprise a truly random sample. intervals. If we sample at a time Ti , then the next sample time
Since the sampling mechanism uses fixed intervals, random is given by
samples would only result if system and process behavior Ti +1 = Ti + Wi+1
were itself random. The original implementation was prob-
ably based on this assumption. Unfortunately, programs are, where each Wi is a random variable, and T0 0. =
for the most part, deterministic, so random behavior should Intuitively, randomizing the sampling clock phase should
not be expected. The BSD book [5] points out that the run break any synchronization with process behavior. But how
Winter USENIX – Jan., 1993 – San Diego, CA 3
can we be sure that the observations are statistically indepen- the old system, the hardclock interrupt also gathered statis-
dent and hence that the sample mean will converge to the tics; now they are driven off a separate statclock timer. Each
probability estimate? The answer depends on the distribu- time statclock returns from its interrupt context, the timer is
tion of W . For example, if W is constant, then we have the reprogrammed for a random interval chosen from a uniform
existing approach, which we know won’t work. More gener- distribution as described in the previous section.
ally, if W is arithmetic, for instance it takes on values only in Process “wall clock time” is computed directly from the
fnk : n 0; k = 0; 1; . . .g, then we have a similar problem. actual time of day at process switch. The microtime function
From a theoretical perspective, a particularly nice choice is used to obtain a high resolution timestamp when the process
for W is the exponential distribution. The sequence fTk g is continued, and again when the process is suspended; the
would then correspond to a Poisson process, for which a well difference between these times is then added to a running
known result is that the conditional distribution of arrivals sum. This figure becomes the Te factor in the approximation
on a subset of time, given their number, is uniform. In other formulas.
words, if we know how many samples occurred over some Meanwhile, on each statclock tick, the current process, if
interval, then those samples are uniformly distributed on that any, is charged a user, system, or interrupt tick according to
interval. These uniformly distributed random samples would the CPU mode at the time of the statclock interrupt; call these
result in a truly random aggregate sample. counts u; s; and i respectively. Since the sum of these counts
However, implementation difficulties arise for exponen- is the total number of samples taken, the probability estimates
tially distributed intervals. Occasionally, the time difference are easily computed: pk = k=(u + s + i) for k 2 fu; s; ig.
between adjacent samples will be smaller than the interrupt From this and Te , we can then compute Tu and Ts . Ti can be
service time. Depending on how the clock hardware operates, similarly computed, but since there is no way to identify the
race conditions could result when reprogramming the timer. true source of such time it currently disappears into general
Also, it is not clear what effect an occasional very large sam- system overhead.
ple interval would have on other aspects of the system (for The statclock abstraction is available only on machines
example, the scheduler). with high-precision, programmable clocks. The randomized
Our solution is to let W be uniformly distributed on sampling intervals are generated by programming the clock’s
[Tmin ; Tmax ]. In this case, Tmin can be chosen to be much limit register with a pseudo-random number. To reduce over-
larger than the interrupt service time, simplifying implemen- head, a cheap-but-good random number generator is used [1].
tation. On systems without programmable clocks, statclock is called
We must be sure, however, that this approach, like the directly from hardclock, and the functionality is unchanged
others, is unbiased. We can argue this using another result from the existing system.
from probability theory, the Ergodic Theorem, which is a
generalization of the Law of Large Numbers. This result
says that the sample mean will converge to the true mean 4 Code Profiling
if the sequence of samples is ergodic (and not necessarily
independent). Assuming the underlying process is stationary, Kernel support for user level code profiling [4] is carried out
and that our sampling process begins at 01, we can argue in a manner identical to CPU usage estimation. We therefore
that our sample sequence is ergodic. We omit the details and wanted to apply the lessons learned from the randomized
refer the reader to [3, Ch. 6]. sampling clock to the profiling system. Along the way, we
Note that the probability estimates converge independently fixed some problems with the traditional profiling support.
of the frequency of the sampling clock. Only the rate of con- When a statclock tick occurs while a profiled process is
vergence is controlled by the mean sampling period. Thus, running, a profiling buffer in user address space must be up-
the average sampling rate, and hence the overhead of the CPU dated. In the previous system, this buffer update cannot be
estimator, is dynamically adjustable. This contrasts with the carried out by the clock interrupt handler because page faults
existing system which required the rate to be configured into are not permitted in an interrupt context. Instead, the clock
the kernel at compile time. handler schedules a profiling “asynchronous system trap” or
AST, which causes a trap to occur just before returning to
user mode. In this trap context, page faults may occur, and
3 Implementation the user’s profiling buffer is updated.
The 4.4bsd kernel avoids most such ASTs through two new
Incorporating the randomized sampling model into the exist- routines to manipulate user memory from interrupt context.
ing system was relatively straightforward. In the 4.4bsd ker- These routines attempt to update the user profiling counts di-
nel, all real-time and time-of-day events, including process rectly. If a fault occurs, the update is aborted and the profiling
scheduling, are driven off a fixed-rate hardclock interrupt. In code schedules an AST as before. Typically only a few such
4 A Randomized CPU Sampling Clock
ASTs are required to page in the user profiling buffer. From real cpu %cpu
then on, the updates can be carried out cheaply from interrupt w/o hog 7.1 7.1 100%
SunOS
context. w/ hog 584.5 334.8 57%
System call profiling in the existing system is inaccurate. w/o hog 7.1 7.0 99%
4.4bsd
In this case, rather than update the user profile buffer during w/ hog 15.5 7.0 45%
the system call, the kernel reads the process’s accumulated
Table 1: Effect of hog on CPU bound process
system time at entry to each call and again at exit from each
call. The difference between these two times is converted to a
tick count, which is added as if from an AST. This includes the 5.2 A CPU Adversary
same interrupt-time excess found in the per-process statistics,
and computing this value is complicated due to the need to An adversarial program was written in an attempt to defeat
turn time into ticks. the CPU utilization statistics altogether. This program, which
In 4.4bsd, system call profiling is still done at the end we call hog, first estimates the phase of the system clock. It
of each call for efficiency, but now it is merely a matter of then enters a hard loop, performing gettimeofday system calls,
subtracting the previous system tick count from the current until just before a hardclock tick is going to happen. At this
count. point, it goes to sleep until the next system clock. Thus, the
process is never charged with a sampling tick, and never will
accumulate CPU time.2 Furthermore, its scheduling priority
remains favorable, so it always runs, even if there are other
5 Results processes waiting.
Since hog sleeps every other system clock tick, it will use
We devised three experiments, two contrived and one from at most half of the CPU. Thus, two hogs are required to use up
production software, that uncover the anomalies of the old the whole CPU. We augmented hog to fork once from main,
CPU utilization estimator. The tests were run under both and the results were dramatic. Table 1 shows timings for a
SunOS 4.1.1 and 4.4bsd. In each experiment, the anoma- CPU bound process when run in the presence and absence
lies were clearly evident under the old CPU estimator, while of hog. The CPU bound test program simply counted to 10
under 4.4bsd, they disappeared. million. The first column gives the real time of execution,
while the second column gives the CPU time as measured
by the system. Without hog, the two systems are similar, as
5.1 Interrupt Activity Interference expected. But with hog, the SunOS process takes 80 times
longer to finish even though the time command reports a uti-
The first experiment clearly demonstrates that interrupt ac- lization of 57%. If this figure were correct, it should only
tivity is charged to the current process as system time. A have taken 1.75 times longer. Under 4.4bsd, the test process
program was written that executes an infinite loop, and runs gets a fair share of the CPU. The 15.5 seconds of real time is
a 4 Hz alarm that logs system time usage. Since the program consistent with 45% utilization.
uses only a small amount of system CPU, just enough to pro- In taking these measurements, we noticed anomalous
cess an alarm signal every 0.25 seconds, system time should scheduler behavior in 4.4bsd. Even though the CPU uti-
accumulate very slowly. lization estimates were accurate, the scheduler often exhibits
This program was simultaneously run on two sparcstation unfairness between the CPU bound counting program and the
1+ machines, one running SunOS 4.1.1 and the other 4.4bsd. hog. In the presence of the hog, the CPU bound process got
Partway through execution, each host was exposed to an on- as little as 9% and as much as 65% of the CPU. The bsd
slaught of interrupt activity1 . The interrupt activity was then scheduler is known to be flawed [6], but it should do better
terminated, and finally, the program was stopped. here. This remains to be investigated.
Figure 2 shows the plot of system CPU time versus real The code for hog is given in the appendix.
time for both processes. The lower line represents the 4.4bsd
behavior, and is as expected—very little system time is accu-
mulated. Under SunOS, however, during the interrupt activ-
5.3 The Isochronous Anomaly
ity, the process is charged with a significant amount of system The previous two experiments were run under controlled con-
CPU time. Note that this process had nothing to do with the ditions in an attempt to expose the worst case behavior of the
interrupt activity; clearly, the statistics are skewed. utilization error. However, we have experienced problematic
1 The interrupt activity was generated by putting each host’s network in- 2 Actually, there is a low probability that the process is run just before a
terface into promiscuous mode, causing all network packets to be processed. clock interrupt, in which case there is insufficient time to discover that the
A 780KB/s Ethernet transfer was then initiated on the local net. interrupt is coming.
Winter USENIX – Jan., 1993 – San Diego, CA 5
4
SunOS
BSD
3
CPU Time (sec)
2
1
0
0 10 20 30 40
SunOS
BSD
6
CPU Time (sec)
4
2
0
7 Acknowledgements [2] Casner, S., and Deering, S. First IETF Internet audiocast.
ConneXions 6, 6 (1992), 10–17.
Van Jacobson originally suggested that randomization be [3] Durrett, R. Probability: Theory and Examples. Brooks/Cole
used to circumvent the statistical biases in the CPU estima- Publishing Company, 1991.
tor. Additionally, he helped interpret the results of our exper- [4] Graham, S. L., Kessler, P. B., and McKusick, M. K. gprof:
iments and provided suggestions for implementation strate- A call graph execution profiler. In Proceedings of the SIGPLAN
gies. ’82 Symposium on Compiler Construction (June 1982).
The idea that the CPU scheduler can be defeated is not new. [5] Leffler, S. J., McKusick, M. K., Karels, M. J., and Quar-
Dheeraj Sanghi and Olafur Gudmundsson wrote a program terman, J. S. The Design and Implementation of the 4.3BSD
similar to the hog presented in this paper. According to them, UNIX Operating System. Addison-Wesley, 1989.
the idea was originally proposed by Ashok Agrawala. [6] Straathof, J. H., Thareja, A., and Agrawala, A. UNIX
We are grateful to Spyros Papadakis for helping us formu- scheduling for large systems. In Proceedings of the 1986 Win-
late our probabilistic arguments. His advice was impeccable; ter USENIX Technical Conference (Denver, CO, Jan. 1986),
any errors were introduced by us. USENIX, pp. 111–138.
Finally, we would like to thank Vern Paxson, Van Jacob-
son, Craig Leres, Deana Goldsmith, and the referees for their
helpful comments on drafts of this paper. Author Information
Steven McCanne has been with the Lawrence Berkeley Lab-
8 Availability oratory since 1988, working on network analysis tools and
remote conferencing applications. He holds a B.S. de-
The statclock code appears in the 4.4BSD alpha release. Cur- gree in Electrical Engineering and Computer Science from
rently, sparcstation and HP-9000/300 series machines are U.C. Berkeley, and is currently a Ph.D. student in Computer
supported. Science at U.C.B.
Chris Torek has been rewriting bits of the Berkeley kernel
for about six years. He joined LBL in 1991, and has been
References working on porting bsd to the sparcstation. In his off hours
he spends far too much time on USENET.
[1] Carta, D. G. Two fast implementations of the “minimal stan-
dard” random number generator. Communications of the ACM
33, 1 (Jan. 1990).
Winter USENIX – Jan., 1993 – San Diego, CA 7
struct timeval hc; =3 our best guess for when a hardclock happened 3=
struct timeval now; =3 hold time0of0day for signal handler 3=
volatile int ntick;
alarm handler()
f
u long u;
struct timeval tv;
static int mindel = 5000000;
++ntick;
gettimeofday(&tv, 0);
u = tvdiff(now, tv);
if (u < mindel) f
mindel = u;
hc = tv;
g
usleep(1);
g
=3
3 Try to figure out when hardclock happens.
3=
struct timeval
train()
f
struct itimerval it;
return (hc);
g
8 A Randomized CPU Sampling Clock
int
main(argc, argv)
int argc;
char 33argv;
f
long us, s, bias, delta, off;
struct timeval tv;
=3
3 Determine when hardclocks are happening then compute a bias
3 with respect to an even multiple of hardclock ticks.
3 Assume 10ms tick. Since a second is an even multiple of
3 a tick, we only need to look at usecs.
3=
tv = train();
bias = tv.tv usec % 10000;
=3
3 Make one copy of ourself.
3 We need two processes to do real damage.
3=
fork();
for (;;) f
gettimeofday(&tv, 0);
=3
3 Round down to even tick multiple, then add in bias.
3 Compute estimate of next hardclock into s and us.
3=
s = tv.tv sec;
us = tv.tv usec;
delta = us % 10000;
off = bias 0 delta;
if (off < 0)
off += 10000;
us += off;
if (us >= 1000000) f
us 0= 1000000;
++s;
g
=3
3 Spin until 1ms before next hardclock.
3=
us 0= 1000;
while (tv.tv sec < s jj (tv.tv sec == s && tv.tv usec < us))
gettimeofday(&tv, 0);
usleep(1);
g
g