Computing Environments For Data Analysis
Computing Environments For Data Analysis
STAN-LCS-09
February 1985
(W
Part 1: Introduction
Department of Statbticr
and
Stanford Linear Accelerator Center
Stanford University, Stanford, California 94305
ABSTRACT
This is the first in a series of papers on aspects of modem computing environments that are
relevant to statistical data analysis. We argue that a network of graphics workstations is far
superior to conventional batch or time-sharing computers as an environment for interactive
data analysis. The first paper in the series provides a general introduction a&motivation for
more detailed considerations of hardware and software in subsequent parts.
Introduction
Statistics has long been thought of as applied mathematics. Certain parts of it, es-
pecially data analysis, could easily be viewed as applied computation. In this context,
different research issues assume importance, in particular, the design and implemen-
tation of computing environments for data analysis.
Analyzing data requires computing, whether with paper-and-pencil or with a Cray
super computer. The computing entnbnment determines what sorts of statistical
methods are practical. More importantly, the statistician’s unconscious assumptions,
or mental model of the computing environment determines the kinds of new statistical
methods that are likely to be invented.
Most current research in statistical computing is based on a hutch processing model
of computing environments that was appropriate twenty years ago. With a few ex-
ceptions, statisticians have not addressed the implications current and future devel-
opment s in scientific computing environments.
Although no single number can be a complete summary, it is probably fair to
say-for purposes of discussion-that the performance per dollar of computing equip-
ment has increased by a factor of a thousand over the past twenty years. The large
quantitative change implies a qualitative change in how computers can be used. For
comparison, the magnitude of the difference is larger than that between walking and
flying-if one could fly for the same price as one could walk.
This is the first part of a series of papers that discuss locd nettooth of gmphics
workstutions as environments for statistical computing. In this, the first part, we
provide general background and motivation. Future parts will describe relevant as-
pects of computing hardware and software, in particular present and future scientific
programming environments.
2
Chapter 2
3
of jobs, typically processing one job at a time, from start to finish. The results
appeared, hours or days later, on reams of awkwardly sised line-printer paper. We
will refer to this type of computing environment as batch pmcessing.
Although batch computing environments are, fortunately, nearly extinct (at least
for scientific computing), it is important to keep the batch processing model (punch
card to batch processor to line printer) in mind. Not only are the standard statistical
packages cast in this form, for example SAS, BMDP , and SPSS; much of state-of-
the-art statistical computing has this model as an implicit underlying assumption.
The batch processing model limits the sort of software that can be developed. The
two most confining aspects of batch processing are the long turn-around time for the
simplest computations and the even longer time to make minor changes to a program.
The result is a style of data analysis based on the concept of a atatidicd package,
a program, or collection of programs, implementing a small set (10 to 50) of fixed
operations that can be applied to sets of data.
Because the time to modify a program is so long, each operation is essentially
&red, with perhaps a few options that can be set before execution. The set of provided
operations in a package is usually small and new operations are difficult or impossible
to add. There is, therefore, a tendency to force data sets of great individuality into
models that are not really appropriate, just because they happen to be in the available
packages. Prior knowledge, both about complex internal structure and about the
external context of the data, must often be ignored. The lack of flexibility is pervasive
and imposes severe limits on the creativity in solving the new problems posed by each
new data set.
Because the execution turn-around time is so long, each operation is designed
as a more-or-less complete analysis. Intermediate choices in an analysis must be
automated; they cannot be based on interpretations of intermediate results. These
automated decisions are often a poor substitute for staightforward interactive meth-
ods. As a result, it is not uncommon for packaged routines to attempt to anticipate
every possible contingency, generating volumes of output of which only some small
part is actually relevant.
An argument can be made that data analysis has regressed by entering the com-
puter age. Although the computing constraint has been largely eliminated, statisti-
cians have lost the flexibility inherent in the pencil-and-paper mode of analysis. John
Tukey drives this point home in the postscript to EDA, a book written almost entirely
from the pencil-and-paper point of view.
This book focusses on paper and pen(ci1) - graph paper and tracing
paper when you can get it, backs of envelopes if necessary - multicolor
pen if you can get it, routine ballpoint or pencil if you cannot. Some
would say that this is a step backward, but there is a simple reason why
they cannot be right: much of what we have learned to do to data can be
4
done by hand - with or without the aid of a hand-held calculator- tONG
BEFORE one can find a computing system - to say nothing of getting
the data entered. Even when every household has access to a computing
system, it is unlikely that Tust what we would like to work withb will be
easily enough available. Now - and in the directly forseeable future -
there will be a place for hand calculation. [Tukey, 1977, p. 6631
The challenge before us is to recover the flexibility that haa been lost, but also to
augment it with the considerable computing power that should be available at our
finger t i pa.
2.3 Time-sharing
As computer prices dropped there was less concern for efficient use of machine time
and more concern for the efficient use of people time. This led to the development of
time-sharing apeding systems and a trend towards smaller computers (which were
often only smaller in the sense that they were owned and used by a smaller number
of people).
A time-shared mainframe communicates with its users through many remote ter-
minals and provides a few centralized peripheral devices (e.g. a line printer, disk
drive, tape drive, etc.). The mainframe runs a multi-user operating system and users
rent time, sharing the central processor.
Some statistical packages, e.g. Minitab, take advantage of time shared computing
to provide some degree of interaction, but most statistical computing has changed
only to the extent that it is easier to submit batch jobs.
Statistical pmgrumming languages, like S [Becker and Chambers, 1984a, 1984b]
and ISP [Donoho, Huber, Ramos, and Thoma, 19821, make better use of the possi-
bilities of interactive computing. A stutidicul lunguuge provides tools for combining
primitive functions and data structures to create new, higher-level functions and data
structures. A statistical language is therefore extensible, in a way that a statistical
package is not. That is, new operations can be defined and easily integrated into
the language. In a package the procedures stand alone and communication between
procedures is awkward or impossible, while in a language the operations are designed
to work together.
5
A tuorkstution is a complete computer that is used by a single person. The work-
stations discussed in this paper may be thought of as roughly equal-in speed of
computation, amount of memory, and so forth-to a VAX (somewhere in the range
covered by the 11/730 to 11/790). In other words, they do not have the limitations
that one might associate with present day personal or home computers.
A gruphics workstation includes, in addition, a high resolution bitmup display and
graphical input devices (e.g. mouse, trackerball, joystick) as integral parts [Beatty,
1983; Foley and VanDam, 1982; Newman and Sproull, 19791.Thii combination per-
mits a natural, graphical language for communication between user and computer,
which, in turn, leads to the design of computing environments that are more power-
ful, more sophisticated, and easier to use.
Examples are workstations built by Apollo, Sun, Chromatics, Symbolics, Ridge,
and Masscomp, which are discussed in more detail in part 2 of this paper. Briefly,
some of the properties of a graphics workstation appropriate for use in statistics are:
l Hardware
- A central processor equivalent to a Vax (500,000to several million instruc-
tions per second).
- Fast floating point equivalent to a Vax (500,000 to several million floating
point operations per second).
- Several megabytes of physical memory (RAM).
- virtual memory (a large address space).
- Tens to hundreds df megabytes of disk storage.
- Bitmap graphics display(s), at a resolution of approximately 100Ox1000,in
black-and-white and/or color
- Auxiliary processors, including array processors for rapid numerical com-
putation and graphics processors for fast picture drawing.
- Graphical input devices such as a mouse.
- Both hardware and software to support communication over a high speed
(several megabits per second) local network (usually some flavor of Ether-
net).
l Programming Environment
A user controls the action of a computer through a set of (software) tools-the
progrumming enuironment. We are intentionally blurring the distinction between
direct, interactive (interpreted) commands and the more complex deferred in-
structions produced by writing, compiling, linking, and loading a program. The
basic alternatives in graphics workstations are a conventional operating system,
6
such as Unix [Deitel, 1983; Kemighan and Mashey, 1981;UNIX, X378],or an in-
tegrated programming environment, such as Inter&p-D [Sheil, 1983; Teitleman
and Masinter, 19811or Smalltalk [Goldberg, 1983, 1984; Goldberg and Rob-
son, 1984; Kramer, 1983; 1. Examples of facilities provided by a programming
environment are [Deutsch and ‘Dxft, 19801:
- Programming languages.
- A file system.
- An (intelligent) display editor.
- Interactive debugging tools.
- A windowing system for effective use of the bitmap display.
- Support for graphical interaction.
7
l Gateways to wide-area networks, such as Arpanet.
The distinction between personal computers, graphics workstations, and m idi-
computers may disappear as graphics workstations become more powerful than m idi-
computers and no more expensive than personal computers.
It is important to emphasize that this series of papers describes computing en-
vironments that will not be fully realized for five to ten years. Although graphics
workstations are commercially available, they range in price from $15,000 to $150,000,
which is too expensive for there to be one on every desk. Also, even if the hardware
is available, software for statistical applications has yet to be written.
Another serious problem is the lack of standardization of network protocols and
internal software that is necessary to allow communication and portability of software
between machines of different manufacturers.
There are good reasons to expect the situation to change rapidly. For example, the
recently announced Apple Macintosh has many of the desirable features of graphics
workstations. However, it has a list price of only $2,500 (and is available in many
universities for about $1,000).
8
Chapter 3
Why Workstations?
9
language for control of the computer. Bitmap displays and graphical input devices
provide a ‘high bandwidth channel” (rich and fast information transmission both
ways) for communication between person and machine.
Graphical interaction permits the design of sophisticated programming environ-
ments that are both easier for the novice and more powerful for the expert. We will
discuss programming environments in more detail in a future part of the paper.
Acceptable graphical interaction is difficult to achieve on anything but a single
user graphics workstation because it requires
to have guaranteed fast response. On a multi-user machine, the response time will vary
as the operating system varys the size of the time-slice given to each user. Multi-user
machines typically communicate with graphics terminals at rates of about 10,000 bits
per second; required communication speeds may be one or two orders of magnitude
higher.
10
I
l Prim-H [Donoho, Huber, and Thoma, 1981; Donoho, Huber, Ramos, and Thoma,
19821.
11
frames per second. However, shape is perceived with more comfort at rates closer to
24 or 30 frames per second.
The rate at which motion graphics can be displayed is determined by:
l the time it takes to compute a new frame.
l the time it take to erase the old frame and draw the new one.
To display three-dimensional scatterplots-or other type of motion graphics-at 30
frames per second,‘all the computations associated with each frame must be com-
pleted in roughly l/60 of a second, to allow time for drawing and erasing the frames.
Rotation about an arbitrary axis with pempective [Foley and VanDam, 1982; New-
man and Sproull, 19791requires at least nine multiplications, nine additions, and two
divisions-20 arithmetic operations per point. Therefore, for smooth rotation of a
scatterplot with 1000 points, the computer may need to do 1,200,OOOarithmetic op-
erations per second. Orthogonal rotation about either the vertical or horizontal axis
takes 3 multiplies and 2 adds per point; at a minimal rate of 5 frames per second,
1000 points demands about 50,000 arithmetic operations per second.
Moving scatterplots are usualy displayed by drawing and erasing a small symbol
for each point. Such symbols may consist of from 1 to 100 pixels (picture elements).
For five pixel symbols, erasing and re-drawing a frame with 1000 points in l/60 second
requires a graphics device that can do 600,000 pixel writes per second. Changing a
pixel typically means sending 32 bits of address and 8 bits of data-a total of 5 bytes-
from the computer to the graphics device. Therefore smooth rotation may require the
computer to be able to communicate with the graphics device at 3 megabytes per
second.
This type of real-time motion is impractical on a timeshared computer, since
one cannot rely on a sufficiently large timeslice to do the necessary computing. In
contrast, a workstation is dedicated to a single user, so the timing constraints are
more easily satisfied. Also, real-time motion is inhibited in time-shared systems by
slow data transfer between a central processor and a remote graphics terminal.
Y = u + b&
12
I .
For fixed values of 8, ii(e) and 6(B) are fit by least squares. On a graphics workstation,
we can use a dial, or some equivalent gtlrpAiccJ input device to control the value of
8. We would display a scatterplot of yi Venus zi = 2: with the least squares line:
y = a(e)+@) -2. The picture must change smoothly as we move the dial that controls
8. To do this the graphics system must be able to compute the transformation, zi = 29,
the least squares fit, and erase and redraw the picture about 30 times a second.
A still more demanding example is Interactive Projection Pursuit Regression,
which is described by Friedman and Stuetzle [1981b, 1982a, !982b] and McDonald
[1982a] and demonstrated in the film by McDonald [1982c].
13
.-.
Chapter
- 4 c- -
14
Chapter 5
References
15
of Pacific 75, ACM Regional Conference.
Foley, J.D. and Van Dam, A. (1982)
Fundamentals of Interactive Computer Graphics.
Reading, Mass., Addison Wesley, 1982
Friedman, J.H., McDonald, J.A., and Staetsle, W. (1982)
An Introduction to Red Time Graphical Techniques for Andyting Multivariate Data
Proc. of the Third Annual Conference and Exposition of the NationaI Computer
Graphics Association, Inc., Volume I.
Friedman, J.H. and Staetsle, W. (1981b)
Projection Pursuit Regression
JASA, vol. 76, pp 817-823
Friedman, J.H. and Stuetsle, W. (1982a)
Smoothing of Scatterplots
Dept. of Statistics Tech. Rept. Orion 3, Stanford University.
Friedman, J.H. and Stuetrle, W. (1982b)
Projection Pursuit Methods for Data Analysis
in Modern Data Analysis
Launer, R.L. and Siegel, A.F., eds. New York, Academic Press, 1982
Friedman, J.H. and Stuetsle, W. (1982c)
Hardware for Kinematic Statistical Graphics
Computer Science and Statistics: Proc. of the 15th Symposium on the Interfirce.
Goldberg, A. (1983)
The Influence of an Object-Oriented Language on the Programming Environment
reprinted in Baretow, D.R., Shrobe, H.E., and Sandewall, E., eds. (1984)
Interactive Programming Environments .
Goldberg, A. (1984)
Smalltalk-80: The Interactive Programming Environment
Reading, Mass., Addison Wesley, 1984
Goldberg, A. and Robson, D. (1983)
Smalltalk-80, The Language and Its Implementation
Reading, Mass., Addison Wesley, 1983
Green, P.E., ccl. (1982)
Computer Network Architectures and Protocols
New York, Plenum Press, 1982
Kernighan, B.W. and Mashey, J.R. (1981)
The UNIX Programming Environment
Computer Vol 14, Num 4, April 1981, pp. 25-34
I Eemighan, B.W. and Ritchie, D.M. (1978)
The C Programming Language
Englewood Cliffs, NJ., Prentice Hall, 1978
Erasner, G., ed. (1983)
16
Smalltalk-80, Bits of History, Words of Advice
Reading, Mass., Addison Wesley, 1983
Littlefield, R.J. and Nicholson, W.L., (1982)
Use of Color and Motion jot thi Display of Higher DimenGond Data
Proc. of the 1982 DOE Statistical Symposium.
McDonald, J.A. (1982a)
Interactive Graphics for Duta Analysis
Ph.D. thesis, Dept. of Statistics, Stanford; available as Dept. of Statistics Tech.
Rept. Orion If, Stanford University
McDonald, 3.A. (1982b)
Ezploring Data with the Orion I Workstation
a 25 minute, 16mm sound film, which demonstrates programs described in McDonald
(1982a). It is available for loan from: Jerome H. Priedman, Computation Research
Group, Bin # 88, SLAC, P.O. Box 4349, Stanford, California 94305
McDonald, J.A. (1982c)
Projection Pursuit Regression with the Orion I Workstation
a 20 minute, 16mm color sound film, which demonstrates programs described in Mc-
Donald (1982a). It is available for loan from: Jerome H. Friedman, Computation
Research Group, Bin # 88, SLAC, P.O. Box 4349, Stanford, California 94305
Newman, WM. and Sproall, R.F. (1979)
Principles of Interactive Computer Graphics, 2nd ed.
New York, McGraw Hill, 1979
Nicholson, W.L., and Littlefleld, R.J. (1982)
The Use of Color and Motion to Display Higher Dimensional Data
Proc. of the Third Annual Conference and Exposition of the Nationd Computer
Graphics Association, Inc., Volume I.
Prim-9 (1974)
Prim-g, a 30 minute, 16m.m sound film, It is available for loan from: Jerome H.
F’riedman, Computation Research Group, Bin # 88, SLAC, P.O. Box 4349, Stanford,
California 94305
She& B.A. (1983)
Power Tools for Prvgmmmers
reprinted in Baretow, D.R., Shrobe, H.E., and Sandewall, E., eds. (1984)
In tezactive Programming Environments
Tannenbaum, A.S. (1981a)
Computer Networks
Englewood Cliffs, N.J., Prentice Hall, 1981
Tannenbaum, A.S. (1981b)
Network Protocols -
ACM Computing Surveys, Vol. 13, Num. 4, Dec. 1981, pp 453-489
Teitleman W. and Maeinter, L. (1981)
17
The Interlisp ProgrammingEnvironment
reprinted in Bartatow, D.R., Shrobe, H.E., and Sandewall, E., eds. (1984)
Lderactive Programming Environments
Tukey, J.W. (1977)
Exploratory Data Analysis
Reading, Mass., Addison Wesley, 1977
UNIX (1978)
special issue on Unix, The BeU System Technical Journal, Vol57
18