The Grammar of Graphics
The Grammar of Graphics
Source
Data
Varset Varset Varset Varset Graph Graph
Variables Algebra Scales Statistics Geometry Coordinates Aesthetics
Graphic
Renderer
The resulting set of tuples is a subset of the product log our data before averaging logs. Even if we do
of the domains of the two varsets. The domain of not compute nonlinear transformations, however, we
a varset produced by a cross is the product of the need to specify a measurement model.
separate domains. One may think of a cross as a The measurement model determines how dis-
horizontal concatenation of the table representation tance in a frame region relates to the ranges of the
of two varsets, assuming the rows of each varset are variables defining that region. Measurement mod-
equivalent and in the same order. els are reflected in the axes, scales, legends, and
other annotations that demarcate a chart’s frame.
Nest (/) Measurement models determine how values are
Nest partitions the left argument using the values in represented (e.g., as categories or magnitudes) and
the right: what the units of measurement are.
In constructing scales for statistical charts, we
x a x a need to know the function used to assign values to
y / a = y a objects. S.S. Stevens developed a taxonomy of such
z b z b functions based on axioms of measurement.2 Stevens
identified four basic scale types: nominal, ordinal,
Although it is not required in the definition, we assume
interval, and ratio. These scales are widely cited in
the nesting varset on the right is categorical. The name
introductory statistics books and in some visualization
nest comes from design-of-experiments terminology.
schemes.3
We often use the word within to describe its effect.
For graphics grammar, we employ a more
For example, if we assess schools and teachers in
restrictive classification based on units of measure-
a district, then teachers within schools specifies that
ment. Unit scales permit standardization and conver-
teachers are nested within schools. Assuming each
sion of metrics and raise exceptions when improper
teacher in the district teaches at only one school, we
blends are attempted. The International System of
would conclude that if our data contain two teach-
Units (SI) unifies measurement under transformation
ers with the same name at different schools, they are
rules encapsulated in a set of base classes.4 Most
different people.
of the measurements in the SI system fit within the
interval and ratio levels of Stevens’ system. There are
Blend (+) other scales fitting Stevens’ system that are not clas-
Blend produces a union of varsets: sified within the SI system, however. These involve
units such as category (state, province, country, color,
x
species, . . .), order (rank, index), and measure (prob-
y ability, proportion, percent, . . .). Also, there are some
x a
z additional scales that are in neither the Stevens nor
y + a =
a the SI system, such as partial order.
z b
a
b
Blend is defined only if the order of the tuples (number Statistics
of columns) in the left and right varsets is the same. The statistics component receives a varset, computes
Furthermore, we need to restrict blend to varsets with statistical summaries, and outputs another varset. In
composable domains. It would make little sense to the simplest case, the statistical method is an identity.
blend Age and Weight, much less Name and Height. We do this for scatterplots. Data points are input and
The Scales class, in the next section, throws an excep- the same data points are output. In other cases, such
tion if we attempt to blend varsets across different as histogram binning, a varset with n rows is input
types of scales. and and a varset with k rows is output, where k is the
number of bins (k < n). With smoothers (regression
or interpolation), a varset with n rows is input and a
Scales varset with k rows is output, where k is the number
Before we compute summaries (totals, means, smooth- of knots in a mesh over which smoothed values are
ers, . . .) and represent these summaries using computed. With point summaries (means, medians,
geometric objects (points, lines, . . .), we must scale . . .), a varset with n rows is input and a varset with
our varsets. In producing most common charts, we one row is output. With regions (confidence intervals,
do not notice this step. When we implement log ranges, . . .), a varset with n rows is input and a varset
scales, however, we notice it immediately. We must with two rows is output.
Statistical methods are members of their own • The bar() or, equivalently, interval() graphing
object, so they are independent of the other elements function produces a set of closed intervals. An
in the system. There is no necessary connection interval has two ends. Ordinarily, however, bars
between regression methods and curves or between are used to denote a single value through the
confidence intervals and error bars or between location of one end. The other end is anchored
histogram binning and histograms. We can represent at a common reference point (usually zero).
the same statistic with a variety of different geometric • A schema is an element that includes both general
objects. and particular features in order to represent
a distribution. We have taken this usage from
Geometry Tukey,5 who invented the schematic plot, which
has come to be known as the box plot because of
Geometric graphs are produced by graphing functions
its physical appearance. The schema() graphing
F : Bm → Rn that injectively map an m-dimensional
function produces a collection of one or more
bounded region to an n-dimensional real space and
points and intervals.
that have geometric names like line() or bar(). A
geometric graph is the image of F. Geometric graphs • The polygon() graphing function partitions a
are not visible; they are geometric sets. surface or space with polygons. Examples are
tessellations and boundary polygons in maps.
• The point() graphing function produces a • A contour() graphing function produces con-
geometric point, which is an n-tuple. This tours, or level curves. A contour graph is used
function can also produce a finite set of points, frequently in weather and topographic maps.
called a multipoint or a point cloud. The set Contours can be used to delineate any continu-
of points produced by point() is called a point ous surface.
graph. • The edge() graphing function joins points with
• The line() graphing function is a bit more line segments (edges). Although edges join points,
complicated. Let Bm be a bounded region in Rm . a point graph is not needed in a frame in order
Consider the function F : Bm → Rn , where n = to make an edge.
m + 1, with the following additional properties:
Coordinates
1. the image of F is bounded, and Most popular charts employ Cartesian coordinates.
2. F(x) = (v, f (v)), where f : Bm → R and v = The same real tuples in the graphs underlying
(x1 , . . . , xm ) ∈ Bm . these graphics can be embedded in many other
coordinate systems, however. Examples are polar,
If m = 1, this function maps an interval to fisheye, and geographic (spherical) projections. The
a functional curve on a bounded plane; and GoG coordinates object is a rich utility for radically
if m = 2, it maps a bounded region to a changing the appearance of a statistical graphic. A
functional surface in a bounded 3D space. The software system implementing GoG can change a
line() graphing function produces these graphs. statistical graphic’s form in one line of code.
Like point(), line() can produce a finite set of Many peculiar names are popularly given to
lines. A set of lines is called a multiline. We charts that are simple coordinate transformations
need this capability for representing multimodal of other popular charts. Spider (radar) charts are
smoothers, confidence intervals on regression polar parallel coordinate plots. Rose charts are polar
lines, and other multifunctional lines. bar charts. Pie charts are polar divided bar charts.
Sparklines are similarity transformations of ordinary
• The area() graphing function produces a graph
line charts.
containing all points within the region under the
line graph.
• The path() graphing function is similar to a line, Aesthetics
but it is not ordered on x. A path() produces a An aesthetic is a function that maps a graph to
path that connects points such that each point a perceivable graphic. Seven of the aesthetic func-
touches no more than two line segments. Thus, tions in GoG are derived from Bertin’s visual vari-
a path visits every point in a collection of points ables6 : position (position), size (taille), shape (forme),
only once. If a path is closed (every point touches orientation (orientation), brightness (valeur), color
two line segments), we call it a circuit. (couleur), and granularity (grain). In GoG, color
is separated into three components. Additional GoG by Rope and Wilkinson.8 GPL was intended to
aesthetics involve dimensions such as blur, sound, and complement the Bureau of Labor Statistics’ Table
motion. Producing Language TPL. GPL was eventually mar-
keted by Illumitek Inc. under the name nViZn and the
SOFTWARE GPL language was incorporated into the first edition
of GoG.
Wilkinson1 states that GoG is not a taxonomy or Rope, Wills, Dubbs, Wilkinson, and other mem-
a semiotic system. It is claimed to be the mathe- bers of the visualization team at SPSS rewrote nViZn
matical foundation of statistical graphics. It is also to incorporate an XML interface and to refine GPL.
claimed that statistical graphics differ fundamentally This system serves as the underlying graphic engine
from other visualizations such as maps, charts, dia- for various IBM analytic products. The visualization
grams, and volumetric 3D realizations. Furthermore, team implemented a GPL interpreter that can exe-
GoG is claimed to be the appropriate class hierar- cute examples from the second edition of the book.
chy for an object-oriented graphics system capable of The interpreter is rather well-hidden among the many
producing statistical graphics. programming options inside SPSS.
Consequently, software systems that implement The Polaris system at Stanford9 implemented a
GoG have classes named Algebra, Scales, Statistics, GoG architecture for producing visualizations from
Geometry, Coordinates and Aesthetics. These classes a multidimensional relational database. This soft-
have multiple methods to produce the rich variety ware evolved into the Tableau visualization package
of statistical graphics. Implementing these classes so distributed by Tableau Software and used in the Hype-
that they preserve orthogonality of their methods is rion OLAP.
difficult. Testing orthogonality is easy, however. In Hadley Wickham has developed a GoG package
a correctly designed GoG system, we ought to be called ggplot2 for the R platform.10 It implements
able to combine any method in any class with any many of the graphics found in Ref 1 using a simple
method in any other class to produce a meaning- functional language.
ful graphical result. Those implementing GoG have Finally, the Protovis project at Stanford11 is
discovered that some of these combinations produce implementing most of the grammar in a Web-based
novel and intriguing visualizations. And Graham Wills platform that uses Javascript for specifying geometric
has demonstrated that GoG can produce in a simple elements. This system makes it especially easy to insert
specification some of the most elaborate statistical statistical graphics into Web pages.
visualizations presented at the InfoVis conference in
recent years.7
CONCLUSION
There have been several software systems based
on the GoG foundation. The earliest was the Java The GoG is both the title of a book and a term for
Graphics Production Language GPL (not to be con- an object-oriented graphics system for creating and
fused with the Gnu Public License GPL) developed automating statistical graphics.
REFERENCES
1. Wilkinson L. The Grammar of Graphics. 2nd ed. 7. Wills G, Zhang C. Grammatical visualization of statis-
New York: Springer-Verlag; 2005. tical models. Joint Statistical Meetings, Salt Lake City,
2. Stevens SS. On the theory of scales of measurement. UT, 2007.
Science 1946, 103:677–680. 8. Wilkinson L, Rope DJ, Carr DB, Rubin MA. The
language of graphics. J Comput Graph Stat 2000,
3. Velleman P, Wilkinson L. Nominal, ordinal, interval,
9:530–543.
and ratio typologies are misleading for classifying sta-
tistical methodology. Am Stat 1993, 47:65–72. 9. Stolte C, Tang D, Hanrahan P. Polaris: a system for
query, analysis, and visualization of multidimensional
4. Taylor BN, Thompson A (eds). The International Sys-
relational databases. IEEE Trans Vis Comput Graph
tem of Units (SI). Gaithesburg, MD: National Institute
2002, 8:52–65.
of Standards and Technology; 2008.
10. Wickham H. ggplot2. New York: Springer-Verlag;
5. Tukey JW. Exploratory Data Analysis. Reading, MA: 2009.
Addison-Wesley Publishing Company; 1977.
11. Bostock M, Heer J. Protovis: a graphical toolkit for
6. Bertin J. Sèmiologie Graphique. Paris: Editions Gau- visualization. Proceedings of the IEEE Information
thierVillars; 1967. Visualization Conference, Atlantic City, NJ, 2009.