0% found this document useful (0 votes)
376 views5 pages

The Grammar of Graphics

Uploaded by

Eduardo Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
376 views5 pages

The Grammar of Graphics

Uploaded by

Eduardo Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Advanced Review

The grammar of graphics


Leland Wilkinson∗
The grammar of graphics (GoG) denotes a system with seven classes embedded
in a data flow. This data flow specifies a strict order in which data are transformed
from a raw dataset to a statistical graphic. Each class contains multiple methods,
each of which is a function executed at the step in the data flow corresponding to
that class. The classes are orthogonal, in the sense that the product set of all classes
(every possible sequence of class methods) defines a space of graphics which is
meaningful at every point. The meaning of a statistical graphic is thus determined
by the mapping produced by the function chain linking data and graphic.  2010
John Wiley & Sons, Inc. WIREs Comp Stat 2010 2 673–677 DOI: 10.1002/wics.118

INTRODUCTION (a statistical summary). The next class (Geometry)


maps a statistical graph to a geometric graph. The
T he grammar of graphics (GoG) denotes a
system with seven orthogonal classes.1 The term
orthogonal means that each class contains one or
next (Coordinates) embeds a graph in a coordinate
space. And the last class (Aesthetics) maps a graph to
a visible or perceivable display called a graphic.
more methods (functions) as elements, and all tuples
The data flow architecture implies that the
in the seven-fold product of these sets of functions
subtasks needed to produce a graphic from data
produce meaningful graphs. A consequence of this
must be done in this specified order. Changes to
orthogonality is a high degree of expressiveness: we
this ordering can produce meaningless graphics. For
can produce a huge variety of graphical forms or
example, if we compute certain statistics on variables
chart types in such a system. In fact, it is claimed
(e.g., sums) before scaling them (e.g., log scales), we
that virtually the entire corpus of known charts
can produce statistically meaningless results because
can be generated by this relatively parsimonious
the log of a sum is not the sum of the logs.
system, and perhaps a great number of meaningful
The data flow in Figure 1 has many paths
but undiscovered chart types as well.
through it because of the multiple methods (functions)
A second claim of GoG is that this system
in each stage. We can choose different algebraic
describes the meaning of what we do when we
designs (factorial, nested, . . .), scales (log, probability,
construct statistical graphics. It is more than a
. . .), statistical methods (means, medians, modes,
taxonomy. It is a computational system based on
smoothers, . . .), geometric objects (points, lines, bars,
the underlying mathematics of representing statistical
. . .), coordinate systems (rectangular, polar, . . .), and
functions of data.
aesthetics (size, shape, color, . . .). These paths reveal
the richness of the system.
THE GoG DATA FLOW
Figure 1 shows a data flow diagram that contains Variables
the seven GoG classes. This data flow is a chain that We begin with data. We assume the data that
describes the sequence of mappings needed to produce we wish to graph are organized in one or more
a statistical graphic from a set of data. The first class tables. The columns of each table are called fields,
(Variables) maps data to an object called a varset with each field containing a set of measurements or
(a set of variables). The next two classes (Algebra, attributes. The rows of each table are called records,
Scales) are transformations on varsets. The next class with each record containing the measurements of an
(Statistics) takes a varset and creates a statistical graph object on each field. Usually, a relational database
management system (RDBMS) produces such a table
∗ Correspondence to: [email protected] from organized queries specified in structured query
Department of Computer Science, University of Illinois at Chicago, language (SQL) or some other relational language.
Chicago, IL, USA Other data sources (object, streaming, . . .) are mapped
DOI: 10.1002/wics.118 to tables through similar methods.

Vo lu me 2, No vember/December 2010  2010 Jo h n Wiley & So n s, In c. 673


Advanced Review www.wiley.com/wires/compstats

Source

Data
Varset Varset Varset Varset Graph Graph
Variables Algebra Scales Statistics Geometry Coordinates Aesthetics

Graphic

Renderer

FIGURE 1 | The grammar of graphics data flow.

Our first step is to convert a table of data Varset


to a varset. A varset is a set of one or more We call the triple
variables. While a column of a table of data might
 f]
X = [V, O,
superficially be considered to be a variable, there are
differences. A variable is both more general (in regard
a varset. The word varset stands for variable set. If
to generalizability across samples) and more specific
X is multidimensional, we use boldface X. A varset
(in regard to data typing and other constraints) than
inverts the mapping used for variables. That is,
a column of data. First, we define a variable, then a
varset.
The domain V is a set of values.
The codomain O is a set of all possible ordered lists
Variable of objects.
A variable X is a mapping f : O → V, which we
The function f assigns to each element of V an
consider as a triple: 
element in O.
X = [O, V, f ]. We invert the mapping customarily used for variables
in order to simplify the definitions of graphics algebra
The domain O is a set of objects. operations on varsets. In doing so, we also replace
The codomain V is a set of values. the variable’s set of objects with the varset’s set of
The function f assigns to each element of O an ordered lists. We use lists in the codomain because it
element in V. is possible for a value to be mapped to an object more
than once (as in repeated measurements).
The image of O under f contains the values of X.
We denote a possible value as x, where x ∈ V. We
denote a value of an object as X(o), where o ∈ O. Algebra
A variable is continuous if V is an interval. A variable Given one or more varsets, we now need to operate
is categorical if V is a finite subset of the integers (or on them to produce combinations of variables. A
there exists an injective map from V to a finite subset typical scatterplot of a variable X against a variable
of the integers). Y, for example, is built from tuples (xi , yi ) that are
Variables may be multidimensional. X is a p- elements in a set product. We use graphics algebra on
dimensional variable made up of p one-dimensional values stored in varsets to make these tuples. There
variables: are three binary operators in this algebra: cross, nest,
and blend.
X = (X1 , . . . , Xp ) Cross (∗)
= [O, Vi , f ], i = 1, . . . , p Cross joins the left argument with the right to produce
a set of tuples stored in the multiple columns of the
= [O, V, f ]. new varset:

The element x = (x1 , . . . , xp ), x ∈ V, is a p- x a x a


dimensional value of X. We use multidimensional y * a = y a
variables in multivariate analysis. z b z b

674  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, No vember/December 2010


WIREs Computational Statistics The grammar of graphics

The resulting set of tuples is a subset of the product log our data before averaging logs. Even if we do
of the domains of the two varsets. The domain of not compute nonlinear transformations, however, we
a varset produced by a cross is the product of the need to specify a measurement model.
separate domains. One may think of a cross as a The measurement model determines how dis-
horizontal concatenation of the table representation tance in a frame region relates to the ranges of the
of two varsets, assuming the rows of each varset are variables defining that region. Measurement mod-
equivalent and in the same order. els are reflected in the axes, scales, legends, and
other annotations that demarcate a chart’s frame.
Nest (/) Measurement models determine how values are
Nest partitions the left argument using the values in represented (e.g., as categories or magnitudes) and
the right: what the units of measurement are.
In constructing scales for statistical charts, we
x a x a need to know the function used to assign values to
y / a = y a objects. S.S. Stevens developed a taxonomy of such
z b z b functions based on axioms of measurement.2 Stevens
identified four basic scale types: nominal, ordinal,
Although it is not required in the definition, we assume
interval, and ratio. These scales are widely cited in
the nesting varset on the right is categorical. The name
introductory statistics books and in some visualization
nest comes from design-of-experiments terminology.
schemes.3
We often use the word within to describe its effect.
For graphics grammar, we employ a more
For example, if we assess schools and teachers in
restrictive classification based on units of measure-
a district, then teachers within schools specifies that
ment. Unit scales permit standardization and conver-
teachers are nested within schools. Assuming each
sion of metrics and raise exceptions when improper
teacher in the district teaches at only one school, we
blends are attempted. The International System of
would conclude that if our data contain two teach-
Units (SI) unifies measurement under transformation
ers with the same name at different schools, they are
rules encapsulated in a set of base classes.4 Most
different people.
of the measurements in the SI system fit within the
interval and ratio levels of Stevens’ system. There are
Blend (+) other scales fitting Stevens’ system that are not clas-
Blend produces a union of varsets: sified within the SI system, however. These involve
units such as category (state, province, country, color,
x
species, . . .), order (rank, index), and measure (prob-
y ability, proportion, percent, . . .). Also, there are some
x a
z additional scales that are in neither the Stevens nor
y + a =
a the SI system, such as partial order.
z b
a
b
Blend is defined only if the order of the tuples (number Statistics
of columns) in the left and right varsets is the same. The statistics component receives a varset, computes
Furthermore, we need to restrict blend to varsets with statistical summaries, and outputs another varset. In
composable domains. It would make little sense to the simplest case, the statistical method is an identity.
blend Age and Weight, much less Name and Height. We do this for scatterplots. Data points are input and
The Scales class, in the next section, throws an excep- the same data points are output. In other cases, such
tion if we attempt to blend varsets across different as histogram binning, a varset with n rows is input
types of scales. and and a varset with k rows is output, where k is the
number of bins (k < n). With smoothers (regression
or interpolation), a varset with n rows is input and a
Scales varset with k rows is output, where k is the number
Before we compute summaries (totals, means, smooth- of knots in a mesh over which smoothed values are
ers, . . .) and represent these summaries using computed. With point summaries (means, medians,
geometric objects (points, lines, . . .), we must scale . . .), a varset with n rows is input and a varset with
our varsets. In producing most common charts, we one row is output. With regions (confidence intervals,
do not notice this step. When we implement log ranges, . . .), a varset with n rows is input and a varset
scales, however, we notice it immediately. We must with two rows is output.

Vo lu me 2, No vember/December 2010  2010 Jo h n Wiley & So n s, In c. 675


Advanced Review www.wiley.com/wires/compstats

Statistical methods are members of their own • The bar() or, equivalently, interval() graphing
object, so they are independent of the other elements function produces a set of closed intervals. An
in the system. There is no necessary connection interval has two ends. Ordinarily, however, bars
between regression methods and curves or between are used to denote a single value through the
confidence intervals and error bars or between location of one end. The other end is anchored
histogram binning and histograms. We can represent at a common reference point (usually zero).
the same statistic with a variety of different geometric • A schema is an element that includes both general
objects. and particular features in order to represent
a distribution. We have taken this usage from
Geometry Tukey,5 who invented the schematic plot, which
has come to be known as the box plot because of
Geometric graphs are produced by graphing functions
its physical appearance. The schema() graphing
F : Bm → Rn that injectively map an m-dimensional
function produces a collection of one or more
bounded region to an n-dimensional real space and
points and intervals.
that have geometric names like line() or bar(). A
geometric graph is the image of F. Geometric graphs • The polygon() graphing function partitions a
are not visible; they are geometric sets. surface or space with polygons. Examples are
tessellations and boundary polygons in maps.
• The point() graphing function produces a • A contour() graphing function produces con-
geometric point, which is an n-tuple. This tours, or level curves. A contour graph is used
function can also produce a finite set of points, frequently in weather and topographic maps.
called a multipoint or a point cloud. The set Contours can be used to delineate any continu-
of points produced by point() is called a point ous surface.
graph. • The edge() graphing function joins points with
• The line() graphing function is a bit more line segments (edges). Although edges join points,
complicated. Let Bm be a bounded region in Rm . a point graph is not needed in a frame in order
Consider the function F : Bm → Rn , where n = to make an edge.
m + 1, with the following additional properties:
Coordinates
1. the image of F is bounded, and Most popular charts employ Cartesian coordinates.
2. F(x) = (v, f (v)), where f : Bm → R and v = The same real tuples in the graphs underlying
(x1 , . . . , xm ) ∈ Bm . these graphics can be embedded in many other
coordinate systems, however. Examples are polar,
If m = 1, this function maps an interval to fisheye, and geographic (spherical) projections. The
a functional curve on a bounded plane; and GoG coordinates object is a rich utility for radically
if m = 2, it maps a bounded region to a changing the appearance of a statistical graphic. A
functional surface in a bounded 3D space. The software system implementing GoG can change a
line() graphing function produces these graphs. statistical graphic’s form in one line of code.
Like point(), line() can produce a finite set of Many peculiar names are popularly given to
lines. A set of lines is called a multiline. We charts that are simple coordinate transformations
need this capability for representing multimodal of other popular charts. Spider (radar) charts are
smoothers, confidence intervals on regression polar parallel coordinate plots. Rose charts are polar
lines, and other multifunctional lines. bar charts. Pie charts are polar divided bar charts.
Sparklines are similarity transformations of ordinary
• The area() graphing function produces a graph
line charts.
containing all points within the region under the
line graph.
• The path() graphing function is similar to a line, Aesthetics
but it is not ordered on x. A path() produces a An aesthetic is a function that maps a graph to
path that connects points such that each point a perceivable graphic. Seven of the aesthetic func-
touches no more than two line segments. Thus, tions in GoG are derived from Bertin’s visual vari-
a path visits every point in a collection of points ables6 : position (position), size (taille), shape (forme),
only once. If a path is closed (every point touches orientation (orientation), brightness (valeur), color
two line segments), we call it a circuit. (couleur), and granularity (grain). In GoG, color

676  2010 Jo h n Wiley & So n s, In c. Vo lu me 2, No vember/December 2010


WIREs Computational Statistics The grammar of graphics

is separated into three components. Additional GoG by Rope and Wilkinson.8 GPL was intended to
aesthetics involve dimensions such as blur, sound, and complement the Bureau of Labor Statistics’ Table
motion. Producing Language TPL. GPL was eventually mar-
keted by Illumitek Inc. under the name nViZn and the
SOFTWARE GPL language was incorporated into the first edition
of GoG.
Wilkinson1 states that GoG is not a taxonomy or Rope, Wills, Dubbs, Wilkinson, and other mem-
a semiotic system. It is claimed to be the mathe- bers of the visualization team at SPSS rewrote nViZn
matical foundation of statistical graphics. It is also to incorporate an XML interface and to refine GPL.
claimed that statistical graphics differ fundamentally This system serves as the underlying graphic engine
from other visualizations such as maps, charts, dia- for various IBM analytic products. The visualization
grams, and volumetric 3D realizations. Furthermore, team implemented a GPL interpreter that can exe-
GoG is claimed to be the appropriate class hierar- cute examples from the second edition of the book.
chy for an object-oriented graphics system capable of The interpreter is rather well-hidden among the many
producing statistical graphics. programming options inside SPSS.
Consequently, software systems that implement The Polaris system at Stanford9 implemented a
GoG have classes named Algebra, Scales, Statistics, GoG architecture for producing visualizations from
Geometry, Coordinates and Aesthetics. These classes a multidimensional relational database. This soft-
have multiple methods to produce the rich variety ware evolved into the Tableau visualization package
of statistical graphics. Implementing these classes so distributed by Tableau Software and used in the Hype-
that they preserve orthogonality of their methods is rion OLAP.
difficult. Testing orthogonality is easy, however. In Hadley Wickham has developed a GoG package
a correctly designed GoG system, we ought to be called ggplot2 for the R platform.10 It implements
able to combine any method in any class with any many of the graphics found in Ref 1 using a simple
method in any other class to produce a meaning- functional language.
ful graphical result. Those implementing GoG have Finally, the Protovis project at Stanford11 is
discovered that some of these combinations produce implementing most of the grammar in a Web-based
novel and intriguing visualizations. And Graham Wills platform that uses Javascript for specifying geometric
has demonstrated that GoG can produce in a simple elements. This system makes it especially easy to insert
specification some of the most elaborate statistical statistical graphics into Web pages.
visualizations presented at the InfoVis conference in
recent years.7
CONCLUSION
There have been several software systems based
on the GoG foundation. The earliest was the Java The GoG is both the title of a book and a term for
Graphics Production Language GPL (not to be con- an object-oriented graphics system for creating and
fused with the Gnu Public License GPL) developed automating statistical graphics.

REFERENCES
1. Wilkinson L. The Grammar of Graphics. 2nd ed. 7. Wills G, Zhang C. Grammatical visualization of statis-
New York: Springer-Verlag; 2005. tical models. Joint Statistical Meetings, Salt Lake City,
2. Stevens SS. On the theory of scales of measurement. UT, 2007.
Science 1946, 103:677–680. 8. Wilkinson L, Rope DJ, Carr DB, Rubin MA. The
language of graphics. J Comput Graph Stat 2000,
3. Velleman P, Wilkinson L. Nominal, ordinal, interval,
9:530–543.
and ratio typologies are misleading for classifying sta-
tistical methodology. Am Stat 1993, 47:65–72. 9. Stolte C, Tang D, Hanrahan P. Polaris: a system for
query, analysis, and visualization of multidimensional
4. Taylor BN, Thompson A (eds). The International Sys-
relational databases. IEEE Trans Vis Comput Graph
tem of Units (SI). Gaithesburg, MD: National Institute
2002, 8:52–65.
of Standards and Technology; 2008.
10. Wickham H. ggplot2. New York: Springer-Verlag;
5. Tukey JW. Exploratory Data Analysis. Reading, MA: 2009.
Addison-Wesley Publishing Company; 1977.
11. Bostock M, Heer J. Protovis: a graphical toolkit for
6. Bertin J. Sèmiologie Graphique. Paris: Editions Gau- visualization. Proceedings of the IEEE Information
thierVillars; 1967. Visualization Conference, Atlantic City, NJ, 2009.

Vo lu me 2, No vember/December 2010  2010 Jo h n Wiley & So n s, In c. 677

You might also like