Graficos No Stata
Graficos No Stata
1 Introduction
The new graphics introduced in Stata 8 has been, by far, the most important step
forward in Stata’s graphical functionality since early releases in the mid-1980s. It is,
therefore, high time that this column turned to discuss graphics directly. I intend to
make 2004 a graphic year for Speaking Stata, starting with the basic and fundamen-
tal issue of graphing univariate distributions. Future columns are intended to discuss
graphing categorical and compositional data, comparisons, and model diagnostics. In
each case, the aim will be to provide an overview of Stata’s provision and to show ways
to go beyond what is obviously and readily available. The emphasis will be on graphics
commands of potential interest to the largest possible cross-section of Stata users. Thus
histograms clearly qualify, but justice cannot be done to details specific to analysis of
survival-time distributions.
The core commands for graphing distributions range from twoway kdensity and
its relative kdensity through twoway histogram and its relative histogram to graph
box and graph hbox. Related but perhaps less-often used commands include dotplot,
spikeplot, and those grouped as diagnostic plots.
c 2004 StataCorp LP
gr0003
N. J. Cox 67
not easily be combined with other graph types. Now we have both greater flexibility
and easier working with other types. Notable additions include the options to tune both
bin width and the start of binning, whereas previously only the number of bins could
be controlled directly. The start of binning could be controlled indirectly by tuning
xlabel() or xscale().
As every good introductory text explains, histogram construction is largely a trade-
off problem in which you seek a compromise between detail and generalization or be-
tween variance and bias. In doing this, you can tune either the number of bins or the
bin width. Theoretical discussions concentrate on the number of bins and its relation
to sample size and the kind of distribution being analyzed. However, my guess is that
people with their feet in application areas often find it natural to think in terms of a
sensible bin width for the variables they have, bearing in mind measurement issues and
the magnitude of important or interpretable differences. Whatever your preference, you
can now do it either way.
age frequency
0–4 28
5–9 46
10–15 58
16 20
17 31
18–19 64
20–24 149
25–59 316
60+ 103
In this example and in other similar examples, density can only be calculated for
the open-ended class if we specify an upper limit; Altman suggests that 60+ be treated
as 60–80.
As usual in statistics, sampling variation is also an issue. If we regard the histogram
as a crude estimator of a density function, there is often a case for varying bin width to
match the structure of variation, in effect varying how we average probability density
locally.
But there is at least one other way to build a histogram in a simple, systematic way:
using as limits a set of quantiles equally spaced on a probability scale (e.g., Breiman
1973, 208–209; Scott 1992, 69–70). That way, each bar represents the same area. Un-
less our data come from something like a uniform distribution, the bin widths will
be markedly unequal, but they will reflect the character of the distribution. Breiman
points out that the associated error will be approximately a constant multiple of the
bar heights, so long as the bin frequencies are not too small.
A related problem is choice of class intervals for a chi-squared test of goodness of fit.
Mann and Wald (1942) and Gumbel (1943) urged the merits of choosing classes with
equal expected frequencies. That is a simple and definite procedure, which can reduce
difficulties arising from low expected frequencies, although data must arrive ungrouped
and there may be some loss of sensitivity in the tails of a distribution. Without getting
into a wider discussion of the merits of different tests of fit or of tests compared with
graphical analysis, it is clear that the equal probability idea is a natural one.
What can be done in Stata? Start with the messier problem in which the data arrive
grouped. Much can be done once you know about an undocumented feature of twoway
bar. We need to enter the lower bin limits and the bin frequencies and one final upper
limit as data. For Altman’s example, we need to enter data to get
age freq
1. 0 28
2. 5 46
3. 10 58
4. 16 20
5. 17 31
6. 18 64
7. 20 149
8. 25 316
9. 60 103
10. 80 .
If you want frequency density rather than probability density, you should omit scaling
by the sample size (here 815).
Finally, we can draw the graph, shown in figure 1:
. twoway bar density age, bartype(spanning) bstyle(histogram)
.04
.03
density
.02
.01
0
0 20 40 60 80
age
The “spanning” extends bars to the right until they are curtailed; that is why it is
necessary to specify all lower limits and one upper limit for the graph. The data should
also be in the correct sort order, as in this example. The option bstyle(histogram) is
70 Speaking Stata
not compulsory, and you might like to check other possibilities. You might need to add
the option yscale(range(0)) if twoway bar does not automatically start bars at 0.
Turning to the more elegant problem, a user-written program for equal-probability
histograms can be described and, if desired, downloaded from the Statistical Software
Components (SSC) archive by using the ssc command; see [R] ssc:
. use http://www.stata-press.com/data/r8/womenwage.dta
. eqprhistogram wage, bin(10) plot(kdensity wage, biweight width(5))
> legend(ring(0) position(1) column(1))
.1
Density
kdensity wage
.08
Density/kdensity wage
.04 .02
0 .06
0 20 40 60 80
wages in 1000s of dollars/x
The bin limits are the deciles, so each bar represents 1/10 of the total probability in
the distribution. Note that you can superimpose a density estimate.
Although it may seem a curiosity, the equal-probability histogram has some ped-
agogic merit. First, it underlines the area principle on which histograms are based.
Wider bars are necessarily shorter and narrow bars necessarily taller. Second, it allows
a link to be made between histograms and quantile-based methods such as box plots.
Arguably, in some datasets it gives a better view of the tails than do the corresponding
box plots, especially if within those box plots no values are flagged beyond the quartiles,
and so no details are given on structure within the tails. (Box plots are especially poor
for U-shaped distributions. In some such cases, no values are identified beyond the
N. J. Cox 71
quartiles, and the box plot reduces to a long box and two short tails. Even experienced
people can misread this as indicating a unimodal distribution, forgetting that if half the
values lie inside the box, then, necessarily, half lie outside it.)
An equal-probability histogram is not suitable for all distributions. Given categor-
ical, discrete, or highly rounded data, quantiles may be tied, especially if the num-
ber of bins is large relative to the sample size. If the specified quantiles are tied,
eqprhistogram refuses to draw the graph. A technical aside: whenever it does this,
the exit code is 0. This is in part a diplomatic acknowledgment that inability to draw
the graph is either a feature of the data or a limitation of the method, rather than a
user error. In addition, it implies that a loop through equal-probability histograms of
different variables or groups will not fail merely because a particular graph is impossible.
we see that with these choices density varies up to about 0.07 per 1,000 dollars. That
leads to a decision to put the rug at about −0.003 on that scale. We need a variable to
hold this value:
. gen where = -0.003
72 Speaking Stata
In practice, we can just choose a trial value and then use replace to improve upon
it. Next, there is no pipe symbol in the symbolstyle portfolio, so we must enlist the
pipe character as a marker label. Then, the rug is just a scatterplot of where against
wage, suppressing the default marker symbol and placing the marker label exactly on
target:
. gen pipe = "|"
. histogram wage, start(0) width(5)
> plot(scatter where wage, ms(none) mlabel(pipe) mlabpos(0)) legend(off)
.08
.06
Density
.04
.02
0
|||||||||||||||||||||||||||||||| ||||||||||| | | | | | | | |
0 20 40 60 80
wages in 1000s of dollars
Figure 3: Example of histogram with rug showing distinct values occurring in the data.
In this case, as shown in figure 3, the rug shows rounding of the data, and a tabulation
makes it explicit that all values are just multiples of $1,000. In general, a rug is a useful
but restrained way of showing some of the fine structure of a distribution.
A rug will take up a lot of bytes in a graph file if any point symbol stands for
many repeated values. Clearly, it is unnecessary to overwrite each symbol repeatedly.
A solution is to select each distinct value just once. There are two systematic ways
to do this, to select the first in each group after sorting or to select the last, and it is
immaterial here which you use, so you might as well go
. bysort wage: gen tag = _n == 1
A canned near-equivalent is
. egen tag = tag(wage)
The difference is that the egen call sorts your data while doing the calculation but then
returns it to its original sort order, which may differ. The first method may change your
sort order. Having done this, we select points for the rug as if tag.
N. J. Cox 73
7000
6000
mean elevation, m
5000
4000
3000
2000
Figure 4: Example of histogram with horizontal bars. In this case, the response variable
is altitude.
To get fractions and percents, we must be careful to count each value just once:
. by lower: gen double sum = frequency * (_n == 1)
. replace sum = sum(sum)
. gen fraction = frequency / sum[_N]
. gen percent = 100 * fraction
Real calculations get messier once you build in selections if or in, subdivision into
groups defined by other variables, or missing values. The messy details are coded up in
an egen function density() in the egenmore package on SSC.
have been introduced by StataCorp in its earlier guise as Computing Resource Center
(1985). Wild and Seber (2000) show many interesting examples of oneway plots.
dotplot, by default, offers, as far as possible, a point symbol for every value and
some binning. Binning can be controlled rather indirectly, although in practice, the
default is usually adequate, and when desired, the binning can be switched off with the
nogroup option. The main virtues of dotplots lie in their ability to show some features
that might otherwise be obscured by a series of touching bars, especially granularity and
details of outliers or other extreme values in the tails. You can also show, for example,
median and quartiles by horizontal marks and thus hybridize box plots and histograms.
The considerable flexibility of histogram, spikeplot, and dotplot might seem to
leave few important gaps in their territory. Nevertheless, onewayplot was written to
provide some extra possibilities in this area; it also may be downloaded using ssc. As
mentioned earlier, graph, oneway did not survive as such into Stata 8, although the
minor trickery needed to add rugs is just one illustration of how they can be emulated
fairly easily. onewayplot is essentially a convenience command that bundles together
various easy but tedious handles for making your own oneway plots. You can choose
between horizontal and vertical layouts, while stack and center options produce a vari-
ant on dotplot.1 There is, by default, no binning of data; binning may be accomplished
with the width() option.
In figure 5, we show the results of a onewayplot using the handle of a regional
classification to split the glacier elevations. Both histogram and dotplot struggle
given 18 regions, some with fairly long names.
. onewayplot mean_elev, by(region) ytitle("") stack ms(oh) msize(tiny) width(20)
1 You can also type centre. An undocumented feature of dotplot is that centre is allowed as well
as center. This is a convenience for speakers of languages, such as English, which use that spelling,
and is emulated by onewayplot.
76 Speaking Stata
Qilian
Kodar
E Sayan
NE Altai
NW Altai
S Altai
Saur
Bogdo & Karlik
Iren Horu
Dzhungar
E Tien; Narat Tarliskay
C Tien; Pobedy
Zaili Kungey
Naryn Tien; Terskei Kokshaal
NW Tien; Chatkal Talas Kyrgyz
Alay Tien
SW Tien; Turkestan Hissar
Pamir
4 Kernel-density estimation
4.1 Available commands
The histogram of a continuous variable is, from one point of view, an estimator of
the density function of that variable. Clearly the set of bins used to compute that
estimate imposes discontinuities on the estimate, which leads us directly to consider
smoother estimates, especially those based on convolution of the data and a sym-
metric kernel. twoway kdensity and kdensity are provided in official Stata as basic
commands. Recently, users have added variable kernel density estimation commands
(Salgado-Ugarte and Pérez-Hernández 2003; Van Kerm 2003).
Some simple devices extend the range of applications of Stata’s official commands for
kernel-density estimation. First is the idea of estimating the density function on a trans-
formed scale and then back-transforming the estimate to one for the raw scale. Two of
the most natural transformations here, as elsewhere, are logarithms for positive variables
N. J. Cox 77
and logit-like transformations for proportions and other data measured on some interval
(a, b). The underlying general principle is that, for a continuous monotone transforma-
tion t(x), the densities f (x) and f {t(x)} are related by f (x) = f {t(x)}|dt/dx|. This
procedure is mentioned briefly by Silverman (1986, 27–30), although his worked exam-
ple (page 28) is not very encouraging. Good expositions are given by Wand and Jones
(1995, 43–45), Simonoff (1996, 61–64), and Bowman and Azzalini (1997, 14–16).
With a logarithmic transformation of x, we have
given that d/dx(log x) = 1/x. Note in particular, if data are right skewed, that the
result of this transformation is more smoothing in the tail and less near the main part
of the distribution than in the default method. I have found this to be one of the
most valuable ways of going beyond the default. It fits very well both the common
finding that positive variables are right-skewed, suggesting a transformation, such as
the logarithm, and the common attitude that results on the original scale are of direct
scientific or practical interest. To put it another way, the transformation behaves more
like a link function than a classical transformation, given that end results are on the
scale of the original response. You can get the best of both worlds.
Returning to the wage data, here is an illustrative (and certainly not definitive)
example, in which we just use default kernel and width choice.
. gen logwage = log(wage)
. kdensity logwage, at(logwage) generate(densitylog)
. gen density = densitylog/wage
. levels wage, local(levels)
. line density wage, sort xtick(‘levels’, tposition(inside))
.08
.06
density
.04
.02
0
0 20 40 60 80
wages in 1000s of dollars
The density function, shown in figure 6, is much smoother in the tails than the
equivalent default, which is not shown here. However, the step in the left-hand tail needs
investigation: is this some odd artifact or a genuine feature of the data? Incidentally,
another technique is used to show a rug by picking up a list of distinct values from
levels (added to Stata on 16 April 2003). However, this technique is not as general as
that previously illustrated, as it hinges on the variable concerned having only integer
values. levels is not designed to work with noninteger numeric values.
Similarly with a logit-like transformation,
(b − a)
estimate of f (x) = {estimate of f (logit x)}
(x − a)(b − x)
where logit x = log{(x − a)/(b − x)}, a slight generalization of the usual definition, for
which a = 0, b = 1. Note that d/dx(logit x) = (b − a)/{(x − a)(b − x)}.
1 2 3 4 5
log wage
Figure 7: Example of plot using log density. The parabola shows a normal density
function with the same mean and standard deviation as log wage.
The results in figure 7 suggest a good but not spectacularly good fit to a lognor-
mal. The slightly fat tails seem suggestive. At the same time, the density estimates,
especially in the tails, are, as always, subject to sampling variation and sensitive to
kernel bandwidth; note also that neighboring density estimates are necessarily highly
dependent.
80 Speaking Stata
What is implemented above is just a first stab. Hazelton (2003) suggests various
refinements, including robust estimation of the mean and standard deviation and, given
a density estimation bandwidth h, fitting a normal with variance sd2 + h2 to correct
for side-effects of using a kernel.
There would also be some advantages to a square-root scale, given that densities behave
a bit like counts, for which a root transformation is often the first to be tried. Also, the
square root of a Gaussian shape is another Gaussian shape. So we can have our cake
and eat it too: hunt for a Gaussian yet benefit from stabilized sampling fluctuation.
Check the assertion with
. twoway function sqrt(normden(x)), range(-4 4)
There is a root option in spikeplot for a similar reason. Tukey (1977, chapter 17)
worked through a bundle of related ideas, which seem to have been little explored since.
Intensities, too
Those interested in data on events, considered as the result of a point process in one
dimension (most obviously, time or space), should note that Stata’s kernel-density com-
mands can readily be used to estimate the intensity function (say, frequency per unit
of time or space). Suppose that a variable contains dates of earthquakes, eruptions,
strikes, honors for a sports team, or whatever else is of interest. To get results on an
intensity scale, just multiply ‘density’ by the number of observed data points. A key
detail is that intensities will be smoothed beyond the beginning and the end of the
interval in question; whether this is tolerable or further surgery is desired is a question
for the user.
In addition, qplot has support for choice of a in a general rule for plotting position
(i − a)/(n − 2a + 1) for i = 1, . . . , n. The default is a = 0.5, giving (i − 0.5)/n. Other
choices include a = 0, giving i/(n + 1), and a = 1/3, giving (i − 1/3)/(n + 1/3). The
choice is often immaterial, but some authorities have strong opinions on the best choice
on various grounds, some even statistical. For more discussion and references, see Cox
(1999b).
6 Skewness plots
The skewness of a variable is often of interest, perhaps especially as an indicator of
potential problems in subsequent analysis. Commonly a single measure is used, whether
the moment-based measure produced by summarize, detail or other measures (which,
in most cases, are readily calculated from the output of summarize). Graphically,
skewness may be assessed with varying degrees of ease and efficiency from the plots
mentioned so far, but there is also a case for a customized design.
Various possibilities are based on the quantiles (Gnanadesikan 1977, 1997). The
quantiles may be paired as lower and upper quantiles x(1) and x(n) , x(2) and x(n−1) ,
etc., and a median may be calculated in the usual way.
Stata supports symplot, a plot of (upper quantile − median) versus (median − lower
quantile), for which the reference situation of symmetry or lack of skew plots as a line of
equality. See [R] diagnostic plots. However, symplot will show only a single group of
data and thus cannot be used for comparisons, while a plot with a sloping reference line
is more difficult to deal with than the plot now to be described, which has horizontal
reference lines.
skewplot produces, by default, a plot of the midsummary versus the spread for the
variables supplied, also known as the mid-versus-spread plot. With the skew option,
it produces a plot of the skewness function versus the spread function. Such plots
convey both the general character and the fine structure of the symmetry or skewness of
datasets and can be used to compare distributions or to assess whether transformations
are necessary or effective.
There are some little-used terms here, so we need a few definitions. In a perfectly
symmetric set of data, the midsummaries (x(1) + x(n) )/2, (x(2) + x(n−1) )/2, etc., would
all be identical and equal to the median. A plot of each midsummary (or mean of lower
and upper quantiles) (x(i) + x(n−i+1) )/2 versus each difference or spread of lower and
upper quantiles x(n−i+1) − x(i) would, thus, yield a horizontal straight line. Conversely,
skewness in sets of data will be reflected by departures from horizontality. In particular,
right skewness would be shown by rising lines and left skewness by falling lines.
Apart from the divisor of 2, this plot was suggested by J. W. Tukey (Wilk and
Gnanadesikan 1968). See also Gnanadesikan (1977, 1997, chapter 6.2) or Fisher (1983).
The form used here and the name mid-versus-spread plot are found in Hoaglin (1985).
It is usual to plot only that half of the sample results for which spread is ≥ 0.
N. J. Cox 83
The skew option produces an alternative form promoted by Benjamini and Krieger
(1996, 1999). Consider the identity, which introduces their terminology,
for x(i) in the lower half of a sample. This leads to a plot of the skewness function
versus the spread function, known as the skewness versus spread plot. Note that the
skewness function is (midsummary − median) and so will be constant and zero for a
perfectly symmetric distribution and that the spread function is half the spread of the
mid versus spread plot. In short, the skew option does not change the configuration of
the plot but merely the labeling of the axes.
In addition, the ratio of the skewness and spread functions or
x(i) + x(n−i+1) − 2 × median
x(n−i+1) − x(i)
is a measure of skewness (in the traditional sense) originally suggested for quartiles by
Bowley (1902) and generalized to this form by David and Johnson (1956). Another
incarnation is as the p-skewness index (Gilchrist 2000, 54, 72).3 It varies between −1
and 1. A similar general measure was used by Parzen (1979). Graphically this measure
is the slope of the line connecting (0, 0) and each data point if the skew option is used.
See Benjamini and Krieger (1996, 1999) and Groeneveld (1998) for concise reviews
tracing such ideas from late 19th-century antecedents to recent work and further details
on the interpretation of the skewness-versus-spread plot.
Let us close with an example for data on 158 glacial cirques from the English Lake
District (Evans and Cox 1995). Glacial cirques are hollows excavated by glaciers that
are open downstream, bounded upstream by the crest of a steep slope (wall), and
arcuate in plan around a more gently sloping floor. More informally, they are sometimes
described as “armchair-shaped”. Glacial cirques are common in mountain areas that
have or have had glaciers present. Three among many possible measurements of their
size are length, width and wall height, and the distribution of all in the area studied is
shown by
. skewplot length width wall_height, legend(ring(0) position(5) column(1))
3 Gilchrist calls the special case for quartiles Galton’s skewness (pages 8, 25, 53, and 72), but there
1000
800
Midsummary
600400
length, m
width, m
200
wall height, m
Figure 8: Skewness plot for three variables. The systematic upward drift indicates
marked right skewness.
log 10 length in m
log 10 width in m
log 10 wall height in m
2.4
2.2
0 .2 .4 .6 .8 1
Spread
7 Conclusions
With one command or another, users can now plot univariate distributions in many
different ways. You can choose between several depictions of the density function or
several depictions of the distribution function or its inverse, the quantile function. You
can choose discrete or continuous representations and vertical or horizontal alignments.
Less obviously, it is straightforward to add details (for example, rugs of distinct data
values) or exploit the inbuilt flexibility of graph (for example, by looking at density
estimates on a log scale or by constructing your own histogram with varying bin width).
The theme of distributions will continue into the next column but with a focus on
categorical data. Distributions of categorical variables may be shown in a variety of
displays: the survey will range from old staples to less well-known plots, with emphasis
on the important special cases of graded data and of three variables with constant sum.
8 Acknowledgments
Ian Evans provided the Lake District cirques data, pointed me to the World Glacier
Inventory data, and participated in many discussions on statistical graphics over more
than 30 years. Marcello Pagano persistently urged the merits of equal-probability his-
tograms and parenthetically underlined the connection with chi-squared tests. Vince
Wiggins specifically alerted me to spanning bars and generally advised on strategies and
tactics for using the new graphics. Elizabeth Allred, Ronán Conroy, Philip Ender, and
Roger Harbord made helpful comments during development of various predecessors or
versions of some programs discussed here. Richard Groeneveld kindly tracked down the
Bowley reference.
9 References
Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman &
Hall.
—. 1941. The Physics of Blown Sand and Desert Dunes. London: Methuen.
—. 1990. Sand, Wind, and War: Memoirs of a Desert Explorer. Tucson: University of
Arizona Press.
Bardou, F., J.-P. Bouchaud, A. Aspect, and C. Cohen-Tannoudji. 2002. Lévy Statistics
and Laser Cooling: How Rare Events Bring Atoms to Rest. Cambridge: Cambridge
University Press.
Benjamini, Y. and A. M. Krieger. 1996. Concepts and measures for skewness with
data-analytic implications. Canadian Journal of Statistics 24: 131–140.
86 Speaking Stata
Bowman, A. W. and A. Azzalini. 1997. Applied Smoothing Techniques for Data Anal-
ysis: The Kernel Approach with S-Plus Applications. Oxford: Oxford University
Press.
Computing Resource Center. 1985. STATA/Graphics User’s Guide. Los Angeles, CA:
Computing Resource Center.
Cox, N. J. 1999a. gr41: Distribution function plots. Stata Technical Bulletin 51: 12–
16. In Stata Technical Bulletin Reprints, vol. 9, 108–112. College Station, TX: Stata
Press.
—. 1999b. gr42: Quantile plots, generalized. Stata Technical Bulletin 51: 16–18. In
Stata Technical Bulletin Reprints, vol. 9, 113–116. College Station, TX: Stata Press.
—. 2001. gr42.1: Quantile plots, generalized: update to Stata 7.0. Stata Technical
Bulletin 61: 10–11. In Stata Technical Bulletin Reprints, vol. 10, 55–56. College
Station, TX: Stata Press.
—. 2003a. Software update: gr41 1: Distribution function plots. Stata Journal 3(2):
211.
—. 2003b. Software update: gr41 2: Distribution function plots. Stata Journal 3(4):
449.
—. 2003c. Stata tip 2: Building with floors and ceilings. Stata Journal 3(4): 446–447.
—. 2004. Software update: gr42 2: Quantile plots, generalized. Stata Journal 4(1): 97.
David, F. N. and N. L. Johnson. 1956. Some tests of significance with ordered variables.
Journal of the Royal Statistical Society, Series B 18: 1–20.
Evans, I. S. and N. J. Cox. 1995. The form of glacial cirques in the English Lake District,
Cumbria. Zeitschrift für Geomorphologie 39: 175–202.
N. J. Cox 87
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Mono-
graphs on Statistics and Applied Probability, London: Chapman & Hall.
Thorne, C. R., R. C. MacArthur, and J. B. Bradley, ed. 1988. The Physics of Sediment
Transport by Wind and Water: A Collection of Hallmark Papers by R. A. Bagnold.
New York: American Society of Civil Engineers.
Tufte, E. R. 1997. Visual Explanations: Images and Quantities, Evidence and Narrative.
Cheshire, CT: Graphics Press.
Van Kerm, P. 2003. Adaptive kernel density estimation. Stata Journal 3(2): 148–156.
Wand, M. P. and M. C. Jones. 1995. Kernel Smoothing. London: Chapman & Hall.
Wild, C. J. and G. Seber. 2000. Chance Encounters: A First Course in Data Analysis
and Inference. New York: John Wiley & Sons.
Wilk, M. B. and R. Gnanadesikan. 1968. Probability plotting methods for the analysis
of data. Biometrika 55: 1–17.