Data Inspection Using Biplot
Data Inspection Using Biplot
[Editors note: This article was received before Stata 9 was announced. Stata 9 has
a biplot command, so the command documented here is named biplot8. biplot8
has some features not found in Stata 9 biplot (and vice versa). Additionally, the
exposition here acts as a helpful supplement to the Stata 9 biplot manual entry.]
1 Introduction
Biplots are projections of multivariate datasets that show the following quantities of a
data matrix:
They are helpful for revealing clustering, multicollinearity, and multivariate outliers
of a dataset, and they can be also used to guide the interpretation of principal component
analyses (PCA).
Biplots were first described thoroughly by Gabriel (1971) and were extended more
recently in a monograph by Gower and Hand (1996). They are heavily used in the
context of principal component analysis (Jolliffe 2002, 90–107) but also useful as a tool
for data inspection in the context of statistical modeling. As a projection technique,
they share similarities with many other projection techniques, such as multidimensional
scaling (Kruskal and Wish 1978), principal coordinate analysis (Fenty 2004), and cor-
respondence analysis (Blasius and Greenacre 1998).1
1 A discussion of the relative merits of several projection techniques can be found in
c 2005 StataCorp LP gr0011
U. Kohler and M. Luniak 209
2 Interpretation
Biplots consists of lines and dots. Lines are used to reflect the variables of the dataset,
and dots are used to show the observations. An example biplot is shown in figure 1,
which uses a dataset from Hamilton (1992, 268). The observations of this dataset are
planets, and the variables are their physical characteristics, for example the mass, the
number of moons, and the distance from the sun. With the exception of a dummy
variable for rings present, all variables are measured on a logarithmic scale.
1
Pluto
DIM 2 (16 % of Var)
.5
Neptune
logdist Uranus
logmoons
rings Saturn
lograd
0
Mars
logdens logmass
Jupiter
Earth
Mercury
Venus
−.5
−.5 0 .5 1
DIM 1 (82 % of Var)
In a biplot, the length of the lines approximates the variances of the variables. The
longer the line, the higher is the variance. Inferring from figure 1 the logarithmic mass of
the planets (logmass) has by far the highest variance among the variables in the biplot,
while the dummy variable for rings present (rings) has the lowest.
The angle between the lines, or, to be more precise, the cosine of the angle between
the lines, approximates the correlation between the variables they represent. The closer
the angle is to 90, or 270 degrees, the smaller the correlation. An angle of 0 or 180
degrees reflects a correlation of 1 or −1, respectively. The biplot in figure 1 shows a
strong relationship between the ring dummy and the number of moons (logmoons),
and a weak relationship between the mass and distance from the sun (logdist). The
correlation between the density and each of the other variables is negative.
210 Data inspection using biplots
4
price
0
weight
displacement
−2
−4
−4 −2 0 2 4
DIM 1 (75 % of Var)
Domestic Foreign
This interpretation of the biplot is similar to the interpretation of the plot of the PCA
coefficients, which is a common way to plot the results of a PCA (Tabachnik and Fidell
1989, 637–638). As for the principal component score plot, the plot of PCA coefficients
can be regarded as a special case of a biplot.
3 Mathematical background
Let Y be an n × k matrix holding the data. You can decompose Y with a singular value
decomposition (SVD) into
Y = ULV
In (1) and (2), the scalar c can take any value between zero and one. Regardless of
the value of c, the equation
GH = ULc L1−c V = ULV = Y
212 Data inspection using biplots
4 Computational issues
The Stata command to calculate a singular value decomposition is
. matrix svd U L V = Y
where Y is the name of the matrix that ought to be decomposed and U, L, and V are
arbitrary names for the resulting matrices of the SVD. To calculate the coordinates of
the biplot, this command requires that the complete data matrix be stored in Y. The
maximum dimension of a single matrix in Intercooled Stata is 800 × 800. In Intercooled
Stata, the SVD of a data matrix therefore can be only done for datasets with up to
800 observations. In Stata/SE, this limit is raised to 11,000 observations. Given that
there is no general maximum number of observations in Stata, the maximum number
of observations to be used in a biplot is restrictive2 .
2 In Stata 9, these limitations can be circumvented using Mata; see the Mata Reference Manual for
details.
U. Kohler and M. Luniak 213
In the case of the JK biplot (c = 1), the restriction can be circumvented. As Jolliffe
(2002, 94–95) shows, the elements in G are equal to the respective values of the ob-
servations on the principal components. Accordingly, the elements in H are equal to
the coefficients (loadings) of a PCA. Therefore, the coordinates of the JK biplot can be
easily calculated from a PCA, bypassing the calculation of the SVD. To this extent, the
biplot with c = 1 is nothing new since the component score plot and the plot of PCA
coefficients are widely used on their own. The superimposing of both plots, however,
gives additional information.
The possibility that you can calculate the plot coordinates by means of a PCA for
the JK biplot raises the question whether this is also possible for the other biplot types.
In fact, the coordinates of the JK biplot and the GH biplot are closely related. It follows
from the definition of both biplots and from (1) and (2) that
Therefore, the coordinates of the JK biplot can be transformed into the coordinates
of the GH biplot with
The SVD, however, is still needed to calculate L. At the same time, it is possible to
calculate the eigenvalues in L by transforming the eigenvalues of a PCA (LJK ) as shown
below3 :
where S is the covariance matrix of the centered data matrix and US are the coefficients
of a PCA. Unfortunately, to get U, it is again necessary to calculate the SVD of Y, which
once more restricts the maximum number of observations to be used.
Right now, you cannot circumvent the restriction on the maximum number of obser-
vations for the GH or SQ biplot. In the future, it might be worthwhile for StataCorp to
program the calculation of the eigenvalues from the dataset without storing the dataset
in a matrix beforehand. In this case, at least the GH biplot could be easily derived from
a PCA with (3) and (4).
From a practical point of view, the described restriction is not as restrictive as it
sounds. It has been already stated that the interpretation of the biplot will be suspect
if the variance explained by the dimensions of the biplot are small. Small explained
variances, however, are quite common in working with datasets with many observations.
To this extent, the biplot has its strength mainly for datasets with small to moderate
number of observations. For huge datasets, the JK biplot can be calculated in any case.
3 The derivation of this formula can be found in the appendix.
214 Data inspection using biplots
aweights and fweights are allowed; see [U] 11.1.6 weight. However, no weights are
allowed with option rv, and aweights are not allowed with options sq and gh.
5.2 Options
jk | sq | gh specifies the biplot type. jk specifies the default, a JK biplot. gh and sq
specifies GH and SQ biplots, respectively (see section 5.4).
mixed(jk | sq | gh jk | sq | gh) can be used instead of the biplot types to combine the
relative advantages of the different biplot types. Inside the parentheses, you first
state the type for the observations and then a type for the variables (see section 5.4).
covariance is used to plot the unstandardized data matrix. The default is standard-
ization (see section 5.4).
mahalanobis can be used for GH biplots to rescale the graph in a way that the distances
between the observations approximate the Mahalnobis distances (see section 5.4).
rv is used to produce relative variation diagrams (see section 5.4).
obsonly | varonly are used to suppress the plotting of observations or variables, respec-
tively (see section 5.3).
dimensions(##) is used to specify the space in which the variables and observations
are drawn. The default is to use the dimension with the highest eigenvalues (i.e.,
the first two principal components for JK biplots) (see section 5.3).
generate(name1 name2 ) is used to store the coordinates for the observations and
the variables as variables in the dataset. The y-axis coordinates for the observations
are stored in name1 y, and the x-axis coordinates for the observations are stored in
name1 x. Accordingly, the coordinates for the variables are stored in name2 y and
name2 x.
subpop(varname) is used to highlight observations from different subpopulations with
different marker symbols (see section 5.5).
stretch(#) draws longer (or if needed shorter) lines for the variables. By default,
stretch() is set to a value that improves readability (see section 5.3).
U. Kohler and M. Luniak 215
flip(x | y | xy) exchanges the signs of the axes. flip(x) and flip(y) exchange signs
of the indicated axis, flip(xy) flips both axes. flip() is seldom used but might
be useful if you want to compare your results with the results of other software
packages.
scatter options are the following options allowed with twoway scatter.
Up to two elements are allowed for each option. The first element refers to the
display of the observations, and the second element refers to the variables. Note
that the default plot symbol for the position of the variables is invisible; that is,
the default value for msymbol is msymbol(oh i). The lines for the variables are,
however, changed with the line options.
line options are the following set of the options allowed with line. Note that the
line options only refer to the display of the variable lines.
clpattern(linepatternstylelist) whether line is solid, dashed, etc.
clwidth(linewidthstylelist) thickness of line
clcolor(colorstylelist) color of line
twoway options are those options allowed with graph twoway; see [G] twoway options.
4 The examples in this section use the iris dataset. The data contains the sepal length, sepal width,
petal length, and petal width of 150 flowers from the iris species setosa, versicolor, and virginica. It was
collected by Anderson (1935) and was used by Fisher (1936) in his initiation of the linear-discriminant-
function technique.
216 Data inspection using biplots
. biplot8 sepallen-petalwid
4
sepalwid
2
DIM 2 (23 % of Var) sepallen
petalwid
petallen
0
−2
−4
−4 −2 0 2 4
DIM 1 (73 % of Var)
As stated above, the JK biplot superimposes two of the most-often described plots
for principal component analysis: the component score plot and the plot of the PCA co-
efficients. However, in the default setting of the command biplot8, there is a difference
between the variable lines of the JK biplot and the plot of the PCA coefficients. The
biplot8 command stretches the variable lines to optimally fill the plot region given by
the observations (Digby and Kempton 1987, section 3.2). The positions of the variable
lines along the graph axis therefore represent the relative sizes of the PCA coefficients,
as opposed to the absolute ones, used in the plot of PCA coefficients. High values
still represent high “loadings”, but the square of the loadings cannot be interpreted as
communalities, as is the case for the plot of PCA coefficients.
It is, however, still possible to use biplot8 as a means to produce the plot of PCA
coefficients and the component score plot. The plot of PCA coefficients can be produced
with the options stretch(#) and varonly. In the former option, # stands for a
number by which the length of the variable lines are multiplied. By default, biplot8
automatically chooses this stretch factor to ensure optimal readability. Setting the
stretch factor to 1 forces Stata to use the original values, which are the PCA coefficients
in the case of the JK biplot. Using the option varonly, in addition, suppresses the
display of the observations entirely and thereby sets the graph scales according to the
coordinates of the variables. This brings up the plot of the PCA coefficients (figure 4).
U. Kohler and M. Luniak 217
1
sepalwid
petalwid
petallen
0
−.5
−.5 0 .5 1
DIM 1 (73 % of Var)
−4 −2 0 2 4
__000003
DIM 1 (73 % of Var)
petalwid petallen
0
sepalwid sepallen
−.1
−.2
−.2 −.1 0 .1 .2
DIM 1 (92 % of Var)
As mentioned in section 3, the biplot types differ in the quality of the approximations
of the key quantities shown in a biplot. While the approximation of the Euclidean
distance is best represented in the JK biplot, the variance–covariance structure is better
U. Kohler and M. Luniak 219
sepalwid
2
DIM 2 (23 % of Var)
sepallen
petalwid
petallen
0
−2
−4
−4 −2 0 2 4
DIM 1 (73 % of Var)
Note, however, that while it is possible to give optimal approximations to two of the
quantities shown in a biplot, this is not possible for all three of them (Gower and Hand
1996; Gabriel 2002). Mixing the GH and JK biplot as in the example above does not
optimally represent the observational values.
A further variant is biplots for compositional data. Compositional data are datasets
with constant row sums and only positive values, e.g., row percentages of contingency
tables. The standard data analysis techniques of compositional data usually tends to
be misleading, and therefore a set of specialized techniques are available for such data
(Aitchison 1986). The equivalent to biplots for compositional data is the “relative
variation diagram” (RV plot) (Aitchison 1990). A relative variation diagram refers to a
biplot of a transformed data matrix. The transformation is
∗
yik = ln yik − y i − y k
with yik being the untransformed value of Y in the ith row and kth column and y i and
y k being the row and column means of the data matrix. The option rv forces Stata to
make this transformation before producing the biplot.
220 Data inspection using biplots
Finally, the option mahalanobis can be used to rescale the coordinates in G and H
by
√
G∗ = G × n
H∗ = H× √1
n
before producing the biplot. According to Gabriel (1971) the resulting biplot reflects
the Mahalanobis distances between the observations instead of the Euclidean distances.
4
sepalwid
2
DIM 2 (23 % of Var)
sepallen
petalwid
petallen
0
−2
setosa versicolor
virginica
−4
−4 −2 0 2 4
DIM 1 (73 % of Var)
Note that the default positioning of legends changes the aspect ratio of the biplot.
If you don’t like this, you can move the legend position to the inner ring, as shown in
the example. Alternatively, you can turn the legend off or refine the aspect ratio with
the options xsize() or ysize().
6 Appendix
Consider a PCA of the data matrix Y, which is a SVD of the variance–covariance matrix
S of Y
S = US LJK VS (6)
Also consider the coordinates of the observations for the JK biplot from (1):
GJK = UL (7)
From Jolliffe (2002, 94), it is known that GJK are equal to the scores of the obser-
vations on the principal components, which are given by
GJK = YUS (8)
L = U YUS (9)
In order to find the relation between L and LJK , we look at US from (6). The matrix
S is symmetric, so
US = VS
S= US LJK US
SUS = US LJK US US
SUS = US LJK
−1
US = S US LJK (10)
L = U YS−1 US LJK
7 Acknowledgments
We like to point German readers to the book Graphisch gestützte Datenanalyse, written
by Rainer Schnell (1994, 176–186), upon which parts of this article are heavily based.
We also like to thank Frauke Kreuter for carefully reading an earlier draft of this article
and Vince Wiggins and Nick Cox for helping us write biplot8.
8 References
Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London: Chapman
& Hall.
—. 1990. Relative variation diagrams for describing patterns of compositional variablity.
Mathematical Geology 22: 487–512.
Anderson, E. 1935. The irises of the Gaspé peninsula. Bulletin of the American Iris
Society 59: 2–5.
Blasius, J. and M. Greenacre, ed. 1998. Visualization of Categorical Data. San Diego:
Academic Press.
Digby, P. G. N. and R. A. Kempton. 1987. Multivariate Analysis of Ecological Com-
munities. London: Chapman and Hall.
U. Kohler and M. Luniak 223
Gabriel, K. 1971. The biplot graphic display of matrices with application to principal
component analysis. Biometrika 58(3): 453–467.
Jolliffe, I. 2002. Principal Component Analysis. 2nd ed. New York: Springer.
Kruskal, J. B. and M. Wish. 1978. Multidimensional Scaling. Beverly Hills, CA: Sage.
Schnell, R. and H. Matschinger. 1994. Multivariate graphics: Current use and imple-
mentations in the social sciences. In Computational Statistics. Papers Collected on
the Occasion of the 25th Conference on Statistical Computing at Schloß Reisensburg,
ed. P. Dirschedl and R. Ostermann, 275–294. Heidelberg: Physica.
Tabachnik, B. and L. S. Fidell. 1989. Using Multivariate Statistics. 2nd ed. New York:
Harper and Row.