Basic Principles of Graphing Data PDF
Basic Principles of Graphing Data PDF
483
Point of View
Introduction
Agricultural scientists produce abundant results.
Some of them are important, others may be less important, still others may be negligible. The way one presents the results, then, counts for several reasons. First,
one needs to emphasize those important ones. Second,
for most data, interpretation is easiest with graphs, and
it is far easier based on a good than a bad graph. Just as
statistical analysis can provide false conclusions (Huff,
1954; Kozak, 2009a), poor graphs can misinform, sometimes leading to serious misinterpretations.
Huff (1954) was probably the first to discuss that
graphing statistical data can be a tool of misinformation.
Later some of his notions found counter-arguments, yet
the main ideas still hold: graphing must be correct and
convey true information and interpretation. Scientific
graphs are not to be beautiful; they are to be informative. Of course, ugly graphs will not convey any interesting message, so elegance should not be disregarded
(Tufte 1991; 1997; 2001 and 2006).
Tufte (1991; 1997; 2001 and 2006), Cleveland (1993; 1994),
Jacoby (1997; 1998) and Wilkinson (2005), among others,
discuss graphing data in great detail. Harris (1999) is a useful reference for information graphics. Yet scientific literature in the 21st century is full of poor graphs, some of which
are incomprehensible while others can even unintentionally falsify information. Data visualization is a developing
research area, and what was considered a good graph 30
years ago does not have to be good these days. In this paper basic information is offered about graphing data with
the help of which authors should be able to construct sufficiently good scientific graphs. Focus will mainly be
placed on graphical rather than statistical aspects, so remember that when constructing a graph, you must take all
484
Kozak
Table 1 Elements to check while constructing a graph. Refer to the text for details.
Data points
Lines
Color
check whether needed at all; check whether all elements are easily distinguished
type of box, aspect, minimum and maximum value, data rectangle and scale-line rectangle, tick marks
(number of, location, direction, length and width), tick marks' labels (font type, length, rotation,
numbering style, abbreviations of text labels)
check whether needed, position, box around, size, elements within (see above for "data points" and
"lines", and "labels" within "box and axes")
of the graph; of bars, cross-hatching, shading, color
Dimensions
Captions
Reference line
Grid lines
Error bars
check if the information is given on what they represent; line type, width and color, type of ending
all the above elements; layout (rows, columns, pages), choice of panel variables and conditioning
variable(s), panel order
Trellis display
485
486
Kozak
Figure 5 Scatterplots representing replicated data of number of insects caught after applying various insecticides (data set InsectsSprays).
A drawback of the left panel is the overlap of points. This is overcome in the right panel, where jittering is employed
(Observe the GOOD GRAPH IN RIGHT PANEL).
487
488
Kozak
101.2
112.1
110.2
102.1
112.1
110.2
102.1
101.2
489
tion (e.g., Mohapatra and Kariali 2008); for example, regression equations are far too often put inside graphs
(Fiorio and Dematte, 2009; Pahlavani et al., 2009; Simoes
et al., 2009). However, Tuftes (2006, p. 120) graph is an
excellent exception to this rule. Sometimes, labels are
needed inside the data rectangle, examples being biplots
(e.g., Lammel et al., 2007), but one must remember about
Clevelands (1994) rule of not allowing such data labels
to interfere with the quantitative data or to clutter the
graph (Table 2).
Do not overuse legends. If all the information necessary can be included in the graph itself, then there is no
need of legend whatsoever. For example, Silva et al.
(2009b) decided not to use a legend in a situation when
most others would. Jonsson and Aoyama (2009) would
have needed a legend had they used a horizontal version
of the barplot instead of the vertical one, while
Ambrosano et al. (2009) chose the vertical barplot and so
had to use the legend. Compare Figures 3 and 4, representing the same data: mean area under the disease
progress curve of 16 potato clones in one environment.
In the former, the clones are presented at the horizontal
axis (note that this is usually done for barplots). The
clones names are too long to be presented in full, so they
are abbreviated and the legend is needed to explain the
abbreviations; even despite that, the font size needed to
be reduced. In Figure 4, a vertical version of the dotplot
is drawn with full names, so no legend is needed. Of
course, it does not mean that legends should never be
usedsometimes they are indeed required, for example
when the grouped data are plotted within one graph (e.g.,
Figure 10). Quite often the legend is included in a figure
caption (which saves space), but placing it near the graph
(like in Figure 10) facilitates reading. Nonetheless, remember Clevelands advice of avoiding putting a legend within
the box (Cleveland, 1994, Table 2; cf. Macedo et al., 2009).
When the legend is presented, it should rather not be surrounded by a box, as it is sometimes done.
Figure 11 Boxplots representing number of insects caught after applications of six insecticides (data set InsectSprays). In left panel,
alphabetical ordering is employed, while in right ordering by count. Observe the GOOD GRAPH IN RIGHT PANEL.
Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010
490
Kozak
491
those in Figure 12 are preferred rather than those in Figure 13 because of easiness of comparison. The latter,
however, have this advantage that they look like a confidence interval closed within parentheses, which is the
way numeric confidence intervals are usually presented.
For multivariate data sets, a so-called trellis display
(Cleveland, 1993, 1994) can be of help. It consists of panels arranged into rows, columns and pages, each panel
containing a plot of the same type for subsets of the data
set defined in a particular way. (Note that in publications
trellis displays are seldom divided into pages.) This display can be very helpful in data analysis, interpretation
and presentation. The trellis display works well with
grouped data, in which case panels represent values of a
categorical conditioning variable(s), but also with quantitative predictors, in which case a conditioning variable(s)
is cut into intervals, and each panel represents those data
points for which the value of the conditioning variable
fits into the interval. Panels can represent various types
of plots, including scatterplots, dot plots, boxplots, line
plots and others. The trellis display can also be very useful in analyzing data from designed experiments (Cleveland and Fuentes, 1997). Most of the comments concerning graphing given before are equally important for this
type of display, but there are many other rules that can
or should be followedthe reader can refer for example
to Cleveland (1993) and Cleveland and Fuentes (1997). Figure 14, which will be discussed later, is an example of the
trellis display for a scatterplot of two quantitative variables and a categorical conditioning variable.
Some additional elements can be used to enhance a
graph. These could be for example a reference line (see
Table 2 for Clevelands [1994] comment) or, for multipanel graphs (especially scatterplots), visual grid. Re-
Additional examples
Several examples of good graphs have been provided
(Figures 2, 4, 7, 10, 12, 13, 14, 15; right panels of Figures
Figure 14 A trellis display portraying relationships of fruitfly longevity and thorax length in different experimental groups
(representing various sexual activities), with a regression lines superimposed (data set fruitfly). The regression lines were
obtained through generalized least squares with a power variance function to deal with heterogeneity of residuals, and
the final model included a common slope for thorax length intercept and varying intercept across activities. This is a
GOOD GRAPH.
Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010
492
Kozak
493
Conclusion
Graphing data is not a simple matter. In fact, this
point of view is just a small portion of the knowledge of visualizing scientific data, but it is believed to
be sufficient for authors constructing simple graphs for
their publications. Whole books have been devoted to
the topic, best-known ones being the classics by Tufte
(1991; 1997; 2001 and 2006), Cleveland (1993; 1994) and
Wilkinson (2005). Jacoby (1997; 1998) also presents
some interesting ideas, while Harris (1999) is a useful
reference of information graphics.
Most principles presented in this paper are so basic that they apply to most graph types. However, be
aware that there are other types of graphs that offer
various possibilities of visualizing data. One should always choose that graph which is best for each data:
best in terms of data presentation and interpretation,
but also easiness of reading. When a type of graph is
chosen, all its details should be carefully selected so
that the graph conveys the message its author planned
it to.
Appendix
Data sets used in the examples. All these data sets come
from R (R Development Core Team, 2009), from the data
sets of the same names.
ChickWeight
The data from a repeated-measure experiment on the effect of diet on early growth of chickens (Crowder and
Hand, 1990).
fruitfly
The data on longevity and sexual activity of male
fruitflies, originating from Partridge and Farquhar
(1981), downloaded from the faraway package (Faraway,
2009) of R. See also the paper by Hanley and Shapiro
(1994), which discusses interesting aspects of this data
set and its usefulness in teaching statistics.
haynes
The data on phenotypic stability of resistance to late
blight of potato clones; the variable considered is area
under the disease progress curve (AUDPC) (Haynes et
al., 1998). Data were taken from the agricolae package
(Mendiburu, 2010) of R.
InsectsSprays
The data originating from Beall (1942). They include a
number of insects counted after application of six insecticides, in 12 replications.
iris
A famous iris data set (Anderson 1935; Fisher, 1936), providing the measurements (in cm) of sepal length and
width and petal length and width for 50 flowers from
each of three species of iris: Iris setosa, I. versicolor and I.
virginica.
trees
The data on measurements of the girth, height and volume of timber in 31 felled black cherry trees, provided
by Ryan et al. (1976).
Software used for graphing and data analysis. The statistical analysis for Example 3 was performed with linear
mixed-effects modeling framework offered by the nlme
package (Pinheiro and Bates, 2000) of R (R Development
Core Team, 2009). The graphs were constructed in the
graphics (Murrell, 2005) and lattice packages of R
(Sarkar, 2008).
References
Abou Khalifa, A.A.B. 2009. Physiological evaluation of some hybrid
rice varieties under different sowing dates. Australian Journal
of Crop Science 3: 178-183.
Ambang, Z.; Ndongo, B.; Amayana, D.; Djil, B.; Ngoh, J.B.;
Chewachong, G.M. 2009. Combined effect of host plant resistance
and insecticide application on the development of cowpea viral
diseases. Australian Journal of Crop Science 3: 167-172.
Ambrosano, E.J.; Trivelin, P.C.O.; Cantarella, H.; Ambrosano,
G.M.B.; Schammass, E.A.; Muraoka, T.; Guirado, N.; rossi, F.
2009. Nitrogen supply to corn from sunn hemp and velvet bean
green manures. Scientia Agricola 66: 386-394.
Anderson, E. 1935. The irises of the Gaspe Peninsula. Bulletin of
the American Iris Society 59: 2-5.
Beall, G. 1942. The transformation of data from entomological
field experiments. Biometrika 29: 243262.
Brunnings, A.M.; Datnoff, L.E.; Ma, J.F.; Mitani, N.; Nagamura,
Y.; Rathinasabapathi, B.; Kirst, M. 2009. Differential gene
expression of rice in response to silicon and rice blast fungus
Magnaporthe oryzae. Annals of Applied Biology 155: 161-170.
Carr, D.; Littlefield, R.J.; Nicholson, W.L.; J.S., Littlefield. 1987.
Scatterplot matrix techniques for large N. Journal of the
American Statistical Association 82: 424-436.
Cleveland, W.S. 1993. Visualizing Data. Hobart Press, Summit,
NJ, USA. 360 p.
Cleveland, W.S. 1994. The Elements of Graphing Data. 2ed. Hobart
Press, Summit, NJ, USA. 323 p.
Cleveland, W.S.; Fuentes, M. 1997. Trellis Display: Modeling Data
from Designed Experiments; Technical Report. BellLabs, Paris,
France.
Cleveland, W.S.; McGill, R. 1884. The Many Faces of a Scatterplot.
Journal of the American Statistical Association. 79: 807-822.
Chambers, J.M.; Hastie, T.J. 1992. Statistical Models in S. Brooks/
Cole Wadsworth, Florence, KY. 608 p.
Crowder, M.; Hand, D. 1990. Analysis of Repeated Measures. Chapman
and Hall, NY, USA. 257 p.
Cumming, G.; Finch, S. 2005. Inference by eye: confidence intervals
and how to read pictures of data. American Psychologist 60:
170-180.
Cumming, G.; Williams, J.; Fidler, F. 2004. Replication, and
researchers understanding of condence intervals and standard
error bars. Understanding Statistics 3: 299-311.
Czubaszek, A. 2009. The effects of genotype and environment on
selected traits of oat grain and flour. Plant Breeding and Seed
Science 60: 45-60.
Dupont, W.D.; Plummer, W.D.J. 2003. Density distribution
sunower plots. Journal of Statistical Software 8: 1-5.
Ehrenberg, A.S.C. 1977. Rudiments of numeracy. Journal of the
Royal Statistical Society. Serie. A. 140: 277-297.
EL-Shafey, N.M.; Hassaneen, R.A.; Gabr, M.M.A.; EL-Sheihy, O.
2009. Pre-exposure to gamma rays alleviates the harmful effect
of drought on the embryo-derived rice calli. Australian Journal
of Crop Science 3: 268-277.
494
Kozak
Moussa, H.R.; Abdel-Aziz, S.M. 2008. Comparative response of
drought tolerant and drought sensitive maize genotypes to water
stress. Australian Journal of Crop Science 1: 31-36. Murrell, P.
2005. R Graphics. Chapman Hall, New York, NY, USA. 328 p.
Ofosu-Anim, J.; Leitch, M. 2009. Relative efficacy of organic
manures in spring barley (Hordeum vulgare L.) production.
Australian Journal of Crop Science 3: 13-19.
Pahlavani, M.H.; Miri, A.A.; Kaziem, G. 2009. Response of oil and
protein content to seed size in cotton(Gossypium hirsutum L.,
cv. Sahel). Plant Breeding and Seed Science 59: 53-64.
Pinheiro, J.P.; Bates, D.M. 2000. Mixed-effects models in S and SPlus. Springer, New York, NY,USA. 528 p.
Partridge, L.; Farquhar, M. 1981. Sexual activity and the lifespan of
male fruitflies. Nature 294: 580-581.
R Development Core Team. 2009. R: A language and environment
for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria
Reynolds, M.; Manes, Y.; Izanloo, A.; Langridge, P. 2009.
Phenotyping approaches for physiological breeding and gene
discovery in wheat. Annals of Applied Biology 155: 309-320.
Ryan, T.A.; Joiner, B.L.; Ryan, B.F. 1976. The Minitab Student
Handbook. Duxbury Press, Pacific Grove, CA, USA.
Sarkar, D. Lattice Multivariate Data Visualization with R. 2008.
Springer, New York, NY,USA. 265 p.
Scarpari, M.S.; Beauclair, E.G.F. 2009. Physiological model to
estimate the maturity of sugarcane. Scientia Agricola 66: 622-628.
Silva, S.C.; Oliveira Bueno, A.A.; Carnevalli, R.A.; Uebele, M.C.;
Bueno, F.O.; Hodgson, J.; MAtthew, C.; Arnold, G.C.; M,
J.P.G. 2009a. Sward structural characteristics and herbage
accumulation of Panicum maximum cv. Mombaa subjected to
rotational stocking managements. Scientia Agricola 66: 8-19
Silva, M.M.; Libardi, P.L.; Fernandes, F.C.S.. 2009c. Nitrogen doses
and water balance components at phenological stages of corn.
Scientia Agricola 66: 512-521.
Silva, R.B.T.R.; Ns, I.A.; Moura, D.J. 2009b. Broiler and swine
production: animal welfare legislation scenario. Scientia Agricola
66: 713-720
Simoes, M.S.; Rocha, J.V.; Lamparell, R.A.C. 2009. Orbital spectral
variables, growth analysis and sugarcane yield. Scientia Agricola
66: 451-461.
Spence, I.; Lewandowsky, S. 1990. Graphical perception. p. 13
57, In: Fox ,J.; Long, J.S., eds. Modern methods of data analysis,
Sage Thousand Oaks, CA, USA.
Takahashi, D.; ditt, R.F.; Lambais, M.R. 2009. Cloning of putative
ureG genes from Glomus intraradices and urease activities in tobacco
arbuscular mycorrhizal roots. Scientia Agricola 66: 258-266.
Tufte, E.R. 2001. The Visual Display of Quantitative Information.
2ed. Graphics Press, Cheshire, CT, USA. 199 p.
Tufte, E.R. 1991. Envisioning Information. Graphics Press, Cheshire,
CT, USA. 126 p.
Tufte, E.R. 1997. Visual Explanations. Graphics Press, Cheshire,
CT, USA. 157 p.
Tufte, E.R. 2006. Beautiful Evidence. Graphics Press, Cheshire,
CT, USA., 213 p.
Vzquez, E.V.; Vieira, S.R.; De Maria, I.C.; Gonzlez, A.P. 2009.
Geostatistical analysis of microrelief of an Oxisol as a function
of tillage and cumulative rainfall. Scientia Agricola. 66: 225-232.
Virdis, A.; Motzo, R.; Giunta, F. 2009. Key phonological events in
globe artichoke (Cynara cardunculus var. scolymus) development.
Annals of Applied Biology 155: 419-429.
Wiewira, B. 2009. Long-time storage effect on the seed health of
spring barley grains. Plant Breeding and Seed Science 59: 3-12.
Wilkinson, L. 2005. The Grammar of Graphics. 2ed. SpringerVerlag, New York, NY, USA. 690 p.