Quantitative Data Analysis
Quantitative Data Analysis
May 1992
Quantitative Data
Analysis: An
Introduction
GAO/PEMD-10.1.11
Preface
Werner Grosshans
Assistant Comptroller General
Office of Policy
Eleanor Chelimsky
Assistant Comptroller General
for Program Evaluation and Methodology
Preface 1
Chapter 1 8
Guiding Principles 8
Introduction Quantitative Questions Addressed in the 11
Chapters of This Paper
Attributes, Variables, and Cases 13
Level of Measurement 16
Unit of Analysis 18
Distribution of a Variable 19
Populations, Probability Samples, and 26
Batches
Completeness of the Data 28
Statistics 29
Chapter 2 31
Measures of the Central Tendency of a 33
Determining the Distribution
Central Tendency Analyzing and Reporting Central Tendency 35
of a Distribution
Chapter 3 39
Measures of the Spread of a Distribution 41
Determining the Analyzing and Reporting Spread 49
Spread of a
Distribution
Chapter 4 51
What Is an Association Among Variables? 51
Determining Measures of Association Between Two 55
Association Variables
Among Variables The Comparison of Groups 67
Analyzing and Reporting the Association 70
Between Variables
Chapter 5 74
Histograms and Probability Distributions 76
Estimating Sampling Distributions 80
Population Population Parameters 83
Parameters Point Estimates of Population Parameters 84
Interval Estimates of Population Parameters 87
Chapter 6 91
What Do We Mean by Causal Association? 92
Determining Evidence for Causation 93
Causation Limitations of Causal Analysis 103
Chapter 7 105
In the Early Planning Stages 105
Avoiding Pitfalls When Plans Are Being Made for Data 108
Collection
As the Data Analysis Begins 109
As the Results Are Produced and Interpreted 112
Abbreviations
Introduction
1
Relative to GAO job phases, the first two checkpoints occur during
the job design phase, the third occurs during data collection and
analysis, and the fourth during product preparation. For detail on
job phases see the General Policy Manual, chapter 6, and the
Project Manual, chapters 6.2, 6.3, and 6.4.
Data Analysis After the data are collected, evaluators need to see
whether their expectations regarding data
characteristics and quality have been met. Choice
among possible analyses should be based partly on
the nature of the data—for example, whether many
observed values are small and a few are large and
whether the data are complete. If the data do not fit
the assumptions of the methods they had planned to
use, the evaluators have to regroup and decide what
to do with the data they have.2 A different form of
data analysis may be advisable, but if some
2
An example would be a study in which the data analysis method
evaluators planned to use required the assumption that
observations be from a probability sample, as discussed in chapter
5. If the evaluators did not obtain observations for a portion of the
intended sample, the assumption might not be warranted and their
application of the method could be questioned.
3
Inconsistencies in the use of statistical terms can cause problems.
We have tried to deal with the difficulty in three ways: (1) by using
the language of current writers in the field, (2) by noting instances
where there are common alternatives to key terms, and (3) by
including a glossary of the terms used in this paper.
4
Instead of referring to the attributes of a variable, some prefer to
say that the variable takes on a number of “values.” For example,
the variable gender can have two values, male and female. Also,
some statisticians use the expression “attribute sampling” in
reference to probability sampling procedures for estimating
proportions. Although attribute sampling is related to attribute as
used in data analysis, the terminology is not perfectly parallel. See
the discussion of attribute sampling in the transfer paper entitled
Using Statistical Sampling, listed in “Papers in This Series.”
5
A variable for which the attributes are assigned arbitrary
numerical values is usually called a “dummy variable.” Dummy
variables occur frequently in evaluation studies.
6
Error in using probability samples to answer questions about
populations stems from the net effects of both measurement error
and sampling error. Conclusions based upon data from the entire
population are subject only to measurement error. The total error
associated with data from a probability sample may be less than the
total error (measurement only) of data from a population.
7
Another purpose, though one that has received less attention in
the statistical literature, is to devise useful ways to graphically
depict the data. See, for example, Du Toit, Steyn, and Stumpf, 1986;
and Tufte, 1983.
1
Measures of central tendency also go by other, equivalent names
such as “center indicators” and “location indicators.”
2
With an odd number of cases, the midpoint is the median. With an
even number of cases, the median is the mean of the middle pair of
cases.
3
This definition is suitable when the mode is used with nominal and
ordinal variables—the most common situation. A slightly different
definition is required for interval-ratio variables.
4
To keep the discussion general, we make no assumptions about
how the group of recipients was chosen. However, in GAO, a
probability sample would usually form the basis for data collection
by a mailout questionnaire.
5
Although computer programs automatically compute a variety of
indicators and although we display three of them here, we are not
suggesting that this is a good practice. In general, the choice of an
indicator should be based upon the measurement level of a variable
and the shape of the distribution.
1
Expressing the spread as a band of four standard deviations is a
common but not unique practice. Any multiple of standard
deviations would be acceptable but two, four, and six are
commonplace.
2
The term “standard deviation” is sometimes misunderstood to be
implying some substantive meaning to the amount of
variation—that the variation is a large amount or a small amount.
The measure by itself does not convey such information, and after
we have computed a standard deviation, we still have to decide, on
the basis of nonstatistical information, whether the variation is
“large” or not.
3
The name for the set of theoretical distributions called “normal” is
unfortunate in that it seems to imply that distributions that have
this form are “to be expected.” While many real-world distributions
are indeed close to a normal (or Gaussian) distribution in shape,
many others are not.
4
A possible approach with a variable that does not have a normal
distribution is to change the scale of the variable so that the shape
does approximate the normal. See Velleman and Hoaglin (1981) for
some examples; they refer to the process of changing the scale as
“re-expression,” but “transformation” of the variables is a more
common term.
1
The term “relationship” is equivalent to “association.”
Table 4.2:
Cross-Tabulation of Two Attitude toward Family income level
Ordinal Variables energy conservation Low Medium High Total
Indifferent 27 37 56 120
Somewhat positive 35 39 41 115
Positive 43 33 30 106
Total 105 109 127 341
Table 4.4:
Cross-Tabulation of Two Region Prices
Nominal Variables affected by
crop
Region Yes No Total
Northeast 322 672 994
Southeast 473 287 760
Midwest 366 382 748
Southwest 306 297 748
Northwest 342 312 654
Total 1,809 1,950 3,759
2
A definition of perfect association is beyond the scope of this
paper. Different measures of association sometimes imply different
notions of perfect association.
3
There are actually three ways to compute lambda. The numerical
value here is the symmetric lambda. There is some discussion of
symmetric and asymmetric measures of association later in this
paper.
4
The word “correlation” is sometimes used in a nonspecific way as
a synonym for “association.” Here, however, the Pearson
product-moment correlation coefficient is a measure of linear
association produced by a specific set of calculations on a batch of
data. It is necessary to specify linear because if the association is
nonlinear, the two variables might have a strong association but the
correlation coefficient could be small or even zero. This potential
problem is another good reason for displaying the data graphically,
which can then be inspected for nonlinearity. For a relationship
that is not linear, another measure of association, called “eta,” can
be used instead of the Pearson coefficient (Loether and McTavish,
1988).
5
Regression analysis is not covered in this paper. For extensive
treatments, see Draper and Smith, 1981, and Pedhazur, 1982.
6
The regression coefficient is closely related to the Pearson product
moment correlation. In fact, when the observed variables are
transformed to so-called z-scores, by subtracting the mean from
each observed value of a variable and dividing the difference by the
standard deviation of the variable, the regression coefficient of the
transformed variables is equal to the correlation coefficient.
Page 62 GAO/PEMD-10.1.11 Quantitative Analysis
Chapter 4
Determining Association Among
Variables
7
The point biserial correlation is analogous to the Pearson
product-moment correlation that applies when both variables are
measured at the interval-ratio level.
8
If we are trying to draw conclusions about a population from a
probability sample, then we must additionally be concerned about
whether what seems to be an association really stems from
sampling fluctuation. The data analysis then involves inferential
statistics.
9
The assumptions are not very stringent for descriptive statistics
but may be problematic for inferential statistics.
1
Probability sampling is sometimes called statistical sampling or
scientific sampling.
2
Nominal and ordinal variables take on a finite set of values.
Interval-ratio variables have a potentially infinite set of values, so
the corresponding probability distribution is defined a little
differently. (These variables are introduced under “Level of
Measurement” in chapter 1.)
3
There are many kinds of probability samples. The most elementary
is the simple random sample in which each member of the
population has an equal chance of being drawn to the sample.
4
Notice the difference between a sample distribution (the
distribution of a sample) and a sampling distribution (the
distribution of a sample statistic).
5
The mean either lies in a given interval or it does not. No
probability is involved in that respect. However, the probability
statement is appropriate since the population mean is usually
unknown and we use the confidence interval as a measure of the
uncertainty in our estimate of the mean that stems from sampling.
Page 81 GAO/PEMD-10.1.11 Quantitative Analysis
Chapter 5
Estimating Population Parameters
6
This is where families of distributions like the chi-square and the t
come into play to help us estimate population parameters. They are
the theoretical distributions that we need.
7
Note that the use of the sample mean to estimate the population
mean does not deal with the question, raised in chapter 2, as to the
circumstances under which the mean is the best measure of central
tendency. When the population distribution is highly asymmetric,
the population median may be a better measure of central tendency
for some purposes. We would then want point and interval
estimates of the median.
8
Notice that although the distribution of loan balances in figure 5.2
is somewhat asymmetric, the sampling distribution is more
symmetric.
9
Obtaining an interval estimate for the standard deviation is highly
problematic because, unlike the case of the mean, the usual
procedures are invalid when the distribution of the variable is not
normal.
Determining Causation
1
The exact nature of causation, both physical and social, is much
debated. We do not delve into the intricacies in this paper. There
are many detailed discussions of the issues; Bunge (1979) and Hage
and Meeker (1988) are two.
2
The asymmetry feature does not rule out reciprocal effects in the
sense that first attitude affects income, then income affects
attitude, and so on.
3
The three conditions are almost uniformly presented as those
required to “establish” causality. However, the language varies from
authority to authority. This paper follows Bollen (1989) in using the
concept of isolation rather than that of nonspuriousness, the more
usually employed concept.
4
In this chapter, we discuss evidence for a causal relationship
between quantitative variables and methods, as used in program
evaluation and the sciences generally, for identifying causes. The
word “cause” is used here in a more specific way than it is used in
auditing. There, “cause” is one of the four elements of a finding, and
the argument for a causal interpretation rests essentially on
plausibility rather than on establishing time-ordered association
and isolating a single cause from other potential ones. The methods
described in this paper may help auditors go beyond plausibility
arguments in the search for causal explanations. See U.S. General
Accounting Office, Government Auditing Standards (Washington,
D.C.: 1988), standard 11 on page 6-3 and standards 21-24 on page
7-5.
5
Judgment is applied in deciding the magnitude of a “sufficient
difference.”
Page 94 GAO/PEMD-10.1.11 Quantitative Analysis
Chapter 6
Determining Causation
Consumer-exposure-to-campaign and
consumer-purchase-choice may indeed have an
underlying causal association, but the presence of the
other variables will distort the computed association
unless we isolate the variables. That is, the computed
amount of the association between X and Y may be
either greater or less than the true level of association
unless we take steps to control the influence of the
other variables. Control is exerted in two ways: by the
design of the study and by the statistical analysis.
6
Sample surveys and case studies can be used in conjunction with
experimental designs. For example, a sample survey could be used
to collect data from the population of people participating in an
experiment.
7
Unless the causal network is an unusually simple one, just adding
additional variables to the regression equation is not an appropriate
form of analysis.
A good fit between the model and the data implies not
that causal associations estimated by structural
equation modeling are correct but just that the model
is consistent with the data. Other models, yet
untested, may do as well or better.
8
Other common names for the methods are “analysis of covariance
structures” and “causal modeling.”
9
An experiment ordinarily provides strong evidence about causal
associations because the process of random assignment ensures
that the members of treatment and control groups are
approximately equivalent with respect to supplementary variables
that might have an effect on the response variable. Being
essentially equivalent, almost all variables except the treatment are
neutralized in that treatment and control group members are
equally affected by those other variables. For example, even though
a variable like a person’s age might affect a response variable such
as health status, random assignment would ensure that the
treatment and control groups are roughly equivalent, on the
average, with respect to age. In estimating the effect of a health
program, then, the evaluator would not mistake the effect of age on
health condition for a program effect.
10
The line between ordinal and interval data is not hard and fast.
For example, many analysts with a dependent variable measured at
the ordinal level use regression analysis if they believe the
underlying variable is at the interval level (and limited only to
ordinal because of the measuring instrument).
Avoiding Pitfalls
1
Flexibility usually exists on the fuzzy border between ordinal and
interval variables. Analysts often treat an ordinal variable as if it
were measured at the interval level. In fact, some authorities (see
Kerlinger, 1986, pp. 401-3, for example) believe that most
psychological and educational variables approximate interval
equality fairly well. In any case, instrument construction should
take account of the measurement level desired.
2
The Evaluation Research Society (now merged with Evaluation
Network to become the American Evaluation Association)
published standards that include coverage of reporting issues
(Rossi, 1982). Other standards that give somewhat more attention
to statistical issues are those of the American Association of Public
Opinion Research (1991) and the Council of American Survey
Research Organizations (1986). In 1988, the federal government
solicited comments on a draft Office of Management and Budget
circular establishing guidelines for federal statistical activities. A
final version of the governmentwide guidelines, which included
directions for the documentation and presentation of the results of
statistical surveys and other studies, has not been published.
Confidence Limits Two statistics that form the upper and lower bounds
of a confidence interval.
Carl Wisler
Lois-ellin Datta
George Silberman
Penny Pickett
Orders by mail:
or visit:
Room 1100
700 4th St. NW (corner of 4th & G Sts. NW)
U.S. General Accounting Office
Washington, DC
[email protected]
United States
Bulk Rate
General Accounting Office
Postage & Fees Paid
Washington, D.C. 20548-0001
GAO
Permit No. G100
Official Business
Penalty for Private Use $300