Analysis of Data: Chapter-5
Analysis of Data: Chapter-5
Analysis of Data
DATA CLEANING
Data cleaning is an important procedure during which the data
are inspected, and erroneous data are-if necessary, preferable, and
possible-corrected. Data cleaning can be done during the stage of
data entry. If this is done, it is important that no subjective decisions
are made. The guiding principle provided by Adèr (ref) is: during
subsequent manipulations of the data, information should always be
cumulatively retrievable. In other words, it should always be possible
to undo any data set alterations. Therefore, it is important not to
throw information away at any stage in the data cleaning phase. All
information should be saved (i.e., when altering variables, both the
original values and the new values should be kept, either in a duplicate
data set or under a different variable name), and all alterations to the
data set should carefully and clearly documented, for instance in a
syntax or a log.
QUALITY OF DATA
The quality of the data should be checked as early as possible.
Data quality can be assessed in several ways, using different types
of analyses: frequency counts, descriptive statistics (mean, standard
deviation, median), normality (skewness, kurtosis, frequency
histograms, normal probability plots), associations (correlations,
scatter plots).
Other initial data quality checks are:
• Checks on data cleaning: have decisions influenced the distribution
of the variables? The distribution of the variables before data
cleaning is compared to the distribution of the variables after
data cleaning to see whether data cleaning has had unwanted
effects on the data.
• Analysis of missing observations: are there many missing values,
Analysis of Data 113
Quality of measurements
The quality of the measurement instruments should only be
checked during the initial data analysis phase when this is not the
focus or research question of the study. One should check whether
structure of measurement instruments corresponds to structure
reported in the literature.
There are two ways to assess measurement quality:
• Confirmatory factor analysis
• Analysis of homogeneity (internal consistency), which gives an
indication of the reliability of a measurement instrument. During
this analysis, one inspects the variances of the items and the
scales, the Cronbach's ? of the scales, and the change in the
Cronbach's alpha when an item would be deleted from a scale.
Initial transformations
After assessing the quality of the data and of the measurements,
one might decide to impute missing data, or to perform initial
transformations of one or more variables, although this can also be
done during the main analysis phase.
Possible transformations of variables are:
• Square root transformation (if the distribution differs moderately
from normal)
• Log-transformation (if the distribution differs substantially from
114 Analysis of Data
normal)
• Inverse transformation (if the distribution differs severely from
normal)
• Make categorical (ordinal / dichotomous) (if the distribution
differs severely from normal, and no transformations help)
Did the implementation of the study fulfill the intentions of the
research design?
One should check the success of the randomization procedure,
for instance by checking whether background and substantive
variables are equally distributed within and across groups.
If the study did not need and/or use a randomization procedure,
one should check the success of the non-random sampling, for
instance by checking whether all subgroups of the population of
interest are represented in sample.
Other possible data distortions that should be checked are:
• dropout (this should be identified during the initial data analysis
phase)
• Item nonresponse (whether this is random or not should be
assessed during the initial data analysis phase)
• Treatment quality (using manipulation checks).
Also, the original plan for the main data analyses can and should
be specified in more detail and/or rewritten.
In order to do this, several decisions about the main data analyses
can and should be made:
• In the case of non-normals: should one transform variables; make
variables categorical (ordinal/dichotomous); adapt the analysis
method?
• In the case of missing data: should one neglect or impute the
missing data; which imputation technique should be used?
• In the case of outliers: should one use robust analysis techniques?
• In case items do not fit the scale: should one adapt the
measurement instrument by omitting items, or rather ensure
comparability with other (uses of the) measurement instrument(s)?
• In the case of (too) small subgroups: should one drop the
hypothesis about inter-group differences, or use small sample
techniques, like exact tests or bootstrapping?
• In case the randomization procedure seems to be defective: can
and should one calculate propensity scores and include them as
covariates in the main analyses?
Analyses
Several analyses can be used during the initial data analysis phase:
• Univariate statistics
• Bivariate associations (correlations)
• Graphical techniques (scatter plots)
It is important to take the measurement levels of the variables
into account for the analyses, as special statistical techniques are
available for each level:
• Nominal and ordinal variables
o Frequency counts (numbers and percentages)
o Associations
• circumambulations (crosstabulations)
• hierarchical loglinear analysis (restricted to a maximum of 8
variables)
• loglinear analysis (to identify relevant/important variables and
possible confounders)
o Exact tests or bootstrapping (in case subgroups are small)
o Computation of new variables
116 Analysis of Data
• Continuous variables
o Distribution
• Statistics (M, SD, variance, skewness, kurtosis)
• Stem-and-leaf displays
• Box plots
Stability of results
It is important to obtain some indication about how generalizable
the results are. While this is hard to check, one can look at the stability
of the results. Are the results reliable and reproducible? There are
Analysis of Data 117
EDA DEVELOPMENT
Tukey held that too much emphasis in statistics was placed on
statistical hypothesis testing (confirmatory data analysis); more
emphasis needed to be placed on using data to suggest hypotheses to
test. In particular, he held that confusing the two types of analyses
and employing them on the same set of data can lead to systematic
bias owing to the issues inherent in testing hypotheses suggested by
the data.
The objectives of EDA are to:
• Suggest hypotheses about the causes of observed phenomena
• Assess assumptions on which statistical inference will be based
• Support the selection of appropriate statistical tools and techniques
• Provide a basis for further data collection through surveys or
experiments
Many EDA techniques have been adopted into data mining and
are being taught to young students as a way to introduce them to
statistical thinking.
TYPE OF DATA
Data can be of several types
Numerical data
Numerical data (or quantitative data) is data measured or identified
on a numerical scale. Numerical data can be analyzed using statistical
methods, and results can be displayed using tables, charts, histograms
and graphs. For example, a researcher will ask a questions to a
participant that include words how often, how many or percentage.
The answers from the questions will be numerical. Quantitative data
involves amounts, measurements, or anything of quantity.
Examples of quantitative data would be:
• Counts
o 'there are 643 dots on the ceiling'
o 'there are 25 pieces of bubble gum'
o 'there are 8 planets in the solar system'
• Measurements
120 Analysis of Data
Categorical data
Categorical data is a statistical data type consisting of categorical
variables, used for observed data whose value is one of a fixed number
of nominal categories, or for data that has been converted into that
form, for example as grouped data. More specifically, categorical
data may derive from either or both of observations made of qualitative
data, where the observations are summarised as counts or cross
tabulations, or of quantitative data, where observations might be
directly observed counts of events happening or they might counts
of values that occur within given intervals. Often, purely categorical
data are summarised in the form of a contingency table. However,
particularly when considering data analysis, it is common to use the
term "categorical data" to apply to data sets that, while containing
some categorical variables, may also contain non-categorical variables.
Qualitative data
The term qualitative is used to describe certain types of
information. The term is distinguished from the term quantitative
data, in which items are described in terms of quantity and in which
a range of numerical values are used without implying that a particular
numerical value refers to a particular distinct category. However,
data originally obtained as qualitative information about individual
items may give rise to quantitative data if they are summarised by
means of counts; and conversely, data that are originally quantitative
are sometimes grouped into categories to become qualitative data
(for example, income below $20,000, income between $20,000 and
$80,000, and income above $80,000).
Analysis of Data 121
GRAPHS
Graphs are often an excellent way to display your results. In
fact, most good science fair projects have at least one graph.
For any type of graph:
• Generally, you should place your independent variable on the x-
axis of your graph and the dependent variable on the y-axis.
• Be sure to label the axes of your graph- don't forget to include
the units of measurement (grams, centimeters, liters, etc.).
• If you have more than one set of data, show each series in a
different color or symbol and include a legend with clear labels.
Different types of graphs are appropriate for different
experiments. These are just a few of the possible types of graphs:
A bar graph might be appropriate for comparing different trials
or different experimental groups. It also may be a good choice if
your independent variable is not numerical. (In Microsoft Excel,
generate bar graphs by choosing chart types "Column" or "Bar.")
A time-series plot can be used if your dependent variable is
numerical and your independent variable is time. (In Microsoft Excel,
the "line graph" chart type generates a time series. By default, Excel
simply puts a count on the x-axis. To generate a time series plot with
your choice of x-axis units, make a separate data column that contains
those units next to your dependent variable. Then choose the "XY
(scatter)" chart type, with a sub-type that draws a line.)
An xy-line graph shows the relationship between your dependent
and independent variables when both are numerical and the dependent
variable is a function of the independent variable. (In Microsoft Excel,
choose the "XY (scatter)" chart type, and then choose a sub-type
that does draw a line.)
A scatter plot might be the proper graph if you're trying to show
how two variables may be related to one another. (In Microsoft Excel,
choose the "XY (scatter)" chart type, and then choose a sub-type
that does not draw a line.)
Analysis of Data 123
Statistical methods
A lot of statistical methods have been used for statistical analyses.
A very brief list of four of the more popular methods is:
item in the test. The term item is used because many test questions
are not actually questions; they might be multiple choice questions
that have incorrect and correct responses, but are also commonly
statements on questionnaires that allow respondents to indicate level
of agreement (a rating or Likert scale), or patient symptoms scored
as present/absent. IRT is based on the idea that the probability of a
correct/keyed response to an item is a mathematical function of person
and item parameters. The person parameter is called latent trait or
ability; it may, for example, represent a person's intelligence or the
strength of an attitude. Item parameters include difficulty (location),
discrimination (slope or correlation), and pseudoguessing (lower
asymptote).
The concept of the item response function was around before
1950. The pioneering work of IRT as a theory occurred during the
1950s and 1960s. Three of the pioneers were the Educational Testing
Service psychometrician Frederic M. Lord, the Danish mathematician
Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued
parallel research independently. Key figures who furthered the
progress of IRT include Benjamin Wright and David Andrich. IRT
did not become widely used until the late 1970s and 1980s, when
personal computers gave many researchers access to the computing
power necessary for IRT.
Among other things, the purpose of IRT is to provide a framework
for evaluating how well assessments work, and how well individual
items on assessments work. The most common application of IRT
is in education, where psychometricians use it for developing and
refining exams, maintaining banks of items for exams, and equating
for the difficulties of successive versions of exams (for example, to
allow comparisons between results over time).
IRT models are often referred to as latent trait models. The term
latent is used to emphasize that discrete item responses are taken to
be observable manifestations of hypothesized traits, constructs, or
attributes, not directly observed, but which must be inferred from
the manifest responses. Latent trait models were developed in the
field of sociology, but are virtually identical to IRT models.
IRT is generally regarded as an improvement over classical test
theory (CTT). For tasks that can be accomplished using CTT, IRT
generally brings greater flexibility and provides more sophisticated
information. Some applications, such as computerized adaptive
Analysis of Data 127