Analysis
Analysis
Analysis
Data Analysis
By the time you get to the analysis of your
data, most of the really difficult work has
been done. It's much more difficult to:
define the research problem; develop and
implement a sampling plan;
conceptualize, operationalize and test
your measures; and develop a design
structure. If you have done this work well,
the analysis of the data is usually a fairly
straightforward affair.
Data Preparation
involves checking or logging the data in;
checking the data for accuracy; entering
the data into the computer; transforming
the data; and developing and
documenting a database structure that
integrates the various measures.
Data preparation
involves checking or logging the data in;
checking the data for accuracy; entering
the data into the computer; transforming
the data; and developing and
documenting a database structure that
integrates the various measures.
Logging the Data
In any research project you may have
data coming from a number of different
sources at different times:
In every research project, you should generate a printed codebook that describes the data and indicates
where and how it can be accessed. Minimally the codebook should include the following items for each
variable:
variable name
variable description
variable format (number, data, text)
Instrument/method of collection
date collected
respondent or group
variable location (in database)
notes
The codebook is an indispensable tool for the analysis team. Together with the database, it should
provide comprehensive documentation that enables other researchers who might subsequently want to
analyze the data to do so without any additional information.
Descriptive vs. Inferential Statistics
Descriptive statistics are typically distinguished from
inferential statistics.
With descriptive statistics you are simply describing what
is or what the data shows.
With inferential statistics, you are trying to reach
conclusions that extend beyond the immediate data alone.
For instance, we use inferential statistics to try to infer from the
sample data what the population might think. Or, we use
inferential statistics to make judgments of the probability that an
observed difference between groups is a dependable one or one
that might have happened by chance in this study. Thus, we use
inferential statistics to make inferences from our data to more
general conditions; we use descriptive statistics simply to
describe what's going on in our data.
Descriptive vs. Inferential Statistics
Makes
inference
Hypothesis
Inferential
testing
Makes
predicitions
Statistics
Summarizin
g
Presents
Descriptive
data
Organizes
Descriptive Statistics
For instance: The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether
she's been in a slump or on a streak.
Similarly, the GPA doesn't tell you whether the student was in
difficult courses or easy ones, or whether they were courses in
their major field or in other disciplines. Even given these
limitations, descriptive statistics provide a powerful summary
that may enable comparisons across people or other units.
Univariate Analysis
Univariate analysis involves the examination
across cases of one variable at a time. There
are three major characteristics of a single
variable that we tend to look at:
the distribution
the central tendency
the dispersion
In most situations, you would describe all three
of these characteristics for each of the
variables in our study.
Univariate Analysis
The Distribution - a summary of the frequency of individual values
or ranges of values for a variable.
How would you do this for a variable like income or GPA? With these
variables there can be a large number of possible values, with relatively
few people having each one. In this case, we group the raw scores into
categories according to ranges of values. For instance, we might look at
GPA according to the letter grade ranges. Or, we might group income into
four or five ranges of income values.
Univariate Analysis
Distributions may also be displayed using
percentages.
Notice that for the same set of 8 scores we got three different
values -- 20.875, 20, and 15 -- for the mean, median and
mode respectively. If the distribution is truly normal (i.e., bell-
shaped), the mean, median and mode are all equal to each
other.
Central Tendency
Dispersion.
If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no
pass") then:
McNemar's Chi-square test.
If there are more than two variables that were measured in the same
sample, then we would customarily use repeated measures ANOVA.
Relationships between variables
To express a relationship between two
variables one usually computes the
correlation coefficient.
A type I error occurs when one rejects the null hypothesis when it is true.
Examples:
If the cholesterol level of healthy men is normally distributed with a mean of
180 and a standard deviation of 20, and men with cholesterol levels over
225 are diagnosed as not healthy, what is the probability of a type one
error?
z=(225-180)/20=2.25; the corresponding tail area is .0122, which is the
probability of a type I error.
If the cholesterol level of healthy men is normally distributed with a mean of
180 and a standard deviation of 20, at what level (in excess of 180) should
men be diagnosed as not healthy if you want the probability of a type one
error to be 2%?
2% in the tail corresponds to a z-score of 2.05; 2.05 × 20 = 41; 180 + 41 =
221.
Threats to Conclusion
Validity
A type II error occurs when one rejects the alternative hypothesis (fails to reject the null
hypothesis) when the alternative hypothesis is true.