Analysis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

ANALYSIS

Data Analysis
 By the time you get to the analysis of your
data, most of the really difficult work has
been done. It's much more difficult to:
define the research problem; develop and
implement a sampling plan;
conceptualize, operationalize and test
your measures; and develop a design
structure. If you have done this work well,
the analysis of the data is usually a fairly
straightforward affair.
Data Preparation
 involves checking or logging the data in;
checking the data for accuracy; entering
the data into the computer; transforming
the data; and developing and
documenting a database structure that
integrates the various measures.
Data preparation
 involves checking or logging the data in;
checking the data for accuracy; entering
the data into the computer; transforming
the data; and developing and
documenting a database structure that
integrates the various measures.
Logging the Data
 In any research project you may have
data coming from a number of different
sources at different times:

 mail surveys returns


 coded interview data
 pretest or posttest data
 observational data
Logging Data
 You need to set up a procedure for logging the information
and keeping track of it until you are ready to do a
comprehensive data analysis.
 Different researchers differ in how they prefer to keep track
of incoming data. In most cases, you will want to set up a
database that enables you to assess at any time what data
is already in and what is still outstanding. You could do this
with any standard computerized database program (e.g.,
Microsoft Access, Claris Filemaker), although this requires
familiarity with such programs. or, you can accomplish this
using standard statistical programs (e.g., SPSS, SAS,
Minitab, Datadesk) and running simple descriptive analyses
to get reports on data status.
Logging Data
 It is also critical that the data analyst retain the
original data records for a reasonable period of
time -- returned surveys, field notes, test
protocols, and so on. Generally professional
researchers will retain such records for at least
5-7 years.
 The data analyst should always be able to trace
a result from a data analysis back to the original
forms on which the data was collected. A
database for logging incoming data is a critical
component in good research record-keeping.
Logging Data
 The database structure is the manner in which you intend to store the data for the study so that it can be
accessed in subsequent data analyses. You might use the same structure you used for logging in the
data or, in large complex studies, you might have one structure for logging data and another for storing it.

 In every research project, you should generate a printed codebook that describes the data and indicates
where and how it can be accessed. Minimally the codebook should include the following items for each
variable:

variable name
variable description
variable format (number, data, text)
Instrument/method of collection
date collected
respondent or group
variable location (in database)
notes

 The codebook is an indispensable tool for the analysis team. Together with the database, it should
provide comprehensive documentation that enables other researchers who might subsequently want to
analyze the data to do so without any additional information.
Descriptive vs. Inferential Statistics
 Descriptive statistics are typically distinguished from
inferential statistics.
 With descriptive statistics you are simply describing what
is or what the data shows.
 With inferential statistics, you are trying to reach
conclusions that extend beyond the immediate data alone.
For instance, we use inferential statistics to try to infer from the
sample data what the population might think. Or, we use
inferential statistics to make judgments of the probability that an
observed difference between groups is a dependable one or one
that might have happened by chance in this study. Thus, we use
inferential statistics to make inferences from our data to more
general conditions; we use descriptive statistics simply to
describe what's going on in our data.
Descriptive vs. Inferential Statistics
Makes
inference

Hypothesis
Inferential
testing

Makes
predicitions
Statistics
Summarizin
g

Presents
Descriptive
data

Organizes
Descriptive Statistics

 Descriptive statistics are used to


describe the basic features of the data in
a study. They provide simple summaries
about the sample and the measures.
Together with simple graphics analysis,
they form the basis of virtually every
quantitative analysis of data.
Descriptive Statistics
 Descriptive Statistics are used to present quantitative descriptions in a
manageable form.

 In a research study you may have lots of measures. Or you may


measure a large number of people on any measure. Descriptive
statistics reduces lots of data into a simpler summary.
 For instance, consider a simple number used to summarize how well a
batter is performing in baseball, the batting average. This single number
is simply the number of hits divided by the number of times at bat
(reported to three significant digits). A batter who is hitting .333 is
getting a hit one time in every three at bats. One batting .250 is hitting
one time in four. The single number describes a large number of
discrete events.
 Another example is Grade Point Average (GPA). This single number
describes the general performance of a student across a potentially
wide range of course experiences.
Descriptive Statistics
 Limitations:

 Every time you try to describe a large set of observations with a


single indicator you run the risk of distorting the original data or
losing important detail.

 For instance: The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether
she's been in a slump or on a streak.
 Similarly, the GPA doesn't tell you whether the student was in
difficult courses or easy ones, or whether they were courses in
their major field or in other disciplines. Even given these
limitations, descriptive statistics provide a powerful summary
that may enable comparisons across people or other units.
Univariate Analysis
Univariate analysis involves the examination
across cases of one variable at a time. There
are three major characteristics of a single
variable that we tend to look at:
 the distribution
 the central tendency
 the dispersion
In most situations, you would describe all three
of these characteristics for each of the
variables in our study.
Univariate Analysis
 The Distribution - a summary of the frequency of individual values
or ranges of values for a variable.

 For instance, a typical way to describe the distribution of college students


is by year in college, listing the number or percent of students at each of
the four years. Or, we describe gender by listing the number or percent of
males and females. In these cases, the variable has few enough values
that we can list each one and summarize how many sample cases had
the value.

 How would you do this for a variable like income or GPA? With these
variables there can be a large number of possible values, with relatively
few people having each one. In this case, we group the raw scores into
categories according to ranges of values. For instance, we might look at
GPA according to the letter grade ranges. Or, we might group income into
four or five ranges of income values.
Univariate Analysis
 Distributions may also be displayed using
percentages.

 For example, you could use percentages to


describe the:

percentage of people in different income levels


percentage of people in different age ranges
percentage of people in different ranges of
standardized test scores
Univariate Analysis
 Central Tendency. The central
tendency of a distribution is an estimate
of the "center" of a distribution of values.
There are three major types of estimates
of central tendency:
 Mean
 Median
 Mode
Central Tendency - Mean
 The Mean or average is probably the most
commonly used method of describing central
tendency. To compute the mean all you do is add up
all the values and divide by the number of values.
 For example, the mean or average quiz score is
determined by summing all the scores and dividing
by the number of students taking the exam. For
example, consider the test score values:
 15, 20, 21, 20, 36, 15, 25, 15
 The sum of these 8 values is 167, so the mean is
167/8 = 20.875.
Central Tendency - Median
 The Median is the score found at the exact middle of the
set of values. One way to compute the median is to list
all scores in numerical order, and then locate the score
in the center of the sample. For example, if there are
500 scores in the list, score #250 would be the median.
If we order the 8 scores shown above, we would get:
 15,15,15,20,20,21,25,36
 There are 8 scores and score #4 and #5 represent the
halfway point. Since both of these scores are 20, the
median is 20. If the two middle scores had different
values, you would have to interpolate to determine the
median.
Central Tendency - Mode
 The mode is the most frequently occurring value in the set of
scores. To determine the mode, you might again order the
scores as shown above, and then count each one. The most
frequently occurring value is the mode. In our example, the
value 15 occurs three times and is the model. In some
distributions there is more than one modal value. For
instance, in a bimodal distribution there are two values that
occur most frequently.

 Notice that for the same set of 8 scores we got three different
values -- 20.875, 20, and 15 -- for the mean, median and
mode respectively. If the distribution is truly normal (i.e., bell-
shaped), the mean, median and mode are all equal to each
other.
Central Tendency
 Dispersion.

 Dispersion refers to the spread of the values


around the central tendency. There are two
common measures of dispersion, the range
and the standard deviation. The range is
simply the highest value minus the lowest
value.
 In our example distribution, the high value is 36
and the low is 15, so the range is 36 - 15 = 21.
Central Tendency
 The Standard Deviation is a more
accurate and detailed estimate of
dispersion because an outlier can greatly
exaggerate the range (as was true in this
example where the single outlier value of
36 stands apart from the rest of the values.
 The Standard Deviation shows the relation
that set of scores has to the mean of the
sample.
Central Tendency – S.D.
Central Tendency
 What is normal?

 The standard deviation allows us to reach some


conclusions about specific scores in our distribution.
Assuming that the distribution of scores is normal or
bell-shaped (or close to it!), the following conclusions
can be reached:
approximately 68% of the scores in the sample fall within one
standard deviation of the mean
approximately 95% of the scores in the sample fall within two
standard deviations of the mean
approximately 99% of the scores in the sample fall within
three standard deviations of the mean
Inferential Statistics
 With inferential statistics, you are trying
to reach conclusions that extend beyond
the immediate data alone. For instance,
we use inferential statistics to try to infer
from the sample data what the population
might think. Or, we use inferential statistics
to make judgments of the probability that
an observed difference between groups is
a dependable one or one that might have
happened by chance in this study.
What is the difference between nominal,
ordinal and interval variables?
 A nominal variable (sometimes called a categorical variable)
is one that has two or more categories, but there is no intrinsic
ordering to the categories. 

 For example, gender is a categorical variable having two


categories (male and female) and there is no intrinsic ordering
to the categories.  Hair color is also a categorical variable
having a number of categories (blonde, brown, brunette, red,
etc.) and again, there is no agreed way to order these from
highest to lowest.  A purely categorical variable is one that
simply allows you to assign categories but you cannot clearly
order the variables.  If the variable has a clear ordering, then
that variable would be an ordinal variable, as described
below.
Nominal data (categorical)
 For example, gender is a categorical variable having
two categories (male and female) and there is no
intrinsic ordering to the categories. 
 Hair color is also a categorical variable having a
number of categories (blonde, brown, brunette, red,
etc.) and again, there is no agreed way to order
these from highest to lowest. 
 A purely categorical variable is one that simply
allows you to assign categories but you cannot
clearly order the variables.  If the variable has a clear
ordering, then that variable would be an ordinal
variable.
Ordinal data
 An ordinal variable is similar to a
categorical variable.  The difference
between the two is that there is a clear
ordering of the variables. 
 For example, suppose you have a variable,
economic status, with three categories
(low, medium and high).  In addition to
being able to classify people into these
three categories, you can order the
categories as low, medium and high.
Ordinal data
 Educational experience (with values such
as elementary school graduate, high school
graduate, some college and college
graduate). These also can be ordered as
elementary school, high school, some
college, and college graduate.  Even though
we can order these from lowest to highest,
the spacing between the values may not be
the same across the levels of the variables. 
Interval data
 An interval variable is similar to an ordinal variable,
except that the intervals between the values of the
interval variable are equally spaced. 

 For example, suppose you have a variable such as


annual income that is measured in dollars, and we have
three people who make $10,000, $15,000 and $20,000.
The second person makes $5,000 more than the first
person and $5,000 less than the third person, and the
size of these intervals  is the same.  If there were two
other people who make $90,000 and $95,000, the size of
that interval between these two people is also the same
($5,000). 
Ratio data
 Hardly a difference between ratio and
interval data.

 Sometimes ratio is distinguished from


interval in that there is a “natural zero” in
ratio data.

 You probably won’t have to worry if your data


is ratio data because it already meets criteria
for interval data….
Four types of Data - Summary
 Nominal Data
classification data, e.g. m/f
no ordering, e.g. it makes no sense to state that M > F
arbitrary labels, e.g., m/f, 0/1, etc
 Ordinal Data
ordered but differences between values are not important
e.g., political parties on left to right spectrum given labels 0, 1, 2
e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction
e.g., restaurant ratings
 Interval Data
ordered, constant scale, but no natural zero
differences make sense, but ratios do not (e.g., 30°-20°=20°-10°, but 20°/10° is not twice
as hot!
e.g., temperature (C,F), dates
 Ratio Data
ordered, constant scale, natural zero
e.g., height, weight, age, length
Why does it matter whether a variable
is categorical, ordinal or interval?
 Statistical computations and analyses assume that the variables
have a specific levels of measurement. 

 An average requires a variable to be interval.  Sometimes you


have variables that are "in between" ordinal and interval, for
example, a five-point likert scale with values "strongly agree",
"agree", "neutral", "disagree" and "strongly disagree".  If we cannot
be sure that the intervals between each of these five values are the
same, then we would not be able to say that this is an interval
variable, but we would say that it is an ordinal variable. 

 However, in order to be able to use statistics that assume the


variable is interval, we will assume that the intervals are equally
spaced.  
Does it matter if my dependent
variable is normally distributed?
 When you are doing a t-test or ANOVA,
the assumption is that the distribution of
the sample means are normally
distributed. 
 However, even if the distribution of the
individual observations is not normal, the
distribution of the sample means will
generally be normally distributed if your
sample size is about 30 or larger (truly
closer to 100).
Data Analysis
 Sample size.

 Another factor that often limits the applicability of tests based


on the assumption that the sampling distribution is normal is
the size of the sample of data available for the analysis
(sample size; n). We can assume that the sampling
distribution is normal even if we are not sure that the
distribution of the variable in the population is normal, as long
as our sample is large enough (e.g., 100 or more
observations).
 However, if our sample is very small, then those tests can be
used only if we are sure that the variable is normally
distributed, and there is no way to test this assumption if the
sample is small.
Parametric and nonparametric methods

 Specifically, nonparametric methods were developed to be used


in cases when the researcher knows nothing about the
parameters of the variable of interest in the population (hence
the name nonparametric).

 In more technical terms, nonparametric methods do not rely on


the estimation of parameters (such as the mean or the standard
deviation) describing the distribution of the variable of interest in
the population.

 Therefore, these methods are also sometimes (and more


appropriately) called parameter-free methods or distribution-free
methods.
Brief Overview of Nonparametric Methods

 Basically, there is at least one nonparametric


equivalent for each parametric general type of test.

 In general, these tests fall into the following


categories:
Tests of differences between groups (independent
samples);
Tests of differences between variables (dependent
samples);
Tests of relationships between variables.

 Think back to Experimental vs Quasi


Differences between independent groups.

 Usually, when we have two samples that we want to


compare concerning their mean value for some variable
of interest, we would use the t-test for independent
samples;

 Nonparametric alternatives for this test are:


the Wald-Wolfowitz runs test
the Mann-Whitney U test
Kolmogorov-Smirnov two-sample test.
If we have multiple groups, use analysis of variance (see
ANOVA/MANOVA; the nonparametric equivalents to this
method are the Kruskal-Wallis analysis of ranks and the
Median test.
Differences between dependent groups.

 If we want to compare two variables measured in the same sample we


would customarily use the t-test for dependent samples (in Basic Statistics
for example, if we wanted to compare students' math skills at the
beginning of the semester with their skills at the end of the semester).

 Nonparametric alternatives to this test are:


the Sign test
Wilcoxon's matched pairs test.

 If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no
pass") then:
McNemar's Chi-square test.

 If there are more than two variables that were measured in the same
sample, then we would customarily use repeated measures ANOVA.
Relationships between variables
 To express a relationship between two
variables one usually computes the
correlation coefficient.

 Nonparametric equivalents to the


standard correlation coefficient are:
 Spearman R
 Kendall Tau
 Coefficient Gamma
Relationships between variables

 If the two variables of interest are categorical in nature (e.g.,


"passed" vs. "failed" by "male" vs. "female") appropriate
nonparametric statistics for testing the relationship between the
two variables are:

The Chi-square test


The Phi coefficient
The Fisher exact test.

 In addition, a simultaneous test for relationships between multiple


cases is available: Kendall coefficient of concordance. FYI - This
test is often used for expressing inter-rater agreement among
independent judges who are rating (ranking) the same stimuli.
Threats to Conclusion Validity

WHAT THE HELL IS THIS?

Type I & Type II Errors


Threats to Conclusion
Validity
 Type I Error:

 A type I error occurs when one rejects the null hypothesis when it is true.

 Examples:
If the cholesterol level of healthy men is normally distributed with a mean of
180 and a standard deviation of 20, and men with cholesterol levels over
225 are diagnosed as not healthy, what is the probability of a type one
error?
z=(225-180)/20=2.25; the corresponding tail area is .0122, which is the
probability of a type I error.
If the cholesterol level of healthy men is normally distributed with a mean of
180 and a standard deviation of 20, at what level (in excess of 180) should
men be diagnosed as not healthy if you want the probability of a type one
error to be 2%?
2% in the tail corresponds to a z-score of 2.05; 2.05 × 20 = 41; 180 + 41 =
221.
Threats to Conclusion
Validity
 A type II error occurs when one rejects the alternative hypothesis (fails to reject the null
hypothesis) when the alternative hypothesis is true.

 The probability of a type II error is denoted by *beta*.



Examples:
If men predisposed to heart disease have a mean cholesterol level of 300 with a standard
deviation of 30, but only men with a cholesterol level over 225 are diagnosed as
predisposed to heart disease, what is the probability of a type II error (the null hypothesis is
that a person is not predisposed to heart disease).
z=(225-300)/30=-2.5 which corresponds to a tail area of .0062, which is the probability of a
type II error (*beta*).

If men predisposed to heart disease have a mean cholesterol level of 300 with a standard
deviation of 30, above what cholesterol level should you diagnose men as predisposed to
heart disease if you want the probability of a type II error to be 1%? (The null hypothesis is
that a person is not predisposed to heart disease.)
1% in the tail corresponds to a z-score of 2.33 (or -2.33); -2.33 × 30 = -70; 300 - 70 = 230.
So what did he say?
 You can essentially make two kinds of
errors about relationships:

conclude that there is no relationship when


in fact there is (you missed the relationship
or didn't see it – TYPE I)

conclude that there is a relationship when in


fact there is not (you're seeing things that
aren't there – TYPE II)

You might also like