Chapter Four
Data Preparation and Analysis
Data analysis and interpretation
• Think about analysis EARLY
• Start with a plan
• Code, enter, clean
• Analyze
• Interpret
• Reflect
– What did we learn?
– What conclusions can we draw?
– What are our recommendations?
– What are the limitations of our analysis?
Coding and quantifying
Age Educational level
1 = 1-5 years 1= Bellow grade 12
2= Diploma holder
2 = 6-10 years
3= Degree holder
3 = 11-18 years
4= Masters & above
4 = 19-25 years
5 = >25 Years
Region of Country
West Ethiopia = 1
Sex East Ethiopia = 2
Male = 1 South Ethiopia= 3
Female = 2 North Ethiopia = 4
Coding
Three types of analysis
• Univariate analysis
– the examination of the distribution of cases on only one
variable at a time (e.g., college graduation)
– Purpose: description
• Bivariate analysis
– the examination of two variables simultaneously (e.g., the
relation between gender and college graduation)
– Purpose: determining the empirical relationship between
the two variables
• Multivariate analysis
– the examination of more than two variables simultaneously
(e.g., the relationship between gender, race, and college
graduation)
– Purpose: determining the empirical relationship among the variables
Univariate Analysis
• Univariate Analysis – The analysis of a single variable, for
purposes of description (examples: frequency distribution,
averages, and measures of dispersion).
It helps to explores each variable in a data set separately
Frequencies can tell you if many study participants share a
characteristic of interest (age, gender, etc.)
Graphs and tables can be helpful
Example: Gender >> the number of men/ women in a
sample/population
Univariate Data Analysis (Measures of
Central Tendency)
• Measures of dispersion reflect the spread or distribution of
the distribution
• Commonly used statistics with univariate analysis of
continuous variables :
Mean – an average computed by summing the values of several
observations and dividing by the number of observations.
Mode- an average representing the most frequently observed
value or attribute.
Median – an average representing the value of the “middle”
case in a rank-ordered set of observations.
Range of values – from minimum value to maximum value
Measures of Dispersion
• Measures of dispersion reflect the spread or
distribution of the distribution
– Range is the difference between largest & smallest scores;
high – low
– Variance is the average of the squared differences
between each observation and the mean
– Standard deviation is the square root of variance
Distributions
Frequency Distributions : A description of the number of
times the various attributes of a variable are observed in
a sample.
Dispersion – The distribution of values around
some central value, such as an average.
Standard Deviation – A measure of dispersion
around the mean, calculated so that approximately
68 percent of the cases will lie within plus or minus
one standard deviation from the mean, 95 percent
within two, and 99.9 percent within three standard
deviations.
Bivariate Analysis
• Bivariate Analysis – The analysis of two
variables simultaneously, for the purpose of
determining the empirical relationship
between them.
Bivariate analysis allows us to:
• Look at associations/relationships among two
variables.
• Look at measures of the strength of the relationship
between two variables.
• Test hypotheses about relationships between two
nominal or ordinal level variables.
Cross-tabulation
We use cross-tabulation when:
• We want to look at relationships among two
or three variables.
• We want a descriptive statistical measure to
tell us whether differences among groups are
large enough to indicate some sort of
relationship among variables.
Multivariate Analysis
• Multivariate Analysis :The analysis of the
simultaneous relationships among several
variables.
Regression Analysis
Multiple Linear Regression
• Multiple Regression is a statistical method for estimating the
relationship between a dependent variable and two or more
independent (or predictor) variables.
• MLR is a method for studying the relationship between a
dependent variable and two or more independent variables.
• Purposes:
– Prediction
– Explanation
– Theory building
Linear Regression and Correlation
• Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
• Model: b0 Mean response when x=0 (y-
intercept)
Y 0 1 x b1 Change in mean response
when x increases by 1 unit (slope)
• b1 > 0 Positive Association
b0, b1 are unknown parameters (like
m)
• b1 < 0 Negative Association
• b1 = 0 No Association b0+b1x Mean response when
explanatory variable takes on the
value x
Design Requirements
One dependent variable (criterion)
Two or more independent variables
(predictor or explanatory variables).
Sample size: >= 50 (at least 10 times as
many cases as independent variables)
MLR Model: Basic Assumptions
• Independence: The data of any particular subject are
independent of the data of all other subjects
• Normality: in the population, the data on the dependent
variable are normally distributed for each of the possible
combinations of the level of the X variables; each of the
variables is normally distributed
• Homoscedasticity: In the population, the variances of the
dependent variable for each of the possible combinations of the
levels of the X variables are equal.
• Linearity: In the population, the relation between the
dependent variable and the independent variable is linear
when all the other independent variables are held constant.
Simple vs. Multiple Regression
One dependent variable Y • One dependent variable Y
predicted from one predicted from a set of
independent variable X independent variables (X1,
X2 ….Xk)
One regression coefficient • One regression coefficient for
each independent variable
• R 2
: proportion of variation in
r2: proportion of variation in
dependent variable Y
dependent variable Y
predictable by set of
predictable from X
independent variables (X’s)
MLR Equation
X = the independent or
predictor variables
Y= a + B1X1 + B2X2 … + BnXn
Y=Dependent variable
a = “raw score b = b weights; or partial regression
or the variable to be equations” include a coefficients.
predicted. constant or Y. Intercept The bs show the relative contribution
ob Y axis, representing the of their independent variable on the
value of Y when X = 0 dependent variable when controlling
for the effects of the other predictors
MLR Output
• The following notions are essential for the
understanding of MLR output: R2, adjusted R2,
constant, b coefficient, beta, F-test, t-test
• For MLR “R2” (the coefficient of multiple
determination) is used rather than “r” (Pearson’s
correlation coefficient) to assess the strength of this
more complex relationship (as compared to a
bivariate correlation)
Adjusted R square and b coefficient
• The adjusted R2 adjusts for the inflation in R2 caused by the number of
variables in the equation. As the sample size increases above 20 cases per
variable, adjustment is less needed (and vice versa).
• When comparing the R2 of an original set of variables to the R2 after
additional variables have been included, the researcher is able to identify
the unique variation explained by the additional set of variables.
• b coefficient measures the amount of increase or decrease in the
dependent variable for a one-unit difference in the independent variable,
controlling for the other independent variable(s) in the equation.
Various Significance Tests
• Testing R2
– Test R2 through an F test
– Test of competing models (difference between R2)
through an F test of difference of R2s
• Testing b
– Test of each partial regression coefficient (b) by t-tests
– Comparison of partial regression coefficients with each
other - t-test of difference between standardized
partial regression coefficients ()
F and t tests
• The F-test is used as a general indicator of the
probability that any of the predictor variables
contribute to the variance in the dependent variable
within the population.
• The null hypothesis is that the predictors’ weights are
all effectively equal to zero, none of the predictors
contribute to the variance in the dependent variable
in the population
• t-tests are used to test the significance of each
predictor in the equation.
SPSS: 1) analyze, 2)
regression, 3) linear
SPSS Screen
SPSS Output Interpret
the
coefficients
SPSS Output Interpret the
r square
What does the
ANOVA result
mean?
Repeated Measures ANOVA
• Between Subjects Design
– ANOVA in which each participant participated
in one of the three treatment groups for
example.
• Within Subjects or Repeated Measures
Design
– Participants participate in one treatment and
the outcome of the treatment is measured in
different time points. for example 3, (before
treatment, immediately after, and 6 months
after treatment)
RM ANOVA Vs. Paired T test
• Repeated measures ANOVA, are an extension of Paired T-Tests.
• Like T-Tests, repeated measures ANOVA gives us the statistic
tools to determine whether or not changed has occurred over
time.
• Repeated measures ANOVA compared the average score at
multiple time periods for a single group of subjects.
• T-Tests compare average scores at two different time periods
for a single group of subjects.
• Solving repeated measures ANOVA required to combine the
data from the multiple time periods into a single time factor for
analysis.
RM ANOVA: Understanding the terms & analysis
interpretation
• The first step in solving repeated measures ANOVA is to
combine the data from the multiple time periods into a single
time factor for analysis.
• The different time periods are analogous to the categories of
the independent variable is a one-way analysis of variance.
• The time factor is then tested to see if the mean for the
dependent variable is different for some categories of the time
factor.
• If the time factor is statistically significant in the ANOVA test,
then Bonferroni pair wise comparisons are computed to
identify specific differences between time periods.
RM ANOVA: Understanding the terms & analysis
interpretation
• The dependent variable is measured at three time
periods, there are three paired comparisons:
Example:
• time 1 versus time 2 (Profitability before Promotion measure)
• time 2 versus time 3 (Profitability immediate after Promotion measure)
• time 1 versus time 3 (Long term effect or Follow-up the post Promotion
measure)
Statistical Assumptions of RM ANOVA
• Independence
• Normality
• Homogeneity of within-treatment variances: In one-way ANOVA,
we expect the variances to be equal & the samples are not related
to one another (so no covariance or correlation)
• Sphericity: All variances & covariance are equal to each other
RM
hyp is id
eff oth eal in
con ectiv esis o testi
str ene n t
ain ss rea ng th
of t w t e
con s rest hen ment
tro rict ethi
l su s th cal
bje e u
ct se
Correlation Coefficient
• Measures the strength of the linear association
between two variables
• Takes on the same sign as the slope estimate from
the linear regression
• Not effected by linear transformations of y or x
• Does not distinguish between dependent and
independent variable (e.g. height and weight)
• Population Parameter - r
• Pearson’s Correlation Coefficient:
S xy
r 1 r 1
S xx S yy
Thank you
[email protected]
Reading assignment
How to interpret results:
• R2
• Β
Significance level (P value)
• T-test
• F-test
• Mean
• Median
• Standard deviation