2012-Assumption and Data Transformationnew
2012-Assumption and Data Transformationnew
2012-Assumption and Data Transformationnew
Data Analyses
General case in data
analysis
Assumptions distortion
Missing data
General Assumption of Anova
1. Normality
Populations are normally distributed
2. Homogeneity of Variance
Populations have equal variances
3. Independence of Errors
Independent random samples are drawn
4. The main effects are additive
5. No Interaction Between Blocks & Treatments
5
Randomly, independently and
Normally distribution
The assumption of normality do not affect the validity of the
analysis of variance too seriously
There are test for normality, but it is rather point pointless to
apply them unless the number of samples we are dealing with is
fairly large
Independence implies that there is no relation between the size
of the error terms and the experimental grouping to which the
belong
It is important to avoid having all plots receiving a given
treatment occupying adjacent positions in the field
The best insurance against seriously violating the first
assumption of the analysis of variance is to carry out the
randomization appropriate to the particular design
Normality
Reason:
• ANOVA is an Analysis of Variance
• Analysis of two variances, more specifically, the ratio of two
variances
• Statistical inference is based on the F distribution which is
given by the ratio of two chi-squared distributions
• No surprise that each variance in the ANOVA ratio come
from a parent normal distribution
Calculations can always be derived no matter what the
distribution is. Calculations are algebraic properties
separating sums of squares. Normality is only needed
for statistical inference.
Diagnosis: Normality
• The points on the normality plot must more or less
follow a line to claim “normal distributed”.
• There are statistic tests to verify it scientifically.
• The ANOVA method we learn here is not sensitive to
the normality assumption. That is, a mild departure
from the normal distribution will not change our
conclusions much.
• Goodness-of-fit tests,
Kolmogorov-Smirnov D
Cramer-von Mises W
2
Anderson-Darling A
2
Checking for Normality
Reminder:
Normality of the RESIDUALS is assumed. The original data are
assumed normal also, but each group may have a different mean if
Ha is true. Practice is to first fit the model, THEN output the
residuals, then test for normality of the residuals. This APPROACH
is always correct.
TOOLS
1. Histogram and/or box-plot of all residuals (eij).
2. Normal probability (Q-Q) plot.
3. Formal test for normality.
Histogram of Residuals
proc glm data=stress;
class sand;
model resistance = sand / solution;
output out=resid r=r_resis p=p_resis ;
title1 'Compression resistance in concrete
beams as';
title2 ' a function of percent sand in the
mix';
run;
proc capability data=resid;
histogram r_resis / normal;
ppplot r_resis / normal square ;
run;
Formal Tests of Normality
• Kolmogorov-Smirnov test; Anderson-Darling test (both based on the
empirical CDF).
• Shapiro-Wilk’s test; Ryan-Joiner test (both are correlation based
tests applicable for n < 50).
• D’Agostino’s test (n>=50).
2
k n
W d a i (n j1 j )
1 d ( i ) 2
j1 i 1
Y
s n ( j )
j1
0.02998598 n
i1 i1
si2
t
If C > c2(t-1),a then apply the correction term s
2
i1 t
1 t 1 1
CF 1 t
3( t 1) i1 (ni 1)
i1
(n i 1)
i i /(t 1)
n ( z z ) 2
t
L t
i 1
ni
nT ni
i ij i /( nT t )
n ( z
i 1 j 1
z ) 2 i 1
df1 = t -1
Reject H0 if L Fa,df1 ,df2 df2 = nT - t
Tabachnik and Fidell (2001) use the Fmax ratio more as a rule of
thumb rather than using a table of critical values.
Fmax ratio is no greater than 10
Sample sizes of groups are approximately equal (ratio of
smallest to largest is no greater than 4)
Tests for Homogeneity of Variances
More importantly:
WARNING:
Homogeneity of variance testing is only available for un-
weighted one-way models.
Tests for Homogeneity of Variances
(Randomized Complete Block Design
and/or Factorial Design)
In a CRD, the variance of each treatment group
is checked for homogeneity
In factorial/RCBD, each cell’s variance should
be checked
Independent observations
• No correlation between error terms
• No correlation between independent variables and
error
Positively correlated data inflates standard error
• The estimation of the treatment means are more
accurate than the standard error shows.
Independence Tests
If some notion of how the data was collected is
understandable, check can be done if there exists any
autocorrelation.
The Durbin-Watson statistic looks at the correlation of
each value and the value before it
• Data must be sorted in correct order for meaningful
results
• For example, samples collected at the same time
would be ordered by time if suspect results could be
depent on time
Independence
A positive correlation between means and
variances is often encountered when there is a
wide range of sample means
Data that often show a relation between
variances and means are data based on counts
and data consisting of proportion or percentages
Transformation data can frequently solve the
problems
Remedial Measures for
Dependent Data
First defense against dependent data is proper study design
and randomization
• Designs could be implemented that takes correlation into account, e.g.,
crossover design
Look for environmental factors unaccounted for
• Add covariates to the model if they are causing correlation, e.g., quantified
learning curves
If no underlying factors can be found attributed to the
autocorrelation
• Use a different model, e.g., random effects model
• Transform the independent variables using the correlation coefficient
The Main effects are additive
For each design, there is a mathematical model called a
linear additive model.
It means that the value of experimental unit is made up
of general mean plus main effects plus an error term
When the effects are not additive, there are
multiplicative treatment effect
In the case of multiplication treatment effects, there are
again transformation that will change the data to fit the
additive model
Data Transformation
There are two ways in which the anova assumptions can be
violated:
1. Data may consist of measurement on an ordinal or a
nominal scale
2. Data may not satisfy at least one of the four
requirements
Two options are available to analyze data:
1. It is recommended to use non-parametric data analysis
2. It is recommended to transform the data before analysis
Square Root Transformation
It is used when we are dealing with counts of rare events
The data tend to follow a Poisson distribution
If there is account less than 10. It is better to add 0.5 to the
value
Square Root Transformation
Response is positive and continuous.
zi yi
This transformation works when we notice the
variance changes as a linear function of the mean. k i 2
i k>0
25.00
Sample Variance
Typical use: 20.00
5.00
0.00
0 10 20 30 40
Sample Mean
Logaritmic Transformation
It is used when the standard deviation of samples are
roughly proportional to the means
There is an evidence of multiplicative rather than additive
Data with negative values or zero can not be transformed.
It is suggested to add 1 before transformation
Logarithmic Transformation
Response is positive and continuous.
Z ln(Y )
This transformation tends to work when the
variance is a linear function of the square of the
k 2
i
2
i
k>0
mean 200
Sample Variance
• Useful If there is considerable heterogeneity 120
Typical use: 60
40
1. Growth over time. 20
2. Concentrations. 0
0 10 20 30 40
3. Counts of times when counts Sample Mean
are greater than 10.
Arcus sinus or angular
Transformation
It is used when we are dealing with counts expressed as
percentages or proportion of the total sample
Such data generally have a binomial distribution
Such data normally show typical characteristics in which
the variances are related to the means
ARCSINE SQUARE ROOT
Response is a proportion.
1
Z sin Y arcsin Y
With proportions, the variance is a linear
function of the mean times (1-mean) where i2 k i 1 i
the sample mean is the expected proportion.
1
Z
Y
This transformation works when the variance
is a linear function of the fourth power of the i2 k i4
mean.
• Use Y+1 if zero occurs
• Useful if the reciprocal of the original
scale has meaning.
ln y 0
geometric mean i
of the original
1 n
data. exp lny i
n i 1
(tT + b B – S)
Yij =
[(t-1)(b -1)]
Where:
• t = number of treatment
• b = number of block
• T= sum of observation with the same
treatment as the missing observation
• B= sum of observations in the same block
as the missing observation
• S = number of treatments
Imputation
The error df should be reduced by one, since M was estimated
SAS can compute the F statistic, but the p-value will have to be
computed separately
The method is efficient only when a couple cells are missing
The usual Type III analysis is available, but be careful of
interpretation
Little and Rubin use MLE and simulation-based approaches
PROC MI in SAS v9 implements Little and Rubin approaches
Missing Data in Latin
Square
If only one plot is missing, you can use the following formula:
^
Y = t(Ri + Cj + Tk)-2G
ij(k)
[(t-1)(t-2)]
Where:
• Ri = sum of remaining observations in the ith row
• Cj = sum of remaining observations in the jth column
• Tk = sum of remaining observations in the kth treatment
• G = grand total of the available observations
• t = number of treatments