Anova & Factor Analysis
Anova & Factor Analysis
Anova & Factor Analysis
A statistical analysis tool that separates the total variability found within a data
set into two components: random and systematic factors. The random factors
do not have any statistical influence on the given data set, while the systematic
factors do. The ANOVA test is used to determine the impact independent
variables have on the dependent variable in a regression analysis.
ANOVA is the synthesis of several ideas and it is used for multiple purposes. As
a consequence, it is difficult to define concisely or precisely.
Additionally:
Balanced design
An experimental design where all cells (i.e. treatment combinations) have
the same number of observations.
Blocking
A schedule for conducting treatment combinations in an experimental study
such that any effects on the experimental results due to a known change in
raw materials, operators, machines, etc., become concentrated in the levels
of the blocking variable. The reason for blocking is to isolate a systematic
effect and prevent it from obscuring the main effects. Blocking is achieved
by restricting randomization.
Design
A set of experimental runs which allows the fit of a particular model and the
estimate of effects.
DOE
Design of experiments. An approach to problem solving involving collection
of data that will support valid, defensible, and supportable conclusions.
Effect
How changing the settings of a factor changes the response. The effect of a
single factor is also called a main effect.
Error
Unexplained variation in a collection of observations. DOE's typically
require understanding of both random error and lack of fit error.
Experimental unit
The entity to which a specific treatment combination is applied.
Factors
Process inputs an investigator manipulates to cause a change in the output.
Lack-of-fit error
Error that occurs when the analysis omits one or more important terms or
factors from the process model. Including replication in a DOE allows
separation of experimental error into its components: lack of fit and random
(pure) error.
Model
Mathematical relationship which relates changes in a given response to
changes in one or more factors.
Random error
Error that occurs due to natural variation in the process. Random error is
typically assumed to be normally distributed with zero mean and a constant
variance. Random error is also called experimental error.
Randomization
A schedule for allocating treatment material and for conducting treatment
combinations in a DOE such that the conditions in one run neither depend on
the conditions of the previous run nor predict the conditions in the
subsequent runs.
Replication
Performing the same treatment combination more than once. Including
replication allows an estimate of the random error independent of any lack of
fit error.
Responses
The output(s) of a process and is sometimes called dependent variable(s).
Treatment
A treatment is a specific combination of factor levels whose effect is to be
compared with other treatments.
There are three classes of models used in the analysis of variance, and these are
outlined here.
Fixed-effects models
Random-effects models
Random effects models are used when the treatments are not fixed. This occurs
when the various factor levels are sampled from a larger population. Because
the levels themselves are random variables, some assumptions and the method
of contrasting the treatments (a multi-variable generalization of simple
differences) differ from the fixed-effects model.
Mixed-effects models
Assumptions of ANOVA
The analysis of variance has been studied from several approaches, the most
common of which uses alinear model that relates the response to the treatments
and blocks. Note that the model is linear in parameters but may be nonlinear
across factor levels. Interpretation is easy when data is balanced across factors
but much deeper understanding is needed for unbalanced data.
In addition to determining that differences exist among the means, you may
want to know which means differ. There are two types of tests for comparing
means: a priori contrasts and post-hoc tests.
Contrasts are tests set up before running the experiment and post hoc tests are
run after the experiment has been conducted. You can also test for trends
across categories.
Assumptions of MANOVA
Linearity -MANOVA assumes that there are linear relationships among all pairs
of dependent variables, all pairs of covariates, and all dependent variable-
covariate pairs in each cell. Therefore, when the relationship deviates from
linearity, the power of the analysis will be compromised.
Homogeneity of Variances: -- Homogeneity of variances assumes that the
dependent variables exhibit equal levels of variance across the range of
predictor variables.
MANCOVA
Logistic regression
The goal of logistic regression is to find the best fitting (yet biologically
reasonable) model to describe the relationship between the dichotomous
characteristic of interest (dependent variable = response or outcome variable)
and a set of independent (predictor or explanatory) variables. Logistic
regression generates the coefficients (and its standard errors and significance
levels) of a formula to predict a logit transformation of the probability of
presence of the characteristic of interest:
Rather than choosing parameters that minimize the sum of squared errors (like
in ordinary regression), estimation in logistic regression chooses parameters
that maximize the likelihood of observing the sample values.
The null model -2 Log Likelihood is given by -2 * ln(L0) where L0 is the likelihood
of obtaining the observations if the independent variables had no effect on the
outcome.
If the P-value for the overall model fit statistic is less than the conventional
0.05 then there is evidence that at least one of the independent variables
contributes to the prediction of the outcome.
Regression coefficients
The regression coefficients are the coefficients b 0, b1, b2, ...bk of the regression
equation:
It is clear that when a variable X i increases by 1 unit, with all other factors
remaining unchanged, then the odds will increase by a factor e bi.
This factor ebi is the odds ratio (O.R.) for the independent variable X i and it
gives the relative amount by which the odds of the outcome increase (O.R.
greater than 1) or decrease (O.R. less than 1) when the value of the
independent variable is increased by 1 units.
Alternatively, you can use the Logit table. For logit(p)=1.08 the probability p of
having a positive outcome equals 0.75.
A large value of Chi-squared (with small p-value < 0.05) indicates poor fit and
small Chi-squared values (with larger p-value closer to 1) indicate a good
logistic regression model fit.
Classification table
The classification table is another method to evaluate the predictive accuracy
of the logistic regression model. In this table the observed values for the
dependent outcome and the predicted values (at a user defined cut-off value,
for example p=0.50) are cross-classified. In our example, the model correctly
predicts 70% of the cases.
To perform a full ROC curve analysis on the predicted probabilities you can
save the predicted probabilities and next use this new variable in ROC curve
analysis. The Dependent variable used in Logistic Regression then acts as the
Classification variable in the ROC curve analysis dialog box.
N = 10 k / p
For example: If you have 3 covariates to include in the model and the
proportion of positive cases in the population is 0.20 (20%). The minimum
number of cases required is
N = 10 x 3 / 0.20 = 150
If the resulting number is less than 100 you should increase it to 100 as
suggested by Long (1997).
FACTOR ANALYSIS
The unique factors are uncorrelated with each other. The common factors
themselves can be expressed as linear combinations of the observed variables.
Fi =estimate of i th factor
k =number of variables
Factor scores. Factor scores are composite scores estimated for each
respondent on the derived factors.
Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. The
Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is an index
used to examine the appropriateness of factor analysis. High values
(between 0.5 and 1.0) indicate factor analysis is appropriate. Values below
0.5 imply that factor analysis may not be appropriate.
Percentage of variance. The percentage of the total variance attributed to
each factor.
Scree plot. A scree plot is a plot of the Eigenvalues against the number of
factors in order of extraction.
DISCRIMINANT ANALYSIS
Similarities
Number of One One One
dependent
Variables
Number of
independent Multiple Multiple Multiple
variables
Differences
Nature of the
dependent Metric Metric Categorical
Variables
Nature of the
independent Categorical Metric Metric
variables
Discriminant Analysis Model
where
D = discriminant score
The coefficients or weights (b), are estimated so that the groups differ as much
as possible on the values of the discriminant function. This occurs when the
ratio of between-group sum of squares to within-group sum of squares for the
discriminant scores is at a maximum.
Ward’s Method