Duleba1996 - Regression Analysis and Multivariate Analysis
Duleba1996 - Regression Analysis and Multivariate Analysis
Multivariate Analysis
Antoni J. Duleba, M.D., and David L. Olive, M.D.
ABSTRACT—Proper evaluation of data does not necessarily require the use of advanced statisti
cal methods; however, such advanced tools offer the researcher the freedom to evaluate more
Reprint requests: Dr. Duleba, Dept. of Obstetrics & Gynecology, Yale University School of Medicine,
333 Cedar Street, P.O. Box 208063, New Haven, CT, 06520-8063
Copyright ©1996 by Thieme Medical Publishers, Inc., 381 Park Avenue South, New York, NY 10016. All rights reserved.
139
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
sis. We will attempt to describe some general con Table 1. Examples of Statistical Methods Applicable to
cepts and to stress intuitive rather than mathematical Analysis of Different Types of Variables
aspects of several statistical methods. At the outset, Dependent Variable
Independent
we will briefly review some basic definitions and Variable Nominal or Ordinal Interval or Ratio
concepts pertaining to identification of various types 2
Nominal or x test Student's t test
of variables, data collection, and statistical inference. ordinal
We will also discuss the role of type and number of Other nonparametric Analysis of variance
variables in the selection of an appropriate statistical tests
Interval or ratio Logistic regression Regression analysis
method.
140
STATISTICAL ANALYSIS—Duleba, Olive
of the data facilitates computer analysis, especially variables, as well as to predict the values of depen
when the goal of the analysis is to explore the rela dent variables for a given set of values of indepen
tionship between the variables. When repeat mea dent variables.
sures are carried out on the same subject (eg, before Most commonly, responses are modeled as linear
and after treatment), a separate variable may be as functions of predictors, and hence linear regression
signed to each measurement. In this format, paired models are evaluated. The advantages of using linear
or repeat measures tests may be easily performed. models are ease of interpretation and ease of mathe
matical manipulations. Several types of regression
analysis may be distinguished:
Statistical Inference
1. Simple regression of one dependent variable
One of the major goals of statistics is to infer the and one independent variable.
characteristic of the population from observations 2. Multiple regression of one dependent vari
collected in a sample. This inference may be achieved able and several independent variables.
by the process of testing hypotheses. Typically, the 3. Multivariate multiple regression of several
null hypothesis (H0) is presented and tested by a dependent variables and several indepen
given test statistic. The null hypothesis is rejected dent variables.
when its testing demonstrates that it is improbable.
141
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
to rhodel the relationship between the age at menar- of unexplained variation, and therefore they may be
che and the calendar year (Fig. la): represented by a probabilistic relationship:
142
STATISTICAL ANALYSIS—Duleba, Olive
The strength of the linear relationship may be mea especially when independent variables were under
sured by the Pearson product moment coefficient of direct control of the researcher.
correlation, r, which may be also referred to as a A small study, with only a few observations, may
correlation coefficient. The correlation coefficient provide an unreliable model that cannot be repro
may assume values from - 1 (perfect negative linear duced upon repeat sampling. In particular, models
relationship) through 0 (no linear relationship) to +1 derived from small studies should be cross-vali
(perfect positive linear relationship). dated, ie, tested on a new set of observations from
A useful independent variable in the regression the same population.
model requires its correlation coefficient to be signifi Finally, it is not sufficient to demonstrate a statisti
cantly different from zero. Testing of this null hy cally significant association between an independent
pothesis may be carried out with the aid of the t and a dependent variable (ie, low probability that
statistic. regression coefficient is equal to zero) and a high
The square of the correlation coefficient is called value of the coefficient of determination r2. It is also
the coefficient of determination, r 2 . The r 2 represents essential to demonstrate the real, intuitive impor
the proportion of the total sample variability ex tance of observed association. Large and powerful
plained by the linear model. Thus, for example, an studies may demonstrate a statistically significant
r2 = .78 means that 78% of the sample variability is but clinically irrelevant association. The clinical rele
explained by the model. Statistical software packages vance of the association between the independent
143
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
pendent variables, graphic presentation in multidi Assume that the R2 of the first model is .45 and that
mensional space cannot be readily accomplished. the R2 of the second model is 57. Thus, the second
Unstandardized regression coefficients are helpful model (ie, add a second predictor, % body fat) explains
in evaluating the practical relevance of the model. an additional 12% of variation (change of R2 by .12).
As discussed in the previous section on simple re The addition of more independent variables into
gression, it is important to determine whether un the equation always increases the value of R2; this
standardized regression coefficients have practical occurs, however, at the cost of decreasing the stabil
relevance to a given model. ity of the model. The real challenge in multiple linear
In a model containing i number of independent regression analysis is to find a meaningful and stable
variables, the overall regression hypothesis tests the model with a high R2.
global null hypothesis that:
β1 = β2, = . . . = βi = 0. Models
The alternative hypothesis states that at least one of In this section we will review some basic concepts
the regression coefficients is nonzero; this alternative of building linear regression models. In particular,
hypothesis states that the model is "useful" (but not we will discuss selection and transformation of vari
necessarily good). The null hypothesis may be tested ables, interactions between variables, and various
144
STATISTICAL ANALYSIS—Duleba, Olive
one (Fig. 2). Can we still perform linear regression? mated by a binary function. The most important step,
The answer is yes, but only when the independent the selection of threshold values for dichotomization,
variable is reexpressed. In other words, when the may be performed using modified receiver operating
outcome is better predicted by the square or loga characteristic curve analysis. 2 In this method, the
rithm of the original independent variable, we may sum of the true positive rate and true negative rate
model the outcome as a linear function of the new, is plotted as a function of different threshold values
squared, or logarithmically transformed variable. In of a continuous variable. The cutoff point for dichot
some instances, when the outcome has a complex omization corresponds to the peak of such a plot.
curved relationship with a predictor, we may use Transformations may also be performed on depen
polynomials to approximate this relationship. Poly dent variables. This may be appropriate when sev
nomials are functions: eral independent variables demonstrate the same
type of nonlinear relationship with the dependent
y = α + β1 ∙ x + β2 ∙ x2 + . . . βk ∙ xk. variable.
Reexpression of a given predictor variable as several
new variables created in this fashion may improve Residuals
the fit of the model.
Sometimes the relationship of a quantitative inde Residuals are differences between observed and
pendent variable and the outcome is best approxi predicted (estimated by the model) values of depen-
145
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
dent variable. Thus, a separate residual may be cal Table 3. Specification of the Design Variables for
culated for each observation. Inspection of residuals Country, Using Country A as the Reference
is an essential component of regression analysis; in Design Variables
Country
particular, it is helpful in identifying outliers and (Code) D1 D2 D3
improving the fit of the model.
Country A 0 0 0
Outliers are observations with large residuals; in Country B 1 0 0
other words, they are observations that do not fit Country C 0 1 0
well to the model. Outliers may have two origins: Country D 0 0 1
they may arise from errors (ie, mistakes in measuring
or documenting), or they may signify unusual obser
vations not explainable by the model. Erroneous ob
servations may be corrected or excluded. Unusual may be used in linear regression analysis with the
but correct observations may require closer scrutiny aid of coding. Coding allows expression of categori
and, possibly, a revision of the model. cal variables as one or more of so-called dummy or
Inspection of residuals can occasionally detect pre design independent variables. For example, when
viously unsuspected relationships between indepen the categorical variable is dichotomous, coding con
dent and dependent variables. For example, a pre sists of creating a single design variable D that may
viously unsuspected quadratic component of the assume values of 0 or 1.
146
STATISTICAL ANALYSIS—Duleba, Olive
that the age at menarche remained constant in coun eigenvalue. Small latent roots (close to zero) and
try A, but declined in country B. Such interaction large condition indices (in the range of at least 15 to
between independent variables may be included in 30) suggest a significant multicollinearity problem.
the model (with one design variable D1 coding for
two countries):
Strategies in Model Building
Age at menarche = β0 + β1 ∙ (D1) The process of building the best possible model
+ β2 ∙ (calendar year) + β3 ∙ (D1) ∙ (calendar year) remains largely in the domain of art rather than sci
ence. It requires judgment in the selection of vari
where "(D1) ∙ (calendar year)" is a two-variable inter ables and their appropriate transformation, the
action term. Such an interaction is illustrated in Fig search for relevant interactions, and accounting for
ure 1c. multicollinearity. The list of all variables, polynomi
While interactions may improve the fit of the als, and interaction components may be long. Exces
model, they may also unnecessarily complicate it. sively complex models lose stability and become
For example, in a model with only five independent nonreproducible on new sets of observations. Thus,
variables, there are 10 possible two-variable interac the selection of the most relevant components is es
tion terms. There may also be three-variable interac sential. Several options may be considered.
tion terms (with multiplication of three independent When dealing with a relatively small set of vari
147
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
Thus, more lenient threshold levels of P value in the maximum value is 1. A conditional mean may be
range of .15 to .20 should be considered. modeled with the aid of logit transformation:
An alternative stepwise procedure is the backward
elimination method. The initial model consists of all E(Y|x) = P = exp(β0 + β1 ∙x)/[1 + exp((β0 + β1 ∙x)].
variables, and at each step the least predictive vari
Logit transformation may be presented as a logit
able (with the corresponding highest P value) is
function:
eliminated from the model. The process continues
until the least predictive variable is still significantly log[P/(1 - P ) ] = β0 + β1 ∙x.
predictive, ie, its corresponding P value is below the
preset threshold level. Finally, both forward and In this form, logit is a linear function of independent
backward steps may be incorporated in the stepwise variable(s) x. In general, logistic regression uses logit
procedure, whereby each forward step (an addition transformations to explore models where the depen
of a variable to the model) is followed by a backward dent variable is categorical and at least one of the
elimination (a search for a nonsignificant variable to independent variables is quantitative (ie, interval or
be removed). This process may help in identifying ratio). In most instances the dependent variable is
variables that may initially appear important but be binary. When a categorical dependent variable as
come insignificant when (an)other, subsequently sumes more than two values, one may use a more
added variable(s) are accounted for. The process re complex extension of logistic regression, polytomous
148
STATISTICAL ANALYSIS—Duleba, Olive
We may also easily determine the Ψ and 95% CI for P (of at least one rejection by chance alone)
a change of c units of an independent variable %{. = 1 - P (all 20 tests accepted) = 1 - (.95)20 - .64 (!).
Ψ(c) = exp(c-β i ) Thus, multiple comparisons may result in a totally
and that the end points of 95% CI are: unacceptable risk of type I error. Importantly, the
above calculation assumes independence of out
exp[c ∙ βi ± 1.96 ∙ c ∙ (standard error of βi)]. comes; such an assumption is rarely fulfilled. In the
In logistic regression, an equivalent of multiple coef above example, the levels of hormones are usually
ficient of determination R2 is McFadden's p2. McFad highly interdependent; consequently, the true type I
den's p 2 may assume values between 0 and 1; the error lies somewhere between .05 and .64. When one
closer to one, the better the fit of the model. Notably, uses multivariate analysis, the type I error is clearly
even values in the range of .2 to .4 are considered set and is not obscured by the number of variables;
satisfactory.8 thus, the researcher is protected from reading too
The evaluation of the goodness of fit may be ac much into the data.
complished using the Pearson x2 test to compare the The interdependence of outcomes is also a source
difference between the observed and predicted (fit of the second important problem with univariate
ted by the model) probabilities (of the dependent analysis: the lack of power to detect the differences
variable). The model fits well when the difference in in outcomes when these differences are obscured by
149
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
Manova
150
STATISTICAL ANALYSIS—Duleba, Olive
151
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996
tion analysis are less likely to be reproducible. The powerful test depends on the relationship between
rule of the thumb is to have at least 10 observations the dependent variables.
(subjects) per dependent variable. To limit the num These tests inform us of the presence or absence
ber of variables to the most important ones, various of a linear relationship between independent and
stepwise procedures may be used. In this context, dependent variables somewhere in the model. Ca
important variables are those that are the most help nonical correlation may be used to closer examine
ful in separating the individual groups. The results this relationship. Canonical correlation requires the
of stepwise selection should be interpreted with cau construction of the first canonical variates, linear
tion; these procedures may exclude not only unim functions for independent and dependent variables.
portant variables, but also important ones, when For example:
they are highly correlated to other variable(s) already
incorporated into the discriminant function. This sit Canonical variate A (for independent variables)
uation is analogous to that encountered with multi-
= B1A ∙ (calendar year) + β2A ∙ (percent of body fat).
collinearity of independent variables in regression
analysis. Canonical variate B (for dependent variables)
= B1B ∙ (age at menarche)
+ B2B ∙ (age at adrenarche) + B3B ∙ (age at telarche).
152
STATISTICAL ANALYSIS—Duleba, Olive
prolide acetate and human menopausal gonadotropins. J Clin 6. Bendel RB, Afifi AA: Comparison of stopping rules in for
Endocrinol Metabol 73:797-803, 1991 ward regression. J Am Stat Assoc 72:46-53, 1977
3. Velleman PF, Welsch RE: Efficient computing of regression 7. Hosmer DW, Lemeshow S: Applied Logistic Regression. New
diagnostics. Am Statistician 35:234-242, 1981 York, John Wiley & Sons, Inc, 1989
8. Hensher D, Johnson LW: Applied Discrete Choice Modelling.
4. Gunst RF, Mason RL: Regression Analysis and Its Application. London, Croom Helm, 1981
A Data-Oriented Approach. New York, Marcel Dekker, Inc, 9. Rencher AC: Methods of Multivariate Analysis. New York, John
1980 Wiley & Sons, Inc, 1995
5. La Motte LR: The SELECT routines: A program for identi 10. Wall FJ: The Generalized Variance Ratio or U-Statistic. Albu
fying best subset regression. Applied Stat 21:1972 querque, The Dikewood Corp, 1967
153