0% found this document useful (0 votes)
62 views15 pages

Duleba1996 - Regression Analysis and Multivariate Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views15 pages

Duleba1996 - Regression Analysis and Multivariate Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Regression Analysis and

Multivariate Analysis
Antoni J. Duleba, M.D., and David L. Olive, M.D.

ABSTRACT—Proper evaluation of data does not necessarily require the use of advanced statisti­
cal methods; however, such advanced tools offer the researcher the freedom to evaluate more

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


complex hypotheses. This overview of regression analysis and multivariate statistics describes
general concepts. Basic definitions and conventions are reviewed. The types of regression
analysis are then discussed, including simple regression, multiple regression, multivariate
multiple regression, and logistic regression. The various steps required to perform these analy­
ses are described, and the advantages and disadvantages of each is detailed.

KEY WORDS: clinical trials, statistical analysis, regression, multivariate analysis

INTRODUCTION limitations of statistical methods intended for data


analysis. Consequently, the actual analysis of the re­
Statistical analysis is an essential component of a sults may be the easiest part of the work.
vast majority of research endeavors. There are at In recent years, access to a broad range of statisti­
least two major incentives to improve the under­ cal methods has been relatively easy, thanks to tre­
standing of statistical methods: the increasing "sta­ mendous advances in microcomputer technology
tistical sophistication" of the scientific community and the availability of powerful statistical programs
(including the reviewers of grants and journals) and such as SAS, SPSS, or SYSTAT. To properly use these
the need to evaluate results of complex studies. programs, the users must overcome what appears to
These increasing demands and expectations behoove be two diametrically opposed adversaries: fear and
researchers to venture beyond the familiar t test or overconfidence. Fear may be instilled by forbidding
x2 test into the territory of more advanced statistical jargon and the complexities of formal-mathematical
methods. Proper evaluation of data does not neces­ aspects of the analysis. Overconfidence, on the other
sarily require the use of advanced methods; such hand, may arise from the deceptive simplicity of
methods are not inherently better than the basic user-friendly computer packages allowing one to
ones. However, more advanced statistical tools offer easily use and misuse powerful tests. These pitfalls
the researcher the freedom to evaluate more complex can be avoided by understanding the general con­
hypotheses, and may protect from drawing false cepts of statistical methods. Furthermore, better un­
conclusions imposed by the limitations of basic sta­ derstanding of statistical terminology may demystify
tistics. Familiarity with a wide range of statistical the manuals accompanying statistical packages and
methods is invaluable both before and after conduct­ improve communication with statisticians.
ing the study. Ideally, studies should be designed This overview is intended as a guide to regression
with a clear understanding of the advantages and analysis and selected aspects of multivariate analy-

Department of Obstetrics & Gynecology, Yale University School of Medicine,


New Haven, Connecticut

Reprint requests: Dr. Duleba, Dept. of Obstetrics & Gynecology, Yale University School of Medicine,
333 Cedar Street, P.O. Box 208063, New Haven, CT, 06520-8063

Copyright ©1996 by Thieme Medical Publishers, Inc., 381 Park Avenue South, New York, NY 10016. All rights reserved.

139
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

sis. We will attempt to describe some general con­ Table 1. Examples of Statistical Methods Applicable to
cepts and to stress intuitive rather than mathematical Analysis of Different Types of Variables
aspects of several statistical methods. At the outset, Dependent Variable
Independent
we will briefly review some basic definitions and Variable Nominal or Ordinal Interval or Ratio
concepts pertaining to identification of various types 2
Nominal or x test Student's t test
of variables, data collection, and statistical inference. ordinal
We will also discuss the role of type and number of Other nonparametric Analysis of variance
variables in the selection of an appropriate statistical tests
Interval or ratio Logistic regression Regression analysis
method.

DEFINITIONS AND CONVENTIONS


bias. For example, in a study evaluating patients
with polycystic ovary syndrome, fasting insulin level
Variables
may be studied as a function of free testosterone
level and age; implicit in this model is dependence
Variables may be classified as qualitative or quan­
of insulin (dependent variable) on testosterone (inde­
titative. Qualitative (categorical) variables may be
pendent variable). Alternatively, the same data may
nominal or ordinal. Nominal variables merely clas­

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


be analyzed by modeling of free testosterone level
sify the elements of the sample to at least two catego­
as a function of fasting insulin level; in this model,
ries without assigning a rank to the categories (eg,
the level of testosterone is a dependent variable,
sex or race). Ordinal variables assign order to its
while insulin is an independent variable.
categories (eg, American Fertility Society classifica­
Importantly, the cause-effect (directional) relation­
tion of endometriosis). Quantitative variables are al­
ship between predictors and outcomes may be truly
ways numerical and may be interval or ratio. The
assessed in those studies in which independent vari­
variable is interval when the interval between the
ables are under direct control of the researcher.
measures is meaningful but the ratio is not (eg, tem­
When the researcher only observes and records a
perature expressed in degrees Celsius). Zero point
collection of variables, the relationship between in­
on the scale of interval variable is not really meaning­
dependent and dependent variables may be de­
ful, eg, 0°C does not mean that there is "no tempera­
scribed more appropriately in terms of association
ture/' Ratio variables are characterized by a mean­
rather than cause and effect.
ingful zero point and a meaningful ratio between the
measures (eg, weight). Another important classifica­
tion pertains to the relationship between variables.
In this context variables may be classified as depen­ Data Collection
dent or independent. Dependent variables are mea­
sured responses or outcomes. Independent variables Prior to analysis, the data are usually organized
are predictors of outcomes. Independent variables in a table (spreadsheet). Typically, each row of the
may be varied by the experimenter directly (by mod­ spreadsheet lists data pertaining to one observation
ulating the experimental conditions) or indirectly (by or one subject, while each column lists values as­
selecting subjects with various characteristics). Thus, signed to an individual variable. Such organization
in a study assessing the effect of two drugs on hirsu­
tism, the independent variable would be nominal
(the selected drug), while the dependent variable Table 2. Comparison of Univariate and Multivariate
may be the Ferriman-Gallway score. Analytical Methods
Many statistical methods require clear distinction Dependent Variable(s)
between the dependent and independent variables; Single
in some settings, however, this distinction may be (univariate At least two
blurred or entirely irrelevant. Table 1 presents exam­ Independent Variable analysis) (multivariate analysis)
ples of statistical methods appropriate for analysis Categorical consisting Student's t test Hotel ling's T2
of various types of independent and dependent vari­ of two groups Discriminant function
analysis
ables. Table 2 compares univariate statistical meth­ Categorical consisting Analysis of Multivariate analysis
ods (evaluating a single dependent variable) with of at least three variance of variance
their multivariate counterparts (evaluating more groups Discriminant function
analysis
than one dependent variable). Quantitative or a Univariate Multivariate regression
In many settings, especially observational studies, combination of regression analysis
the assignment of a variable as dependent or inde­ quantitative and analysis Canonical correlation
qualitative
pendent is arbitrary and may reflect the researcher's

140
STATISTICAL ANALYSIS—Duleba, Olive

of the data facilitates computer analysis, especially variables, as well as to predict the values of depen­
when the goal of the analysis is to explore the rela­ dent variables for a given set of values of indepen­
tionship between the variables. When repeat mea­ dent variables.
sures are carried out on the same subject (eg, before Most commonly, responses are modeled as linear
and after treatment), a separate variable may be as­ functions of predictors, and hence linear regression
signed to each measurement. In this format, paired models are evaluated. The advantages of using linear
or repeat measures tests may be easily performed. models are ease of interpretation and ease of mathe­
matical manipulations. Several types of regression
analysis may be distinguished:
Statistical Inference
1. Simple regression of one dependent variable
One of the major goals of statistics is to infer the and one independent variable.
characteristic of the population from observations 2. Multiple regression of one dependent vari­
collected in a sample. This inference may be achieved able and several independent variables.
by the process of testing hypotheses. Typically, the 3. Multivariate multiple regression of several
null hypothesis (H0) is presented and tested by a dependent variables and several indepen­
given test statistic. The null hypothesis is rejected dent variables.
when its testing demonstrates that it is improbable.

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


Thus, the researcher decides how stringent the test­
Dependent variables in the above types of regression
ing should be, ie, what is the acceptable risk of falsely
analysis are quantitative. When the dependent vari­
(by chance alone) rejecting the null hypothesis. This
able is categorical, the data may be analyzed by a
level of probability (level of significance) is called
special type of regression called logistic regression.
the alpha (a) level. The error of incorrectly rejecting
In the following section we will focus on univariate
the null hypothesis is referred to as type I error.
regression analysis encompassing simple and multi­
α = P (type I error) ple linear regression analysis and logistic regression.
Multivariate multiple regression will be discussed
= P (erroneous rejecting of the null hypothesis). separately in the section devoted to multivariate
When the null hypothesis is rejected, the alternative analysis.
hypothesis (H a ) a hypothesis encompassing all alter­
natives to H 0 ) is accepted. The null hypothesis may
not be rejected in two instances: when it is true, or Simple Linear Regression Analysis
when it is false but the study lacks sufficient power
to reject it. The probability P of type II error (β-error)
is the probability of not rejecting the null hypothesis The relationship between a single dependent vari­
when the alternative hypothesis is true. able and a single independent variable may be
graphically approximated as a line. This line may be
β = P (type II error) described by an equation:
= P (not rejecting the null hypothesis when
the alternative hypothesis is true).
where y and x are, respectively, dependent and inde­
The power of the study is equal to 1 - β and repre­ pendent variables. The y intercept of the line is repre­
sents the probability of the test correctly rejecting the sented by the constant "β 0 "; it represents the value
null hypothesis. of the dependent variable when the independent
variable is equal to zero. The slope of the line is
represented by the unstandardized regression coeffi­
REGRESSION ANALYSIS cient "β 1 "; it reflects the ratio of the change in the
dependent variable for a given change in the inde­
Regression analysis explores relationships be­ pendent variable. Clearly, when β1 is equal to zero,
tween one or more dependent (response) variables changes in the independent variable are not associ­
and one or more independent (predictor) variables. ated with any change in the dependent variable; un­
Dependent and independent variables are usually, der such circumstances, the independent variable is
but not always, quantitative (ie, interval or ratio). a useless predictor of outcome. Positive value of β1
The goal of regression analysis is to express the de­ reflects positive correlation of predictor and out­
pendent variable(s) as a function of the independent come; negative value of β1 reflects negative correla­
variable(s). Such a function may be used to describe tion.
the type and the strength of association between the For example, simple linear regression may be used

141
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


Figure 1. Regression analysis models. (a) Simple linear
regression. (b) Multivariate regression with a categorical
variable (country). (c) Multivariate regression with an in­
teraction element (country and calendar year).

to rhodel the relationship between the age at menar- of unexplained variation, and therefore they may be
che and the calendar year (Fig. la): represented by a probabilistic relationship:

Age at menarche = β0 + β1 ∙ (calendar year). Age at menarche


— β0 + β1 ∙ (calendar year) + random error.
In this case, the unstandardized regression coeffi­
cient assumes negative value: dependent variable The selection of the best fitting line is performed
(age at menarche) decreases when independent vari­ by the least squares approach, which identifies the
able (calendar year) increases. constant β0 and the unstandardized regression coef­
The above model represents a deterministic rela­ ficient β1 for which the sum of squares of the errors
tionship and therefore only approximates reality; is minimized. In this context, individual errors are
real experimental data cannot be expected to fit this the differences between observed and expected val­
model perfectly. Real data would contain an element ues of dependent variable.

142
STATISTICAL ANALYSIS—Duleba, Olive

The strength of the linear relationship may be mea­ especially when independent variables were under
sured by the Pearson product moment coefficient of direct control of the researcher.
correlation, r, which may be also referred to as a A small study, with only a few observations, may
correlation coefficient. The correlation coefficient provide an unreliable model that cannot be repro­
may assume values from - 1 (perfect negative linear duced upon repeat sampling. In particular, models
relationship) through 0 (no linear relationship) to +1 derived from small studies should be cross-vali­
(perfect positive linear relationship). dated, ie, tested on a new set of observations from
A useful independent variable in the regression the same population.
model requires its correlation coefficient to be signifi­ Finally, it is not sufficient to demonstrate a statisti­
cantly different from zero. Testing of this null hy­ cally significant association between an independent
pothesis may be carried out with the aid of the t and a dependent variable (ie, low probability that
statistic. regression coefficient is equal to zero) and a high
The square of the correlation coefficient is called value of the coefficient of determination r2. It is also
the coefficient of determination, r 2 . The r 2 represents essential to demonstrate the real, intuitive impor­
the proportion of the total sample variability ex­ tance of observed association. Large and powerful
plained by the linear model. Thus, for example, an studies may demonstrate a statistically significant
r2 = .78 means that 78% of the sample variability is but clinically irrelevant association. The clinical rele­
explained by the model. Statistical software packages vance of the association between the independent

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


provide a summary of the most important parame­ and the dependent variable may be assessed by ex­
ters of linear regression: the constant, the regression amining the value of the unstandardized regression
coefficient, the probability of type I error, and the coefficient.
coefficient of determination.
Once a best fitting model has been found, it may
be used to estimate the mean value of a dependent Multiple Linear Regression Analysis
variable (and the confidence interval [CI] of the
mean) for a given value of an independent variable. Multiple linear regression is a natural extension of
Thus, for example, one may estimate the average age simple linear regression, when the dependent vari­
at menarche for a given calendar year and describe able is described as a linear function of more than
the CI for this average. The CI of this average will one independent variable.
describe the boundaries within which the mean age For example, expanding on the model presented
of menarche is expected to be found upon repeated for simple linear regression, we may describe the
sampling of the population of women. age at menarche as a function of two independent
Furthermore, the model may be used to predict variables, the calendar year and the percent of body
the new individual value of a dependent variable fat:
(and the prediction interval of this value) for a given
value of an independent variable. Thus, for a given Age at menarche = β0 + β1 ∙ (calendar year)
calendar year we may also predict the expected age
+ β2 ∙ (percent of body fat).
at menarche for an individual and the CI of this
prediction. While the individual value of the age at We know that the age at menarche has decreased
menarche for a given calendar year will be the same over the recent decades. We also know the age at
as the estimate of the mean, the CI will be wider. menarche correlates negatively with the percent of
This is intuitively obvious, since the prediction of body fat. Incorporation into the model of both inde­
the new individual value carries two errors: the error pendent variables allows us to determine the relative
of estimating the mean and the random error of the predictive power of each variable, accounting for the
predicted individual value. other one. Thus, we may be able to answer questions:
which independent variable correlates best with the
outcome, and whether (accounting for the more
Pitfalls
powerful predictor) the other independent variable
Predictions of the value of a dependent variable has any additional predictive power.
are usually safer when performed by interpolation In the above equation, β0 is a constant, while β1
rather than by extrapolation, eg, the models pre­ and β2 are unstandardized regression coefficients.
sented in Figure 1 should not be used to predict the Each of these coefficients corresponds to the slope of
age of menarche in the year 1000. the line described in the discussion of simple linear
Regression analysis evaluates association; it does regression. The equation with two independent vari­
not provide direct evidence for or against the cause- ables may be graphically represented by a plane in
and-effect relationship between variables. Such cau­ three-dimensional space. When the multiple linear
sation may be inferred based on the study design, regression equation contains more than two inde-

143
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

pendent variables, graphic presentation in multidi­ Assume that the R2 of the first model is .45 and that
mensional space cannot be readily accomplished. the R2 of the second model is 57. Thus, the second
Unstandardized regression coefficients are helpful model (ie, add a second predictor, % body fat) explains
in evaluating the practical relevance of the model. an additional 12% of variation (change of R2 by .12).
As discussed in the previous section on simple re­ The addition of more independent variables into
gression, it is important to determine whether un­ the equation always increases the value of R2; this
standardized regression coefficients have practical occurs, however, at the cost of decreasing the stabil­
relevance to a given model. ity of the model. The real challenge in multiple linear
In a model containing i number of independent regression analysis is to find a meaningful and stable
variables, the overall regression hypothesis tests the model with a high R2.
global null hypothesis that:

β1 = β2, = . . . = βi = 0. Models
The alternative hypothesis states that at least one of In this section we will review some basic concepts
the regression coefficients is nonzero; this alternative of building linear regression models. In particular,
hypothesis states that the model is "useful" (but not we will discuss selection and transformation of vari­
necessarily good). The null hypothesis may be tested ables, interactions between variables, and various

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


using F statistics. When the global null hypothesis approaches to building models.
is rejected, each of individual regression coefficients
may be tested for significance (ie, whether, for i ≠ Selection of Independent Variables
0, any given βi = 0).
The most important step in regression analysis is
When evaluating a given model, we may want to
the appropriate selection of independent variables.
compare predictive power of independent variables.
The variables may be conceptually divided into two
Direct comparison of unstandardized regression co­
categories: hypothetical predictors and historically
efficients cannot be done, since values of β1 and β2
important predictors. Hypothetical predictors are
are affected by the choice of units in which individ­
variables not previously known (or tested) in evalu­
ual variables are measured. To determine how pre­
ating a given outcome. For example, we may choose
dictive a given variable is, we may assess the effect
to test the effect of a daily dose of spinach on the age
of a standardized change of an independent variable, at menarche. Testing hypothetical variables alone is
eg, a change by one standard deviation. Such stan­ often unrewarding, not only because the hypothesis
dardization can be accomplished by converting each may be wrong, but also because the models would
independent variable to have a mean of 0 and stan­ be missing other important predictors that should be
dard deviation of 1; this procedure allows determi­ accounted for when evaluating a new, hypothetical
nation of standardized regression coefficients (beta- predictor. Thus, even an important variable may not
weights) that can be meaningfully compared with be identified as a good predictor of outcome, unless
each other. it is stratified against (an)other, appropriate vari­
In the discussion of simple linear regression, we able(s).1 The reverse of this scenario is also possible,
discussed the coefficient of determination, r2, as a whereby any predictive power of a new hypothetical
measure of the proportion of the total sample vari­ predictor may disappear after accounting for other
ability explained by the linear model. In the multiple variables.
regression model, we can determine the multiple co­ Thus, a more meaningful approach to evaluating
efficient of determination, R2, which also describes the effect of spinach on age at menarche would be
the fraction of total variability explained by the to study a model such as:
model. Furthermore, the change of R2 is also mean­
Age at menarche = β0 + β1 ∙ (calendar year)
ingful. The addition of another independent variable
to the model should increase the value of R2. This + β2 ∙ (% body fat) + β3 ∙ (daily dose of spinach).
change of R2 reflects the additional contribution of Testing of a large number of variables is tempting;
the newly added variable to the overall proportion however, it carries a risk of producing models that
of the variance explained by the new, enriched cannot be reproduced on new samples of observa­
model. For example, we may compare models: tions. A practical recommendation is to assure that
the sample size (the number of observations) is at least
Age at menarche = β0 + β1 ∙ (calendar year) 5 to 10 times the number of independent variables.
and
Transformations
Age at menarche
In many instances the relationship between the
= β0 + β1 ∙ (calendar year) + β2 ∙ (% body fat). dependent and independent variable is not a linear

144
STATISTICAL ANALYSIS—Duleba, Olive

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


Figure 2. Examples of nonlinear relations between variables. (a) Quadratic. (b) Cuboidal. (c) Logarithmic.
(d) Exponential.

one (Fig. 2). Can we still perform linear regression? mated by a binary function. The most important step,
The answer is yes, but only when the independent the selection of threshold values for dichotomization,
variable is reexpressed. In other words, when the may be performed using modified receiver operating
outcome is better predicted by the square or loga­ characteristic curve analysis. 2 In this method, the
rithm of the original independent variable, we may sum of the true positive rate and true negative rate
model the outcome as a linear function of the new, is plotted as a function of different threshold values
squared, or logarithmically transformed variable. In of a continuous variable. The cutoff point for dichot­
some instances, when the outcome has a complex omization corresponds to the peak of such a plot.
curved relationship with a predictor, we may use Transformations may also be performed on depen­
polynomials to approximate this relationship. Poly­ dent variables. This may be appropriate when sev­
nomials are functions: eral independent variables demonstrate the same
type of nonlinear relationship with the dependent
y = α + β1 ∙ x + β2 ∙ x2 + . . . βk ∙ xk. variable.
Reexpression of a given predictor variable as several
new variables created in this fashion may improve Residuals
the fit of the model.
Sometimes the relationship of a quantitative inde­ Residuals are differences between observed and
pendent variable and the outcome is best approxi­ predicted (estimated by the model) values of depen-

145
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

dent variable. Thus, a separate residual may be cal­ Table 3. Specification of the Design Variables for
culated for each observation. Inspection of residuals Country, Using Country A as the Reference
is an essential component of regression analysis; in Design Variables
Country
particular, it is helpful in identifying outliers and (Code) D1 D2 D3
improving the fit of the model.
Country A 0 0 0
Outliers are observations with large residuals; in Country B 1 0 0
other words, they are observations that do not fit Country C 0 1 0
well to the model. Outliers may have two origins: Country D 0 0 1
they may arise from errors (ie, mistakes in measuring
or documenting), or they may signify unusual obser­
vations not explainable by the model. Erroneous ob­
servations may be corrected or excluded. Unusual may be used in linear regression analysis with the
but correct observations may require closer scrutiny aid of coding. Coding allows expression of categori­
and, possibly, a revision of the model. cal variables as one or more of so-called dummy or
Inspection of residuals can occasionally detect pre­ design independent variables. For example, when
viously unsuspected relationships between indepen­ the categorical variable is dichotomous, coding con­
dent and dependent variables. For example, a pre­ sists of creating a single design variable D that may
viously unsuspected quadratic component of the assume values of 0 or 1.

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


relationship may become obvious by noticing a U- The situation becomes slightly more complex
shaped (or inverted U-shaped) relationship in the when there are three or more categories; for a given
graph where a residual is plotted as a function of a number n of categories, we need to assign n — 1
given predictor. number of design variables. For example, we may
When the inspection of residuals suggests that a choose to study the age at menarche as a function
given observation be removed, a variable be trans­ of calendar year and the country of origin. When
formed, or a new variable (such as a quadratic com­ studying four countries, we need to create three de­
ponent) added, the new, revised model should be sign variables (Table 3). The corresponding regres­
identified and its residuals examined. sion model would be as follows:
Various types of residuals may be calculated. Raw
Age at menarche = β0 + β1 ∙ (D1) + β2 ∙ (D2)
residuals are expressed in the same units as original
dependent variables. Plots of raw residuals as func­ + β3 ∙ (D3) + β4 ∙ (calendar year).
tions of their corresponding independent variables
Thus, for the country A, the model would be reduced
may uncover nonrandom (but not yet accounted for)
to:
relationships between a given predictor and response.
Studentized residuals are scaled by dividing each Age at menarche = β0 + β1 ∙ (D1 = 0) + β2
raw residual by its estimated standard deviation.3
Studentized residuals are particularly helpful in iden­ × (D2 = 0) + β3 ∙ (D3 = 0) + β4 ∙ (calendar year)
tifying the outliers. Moderately large Studentized re­ = β0 + β4 ∙ (calendar year)
siduals greater than two standard deviations deserve
closer scrutiny; Studentized residuals greater than while for country B, the model would be reduced to:
three standard deviations point at significant outliers. Age at menarche = β0 + β1 ∙ (D1 = 1) + β2
Both raw and Studentized residuals may be calcu­
lated as deleted residuals. 4 Each deleted residual × (D2 = 0) + β3 ∙ (D3 = 0) + β4 ∙ (calendar year)
represents a difference between the observed value = β0 + β1 + β4 ∙ (calendar year).
of a dependent variable and its predicted value,
when the prediction equation is derived with that Thus, for each country, we may model a relationship
observation deleted. Plots of deleted residuals vs raw between the age at menarche and the calendar year
residuals are particularly helpful in identifying outli­ as a separate line with the same slope but different
ers that may significantly distort the regression coef­ constant (y intercept). This example is illustrated in
ficients; such outliers would markedly deviate from Fig. lb.
the 45° line through the origin of the plot.
An excellent in-depth discussion of this complex Interactions
topic may be found in the monograph by Gunst and
Mason. 4 The effect of a given independent variable may
be influenced by the value of another independent
Categorical Variables variable. For example, we may easily imagine that,
Some potential predictors of outcome are inher­ over the last century, changes in the age at menarche
ently non-numerical, eg, sex or race. These variables were different for different countries. Let's assume

146
STATISTICAL ANALYSIS—Duleba, Olive

that the age at menarche remained constant in coun­ eigenvalue. Small latent roots (close to zero) and
try A, but declined in country B. Such interaction large condition indices (in the range of at least 15 to
between independent variables may be included in 30) suggest a significant multicollinearity problem.
the model (with one design variable D1 coding for
two countries):
Strategies in Model Building
Age at menarche = β0 + β1 ∙ (D1) The process of building the best possible model
+ β2 ∙ (calendar year) + β3 ∙ (D1) ∙ (calendar year) remains largely in the domain of art rather than sci­
ence. It requires judgment in the selection of vari­
where "(D1) ∙ (calendar year)" is a two-variable inter­ ables and their appropriate transformation, the
action term. Such an interaction is illustrated in Fig­ search for relevant interactions, and accounting for
ure 1c. multicollinearity. The list of all variables, polynomi­
While interactions may improve the fit of the als, and interaction components may be long. Exces­
model, they may also unnecessarily complicate it. sively complex models lose stability and become
For example, in a model with only five independent nonreproducible on new sets of observations. Thus,
variables, there are 10 possible two-variable interac­ the selection of the most relevant components is es­
tion terms. There may also be three-variable interac­ sential. Several options may be considered.
tion terms (with multiplication of three independent When dealing with a relatively small set of vari­

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


variables) and so on. Furthermore, interaction may ables, all possible regressions may be tested, their
be expressed in the form of other functions of two summary statistics compared, and the best fitting
or more of the independent variables, eg, their ratios. model selected. The major drawback of this method
Interactions should be included in the model when is the large number of regressions that have to be
justified by known or suspected relevance to the in­ evaluated: for n number of variables, there are 2n
terpretation of the model, and/or when they appear different models.
to significantly improve the fit of the model. A prob­ The more efficient approach to selection of the
lem with including interactions, however, is that most relevant predictors is the best subset regres­
they may introduce redundancy to the model (multi- sion.5 The concept of this approach may be best pre­
collinearity). sented in an example. Let us assume that we have
10 variables, and that we want to create a model
Multicollinearity with only 4 variables. The best subset regression
method will select 4 variables such that the resultant
Multicollinearity occurs when independent vari­ model will have the largest coefficient of determina-
ables in the model contain redundant information. tionR2. We may then select the best possible subset of
This happens when independent variables are highly 5 variables and determine whether this larger model
correlated with each other. Multicollinearity may be results in an appreciably increased R2. The procedure
responsible for unreliable models, whereby an im­ may be continued for successively larger subsets un­
portant predictor may not be detected and thus not til increases of R2 become insignificant or, less likely,
included in the model because of its linear relation­ until all 10 variables are incorporated.
ship with another predictor. In other words, multi­ Another method of selecting the most relevant
collinearity occurs when two or more independent variables is the stepwise selection. This approach
variables contribute to the prediction of the depen­ uses a fixed decision rule to either add a variable to
dent variable, but the contribution of one overlaps the model or to delete it.
with that of the other(s). Models that include redun­ In the forward selection, at each step, one variable
dant variables tend to have unstable estimates of with the best predictive power is added to the model.
coefficient (ie, coefficients have large variances). Thus, an F statistic is calculated for each possible
Multicollinearity may be detected by examination step, and the variable associated with a highest sig­
of the correlation matrix of regression coefficients.4 nificance (lowest P value) is incorporated into the
When multicollinearity is limited to two indepen­ model. After the first variable has been selected, the
dent variables, it may be detected when the correla­ next variable is selected by evaluating its predictive
tion coefficient is in the range of .70 to .80. Multi­ power while accounting for the effect of the first
collinearity between three or more independent variable. The process continues until either all vari­
variables may not be identified by pairwise screen­ ables are incorporated or, more likely, an addition
ing of correlation coefficients and may require evalu­ of another variable fails to significantly improve the
ation of other parameters of the correlation matrix model (ie, the F statistic demonstrates a P value that
of regression coefficients: latent roots (eigenvalues) exceeds the preset threshold level of type I error).
and condition. Condition indices are square roots of The traditional threshold of P = .05 is probably too
the ratios of the largest eigenvalue to each successive stringent and may exclude important variables. 6

147
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

Thus, more lenient threshold levels of P value in the maximum value is 1. A conditional mean may be
range of .15 to .20 should be considered. modeled with the aid of logit transformation:
An alternative stepwise procedure is the backward
elimination method. The initial model consists of all E(Y|x) = P = exp(β0 + β1 ∙x)/[1 + exp((β0 + β1 ∙x)].
variables, and at each step the least predictive vari­
Logit transformation may be presented as a logit
able (with the corresponding highest P value) is
function:
eliminated from the model. The process continues
until the least predictive variable is still significantly log[P/(1 - P ) ] = β0 + β1 ∙x.
predictive, ie, its corresponding P value is below the
preset threshold level. Finally, both forward and In this form, logit is a linear function of independent
backward steps may be incorporated in the stepwise variable(s) x. In general, logistic regression uses logit
procedure, whereby each forward step (an addition transformations to explore models where the depen­
of a variable to the model) is followed by a backward dent variable is categorical and at least one of the
elimination (a search for a nonsignificant variable to independent variables is quantitative (ie, interval or
be removed). This process may help in identifying ratio). In most instances the dependent variable is
variables that may initially appear important but be­ binary. When a categorical dependent variable as­
come insignificant when (an)other, subsequently sumes more than two values, one may use a more
added variable(s) are accounted for. The process re­ complex extension of logistic regression, polytomous

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


quires that two threshold levels of P are selected: Pe logistic regression, a topic beyond the scope of this
for entry of a variable to the model, and Pr for re­ review. An excellent monograph by Hosmer and
moval of a variable from the model. The threshold Lemeshow discusses the details of logistic regression
for removal must exceed the threshold for entry. For analysis, including polytomous logistic regression. 7
example, for an entry threshold set at P = .15, the Logistic regression modeling is in many ways
removal threshold may be set at P = .20. The step­ comparable to linear regression. It evaluates the lin­
wise procedure is incorporated into most statistical earity of the relationship between the independent
packages. variables and the logit of the dependent variable. It
Since all the above approaches may be performed accepts categorical independent variables and inter­
automatically by the computer program, there is a action elements. It allows transformations of vari­
risk of excluding a variable that appears to be statisti­ ables, and it is vulnerable to multicollinearity. Fi­
cally nonsignificant but is known to be important nally, it uses familiar methods of variable selection:
(historically or biologically). Such variables should all possible regressions, best subset selection, and
be forced into the model manually, ie, included in various stepwise selection procedures.
the model, even when their contribution to the Logistic regression is distinct from linear regres­
model is not statistically significant. sion in several ways including the interpretation of
the coefficients, and the approach to evaluation of
the goodness of fit.
Logistic Regression Analysis One of the most useful features offered by logistic
regression is the ability to interpret coefficients (beta-
In many studies, the outcome of interest cannot coefficients) as measures of odds ratios of dichoto-
be described as a quantitative dependent variable. mous outcomes. The odds of a dichotomous outcome
For example, conditions such as pregnancy, sponta­ (eg, response vs no response) may be represented as
neous abortion, or development of breast cancer are a ratio of P / ( l — P), where P represents the probabil­
best described by dichotomous (binary) variables ity of the response and 1 — P represents the probabil­
that code for the presence or absence of a given out­ ity of no response. The odds ratio is the multiplica­
come. Under such circumstances, the usual linear tive factor by which the odds change when the
regression model is not helpful. However, modeling independent variable is increased by one unit. Most
of the relationship between various independent programs provide an output in the form of estimated
predictors and dichotomous dependent variables coefficients β (± standard error), odds ratios (Ψ), and
may be accomplished using a related statistical 95% CI for odds ratios.
method, logistic regression. For a change of one unit of a given independent
Dichotomous outcome may be presented in the variable xi, the calculation of Ψ and 95% CI may be
form of a conditional mean "(E(Y| x)," which denotes easily accomplished from observing that:
the expected value of Y for a given value of x. For
example, a conditional mean may represent the Ψ = exp(βi)
probability of pregnancy for a female of a given age. and that the end points of 95% CI are:
For dichotomous outcomes such as pregnancy, the
minimum value of the conditional mean is 0 and the exp[βi ± 1.96 ∙ (standard error of βi)].

148
STATISTICAL ANALYSIS—Duleba, Olive

We may also easily determine the Ψ and 95% CI for P (of at least one rejection by chance alone)
a change of c units of an independent variable %{. = 1 - P (all 20 tests accepted) = 1 - (.95)20 - .64 (!).
Ψ(c) = exp(c-β i ) Thus, multiple comparisons may result in a totally
and that the end points of 95% CI are: unacceptable risk of type I error. Importantly, the
above calculation assumes independence of out­
exp[c ∙ βi ± 1.96 ∙ c ∙ (standard error of βi)]. comes; such an assumption is rarely fulfilled. In the
In logistic regression, an equivalent of multiple coef­ above example, the levels of hormones are usually
ficient of determination R2 is McFadden's p2. McFad­ highly interdependent; consequently, the true type I
den's p 2 may assume values between 0 and 1; the error lies somewhere between .05 and .64. When one
closer to one, the better the fit of the model. Notably, uses multivariate analysis, the type I error is clearly
even values in the range of .2 to .4 are considered set and is not obscured by the number of variables;
satisfactory.8 thus, the researcher is protected from reading too
The evaluation of the goodness of fit may be ac­ much into the data.
complished using the Pearson x2 test to compare the The interdependence of outcomes is also a source
difference between the observed and predicted (fit­ of the second important problem with univariate
ted by the model) probabilities (of the dependent analysis: the lack of power to detect the differences
variable). The model fits well when the difference in in outcomes when these differences are obscured by

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


these probabilities is not statistically significant. A the interdependence of outcomes or, conversely,
more sophisticated approach, often provided by sta­ finding the differences that would disappear after
tistical software, is to perform the Hosmer-Lemes- accounting for interdependence of outcomes.
how test. 7 Ultimately, the best evaluation of the To illustrate these concerns, let us assume that we
model may be achieved by applying the model to a are testing a null hypothesis stating that in patients
new set of observations. with polycystic ovary syndrome the mean level of
luteinizing hormone (LH) is 12 mlU/mL and the
mean level of follicle-stimulating hormone (FSH) is
MULTIVARIATE ANALYSIS 4 mlU/mL. Using univariate statistics, we may con­
struct a separate 95% CI for the mean of LH and
Multivariate analysis simultaneously evaluates FSH. On a graph depicting LH on the x-axis and
multiple dependent variables and accounts for the FSH on the y-axis, these 95% CIs may be jointly pre­
correlation(s) between the dependent variables. In sented as a rectangle (Fig. 3). However, it is reason­
this section we will compare univariate and multi­ able to expect that, in fact, the levels of LH and FSH
variate methods and briefly present selected aspects are correlated. Thus, the distribution of observed val­
of multivariate analysis. In-depth presentation of this ues would not be rectangular, but more likely ellip­
subject may be found in an excellent monograph by tical (Fig. 3). A multivariate test creates elliptical ac­
Rencher.9 ceptance regions. Two special situations should be
more closely investigated: (1) acceptance by univari­
ate but rejection by multivariate testing, and (2) rejec­
Comparison of Univariate and Multivariate tion by univariate but acceptance by multivariate
Analysis testing. In the former situation, the observed means
of the sample would be inside the rectangle but out­
Studies evaluating multiple outcomes are often an­ side the ellipse; in the latter situation, the observed
alyzed using univariate methods whereby each out­ means of the sample would be outside the rectangle,
come is evaluated separately. This approach, how­ but inside the ellipse. In both instances, the results
ever, has major shortcomings when compared to of multivariate analysis should be more trusted.
multivariate analysis, especially due to the errors of In summary, multivariate analysis protects from
multiple comparisons and the inability to account errors of multiple comparisons, has the power to
for intercorrelation between the outcomes. identify important predictors of outcome that may
The error of multiple comparison may be the best not be observed in analyzing each outcome sepa­
illustrated by an example. Let's assume that we in­ rately, and protects from false identification of effects
tend to study the effects of two treatments on 20 that would lose significance after accounting for
hormones. Use of univariate analysis would require other variables.
us to perform 20 separate comparisons (eg, t tests).
If for each test we accept a 5% risk of type I error (P Hotelling's T2
= .05), the probability of finding a significant differ­
ence in the levels of at least one hormone by chance The simplest situation where multivariate analy­
alone would be: sis may be used applies to studies comparing two

149
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

Manova

Multivariate analysis of variance (MANOVA) is a


natural extension of analysis of variance (ANOVA)
to situations where analysis involves more than one
dependent variable. It is useful in situations where
evaluation of more than two groups is required.
In ANOVA we may test a null hypothesis that
means (of a dependent variable) for all groups are
equal. Rejection of null hypothesis means that the
means of the dependent variable are different in at
least two groups.
In MANOVA, the null hypothesis may be ex­
tended to all dependent variables, ie, we may postu­
late that for each dependent variable, the means for
all groups are equal. Thus, for k groups and p depen­
dent variables, the null hypothesis may be expressed
as follows:

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


µ11 = µ21 = µ31 = . . . = µk1
Figure 3. Comparison of acceptance and rejection re- µ12 = µ22 =
µ32 = . . . = µk2
gions for univariate and multivariate analysis. The rectangle
represents a confidence region constructed for each depen- µ1p = µ2p = µ3p = . . . = µkp.
dent variable (luteinizing hormone [LH] and follicle-stimu-
lating hormone [FSH]) separately by univariate tests. The There are several different statistics available to test
ellipse represents a confidence region constructed by a mul- the null hypothesis in MANOVA. Probably the most
tivariate test. popular test is known as Wilks' A. In contrast to
the familiar univariate statistics (such as t test or F
statistic) the null hypothesis is rejected when the
groups with two or more dependent variables. value of A is low and approaches zero. Critical val­
Such studies may be analyzed with the aid of Ho- ues of A are available in table form 10 or may be
telling's T2 test, a multivariate extension of t tests. obtained using statistical software such as SYSTAT
For example, we may analyze an imaginary
study comparing the effect of two in vitro fertiliza­
tion ovulation induction protocols on two out­
comes: the number of embryos and the quality of
embryos (Fig. 4). The distribution of outcomes in
each group is approximated by an ellipse. The size
and the shape of each ellipse illustrates the rela­
tionship between the outcomes. In the example il­
lustrated by Figure 4, outcomes correlate posi­
tively in protocol 1 and negatively in protocol 2.
Centroids C1 and C2 represent vectors of means,
in this case, pairs consisting of mean number of
embryos and mean quality of embryos. Hotelling's
T2 tests may be used to evaluate the null hypothe­
sis that the centroids C1 and C2 are not different.
Finding a significant value of T2 allows us to reject
the null hypothesis and to state that the centroids
are significantly different.
In the above example, two treatments may differ
in three ways: in the number of embryos only, in the
quality of embryos only, and in both the number
and the quality of embryos. Hotelling's T2 test is
unable to determine which of these three possibilities Figure 4. Comparison of two imaginary in vitro fertiliza-
takes place. This problem may be solved by discrimi­ tion protocols where two outcomes (dependent variables)
nant function analysis (see below). were recorded: number of embryos and quality of embryos.

150
STATISTICAL ANALYSIS—Duleba, Olive

or SAS. Other tests frequently used in MANOVA


(and found in outputs of statistical packages) include
Roy's largest root test, the Pillai statistic, and the
Lawley-Hotelling statistic.9
Why so many statistics? All these tests have the
same type I error rate, ie, when the null hypothesis
is true, they all have the same risk of rejecting it.
However, when the null hypothesis is false, these
tests differ in their ability to reject it (ie, they differ
in power). Thus, under some circumstances, some
tests will reject the null hypothesis while the others
will not. The power of each of these tests depends
on the configuration of the mean vectors in multidi­
mensional space (since we are testing means of sev­
eral dependent variables, we may represent them
as mean vectors). Each mean vector represents one
group. We may imagine that, for example, mean vec­
tors of three groups may be collinear (in the same

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


line) or diffuse. When the mean vectors are collinear,
the most powerful is Roy's largest root test. On the
other hand, when the mean vectors are diffuse, the
Pillai statistic is most powerful. Further discussion
of this subject may be found in specialized litera­
ture. 9 In practice, under most circumstances, use of Figure 5. Discriminant function optimizes the separa-
Wilks' A is recommended. tion between the groups.
The finding of significant value of Wilks' A (or
other comparable test) allows the rejection of the null
hypothesis. However, these tests, like Hotelling's T2 Graphically, two groups may be best separated by
test, are not helpful in the actual location of the a line, 1, through the points where the ellipses cross
source of the difference between the means; this may each other (Fig. 5). A line, 2, perpendicular to line
be accomplished by using discriminant function 1, is called the discriminant function; it allows the
analysis. representation of pairs of dependent variables as
simple numbers (values of discriminant function) in
such a way that the two groups may be the best
Discriminant Function Analysis separated.
In situations where there are more than two
Discriminant functions are linear combinations of groups, single discriminant function is clearly not
dependent variables that best separate groups. Thus, sufficient; evaluation of k groups requires k — 1 dis­
discriminant function analysis identifies the relative criminant functions. The most powerful discriminant
contribution of each individual dependent variable function separates one of the groups from all other
to separation of groups. groups. The next discriminant function separates the
For example, to evaluate a study testing the effect second group from the remaining ones and so on.
of two in vitro fertilization ovulation induction pro­ These individual discriminant functions may be
tocols on the number of embryos and the quality of ranked according to their power to separate a given
embryos, one may obtain a discriminant function: group from the remaining groups. Each discriminant
function may be evaluated by Wilk's A test to deter­
z = a1 ∙ number of embryos mine its significance, ie, whether it significantly con­
tributes to the separation between the groups.
+ a2 ∙ quality of embryos.
In the presence of more than two dependent vari­
The values of coefficients a1 and a2 are such that they ables, graphic representation of discriminant func­
optimize the separation between individual ovula­ tions cannot be readily accomplished; however, the
tion induction protocols. Variables may be standard­ concepts and the calculations of the function param­
ized to have means of zero and standard deviations eters remain essentially unchanged.
of one. This allows direct comparison between the However, the greater the number of variables and
individual weights. Coefficients close to zero indi­ groups, the less stable are the solutions. Decreased
cate that their corresponding variables provide little stability means that on cross-validation on the new
contribution to the separation between the groups. sample of observations, results of discriminant func-

151
SEMINARS IN REPRODUCTIVE ENDOCRINOLOGY Volume 14, Number 2 May 1996

tion analysis are less likely to be reproducible. The powerful test depends on the relationship between
rule of the thumb is to have at least 10 observations the dependent variables.
(subjects) per dependent variable. To limit the num­ These tests inform us of the presence or absence
ber of variables to the most important ones, various of a linear relationship between independent and
stepwise procedures may be used. In this context, dependent variables somewhere in the model. Ca­
important variables are those that are the most help­ nonical correlation may be used to closer examine
ful in separating the individual groups. The results this relationship. Canonical correlation requires the
of stepwise selection should be interpreted with cau­ construction of the first canonical variates, linear
tion; these procedures may exclude not only unim­ functions for independent and dependent variables.
portant variables, but also important ones, when For example:
they are highly correlated to other variable(s) already
incorporated into the discriminant function. This sit­ Canonical variate A (for independent variables)
uation is analogous to that encountered with multi-
= B1A ∙ (calendar year) + β2A ∙ (percent of body fat).
collinearity of independent variables in regression
analysis. Canonical variate B (for dependent variables)
= B1B ∙ (age at menarche)
+ B2B ∙ (age at adrenarche) + B3B ∙ (age at telarche).

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.


Multivariate Regression Analysis
The canonical correlation method selects the best co­
While MANOVA is a multivariate technique eval­ efficients B for both the independent and the depen­
uating categorical independent variables, multivari­ dent variables to maximize the correlation between
ate regression analysis is a multivariate technique these two sets of variables. When the variables are
evaluating quantitative independent variables (Table standardized, their corresponding standardized co­
2). Specifically, multivariate regression analysis de­ efficients show the contribution of individual vari­
termines the strength of the linear relationship be­ ables to the overall relationship between indepen­
tween multiple independent and multiple depen­ dent and dependent variables. Further analysis of
dent variables. It is therefore an extension of multiple canonical correlation allows delineation of the details
regression analysis that produces models with the of the relationship between variables. An excellent
same number of equations as the number of depen­ discussion of this subject may be found in the mono­
dent variables. graph by Rencher.9
For example, for three dependent variables (age at
menarche, age at adrenarche, and age at telarche) we
may produce a model: WHAT IS NEXT?
Age at menarche = β0 + β1 ∙ (calendar year)
In addition to the techniques discussed in this arti­
+ β2 ∙ (percent of body fat) cle, there are other methods of multivariate analysis.
These techniques allow grouping of objects/observa­
Age at adrenarche = β0 + β1 ∙ (calendar year) tions according to their similarity (cluster analysis)
+ β2 ∙ (percent of body fat) or identification of the most important underlying
factors responsible for the relationship between the
Age at telarche = β0 + β1 ∙ (calendar year) variables (factor analysis).
These methods and other statistical approaches
+ β2 ∙ (percent of body fat).
may be used not only to rigidly test our hypothe­
As in univariate regression models, the β's are re­ ses, but also, and possibly even more importantly,
gression coefficients, and the least square method to help us in generating the new hypotheses. Thus
estimates the model with the best fitting regression applied, statistical tools may also be tools of cre­
coefficients. Multivariate regression analysis allows ativity.
testing whether at least one of the outcomes is corre­
lated with at least one of the predictors. This testing,
the test of overall regression, evaluates the null hy­ REFERENCES
pothesis that at least one of the regression coeffi­
cients (in the above example, β1 or β2) is equal to 1. Simpson EM: The interpretation of interaction of contingency
zero. Various tests of overall regression may be used: tables. R Stat Soc Series B 13:238-241, 1951
2. Silverberg KM, Burns WN, Olive DL, Riehl RM, Schenken
Wilks' A, Roy's largest root, Pillai, and Lawley-Ho- RS: Serum progesterone levels predict success of in vitro fer­
telling. As in MANOVA, the selection of the most tilization/embryo transfer in patients stimulated with leu-

152
STATISTICAL ANALYSIS—Duleba, Olive

prolide acetate and human menopausal gonadotropins. J Clin 6. Bendel RB, Afifi AA: Comparison of stopping rules in for­
Endocrinol Metabol 73:797-803, 1991 ward regression. J Am Stat Assoc 72:46-53, 1977
3. Velleman PF, Welsch RE: Efficient computing of regression 7. Hosmer DW, Lemeshow S: Applied Logistic Regression. New
diagnostics. Am Statistician 35:234-242, 1981 York, John Wiley & Sons, Inc, 1989
8. Hensher D, Johnson LW: Applied Discrete Choice Modelling.
4. Gunst RF, Mason RL: Regression Analysis and Its Application. London, Croom Helm, 1981
A Data-Oriented Approach. New York, Marcel Dekker, Inc, 9. Rencher AC: Methods of Multivariate Analysis. New York, John
1980 Wiley & Sons, Inc, 1995
5. La Motte LR: The SELECT routines: A program for identi­ 10. Wall FJ: The Generalized Variance Ratio or U-Statistic. Albu­
fying best subset regression. Applied Stat 21:1972 querque, The Dikewood Corp, 1967

Downloaded by: University of Pennsylvania Libraries. Copyrighted material.

153

You might also like