Partial Least Squares Regression
Partial Least Squares Regression
Regression
Partial least squares (PLS) regression is, at its historical core, a black-box algorith-
mic method for dimension reduction and prediction based on an underlying linear
relationship between a possibly vector-valued response and a number of predictors.
Through envelopes, much more has been learned about PLS regression, resulting in
a mass of information that allows an envelope bridge that takes PLS regression from
a black-box algorithm to a core statistical paradigm based on objective function op-
timization and, more generally, connects the applied sciences and statistics in the
context of PLS. This book focuses on developing this bridge. It also covers uses of PLS
outside of linear regression, including discriminant analysis, non-linear regression,
generalized linear models and dimension reduction generally.
Key Features:
• Showcases the first serviceable method for studying high-dimensional regressions.
• Provides necessary background on PLS and its origin.
• R and Python programs are available for nearly all methods discussed in the book.
R. Dennis Cook
Liliana Forzani
Designed cover image: © R. Dennis Cook and Liliana Forzani
MATLAB and Simulink are trademarks of The MathWorks, Inc. and are used with permission. The MathWorks
does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB or
Simulink software or related products does not constitute endorsement or sponsorship by The MathWorks of a
particular pedagogical approach or particular use of the MATLAB and Simulink software.
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted,
or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, includ-
ing photocopying, microfilming, and recording, or in any information storage or retrieval system, without written
permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works
that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003482475
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
For Violette and Rose
R. D. C.
L. F.
Contents
Preface xvii
Authors xxvii
1 Introduction 1
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Corn moisture . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Meat protein . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Serum tetracycline . . . . . . . . . . . . . . . . . . . . . 6
1.2 The multivariate linear model . . . . . . . . . . . . . . . . . . . 7
1.2.1 Notation and some algebraic background . . . . . . . . 9
1.2.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Partitioned models and added variable plots . . . . . . . 15
1.3 Invariant and reducing subspaces . . . . . . . . . . . . . . . . . 15
1.4 Envelope definition . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Algebraic methods of envelope construction . . . . . . . . . . . 22
1.5.1 Algorithm E . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.2 Algorithm K . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.3 Algorithm N . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.4 Algorithm S . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.5 Algorithm L . . . . . . . . . . . . . . . . . . . . . . . . 32
1.5.6 Other envelope algorithms . . . . . . . . . . . . . . . . 33
vii
viii Contents
3.5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5.2 Summary of common features of NIPALS and SIMPLS 98
3.6 Helland’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 Likelihood estimation of predictor envelopes . . . . . . . . . . . 103
3.9 Comparisons of likelihood and PLS estimators . . . . . . . . . 104
3.10 PLS1 v. PLS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.11 PLS for response reduction . . . . . . . . . . . . . . . . . . . . 108
Bibliography 387
Index 407
Preface
Partial least squares (PLS) regression is, at its historical core, a black-box
algorithmic method for dimension reduction and prediction based on an un-
derlying linear relationship between a possibly vector-valued response and a
number of predictors. Its origins trace back to the work of Herman Wold in
the 1960s, and it is generally recognized that Wold’s non-linear iterative par-
tial least squares (NIPALS) algorithm is a critical point in its evolution. PLS
regression made its first appearances in the chemometrics literature around
1980, subsequently spreading across the applied sciences. It has attracted con-
siderable attention because, with highly collinear predictors, its performance is
typically better than that of standard methods like ordinary least squares, and
it has particularly good properties in the class of abundant high-dimensional
regressions where many predictors contribute information about the response
and the sample size n is not sufficient for standard methods to yield un-
ambiguous answers. In some regressions, its estimators can converge at the
root-n rate regardless of the asymptotic relationship between the n and the
number of predictors. It is perhaps the first serviceable method for studying
high-dimensional linear regressions.
The statistics community has been grappling for over three decades with
the regressions in which n is not sufficient to allow crisp application of standard
methods. In the early days, they fixed on the idea of sparsity, wherein only a
few predictors contribute information about the response, to drive their fitting
and prediction methods, resulting in a focus that likely hindered the pursuit of
other plausible methods. Not being based on optimizing an objective function,
PLS regression fell on the fringes of statistical methodology and consequently
did not receive much attention from the statistics community. This picture
began to change in 2013 with the findings that PLS regression is a special
case of the then nascent envelope theory of dimension reduction (Cook et al.,
2013). Through envelopes, we have learned much about PLS regression in the
past decade, resulting in a critical mass of information that allows us to provide
an envelope bridge that takes PLS regression from a black-box algorithm to a
xvii
xviii Preface
Outline
Chapter 1 contains background on the multivariate linear model and on
envelopes. This is not intended as primary instruction, but may be sufficient to
establish basic ideas and notation. In Section 1.5, we introduce five algebraic
methods for envelope construction, some of which are generalizations of com-
mon PLS algorithms from the literature. In Chapter 2, we review envelopes for
response and predictor reduction in regression, and discuss first connections
with PLS regression. In Chapter 3, we describe the common PLS regression
algorithms for predictor reduction, NIPALS and SIMPLS, and prove their con-
nections to envelopes. We also discuss PLS for response reduction, PLS1 v.
PLS2 and various other topics. These first three chapters provide a foundation
for various extensions and adaptations of PLS that come next. Chapters 4 –
11 do not need to be read in order. For instance, readers who are not particu-
larly interested in asymptotic considerations may wish to skip Chapter 4 and
proceed with Chapter 5.
Various asymptotic topics are covered in Chapter 4, including convergence
rate and abundance v. sparsity. Simultaneous PLS reduction of responses and
predictors in multivariate linear regression is discussed in Chapter 5, and
methods for reducing only a subset of the predictors are described in Chap-
ter 6. We turn to adaptations for linear and quadratic discriminant analysis
in Chapters 7 and 8. In Chapter 9 we argue that there are settings in which
the dimension reduction arm of a PLS algorithm is serviceable in non-linear
as well as linear regression.
The versions of PLS used for path analysis in the social sciences are notably
different from the PLS regressions used in other areas like chemometrics. PLS
for path analysis is discussed in Chapter 10. Ancillary topics are discussed in
Chapter 11, including bilinear models, the relationship between PLS regression
and conjugate gradient methods, sparse PLS, and PLS for generalized linear
models. Most proofs are given in an appendix, but some that we feel may be
particularly informative are given in the main text.
Preface xix
Computing
R or Python computer programs are available for nearly all of the
methods discussed in this book and many have been implemented in both
languages. Most of the methods are available in integrated packages, but
some are standalone programs. These programs and packages are not dis-
cussed in this book, but descriptions of and links to them can be found at
https://lforzani.github.io/PLSR-book/. This format will allow for updates
and links to developments following this book. The web page also gives errata,
links to recent developments, color versions of some grayscale plots as they ap-
pear in the book, and commentary on parts of the book as necessary for clarity.
Acknowledgments
Earlier versions of this book were used as lecture notes for a one-semester
course at the University of Minnesota. Students in this course and our collabo-
rators contributed to the ideas and flavor of the book. In particular, we would
like to thank Shanshan Ding, Inga Helland, Zhihua Su, and Xin (Henry) Zhang
for their helpful discussions. Much of this book of course reflects the many
stimulating conversations we had with Bing Li and Francesca Chiaromonte
during the genesis of envelopes. The tecator dataset of Section 9.7 is available
at http://lib.stat.cmu.edu/datasets/tecator.
We extend our gratitude to
Rodrigo Garcı́a Arancibia for providing the economic growth data and anal-
ysis of Section 6.3.
Fabricio Chiappini for providing the etanercept data of Section 9.8 and for
sharing with us a lot of discussion and background on chemometric data.
He together with Alejandro Olivieri were very generous in providing us
with necessary data.
Pedro Morin for helping us comprehend the subtle link between PLS algo-
rithms and conjugate gradient methods.
Marilina Carena for help drawing Figure 10.1 and Jerónimo Basa for drawing
the mussels pictures for Chapter 4.
Special thanks to Marco Tabacman for making our codes more readable and
to Eduardo Tabacman for his guidance in creating numerous graphics in R.
Reminders of the following notation may be included in the book from time
to time.
• r number of responses.
• p number of predictors.
xxi
xxii Notation and Definitions
• (M )ij denotes the i, j-th element of the matrix M . (V )i denotes the i-th
element of the vector V .
• Rm×n is the set of all real m × n matrices and Sk×k is the set of all real
symmetric k × k matrices.
E (A − E(A))(C − E(C))T
ΣA,C =
E (A − E(A))(A − E(A))T .
ΣA =
Σ(A,C) = ΣB ,
we set
A1|2 = A1,1 − A1,2 A−1
2,2 A2,1 .
• For stochastic vectors A and B, βA|B = ΣA,B Σ−1 B . The subscripts may be
dropped if clear from context; for instance, β = Σ−1
X ΣX,Y .
A ⊗ B = (A)ij B, i = 1, . . . , r, j = 1, . . . s,
• Common acronyms
R. Dennis Cook
School of Statistics
University of Minnesota
Minneapolis, MN 55455, U.S.A.
Email: [email protected]
xxvii
xxviii Authors
Liliana Forzani
1.1 Corn moisture: Plots of the PLS fitted values as ◦ and lasso
fitted values as + versus the observed response for the test
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Meat protein: Plot of the PLS fitted values as ◦ and lasso fitted
values as + versus the observed response for the test data. . . 7
1.3 Serum tetracycline: Plots of the PLS fitted values as ◦ and lasso
fitted values as + versus the observed response for the test
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Mussels’ data: Plot of the observed responses Yi versus the fit-
ted values Ybi from the PLS fit with one component, q = 1 . . . 118
4.2 Mussels’ data: Plot of the fitted values Ybi from the PLS fit
with one component, q = 1, versus the OLS fitted values. The
diagonal line y = x was included for clarity. . . . . . . . . . . . 118
xxix
xxx List of Figures
6.1 Plot of lean body mass versus the first partial SIR predictor
based on data from the Australian Institute of Sport. circles:
males; exes: females. . . . . . . . . . . . . . . . . . . . . . . . . 184
6.2 Economic growth prediction for 12 South American countries,
2003–2018, n = 161: Plot of the response Y versus the leave-
one-out fitted values from (a) the partial PLS fit with q1 = 7
and (b) the lasso. . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.3 Coffee data: Plots of PLS, PFC, and Isotropic projections. . . . 226
7.4 Olive oil data: Plots of PLS, PLS+PFC, and Isotropic projec-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.1 Diagnostic plots for fat in the Tecator data. Plot (a) was
smoothed with a quadratic polynomial. The other three plots
were smoothed with splines of order 5. (From Fig. 1 of Cook
and Forzani (2021) with permission.) . . . . . . . . . . . . . . . 268
9.2 Diagnostic plots for protein in the Tecator data. Plot (a) was
smoothed with a quadratic polynomial. The other three plots
were smoothed with splines of order 5. . . . . . . . . . . . . . . 269
9.3 Diagnostic plots for water in the Tecator data. Plot (a) was
smoothed with a quadratic polynomial. The other three plots
were smoothed with splines of order 5. . . . . . . . . . . . . . . 270
9.4 Tecator data: root mean squared prediction error versus num-
ber of components for fat. Upper curve is for linear PLS,
method 1 in Table 9.1; lower curve is for inverse PLS predic-
T
tion with βbnpls X, method 4. (From Fig. 2 of Cook and Forzani
(2021) with permission.) . . . . . . . . . . . . . . . . . . . . . . 274
9.5 Solvent data: Predictive root mean squared error PRMSE ver-
sus number of components for three methods of fitting. A: linear
PLS. B: non-parametric inverse PLS with W TX, as discussed in
Section 9.7. C: The non-linear PLS method proposed by Lavoie
et al. (2019). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
1.1 Root mean squared prediction error (RMSE), from three ex-
amples that illustrate the prediction potential of PLS
regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
xxxiii
xxxiv List of Tables
7.1 Olive oil and Coffee data: Estimates of the correct classification
rates (%) from leave one out cross validation. . . . . . . . . . . 224
9.2 Tecator Data:(a) Root mean squared training and testing pre-
diction errors for five methods of prediction. . . . . . . . . . . . 275
9.3 Etanercept Data: Numbering of the methods corresponds to
that in Table 9.2. Part (a): Number of components was based
on leave-one-out cross validation using the training data with
p = 9,623 predictors. RMSEP is the root mean squared error
of prediction for the testing data. Part (b): The number of
components for MLP is the number of principal components
selected for the input layer of the network. Part (c): Number of
components was based on leave-one-out cross validation using
all 35 data points. CVRPE is the cross validation root mean
square error of prediction based on the selected 6 components. 277
Introduction
DOI: 10.1201/9781003482475-1 1
2 Introduction
statistical methods, the overarching goal being to separate with clarity the
information in the data that is material to the study goals from that which
is immaterial, which is in the spirit of Fisher’s notion of sufficient statistics
(Fisher, 1922). This is the same as the general objective of PLS regression,
and it is this connection of purpose that promises to open a new chapter in
PLS regression by enhancing its current capabilities, extending its scope and
bringing applied scientists and statisticians closer to a common understanding
of PLS regression (Cook and Forzani, 2020).
Following brief expository examples in the next section, this chapter sets
forth the multivariate linear model that forms the basis for much of this book
and describes the algebra that underpins envelopes, which are reviewed in
Chapter 2, and our treatment of PLS regression, which begins in earnest in
Chapter 3. This and the next chapters are intended to set the stage for our
treatment of PLS regression in Chapter 3 and beyond. Although we indicate
how results in this chapter link to PLS algorithms, it nevertheless can be read
independently or used as a reference for the developments in later chapters.
Throughout this book we use 5-fold or 10-fold cross validation to pick
subspace dimensions and the number of compressed predictors in PLS. These
algorithms require a seed to initiate the partitioning of the sample, and the
results typically depend on the seed. However, in our experience, the seed does
not affect results materially.
1.1 Examples
In this section we describe three experiments where PLS regression has been
used effectively. This may give indications about the value of PLS and about
the topics addressed later in this book. All three examples are from chemo-
metrics, where PLS has been used extensively since the early 1980s. Lasso
(Tibshirani, 1996) fits were included to represent sparse methodology. PLS
and lasso fits were determined by using library{pls} and library{glmnet} in R
(R Core Team, 2022).
of 700 predictors (e.g. Allegrini and Olivieri, 2013). The model was trained on
a total of ntrain = 50 corn samples and then tested on a separate sample of
size ntest = 30. Let Y denote the moisture content of a corn sample, and let
X denote the corresponding 700 × 1 vector of spectral predictors. We assume
that the mean function is linear, E(Y | X) = α + β T X. The overarching goal
is then to use the training data (Yi , Xi ), i = 1, . . . , 50, to produce an estimate
b of the scalar α and an estimate βb of the 700 × 1 vector β to give a linear
α
rule, E(Y
b | X) = αb + βbT X, for predicting the moisture content of corn.
Since ntrain p, traditional methods like OLS are not serviceable. It
has become common in such settings to assume sparsity, which reflects the
view that only a relatively few predictors, p∗ < ntrain , furnish information
about the response. The lasso represents methods that pursue sparse fits,
producing estimates of β with most coefficients equal to zero. Another option
is to use PLS regression, which includes a dimension reduction method that
compresses the predictors onto q < ntrain linear combinations traditionally
called components, X 7→ W TX where W ∈ Rp×q , and then bases prediction
on the OLS linear regression of Y on W TX. The number of components q is
typically determined by using cross validation or a holdout sample. It is known
that PLS can produce effective predictions when ntrain p and the predictors
are highly collinear, as they often are in chemometrics applications. A third
option is to use OLS, replacing inverses with generalized inverses. This method
is known to produce relatively poor results; we use it here as an understood
reference point.
The second row of Table 1.1, labeled “Moisture in corn,” summarizes the
results. Columns 2–6 give characteristics of the fits. With a 10-fold cross vali-
dation, PLS selected 23 compressed predictors or components, while the lasso
picked 21 of the 700 predictors as relevant. The final three columns give the
root mean squared error (RMSE) of prediction for the ntest = 30 testing
observations,
" n
#1/2
Xtest
RMSE = n−1
test (Yi − E(Y
b 2
| Xi )) .
i=1
Clearly, PLS did best over the test dataset. This conclusion is supported by
Figure 1.1, which shows plots of the PLS and lasso fitted values versus the
observed response for the test data.
Examples 5
TABLE 1.1
Root mean squared prediction error (RMSE), from three examples that illus-
trate the prediction potential of PLS regression.
RMSE
∗
Dataset p ntrain ntest PLS q Lasso p Lasso PLS OLS
Moisture in corn 700 50 30 23 21 0.114 0.013 6.52
Protein in meat 100 170 70 13 19 1.232 0.787 3.09
Tetracycline in serum 101 50 57 4 10 0.101 0.070 1.17
Columns 2–6 give the number of predictors p, the size of the training set
ntrain , the size of the test set ntest , the number of PLS components q cho-
sen by 10-fold cross validation, and the number of predictors p∗ estimated
to have nonzero coefficients by the Lasso. Columns 7–9 give the root mean
squared prediction error from the test data for the Lasso, PLS, and OLS using
generalized inverses.
FIGURE 1.1
Corn moisture: Plots of the PLS fitted values as ◦ and lasso fitted values as
+ versus the observed response for the test data.
6 Introduction
FIGURE 1.2
Meat protein: Plot of the PLS fitted values as ◦ and lasso fitted values as +
versus the observed response for the test data.
The fourth row of Table 1.1 shows the results of lasso, PLS, and OLS
fitting. PLS compressed the data into q = 4 components, while the lasso
judged that 10 of the original 101 predictors are relevant. The RMSE in the
last three columns of Table 1.1 again show that PLS has the smallest overall
error, followed by the lasso and then OLS. Figure 1.3 shows plots of the
PLS and lasso fitted values versus the observed response for the test data.
Our visual impression conforms qualitatively with the root mean squares in
Table 1.1. In this example, the lasso and PLS fitted values seem in better
agreement than those of the previous two examples, an observation that is
supported by the root mean squared values in Table 1.1.
FIGURE 1.3
Serum tetracycline: Plots of the PLS fitted values as ◦ and lasso fitted values
as + versus the observed response for the test data.
X, the predictors are stochastic and not ancillary, as the marginal distribution
of X is hypothesized to carry relevant information about the regression. With
stochastic predictors, we replace Xi in model 1.1 with Xi − µX , centering the
predictors about the population mean µX , and we add the assumption that
the predictor vector X is independent of the error vector ε. Let ΣX = var(X)
and ΣX,Y = cov(X, Y ). With this structure, we can represent β in terms of
parameters of the joint distribution
β = Σ−1
X ΣX,Y . (1.2)
ΣA = E (A − E(A))(A − E(A))T .
Σ(A,C) = ΣB ,
provided AT ∆A > 0. The usual inner product arises when ∆ = I and then
PA = A(AT A)−1 AT . Regardless of the inner product, Q·(·) = Ir −P·(·) denotes
the orthogonal projection.
We will occasionally encounter a conditional variate of the form N | C T N ,
where N ∈ Rr is a normal vector with mean µ and variance ∆, and C ∈ Rr×q
is a non-stochastic matrix with q < r. Assuming that C T ∆C > 0, the mean
and variance of this conditional form are as follows (Cook, 1998, Section 7.2.3).
E(N | C T N ) T
= µ + PC(∆) (N − µ) (1.4)
var(N | C T N ) = ∆ − ∆C(C T ∆C)−1 C T ∆
= ∆QC(∆)
= QTC(∆) ∆QC(∆) . (1.5)
Let Y denote the n × r centered matrix with rows (Yi − Ȳ )T , let Y0 denote
the n × r uncentered matrix with rows YiT , let X denote the n × p matrix with
rows (Xi − X̄)T and let X0 denote the n×p matrix with rows XiT , i = 1, . . . , n.
With this notation, model (1.1) can be represented in full matrix form as
A ⊗ B = (A)ij B, i = 1, . . . , r, j = 1, . . . , s,
where (A)ij denotes the ij-th element of A. Kronecker products are not in
general commutative, A ⊗ B 6= B ⊗ A. The vec operator transforms a matrix
A ∈ Rr×u to a vector vec(A) ∈ Rru by stacking its columns. Representing A
in terms of its columns aj , A = (a1 , . . . , au ), then
a1
a2
vec(A) = .. .
.
au
1.2.2 Estimation
With normality of the errors in model (1.1) and n > p, the maximum likelihood
estimator of α is Ȳ and the maximum likelihood estimator of β, which is also
the OLS estimator, is
−1
βbols = (XT X)−1 XT Y = (XT X)−1 XT Y0 = SX SX,Y , (1.7)
where the second equality follows because the predictors are centered and, as
defined previously, Y0 denotes the n × r uncentered matrix with rows YiT .
A justification for this result is sketched in Appendix Section A.1.1. We use
T
Ybi = Ȳ + βbols (Xi − X̄) and R
bi = Yi − Ybi to denote the i-th vectors of fitted val-
ues and residuals, i = 1, . . . , n. Notice from (1.7) that βbols can be constructed
by doing r separate univariate linear regressions, one for each element of Y on
X. The coefficients from the j-th regression then form the j-th column of βbols ,
j = 1, . . . , r. This observation will be useful when discussing the differences be-
tween PLS1 algorithms, which apply to regressions with a univariate response,
and PLS2 algorithms, which apply to regressions with multiple responses.
The sample covariance matrices of Yb , R, b and Y – which are denoted SY ◦X ,
SY |X , and SY – can be expressed as
−1
SY ◦X = n−1 YT PX Y = SY,X SX SX,Y , (1.8)
Xn
SY |X = n−1 R biT = n−1 YT QX Y,
bi R (1.9)
i=1
−1
= SY − SY,X SX SXY ,
= SY − SY ◦X ,
SY = n−1 YT Y = SY ◦X + SY |X , (1.10)
which corresponds to the estimated coefficient matrix from the OLS fit of the
−1/2 −1/2
standardized responses SY |X Y on the standardized predictors SX X.
The joint distribution of the elements of βbols can be found by using the vec
operator to stack the columns of βbols : vec(βbols ) = {Ir ⊗ (XT X)−1 XT }vec(Y0 ).
Although X is stochastic in PLS applications, properties of βbols are typically
described conditional on the observed values of X. We emphasize that in the
following calculations by conditioning on the observed matrix of uncentered
predictors X0 . Since vec(Y0 ) is normally distributed with mean α ⊗ 1n + (Ir ⊗
X)vec(β) and variance ΣY |X ⊗ In , it follows that vec(βbols ) | X0 is normally
distributed with mean and variance
T
The covariance matrix can be represented also in terms of βbols by us-
T
ing the rp × rp commutation matrix Krp to convert vec(βols ) to vec(βbols
b ):
T
vec(βols ) = Krp vec(βols ) and
b b
−1 −1
T
var{vec(βbols ) | X0 } = n−1 Krp (ΣY |X ⊗ SX T
)Krp = n−1 SX ⊗ ΣY |X .
Let ei ∈ Rr denote the indicator vector with a 1 in the i-th position and
0’s elsewhere. Then, the covariance matrix for the i-th column of βbols is
−1
var{vec(βbols ei ) | X0 } = (eTi ⊗Ip )var{vec(βbols ) | X0 }(ei ⊗Ip ) = n−1 SX (ΣY |X )ii .
14 Introduction
We commented following (1.7) that the j-th column of βbols is the same as
doing the linear regression of the j-th response on X, j = 1, . . . , r. Consistent
with that observation, we see from this that the covariance matrix for the
j-th column of βbols is the same as that from the marginal linear regression
of (Y )j on X. We refer to the estimate (βbols )ij divided by its standard error
−1
{n−1 (SX )jj (SY |X )ii }1/2 as a Z-score:
(βols )ij
b
Z= −1 . (1.15)
{n−1 (SX )jj (SY |X )ii }1/2
This statistic will be used from time to time for assessing the magnitude of
(βbols )ij , sometimes converting to a p-value using the standard normal distri-
bution.
The usual log-likelihood ratio statistic for testing that β = 0 is
|SY |
Λ = n log , (1.16)
|SY |X |
Asymptotic normality also holds without normal errors but with some tech-
nical conditions: if the errors have finite fourth moments and if the maximum
√
leverage converges to 0, max1≤i≤n (PX )ii → 0, then n(vec(βbols ) − vec(β))
converges in distribution to a normal vector with mean 0 (e.g. Su and Cook,
2012, Theorem 2).
Invariant and reducing subspaces 15
Y = µ + β1T R T
b1|2 + β2∗ (X2 − X̄2 ) + ε. (1.20)
In this version of the partitioned model, the parameter vector β1 is the same as
6 β2∗ unless SX1 ,X2 = 0. The predictors – R
that in (1.19), while β2 = b1|2 and X2
– in (1.20) are uncorrelated in the sample SRb1|2 ,X2 = 0, and consequently the
maximum likelihood estimator of β1 is obtained by regressing Y on R b1|2 . The
maximum likelihood estimator of β1 can also be obtained by regressing R bY |2 ,
the residuals from the regression of Y on X2 , on R b1|2 . A plot of R
bY |2 versus
Rb1|2 is called an added variable plot (Cook and Weisberg, 1982). These plots
are often used in univariate linear regression (r = 1) as general graphical
diagnostics for visualizing how hard the data are working to fit individual
coefficients.
Model (1.20) will be used in Chapter 6 when developing partial PLS and
partial envelope methods.
section is based on the results of Cook et al. (2010). The statistical relevance
of the constructions in this section will be addressed in Chapter 2.
If R is a subspace of Rr and A ∈ Rr×c , then we define
AT R = {ATR | R ∈ R}
RTA = {RTA | R ∈ R}
R⊥ = {S ∈ Rr | S T R = 0}.
R1 + R2 = {R1 + R2 | Rj ∈ Rj , j = 1, 2}.
Recall from Section 1.2.1 that Sr×r denotes the space of all real, symmetric
r ×r matrices. The next lemma tells us that if M ∈ Sr×r and R is an invariant
subspace of M then R reduces M . This fact is handy in proofs because to
show that R reduces a symmetric M , we need to show only that R is an
invariant subspace of M . In this book we will be concerned almost exclusively
with real symmetric linear transformations M and reducing subspaces.
Subspaces of this form are often called cyclic invariant subspaces in linear
algebra.
Subspaces (1.21) arise in connection with PLS regressions having a uni-
variate response, particularly when allowing for the possibility that the full
set {x, M x, . . . , M r−1 x} may not be necessary to span R. For t ≤ r, let
Kt (M, x) = {x, M x, M 2 x, . . . , M t−1 x}
, (1.22)
Kt (M, x) = span{Kt (M, x)}
which are called a Krylov basis and a Krylov subspace of dimension t in numer-
ical analysis, terminology that we adopt for this book. For example, let r = 3,
√ √ √
v1 = (1, 1, 1)/ 3, v2 = (−1, 1, 0)/ 2 and v3 = (1, 1, −2)/ 6, and construct
M = 3Pv1 + Pv2 + Pv3 . Then K1 (M, v1 ) and, for any real scalars a and b,
K1 (M, av2 + bv3 ) are one-dimensional reducing subspace of M . For an arbi-
trary vector x ∈ R3 , what is the maximum possible dimension of Kq (M, x)?
If M R ⊆ R, so R is an invariant subspace of M , and if x ∈ R then
clearly M x ∈ R. Any invariant subspace of M that contains x must then
also contain all of the vectors {x, M x, . . . , M r−1 x}. Consider a subspace R
that is unknown, but known to be an invariant subspace of M . If we know or
can estimate one vector x ∈ R, then we can iteratively transform x by M to
obtain additional vectors in R: for any t, Kt (M, x) ⊆ R. Here we can think
18 Introduction
5. If R ⊆ span(M ) then
Definition 1.2. Let M ∈ Sr×r and let S ⊆ span(M ). Then, the M -envelope
of S, denoted by EM (S), is the intersection of all reducing subspaces of M that
contain S.
EM (KS) = EM (S).
ΣY = ΣY |X + β T ΣX β = ΣY |X + GV GT , (1.25)
= EΣY (Σ−1 0 −1 0
Y B ) = EΣY (ΣY |X B ) (1.26)
= EΣY |X (Σ−1 0
Y B ).
1.5.1 Algorithm E
The next proposition (Cook et al., 2010) describes a method of constructing
an envelope in terms of the eigenspaces of M . We use h generally to denote
the number of eigenspaces of a real symmetric matrix. Recalling the subspace
computations introduced at the outset of Section 1.3,
Algebraic methods of envelope construction 23
Ph
Statement 2 holds because A = {P1 v + · · · + Ph v : v ∈ A} ⊆ i=1 Pi A.
Ph
Turning to statement 3, if T reduces M , it can be written as T = i=1 Pi T .
If, in addition, A ⊆ T then we have Pi A ⊆ Pi T for i = 1, . . . , h. Statement 3
Ph Ph
follows since i=1 Pi A ⊆ i=1 Pi T = T .
1.5.2 Algorithm K
When dim(A) = 1, the dimension of the envelope is bounded above by the
number h of eigenspaces of M : dim(EM (A)) ≤ h. We use Proposition 1.9
in this case to gain further insights into the Krylov subspaces discussed in
Section 1.3 and to motivate a new algebraic method of constructing envelopes.
As we will see in Chapter 3, Algorithm K is a foundation for the NIPALS
regression algorithm with a univariate response, Y ∈ R1 .
Let a ∈ Rr be a basis vector for the one-dimensional subspace A, let λi 6= 0
be the eigenvalue for the i-th eigenspace of M , let q = dim(EM (A)) so that a
24 Introduction
1 λ1 λ21 · · · λt−1
1
t−1
1 λ2 λ22 · · · λ2
V = .. .. .. ..
. . . .
1 λq λ2q · · · λt−1
q q×t
we have
Kt (M, a) = Cq V.
are monotonic
as they are when h = 1 (1.23). However, the stopping point t is no longer equal
to the dimension q of EM (S). We can bound q ≤ th, but there is no way to get
a better handle on q without imposing additional structure. The Krylov space
Kt (M, A) at termination is related to the individual Krylov spaces Kr (M, Aj ),
j = 1, . . . , h, corresponding to the columns of A as follows,
h
X h
X h
X
Kt (M, A) = Kt (M, Aj ) = Kqj (M, Aj ) = EM (Aj ) = EM (A)
j=1 j=1 j=1
where the final two equalities follow from Propositions 1.10 and 1.7. In con-
sequence, the columns of Kt (M, A) span the desired envelope EM (A), but
they do not necessarily form a basis for it, leading to inefficiencies in statis-
tical applications. For that reason a different method is needed to deal with
multi-dimensional subspaces.
1.5.3 Algorithm N
Let v1 (·) denote the largest eigenvalue of the argument matrix with corre-
sponding eigenvector `1 (·) of length 1, and let q denote the dimension of the
M -envelope of A, EM (A). Then, the following algorithm generates EM (A). We
refer to it as Algorithm N and give its proof here since one instance of it cor-
responds to the NIPALS partial least squares regression algorithm discussed
in Section 3.1. Recall from Section 1.2.1 that QA(∆) = I − PA(∆) , where PA(∆)
denotes the projection onto span(A) in the ∆ inner product.
26 Introduction
Proof. For clarity we divide the proof into a number of distinct claims. Let
λi+1 = v1 (QTUi (M ) AQUi (M ) ). Prior to stopping, λi+1 > 0.
or equivalently, since UiT QTUi (M ) = 0, λi+1 UiT ui+1 = 0, which implies that
UiT ui+1 = 0 since λi+1 6= 0. Since ui+1 is an eigenvector chosen with
length 1, UiT Ui = I.
In the claims that follow, let k be the point at which the algorithm termi-
nates.
Claim 2: QTUk (M ) A = 0.
Since A is positive semi-definite there is a full rank matrix V so that
A = V V T . When the algorithm terminates, QTUk (M ) V V T QUk (M ) = 0, and
so QTUk (M ) V = 0 and QTUk (M ) V V T = QTUk (M ) A = 0.
Algebraic methods of envelope construction 27
Next,
λi+1 ui+1 = QTUi (M ) AQUi (M ) ui+1
= A − PUTi (M ) A QUi (M ) ui+1 . (1.31)
Substituting the right-hand side of (1.30) for the first A in (1.31) and using
Ui = Uk (Ii , 0)T gives
λi+1 ui+1 = M Uk (UkT M Uk )−1 UkT A − M Uk (Ii , 0)T (UiT M Ui )−1 UiT A
× QUi (M ) ui+1
= M Uk (UkT M Uk )−1 UkT A − (Ii , 0)T (UiT M Ui )−1 UiT A
× QUi (M ) ui+1
= M Uk Vi+1 for i = 0, . . . , k − 1
AT QUd−1 (M ) Γ0 = 0. (1.32)
and
ũTd QTUd−1 (M ) AAT QUd−1 (M ) ũd = uTd QTUd−1 (M ) AAT QUd−1 (M ) ud /(BdT Bd )
> uTd QTUd−1 (M ) AAT QUd−1 (M ) ud ,
where the inequality follows because BdT Bd < 1 and (1.33). Therefore, the
maximum is not at ud , which is the contradiction we seek.
Algebraic methods of envelope construction 29
(ii) for s = 1 and t ≤ q = dim(EM (A)) we have that KtT(M, A)M Kt (M, A) >
0.
For conclusion (ii), first it follows from Proposition 1.10 that Kt (M, A) has
full column rank if and only if t ≤ q. Since A ⊆ span(M ) there is a non-zero
vector b ∈ Rr so that A = M b and
u2 = QTu1 (M ) a = QTa(M ) a
= a − M a/(aTM a),
where KtT M Kt > 0 by Lemma 1.5 (ii). Let ht = (KtT M Kt )−1 KtT a ∈ Rt .
Then
ut+1 = a − (M a, M 2 a, . . . , M t a)ht ∈ Rr
!
1
= Kt+1 .
−ht
Substituting this into the above representation for Ut+1 ∈ Rr×(t+1) gives
! ! !
Ht Ht 1
Ut+1 = Kt+1 , ut+1 = Kt+1 .
01×t 0 −ht
Since t < q, Ut+1 and Kt+1 both have full column rank t + 1, so
!
Ht 1
Lt :=
0 −ht
Kt . That is, since Ut+1 = Kt+1 Lt and the columns of Ut+1 are orthogonal with
length 1, the multiplication of Kt+1 on the right by Lt serves to orthogonalize
the columns of Kt+1 .
1.5.4 Algorithm S
The algorithm indicated in the following proposition is called Algorithm S be-
cause a special case of it yields the SIMPLS algorithm discussed in Section 3.3.
Like Algorithm N, its population version requires positive semi-definite ma-
trices A, M ∈ Sr×r as inputs with A = span(A) ⊆ span(M ). We may describe
specific instances of the algorithm as S(A, M ) using A and M as arguments.
Sample versions of the algorithm are similarly described as S(A,
b Mc), where A b
and Mc are consistent estimators of A and M .
u M Uk = 0 and uT u = 0.
T
The form of the algorithm given in (1.34) is not directly constructive, but
it connects with a version of the SIMPLS algorithm in the literature; see
Section 3.3.
The following corollary is provided in summary; its proof straightforward
and omitted.
1.5.5 Algorithm L
The Algorithms K, N, and S all allow for M to be positive semi-definite. In
contrast, the algorithm needed to implement Proposition 1.14 requires M to
be positive definite. It is referred to as Algorithm L because it is inspired by
likelihood-based estimation discussed in Sections 2.3.1 and 2.5.3.
where minG is taken over all semi-orthogonal matrices G ∈ Rr×q and (G, G0 )
is an orthogonal matrix.
Algebraic methods of envelope construction 33
log |GT M G| + log |GT0 (M + A)G0 | = log |GT M G| + log |GT0 M G0 + GT0 AG0 |
≥ log |GT M G| + log |GT0 M G0 |
= log |M | + log |GT M G| + log |GT M −1 G|
≥ log |M |,
where the final step follows from Lemma 1.3 (III). To achieve the lower bound,
equality in the first inequality requires that A ⊆ span(G). The second equal-
ity follows from Lemma 1.3. Consequently, achieving equality in the second
inequality requires that span(G) reduce M . Overall then, span(G) must be a
reducing subspace of M that contains A. The conclusion follows since q is the
dimension of the smallest subspace that satisfies these two properties.
DOI: 10.1201/9781003482475-2 35
36 Envelopes for Regression
least squares regression. Envelopes were used as a basis for studying the high
dimensional behavior of partial least squares regression by Cook and Forzani
(2018, 2019). Although partial least squares regression is usually associated
with the multivariate linear model, Cook and Forzani (2021) showed that PLS
dimension reduction may be serviceable in the presence of non-linearity. The
role of envelopes in furthering the application of PLS in chemometrics was
discussed by Cook and Forzani (2020).
Bayesian versions of envelopes were developed by Khare, Pal, and Su
(2016) and Chakraborty and Su (2023). Su and Cook (2011) developed the no-
tion of inner envelopes for capturing part of the regression. Each of these and
others demonstrate a potential for envelope methodology to achieve reduc-
tion in estimative and predictive variation beyond that attained by standard
methods, sometimes by amounts equivalent to increasing the sample size many
times over. An introduction to envelopes is available from the monograph by
Cook (2018) and a brief overview from Cook (2020).
Our goal in this chapter is to give envelope foundations that allow us to
connect with PLS regression methodology.
Early work that marks the beginning of the area is available from Cook
(1998). Relatively recent advances are described in the monograph by Li
(2018). To help fix ideas, consider the following four standard models, each
with stochastic predictor X ∈ Rp and β, β1 , and β2 all p × 1 vectors.
In models 1–3, the response depends on the predictor only via β TX, and so
Y X | β TX and SY |X = span(β). From a model-free SDR perspective, these
models look the same because each depends on only one linear combination of
the predictors. Model 4 depends on two linear combinations of the predictor,
SY |X = span(β1 , β2 ). In the multivariate linear model (1.1) with β ∈ Rp×r ,
we have SY |X = span(β) ⊆ Rp .
As indicated in this equation, conditions (2.2a) and (2.2b) are together equiv-
alent to condition (2.2c), which says that jointly (Y, PS X) is independent of
QS X. The material components are still represented as PS X and the imma-
terial components as QS X. See Cook (2018) for further discussion of the role
of conditional independence in envelope foundations.
In full generality, (2.2) represents a strong ideal goal that is not cur-
rently serviceable as a basis for methodological development because of the
need to assess condition (2.2b). However, if the predictors are normally dis-
tributed, condition (2.2b) is manageable since it then holds if and only if
cov(PS X, QS X) = PS ΣX QS = 0, where ΣX denotes the covariance matrix
of the predictor vector X. This connects with the reducing subspaces of ΣX
as described in the next lemma. Normality is not required, and its proof is in
Appendix A.2.1.
Lemma 2.1. Let S ⊆ Rp . Then S reduces ΣX if and only if
cov(PS X, QS X) = 0.
Model-free predictor envelopes defined 39
Requirement
Paradigm (a) Y X | PS X (b) PS X QS X
SDR Construction No constraints
Variable Selection Construction Assumption
Envelopes Construction Construction
correlated errors. They assume in effect that n1/2 times the sample correlation
between the selected and eliminated predictors is bounded, which exerts firm
control over the dependence between PS X and QS X as p and n grow.
Envelopes handle the dependence between PS X and QS X by construction,
the selected subspaces S being required to satisfy both conditions (a) and (b)
of Table 2.1. This has the advantage of ensuring a crisp distinction between
material and immaterial parts of X, but it has the potential disadvantage of
leading to a larger subspace since SY |X ⊆ EΣX (SY |X ). Nevertheless, experi-
ence has indicated that this tradeoff is very often worthwhile, particularly in
model-based analyses and in high-dimensional settings where n p.
Asymptotic variances can provide insights into the potential gain offered by
the envelope estimator: Repeating (1.18) for ease of comparison,
√
avar( nvec(βbols )) = ΣY |X ⊗ Σ−1
X
√
avar( nvec(βbΦ )) = ΣY |X ⊗ Φ∆−1 ΦT , (2.7)
Likelihood estimation of predictor envelopes 43
√
where avar( nvec(βbΦ )) is based on Cook (2018, Section 4.1.3). This gives a
difference of
ΣY |X ⊗ (Σ−1
X − Φ∆ Φ ) = ΣY |X ⊗ Φ0 ∆−1
−1 T T
0 Φ0 ,
which indicates that the difference in asymptotic variances will be large when
the variability of the immaterial predictors ∆0 = var(ΦT0 X) is small relative
to ∆ = var(ΦT X). This can be seen also from the ratio of variance traces:
tr{ΣY |X ⊗ Φ∆−1 ΦT } tr{∆−1 }
= . (2.8)
tr{ΣY |X ⊗ Σ−1
X } tr{∆−1 } + tr{∆−1
0 }
where !
Φ0 Φ 0
O= ∈ R(p+r)×(p+r)
0 0 Ir
is an orthogonal matrix and
∆0 0 0
ΣO T C = 0 ∆ γ ∈ R(p+r)×(p+r)
0 γT ΣY
The objective function Lq (G) is an instance of the objective function for the
sample version of Algorithm L introduced in Section 1.5.5. The connection is
obtained by setting M = SX|Y and A = SY ◦X . Then from (1.8) to (1.10) we
have SX = SX|Y + SY ◦X = M + A, so
−1
log |GT M G| + log |GT (M + A)−1 G| = log |GT SX|Y G| + log |GT SX G|.
46 Envelopes for Regression
Σ
bY = SY ,
∆
b = b T SX Φ,
Φ b
∆
b0 b T0 SX Φ
= Φ b 0,
γ
b b T SX,Y
= Φ
b −1 γ
ηb = ∆ b.
Σ
b X,Y = PΦ
b SX,Y ,
Σ
bX = Φ
b∆ bT + Φ
bΦ b 0∆ b T0 = P b SX P b + Q b SX Q b ,
b 0Φ
Φ Φ Φ Φ
−1
Σ
b Y |X = SY − SY,X PΦ
b SX PΦ
b SX,Y
βb = Φ
b∆b −1 γ
b = Φ(
b Φ b −1 Φ
b T SX Φ) b T SX,Y = P b (2.12)
Φ(SX ) βols ,
b
−1
where βbols = SX SX,Y is the ordinary least squares estimator of β. The es-
timators ∆, ∆0 and γ
b b b depend on the selected basis Φ. b The parameters of
interest – Σ b X and βb – depend on EbΣ (B) but do not depend on the
b X,Y , Σ
X
= ΣY |X ⊗ Φ∆−1 ΦT + (η T ⊗ Φ0 )M † (η ⊗ ΦT0 ),
where M = ηΣ−1 T −1
Y |X η ⊗ ∆0 + ∆ ⊗ ∆0 + ∆
−1
⊗ ∆0 − 2Iq ⊗ Ip−q . Additionally,
Tq = n(F (SC , ΣC ) − F (SC , SC )) converges to a chi-squared random variable
b
with (p − q)r degrees of freedom, and
n√ o n√ o
avar b ≤ avar
nvec(β) nvec(βbols ) .
Likelihood estimation of predictor envelopes 47
√
The first addend avar{ nvec(βbΦ )} = ΣY |X ⊗ Φ∆−1 ΦT on the right hand
side of the asymptotic variance is the asymptotic variance of maximum like-
lihood estimators of β when a basis Φ for the envelope is known, as dis-
√
cussed previously around (2.7. The second addend avar{ nvec(QΦ βbη )} =
(η T ⊗ Φ0 )M † (η ⊗ ΦT0 ) can then be interpreted as the cost of estimating the
envelope. The final statement of the proposition is that the difference between
the asymptotic variance of the ordinary least squares estimator and that for
the envelope estimator is always positive semi-definite.
The following corollary, also from Cook et al. (2013), gives sufficient con-
ditions for the envelope and OLS estimators to be asymptotically equivalent.
Let SX|Y denote the central subspace for the regression of X on Y . Then,
in parallel to Definition 2.2, we have
Definition 2.3. Assume that SX|Y ⊆ span(ΣY ). The response envelope for
the regression of Y on X is then the intersection of all reducing subspaces of
ΣY that contain SX|Y , and is denoted as EΣY (SX|Y ), which is represented as
EY for use in subscripts.
Comparing Definitions 2.2 and 2.3 we see that response and predictor
envelopes are dual constructions when X and Y are jointly distributed, as
summarized in Table 2.2. The predictor envelope EΣX (SY |X ) for the regression
of Y on X is also the response envelope for the regression of X on Y , and
the response envelope EΣY (SX|Y ) for the regression of Y on X is also the
predictor envelope for the regression of X on Y . These equivalences in model-
free envelope construction are shown in Table 2.2(A).
TABLE 2.2
Relationship between predictor and response envelopes when X and Y are
jointly distributed: CX,Y = span(ΣX,Y ) and CY,X = span(ΣY,X ). B =
span(βY |X ) and β 0 = span(βYT |X ) are as used previously for model (1.1). B∗
and B∗T are defined similarly but in terms of the regression of X | Y . P-env
and R-env stand for predictor and response envelope.
The envelope forms EΣX (CX,Y ) and EΣY (CY,X ) reflect the duality of re-
sponse and predictor envelopes when X and Y are jointly distributed, while
forms EΣX(B) and EΣY (B 0 ) are in terms of model (1.1) to facilitate re-
parametrization.
So far, we have required that X and Y be jointly distributed. This implies,
for example, that PLS predictor reduction would not normally be useful in
designed experiments . For instance, if predictor values are selected to follow
a 2k factorial or if center points are included to allow estimation of quadratic
effects, there seems no reason to suspect that the predictor distribution carries
useful information about the responses. However, response reduction may still
be of value.
The conditions (a) and (b) in (2.14) can be stated alternatively as
(a) For all x1 , x2 , FQE Y |X=x1 (z) = FQE Y |X=x2 (z)
, (2.16)
(b) For all x, PE Y QE Y | X = x,
EΣY (B 0 ) = EΣY |X (B 0 ).
Variances can provide insights into the potential gain offered by the envelope
estimator:
var(vec(βbΓ )) = var(vec(βbols PΓ ))
= (PΓ ⊗ Ip )var(vec(βbols ))(PΓ ⊗ Ip )
−1
= n−1 (PΓ ⊗ Ip ) ΣY |X ⊗ SX
(PΓ ⊗ Ip )
−1
= n−1 ΓΩΓT ⊗ SX ,
−1
var(vec(βbols )) − var(vec(βbΓ )) ≥ n−1 Γ0 Ω0 ΓT0 ⊗ SX ≥ 0.
t21
1 t1
1 t2 t22
Γ= .. .. .. ,
. . .
1 tr t2r
where tj is the time at which the j-th measurement is taken. The model in
this case is simply Yi = Γα + εi , where α = (α0 , α1 , α2 )T ∈ R3 contains the
coefficients of the quadratic response. For instance, the expected responses at
times t1 and tr are α0 + α1 t1 + α2 t21 and α0 + α1 tr + α2 t2r , respectively.
Expanding the context of the illustration, suppose now that we are com-
paring two treatments indicated by X = 0, 1 and that the effect of a treatment
is to alter the coefficients of the quadratic. Let the columns of the matrix
α00 α01
A = α10 α11 = (α•0 , α•1 )
α20 α21
This model is now in the form of the response envelope model (2.17), except
that Γ is known. Depending on the context, variations of this model may be
appropriate. For example, allowing for an unconstrained intercept, the model
becomes
Yi = δ0 + ΓδX + ε.
effect could be detected and, if so, estimate the time point when the difference
was first manifested. Judging from the plot of average weight by treatment
and time (Cook, 2018, Fig. 1.5), it is not clear what functional form could be
used to describe the weight profiles. In such cases, treating Γ as unknown and
adopting a response envelope model (2.17) may be a viable analysis strategy.
Details of such an analysis are available from Cook (2018). PLS and maximum
likelihood envelope estimators of the parameters in models of the form given
in (2.19) are discussed in Section 6.4.4.
We next turn to maximum likelihood estimation of the parameters in re-
sponse envelope (2.17).
Yi − Ȳ = PΓ (Yi − Ȳ ) + QΓ (Yi − Ȳ )
L(1) 0
u (η1 , EΣY |X (B ), Ω, Ω0 ) = −(nr/2) log(2π) + L(11) 0
u (η1 , EΣY |X (B ), Ω)
+L(12) 0
u (EΣY |X (B ), Ω0 ),
where
L(11)
u = −(n/2) log |Ω|
Xn
−(1/2) {ΓT (Yi − Ȳ ) − η1 Xi }T Ω−1 {ΓT (Yi − Ȳ ) − η1 Xi }
i=1
n
X
L(12)
u = −(n/2) log |Ω0 | − (1/2) (Yi − Ȳ )T Γ0 Ω−1 T
0 Γ0 (Yi − Ȳ ).
i=1
(11)
Holding Γ fixed, Lu can be seen as the log likelihood for the multivariate
(11)
regression of ΓT (Yi − Ȳ ) on Xi , and thus Lu is maximized over η1 at the
Response envelopes for the multivariate linear model 57
(11)
value η1 = (βbols Γ)T . Substituting this into Lu and simplifying we obtain a
(11)
partially maximized version of Lu
n
X
L(21) 0
u (EΣY |X (B ), Ω) = −(n/2) log |Ω| − (1/2)
bi )T Ω−1 ΓT R
(ΓT R bi ,
i=1
where, as defined in Section 1.2, R bi is the i-th residual vector from the
fit of the standard model (1.1). From this it follows immediately that, still
(21)
with Γ fixed, Lu is maximized over Ω at Ω = ΓT SY |X Γ. Consequently,
(31)
we arrive at the third partially maximized log likelihood Lu (EΣY |X (B 0 )) =
−(n/2) log |ΓT SY |X Γ| − nu/2. By similar reasoning, the value of Ω0 that max-
(2)
imizes Lu (EΣY |X (B 0 ), Ω0 ) is Ω0 = ΓT0 SY Γ0 . This leads to the maximization
(2)
of Lu (EΣY |X (B 0 )) = −(n/2) log |ΓT0 SY Γ0 | − n(r − u)/2.
Combining the above steps, we arrive at the partially maximized form
L(2) 0 T
u (EΣY |X (B )) = −(nr/2) log(2π) − nr/2 − (n/2) log |Γ SY |X Γ|
ηb1 = b T,
(βbols Γ)
βb = b η )T = βbols Pb ,
(Γb (2.22)
Γ
Ω
b bT
= Γ SY |X Γ,
b
Ω
b0 b T0 SY Γ
= Γ b0 ,
Σ
b Y |X = Γ
bΩ bT + Γ
bΓ b0 Ω b T0 ,
b 0Γ
Proposition 2.2. Under the envelope model (2.17) with non-stochastic pre-
√
dictors, normal errors and known u = dim{EΣ (B)}, n(βb − β) is asymptoti-
cally normal with mean 0 and variance
√
b = ΓΩΓT ⊗ Σ−1 + (Γ0 ⊗ η T )U † (ΓT ⊗ η),
avar{ nvec(β)} (2.24)
X 0
where
U = Ω−1 T −1
0 ⊗ ηΣX η + Ω0 ⊗ Ω + Ω0 ⊗ Ω
−1
− 2Ir−u ⊗ Iu .
The second addend on the right side of (2.24) can be interpreted as the cost
of estimating Γ. See Cook (2018) for further discussion.
We present an illustrative analysis in the next section to help fix the ideas
behind response reduction. Many more illustrations on the use of response
envelopes are available from Cook (2018).
FIGURE 2.1
Illustration of response envelopes using two wavelengths from the gaso-
line data. High octane data marked by x and low octane numbers by o.
Eb = EbΣY |X (B 0 ): estimated envelope. Eb⊥ = EbΣ⊥Y |X (B 0 ): estimated orthogonal
complement of the envelope. Marginal distributions of high octane numbers
are represented by dished curves along the horizontal axis. Marginal envelope
distributions of low octane numbers are represented by solid curves along the
horizontal axis. (From the Graphical Abstract for Cook and Forzani (2020)
with permission.)
An envelope analysis based on model (2.17) results in the inference that the
distribution of high and low octane numbers are identical along the orthogonal
complement EΣ⊥Y |X (B 0 ) of the true one-dimensional envelope EΣY (B 0 ). The
substantial variation along Eb⊥ (B 0 ) is thus inferred to be immaterial to
ΣY |X
the analysis and, in consequence, all differences between high and low octane
numbers are inferred to lie in the envelope subspace. To estimate β1 , the
envelope analysis first projects the data onto the estimated envelope EbΣY |X (B 0 )
to remove the immaterial variation, as represented by path B in Figure 2.1.
The resulting point is then projected onto the horizontal axis for inference
on β1 . This produces two narrowly dispersed distributions represented by
dashed and solid curves in Figure 2.1. The difference in variation between the
narrowly and widely dispersed distributions represents the gain that envelopes
can achieve over standard methods of analysis, which in this case is substantial.
In this illustration we have use two wavelengths as the continuous bivari-
ate response Y with a binary predictor X. The response envelope illustrated
in Figure 2.1 is EΣY (B 0 ) = EΣY |X (B 0 ), which corresponds to the lower left
entry in Table 2.2B. This is the predictor envelope in the lower right entry in
Table 2.2B. Thus, Figure 2.1 serves to illustrate both response and predictor
envelopes. The estimation process illustrated is for response envelopes, how-
ever. The estimated envelope aligns closely with the second eigenvector of the
sample variance-covariance matrix of the wavelengths. That arises because we
are working in only r = 2 dimensions. In higher dimensions the envelope does
not have to align closely with any single eigenvector but may align with a
subset of them.
2.5.5 Prediction
We think of ΓT Y as subject to refined prediction since its distribution depends
on X, while ΓT0 Y is only crudely predictable since its distribution does not de-
pends on X. Assuming that the predictors are centered, the actual predictions
at a new value XN of X are
! !
Γb T Yb b T Ȳ + ηbXN
Γ
= .
Γb T Yb Γb T Ȳ
0 0
Multiplying by (Γ,
b Γb 0 ) we get simply
YbN = Ȳ + Γb
b η XN .
62 Envelopes for Regression
with p and q fixed. And they do so subject to (2.14). The estimated basis Γ
b pls
is then used to estimate β by substituting into (2.18) to give
The context for this estimator is regressions with many responses and few
predictors so βbols is well defined. Model (2.17) allows maximum likelihood
estimation of these same quantities along with estimation of ΣY |X . In other
words, model (2.17) provides for maximum likelihood estimation of the quan-
tities estimated by the PLS algorithms.
Another envelope-based estimator of β is the maximum likelihood esti-
mator based on model (2.17). That estimator (2.22) is of the same form as
(2.26), except the estimated basis derives from maximum likelihood estima-
tion. Adding a subscript ‘env’ to envelope estimators, we see a parallel between
the likelihood and PLS estimators (2.22) and (2.26):
ΣX,Y = AΣW,Y
β T
:= ΣX,Y Σ−1
Y
= AΣW,Y Σ−1 −1
Y = PΓ AΣW,Y ΣY
= Γη1 ,
ΣX = AΣW AT + σ 2 Ip
= PΓ (AΣW AT + σ 2 Ip )PΓ + QΓ σ 2
ΣX|Y = AΣW |Y AT + σ 2 Ip
= ΓΩΓT + Γ0 Ω0 ΓT0
subspace of ΣX that contains CX,Y . Using Table 2.2, we see that the Nadler-
Coifman model (2.28), which was developed specifically for spectroscopic ap-
plication, and its forward regression counterpart both have a natural envelope
structure via EΣX (CX,Y ). Further, both NIPALS and SIMPLS discussed in
Chapter 3 are serviceable methods for fitting these models. Methods for the
simultaneous compression of the predictor and response vectors, which are
applicable in the particular envelope structure for (2.28), are discussed in
Chapter 5.
units:
⊥
2 EΣ (B) 2
1 B = EΣ (B) 1
o
0 0
o
o
o BΛ
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2
Λ = diag(1, . . . , 1, λ2 , . . . , λ2 , . . . , λs . . . , λs ) ∈ Rp×p ,
Y = α + η T ΦT Λ−1 (X − µX ) + ε (2.30)
T
ΣX = ΛΦ∆Φ Λ + ΛΦ0 ∆0 ΦT0 Λ,
may well destroy the envelope structure and result in estimators that are
more variable than necessary. In short, the usual scaling to get unit standard
deviations is generally questionable.
Aside from the proposal by Cook and Su (2016), there does not seem
to be PLS methodology for estimating proper rescaling parameters. Further
study along these lines is needed to expand the scope of PLS as envelope
methodology has been expanded by Cook and Su (2013, 2016).
3
PLS algorithms were not historically associated with specific models and, in
consequence, were not typically viewed as methods for estimating identifiable
population parameters, but were seen instead as methods for prediction. We
study in this chapter several algorithms for PLS regression, describing cor-
responding statistical models and their connection with envelopes. For the
major algorithms we begin each section with a synopsis that highlights their
main points. Each synopsis contains a description of the sample algorithm
and the corresponding population version. Subsequent to each synopsis, we
study characteristics of the sample algorithms, prove the relationship between
the population and sample algorithms and establish the connection with en-
velopes.
PLS regression algorithms are classified as PLS1 algorithms, which apply
to regressions with a univariate response, or PLS2 algorithms, which apply
to regressions with multiple responses. The algorithms presented here are all
of the PLS2 variety and become PLS1 algorithms when the response is uni-
variate. Their similarities and differences are highlighted later in this chap-
ter. To help maintain a distinction between PLS1 and PLS2 algorithms, we
use ΣX,Y = cov(X, Y ) when the response Y can be multivariate but we use
σX,Y = cov(X, Y ) when there can be only a single real response.
While the PLS algorithms in this chapter are presented and studied for
the purpose of reducing the dimension of the predictor vector, with minor
modifications they apply also to reducing the dimension of the response vector
in the multivariate (multiple response) linear model (1.1). The justification for
this arises from the discussion of Section 2.5.1, particularly Table 2.2. PLS for
DOI: 10.1201/9781003482475-3 69
70 PLS Algorithms for Predictor Reduction
3.1.1 Synopsis
We begin by centering and organizing the data (Yi , Xi ), i = 1, . . . , n, into
matrices X ∈ Rn×p and Y ∈ Rn×r , as defined in Section 1.2.1. Table 3.1(a)
gives steps in the NIPALS algorithm for reducing the predictors. The table
serves also to define notation used in this synopsis, like Xd , Yd and md . It may
not be easy to see what the algorithm is doing and the tradition of referring
to loadings ld and md , weights wd and scores sd does not seem to help with
intuition. The weight vectors require calculation of a first eigenvector `1 (·)
normalized to have length 1. Early versions of the NIPALS algorithm (e.g.
Martens and Næs, 1989, Frame 3.6) contained subroutines for calculating `1
(Stocchero, 2019), which tended to obscure the underlying ideas. We have
dropped the eigenvector subroutine in Table 3.1(a). The algorithm is not gen-
erally associated with a particular model, but it does involve PLS estimation
of β = Σ−1X ΣX,Y .
For simplicity, we use the same notation for weight vectors wd and weight
matrices Wd in a sample and in the population. The context should be clear
from the frame of reference under discussion.
As a first step in developing the population version shown in Ta-
ble 3.1(b), we rewrite selected steps in terms of the sample covariance matrix
SXd ,Yd = n−1 XTd Yd ∈ Rp×r between the deflated predictors and the deflated
NIPALS algorithm 71
TABLE 3.1
NIPALS algorithm: (a) sample version adapted from Martens and Næs (1989)
and Stocchero (2019). The n × p matrix X contains the centered predictors
and the n×r matrix Y contains the centered responses; (b) population version
derived herein.
response and the sample covariance matrix SXd = n−1 XTd Xd ∈ Rp×p of the
deflated predictors:
T
wd = `1 (SXd ,Yd SX d ,Yd
) ∈ Rp
sd = Xd wd ∈ Rn×1
md T
= SX d ,Yd
wd (wdT SXd wd )−1 ∈ Rr×1
ld = SXd wd (wdT SXd wd )−1 ∈ Rp×1
Xd+1 = Xd − Xd wd (wdT SXd wd )−1 wdT SXd
= Xd Qwd (SXd ) (3.1)
n×p
= QXd wd Xd ∈ R (3.2)
Yd+1 = Yd − Xd wd (wdT SXd wd )−1 wdT SXd Yd
= QXd wd Yd ∈ Rn×r (3.3)
SXd+1 = QTwd (SX ) SXd Qwd (SXd ) (3.4)
d
where Qwd (SXd ) the operator that projects onto the orthogonal complement
of span(wd ) in the SXd inner product,
that reduces ΣX and that spans a subspace containing span(ΣX,Y ). The stop-
ping criterion is then met because, by Proposition 1.3, QWd (ΣX ) = QWd and,
since span(ΣX,Y ) ⊆ span(Wd ), QWd ΣX,Y = 0. This represents the essence of
the links connecting envelopes with PLS, as discussed in Section 3.2.
The sample NIPALS algorithm of Table 3.1(a) does not contain an explicit
mechanism for choosing the stopping point q and so it must be determined
by some external criterion. The algorithm is typically run for several values
of q and then the stopping point is chosen by predictive cross validation or a
holdout sample. Additionally, the sample version does not require SX to be
nonsingular. However, when SX is nonsingular, the NIPALS estimator can be
represented as the projection of the OLS estimator βbols onto span(Wq ) in the
SX inner product:
βbnpls = PWq (SX ) βbols .
This requires that SX be positive definite and so does not have a direct sample
counterpart when n < p. If in addition, q = p then PWq (SX ) = Ip and the
NIPALS estimator reduces to the OLS estimator, βbnpls = βbols .
The NIPALS estimator βbnpls shown in Table 3.1(a) depends only on the
final weight matrix Wq = (w1 , . . . , wq ) and the two corresponding loading ma-
trices Lq = (l1 , . . . , lq ) and Mq = (m1 , . . . , mq ). The columns of these matrices
depend on the data only via SXd and SXd ,Yd , d = 1, . . . , q. In consequence,
the population version of the algorithm shown in Table 3.1(b) can be deduced
by replacing SXd ,Yd and SXd with their population counterparts, which we
represent as ΣXd ,Yd and ΣXd . There is no crisp population counterpart of
q associated with the sample algorithm, although later we will see that, as
indicated in Table 3.1(c), this is in fact the dimension of EΣX (B), the ΣX -
envelope of B. We will show later in Section 3.1.4 that the sample version of
the population algorithm in Table 3.1(b) is the NIPALS algorithm.
Starting with the population algorithm in Table 3.1(b), βbnpls is obtained
by replacing ΣX and ΣX,Y by their sample counterparts, SX and SX,Y , and
stopping at the selected value of q. Additionally, the population algorithm
shows that the intermediate quantities in Table 3.1(a) – ld , md , sd , Xd , Yd ,
and SXd ,Yd – are not necessary to get the PLS estimator βbpls , although they
might be useful for computational or diagnostic purposes.
74 PLS Algorithms for Predictor Reduction
The response covariance matrix ΣY is not used here but will be used in sub-
sequent illustrations. The last row of ΣX,Y is zero, but this alone does not
provide clear information about the contribution of x3 to the regression. If x3
is correlated with the other two predictors, it may well be material. However,
we see from ΣX that in this regression x3 is uncorrelated with x1 and x2 and
this enables a clear conclusion about the role of x3 . In short, since the last row
of ΣX,Y is zero and since the third predictor is uncorrelated with the other
two, we can conclude immediately that at most two predictors, x1 and x2 , are
needed. One goal of this example is to illustrate how the computations play
out to reach that conclusion.
According to the initialization step in Table 3.1(b), the first eigenvector of
ΣX,Y ΣTX,Y = diag(25, 16, 0) is w1 = (1, 0, 0)T . To compute the second weight
vector we need
0 −4/3 0 0 0
Qw1 (ΣX ) = 0 1 0 and QTw1 (ΣX ) ΣX,Y = −20/3 4 .
0 0 1 0 0
Since QTw1 (ΣX ) ΣX,Y has rank 1, the second weight vector is w2 = (0, 1, 0)T .
For the third weight vector we need QTW2 (ΣX ) ΣX,Y , where W2 = (w1 , w2 ). It
can be seen by direct calculation that QTW2 (ΣX ) ΣX,Y = 0 and so the algorithm
terminates, giving q = 2. And, as described in Table 3.1, W2T W2 = I2 .
We can reach the conclusion that q = 2 also by reasoning that span(W2 )
is a reducing subspace of ΣX :
ΣX = W2 ∆W2T + w3 ∆0 w3T ,
It follows from Proposition 1.3 that QW2 (ΣX ) = QW2 , which can be seen also
by direct calculation:
The conclusion that we have reached the stopping point now follows since
span(ΣX,Y ) = span(W2 ).
In short, only two reduced predictors W2T X = (x1 , x2 )T are needed to
describe fully the regression of Y on X.
Lemma 3.1. Following the notation from Table 3.1(a), for the sample version
of the NIPALS algorithm
WdT Wd = Id , d = 1, . . . , q.
XTd+1 sj = XTd+1 Xj wj = 0.
Consequently,
It follows from this lemma that the score vectors sd are mutually orthog-
onal, which allows a more informative version of the deflations (3.6). Recall
from Table 3.1 that Sd = (s1 , . . . , sd ) and from (1.3) that
b npls denotes the n × r matrix of fitted responses from the NIPALS fit
where Y
and Sd is the score matrix as defined in Table 3.1.
With the NIPALS fitted values Y b npls given in Lemma 3.3, we define the
NIPALS residuals as Rb npls = Y1 − Yb npls . These quantities allow the construc-
tion of many standard diagnostic and summary quantities, like residual plots
and the multiple correlation coefficient. Standard formal inference procedures
are problematic, however, because the scores are stochastic.
The form of wd+1 given in Lemma 3.3 shows that the deflation of Y1 is
unnecessary for computing the weight matrix:
TABLE 3.2
Bare bones version of the NIPALS algorithm given in Table 3.1(a). The no-
tation Y1 = Y of Table 3.1(a) is not used since here there is no iteration over
Y.
shown in Table 3.2. The estimator βbnpls given in Table 3.2 is the sample
counterpart of the population version given in Table 3.1(b). It is shown in
(3.12), following the justification given in Section 3.1.4 for the population
version.
Lemma 3.4 indicates also how the NIPALS algorithms deals operationally
with rank-deficient regressions in which rank(SX ) < p. Let the columns of the
p × p1 matrix V be the eigenvectors of SX with non-zero eigenvalues, p1 ≤ p.
Then we can express XT1 = V ZT1 , where ZT1 is an p1 × n matrix that contains
the coordinates of XT1 in terms of the eigenvectors of SX . Let wd∗ and s∗d denote
the weights and scores that result from applying NIPALS to data (Z1 , Y1 ).
Then
We see from Table 3.1 and Lemma 3.3 that βbnpls depends only on X1 , Y1 ,
weights Wq and the scores Sq . As a consequence of Lemma 3.4 we can apply the
NIPALS algorithm by first reducing the data to the principal component scores
Z1 , running the algorithm on the reduced data (Z1 , Y1 ) and then transforming
back to the original scale. The eigenvectors V of SX account for 100% of the
78 PLS Algorithms for Predictor Reduction
TABLE 3.3
Bare bones version in the principal component scale of the NIPALS algorithm
given in Table 3.1(a).
Proof. The main part of the proof is by induction. Clearly, the algorithms
produce the same w1 . At d + 1 = 2, we have
This step matches the corresponding step of the population NIPALS algo-
rithm in Table 3.1(b).
At d + 1 = 3, we again see that the direct NIPALS algorithm matches that
from Table 3.1(b).
The second equality for ΣX3 follows by Lemma 3.5(d) with V = (w1 ), v = w2
and Σ = ΣX . The second equality for ΣX3 ,Y3 follows by replacing ΣX2 ,Y2 =
QTw1 (ΣX ) ΣX,Y to get
and then using Lemma 3.5(c). The second equality for w3 follows by direct
substitution.
Assume that these relationships hold for d = k − 1 ≤ q − 1 and let
Wk−1 = (w1 , . . . , wk−1 ):
To show the general result by induction we need to show that the relationships
hold for d = k. Let Wk = (Wk−1 , wk ). Then we need to show
The first equality for ΣXk+1 follows by definition. The second equality for
ΣXk+1 follows by first substituting for
The desired conclusion for ΣXk+1 then follows from Lemma 3.5(c) with
V = Wk−1 , v = wk and Σ = ΣX . The first equality for ΣXk+1 ,Yk+1 fol-
lows by definition. The second equality for ΣXk+1 ,Yk+1 follows by replacing
ΣXk ,Yk = QTWk−1 (ΣX ) ΣX,Y to get
where the second equality for lk+1 follows from Lemma 3.5(a) and the form
of mk+1 is found following the same general steps as we did for lk+1 .
The role of the numerators QWk (ΣX ) wk+1 is to provide a successive orthog-
onalization of columns of Wq = (w1 , . . . , wq ), while the denominators provide
a normalization in the ΣX inner product. Thus,
( )
QWk (ΣX ) wk+1
span T QT
| k = 0, 1, . . . , q − 1 = span(Wq ),
wk+1 Wk (ΣX ) ΣX QWk (ΣX ) wk+1
(3.10)
and in consequence there is a nonsingular q×q matrix A so that Lq = ΣX Wq A.
In the same way we have
Mq = ΣTX,Y Wq A.
Substituting these forms into the population version of the PLS estimator
given in Table 3.1(a), we get
Recall that the notation βY |WqT X means the coefficients from the popula-
tion OLS fit on Y on the reduced predictors WqT X. From this we see that the
normalization of ld and md in Table 3.1 plays no essential role as it has no
effect on the subspace equality in (3.10). The sample version βpls of βpls , as
shown in Table 3.2, follows immediately from (3.11):
B ⊆ span(Wq ) (3.15)
dim(B) ≤ q.
84 PLS Algorithms for Predictor Reduction
Next, we substitute (3.14) for the first ΣX,Y on the right hand side and write
the first Wd = Wq (Id , 0)T to get
:= ΣX Wq Ad ,
ΣX Wq = Wq C −1 . (3.16)
Consequently,
n o
w2 = `1 QTw1 (ΣX ) ΣX,Y ΣTX,Y Qw1 (ΣX )
n o
= `1 ΦQTb1 (∆) ΦT ΣX,Y ΣTX,Y ΦQb1 (∆) ΦT
∈ span(ΦQTb1 (∆) ) ⊂ span(Φ)
3.3.1 Synopsis
Table 3.4(a) gives the SIMPLS data-based algorithm as it appears in de Jong
(1993, Table 1). The weight vectors and weight matrices are denoted as v
and V to distinguish them from the NIPALS weights. The scores and score
matrices are denoted as s and S, and the loadings and loading matrices as l
and L. de Jong (1993, Appendix) also discussed a more elaborate version of
the algorithm that contains steps to facilitate computation. Like the NIPALS
algorithm, the SIMPLS sample algorithm does not contain a mechanism for
stopping and so q is again typically determined by predictive cross validation
or a holdout sample. Also like the NIPALS algorithm, the sample SIMPLS
algorithm does not require SX to be nonsingular. However, when SX is non-
singular, the SIMPLS estimator can be represented as the projection of the
SIMPLS algorithm 87
TABLE 3.4
SIMPLS algorithm: (a) sample version adapted from de Jong (1993, Table 1).
The n × p matrix X contains the centered predictors and the n × r vector Y
contains the centered responses; (b) population version derived herein.
which requires that SX be positive definite and so does not have a direct
sample counterpart when n < p. Recall that projections are defined at (1.3),
so
PVq (SX ) = Vq (VqT SX Vq )−1 VqT SX
and QVq (SX ) = I − PVq (SX ) .
Maintaining the convention established for NIPALS, we use vd and Vd to
denote weight vectors and weight matrices in both the sample and population.
The population version of the SIMPLS algorithm, which will be justified
herein, is shown in Table 3.4(b). Substituting SX and SX,Y for their popu-
lation counterparts ΣX and ΣX,Y produces the same weights and the same
estimated coefficient vector as the sample version in Table 3.4(a), provided
the same value of q is used. The score and loading vectors computed in Ta-
ble 3.4(a) are not really necessary for the algorithm.
The population version in Table 3.4(b) is a special case of Algorithm S
described previously in Section 1.5.4 with A = ΣX,Y ΣTX,Y and M = ΣX .
Consequently, reasoning from (3.17), the SIMPLS algorithm can be described
also as follows, still using the notation of Table 3.4(b). Let Vi = (v1 , . . . , vi ).
Then given Vk , k < q, Vk+1 is constructed by concatenating Vk with
v ΣX Vk = 0 and v T v = 0.
T
This is the description of SIMPLS that (Cook, Helland, and Su, 2013, Section
4.3) used to establish the connection between PLS algorithms and envelopes.
The population construction algorithm (3.17) is shown in Section 3.3.3, equa-
tions (3.18) and (3.19).
0 0 5 0 0
SIMPLS algorithm 89
and that the first weight vector is the same as that for NIPALS, v1 = (1, 0, 0)T .
As in Section 3.1.2, v1 ∈ span(ΣX,Y ) and span(ΣX,Y ) reduces ΣX . Direct
computation of the second weight vector
is not as straightforward as it was for NIPALS since we are not now working in
the ΣX inner product. However, computation is facilitated by using a change
of basis for span(ΣX,Y ) that explicitly incorporates ΣX v1 = (1, 4/3, 0)T . Let
!
5−1 −5−1
A= .
3−1 3/16
Then,
1 −1
ΣX,Y A = 4/3 3/4 ,
0 0
1 0
PΣX v1 ΣX,Y = (PΣX v1 ΣX,Y A)A−1 = 4/3 0 A−1 ,
0 0
and
0 −1
QΣX v1 ΣX,Y = (ΣX,Y A − PΣX v1 ΣX,Y A)A−1 = 0 3/4 A−1 ,
0 0
For the next iteration we need to find QΣX V2 ΣX,Y . But V2 is a reduc-
ing subspace of ΣX and so QΣX V2 = QV2 . Then the algorithm terminates:
QΣX V2 ΣX,Y = QV2 ΣX,Y = 0 since span(ΣX,Y ) = span(V2 ). Although the
NIPALS weights W2 are not equal to the SIMPLS weights V2 , they span the
same subspace and in consequence βnpls = βspls .
90 PLS Algorithms for Predictor Reduction
T
v1 = `1 (SX,Y SX,Y ).
The first score vector is then computed as s1 = Xv1 , which gives the first
linear combination of the predictors. The first loading vector is
XT s1 XT Xv1 SX v1
l1 = T
= T T
= T .
s1 s1 v1 X Xv1 v1 SX v1
T
v1 = `1 (SX,Y SX,Y )
s1 = Xv1
SX v1
l1 =
v1T SX v1
V1 = (v1 ); S1 = (s1 ); L1 = (l1 ).
In the second pass through the algorithm, d = 2, we first compute the first
left singular vector of QL1 SX,Y . From the computations from the first step
d = 1, span(L1 ) = span(SX v1 ). In other words, the normalization by v1T SX v1
in the computation of l1 is unnecessary. Thus we can compute the first left
singular vector of QSX v1 SX,Y . Following the logic expressed in step d = 1, we
then have
T
v2 = `1 (QSX v1 SX,Y SX,Y QSX v1 ).
The rest of the steps in the second pass through the algorithm are similar to
those in the first pass, so we summarize the second pass as
T
v2 = `1 (QSX v1 SX,Y SX,Y QSX v1 )
s2 = Xv2
SX v2
l2 = T
v2 SX v2
V2 = (v1 , v2 ); S2 = (s1 , s2 ) = XV2 ;
L2 = (l1 , l2 ) = SX V2 diag−1 (v1T SX v1 , v2T SX v2 ).
SIMPLS algorithm 91
For completeness, we now state the results for the a general d-th pass
through the algorithm:
T
vd = `1 (QSX Vd−1 SX,Y SX,Y QSX Vd−1 )
T
= `1 (QLd−1 SX,Y SX,Y QLd−1 )
sd = Xvd
SX vd
ld =
vdT SX vd
Vd = (v1 , v2 . . . vd ); Sd = (s1 , s2 . . . sd ) = XVd ;
Ld = (l1 , l2 . . . ld ) = SX Vd diag−1 (v1T SX v1 , v2T SX v2 . . . vdT SX vd ),
where QSX Vd−1 = QLd−1 . From this general step, it can be seen that the final
weight matrix Vq , SX , and SX,Y are all that is needed to compute the SIMPLS
estimator βbspls of the coefficient matrix after q steps:
The score matrix Sq = XVq gives the n × q matrix of reduced predictor values.
It may be of interest for interpretation and graphical studies. The loading
matrix Lq is essentially a matrix of normalized scores.
In view of the previous observations, we can reduce the SIMPLS algorithm
T
to the following compact version. Construct v1 = `1 (SXY SXY ) and V1 = (v1 ).
Then for d = 1, . . . , q − 1
T
vd+1 = `1 (QLd SX,Y SX,Y QLd ) (3.18)
= arg max hT SX,Y SX,Y
T
h (3.19)
hT h=1,hT SX Vd =0
Vd+1 = (Vd , vd+1 ).
Although key details differ, the argument here follows closely that for NIPALS
given in Section 3.2. Also, some results in this section may be deduced from
properties of Algorithm S in Section 1.5.4. We provide separate demonstra-
tions in this section to aid intuition and establish connections with SIMPLS.
Envelopes and SIMPLS 93
(a) β = v1 cT ,
(b) B = span(v1 ),
(c) dim{B} = 1,
(d) v1 is an eigenvector of both ΣX,Y ΣTX,Y and ΣX .
and therefore
B ⊆ span(Vq ). (3.23)
Next, we substitute (3.22) for the first ΣX,Y on the right hand side and write
the Vd = Vq (Id , 0)T to get
:= ΣX Vq Ad ,
Since vd+1 ∈ span(ΣX Vq ), this implies that vd+1 can be represented as vd+1 =
ΣX Vq md for some q × 1 vector md and thus that there is a q × q matrix
M = (m1 , . . . , mq ) so that Vq = ΣX Vq M . Since Vq has full column rank and
ΣX is non-singular, M must be nonsingular and so
ΣX Vq = Vq M −1 . (3.24)
In summary,
Proposition 3.4. The span span(Vq ) of the weight matrix Vq from the popula-
tion SIMPLS algorithm is a reducing subspace of ΣX that contains span(ΣX,Y )
and B.
Let Φ ∈ Rp×q denote a semi-orthogonal basis matrix for EΣX(B), the in-
tersection of all reducing subspace of ΣX that contain span(ΣX,Y ) and let
(Φ, Φ0 ) ∈ Rp×p be an orthogonal matrix. This the same notation we used
when dealing with NIPALS and envelopes in Section 3.2.2. Then we know
from Proposition 1.2 that ΣX can be expressed as
ΣX = Φ∆ΦT + Φ0 ∆0 ΦT0 .
Consequently,
∈ span(Φ),
3.5.1 Estimation
While first weight vectors from the population algorithms are the same,
w1 = v1 , the second and subsequent weight vectors differ:
Since βnpls and βspls depend only on the subspaces spanned by the correspond-
ing weight matrices, it follows that βnpls = βspls , and so NIPALS and SIMPLS
produce the same result in the population.
The sample estimators are generally different, βbnpls 6= βbspls , with two im-
portant exceptions as given in the following proposition (de Jong, 1993).
Proposition 3.6. Recall that βbnpls and βbspls denote the sample NIPALS and
SIMPLS estimators. Assume that these estimators are each constructed with
q components.
w1 ∝ SX,Y and
where c1 = (w1T SX w1 )−1 w1T SX,Y is a scalar. From this we have the represen-
tation
0 1
span(W2 ) = span SX SX,Y , SX SX,Y .
and
which led Manne (1987) and Helland (1990) to claim that the NIPALS algo-
rithm is a version of the Gram-Schmidt procedure (see Section 3.6).
Changing to the population to complete the discussion, the SIMPLS weight
vectors also result in an orthogonalization of the Krylov sequence, except
now the weight vectors v1 , . . . , vq are orthogonal in the ΣX inner product. In
particular,
• v1 = σX,Y ,
• v2 is the vector of residuals from the regression of σX,Y on ΣX v1 ,
• v3 is the vector of residuals from the regression of σX,Y on ΣX (v1 , v2 ),
..
.
• vk is the vector of residuals from the regression of σX,Y on
ΣX (v1 , v2 , . . . , vk−1 ).
In consequent, we see that for a univariate response, the NIPALS and SIM-
PLS sample and population vectors can be viewed as different orthogonaliza-
tion of the Krylov vectors. This implies that in the population and sample
span(Wk ) = span(Vk ), k = 1, . . . , q; that is, we have demonstrated that
• Both algorithms give the envelope EΣX (B) in the population and in that
sense are aiming at the same target. However, the algorithms give different
sample weight vectors, as discussed previously.
• In the population, the NIPALS algorithm makes use of the ΣX inner prod-
uct to generate its orthogonal weights, WqT Wq = Iq , while the SIMPLS
algorithm uses the identity inner product to generate its weight vectors
that are orthogonal in the ΣX inner product, so VqT ΣX Vq is a diagonal
matrix. Nevertheless, span(Wq ) = span(Vq ) = EΣX(B).
SIMPLS v. NIPALS 99
• With q = dim{EΣX (B)} known and fixed, p fixed and n → ∞, both algo-
√
rithms after q steps produce n-consistent estimators of β, because the
√
algorithms are smooth functions of SX,Y and SX , which are n-consistent
estimators of ΣX,Y and ΣX .
• Neither algorithms requires that SX be positive definite and they are gen-
erally serviceable in high-dimensional regressions. Asymptotic properties
as n, p → ∞ are discussed in Chapter 4.
• Both population algorithms are invariant under full rank linear transfor-
mations YA = AY of the response vector. The coefficient matrix with the
transformed responses is βA := βAT . Since span(βA ) = B, the envelope is
invariant under this transformation,
h3 13
where the hj ’s are independent standard normal random variables and e is a
p × 1 vector of independent standard normal variates.
Illustrative example 101
TABLE 3.5
Helland’s algorithm for univariate PLS regression: (a) sample version adapted
from Table 2 of Frank and Frideman (1993). The n × p matrix X contains the
centered predictors and the n × r vector Y contains the centered responses;
(b) population version derived herein.
0 0 13 1T3
From this structure, we see that the predictors consist of three indepen-
dent blocks of sizes p1 , p2 and p3 . Let uT1 = (1T1 , 0, 0), uT2 = (0, 1T2 , 0) and
uT3 = (0, 0, 1T3 ) denote three eigenvectors of ΣX with non-zero eigenvalues.
102 PLS Algorithms for Predictor Reduction
where P(·) is the projection onto the subspace spanned by the indicated eigen-
vector and Q is the projection onto the p − 3 dimensional subspace that is
orthogonal to span(u1 , u2 , u3 ).
Next, with a single response r = 1, the n × 1 vector of responses was
generated as the linear combination Y = H1 − H2 + , where is an n × 1
vector of independent normal variates with mean 0 and variance σ 2 , which
gives
ΣX,Y = u1 − u2
1 1
β = u1 − u2 .
1 + p1 1 + p2
Consequently, if p1 6= p2 then both β and ΣX,Y are linear combinations of
two eigenvectors of ΣX and only q = 2 components are needed to characterize
the regression, specifically uT1 X and uT2 X. Equivalently, the two-dimensional
envelope EΣX(B) = span(u1 , u2 ). If p1 = p2 then EΣX(B) = span(u1 − u2 ). This
follows because span(ΣX,Y ) = B = span(u1 − u2 ) has dimension 1.
To calculate the first two NIPALS weight vectors we need
With this we can now calculate the first two NIPALS weight vectors as
w1 ∝ ΣX,Y = u1 − u2
w2 ∝ QTΣX,Y (ΣX ) ΣX,Y
= ΣX,Y − ΣX ΣX,Y (ΣTX,Y ΣX ΣX,Y )−1 ΣTX,Y ΣX,Y
p2 − p1
= (p2 u1 + p1 u2 ).
(1 + p1 )p1 + (1 + p2 )p2
Clearly, span(W2 ) = span(u1 , u2 ) provided p1 6= p2 . If p1 = p2 then ΣX
reduces to
ΣX = (1 + p1 )P(u1 ,u2 ) + (1 + p3 )Pu3 + Q,
u1 and u2 belong to the same eigen-space and, in consequence, only one com-
ponent is needed and EΣX(B) = span(u1 − u2 ). Also, when p1 = p2 , the
stopping criterion is met at w2 = 0, giving q = 1.
Likelihood estimation of predictor envelopes 103
v1 ∝ ΣX,Y = u1 − u2
v2 ∝ QΣX ΣX,Y ΣX,Y
= ΣX,Y − ΣX ΣX,Y (ΣTX,Y Σ2X ΣX,Y )−1 ΣTX,Y ΣX ΣX,Y
p2 − p1
= ((1 + p2 )p2 u1 + (1 + p1 )p1 u2 ).
(1 + p1 ) p1 + (1 + p2 )2 p2
2
ΣX,Y = u1 − u2
ΣX ΣX,Y = (1 + p1 )u1 − (1 + p2 )u2
Σ2X ΣX,Y = (1 + p1 )2 u1 − (1 + p2 )2 u2 .
“hard modeling” has the advantage of laying bare all of the structural and
stochastic foundations of a method, allowing the investigator to study perfor-
mance under deviations from those foundations as necessary for a particular
application. There seems to be a feeling within the chemometrics community
that the success of PLS algorithms over the past four decades does not fully
compensate for the lack of adequate foundations (Stocchero, 2019).
Since we have shown in this chapter that NIPALS and SIMPLS give the
envelope in the population, we could use likelihood theory to get maximum
likelihood estimators when the sample size is sufficiently large. This approach
was described in Section 2.3 and a preview of the connection to PLS was given
in Section 2.3.2.
−1
Lq (G) = log |GT SX G| + log |GT SX G| + log |SZ|GT X |.
The sum of the first two addends on the right hand side is non-negative and
zero when the columns of G correspond to any subset of q eigenvectors of SX .
Comparisons of likelihood and PLS estimators 105
and the conclusion follows. Consequently, the role of these addends is to pull
the solution toward subsets of q eigenvectors of SX . This in effect imposes
a sample counterpart of the characterization in Proposition 1.9, which states
that in the population EΣX (B) is spanned by a subset of the eigenvectors of
ΣX . There is no corresponding operation in the PLS methods. The first SIM-
PLS vector v1 does not incorporate direct information about SX . The second
PLS vector incorporates SX by essentially removing the subspace span(SX v1 )
from consideration, but the choice of span(SX v1 ) is not guided by the rela-
tionship between v1 and the eigenvectors of SX . Subsequent SIMPLS vectors
operate similarly in successively smaller spaces. In application, PLS methods
often require more directions to match the performance of the likelihood-based
method (Cook et al., 2013).
Our discussion of likelihood-based estimation has so far been restricted to
regressions in which n max(p, r) or n → ∞ with p and r fixed. Rimal,
Trygve, and Sæbø (2019) adapted likelihood-based estimation to accommo-
date regressions with n < p by selecting the principal components of X that
account for 97.5 percent of its sample variation and then using likelihood-
based estimation on the principal component predictors. Using large-scale
simulations and data analyses, they compared the predictive performance of
envelope methods, principal component regression and the kernel PLS method
(Lindgren, Geladi, and Wold, 1993) from the PLS package in R (R Core Team,
2022) in contexts that reflect chemometric applications. They summarized
their overall findings as follows . . .
Analysis using both simulated data and real data has shown that
the envelope methods are more stable, less influenced by [predictor
collinearity] and in general, performed better than PCR and PLS meth-
ods. These methods are also found to be less dependent on the number
of components.
106 PLS Algorithms for Predictor Reduction
The envelope methods used by Rimal et al. (2019) are likelihood-based and
they differ from the PLS methods only on the method of estimating EΣX(B).
In contrast, principal component regression is not designed specifically to esti-
mate EΣX(B) and this may in part account for its relatively poor performance.
Subsequently, Rimal, Trygve, and Sæbø (2020) reported the results of a second
study, this time focusing on the estimative performance of the methods.
Comparisons based on asymptotic approximations are presented in Chap-
ter 4.
where βbols,j ∈ Rp denotes the coefficient vector from the OLS fit of Yj on
X. From this we see that the j-th column of the OLS coefficient matrix βbols
consists of the coefficient vector from the univariate-response regression of Yj
on X. Following this lead, we can use NIPALS or SIMPLS to construct a
different PLS estimator by performing r univariate PLS regressions:
βbnpls1 := (βbnpls,1 , βbnpls,2 , . . . , βbnpls,r )
(3.28)
βbspls1 := (βbspls,1 , βbspls,2 , . . . , βbspls,r )
where βbnpls,j is the estimator of the coefficient vector from the NIPALS fit of
the j-th response on X, with a similar definition for βbspls,j .
PLS1 v. PLS2 107
This lemma implies a key fact about the relationship PLS1 and PLS2
envelopes:
NIPALS estimator βbnpls,j for the univariate regressions will be fitted with
q = 1. But the overall NIPALS estimator will be fitted with q = r, since
dim(EΣX(B)) = r. Again, a PLS1 fit may be better than PLS2.
If dim{EΣX (B)} = 1 and dim{EΣX (Bj )} = 1, j = 1, . . . , r then, knowing
the dimensions, each NIPALS estimator βbnpls,j for the univariate regressions
will be fitted with q = 1. But the overall NIPALS estimator will also be fitted
with q = 1, since dim(EΣX(B)) = 1. In this case a PLS2 fit may be better than
PLS1.
This discussion indicates that, with known envelope dimensions (number
of components), either PLS1 or PLS2 may be preferable, depending in the
regression. On the other hand, PLS1 methods require in application that each
dimension dim{EΣX (Bj )}, j = 1, . . . , r, be estimated separately, while PLS2
methods need only one dimensions dim{EΣX (B)} estimated. This means that
in PLS1 regressions there are more chances for mis-estimation of the number
of components. In practice it may be advantageous to try both methods and
use cross validation or a hold-out sample to assess their relative strengths.
While the previous discussion was cast in terms of PLS estimators, the
same reasoning and conclusions apply to the more general algorithms intro-
duced in Section 1.5. For instance, in multivariate regressions we can think
of applying algorithms N and S overall one response at a time, leading to
estimators of the form
βbN1 := (βbN,1 , βbN,2 , . . . , βbN,r )
(3.29)
βbS1 := (βbS,1 , βbS,2 , . . . , βbS,r )
where βbN,j is the estimator of the coefficient vector from using Algorithm N
to fit the j-th response on X, with a similar definition for βbS,j .
This estimator can be used in conjunction with the multivariate linear model
and cross validation to select an appropriate value for the number u of response
components, in the same way that cross validation is used to select the number
of predictor components.
Response and predictor reductions can be combined to achieve simultane-
ous response-predictor reduction in the multivariate linear model. Methodol-
ogy for this is discussed in Chapter 5.
4
4.1 Synopsis
Our study of asymptotic properties of PLS led us to believe that classical
PLS methods can be effective for studying the class of high dimensional abun-
dant regressions (Cook, Forzani, and Rothman, 2012). Abundance is defined
for one-component PLS regressions in Definition 4.1. It is defined for multi-
component regressions in Definition 4.3 and a characterization is given in
Proposition 4.2. Informally, an abundant regression is one in which many
predictors contribute information about the response. In some abundant re-
gressions, estimators of regression coefficients and fitted values can converge
√
at the n rate as n, p → ∞ without regard to the relationship between n
and p. This phenomenon is described in Corollaries 4.3–4.5 for one compo-
nent regressions and in Theorems 4.7 and 4.8 for multi-component regressions,
relying mostly on the results of Cook and Forzani (2018, 2019).
Abundant regressions apparently occur frequently in the applied sciences.
For instance, in chemometrics calibration, spectra are often digitized to give
hundreds or even thousands of predictor variables (Martens and Næs, 1989)
and it is generally expect that many points along the spectrum contribute in-
formation about the analyte of interest. Wold, Kettaneh, and Tjessem (1996)
argued against the tendency to “. . . drastically reduce the number of variables
. . . ,” in effect arguing for abundance in chemometrics. In contrast to abun-
dance, a sparse regression is one in which few predictors contain response
information so the signal coming from the predictors is finite and bounded.
Classical PLS methods are generally inconsistent in sparse regressions, but
modifications have been developed to handle these cases, as discussed in
Section 11.4.
In Section 4.3 we discuss traditional asymptotic approximations based on
the one-component model and letting n → ∞ with p fixed. This material
mostly comes from Cook, Helland, and Su (2013). Although PLS is often
associated with high dimensional regressions in which n < p, is it still service-
able in traditional regression contexts and knowledge of this setting will help
paint an overall picture of PLS. In particular, there are settings in which PLS
asymptotically outperforms standard methods like ordinary least squares and
there are also settings in which PLS underperforms.
112 Asymptotic Properties of PLS
Using result from Basa, Cook, Forzani, and Marcos (2022), in Section 4.5
we describe the asymptotic normal distribution as n, p → ∞ of a user-selected
univariate linear combination of the PLS coefficients estimated from a one-
component PLS fit and show how to construct asymptotic confidence intervals
for mean and predicted values. We concluded from Theorem 4.2 and Corol-
lary 4.6 that the conditions for asymptotic normality are more stringent than
those for consistency and, as illustrated in Figure 4.4, that there is a potential
for asymptotic bias.
= µ + δ −1 kσX,Y k ΦTXi + i , i = 1, . . . , n,
Yi (4.1)
ΣX = δΦΦT + Φ0 ∆0 ΦT0 (4.2)
σY2 |X = var() > 0.
In reference to the general form of the linear model for predictor envelopes
given at (2.5), η = δ −1 kσX,Y k. While δ is an eigenvalue of ΣX with corre-
sponding normalized eigenvector Φ,
T
σX,Y ΣX σX,Y
δ = ΦT ΣX Φ = , (4.3)
kσX,Y k2
it need not be the largest eigenvalue of ΣX . We see also that the vector of
regression coefficients
β = Σ−1 −1
X σX,Y = ΣX ΦkσX,Y k = δ
−1
σX,Y
kσX,Y k2
= T
σX,Y
σX,Y ΣX σX,Y
and that the squared population correlation coefficient between Y and ΦTX
is
δ −1 kσX,Y k2
ρ2 (Y, ΦTX) = −1 . (4.6)
δ kσX,Y k2 + σY2 |X
Since Y and X are required to have a joint distribution, the covariance matrix
of the concatenated variable C = (X T , Y )T is similar to (2.9):
! !
ΣX σX,Y δΦΦT + Φ0 ∆0 ΦT0 δηΦ
ΣC = = . (4.7)
T
σX,Y σY2 δηΦT σY2
114 Asymptotic Properties of PLS
4.2.2 Estimation
The usual estimators of ΣX and σX,Y are SX and SX,Y , as given in Section 1.2.
With a single component, the estimated PLS weight vector is
W
c ∝ SX,Y ,
SX,Y
Φ
b = ,
kSX,Y k
T
SX,Y SX SX,Y
δbpls = .
kSX,Y k2
Let (Φ,
b Φb 0 ) ∈ Rp×p be an orthonormal matrix. Then, the PLS estimators of
∆0 and ΣX are
∆
b 0,pls b T0 SX Φ
= Φ b 0,
Σ
b X,pls = δbpls Φ
bΦbT + Φ
b 0∆ bT .
b 0,pls Φ
0
+ Op (n−1/2 ),
The proof of Proposition 4.1 is rather long and so Cook, Helland, and Su
(2013) provided only a few key steps. We provide more details in the proof
given in Appendix A.4.1.
According to the results in parts (i) and (ii) of Proposition 4.1, βbpls is
asymptotically normal and that its asymptotic covariance depends on fourth
moments of the marginal distribution of X. However, if PΦ X is independent
of QΦ X, as required in part (iii), then only second moments are needed. The
condition of part (iii) is implied when X is normally distributed:
cov(PΦ X, QΦ X) = PΦ ΣX QΦ = 0
Using this result, the asymptotic covariance in part (iii) of Proposition 4.1
can be expressed equivalently as
√ √ −1/2
avar( nβbpls ) = avar( nβbols ) + Φ0 ∆0
n o
−1/2
× (σY2 /σY2 |X )(∆20 /δ 2 ) − Ip−1 ∆0 ΦT0 σY2 |X .
From this we see that the performance of PLS relative to OLS depends
on the strength of the regression as measured by the ratio σY2 /σY2 |X ≥ 1
116 Asymptotic Properties of PLS
The first addends on the right-hand side of the asymptotic variances are
the same and correspond to the asymptotic variances when Φ is known. This
is as it must be since the envelope and PLS estimators are identical when Φ
is known. Thus, the second addends on the right-hand side of the asymptotic
variances correspond to the cost of estimating the envelope. To compare the
asymptotic variances we compare the costs of estimating the envelope. Let
−1
Cost for βb = η 2 η 2 δ0 /σY2 |X + (δ0 /δ)(1 − δ/δ0 )2
Cost for βb τ (1 − τ )
= ≤ 1,
Cost for βpls
b (δ0 /δ − τ )2 + τ (1 − τ )
where τ = σY2 |X /σY2 , which is essentially one minus the population multi-
ple correlation coefficient for the regression of Y on X. The relative cost of
estimating the envelope is always less than or equal to one, indicating that
PLS is the more variable method asymptotically. This is expected since the
envelope estimator inherits optimal properties from general likelihood theory.
Otherwise, the cost ratio depends on the signal strength as measured by τ and
the level of collinearity, as measured by δ0 /δ. The envelope estimator tends
to do much better than PLS in low signal regressions where τ is close to 1
and in high signal regressions where τ is close to 0. If there is a high degree of
collinearity so δ0 /δ is small and (δ0 /δ − τ )2 ≈ τ 2 then the cost ratio reduces to
1 − τ and again the envelope estimator will do better than PLS in low signal
regressions. On the other hand, if the level of collinearity is about the same
as the signal strength δ0 /δ − τ ≈ 0 then the PLS estimator will do about the
same as the envelope estimator asymptotically.
The comparison of this section is based on approximations achieved by
letting n → ∞ with p fixed. They may have little relevance in regression where
n is not large relative to p. When n < p, PLS regression is still serviceable,
while the maximum likelihood estimation based on model (4.1) is not.
FIGURE 4.1
Mussels’ data: Plot of the observed responses Yi versus the fitted values Ybi
from the PLS fit with one component, q = 1
FIGURE 4.2
Mussels’ data: Plot of the fitted values Ybi from the PLS fit with one component,
q = 1, versus the OLS fitted values. The diagonal line y = x was included for
clarity.
Asymptotic distribution of βbpls as n → ∞ with p fixed 119
TABLE 4.1
Coefficient estimates and corresponding asymptotic standard deviations
(S.D.) from three fits of the mussels’ data: Ordinary least squares, (OLS),
partial least squares with one component (PLS), and envelope with one di-
mension (ENV). Results for OLS and ENV are from Cook (2018).
OLS PLS, q = 1 ENV, q = 1
Estimate S.D. Estimate S.D. Estimate S.D.
βb1 0.741 0.410 0.142 0.0063 0.141 0.0052
βb2 −0.113 0.399 0.153 0.0067 0.154 0.0056
βb3 0.567 0.118 0.625 0.0199 0.625 0.0194
βb4 0.170 0.304 0.206 0.0086 0.206 0.0073
The coefficient estimates from the OLS, PLS, and envelope fits along with
their asymptotic standard errors are shown in Table 4.1. The OLS standard
errors are the usual ones. The PLS standard errors were obtained by plugging
in estimates into the asymptotic variances given in Proposition 4.1. The enve-
lope standard errors were obtained similarly from the asymptotic covariance
matrix given in Proposition 2.1. The PLS and envelope standard errors were
computed under multivariate normality of X. We see that the OLS standard
errors are between about 6 and 60 times those of the PLS estimator. These
types of differences are not unusual in the envelope literature, as illustrated
in examples from Cook (2018). The PLS standard errors are between about
2 and 20 percent larger than the corresponding envelope standard errors, but
are still much smaller than the OLS standard errors.
Shown in Table 4.2 are the estimates of the predictor covariance matrix ΣX
from the envelope, OLS, and PLS fits of the mussels regression. The estimate
of ΣX from the OLS fit is SX as defined in Section 1.2.1. The estimates from
the envelope and PLS fits were obtained by substituting the corresponding
estimates into (4.2). Estimation from the PLS fit is discussed in Section 4.2.2.
We see that the envelope and PLS estimates are both close to the OLS esti-
mate, which supports the finding that the dimension of the envelope is 1.
120 Asymptotic Properties of PLS
TABLE 4.2
Mussels’ muscles: Estimates of the covariance matrix ΣX from the envelope,
OLS and PLS (q = 1) fits.
Envelope
X log H log L log S log W
log H 0.030 0.031 0.123 0.041
log L 0.035 0.134 0.045
log S 0.550 0.180
log W 0.063
OLS
X log H log L log S log W
log H 0.030 0.031 0.123 0.041
log L 0.035 0.134 0.045
log S 0.550 0.180
log W 0.063
PLS
X log H log L log S log W
log H 0.031 0.032 0.126 0.042
log L 0.036 0.136 0.046
log S 0.557 0.182
log W 0.064
Recall from Section 1.2 that Y0 = (Y1 , . . . , Yn )T denotes the vector of un-
centered responses and that X denotes the n × p matrix with rows (Xi − X̄)T ,
i = 1, . . . , n. Recall also that SX,Y = n−1 XT Y0 and SX = n−1 XT X ≥ 0
represent the usual moment estimators of σX,Y and ΣX using n for the
divisor. We use Wq (M ) to denote the Wishart distribution with q degrees of
freedom and scale matrix M . With W = XT X ∼ Wn−1 (ΣX ), we can represent
SX = W/n, SX,Y = n−1 (W β + XT ε), where ε is the n × 1 vector with the
model errors i as elements. (This use of the W notation is different from
the weight matrix in PLS.) We use the notation “ak bk ” to mean that, as
k → ∞, ak = O(bk ) and bk = O(ak ), and describe ak and bk as then being
asymptotically equivalent.
4.4.1 Goal
Our general goal is to gain insights into the predictive performance of βbpls as
n and p grow in various alignments. In this section we describe how that goal
will be pursued.
Let YN = µ + β T (XN − E(X)) + N denote a new observation on Y at
a new independent observation XN of X. The PLS predicted value of YN at
T
XN is YbN = Ȳ + βbpls (XN − X̄), giving a difference of
The term (βbpls − β)T (X̄ − E(X)) must have order smaller than or equal to
the order of (βbpls − β)T (XN − E(X)), which will be at least Op (n−1/2 ).
Consequently, we have the essential asymptotic representation
n o
YbN − YN = Op (βbpls − β)T (XN − E(X)) + N as n, p → ∞.
Since N is the intrinsic error in the new observation, the n, p-asymptotic be-
havior of the prediction YbN is governed by the estimative performance of βbpls
as measured by
DN := (βbpls − β)T ωN
= T
SX,Y Φ(
b Φ b −1 Φ
b T SX Φ) b T − σ T Φ(ΦT ΣX Φ)−1 ΦT ωN , (4.8)
X,Y
122 Asymptotic Properties of PLS
We assume for illustration and to gain intuition about abundance and spar-
sity in the context of one-component regressions that X can be represented as
X = V ν + ζ, where ν is a real stochastic latent variable that is related to X
via the non-stochastic vector V and the X-errors ζ (ν, Y ). Without loss of
generality we take var(ν) = 1. We know that, for our one-component model,
σX,Y is a basis for the one-dimensional envelope EΣX(B). Accordingly, in terms
Consistency of PLS in high-dimensional regressions 123
δ = var(ΦTX) = kV k2 + ΦT Σζ Φ,
Then, as p → ∞,
p cov2 (ν, Y ) cov2 (ν, Y )
β T ΣX β = → .
p + (p − 1)ρ + 1 ρ+1
124 Asymptotic Properties of PLS
δ −2 kσX,Y k2 ΦT ΣX Φ = δ −2 σX,Y
T
ΣX σX,Y 1.
In consequence,
tr(∆j0 ) tr(∆jσ )
Kj (n, p) = = . (4.11)
nkσX,Y k2j n
In this section we will use only K1 (n, p) and K2 (n, p). We will make use
of K3 (n, p) and K4 (n, p) when considering asymptotic distributions in Sec-
tion 4.5.
This theorem can be used to gain insights into specific types of regressions.
In particular, in many regressions the eigenvalues of ∆0 may be bounded as
p → ∞, reflecting that the noise in X is bounded asymptotically. For instance,
this structure can occur when the predictor variation is compound symmetric:
ΣX = π 2 (1 − ρ + ρp)P1p + (1 − ρ)Q1p ,
Corollary 4.2. Assume the conditions of Theorem 4.1 and that the eigenval-
ues of ∆0 are bounded as p → ∞. Then
I. √
1 p p
D N = Op √ + √ + .
n nkσX,Y k2 nkσX,Y k2
√
II. (Abundant) If kσX,Y k2 p then DN = Op {1/ n}.
√ √
III. (Sparse) If kσX,Y k2 1 then DN = Op p/ n .
The first conclusion tells us that with bounded noise the predictive perfor-
mance of PLS methods depends on an interplay between the sample size n, the
number of predictors p and the signal as measured by kσX,Y k2 . The second and
third conclusions relate to specific instances of that interplay. The second con-
clusion says informally that if most predictors are correlated with the response
then PLS predictions will converge at the usual root-n rate, even if n < p. The
third conclusion says that if few predictors are correlated with the response or
kσX,Y k increases very slowly, then for predictive consistency the sample size
126 Asymptotic Properties of PLS
needs to be large relative to the number of predictors. The third case clearly
suggests a sparse solution, while the second case does not. Sparse versions of
PLS regression have been proposed by Chun and Keleş (2010) and Liland et al.
(2013). In view of the apparent success of PLS regression over the past four
decades, it is reasonable to conclude that many regressions are closer to abun-
dant than sparse. The compound symmetry example with EΣX(B) = span(1p )
is covered by Corollary 4.2 (II) and so the convergence rate is Op (n−1/2 ).
The next two corollaries give more nuanced results by tying the signal
kσX,Y k2 and the number of predictors p to the sample size n. Corollary 4.3
is intended to provide insights when the sample size exceeds the number of
predictors, while Corollary 4.4 is for regressions where the sample size is less
than the number of predictors.
Corollary 4.3. Assume the conditions of Theorem 4.1 and that the eigenval-
ues of ∆0 are bounded as p → ∞. Assume also that p na for 0 < a ≤ 1 and
that kσX,Y k2 ps nas for 0 ≤ s ≤ 1. Then
The requirement from Theorem 4.1 that Kj (n, p) converge to 0 forces the
terms in the order given in conclusion I to converge to 0 to ensure consistency,
which limits the values of a and s jointly. The corollary predicts that s = 1/2
√
is a breakpoint for n-convergence of PLS predictions in high-dimensional
regressions. If the signal accumulates at a rate that is greater than kσX,Y k2
p1/2 , so s = 1/2, then predictions converge at the usual root-n rate. This
indicates that there is considerably more leeway in the signal rate to obtain
√
n-convergence than that described by Corollary 4.2(I). Otherwise a price is
paid in terms of a slower rate of convergence. For example, if kσX,Y k2 p1/4
and p n, so s = 1/4 and a = 1, then DN = Op (n−1/4 ), which is consider-
ably slower than the root-n rate of case II. This corollary also gives additional
characterizations of how PLS predictions will do in sparse regressions. From
conclusion IV, we see that if s = 0, so kσX,Y k2 1, and a = 0.8, then
DN = Op (n−0.1 ), which would not normally yield useful results but could
likely be improved by using a sparse fit. On the other hand, if a is small, say
Consistency of PLS in high-dimensional regressions 127
Corollary 4.4. Assume the conditions of Theorem 4.1 and that the eigenval-
ues of ∆0 are bounded as p → ∞. Assume also that p na for a ≥ 1 and that
kσX,Y k2 ps nsa for 0 ≤ s ≤ 1. Then
Theorem 4.1 requires that K1 (n, p) → 0. Thus, in the context of this corol-
lary, for consistency we need as n, p → ∞
p
K1 (n, p) na(1−s)−1 → 0,
nkσX,Y k2
which requires a(1 − s) < 1. This is indicated also in conclusion I of this corol-
lary. The corollary does not indicate an outcome when a(1 − s) ≥ 1, although
here PLS might be inconsistent depending on the specific values of a and s. The
usual root-n convergence rate is achieved when a(1 − s) ≤ 1/2. For instance,
if a = 2 so p = n2 then we need s ≥ 3/4 for root-n convergence. However, in
contrast to Corollary 4.3, here there is no convergence with sparsity. If s = 0,
then we need a < 1 for convergence, which violates the corollary’s hypothesis.
Figure 4.3 gives a visual representation of the division of the (a, s) plane
according to the convergence properties of PLS from parts II and III of Corol-
laries 4.3 and 4.4. The figure is constructed to convey the main parts of the
conclusions; not all aspects of these corollaries are represented. The abundant
and in-between categories occupy most of the plane, while sparse fits are indi-
cated only in the upper left corner that represents high-dimensional problems
√
with weak signals, p > n and kσX,Y k ≤ p.
The previous three corollaries require that the eigenvalues of ∆0 be
bounded. The next corollary relaxes this condition by allowing a finite number
of eigenvalues δ0,j of ∆0 to be asymptotically similar to p (δ0,j p for a finite
128 Asymptotic Properties of PLS
FIGURE 4.3
Division of the (a, s) plane according to the convergence properties given in
conclusions II and III of Corollaries 4.3 and 4.4.
Corollary 4.5. Assume the conditions of Theorem 4.1 and that δ0,j p for
a finite collection of indices j while the other eigenvalues of ∆0 are bounded
as p → ∞. Assume also that p na for a ≥ 1 and that kσX,Y k2 ps for
0 ≤ s ≤ 1. Then
DN = Op (n−1/2+a(1−s) ).
Distributions of one-component PLS estimators 129
results of Basa et al. (2022); full proofs and additional details are available
from the supplement to their paper.
where we have suppressed the arguments (n, p) to Kj , V 1/2 (G) will be used
to scale (βbpls − β)T G and b is related to a potential for asymptotic bias. We
D
use −→ to denote convergence in distribution.
The following theorem describes the asymptotic distribution reported by
Basa et al. (2022).
Theorem 4.2. Assume that the one-component model (4.1) holds with
β T ΣX β 1. Assume also that (a) X ∼ Np (µX , ΣX ), (b) E(4 ) from model
(4.1) is bounded, (c) K1 (n, p) and K2 (n, p) converge to 0 as n, p → ∞, and
(d) K3 (n, p) is bounded and K4 (n, p) converges to 0 as n, p → ∞.
Condition (a) requires normality for X, and condition (b) is the usual
requirement of finite fourth moments. Condition (c) plus the constraint
β T ΣX β 1 are the same as the conditions for consistency in Theorem 4.1.
Condition (d) is new and is needed to insure stable asymptotic distributions.
Like conditions (b) and (c), condition (d) is judged to be mild.
Distributions of one-component PLS estimators 131
with the upper bound being attained when G ∈ span(β). Since the choice
of G does not affect b, this indicates that the bias effects will be the most
prominent when G ∈ span(β).
It follows from (4.15) that
√ √
V −1/2 |GTβb| ≤ n|b|(kσX,Y k2 /δσY2 |X )1/2 n|b|. (4.16)
√
In consequence, n|b| → 0 is a sufficient condition for avoiding the bias effects
√
when G 6∈ span⊥ (β). Inspecting n|b| we have
√ √ kσX,Y k2 σ 2 kσX,Y k2 2
n|b| ≤ n − σY2 |X K1 + Y K1 + K2 .
δ δ
√
This inequality gives rise to conditions for n|b| → 0. Since σY2 and
√
kσX,Y k2 /δ − σX,Y
2
are bounded, it is sufficient to require nKj (n, p) → 0,
j = 1, 2. We summarize these findings in the following corollary.
132 Asymptotic Properties of PLS
where zα/2 denotes a selected percentile of the OLS normal distribution. Un-
der Corollary 4.6 this same interval becomes a confidence interval for β T G, in
which case we refer to the interval as CIα (β T G).
It is not possible to use (4.17) in applications because V is unknown. How-
ever, a sample version of interval (4.17) can be constructed by using plug-in
estimators for V . To construct an estimator of V we simply plug in the esti-
mators of the constituents of representation (4.12),
GT (SX SY − SX,Y SX,Y
T
)G
Vb (G) = . (4.18)
nδ
b 2
pls
Distributions of one-component PLS estimators 133
where
b T SX Φ
δbpls = Φ b = S T SX SX,Y /S T SX,Y ,
X,Y X,Y
where
bj )
tr(∆
bj = 0,pls
K
nkSX,Y k2j
and ∆
b 0,pls is as defined in Section 4.2.2.
√
Theorem 4.5. Under all conditions of Theorem 4.2(II), n(b̂ − b) → 0 in
probability as n, p → ∞. In consequence, an asymptotic confidence interval
for β T G is
1 h i
CIα (β T G) = T
βbpls G ± zα/2 Vb 1/2 (G) .
1 + b̂
1. The one-component model (4.1) holds with β T ΣX β 1, and (a) and (b)
of Theorem 4.2.
2. kσX,Y k2 pt for 0 ≤ t ≤ 1.
134 Asymptotic Properties of PLS
3. Eigenvalue constraints,
Conditions 1 and 2, and either 3(i) or 3(ii) are assumed for the rest of this
section.
4.5.3.1 Kj (n, p)
It should be understood in the following that we deal with Kj (n, p) for only
j = 1, . . . , 4. Under condition 3(i), we have for j ≥ 1
p 1
Kj (n, p) = =
npjt npjt−1 (4.19)
K (n, p) ≥ K (n, p) for all n, p.
1 j
pj 1
Kj (n, p) = = ,
npjt npj(t−1) (4.20)
Ki (n, p) ≤ Kj (n, p), i ≤ j.
Two equal forms for V (G) were presented in (4.12) and (4.13). The form
given in (4.12) was used in previous developments. The form in (4.13) derives
Distributions of one-component PLS estimators 135
immediately from the asymptotic variance for the fixed p case presented in
Proposition 4.1(iii):
√ √
GT avar nβbpls G avar nGT βbpls
V (G) = = .
n n
That is, the V (G) derived from n-asymptotics is the same as that in n, p-
asymptotics. When letting n → ∞ with p fixed, nV is static; it does not
change since p is static. In consequence, with p static, the result from Propo-
√ D
sition 4.1 can be stated alternatively as nGT (βbpls − β)−→N (0, nV ), leading
√
to the conclusion that convergence is at the usual n rate.
The same scaling quantity V is appropriate for our n, p-asymptotic results:
D D
V −1/2 (βbpls − β(1 + b))T G−→N (0, 1) and V −1/2 (βbpls − β)T G−→N (0, 1).
√
However, we do not necessarily have n convergence because, in our con-
text, nV is dynamic, changing as p → ∞. Let λmax (·) denote the maximum
eigenvalue of the argument matrix. Then
1 n T 2 2 o
V (G) = (G Φ) σY |X + δ −1 σY2 GT Φ0 ∆0 ΦT0 G
nδ
1 n o
≤ kGk2 σY2 |X + δ −1 σY2 kGk2 λmax (∆0 )
nδ
σY2 |X
( )
2
2 kσX,Y k kσX,Y k2 2 λmax (∆0 )
= kGk + σY .
δ nkσX,Y k2 δ nkσX,Y k4
4.5.3.3 Bias b
It follows from (4.14) and (4.19) that under condition 3(i) b = O(1/npt−1 ),
which is reported in the second column, second row of Table 4.3. That
1/npt−1 → 0 is implied by (4.21). For t = 0, b = O(p/n), for t = 1/2, b =
√
O( p/n) and for t = 1, b = O(1/n). When hypotheses (a) - (d) of Theorem 4.2
are met in this example, n1/4 Kj → 0, j = 1, 2 is sufficient to have the limiting
distribution stated in the theorem. Under condition 3(i), K1 ≥ K2 and so
1
n1/4 K1 = →0
n3/4 pt−1
is sufficient for Theorem 4.2. This condition is listed in the third column,
second row of Table 4.3.
It follows from (4.14) and (4.20) that under condition 3(ii), b =
O 1/np2(t−1) as given in the second column, third row of Table 4.3. That
TABLE 4.3
The first and second columns give the orders O(·) of V 1/2 and b under condi-
tions 1– 3 as n, p → ∞. The fourth column headed b gives from Corollary 4.6
√
the order of quantity n|b| that must converge to 0 for the bias term to be
eliminated.
It seems clear from Table 4.3 that the bias plays a notable role in the conver-
gence. Nevertheless, in abundant regressions with kσX,Y k2 p, the quantities
√ √
in Table 4.3 all converge to 0 at the n rate or better when kGk/ p 1.
The results in Table 4.3 hint that, as n, p → ∞, the proper scaling of
the asymptotic normal distribution may be reached relatively quickly, while
achieving close to the proper location may take a larger sample size. Basa et al.
(2022) conducted a series of simulations to illustrate the relative impact that
bias and scaling can have on the asymptotic distribution of GT βbpls . In refer-
ence to the one-component model (4.1), they set n = p/2, µ = 0, σY |X = 1/2,
δ = kσX,Y k2 and ∆0 = Ip−1 . The covariance vector σX,Y was generated with
bp1/2 c standard normal elements and the remaining elements equal to 0, so
√
kσX,Y k2 p. Following the discussion of (4.15), they selected G = σX,Y
to emphasize bias. Then, X was generated as N (0, ΣX ) and Y | X was then
generated according to (4.1) with N (0, σY2 |X ) error. In reference to Table 4.3,
this simulation is an instance of condition 3(i) with t = 1/2.
For each selection of n = p/2 this setup was replicated 500 times
and side-by-side histograms drawn of D1 = V −1/2 GT βbpls − β (1 + b) and
D2 = V −1/2 GT βbpls − β . Their results are shown graphically in Figure 4.4.
Since Kj (n, p) p−j/2 conditions (a) – (d) of Theorem 4.2 hold. Further, since
n1/4 Kj (n, p) p(−2j+1)/4 → 0 for j = 1, 2, 3, 4 it follows from Theorem 4.2(II)
that D1 converges in distribution to a standard normal. The convergence rate
for the largest of these n1/4 K1 (n, p) p−1/4 is quite slow so it may take a
138 Asymptotic Properties of PLS
p=8 p = 16 p = 32
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
p = 64 p = 128 p = 256
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
FIGURE 4.4
Simulation results on bias:
The right histogram in each plot is of
−1/2 T b −1/2 T b
V G βpls − β (1 + b) and the left histogram is of V G βpls − β .
The standard normal reference density is also shown and in all cases n = p/2.
(Plot was constructed with permission using the data that Basa et al. (2022)
used for their Fig. 1.)
FIGURE 4.5
Plots of the means and standard deviations versus log p corresponding to the
simulations of Figure 4.4.
predictors before the approximation seems quite close to the standard normal.
This is in qualitative agreement with the slow rate of convergence mentioned
previously for this example. The histograms for D2 do not converge to a stan-
dard normal density. Visually, it seems that the histogram of D1 gets to the
right scaling faster than it achieves the right location, which is in agreement
with the discussion of Table 4.3. This is supported by the plots in Figure 4.5.
of eliminating the first addend in the bias b (4.14). The part of G that falls
in span⊥ (β) will also mitigate the bias. The coverage rates, estimated by re-
peating the procedure 1, 000 times and counting the number of intervals that
covered β T G and (1 + b)β T G, are shown in the third and fourth columns
of Table 4.4. Theorem 4.2 holds in this simulation scenario. For instance,
√
Kj (n, p) = 1/p3j/4−1/2 . Corollary 4.6(I) also holds since n|b| n−1/2 . How-
ever, the sufficient condition given in Corollary 4.6(II) does not hold since
√
nK1 (n, p) = 1 for all p.
To emphasize the bias, Basa et al. (2022) conducted another simulation
with the same settings, except we set σY2 |X = 1/2, G = σX,Y and, to enhance
the contrast, the true value of V was used in the interval construction. Ac-
cording to the discussion following (4.15), this choice of G will maximize the
bias effect and the first addend of (4.14) will now contribute to the bias. The
results are shown in the fifth and sixth columns of Table 4.4. We see now that
√
CI0.05 (β T G) suffers in comparison to CI0.05 ((1 + b)β T G) since n|b| does not
√
converge to 0, n|b| 1. However, by (4.16),
√ √
JV −1/2 |GT βb| ≤ nJ|b|(kσX,Y k2 /δσY2 |X )1/2 = 2nJ|b| → 0,
and thus Theorem 4.2(I) holds. Additionally, the conditions for Theo-
rem 4.2(II) hold and so the adjusted confidence interval of Theorem 4.5 is
applicable. The rates for that adjusted interval are shown in the last column
of Table 4.4.
Distributions of one-component PLS estimators 141
and
κ(p) = tr(var(ΦT0 X)) = tr(∆0 ) = tr(QE ΣX ).
In Definition 4.1 and at the outset of Section 4.4.3 we used kσX,Y k2 δ =
tr(∆) as a measure of the signal in one-component regressions. Our definition
of signal in multi-component regressions then reduces to the previous one when
q = 1. In our treatment of one-component regressions, we used the noise-to-
signal ratio ∆σ := ∆0 /kσX,Y k2 to characterize in Theorem 4.1 the asymptotic
behavior of DN in terms of K1 (n, p) = tr(∆σ )/n and K2 (n, p) = tr(∆2σ )/n. A
corresponding fine level of analysis was not maintained in the multi-component
case. Instead, Cook and Forzani (2019) consistently used additional bounding
of the form tr(Ak ) ≤ trk (A) when characterizing the asymptotic behavior of
PLS predictions. For instance, this means that K2 (n, p) ≤ nK12 (n, p). In this
way, we obtain a coarser version of Theorem 4.1 depending only on the sample
size and noise-to-signal ratio via an extended definition of K1 :
κ(p)
K1 (n, p) = .
nη(p)
This is similar to the signal rate found by Cook, Forzani, and Rothman (2013)
in their study of abundant high-dimensional linear regression.
It follows from our discussion of Section 3.5.1 that the population
NIPALS weight vectors arise from a sequential orthogonalization of the
q vectors ΣjX σX,Y , j = 0, . . . , q − 1, in the Krylov sequence Kq =
(σX,Y , ΣX σX,Y , . . . , Σq−1
X σX,Y ). While these basis vectors are linearly inde-
pendent by construction, for this orthogonalization to be stable as p → ∞,
the Krylov basis vectors cannot be too collinear. Let Rj2 denote the squared
multiple correlation coefficient from the linear regression of the j-th coordinate
T
σX,Y Σj−1 T
X X of Kq X onto the rest of its coordinates,
T
(σX,Y T
X, . . . , σX,Y Σj−2 T j T q−1
X X, σX,Y ΣX X, . . . , σX,Y ΣX X).
Then the collinearity among the Krylov basis vectors arises asymptotically as
the rate of increase in the sum of variance inflation factors
Xq
ρ(p) = (1 − Rj2 )−1 .
j=1
144 Asymptotic Properties of PLS
Practically, if there are many predictors that are correlated with the re-
sponse and if basis-vector collinearity as described above is not an issue then
we have an abundant regression.
To perhaps provide a little intuition we next describe how kσX,Y k2 con-
nects with η(p) and the eigenvalues of ΣX . Let λ1 (A) ≥ · · · ≥ λm (A)
denote the ordered eigenvalues of the symmetric m × m matrix A. Since
Pq
η(p) = j=1 λj (ΦT ΣX Φ), it follows that (Cook and Forzani, 2019, eq. (4.4))
kσX,Y k2
σY2 > . (4.26)
η(p)
The response variance σY2 is a finite constant and does not change with p.
Consequently, regardless of the value of p, the ratio in this relationship is
bounded above by σY2 . If kσX,Y k2 increases as we add more predictors then
η(p) must correspondingly increase to maintain the bound (4.26) for all p. If
kσX,Y k2 → ∞ as p → ∞ then η(p) must also diverge in a way that main-
tains the bound. Equivalently, if η(p) is bounded, so the regression is sparse,
then kσX,Y k2 must also be bounded. In sum, kσX,Y k2 may serve as a useful
surrogate for the signal η(p).
There is also a relationship between η(p) and the eigenvalues of ΣX (Rao,
1979, Thm. 2.1):
Xq
η(p) ≤ λj (ΣX ). (4.27)
j=1
C1. Model (2.5) holds. The response Y is real (r = 1), (Y, X) follows
a non-singular multivariate normal distribution and the data (Yi , Xi ),
i = 1, . . . , n, arise as independent copies of (Y, X). To avoid the trivial
case, we assume that the coefficient vector β = 6 0, which implies that the
dimension of the envelope q ≥ 1. This is the same as the model adopted for
the one-component case, except here the number of components is allowed
to exceed one.
C3. The number of components q, which is the same as the dimension of the
envelope EΣX(B), is known and fixed for all p. This is the same structure
adopted for the one-component case, except here the number of compo-
nents is allowed to exceed one.
√
C4. K1 and ρ/ n → 0 as n, p → ∞, where K1 and ρ are defined in Sec-
tion 4.6.1.
Theorem 4.7. As n, p → ∞,
√ n o
DN = Op (ρ/ n) + Op ρ1/2 n−1/2 (κ/η)q .
In particular,
√
II. If κ η then DN = Op (ρ/ n).
√
III. If q = 1 then DN = Op ( nK1 ).
We see from this that the asymptotic behavior of PLS depends crucially
on the relative sizes of signal η and noise κ in X. It follows from the general
result that if κ p, as likely occurs in many applications, particularly spectral
applications in chemometrics, and if η p, so the regression is abundant, then
√
DN = Op (ρ/ n). If, in addition, ρ 1 then PLS fitted values converge at the
√
usual n-rate, regardless of the relationship between n and p.
On the other hand, if the signal in X is small relative to the noise in
X, so η = o(κ), then it may take a very large sample size for PLS predic-
tion to be consistent. For instance, suppose that the regression is sparse and
only q predictors matter and thus η 1. Then it follows reasonably that
ρ 1 and, from part I, DN = Op {n−1/2 κq }. If, in addition, κ p then
DN = Op {pq n−1/2 }. Clearly, if q is not small, then it could take a huge sam-
ple size for PLS prediction to be usefully accurate.
Theorem 4.7 places no constraints on the rate of increase in κ(p) = tr(∆0 ).
In many regressions it may be reasonable to assume that the eigenvalues of
∆0 are bounded so that κ(p) p as p → ∞. In the next theorem we describe
the asymptotic behavior of PLS predictions when the eigenvalues of ∆0 are
bounded. It is a special case of Cook and Forzani (2019, Theorem 2).
In particular,
I. If ρ 1 or if q = 1 then
( 1/2 )
p
DN = Op .
nη
√
II. If η p then DN = Op (ρ/ n) .
The order of DN now depends on a balance between the sample size n, the
variance inflation factors as measured through ρ and the noise to signal ratio
in K1 , but it no longer depends on the dimension q. We see in the conclusion
to case I that there is synergy between the sample size and the signal, the sig-
nal serving to multiply the sample size. For instance, if we assume a modest
√ √
signal of η(p) p then in case I we must have n large relative to p for the
best results. If η p and ρ 1 then from case II we again get convergence at
√
the n rate.
Chun and Keleş (2010) concluded that PLS can be consistent in high-
dimensional regressions only if p/n → 0. However, they required the eigen-
values of ΣX to be bounded as p → ∞ and that ρ(p) 1. If the eigenvalues
of ΣX are bounded then the eigenvalues of ∆0 are bounded so κ p and
from (4.27) the signal must be bounded as well η 1. It then follows from
Theorem 4.8 that DN = Op {(p/n)1/2 }, which is the rate obtained by Chun
and Keleş (2010). By required the eigenvalues of ΣX to be bounded, Chun
and Keleş (2010) in effect restricted their conclusion to sparse regressions.
So far our focus has been on the rate of convergence of predictions as mea-
sured by DN . There is a close connection between the rate for DN and the
rate of convergence of βbpls in the ΣX inner product. Let
Then, as shown by Cook and Forzani (2019, Supplement Section S8), Vn,p and
DN have the same order as n, p → ∞, so Theorems 4.7 and 4.8 hold with DN
replaced by Vn,p . It follows from that the special cases of Theorems 4.7 and 4.8
and the subsequent discussions apply to Vn,p as well. In particular, estimative
convergence for βbpls as measured in the ΣX inner product will be at or near
the root-n rate under the same conditions as predictive convergence.
Bounded or unbounded signal? 149
4.8 Illustration
We use in this section variations on the example presented in Section 3.7
to illustrate selected asymptotic results, including the distinction between
150 Asymptotic Properties of PLS
0.12
0.10 0.11
Root MSE
0.07 0.08 0.09
FIGURE 4.6
Tetracycline data: The open circles give the validation root MSE from
10, 20, 33, 50, and 101 equally spaced spectra. (From Fig. 4 of Cook and Forzani
(2019) with permission.)
abundant and sparse regressions (Cook and Forzani, 2020). A detailed analy-
sis of a one-component regression is described in Section 4.8.1, and a cursory
analysis of a two-component regression is given in Section 4.8.2.
h3 13
Illustration 151
but now the hj ’s are independent normal random variables with mean 0 and
variance 25. The variance-covariance matrix of the predictors X is now
11 1T1 0
ΣX = 25 0 12 1T2 0 + Ip .
0 0 13 1T3
Again, the predictors consist of three independent blocks of sizes (p − d)/2,
(p − d)/2 and d. The correlation between predictors in the same block is about
0.96. Recall that uT1 = (1T1 , 0, 0), uT2 = (0, 1T2 , 0), and uT3 = (0, 0, 1T3 ). Then
ΣX can be expressed equivalently as
where P(·) is the projection onto the subspace spanned by the indicated vectors
and Q is the projection onto the p − 3 dimensional subspace that is orthog-
onal to span(u1 , u2 , u3 ). We see that ΣX has three eigenspaces span(u1 , u2 ),
span(u3 ) and span⊥ (u1 , u2 , u3 ). With r = 1, the n × 1 vector of responses Y
was generated as the linear combination Y = 3H1 − 4H2 + , where H1 and
H2 where is an n × 1 vector of independent normal variates with mean 0
and finite variance. This gives
η(p) = Φ T ΣX Φ
= 1 + 25(p − d)/2
κ(p) = tr(ΦT0 ΣX Φ0 )
= −1 + 25(p + d)/2 + p.
FIGURE 4.7
Illustration of the behavior of PLS predictions in abundant and sparse regres-
sions. Lines correspond to different numbers of material predictors. Reading
from top to bottom as the lines approach the vertical axis, the first bold line
is for 2 material predictors. The second dashed line is for p2/3 material pre-
dictors. The third solid lines is for p − 40 material predictors and the last line
with circles at the predictor numbers used in the simulations is for p mate-
rial predictors. The vertical axis is the squared norm kβ − βk b 2 and always
SX
n = p/3. (This figure was constructed with permission using the same data
as Cook and Forzani (2020) used for their Fig. 1.)
Figure 4.7 shows the results of a small simulation to reinforce these com-
ments. The vertical axis is the squared difference between centered mean value
β TX and its PLS estimator βbTX averaged over the sample: kβ − βk b 2 =
SX
−1
P n T T 2
n i=1 (β Xi − β Xi ) . The horizontal axis is the number of predictors p
b
and the sample size is n = p/3. The lines on the plot are for different num-
bers of material predictors. The curve for 2 material predictors corresponds
to a sparse regression and, as expected, the results support Case 3, as there
is no visual evidence that it is converging to 0. The other three curves are for
abundant regressions with varying rates of information accumulation.
For the curve for “p” material predictors is the best that can be achieved
in the context of the simulated regression. For the “p−40” curve all predictors
are material except for 40. As the total number of predictors increases, the
40 immaterial predictors cease to play a role and the “p − 40” curve coincides
with the “p” curve. These results support Case 1.
154 Asymptotic Properties of PLS
The remaining dashed curve for “p2/3 ” material predictors represents Case
2, an abundant regression with a very slow rate of information accumulation
that is less than that for the “p” and “p − 40” curves. The theory predicts
that this curve will eventually coincide with the “p” curve, but demonstrating
that result by simulation will surely take a very, very large value of p.
where κ(p) comes from adding the eigenvalues of ∆0 , (1 + 25p3 ) and 1 with
multiplicity p − 3. From here we can appeal to Theorems 4.7 and 4.8 to char-
acterize the asymptotic behavior of PLS predictions.
Illustration 155
Simultaneous Reduction
Response and predictor envelope methods have the potential to increase effi-
ciency in estimation and prediction. It might be anticipated then that combin-
ing response and predictor reductions may have advantages over either method
applied individually. We discuss simultaneous predictor-response reduction in
this chapter. Our discussion is based mostly on two relatively recent papers:
Cook and Zhang (2015b) developed maximum likelihood estimation under a
multivariate model for simultaneous reduction of the response and predictor
vectors, studied asymptotic properties under different scenarios and proposed
two algorithms for getting estimators. Cook, Forzani, and Liu (2023b) devel-
oped simultaneous PLS estimation based on the same type of multivariate
linear regression model used by Cook and Zhang (2015b).
As in previous developments, we first discuss in Section 5.1 conditional
independence foundations for simultaneous reduction. We then incorporate
the multivariate linear model, turning to likelihood-based estimation in Sec-
tion 5.2. Simultaneous PLS estimation is discussed in Section 5.3 and other
related methods are discussed briefly in Section 5.4. Since PLS is a focal point
of this book, the main thrust of our discussion follows Cook, Forzani, and Liu
(2023b), with results from Cook and Zhang (2015b) integrated as relevant.
ΣX = var(X) = var(PS X + QS X)
= var(PS X) + var(QS X)
= PS ΣX PS + QS ΣX QS ,
and !
∆0 0
Σ(Φ0 ⊕Γ0 )T C = ,
0 Θ0
where the direct sum operator ⊕ is defined in Section 1.2.1.
Foundations for simultaneous reduction 159
EΣY |X (B 0 ) ⊕ EΣX (B) = EΣY (B 0 ) ⊕ EΣX (B) = EΣY ⊕ΣX (B 0 ⊕ B). (5.6)
= Φ∆−1 ΦT ΦKΓT
= ΦηΓT , where η = ∆−1 K;
ΣY |X = ΣY − var(β T X)
= ΓΘΓT + Γ0 Θ0 ΓT0 − ΓK T ∆−1 KΓT
= Γ(Θ − K T ∆−1 K)ΓT + Γ0 Θ0 ΓT0
= ΓΩΓT + Γ0 Ω0 ΓT0 ,
This structure leads to the following linear model for the simultaneous
reduction of predictors and responses,
Y = α + Γη T ΦT X + ε
ΣY |X = ΓΩΓT + Γ0 Ω0 ΓT0 , (5.7)
T T
ΣX = Φ∆Φ + Φ0 ∆0 Φ0 ,
where η contains the coordinates of β relative to bases Γ and Φ. In the follow-
ing sections we discuss estimation under model (5.7) by maximum likelihood,
PLS and a related two-block method.
Model (5.7) is the same as that used by Cook and Zhang (2015b) in their
development of a likelihood-based simultaneous reduction method. However,
instead of starting with general reductive conditions given in Proposition 5.1,
Cook and Zhang (2015b) took a different route, which we now describe since
it may furnish additional insights.
β = U DV T = (ΦA)D(B T ΓT ) = ΦηΓT
Lemma 5.1. Under the simultaneous envelope model (5.7), canonical cor-
relation analysis can find at most d directions in the population, where
d = rank(ΣX,Y ) as defined in (5.5). Moreover, the directions are contained in
the simultaneous envelope as
Canonical correlation analysis may thus miss some information about the
regression by ignoring some material parts of X and/or Y . For example, when
r is small, it can find at most r linear combinations of X, which can be insuf-
ficient for regression. Cook and Zhang (2015b) found in the simulation studies
that the performance of predictions based on estimated canonical correlation
reductions aT1 X, . . . , aTd X and bT1 Y, . . . , bTd varied widely for different covari-
ance structures and was generally poor.
5.2.1 Estimators
As in other envelope model encountered in this book, the most difficult part
of parameter estimation is determining estimators for the basis matrices Φ
and Γ. Once this is accomplished, the estimators of the remaining parameters
in model (5.7) are relatively straightforward to construct.
Let G and W denote semi-orthogonal matrices that estimate versions of Γ
and Φ. These may come from maximum likelihood estimation, PLS estimation
or some other methods. Estimators of the remaining parameters in (5.7) are
given in the following lemma. The derivation is omitted since it is quite similar
to the derivations of the estimators for predictor and response envelopes.
−1 T
βb = W SW TX SW TX,GT Y G
∆
b = SW TX
∆
b0 = SW0T X
Ω
b = SGT Y |W TX
Ω
b0 = SGT0 Y
Σ
b Y |X b T + G0 Ω
= GΩG b 0 GT0
Σ
bX b T + W0 ∆
= W ∆W b 0 W0T ,
The next lemma gives an instructive form of the residuals from the fit of
model (5.7. It proof is in Appendix A.5.3.
Lemma 5.3. Using the sample estimators from Lemma 5.2, the sample co-
variance matrix of the residuals from model (5.7),
n
X
Sres = n−1 η T W TXi }{(Yi − Ȳ ) − Gb
{(Yi − Ȳ ) − Gb η T W TXi }T ,
i=1
can be represented as
Sres = Σ
b Y |X + PG SY |W TX QG + QG SY |W TX PG ,
where Σ
b Y |X is as given in Lemma 5.2.
! ! !T
Φ 0 ∆ ∆η Φ 0
ΣC =
0 Γ η T ∆ Ω + η T ∆η 0 Γ
! ! !T
Φ0 0 ∆0 0 Φ0 0
+
0 Γ0 0 Ω0 0 Γ0
!
∆ ∆η
= (Φ ⊕ Γ) (Φ ⊕ Γ)T
η T ∆ Ω + η T ∆η
+(Φ0 ⊕ Γ0 )(∆0 ⊕ Ω0 )(Φ0 ⊕ Γ0 )T . (5.9)
164 Simultaneous Reduction
Let
(Φ, b = argminW,G F (SC , W ⊕ G),
b Γ)
5.2.2 Computing
Cook and Zhang (2015b) discussed three algorithms for minimizing the si-
multaneous objective function, one of which uses an algorithm that alternates
between predictor and response reduction. If we fix Φ as an arbitrary orthog-
onal basis, then the objective function F (SC , Φ ⊕ Γ) can be re-expressed as
an objective function in Γ for response reduction as given in (2.21):
The following alternating algorithm based on (5.10) and (5.11) can be used
to obtain a minimizer of the objective function F (SC , Φ ⊕ Γ).
1. Initialization: Set the starting value Φ(0) and get Γ(0) = arg minΓ F (Γ|Φ(0) ).
2. Alternating: For the k-th stage, obtain Φ(k) = arg minΦ F (Φ|Γ = Γ(k−1) )
and Γ(k) = arg minΓ F (Γ|Φ = Φ(k) ).
3. Convergence criterion: Evaluate F (Φ(k−1) ⊕ Γ(k−1) ) − F (Φ(k) ⊕ Γ(k) )
which is w1 = (1, 0)T . The second weight vector(0, 1)T is constructed using
! !
0 −0.6 T 0 0 0
Qw1 (ΣY ) = , Qw1 (ΣY ) ΣY,X = .
0 1 −3 4 0
and !
0 0
QTw1 (ΣY ) ΣY,X ΣX,Y Qw1 (ΣY ) = .
0 1
In consequence, no response reduction is possible since span(w1 , w2 ) = R2 .
assume that the vi ’s and ti ’s are independent copies random vectors v and
t, both with mean 0. The models are the same form after nonsingular linear
transformations of t and v and, in consequence, we assume without loss of
generality that var(t) = Idt and var(v) = Idv . However, t and u are required
to be correlated, cov(t, u) = Σt,u , for the predictors and responses to be cor-
related. C and D are non-stochastic loading matrices with full column ranks,
and E and F are matrices of random errors.
Model (5.12) is an instance of a structural equation model. Investigators
using structural equation models have historically taken a rather casual atti-
tude toward the random errors, focusing instead on the remaining structural
part of the model (Kruskal, 1983). Often, properties of the error matrices E
and F are not mentioned, presumably with the implicit understanding that
the errors are small relative to the structural components. That view may be
adequate for some purposes, but not here. We assume that the rows eTi and
fiT , i = 1, . . . , n, of E and F are independently distributed with means 0 and
constant variance-covariance matrices Σe and Σf , and E F .
Written in terms of the rows XiT of X0 and the rows YiT of Y0 , both
uncentered, these models become for i = 1, . . . , n
)
Xi = µX + Cvi + ei
. (5.13)
Yi = µY + Dti + fi
The vectors vi and ti are interpreted as latent vectors that control the ex-
trinsic variation in X and Y . These latent vectors may be seen as imaginable
constructs as in some applications like path analyses (e.g. Haenlein and Ka-
plan, 2004; Tenenhaus and Vinzi, 2005) or as convenient devices to achieve
dimension reduction. The X and Y models are connected because t and v are
correlated with covariance matrix Σt,v . If t and v are uncorrelated then so are
X and Y .
To develop a connection with envelopes, let C = span(C) and D = span(D)
and let Φ ∈ Rp×q and Γ ∈ Rr×u be semi-orthogonal basis matrices for
EΣe (C) and EΣf (D), respectively, q ≥ dv , u ≥ dt , and let (Φ, Φ0 ) ∈ Rp×p
and (Γ, Γ0 ) ∈ Rr×r be orthogonal matrices. With this we have
where the W(·) ’s are positive definite matrices what can be expressed in
terms of quantities in model (5.13). Such expressions of the W(·) ’s will play
no role in what follows and so are not given. We see from (5.14) that
ΓT0 ΣY,X = ΣΓT0 Y,X = 0 and that ΣY,X Φ0 = ΣY,ΦT0 X = 0, and thus that
the covariance between X and Y is captured entirely by the covariance be-
tween ΦTX and ΓT Y . In effect, ΓT Y represents the predictable part of Y and
ΦT X represents the material part of X.
In short, the bilinear model (5.12) leads to the envelope structure of (5.3)
that stems from Proposition 5.1. See Section 11.2 for additional discussion of
bilinear models as a basis for PLS.
This represents linear models in the rows and columns of Y and is the same
model form as that studied by vonRosen (2018). Models of this form arise in
analyses of longitudinal data, as discussed in Section 6.4.4.
Rosipal and Krämer, 2006; Rosipal, 2011; Weglin, 2000; Wold, 1975a). The
derivation of the population two-block algorithm from the data-based algo-
rithm is available in Appendix A.5.4, since it follows the same general steps
as the derivations of the population algorithms for NIPALS and SIMPLS dis-
cussed previously in Sections 3.1 and 3.3.
The population version of the two-block algorithm developed in Ap-
pendix A.5.4 is shown in Table 5.2. The sample version, which is obtained by
replacing ΣX , ΣY , and ΣX,Y with their sample versions SX , SY , and SX,Y ,
does not require SX and SY to be nonsingular. Perhaps the main drawback of
this algorithm is that the column dimensions of the weight matrices Ck̄ and
Dk̄ must both equal k̄, the number of components, so the number of mate-
rial response linear combinations will be the same as the number of material
predictor linear combinations. In consequence, we are only guaranteed in the
population to generate subsets of EΣX (B) and EΣY (B 0 ), and a material part
of the response or predictors must necessarily be missed, except perhaps in
special cases.
We use the example of Sections 3.1.2 and 3.3.2 to illustrate an important
limitation of the algorithm in Table 5.2. That example had p = 3 predictors
Empirical results 171
to get c1 = (1, 0, 0)T and d1 = (1, 0)T . The next step is to check the stop-
ping criterion, and for that we need to calculate QTc1 (ΣX ) ΣX,Y Qd1 (ΣY ) . The
projections in this quantity are
0 −4/3 0 !
0 0.6
Qc1 (ΣX ) = 0 1 0 and Qd1 (ΣY ) = .
0 1.0
0 0 1
so we stop with (c1 , d1 ), which implies that we need only cT1 X = x1 and
dT1 Y = y1 to characterize the regression of Y on X. However, we know from
Sections 3.1.2 and 3.3.2 that two linear combinations of the predictors are
required for the regression, and from Section 5.3 that two linear combinations
of the response are required. In consequence, the two-block algorithm missed
key response and predictor information, and for this reason it cannot be rec-
ommended for simultaneous reduction of the predictors and responses. The
simultaneous envelope methods discussed in Sections 5.2 and 5.3 are definitely
preferred.
squares (OLS), (2) the two block method (Section 5.4.3), (3) sample size per-
mitting, the envelope-based likelihood reduction method for simultaneously
reducing X and Y (XY -ENV, Section 5.2), (4) PLS for predictor reduction
only (X-PLS, Table 3.1), and (5) the newly proposed PLS method for reducing
both X and Y (XY -PLS, Section 5.3).
Our overall conclusions are that any of these five methods can give com-
petitive results depending on characteristics of the regression. However the
XY -ENV and XY -PLS methods were judged best overall, with XY -ENV
being preferred when n is sufficiently large. We judged XY -PLS to be the
best overall. Lacking detailed knowledge of the regression or if n is sufficiently
‘large’, we would likely choose XY -PLS for use in predictive applications.
5.5.1 Simulations
The performance of a method can depend on the response and predictor di-
mensions r and p, the dimensions of the response and predictor envelopes u
and q, the components ∆, ∆0 , Ω, and Ω0 of the covariance matrices ΣY |X and
ΣX given at (5.7), β = ΦηΓT and the distribution of C. Following Cook et al.
(2023b) we focused on the dimensions and the covariance matrices. Given all
dimensions, the elements of the parameters Φ, η, and Γ were generated using
independent uniform (0, 1) random variables. The distribution of C was taken
to be multivariate normal. This gives an edge to the XY -ENV method so we
felt comfortable thinking of it as the gold standard in large samples.
Following Cook and Zhang (2015b) we set
∆ = aIq , ∆0 = Ip−q
Ω = bIu , Ω0 = 10Ir−u ,
where a and b were selected constants. We know from our discussion at the
end of Section 2.5.1 and from previous studies (Cook and Zhang, 2015b; Cook,
2018) that the effectiveness of predictor envelopes tends to increase with a,
while the effectiveness of response envelopes increases as b decreases. Here
we are interested mainly in the effectiveness of the joint reduction methods
as they compare to each other and to the marginal reduction methods. We
compared methods based on the average root mean squared prediction error
Empirical results 173
where Ybij denotes a fitted value and Yij is an observation from an indepen-
dent testing sample of size m = 1000 generated with the same parameters as
the data for the fitted model. To focus on estimation, we also computed the
average root mean squared estimation error per coefficient
v
p uX r
1 Xu t (βij − βbij )2 .
betaRMSE =
p i=1 j=1
The true dimensions were used for all the methods. Results are given in Ta-
bles 5.3–5.4. Some cells in these tables are missing data because a method ran
into rank issues and could not be implement cleanly.
two-block method does relatively better in this case because there is substan-
tial response reduction possible, while X-PLS does relatively worse because
it neglects response reduction. We again conclude that with sufficiently large
sample size, XY -ENV performs the best, while XY -PLS is serviceable overall.
Empirical results 175
TABLE 5.4
Error measures with a = 50, b = 0.01, p = 50, r = 4, q = 40, and u = 3.
The settings for Table 5.7 are the same as those for Table 5.6 except
b = 0.1. Now, XY -PLS does the best for a ≥ 50 and XY -ENV is the best for
a = 5, 0.5. As we have seen in other tables, the relative sizes of the material
and immaterial variation matters.
Empirical results 177
TABLE 5.6
Error measures with n = 57, b = 5, p = 50, r = 4, q = 40, and u = 3.
highly skewed to the right, the responses were transformed to the log scale
and then standardized to have unit variance. Dimensions were determined as
described in the concrete example. The prediction errors predRMSE were de-
termined using the leave-one-out method to compare with their results. The
results shown in Table 5.9 suggest that for these data there is little advantage
in compressing the responses although q = 3.
Our third illustration comes from a near-infrared spectroscopy study on
the composition of biscuit dough (Osborne et al., 1984). The original data
has n = 39 samples in a training dataset and 31 samples in a testing set.
The two sets were created on separate occasions and are not the result of a
random split of a larger dataset. Each sample consists of r = 4 responses,
the percentages of fat, sucrose, flour and water, and spectral readings at 700
wavelengths. Cook and Zhang (2015b) used a subset of these data to illustrate
application of XY -ENV. They constructed the data subset by reducing the
spectral range to restrict the number of predictors to 20 from a potential of
700. This allowed them to avoid ‘n < p’ issues and again dimension was chosen
by cross-validation. In a separate study, Li, Cook, and Tsai (2007) reasoned
that the leading and trailing wavelengths contain little information, which
motivated them to use middle wavelengths, ending with p = 64 predictors.
Using the subset of wavelengths constructed by Li et al. (2007), Cook et al.
(2023b) applied the four methods that do not require n > p, which gave the
results in the last row of Table 5.9. We see that XY -PLS again performed
the best. Since its performance was better than that of X-PLS, we see an
advantage to compressing the responses.
6
Y = α + βT X + ε
= α + β1T X1 + β2T X2 + ε, (6.1)
where Y ∈ Rr , ! !
X1 β1
X= , β= ,
X2 β2
and the errors have mean 0 variance-covariance matrix ΣY |X and are
independent of the predictors. We assume throughout this chapter that
p2 min(n, p1 ). There are several contexts in which partial reduction may
be useful, in addition to the kinds of application emphasized in the reac-
tion example. For instance, there may be a linear combination GTβ of the
components of β that is of special interest. In which case reducing the dimen-
sion while shielding GTβ from the reduction may improve its estimation (see
Section 6.2.4).
FIGURE 6.1
Plot of lean body mass versus the first partial SIR predictor based on data
from the Australian Institute of Sport. circles: males; exes: females.
Partial predictor envelopes 185
Shao, Cook, and Weisberg (2009) adapted sliced average variance esti-
mation for estimating the partial central subspace. Wen and Cook (2007) ex-
tended the minimum discrepancy approach to SDR developed by Cook and Ni
(2005) to allow the partial central subspace to be estimated. In this chapter we
describe PLS and maximum likelihood methods for compressing X1 while leav-
ing X2 unchanged in the context of linear models. Park, Su, and Chung (2022)
recently applied partial PLS methods to the study of cytokine-based biomarker
analysis for COVID-19. We will link to their work as this chapter progresses.
Run PLS: Run the dimension reduction arm of a PLS algorithm with di-
mension q1 , response R
bY |2 i and predictor vector R
b1|2 i , and get the output
c ∈ Rp1 ×q1 of weights.
matrix W
This algorithm can be used in conjunction with cross validation and a holdout
sample to select the number of components and estimate the prediction error
of the final estimated model. It is also possible to use maximum likelihood to
estimate a partial reduction, as described in Section 6.2.3.
186 Partial PLS and Partial Envelopes
6.2.2 Derivation
Our pursuit of methods for partial predictor reduction proceeds via partial
predictor envelopes which are introduced by adding a second conditional inde-
pendence condition to (6.2) and then following the reasoning that led to (2.2):
The rationale here is the same as that leading to (2.2): we use condition (b)
to induce a measure of clarity in the separation of X1 into its material and
immaterial parts. Conditions (a) and (b) hold if and only if (Cook, 1998,
Proposition 4.6)
(c) (Y, PS X1 ) QS X1 | X2 . (6.4)
A1 (RY |2 , R1|2 ) X2 .
This condition plays a central role in dimension reduction generally. For in-
stance, (Cook, 1998, Section 13.3) used it in the development of Global Net
Effects Plots. With assumption A1, conditions (6.6) reduce immediately to
Similar constructions involving just the central subspace were given by (Cook,
1998, Chapter 7).
188 Partial PLS and Partial Envelopes
These conditions are identical to those given in (2.3) with Y and X replaced
by RY |2 and R1|2 . Thus we can learn how to compress X1 by applying PLS to
the regression of RY |2 on R1|2 , except that these variables are not observed and
must be estimated. Recall we are requiring p2 to be small relative to n and p1 :
A2 p2 min(n, p1 ).
R
bY |2 i = Yi − Ȳ − BYT |2 (X2i − X̄2 )
T
R
b1|2 i = X1i − X̄1 − B1|2 (X2i − X̄2 ).
−1
S1|2 = S11 − S12 S22 S21
SR1|2 ,RY |2 = SR1|2 ,Y .
Writing
n
X
SR1|2 ,Y = n−1 T
[X1i − X̄1 − B1|2 (X2i − X̄2 )]YiT
i=1
n
X
−1
= n−1 [X1i − X̄1 − S1|2 S22 (X2i − X̄2 )]YiT
i=1
−1
= S1,Y − S1,2 S22 S2,Y ,
we have
−1
SR1|2 ,RY |2 = S1,Y − S1,2 S22 S2,Y .
This then leads directly to the algorithm given by the synopsis of Section 6.2.1.
Since the sample covariance matrix between R b1|2 and RbY |2 is the same as that
between R1|2 and Y , the algorithm could be run with Y in place of R
b bY |2 . In
this case the dimension reduction arms of the SIMPLS and NIPALS algo-
T
rithms are instances of Algorithms S and N with A = SR1|2 ,RY |2 SR 1|2 ,RY |2
and
M = S1|2 . This version of Algorithm S was used by Park et al. (2022) as a
partial PLS algorithm.
Partial predictor envelopes 189
Y = α + β1T (X1 − X
b1|2 ) + β T X
1
b1|2 + β T X2 + ε
2
= α∗ + β1T R
b1|2 + β2∗T X2 + ε,
where β2∗ = B1|2 β1 and constants have been absorbed by α∗ . Since R b1|2 and
X2 are uncorrelated in the sample, β1 can be obtained from the regression of
b
Y on Rb1|2 and βb∗ from the regression of Y on X2 , which leads back to the al-
2
gorithm of Section 6.2.1. The OLS estimator B1 of β1 can then be represented
as the coefficients from the OLS fit of R1|2 i on Yi :
−1 −1
B1 = SR 1|2
SR1|2 ,Y = S1|2 SR1|2 ,Y . (6.8)
Since the predictors are not ancillary in this treatment, the likelihood
should be based on the joint distribution of (Y, X1 , X2 ). Without loss of gen-
erality we assume that all variables are centered and so have marginal means
of zero. We assume also that in model 6.1
X2 ∼ N (0, Σ2 )
T
X1 | X2 ∼ N (β1|2 X2 , Σ1|2 )
Y | (X1 , X2 ) ∼ N (β1T X1 + β2T X2 , σY2 |X ).
β1 = Φη
Σ1|2 = Φ∆ΦT + Φ0 ∆0 ΦT0 .
With this structure we can write our normal model in terms of the envelope
basis,
X2 ∼ N (0, Σ2 )
T
X1 | X2 ∼ N (β1|2 X2 , Φ∆ΦT + Φ0 ∆0 ΦT0 )
Y | (X1 , X2 ) ∼ N (η T ΦT X1 + β2T X2 , σY2 |X ).
Details of how to maximize this log likelihood function are available in Ap-
pendix A.6.2. Here we report only the final estimators.
After maximizing the log likelihood over all parameters except Φ we find
that the MLE of the partial envelope is
n o
(X ) −1
EbΣ1|22 (B1 ) = span arg min log GT S1|Y,2 G + log GT S1|2 G , (6.9)
G
parameters are
ηb = (Φ b −1 Φ
b T S1|2 Φ) b T SR ,Y
1|2
βb1 = Φb
bη
= PΦ(S
b 1|2 ) B1
where B1 was defined at (6.8). In particular, we see that the MLE of β1 is the
projection of B1 onto span(Φ)
b in the S1|2 inner product. The MLE’s of the
scale parameters are
Σ
b2 = S2 ,
∆
b b T S1|2 Φ,
= Φ b
∆
b0 b T0 S1|2 Φ
= Φ b 0,
Σ
b 1|2 = PΦ b + QΦ
b S1|2 PΦ b S1|2 QΦ
b
bY2 |X
σ T
= SY |2 − SR 1|2 ,Y
Φ[
bΦ b −1 Φ
b T S1|2 Φ] b T SR ,Y .
1|2
Our envelope estimator (6.9) is the same as that found by Park, Su, and
Chung (2022, eq. (11)), who also gave asymptotic distributions of βb1 and
βb1 in their Proposition 2. Park et al. (2022) named the subsequent estimators
envelope-based partial least squares estimators (EPPLS). We see this as a mis-
nomer. Historically and throughout this book, partial least squares has been
associated with specific algorithms, which we have generalized to Algorithms N
and S. Only relatively recently was it found that the dimension reduction arm
of PLS estimates an envelope (Cook, Forzani, and Rothman, 2013). We think
communication will be clearer by continuing the historical practice of linking
PLS with a specific class of algorithms as described in Chapter 3, particularly
since the algorithms are serviceable in n < p regressions, while EPPLS is not.
When n < p1 the covariance matrices S1|Y,2 and S1|2 in (6.9) are singular.
Park et al. (2022) suggested to then replace these matrices with their sparse
permutation invariant covariance estimators (SPICE, Rothman, Bickel, Lev-
ina, and Zhu, 2008), but this option does not seem to have been studied in
detail.
192 Partial PLS and Partial Envelopes
β TX = β T QG X + β T PG X
= β T G0 GT0 X + β T G GT X/kGk2
= φT1 Z1 + φ2 Z2 ,
Z1 = GT0 X are the observable transformed predictors that go with the nui-
sance parameters. The model in terms of the new parameters and predictors is
Y = α + φT1 Z1 + φ2 Z2 + ε. (6.10)
where φb1 can come from either the partial PLS fit or the MLE of the partial
(Z2 )
envelope, Evar(Z 1 |Z2 )
(span(φ1 )), where
Z1 = GT0 X
Z2 = GT X/kGk2
SZ2 = GT SX G/kGk4
SZ2 ,Y = GT SX,Y /kGk2
SZ2 ,Z1 = GT SX G0 /kGk2
φb2 = kGk4 (GT SX G)−1 {GT SX,Y /kGk2 − GT SX G0 /kGk2 φb1 }
= kGk2 (GT SX G)−1 GT {SX,Y − SX G0 φb1 }.
Partial predictor envelopes 193
6.3.2 Objectives
Given the empirical evidence of governance as a driver of per capita GDP,
here we analyze the power of governance to predict the economic growth.
Therefore, the objective is the prediction of the per capita GDP using
Partial predictor envelopes in economic growth prediction 195
The World Bank considers six aggregate WGI’s that combine the views of a
large number of enterprise, citizen and expert survey respondents: control of
196
TABLE 6.1
Leave-one-out cross validation MSE for GPD growth prediction.
• LASSO: Regression with variable selection using LASSO with the penaliza-
tion parameter selected by cross validation. This is intended to represent
penalization methods.
where Ŷ−i is the prediction of the i-th response vector based on the data with-
out that vector. The second form of predMSE expresses the same quantity in
terms of the individual responses Yij .
Several observations are apparent.
• The full and PPENV methods cannot handle relatively small sample sizes,
as expected, while the PLS and PCA methods are apparently serviceable
in such cases.
• As expected, there is appreciable gain over the FULL method when the
sample size is relatively small.
• The PLS and envelope methods are close competitors, although the PLS
methods are serviceable when n < p, while the envelope methods are not.
• There is little reason to recommend the PCA and LASSO methods, al-
though these methods may be in the running on occasion.
To round out the discussion, Figure 6.2 shows plots of the response Y ver-
sus the leave-one-out predicted values from the partial PLS fit with q1 = 7 and
the lasso for the data in the first row of Table 6.1 comprising 12 South Amer-
ican countries, 2003–2018, n = 161. The visual impression seems to confirm
the MSE’s shown in Table 6.1.
Partial response reduction 199
10 10
Y
Y
0 0
-10 -10
0 5 0 5 10
Leave-one-out fitted values from PPLS-7 Leave-one-out fitted values from LASSO
FIGURE 6.2
Economic growth prediction for 12 South American countries, 2003–2018, n =
161: Plot of the response Y versus the leave-one-out fitted values from (a) the
partial PLS fit with q1 = 7 and (b) the lasso.
Run PLS: Run the dimension reduction arm of a PLS algorithm with re-
sponse vector Rb1|2 i and predictor vector R
bY |2 i , i = 1, . . . , n, and get the
r×u1
output matrix W ∈ R
c of weights. The projection onto span(W c ) is a
√
n-consistent estimator of EΣY |X (span(β1T )), the smallest reducing sub-
space of the conditional covariance matrix ΣY |X that contains span(β T ).
c T Y = a + ηX1 + γX2 + e.
W
If the sample size permits, use OLS; otherwise use PLS to reduce the
predictors. Let ηb denote the estimated coefficients of X1 . Then the PLS
estimator of β1T is W
c ηb.
c ηbX1 + βbTX2 .
Predict: The rule for predicting Y is then Yb = Ȳ + W 2
This algorithm can be used in conjunction with cross validation and a holdout
sample to select the number of components and estimate the prediction error
of the final estimated model. It is also possible to use maximum likelihood to
estimate a partial reduction, as described in Section 6.4.3.
6.4.2 Derivation
To compress the response vector using PLS for the purpose of estimating β1
in model (6.1), we strive to project Y onto the smallest subspace E or Rr that
satisfies
(a) QE Y X1 | X2 and (b) PE Y QE Y | (X1 , X2 ). (6.13)
Comparing this to condition (6.4) we see that the two conditions are the same,
except the roles of X1 and Y are interchanged. In consequence, we can achieve
partial dimension reduction of the response vector by following the PLS al-
gorithm of Section 6.2.1, but interchanging X1 and Y and using a different
fitting scheme. This then is how we arrived at the algorithm of Section 6.4.1.
predictor reduction (6.9) we see that one can be obtained from the other by
interchanging the roles of Y and X1 , although X1 need not be stochastic for
(6.16). This is in line with our justification around (6.14) for the PLS com-
pression algorithm described in Section 6.4.
Following determination of EbΣY |X (B10 ), the maximum likelihood estimators
of the remaining parameters are as follows. The maximum likelihood estima-
tor βb1 of β1 is obtained from the projection onto EbΣY |X (B10 ) of the maximum
likelihood estimator B1T of β1T from model (6.1),
where PEb1 denotes the projection operator for EbΣY |X (B10 ). The maximum like-
lihood estimator βb2 of β2 is the coefficient matrix from the ordinary least
squares fit of the residuals Y − Ȳ − βb1T X1 on X2 . If X1 and X2 are orthogonal
then βb2 reduces to the maximum likelihood estimator of β2 from the standard
model. Let Γ b be a semi-orthogonal basis matrix for EbΣ (B 0 ) and let (Γ, b Γ
b0 )
Y |X 1
be an orthogonal matrix. The maximum likelihood estimator Σ b Y |X of ΣY |X
is then
Σ
b Y |X = PEb1 SRY |2 |R1|2 PEb1 + QEb1 SRY |2 QEb1 = Γ
bΩ bT + Γ
bΓ b0 Ω b T0
b 0Γ
Ω
b b T SR |R Γ
= Γ b
Y |2 1|2
Ω
b0 b T0 SR Γ
= Γ b .
Y |2 0
The asymptotic distributions of βb1 and βb2 were derived and reported by Su
and Cook (2011). Here we give only their result for βb1 since that will typically
be of primary interest in applications. See also Cook (2018, Chapter 3).
As we discussed for response reduction in Section 2.5.3, it is appropriate to
treat the predictors as non-stochastic and ancillary. Accordingly, Su and Cook
(2011) treated the predictors as non-stochastic when deriving the asymptotic
distribution of βb1 . Recall from Section 2.5.3 that when X is non-stochastic we
define ΣX as the limit of SX as n → ∞. Partition ΣX = (Σi,j ) according to
the partitioning of X (i, j = 1, 2) and let
The matrix Σ1|2 is constructed in the same way as the covariance matrix
for the conditional distribution of X1 | X2 when X is normally distributed,
although here X is fixed. Define
U1|2 = Ω−1 T −1
0 ⊗ ηΣ1|2 η + Ω0 ⊗ Ω + Ω0 ⊗ Ω
−1
− 2Iu1 (r−u1 ) .
Partial response reduction 203
Proposition 6.1. Under the partial envelope model (6.1) with non-stochastic
√
predictors, n{vec(βb1 ) − vec(β1 )} converges in distribution to a normal ran-
dom vector with mean 0 and covariance matrix
n√ o
†
avar nvec(βb1 ) = ΓΩΓT ⊗ Σ−1 T T
1|2 + (Γ0 ⊗ η )U1|2 (Γ0 ⊗ η),
Yi = U α0 + U αXi + εi , i = 1, . . . , n, (6.17)
where
√
avar( nvec(βbcm )) = Σ−1 T
X ⊗ U ΣD|S U ,
where αbcm and βbcm are the MLEs of α and β in model (6.18).
Model (6.20) is in the form of a partitioned multivariate linear model (6.1)
and so it is amenable to compression of the response vector for the purpose
of estimating α. Let A = span(α) and parameterize (6.20) in terms of a semi-
orthogonal basis matrix Γ ∈ Rk×u for EΣD|S (A), the ΣD|S -envelope of A. Let
Partial response reduction 205
and the envelope estimator described in Section 6.4.3. For comparison, we also
studied partial predictor reduction using the model of Section 6.2.3 as the ba-
sis for the PLS algorithm of Section 6.2.1 and the likelihood-based partial
envelope estimation of Section 6.2.3. The comparison criterion is prediction
of the response vector using leave-one-out cross validation to form the mean
prediction error predRMSE as defined in Section 5.5.1.
The results shown in Table 6.2 indicate that partial response reduction
is somewhat better than partial predictor reduction and that both reduction
types do noticeably better than OLS. The two components indicated in the
partial response applications indicate that the basis matrix Γ in model (6.1) is
6 × 2. Consequently, ΓT Y is the part of the response vector that is influenced
by the changes in the predictors, while ΓT0 Y is unaffected by changes in the
predictors.
7
Definition 7.1. If
φ(X) = φS (X) (7.1)
In this way the central discriminant subspace captures all of the classifi-
cation information that X has about class C and thus has the potential to
reduce the dimension of X without loss of information on φ(X). For exam-
ple, suppose that there are three classes, χ = {0, 1, 2}, two normal features
X = (X1 , X2 )T and that Pr(C = k | X1 , X2 ) = Pr(C = k | X1 + X2 ). Then
only the sum of the two predictors is relevant for classification and (7.1) holds
with S = span((1, 1)T ). This reflects the kind of setting we have in mind when
envelopes are employed a bit later.
To illustrate the potential importance of discriminant subspaces, con-
sider a stylized problem adapted from Cook and Yin (2001), still using three
classes and two normal features. Define the conditional distribution of C | X
Envelope discriminant subspace 209
FIGURE 7.1
Illustration of the importance of the central discriminant subspace.
where 0 < ω < 1. These conditional distributions are depicted in Figure 7.1.
The central subspace SC|X = R2 and no linear dimension reduction is possi-
ble. However, dimension reduction is possible for φ(X). Although we would
not normally expect applications to be so intricate, the example does illustrate
the potential relevance of discriminant subspaces.
If ω > 1/2 then φ(X) = 0 for all X, DC|X = span((0, 0)T ) is a discriminant
subspace and consequently the two features provide no information to aid in
classification. In this case we are able to discard X completely. If ω < 1/2
then φ(X) depends non-trivially on both predictors, DC|X = R2 and so no
dimension reduction is possible.
210 Linear Discriminant Analysis
To bring envelopes into the discussion, recall that conditions (2.3), restated
here with Y replaced by C for ease of reference
are operational versions of the more general envelope conditions (2.2) for re-
ducing the dimension of X in the regression of Y on X. Condition (7.2a)
requires that PS X capture all of ways in which X can inform on the distribu-
tion of C. Dimension reduction for discriminant analysis is different since we
are not interested in capturing all aspects of X that provide information about
C. Rather we pursue only the part of X that affects φ(X). Adapting to this
distinction, Zhang and Mai (2019) replaced condition (7.2a) with requirement
(7.1), leading to consideration of subspaces S ⊆ Rp with the properties
The rationale for including (7.3b) is similar to that discussed in Section 2.1:
methodology based on condition (7.3a) alone may not be effective when p > n
or the features are highly collinear because then it is hard in application to
distinguish the material part PS X of X that is required for φ(X) from the
complementary part QS X that is immaterial. The role of condition (7.3b) is
then to induce a measure of clarity in the separation of X into parts that
are material and immaterial to φ(X). Zhang and Mai (2019) formalized the
notion of the smallest subspace that satisfies (7.3) as follows:
Definition 7.2. If the intersection of all subspaces that satisfy (7.3) is it-
self a subspace that satisfies (7.3), then it is called the envelope discriminant
subspace.
there are two classes, this Bayes interpretation may not play a useful role.
PLS discriminant analysis (e.g. Brereton and Lloyd, 2014) arises by modeling
X | (C = k) as a conditional normal, leading to methodology closely associ-
ated with Fisherian LDA. While this may seem simple relative to the range
of methods available, Hand (2006) argued that simple methods can outper-
form more intricate methods that do not adapt well to changing circumstances
between classifier development and application.
We next turn to LDA where we bring in PLS methods. Quadratic discrim-
inant analysis is considered in Chapter 8.
Xi = µ0 + β T Yi + εi , i = 1, . . . , n, (7.8)
where the errors are independent copies of ε ∼ N (0, ΣX|C ). Maximum likeli-
hood estimators of the parameters in this model were reviewed in Section 1.2.2.
Since the roles of X and Y are different in (7.8) we restate the estimators for
ease of reference:
These estimators are then substituted into φlda (X) to estimate the class with
the maximum posterior probability, a process that characterizes classical LDA.
However, this estimation procedure can be inefficient if β has less than full
row rank or if only a part of X is informative about C. In the extreme, if it
were known that rank(β) = 1 then it may be possible to improve the classical
method considerably. Incorporating the central discriminant subspace allows
for the possibility that β is rank deficient, and envelopes were designed to deal
the possibility that only a portion of X informs on C.
DC|X = span(Σ−1 T −1 0
X|C β ) = ΣX|C B , (7.9)
Xi = µ0 + BbYi + εi , i = 1, . . . , n. (7.10)
If d = r then β T has full column rank and this model reduces to model (7.8).
If d < r this becomes an instance of the general model for PFC (Cook, 2007;
Cook and Forzani, 2008b), which is an extension to regression of the Tipping-
Bishop model that yields probabilistic principal components (Tipping and
Bishop, 1999). The general PFC model allows the Y -vector to be any user-
specified function of a response, but in discriminant analysis Y is properly
an indicator vector as defined previously. A key characteristic of model (7.10)
is summarized in the following proposition (Cook (2007, Prop 6); Cook and
Forzani (2008b, Thm 2.1)). In preparation, define a p × d basis matrix for the
central discriminant subspace as
Φ = Σ−1
X|C B ∈ R
p×d
.
X | (Y, ΦT X) ∼ X | ΦT X.
= 0.
Since ε is normally distributed, it follows that ΦTX B0T X | Y . This plus the
previous conclusion B0T X Y implies that B0T X (ΦTX, Y ) and thus that
B0T X Y | ΦTX (Cook, 1998, Proposition 4.6). Since (B0 , Φ)T X is a full rank
linear transformation of X, this last conclusion implies that X Y | ΦTX.
The desired conclusion – X | (Y, ΦT X) ∼ X | ΦT X – follows.
214 Linear Discriminant Analysis
+ (PB(Σ−1 ) βk )T Σ−1
X|C PB(Σ
−1
) (X − (βk + 2µ0 )/2)},
X|C X|C
B
b = (φb1 , . . . , φbd )
b T XT Y(YT Y)−1
bb = B
µ
b0 = X̄ − B
bbbȲ
These estimators, which require n > r and n > d but not n > p, can now
we substituted into φlda (B TX) for a sample classifier. If classes have equal
prior probabilities, πk = πj , then the classification rule simplifies a bit and an
estimator of σ 2 is no longer necessary:
Xi = µ0 + ΓηYi + εi , i = 1, . . . , n, (7.13)
T
ΣX|C = ΓΩΓ + Γ0 Ω0 ΓT0 .
This response envelope model (See Sections 2.4 and 2.5.3) was anticipated by
Cook (2007), who proposed it as an extension (EPFC) of PFC model (7.10),
Principal fitted components 217
but without the key understanding that can derive from the envelope struc-
ture. It was subsequently proposed specifically for discrimination by Zhang
and Mai (2019), who called it the envelope discriminate subspace (ENDS)
model. Based on model (7.13) we determine class membership by using (7.7)
in combination with the multivariate model for the reduced features
φlda (ΓTX) = arg max log(πk /π0 ) + ηkT Ω−1 (ΓTX − (ηk + 2ΓTµ0 )/2) .
k∈χ
(7.15)
−1
log |GT SX|Y G| + log |GT SX
Γ
b = arg min G| (7.16)
G
b T SX,Y S −1 = (b
ηb = Γ η1 , . . . , ηbr )
Y
b η = Pb SX,Y S −1 ,
Γb Γ Y
µ
b0 = X̄ − Γb
b η Ȳ
Ω
b b T SX|Y Γ,
= Γ b
Ω
b0 b T SX Γ
= Γ b0 ,
0
Σ
b X|C = Γ
bΩ bT + Γ
bΓ b0 Ω b T0 ,
b 0Γ
L
bu = −(nr/2) log(2π) − nr/2 − (n/2) log |SX |
b T SX|Y Γ|
−(n/2) log |Γ b T S −1 Γ|.
b − (n/2) log |Γ b
X
218 Linear Discriminant Analysis
The estimators Γ,
b ηbk , µ
b0 and Ω
b are now substituted into (7.15) to determine
the class with the maximum estimated probability.
Recall from the discussion of Chapter 3 that NIPALS and SIMPLS estimate
the predictor envelope EΣX(B) in the regression of Y on X. However, we also
know from the discussions of Sections 2.4 and 3.11 that this envelope is the
same as the response envelope in the regression of X on Y , which is exactly
what appears in model (7.13). In other words, beginning with classification
indicator vectors Yi and corresponding features Xi we can use either NIPALS
or SIMPLS weight matrices as estimates of a basis Γ b pls for EΣ (B), which is
X
then used in place of Γ b from (7.16) to construct the remaining estimators for
substitution into (7.15). The adaptation of PLS algorithms to discrimination
problems is then seen to be straightforward, the methodology for constructing
Γ
b pls being covered by the general discussions in previous chapters. In partic-
ular, the PLS algorithms are not hindered by the requirement that n > p
and their asymptotic behavior in high dimensions is governed by the results
summarized in Chapter 4.
To be clear, a procedure for computing the PLS discrimination function is
outlined as follows.
µ
b0,pls = X̄ − Γ
b pls ηbpls Ȳ
Ω
b pls b Tpls SX|Y Γ
= Γ b pls .
3. Substitute these estimators into (7.15) to get the estimated PLS discrim-
inant function:
µ
b0 = X̄ − βbT Ȳ (7.17)
1/2 (d) −1/2
βbT = (βb1 , . . . , βbr ) = SX CX,Y SY (7.18)
220 Linear Discriminant Analysis
Σ
b X|C = SX − βbT SY,X
1/2 (d) (d) 1/2
= SX (Ir − CX,Y CY,X )SX . (7.19)
General estimation methods like that based on PFC-RR model (7.10) and
the corresponding PFC discriminant function require that n p to insure
that Σ−1
X|C can be well estimated. The same is true of the envelope discrimina-
tion discussed in Section 7.3.2. But PLS discrimination discussed at the end of
Section 7.3.2 and PFC discrimination with isotropic errors may be serviceable
without requiring n p. Of these two, PLS fitting is surely the more versatile
since it does not require isotropic errors.
There are also notable differences between the ways in which gains are
produced by envelope and by PFC-RR regressions. The gain from PFC-RR
regression results primarily from the reduction in the number of real param-
eters need to specify β (Cook, 2018, Sec. 9.2.1). On the other hand, the gain
from a response envelope is due to the reduction in the number of parameters
and to the structure of ΣX|C = ΓΩΓT + Γ0 Ω0 ΓT0 , with massive gains possible
depending on the relationship between kΩk and kΩ0 k.
These contrasts lead to the conclusion that envelope and PFC-RR regres-
sions (7.10) are distinctly different methods of dimension reduction with differ-
ent operating characteristics. Reasoning in the equivalent context of reduced-
rank regression, Cook et al. (2015) combined PFC-RR and response envelopes,
leading to a new dimension reduction paradigm that can automatically choose
the better of the two methods and, if appropriate, can also give an estimator
that does better than both of them.
When formulating envelope model (7.13) no explicit accommodation was
included for the dimension d of B 0 . The PFC-RR -envelope model includes
such an accommodation by starting with model (7.10) and then incorporating
a semi-orthogonal basis matrix Γ ∈ Rr×u for EΣX|C (span(B)):
Xi = µ0 + ΓηbYi + εi , i = 1, . . . , n, (7.20)
T
ΣX|C = ΓΩΓ + Γ0 Ω0 ΓT0 ,
where β T = Γηb, η ∈ Ru×d , b ∈ Rd×r as defined for (7.10) and the re-
maining parameters are as defined in (7.13). This model contains two tuning
dimension, u and d, that need to be determined subject to the constraints
0 ≤ d ≤ u ≤ min(p, r). Maximum likelihood estimators of the parameters in
this model as well as suggestions for determining the tuning parameters are
available from Cook et al. (2015). These estimators can then be substituted
into classification function (7.7).
First, neglect d and estimate the envelope basis Γ ∈ Rp×u using either
Second, substitute Γ
b into model (7.14) and then use that as the basis for a
PFC-RR regression. This corresponds to fitting the working RR model
b TXi
Γ = α0 + GbYi + ei , i = 1, . . . , n,
ΣΓbTX|C = Ω,
7.7 Illustrations
The likelihood-based classification methods listed as items 2, 4, and 6 in Sec-
tion 7.6 will eventually dominate for a sufficiently large sample, since the
methods inherit optimality properties from general likelihood theory. Our pri-
mary goal for this section is to provide some intuition into the methods that
do not require a large sample size, principally isotropic PFC and PLS, meth-
ods 3, 5 and 7. We choose two data sets – Coffee data and Olive Oil data –
from the literature because these data sets have been used in studies to com-
pare classification methods for small samples. That enabled us to compare our
methods with other methods without the need to implement them.
224 Linear Discriminant Analysis
7.7.1 Coffee
Zheng, Fu, and Ying (2014), using data from Downey, Briandet, Wilson, and
Kemsley (1997), compared five discrimination methods on their ability to
predict one of two coffee varieties, Arabica and Robusta. The data consist of
29 Arabica and 27 Robusta species with corresponding Fourier transform in-
frared spectral features obtained by sampling at p = 286 wavelengths. The five
methods they compared, including PLS discriminant analysis, are named in
Table 7.1. Detailed descriptions of the methods are available from their article.
The last five entries in the second row of Table 7.1 gives the rates of correct
classification, estimated by using leave-one-out cross validation, from Zheng
et al. (2014, Table 1). The second and third entries in the second row are the
rates based on leave-one-out cross validation that we observed by applying
methods 3 (ISO) and 5 (PLS) listed in Section 7.6. Using 10-fold cross val-
idation, we chose 3 components for PLS and 4 components for classification
via isotropic PFC. For the fourth entry, PLS+PFC, we first applied PLS and
then used PFC to further reduce the compressed PLS features. This resulted
in 3 PLS components and one PFC component. The linear classification rule
was then based on the PLS+PFC compressed feature. Our implementation of
PLS did better than three of the five methods use by Zheng et al. (2014) and
did the same as two of their methods at 100% accuracy.
These data are sensitive to particular partition used to conduct the 10-fold
cross validation. Depending on the seed to start the pseudo-random sampling,
TABLE 7.1
Olive oil and Coffee data: Estimates of the correct classification rates (%) from
leave one out cross validation.
ISO and PLS refer to methods 3 and 5 as listed in Section 7.6. The remaining
designations are those used by Zheng et al. (2014). KNN: k-nearest neighbor.
LS-SVM: least-squares support vector machine. PLS-DA: partial least-squares
discriminant analysis. BP-ANN: back propagation artificial neural network.
ELM: extreme learning machine.
Illustrations 225
1.0
0.9
Accuracy 0.8
0.7
0.6
4 8 12
Number of components
FIGURE 7.2
Coffee data: Plots of estimated classification rate (accuracy) versus the number
of components
FIGURE 7.3
Coffee data: Plots of PLS, PFC, and Isotropic projections.
their country of origin. There were 60 authenticated extra virgin olive oils
from four countries: 10 from Greece, 17 from Italy, 8 from Portugal, and 25
from Spain. The p = 570 features were obtained from Fourier transform in-
frared spectroscopy of each of the 60 samples. The analyses of these data
parallels those for the Coffee data in Section 7.7.1. The percentages of correct
Illustrations 227
classification are shown in the third row of Table 7.1 and the corresponding
graphics are shown in Figure 7.4. Our implementation of PLS gave notably
better results than the PLS-DA method of Zheng et al. (2014). We have no
explanation for the difference.
Our implementation of 10-fold cross validation gave 28 components for
PLS and 28 components for ISO. Marked plots of the first two PLS and ISO
components are shown in panels a and c of Figures 7.4. The separation is not
228 Linear Discriminant Analysis
very clear, perhaps signaling the need for more components. Application of
PLS+PFC resulted in 17 PLS components and 2 PFC component based on
the PLS components. A marked plot of the two PLS components is shown
in Figures 7.4b where we observe perfect separation of the four classes. One
advantage of the PLS+PFC method is its ability to allow informative low
dimensional plots, as illustrated here.
8
Quadratic Discriminant
Analysis
Quadratic discriminant analysis (QDA) proceeds under the same model (7.5)
as linear discriminant analysis, except that now the conditional covariance ma-
trices are no longer assumed to be the same, so we may have ΣX|C=k = 6 ΣX|C=j
for k =6 j. Let Σk = ΣX|(C=k) . The nominal stochastic structure underlying
QDA for predicting the class C ∈ χ based on a vector of continuous features
X ∈ Rp is then
where
where
−1
ak = log(πk ) + (1/2)µTk Σ µk
Σ = ΣX|C .
Dimension reduction for QDA 231
With sufficient observations per class, φqda (X) can be estimated consis-
tently by substituting sample versions of Σk and µk , k = 0, . . . , r. However,
as with linear discriminant analysis, we strive to reduce the dimension of
the feature vectors without loss of information and thereby reduce the rate
of mis-classifications. It may be clear from the above forms that the feature
vectors furnish classification information through the scaled mean deviations
−1 −1
Σ (µk − µ) and precision deviations Σ−1 k −Σ . These deviations will play a
key role when pursuing dimension reduction for quadratic discriminant anal-
ysis.
Beginning with a simple random sample (Ci , Xi ), i = 1, . . . , n, let nk de-
note the number of observations in class C = k so the total sample size can be
Pr
represented as n = k=0 nk . In the remainder of this section, we frequently
take πk = nk /n for use in application as will often be appropriate. This also
facilitates presentation, particularly connections with other methods.
2009, Thm 1 and Prop 1) proved the key result shown in Proposition 8.1. In
preparation, define
βk = µk − µ, k ∈ χ
T
β = (β0 , β1 , . . . , βr )
B0 = span(β T ).
This definition of βk differs from that used for linear discriminant analysis
(see just below (7.7)). Here µ is used for centering rather than µ0 . Recall also
that, from Definition 2.1, a subspace S ⊆ Rp with the property C X | PS X
is called a dimension reduction subspaces for the regression of C on X and
that the central subspace SC|X is the intersection of all dimension reduction
subspaces.
where
E(ΨTX | C = k) = ΨT µk
var(ΨTX | C = k) = Ψ T Σk Ψ
E{var(ΨTX | C)} = ΨT ΣΨ.
Using these moments in conjunction with (8.4) gives the reduced classification
function
where
Lk (ΨTX) = X T Ψ
(Ψ Σk Ψ)−1 − (ΨT ΣΨ)−1 ΨTµk + (ΨT ΣΨ)−1 ΨT (µk − µ) ,
T
(i) L ⊆ S
(ii) Q ⊆ S.
In addition,
Conditions (i) and (ii) are restatements of conditions (i) and (ii) in Propo-
sition 8.1. The conclusion that SC|X = L + Q follows because L + Q is a
dimension reduction subspace that is contained in all dimension reduction
subspaces. The conclusion that DC|X = SC|X was demonstrated by Zhang
and Mai (2019).
Cook and Forzani (2009, Thm 2) showed that the maximum likelihood
estimator Φ b of a basis Φ for SC|X can be constructed as follows. Let SX and
Sk denote the sample versions of ΣX , the marginal covariance matrix of X,
and Σk , the covariance matrix of X restricted to class C = k. Also let S =
Pr
k=0 (nk /n)Sk denote the sample version of the average covariance matrix Σ.
Then under model (8.1) with fixed dimension d = dim(SC|X ) and Sk > 0 for
all k ∈ χ, the maximum likelihood estimator of a basis matrix Φ for SC|X is
Φ
b = arg max `d (H),
H
where
b T Sk Φ|
ak = log(πk ) + (1/2) log |Φ b + (1/2)X̄kT Φ(
b Φ b −1 Φ
b T Sk Φ) b T X̄k ,
where X̄k is the average feature vector in class C = k and X̄ is the overall
average. This reduced classification function may be reasonable when the in-
traclass sample sizes are relatively large nk p for all k ∈ χ and intraclass
collinearity is negligible to moderate.
We see from this that one role of the envelope is to separate Σ into a part ΓΩΓT
that is material to classification and a complementary part Γ0 Ω0 ΓT0 that is
immaterial. Since SC|X ⊆ EΣ (SC|X ), the envelope space EΣ (SC|X ) is a dimen-
sion reduction subspace for the regression of C on X and, from Corollary 8.1,
L ⊆ EΣ (SC|X ) and Q ⊆ EΣ (SC|X ). The following corollary to Proposition 8.1
describes key relationships that will guide the methodology in this section.
B 0 ⊆ ΣSC|X ⊆ Σ EΣ (SC|X ).
Conclusion (II) now follows by using the argument in the justification of con-
clusion (I).
Conclusion (III) follows immediately from Proposition 8.1(ii). 2
We next discuss two ways in which this structure can be used to esti-
mate the corresponding reduced classification function φqda (ΓTX). The first
is likelihood-based, requiring relatively large sample settings where nk p
and so Sk > 0 for all k ∈ χ. The second is when nk is not large relative to p
for some k. This is where PLS comes into play since then the likelihood-based
classification function may be unserviceable or not sufficiently reliable.
238 Quadratic Discriminant Analysis
b = X̄
µ
µ
bk = X̄ + PΓb (X̄k − X̄)
Σ
b k = Pb Sk Pb + Qb SX Qb
Γ Γ Γ Γ
Σ
bX = PΓb SX PΓb + QΓb SX QΓb ,
P
where S = k∈χ (nk /n)Sk . Substituting these estimators into (8.5) gives the
classification function under the ENDS-QDA model of Zhang and Mai (2019):
n o
b TX) = arg max ak − Lk (Γ
φqda (Γ b TX) + Qk (Γ
b TX) (8.12)
k∈χ
where
b T Sk Γ|
ak = log(πk ) + (1/2) log |Γ b + (1/2)X̄kT Γ(
b Γ b −1 Γ
b T Sk Γ) b T X̄k ,
where X̄k is the average feature vector in class C = k and X̄ is the overall
average. This reduced classification function may be advisable when the in-
traclass sample sizes are relatively large nk p for all k ∈ χ and intraclass
collinearity is high for some classes.
Rounding out the discussion, Su and Cook (2013) developed envelope es-
timation of multivariate means from populations with different covariance
matrices. The basic model that they used to develop envelope estimation is
the same as (8.1), although they were interested only in estimation of popula-
tion means and not subsequent classification. They based their methodology
on the following definition of a generalized envelope.
Definition 8.1. Let M be a collection of real t×t symmetric matrices and let
V ∈ span(M ) for all M ∈ M. The M-envelope of V, denoted EM (V), is the in-
tersection of all subspaces of Rt that contain V and reduce each member of M.
span(Σ − Σk ) ⊆ ΣSC|X .
The left-hand side of (8.13) is the span of the kernel function for sliced av-
erage variance estimation (SAVE) as proposed by Cook and Weisberg (1991)
and developed further by Cook (2000) and Shao, Cook, and Weisberg (2007,
2009). A comprehensive treatment is available from Li (2018, Ch. 5). Equality
follows from Cook and Forzani (2009, Discussion of Prop. 3).
While normality guarantees equality in (8.13), containment holds under
much weaker conditions. In particular, if var(X | PSC|X X) is a non-random
Dimension reduction with S = EΣ (SC|X ) 241
matrix and if the linearity condition as given in Definition 9.3 holds then the
containment represented in (8.13) is assured. See Li (2018, Ch. 5) for further
discussion.
Algorithms N and S can now be used in applications by setting
X
M = S and A = (nk /n)(SX − Sk )2 , (8.14)
k∈χ
and using predictive cross validation to determine the dimension of the enve-
lope.
The methodology implied by Corollary 8.3 implicitly treats the mean and
variance components – β T Πβ and Σ − Σk – of Corollary 8.2 equally. In some
applications, it may be useful to differentially weight these components, par-
ticularly if they contribute unequally to classification. Let 0 ≤ a ≤ 1 and
combine the weighted components (1 − a)(Σ − Σk ) and aβ T πβ as
X X
πk {(1 − a)(Σ − Σk ) + aβ T πβ}2 = (1 − a)2 πk (Σ − Σk )2 + a2 (β T πβ)2 ,
k∈χ k∈χ
where the equality holds because the cross product terms sum to zero. With-
out loss of generality, we can rescale the right-hand side to give
X
span (1 − λ) πk (Σ − Σk )2 + λ(β T πβ)2 ⊆ ΣSC|X ,
k∈χ
where λ = a2 /{(1−a)2 +a2 }. This then implies that we use the sample version
of Algorithm N or S with M = S and
X
Aλ = (1 − λ) (nk /n)(S − Sk )2 + λ(SX − S)2 . (8.15)
k∈χ
which is closely related to the methods for studying covariance matrices pro-
posed by Cook and Forzani (2008a).
242 Quadratic Discriminant Analysis
Methods 3–5 all depend on the same classification function but differ on
the method of estimating the compressed predictors.
8.5 Illustrations
In this section we compare the PLS-type methods AN-Q1 and AN-Q2 to other
classification methods studied in the literature. We confine attention mostly
244 Quadratic Discriminant Analysis
TABLE 8.1
Birds-planes-cars data: Results from the application of three classification
methods. Accuracy is the percent correct classification based on leave-one-out
cross validation.
to methods that require and thus benefit from having nonsingular class covari-
ance matrices Sk . We see this as a rather stringent test for AN-Q1 and AN-Q2,
which do not require the class covariance matrices Sk to be nonsingular.
a. AN-Q2 b. AN-Q1
FIGURE 8.1
Bird-plane-cars data: Plots of the first two projected features for AN-Q1 and
AN-Q2.
accuracy rate of 92.3%. This agrees well with our estimate of 92.1% in Ta-
ble 8.1 and in the fourth column of Table 8.2. Due to this agreement, we feel
comfortable comparing our results based on Algorithm N with their results for
12 other methods. Of those 12 methods, their new method ENDS had the best
accuracy at 96.2%, which is less than our estimated accuracy rates for the two
methods based on Algorithm N. Zhang and Mai (2019, Figure 3) gave a plot of
the first two feature vectors compressed by using ENDS. Their plot is similar
to those in Figure 8.1 but the point separation does not appear as crisp.
All of the methods studied by Zhang and Mai (2019) require that the
intraclass covariance matrices Sk be nonsingular, except perhaps for their im-
plementation of SVM. However, being based on Algorithm N, methods AN-Q2
and AN-Q1 do not require the Sk ’s to be nonsingular. We take this as a gen-
eral indication of the effectiveness of the methods based on Algorithm N since
they are able to perform on par with or better than methods that rely on rel-
atively large sample sized. A similar conclusion arises from the fruit example
of the next section.
246 Quadratic Discriminant Analysis
TABLE 8.2
Birds-planes-cars data: Estimates of the correct classification rates (%) from
leave one out cross validation.
Classification rates for columns 5–11 were taken from Zhang and Mai (2019).
NB: Naive Bayes classifier (Hand and Yu, 2001). SVM: support vector ma-
chine. QDA: quadratic discriminant analysis with no reduction. ENDS: en-
velope discrimination. SIR: sliced inverse regression. SAVE: sliced average
variance estimation. DR: directional regression (Li and Wang, 2007).
8.5.2 Fruit
This dataset contains a collection of 983 infrared spectra collected from straw-
berry, 351 samples, and non-strawberry, 632 samples, fruit purees. The spec-
tral range was restricted to 554–11,123 nm, and each spectrum contained 235
variables. This then is a classification problem with p = 235 features for clas-
sifying fruit purees as strawberry or non-strawberry. It was used by Zheng,
Fu, and Ying (2014) to compare several classification methods.
Percentages of correct classifications are shown in Table 8.3. The number
of compressed features, determined by 10-fold cross validation , for AN-Q1 and
AN-Q2 were 14 and 18. The value of λ for AN-Q2 was estimated similarly to
be 1. The rates of correct classification shown in Table 8.3 were determined by
leave-one-out cross validation. Except for the relatively poor performances by
KNN and PLS-DA, there is little to distinguish between the methods. Plots of
the first two compressed feature vectors are shown in Figure 8.2, although the
results of Zheng et al. (2014) show that more than two compressed features
are needed for each method.
TABLE 8.3
Fruit data: Estimates of the correct classification rates (%) from leave one out
cross validation.
Classification rates for columns 4–8 were taken from Zheng et al. (2014). KNN:
k-nearest neighbor. LS-SVM: least-squares support vector machine. PLS-DA:
partial least-squares discriminant analysis. BP-ANN: back propagation artifi-
cial neural network. ELM: extreme learning machine.
a. AN-Q2 b. AN-Q1
FIGURE 8.2
Fruit data: Plots of the first two projected features for AN-Q1 and AN-Q2.
that the PLS methods discussed in Chapter 7 did well for these data when
compared to the methods of Zheng et al. (2014).
The coffee and olive oil datasets may not be large enough to give compelling
evidence that the intraclass covariance matrices are unequal and consequently
the PLS classification methods of Chapter 7 would be a natural first choice.
It could also be argued reasonably that, to error on the safe side, quadratic
methods should be tried as well. Shown in columns 2 and 3 of Table 8.4 are the
rates of correct classification from applying AN-Q1 and AN-Q2 to the coffee
248 Quadratic Discriminant Analysis
and olive oil data. For comparison, we again listed the results from Zheng et al.
(2014). For the coffee data, these quadratic methods did as well as the best
method studied by Zheng et al. (2014) and, from Table 7.1, as well as the best
PLS method. Viewing the results for the olive oil data, the quadratic methods
did reasonably, but not as well as the PLS or PLS+PFC results in Table 7.1.
TABLE 8.4
Olive oil and Coffee data: Estimates of the correct classification rates (%)
from leave one out cross validation. Results in columns 4–8 are as described
in Table 7.1.
The result for the coffee and olive oil data support the notion that AN-Q1
and AN-Q2 may be serviceable methods in classification problems where the
class covariance matrices are singular. Combining this with our conclusions
from the birds-planes-cars and fruit datasets leads the conclusion that AN-Q1
and AN-Q2 may be serviceable methods without much regardless for the class
sample sizes.
9
Non-linear PLS
Yi = E(Y | Xi ) + εi , i = 1, . . . , n, (9.1)
where the errors ε are independent and identically distributed random vectors
with mean 0 and variance ΣY |X . Model (9.1) is intended to be the same as
model (1.1) except for the possibility that the mean is a non-linear function of
X. If E(Y | X) = β0 +β TX then model (9.1) reduces to model (1.1). It is recog-
nized that predictions from PLS algorithms based on (1.1) are not serviceable
when the mean function E(Y | X) has significant non-linear characteristics
(e.g. Shan et al., 2014).
Following Cook and Forzani (2021), in this chapter we study the behav-
ior of the PLS regression algorithms under model (9.1), without necessarily
specifying a functional form for the mean. We restrict attention to univariate
responses (r = 1) starting in Section 9.5, but until that point the response
may be multivariate. Our discussion is based mainly on the NIPALS algorithm
(Table 3.1) although our conclusions apply equally to SIMPLS (Table 3.4) and
to Helland’s algorithm (Table 3.5).
9.1 Synopsis
We bring in two new ideas to facilitate our discussion of PLS algorithms
under non-linearity (9.1). The first is a construction – the central mean sub-
space (CMS) – that is used to characterize the mean function E(Y | X) in a
way that is compatible with envelopes (Section 9.2). The second is a linearity
condition that is used to constrain the marginal distribution of the predictors
(Section 9.3). This condition is common in sufficient dimension reduction (e.g.
Cook, 1998; Li, 2018) and is used to rule out anomalous predictor behavior
by requiring that certain regressions among the predictors themselves are all
linear. Using these ideas we conclude in this chapter that
1. Plots of the responses against NIPALS fitted values derived from the algo-
rithm in Table 3.1 can be used to diagnose non-linearity in the mean func-
tion E(Y | X). Linearity in such plots supports linear model (1.1), while a
clearly non-linear trend contradicts model (1.1) in support of model (9.1)
(see Proposition 9.4).
The intuition behind this definition is that the projection PS X carries all
of the information that X has about the conditional mean E(Y | X). Let
α ∈ Rp×dim(S) be a basis for a mean dimension reduction subspace S. Then if
S were known, we might expect that E(Y | X) = E(Y | αT X), thus reducing
the dimension of X for the purpose of estimating the conditional mean. This
expectation is confirmed by the following proposition (Cook and Li, 2002)
whose proof is sketched in Appendix A.7.1.
(i) Y E(Y | X) | αT X,
The CMS does not always exist, but it does exist under mild conditions
that should not be worrisome in practice (Cook and Li, 2002). We assume
existence of the CMS throughout this chapter.
252 Non-linear PLS
Suppose r = 1 and that the regression of Y on X follows the single index model
Y = f (β1T X) + ε, (9.2)
1. M = ΣW α(αT ΣW α)−1 .
2. M T is a generalized inverse of α.
4.
E(W | αT W ) − µW = PS(Σ
T
W)
(W − µW ),
E(W | αT W ) = µW + PS(Σ
T
W)
(W − µW ).
The next corollary and lemma describe special settings for the linearity
condition.
Corollary 9.1. (I) If model (9.1) holds and if X | Y satisfies the linearity
condition relative to SE(Y |X) for each value of Y , then X satisfies the linearity
condition relative to SE(Y |X) . (II) If X | Y is elliptically contoured, then X
satisfies the linearity condition relative to SY |X .
Proof. (I) The result follows because under model (9.1), SE(Y |X) = SY |X . (II)
If X | Y is elliptically contoured, then it satisfies the linearity condition for
NIPALS under the non-linear model 255
Proposition 9.4. Assume the linearity condition for X relative to SE(Y |X)
and that non-linear model (9.1) holds. Then span(ΣX,Y ) ⊆ ΣX SE(Y |X) and
consequently B ⊆ SE(Y |X) .
Proof. We step through the proof so the role of various structures can be
seen. Recall that η is a basis matrix for SE(Y |X) . Expanding the expectation
operator, we first have
T
ΣX,Y = EηT X {Pη(Σ X)
(X − µX )(E(Y | η T X) − µY )T }
T
= Pη(Σ X)
ΣX,Y (9.4)
span(β) ⊆ SE(Y |X) ,
Corollary 9.2. Assume the linearity condition for X relative to SE(Y |X) . As-
sume also that the regression of Y on X follows the single index model (9.2)
and that ΣX,Y 6= 0. Then B = SE(Y |X) .
Proposition 9.5. Assume the linearity condition for X relative to SE(Y |X) .
Then under the non-linear model (9.1), we have
Moreover, if the single index model (9.2) holds and ΣX,Y 6= 0 then EΣX(B) =
EΣX (SE(Y |X) ).
We know from the discussion in Section 3.2 that the weight matrix from
the NIPALS algorithm provides an estimator of a basis for the predictor en-
velope EΣX(B), which by Proposition 9.5 is contained in the corresponding
envelope of the CMS. It follows that the compressed predictors W TX are as
relevant for the non-linear regression model (9.1) as they are for the linear
model (1.1), provided that the linearity condition holds. In some regressions
it may happen that EΣX(B) is a proper subset of EΣX (SE(Y |X) ), in which case
the NIPALS compression may miss relevant directions in which the mean
function is non-linear. However, if the single index model (9.2) holds then
NIPALS under the non-linear model 257
EΣX(B) = EΣX (SE(Y |X) ) and NIPALS will not miss any relevant directions in
the population.
These results have implications for graphical diagnostics to detect non-
linearities in a regressions. The following steps may be helpful in regressions
with a real response.
1. Fit the data with NIPALS, SIMPLS or a PLS regression variation thereof
and get the estimated coefficient matrix βbpls .
T
2. Plot the response Y versus the fitted values βbpls X. If the plot shows clear
curvature, then the serviceability of the envelope model (2.5) is question-
able. If no worrisome non-linearities are observed, then the model is sus-
tained.
The conclusion EΣX(B) ⊆ EΣX (SE(Y |X) ) from Proposition 9.5 implies that
EΣX(B) could be a proper subset of EΣX (SE(Y |X) ) and thus that some
directions in the CMS envelope are missed. However, if the underlying re-
gression is a single index model, then the plot of the response versus the
fitted values is sufficient for detecting non-linearity.
which is the fundamental population justification for using the NIPALS al-
gorithm for dimension reduction in conjunction with using linear model (1.1)
for prediction. In particular, it implies that there is a vector η ∈ Rq so that
β = W η, leading to the reduced linear model Y = α + η T W TX + ε.
We propose in Section 9.5.1 a procedure for generalizing PLS algorithms for
dimension reduction in regressions covered by non-linear model (9.1). These
generalizations effect dimension reduction but not prediction. Methods for
prediction are discussed in Section 9.6. We show in Section 9.5.2 that the gen-
eralized procedure can produce the Krylov sequence (9.5) and in Section 9.5.3
we show how to use it to remove linear trends to facilitate detecting non-
linearities. We expect that there will be many other applications of the gen-
eralized method in the future.
We confine discussion to regressions with a real response, r = 1, in the
remainder of this chapter.
Proposition 9.6. Assume model (9.1) with a real response. Let Γ be a basis
matrix for EΣX (SE(Y |X) ) and let U and V be real-valued functions of ΓTX.
Assume also that X satisfies the linearity condition relative to EΣX (SE(Y |X) ).
Then
E{(U Y + V )(X − µX )} ∈ EΣX (SE(Y |X) ).
Let η be basis matrix for SE(Y |X) . By Proposition 9.1, E(Y | X) = E(Y | η TX).
Since SE(Y |X) ⊆ EΣX (SE(Y |X) ), it follows that E(Y | X) = E(Y | η T X) =
E(Y | ΓTX). Consequently,
Since X satisfies the linearity condition relative to EΣX (SE(Y |X) ) = span(Γ),
we know from conclusions 4 and 5 of Proposition 9.2 that E(X | ΓTX) = PΓ X.
Using the condition that U and V are real-valued functions of ΓT X, we have
T T
νj = cov{u(νj−1 X)Y + v(νj−1 X), X} ∈ EΣX (SE(Y |X) ), j = 1, 2, . . . .
260 Non-linear PLS
is a second vector in the envelope. To get a third vector, we use the sec-
ond vector since ΣX σX,Y ∈ EΣX (SE(Y |X) ). Then with U = 0 and V =
v{(ΣX σX,Y )T X} = (ΣX σX,Y )T X, we have
Under the non-linear model we still have strict monotone containment of the
Krylov subspaces following (1.23) with equality after reaching q components
is reached. However, here Kq (ΣX , σX,Y ) is contained in EΣX (SE(Y |X) ) without
necessarily being equal:
K1 (ΣX , σX,Y ) ⊂ · · · ⊂ Kq (ΣX , σX,Y ) = Kq+1 (ΣX , σX,Y ) · · · ⊆ EΣX (SE(Y |X) ).
with equality EΣX(B) = EΣX (SE(Y |X) ) under the single index model.
This result reinforces the conclusions of Proposition 9.5 and the discussion
that follows it: the NIPALS algorithm can be used to diagnose non-linearity
in the mean function. In fact, the mild linearity condition for the envelope is
the only novel condition needed.
This graphical diagnostic procedure discussed at the end of Section 9.4
will display linear as well as non-linear trends. For the purpose of diagnosing
non-linearity in the mean function, it may be desirable to remove linear trends
first and use the NIPALS residuals
T
R = Y − E(Y ) − βnpls (X − µX )
where the expectation is with respect to the joint distribution of (R, X), and
then apply Proposition 9.6 with first vector βnpls ∈ EΣX (SE(Y |X) ). To get
the second vector, assume without loss of generality that E(X) = 0 and use
T
U = u(βnpls T
X) = βnpls X and V = v(βnplsT
X) = −βnplsT T
XE(Y ) − (βnpls X)2 .
This gives
where the final containment follows from Proposition 9.6. For the third vector
we operationally replace βnpls in u and v with ΣRXX βnpls to get
which implies that we run NIPALS with Σ b RXX and βbnpls . Using residuals
in Σb RXX has the advantage that the linear trend is removed making non-
linearities easier to identify.
The graphical diagnostic for non-linearity in the mean function can be
summarized as follows:
1. Run NIPALS(Σ bX,Y ), obtain βbnpls and then form the sample residuals
bX, σ
T
bi = Yi − Ȳ − βbnpls
R (Xi − X̄), i = 1, . . . , n.
4. Plot Y or residuals versus hTj X for j = 1, 2, . . . and look for clear non-
linear trends. Since βnpls is the first vector listed on the left-hand side of
(9.7), h1 = βbnpls /kβbnpls k, where the normalization arises from the eigen-
vector computation in Table 3.1. In consequence, a plot of Y or residuals
against hT1 X is effectively the same as a plot of Y or residuals against the
T
NIPALS fitted values βbnpls Xi , i = 1, . . . , n.
Prediction 263
9.6 Prediction
The graphical methods described in Sections 9.5.2 and 9.5.3 are used to as-
sess the presence of non-linearity in the mean function. If no notable non-
linearity is detected then the linear methods described in Chapter 3 may be
serviceable for prediction. If non-linearity is detected, the compressed pre-
dictors w1T X, w2T X, . . . can still serve for dimension reduction under Corol-
lary 9.3, but there is no model or rule for prediction and consequently there
is no way to determine the number of components. To complete the anal-
ysis we need a method or model for predicting Y from w1T X, . . . , wdT X,
d = 1, 2, . . . , min{p, n − 1}, along with the associated estimated predictive
mean squared error used in selecting the number of components.
Let SZd |Y denote the covariance matrix of the residuals and let
Zbd,i = µ
bd + Θ
b d f (Yi ), i = 1, . . . , n
denote the fitted values from a fit of (9.10). Then based on model (9.10), we
have
1 T −1
ωi (zd ) ∝ exp − (zd − Zd,i ) SZd |Y (zd − Zd,i ) .
b b
2
These weights are then substituted into (9.9) to obtain the predictions
Yb (zd ) = E(Y
b | Zd = zd ), which can be used with a holdout sample or cross
validation to estimate the predictive mean squared error.
The performance of this predictive methodology depends on the approxi-
mate normality of Zd and the choice of f . Since low-dimensional projections
of high-dimensional data are approximately normal (Diaconis and Freedman,
1984; Hall and Li, 1993), it is reasonable to rely on approximate normality
of Zd . Consistency and the asymptotic distribution for prediction were es-
tablished by Forzani et al. (2019) for the case of p fixed and n → ∞. There
are several generic possibilities for choice of f , perhaps guided by graphics
as implied earlier. Fractional polynomials (Royston and Sauerbrei, 2008) or
polynomials deriving from a Taylor approximation may be useful for single-
response regressions,
f (Y ) = {Y, Y 2 , Y 3 , . . . , Y s }T ,
are one possibility. Periodic behavior could be modeled using a Fourier series
form
f (Y ) = {cos(2πY ), sin(2πY ), . . . , cos(2πkY ), sin(2πkY )}T
as perhaps in signal processing applications. Here, k is a user-selected integer
and s = 2k. Splines and other types of non-parametric constructions could
also be used to form a suitable f . A variety of basis functions are available
from Adragni (2009).
Another option with a single continuous response consists of “slicing”
the observed values of Y into H bins Ch , h = 1, . . . , H. We can then set
s = H − 1 and specify the h-th element of f to be Jh (Y ), where J is the
indicator function, Jh (Y ) = 1 if Y is in bin h and 0 otherwise. This has the
effect of approximating each component E(wjT X | Y ) of E(Zd | Y ) as a step
function of Y with H steps,
E(Zd | Y ) ≈ µ + ξJ,
T T
(a) Fat vs. βbnpls X (b) Residuals, ri vs. βbnpls X
beyond that displayed in plots (a) and (b). In this case there does not seem
to be any notable non-linearity remaining and consequently we could proceed
with modeling based on plot (a). In particular, a reasonable model would have
E(fat | X) = f (β T X) for some scalar-valued function f that is likely close to
a quadratic.
Figures 9.2 and 9.3 show for protein and water content the same construc-
tions as Figure 9.1 does for fat content. Our interpretation of these plots is
qualitatively similar to that for Figure 9.1: In this case there does not seem
Tecator data 269
T T
(a) Protein vs. βbnpls X (b) Residuals, ri vs. βbnpls X
T T
(a) Water vs. βbnpls X (b) Residuals, ri vs. βbnpls X
Here we used NIPALS as summarized in Table 3.1 to fit each response indi-
vidually. The number components was determined by using 5-fold predictive
cross validation on the training data. This gave q = 13, 13, and 17 for fat, pro-
tein and water. In view of the curvature present in panel (a) of Figures 9.1–9.3
we would expect non-linear methods to give smaller mean squared prediction
errors.
The fitted model for prediction is as shown in (9.12) with the coefficients
determined by using linear PLS. The number of components was again deter-
mined by using 5-fold predictive cross validation on the training data, giving
q = 12, 13, and 16 for fat, protein and water. This method will perform best
272 Non-linear PLS
when the CMS has dimension 1, as suggested by the plots in Figures 9.1–
9.3, and the linearity condition holds. If the dimension of the CMS is greater
than 1, this method could still give prediction mean squared errors that are
smaller than those from linear PLS, provided that the response surface is well
approximated by a quadratic.
T
4. Non-parametric inverse PLS (NP-I-PLS) using βbnpls X.
This is a version of the method described in Section 9.6.2 where the weights
(9.9) were determined using Zd = βbnplsT
X and f (Y ) = (Y, Y 2 , Y 1/2 ) was se-
lected by graphical inspection. The number components for βbnpls was again
selected by using 5-fold predictive cross validation on the training data, giving
q = 13, 18, and 18 for fat, protein and water. Like quadratic PLS, and non-
parametric forward PLS, this method will perform best when the CMS has
dimension 1 and the linearity condition holds but may still be useful otherwise.
This is the method described in Section 9.6.2 using f (Y ) = (Y, Y 2 , Y 1/2 ). The
number components for W was again determined by using 5-fold predictive
Tecator data 273
TABLE 9.1
Tecator Data: Number of components based on 5-fold cross validation using
the training data for each response. NP-PLS denotes non-parametric PLS and
NP-I-PLS denotes non-parametric inverse PLS.
cross validation on the training data, giving q = 15, 17, and 22 for fat, pro-
tein and water. The previous four methods will all be at their best when the
dimension of the CMS is 1. This method does not have that restriction and
should work well even when the dimension of the CMS is greater than one.
The number of components for each method-response combination is sum-
marized in Table 9.1 and the root mean squared prediction errors
43
!1/2
X 2
(1/43) Ybtest,i − Ytest,i
i=1
determined from the 43 observations in the testing dataset are given in Ta-
ble 9.2(a). The root mean squared prediction errors typically changed rela-
tively little when the number of components was varied by, say, ±2. This is
illustrated in Figure 9.4, which shows the cross validation root mean squared
prediction error for linear PLS, method 1, and non-parametric inverse PLS
T
with βbnpls X, method 4. The prediction errors are much larger for q outside
the range of the horizontal axis shown in Figure 9.4.
The number of training cases in this dataset n = 172 is greater than the
number of predictors, p = 100, but none of the prediction methods used here
requires that n > p and all are serviceable when n < p.
The first method, linear PLS, is the same as the standard NIPALS algo-
rithm outlined in Table 3.1. From Table 9.2(a) it seems clear that in this exam-
ple any of methods 2–4 is preferable to the NIPALS algorithm. This is perhaps
not surprising in view of the curvature shown in panel (a) of Figures 9.1–9.3.
T
Methods 2–4 are all based on the NIPALS fitted values βbnpls X. As mentioned
274 Non-linear PLS
FIGURE 9.4
Tecator data: root mean squared prediction error versus number of compo-
nents for fat. Upper curve is for linear PLS, method 1 in Table 9.1; lower curve
T
is for inverse PLS prediction with βbnpls X, method 4. (From Fig. 2 of Cook
and Forzani (2021) with permission.)
previously, these methods will do their best when the CMS has dimension 1.
We judge their relative behavior in this example to be similar, except it might
be argued that non-parametric PLS demonstrates a slight advantage. Method
5, non-parametric inverse PLS with W TX, has the smallest root mean squared
prediction errors in this example and is clearly the winner. This method does
not rely on the CMS having one dimension, and we expect that it will dom-
inate in any analysis where linear PLS regression is judged to be serviceable
apart from the presence of non-linearity in the conditional mean E(Y | X).
As discussed in Section 9.6.1, Shan et al. (2014) propose a new method,
called PLS-SLT, of non-linear PLS based on slicing the response into non-
overlapping bins within which the relationship between Y and X is assumed
to follow linear model (1.1). In their analysis of the Tecator data, PLT-SLT
showed superior predictive performance against linear PLS and five compet-
ing non-linear methods. Table 9.2(b) gives the root mean squared prediction
error reported by (Shan et al., 2014, Table 6) for linear PLS and their new
method PLS-SLT. They used 12 components for each of the six scenarios
shown in Table 9.2(b). Although we used 13 components for fat and protein,
the root mean squared prediction errors for linear PLS shown in Table 9.2
are nearly identical. We used 17 components for the linear PLS analysis of
water, while Shan et al. (2014) used 12. This may account for the discrepancy
Tecator data 275
TABLE 9.2
Tecator Data:(a) Root mean squared training and testing prediction errors
for five methods of prediction.
in linear PLS prediction errors for water, 1.72 and 2.03 in parts (a) and (b)
of Table 9.2. Most importantly, the root mean squared prediction error for
PLS-SLT is comparable to or smaller than the prediction errors for methods
1–4 in Table 9.2 (a), but its predictor errors are larger than those for method
5, non-parametric inverse PLS with W TX.
The methods discussed here all require a number of user-selected tuning
parameters or specifications. Our method 5, non-parametric inverse regression
with Zd = W TX, requires f (y) and the number of components q represented in
√
W . Recall that we selected f (y) = (y, y 2 , y)T based on graphical inspection
of inverse regression functions, and we selected q by using 5-fold cross valida-
tion on the training data. PLS-SLT requires selecting the number of bins and
the number of components, which we still represented as q. Shan et al. (2014)
restricted q to be at most 15 and the number of bins to be at most 10. For
each number of bins, they determined the optimal value of q by using a test-
ing procedure proposed by Haaland and Thomas (1988). The number of bins
giving the best predictive performance was then selected. We conclude from
these descriptions that the procedures for selecting the tuning parameters are
276 Non-linear PLS
network, the root mean squared prediction error for the testing data was 8.6,
which is considerably smaller than the corresponding error of 15.28 from their
implementation of PLS.
We applied four of the methods used for Table 9.2 to the same training
and testing data with p = 9, 623. Quadratic PLS (method 2 in Table 9.2) was
not considered because the response surface was noticeably non-quadratic.
The fitting function f (Y ) = (Y, Y 2 , Y 3 , Y 1/2 )T for inverse PLS with W TX
(see Section 9.6.2) was selected by smoothing a few plots of predictors versus
the response from the training data. Using the testing data, the root mean
squared prediction errors for these four methods are shown in Table 9.3(a),
along with the prediction error for the final MLP network in part (b). The
prediction error of 9.20, which is the best of those considered, is a bit larger
than the MLP error found by Chiappini et al. (2020) but still considerably
smaller than the PLS error.
Wold et al. (2001) recommended that “A good nonlinear regression tech-
nique has to be simple to use.” We agree subject to the further condition that
a simple technique must produce competitive or useful results. The general
278 Non-linear PLS
approach behind the MLP network requires considerable analysis and many
subjective decisions, including the initial reduction to 9,623 spectral measure-
ments followed by reduction to 12 principal components, outlier detection
methodology, the type of experimental design used to train the network, the
desirability function, specific characteristics of the network construction, divi-
sion of the data into testing and training sets, and so on. Because of all these
required data-analytic decisions, it is not clear to us that the relative perfor-
mance of the neural network approach displayed in the etanercept data would
hold in future analyses. In contrast, the inverse PLS approach with W TX
proved best in the Tecator data and had a strong showing in the analysis of
the etanercept data with many fewer data analytic decisions required.
To gain intuition into what might be achieved by minimizing the number
of data-analytic decisions required, we conducted another analysis using NP-
I-PLS with W TX based on the original 38,500 predictors and n = 35 data
points. For each fixed number of components, the root mean square error of
prediction was estimated by using leave-one-out cross validation with the fit-
ting function used previously. For each subset of 34 observations, the number
of components minimizing the prediction error was always 6. The cross val-
idation root mean square error of prediction was determined to be 8.46 as
shown in part (c) of Table 9.3. This relatively simple method requires only
two essential data-analytic choices: the number of components and the fitting
function f (Y ), and yet it produced a root mean square error of prediction
that is smaller than that found by the carefully tuned MLP network.
FIGURE 9.5
Solvent data: Predictive root mean squared error PRMSE versus number of
components for three methods of fitting. A: linear PLS. B: non-parametric
inverse PLS with W TX, as discussed in Section 9.7. C: The non-linear PLS
method proposed by Lavoie et al. (2019).
used the same solvent data to compare the method proposed by Lavoie et al.
(2019) to linear PLS and to non-parametric inverse PLS with W TX.
We studied the solvent data following the steps described by Lavoie et al.
(2019). For instance, we removed the 5 observations that Lavoie et al. (2019)
flagged outliers, leaving n = 98 observations on p = 8 chemical properties of
the solvents as predictors and one response, the dielectric constant. Ten-fold
cross validation was used to measure the predictive root mean squared error,
PRMSE.
The results of our comparative study are shown in Figure 9.5. Curve A
gives the PRMSE for linear PLS and it closely matches the results of Lavoie
et al. (2019, Fig. 6(b), curve A). Curve C gives the PRMSE for the Lavoie et
al. method read directly from their Figure 6. Curve B gives the PRMSE of our
proposed NP-I-PLS with W TX using the fractional polynomial (e.g. Royston
and Sauerbrei, 2008) fitting function f (Y ) = (Y 1/3 , Y 1/2 , Y, Y 2 , Y 3 , log Y )T .
Figure 9.5 reinforces our previous conclusions that the dimension reduction
280 Non-linear PLS
In this chapter, we describe the current and potential future roles for partial
least squares (PLS) algorithms in path analyses. After reviewing the present
debate on the value of PLS for studying path models in the social sciences
and establishing a context, we conclude that, depending on specific objectives,
PLS methods have considerable promise, but that the present social science
method identified as PLS is only weakly related to PLS and is perhaps more
akin to maximum likelihood estimation. Developments necessary for integrat-
ing proper PLS into the social sciences are described. A critique of covariance-
based structural equation modeling (cb|sem), as it relates to PLS, is given as
well. The discussion in this chapter follows Cook and Forzani (2023).
10.1 Introduction
Path modeling is a standard way of representing social science theories. It
often involves concepts like “customer satisfaction” or “competitiveness” for
which there are no objective measurement scales. Since such concepts can-
not be measured directly, multiple surrogates, which may be called indicators,
observed or manifest variables, are used to gain information about them indi-
rectly. One role of a path diagram is to provide a representation of the rela-
tionships between the concepts, which are represented by latent variables, and
the indicators. A fully executed path diagram is in effect a model that can be
used to guide subsequent analysis, rather like an algebraic model in statistics.
use it.” Although dressed up a bit to highlight new developments, their criti-
cism are essentially the same as those they leveled in previous writings. There
was much give-and-take in the subsequent discussion, both for and against
pls|sem. Goodhue et al. (2023) recommended against using PLS in path
analysis, arguing that it “. . . violates accepted norms for statistical inference.”
They further recommended that key journals convene a task force to assess the
advisability of accepting PLS-based work. Russo and Stol (2023) took excep-
tion to some of Evermann and Rönkkö (2021) conclusions, while commending
their efforts. Sharma et al. (2022) expressed matter-of-factly that the pls|sem
claims leveled by Evermann and Rönkkö (2021) are misleading, extraordinary
and questionable. They then set about “. . . to bring a positive perspective to
this debate and highlight the recent developments in PLS that make it an
increasingly valuable technique in IS and management research in general”.
There is a substantial literature that bears on this debate (Rönkkö et al.,
2016b, cites about 150 references). The preponderance of articles rely mostly
on intuition and simulations to support sweeping statements about intrinsi-
cally mathematical/statistical issues without sufficient supporting theory. But
adequately addressing the methodological issues in path analysis requires in
part avoiding ambiguity by employing a degree of context-specific theoretical
specificity.
In this chapter, we use the acronym PLS to designate the partial least
squares methods that stem from the theoretical foundations by Cook, Hel-
land, and Su (2013). These link with early work by Wold (Geladi, 1988) and
cover chemometrics applications (Cook and Forzani, 2020, 2021; Martens and
Næs, 1989), as well as a host of subsequent methods, particularly the PLS
methods for simultaneous reduction of predictors and responses discussed in
Chapter 5, which are relevant to the analysis of the path models. We rely on
these foundations in this chapter.
path models, we are able to state clearly results that carry over qualitatively
to more intricate settings. And we are able to avoid terms and phrases that
do not seem to be understood in the same way across the community of path
modelers (e.g. McIntosh et al., 2014; Sarstedt et al., 2016). We took the articles
by Wold (1982), Dijkstra (1983, 2010), and Dijkstra and Henseler (2015a,b)
as the gold standard for technical details on pls|sem. Output from our im-
plementation of their description of a pls|sem estimator agreed with output
from an algorithm by Rönkkö et al. (2016a), which supports our assessment of
the method. We confine our discussion largely to issues encountered in Wold’s
first-stage algorithm (Wold, 1982, Section 1.4.1).
10.1.3 Outline
To establish a degree of context-specific theoretical specificity, we cast our de-
velopment in the framework of common reflective path models that are stated
in Section 10.2 along with the context and goals. These models cover simula-
tion results by Rönkkö et al. (2016b, Figure 1), which is one reason for their
adoption. We show in Section 10.3 that the apparently novel observation that
this setting implies a reduced rank regression (RRR) model for the observed
variables (Anderson, 1951; Cook, Forzani, and Zhang, 2015; Izenman, 1975;
Reinsel and Velu, 1998). From here we address identifiability and estimation
using RRR. Estimators are discussed in Sections 10.4 and 10.5. The approach
we take and the broad conclusions that we reach should be applicable to other
perhaps more complicated contexts.
Some of the concerns expressed by Rönkkö et al. (2016b) involve contrasts
with cb|sem methodology. Building on RRR, we bring cb|sem methodology
into the discussion in Section 10.5 and show that under certain key assump-
tions additional parameters are identifiable under the SEM model. In Sec-
tion 10.6 we quantify the bias and revisit the notion of consistency at large.
The chapter concluded with simulation results and a general discussion.
denote observable random variables, which are called indicators, that are as-
sumed to reflect information about underlying real latent constructs ξ, η ∈ R.
This is indicated by the arrows in Figure 10.1 leading from the constructs to
the indicators, which imply that the indicators reflect the construct. Boxed
variables are observable while circled variables are not. This restriction to
univariate constructs is common practice in the social sciences, but is not re-
quired. An important path modeling goal in this setting is to infer about the
association between the latents η and ξ, as indicated by the double-headed
curved paths between the constructs. We take this to be a general objective
of both pls|sem and cb|sem, and it is one focal point of this chapter. Each
indicator is affected by an unobservable error . The absence of paths between
these errors indicates that, in the upper path diagram, they are independent
conditional on the latent variables. They are allowed to be conditionally de-
pendent in the lower path diagram, as indicated by the double-arrowed paths
that join them. We refer to the lower model as the correlated error (core)
model and to the upper model as the uncorrelated error (uncore) model.
The models of Figure 10.1 occur frequently in the literature on path models
(e.g. Lohmöller, 1989). The uncore model is essentially the same as that de-
scribed by Wold (1982, Fig. 1) and it is the basis for Dijkstra and Henseler’s
studies (Dijkstra, 1983; Dijkstra and Henseler, 2015a,b) of the asymptotic
properties of pls|sem. We have not seen a theoretical analysis of pls|sem
under the core model.
For instance, Bollen (1989, pp. 228–235, Fig. 7.3) described a case study
on the evolution of political democracy in developing countries. The variables
X1 , . . . , Xp were indicators of specific aspects of a political democracy in de-
veloping countries in 1960, like measures of the freedom of the press and the
fairness of elections. The latent construct ξ was seen as a real latent construct
representing the level of democracy in 1960. The variables Y1 , . . . , Yr were the
same indicator from the same source in 1965, and correspondingly η was inter-
preted as a real latent construct that measures the level of democracy in 1965.
One goal was to estimate the correlation between the levels of democracy in
1960 and 1965.
The path diagrams in Figure 10.1 are uncomplicated relative to those that
may be used in sociological studies. For example, Vinzi, Trinchera, and Amato
(2010) described a study of fashion in which an uncore model was imbedded
in a larger path diagram with the latent constructs “Image” and “Character”
286 The Role of PLS in Social Science Path Analyses
FIGURE 10.1
Reflective path diagrams relating two latent constructs ξ and η with their
respective univariate indicators, X1 , . . . , Xp and Y1 , . . . , Yr . ϕ denotes either
cor(ξ, η) or Ψ = cor{E(ξ | X), E(η | Y )}. Upper and lower diagrams are the
uncore and core models. (From Fig. 2.1 of Cook and Forzani (2023) with
permission.)
and
)
Y = µY + βY |η (η − µη ) + Y |η , where Y |η ∼ Nr (0, ΣY |η )
(10.2)
X = µX + βX|ξ (ξ − µξ ) + X|ξ , where X|ξ ∼ Np (0, ΣX|ξ ).
We use σξ2 , σξ,η and ση2 to denote the elements of Σ(ξ,η) ∈ R2×2 , and we
further assume that ξ,η , X|ξ and Y |η are mutually independent so jointly
X, Y , ξ, and η follow a multivariate normal distribution. Normality per se
is not necessary. However, certain implications of normality like linear regres-
sions with constant variances are needed in this chapter. Assuming normality
overall avoids the need to provide a list of assumptions, which might obscure
the overarching points about the role of PLS. As defined in Chapter 1, we use
the notation ΣU,V to denote the matrix of covariances between the elements
of the random vectors U and V , and we use βU |V = ΣU,V Σ−1 V to indicate the
matrix of population coefficients from multi-response linear regression of U on
V , where ΣV = var(V ). Lemma A.5 in Appendix A.8.1 gives the mean and
variance of the joint multivariate normal distribution of X, Y , ξ, and η. Since
only X, Y are observable, all estimation and inference must be based on the
joint multivariate distribution of (X, Y ).
marginal correlation cor(η, ξ). These measures can reflect quite different views
of a reflective setting. The marginal correlation cor(η, ξ) implies that η and
ξ represent concepts that are uniquely identified regardless of any subjective
choices made regarding the variables Y and X that are reasoned to reflect their
properties up to linear transformations. Two investigators studying the same
constructs would naturally be estimating the same correlation even if they se-
lected a different Y or a different X. In contrast, Ψ = cor{E(η | Y ), E(ξ | X)},
the correlation between population regressions, suggests a conditional view:
η and ξ exist only by virtue of the variables that are selected to reflect their
properties, and so attributes of η and ξ cannot be claimed without citing the
corresponding (Y, X). Two investigators studying the same concepts could un-
derstandably be estimating different correlations if they selected a different Y
or a different X. For instance, the latent construct “happiness” might reflect
in part a combination of happiness at home and happiness at work. Two sets
of indicators X and X ∗ that reflect these happiness subtypes differently might
yield different correlations with a second latent construct, say “generosity.”
The use of Ψ as a measure of association can be motivated also by an ap-
peal to dimension reduction. Under the models (10.2), X ξ | E(ξ | X) and
Y η | E(η | Y ). In consequence, the constructs affect their respective indi-
cators only through the linear combinations given by the conditional means,
E(ξ | X) = ΣTX,ξ Σ−1 T −1
X (X − µX ) and E(η | Y ) = ΣY,η ΣY (Y − µY ).
the same substantive conclusions is a related issue that can also be relevant,
but is beyond the scope of this book. The correlation Ψ brings these issues to
the fore; its explicit dependence on the indicators manifests important trans-
parency and it facilitates the reification of a concept, although it does not
directly address discrepancies between the target concept and construct.
The diagrams in Figure 10.1 can be well described as instances of oper-
ational measurement theory in which concepts are defined in terms of the
operations used to measure them (e.g. Bridgman, 1927; Hand, 2006). In op-
erationalism there is no assumption of an underlying objective reality; the
concept being pursued is precisely that which is being measured, so concept
and construct are the same. In the alternate measurement theory of represen-
tationalism, the measurements are distinct from the underlying reality. There
is a universal reality underlying the concept of length regardless of how it is
measured. However, a universal understanding of “happiness” must rely on
unanimity over the measurement process.
The regression constraints fix the variances of the conditional means E(ξ | X)
and E(η | Y ) at 1, while the marginal constraints fix the marginal vari-
ances of ξ and η at 1. Under the regression constraints, Ψ = cor(E(η | Y ),
E(ξ | X)) = cov(E(η | Y ), E(ξ | X)). Under the marginal constraints,
cor(ξ, η) = cov(ξ, η). The regression and marginal constraints are related via
the variance decompositions
Lemma 10.1. For the core model presented in presented in (10.1) and
(10.2) and without imposing either the regression or marginal constraints,
we have E(Y | X) = µY + βY |X (X − µX ), where βY |X = AB T Σ−1 X , with
A ∈ Rr×1 , B ∈ Rp×1 , and ΣY,X = AB T .
It is known from the literature on RRR that the vectors A and B are not
identifiable, while AB T is identifiable (see, for example, Cook et al., 2015). As
a consequence of being a RRR model, we are able to state in Proposition 10.1
which parameters are identifiable in the reflective model of (10.1) and (10.2).
10.4 Estimators of Ψ
10.4.1 Maximum likelihood estimator of Ψ
Although A and B are not identifiable, |Ψ| is identifiable since from (10.4) it
depends on the identifiable quantities ΣX , ΣY , and on the rank 1 covariance
matrix ΣY,X = AB T in the RRR of Y on X. Let Σ b Y,X denote the maximum
likelihood estimator of ΣY,X from fitting the RRR model of Lemma 10.1, and
recall that SX and SY denote the sample versions of ΣX and ΣY . Then the
maximum likelihood estimator of |Ψ| can be obtained by substituting these
estimators into (10.4):
b X,Y S −1 Σ
b mle | = tr1/2 (Σ
|Ψ b Y,X S −1 ).
Y X
−1/2 −1/2
Let ΣỸ ,X̃ = ΣY ΣY,X ΣX denote the standardized version of ΣY,X
−1/2 −1/2
that corresponds to the rank one regression of Ỹ = ΣY Y on X̃ = ΣX X.
292 The Role of PLS in Social Science Path Analyses
2. Construct Σb
Ỹ ,X̃ , the matrix of sample correlations between the elements
of the standardized vectors X̃ and Ỹ .
T
3. Form the singular value decomposition Σb
Ỹ ,X̃ = U DV and extract U1 and
V1 , the first columns of U and V , and D1 is the corresponding (largest)
singular value of Σb
Ỹ ,X̃ .
4. Then
|Ψ
b mle | = D1 .
where tr denotes the trace operator. This moment estimator of |Ψ|, which is
constructed by simply substituting moment estimators for the quantities in
(10.4), may not be as efficient as the maximum likelihood estimator under
the model of 10.2.2, but it might possess certain robustness properties. This
estimator does not use reduced dimensions that arises from the rank of ΣX,Y
or from envelope constructions and, for this reason, it is likely to be inferior.
)
Y = µY + Γγ(η − µη ) + Y |η ; Y |η ∼ Nr (0, ΓΩΓT + Γ0 Ω0 ΓT0 )
(10.7)
X = µX + Φφ(ξ − µξ ) + X|ξ ; X|ξ ∼ Np (0, Φ∆ΦT + Φ0 ∆0 ΦT0 ).
Envelope estimators.
PLS estimators.
1. Scale and center the indicators marginally, X (s) = diag−1/2 (SX )(X − X̄)
and Y (s) = diag−1/2 (SY )(Y − Ȳ ), where diag(·) denotes the diagonal ma-
trix with diagonal elements the same as those of the argument.
2. Construct the first eigenvector `1 of SY (s) X (s) SX (s) Y (s) and the first eigen-
vector r1 of SX (s) Y (s) SY (s) X (s) .
3. Construct the proxy latent variables ξ¯ = r1T X (s) and η̄ = `T1 Y (s) .
4. Then the estimated covariance and correlation between the proxies η̄ and
ξ¯ are
cd ¯
ov(η̄, ξ) = `T1 SY (s) X (s) r1
¯ = `T1 SY (s) X (s) r1
cor(η̄, ξ) . (10.8)
{`T1 SY (s) `1 r1T SX (s) r1 }1/2
c
Aside from the scaling, the reduction to the proxy variables ξ¯ and η̄ is the
same as that from the simultaneous PLS method developed by (Cook, Forzani,
and Liu, 2023b, Section 4.3) and discussed in Section 5.3 when u = q = 1.
However, if we standardize jointly, then we recover the MLE. Rewriting the
MLE algorithm given in Section 10.4.1 to reflect its connection with pls|sem,
we have
−1/2 −1/2
1. Standardize the indicators jointly X̃ = SX (X − X̄), and Ỹ = SY (Y −
Ȳ ).
2. Construct the first eigenvector U1 of SỸ ,X̃ SX̃,Ỹ and the first eigenvector
V1 of SX̃,Ỹ SỸ ,X̃ .
cor(η̄, ¯ = UT S
c ξ) 1 Ỹ ,X̃ V1 = D1 .
I. If joint standardization is used for the indicator vectors, then the pls|sem
estimator cor(η̄,
c ¯ is the same as the MLE of |Ψ| under the model (10.1)–
ξ)
(10.2). This involves no dimension reduction beyond that arising from the
reduced rank model of Section 10.3. It is unrelated to PLS.
We recommend that pls|sem be based on joint standardization and not
marginal scaling, when permitted by the sample size.
II. If no standardization is used for the indicator vectors, then the pls|sem
estimator cor(η̄,
c ¯ is the same as the simultaneous PLS estimator (see Sec-
ξ)
tion 5.3) when u = q = 1. But we may lose information if u 6= 1 or q 6= 1.
10.5 cb|sem
As hinted in Proposition 10.1, the parameter cor(η, ξ) is not identifiable with-
out further assumptions. We give now sufficient conditions to have identi-
fication of cor(ξ, η). The conditions needed are related to the identification
of ΣX|ξ and ΣY |η . As seen in Lemma A.6 of Appendix A.8.4, under model
(10.1)–(10.2),
if we can identify σξ2 and ση2 . From (10.9) and (10.10) that is equivalent to
identifying ΣY |η and ΣX|η . We show in Proposition A.4 of Appendix A.8.4
that, under the regression constraints, (a) if ΣY |η and ΣX|ξ are identifiable,
then σξ2 , ση2 , |σξ,η |, βX|ξ , βY |η are identifiable and that (b) ΣY |η and ΣX|ξ are
identifiable if and only if σξ2 , ση2 are so.
The next proposition gives conditions that are sufficient to ensure iden-
tifiability; its proof is given in Appendix A.8.5. Let (M )ij denote the ij-th
element of the matrix M and (V )i the i-th element of the vector V .
Corollary 10.1. Under the regression constraints, if ΣY |η and ΣX|ξ are di-
agonal matrices and if A and B each contain at least two non-zero elements
then ΣY |η , ΣX|ξ , σξ2 , and ση2 are identifiable.
The usual assumption in SEM is that ΣX|ξ and ΣY |η are diagonal matrices
(e.g. Henseler et al., 2014; Jöreskog, 1970). We see from (10.11) and Corol-
lary 10.1 that this assumption along with the regression constraints is sufficient
to guarantee that |cor(ξ, η)| is identifiable provided B and A contain at least
two non-zero elements. However, from Proposition 10.2, we also see that it is
not necessary for ΣY |η and ΣX|ξ to be diagonal. The assumption that ΣX|ξ and
ΣY |η are diagonal matrices means that, given ξ and η, the elements of Y and
X must be independent. In consequence, elements of X and Y are correlated
only by virtue of their association with η and ξ. The presence of any residual
correlations after accounting for ξ and η would negate the model and possibly
lead to spurious conclusions. See Henseler et al. (2014) for a related discussion.
In full, the usual SEM requires that ΣY |η and ΣX|ξ are diagonal matrices,
and it adopts the marginal constraints instead of the regression constraints.
By Proposition 10.2, our ability to identify parameters is unaffected by the
constraints adopted. However, we need also to be sure that the meaning of
298 The Role of PLS in Social Science Path Analyses
ΣΓT Y = γγ T + Ω
ΣΓT Y,ΦT X = γφT cor(ξ, η)
ΣΦT X = φφT + ∆,
where the notation is as used for model (10.7). From this we see that the joint
distribution of the envelope composites (ΓT Y, ΦT X) has the same structure
as the SEM shown in equation (A.31) of Appendix A.8.6, except that assum-
ing Ω and ∆ to be diagonal matrices is untenable from this structure alone.
Additionally, Ω and ∆ are not identifiable because they are confounded with
γγ T and φφT .
In short, it does not appear that there is a single method that can provide
estimators of both of the parameters |Ψ| and cov(ξ, η). However, an estimator
of |Ψ| can provide an estimated lower bound on |cov(ξ, η)|, as discussed in
Section 10.6.
Bias 299
10.6 Bias
Bias is a malleable concept, depending on the context, the estimator and the
quantity being estimated.
If the goal is to estimate |cor(ξ, η)| via maximum likelihood while assum-
ing that ΣY |η and ΣX|ξ are diagonal matrices, bias might not be a worrisome
issue. Although maximum likelihood estimators are generally biased, the bias
typically vanishes at a fast rate as the sample size increases. Bias may be an
issue when the sample size is not large relative to the number of parameters
to be estimated while SX and SY are still nonsingular. This issue is outside
the scope of this chapter.
If the goal is to estimate |Ψ| without assuming diagonal covariance ma-
trices and n > min(p, r) + 1 then its maximum likelihood estimator, the first
canonical correlation between X and Y , can be used and again bias might not
be a worrisome issue.
In some settings we may wish to use an estimator of |Ψ| also as an esti-
mator of |cor(ξ, η)| without assuming diagonal covariance matrices. The im-
plications of doing so are a consequence of the next proposition. Its proof is
in Appendix A.8.7.
Proposition 10.3. Under the model that stems from (10.1) and (10.2),
cor(ξ, η)
Ψ = [var{E(ξ | X)}var{E(η | Y )}]1/2 .
σξ ση
which agrees with Dijkstra’s (Dijkstra, 1983, Section 4.3) conclusion that PLS
will underestimate |cor(ξ, η)|. Under the marginal constraints,
this bias will be small when E{var(ξ | X)} and E{var(η | Y )} are small, so
ξ and η are well predicted by X and Y . This may happen with a few highly
informative indicators. It may also happen as the number of informative in-
dicators increases, a scenario that is referred to as an abundant regression in
statistics (Cook, Forzani, and Rothman, 2012, 2013). On the other extreme, if
ξ and η are not well predicted by X and Y , then it is possible to have |Ψ| close
to 0 while |cor(ξ, η)| is close to 1, in which case the bias is close to 1. Like the
assumption that ΣY |η and ΣX|ξ are diagonal matrices, it may be effectively
impossible to gain from the data support for claiming that E(var(ξ | X)) and
E(var(η | Y )) are small.
Under the regression constraints,
this bias will again be small when E{var(ξ | X)} and E{var(η | Y )} are small.
In short, an estimator of |Ψ| is also an estimator of a lower bound on |cor(ξ, η)|.
Under either the marginal or regression constraints, the bias |cor(ξ, η)| − |Ψ|
will be small when the indicators are good predictors of the constructs.
which agrees with our general conclusion (10.12). Following Rönkkö et al.
(2016b), we simulated data with sample sizes N = 100 and N = 1000 obser-
vations on (X, Y ) according to these settings with various values for cor(η, ξ).
The estimators that we used are as follows:
MLE: This is the estimator described in Section 10.4.1. Recall that it is the
same as the pls|sem estimator with joint standardization, as developed in
Section 10.4.4.
ENV: This envelope estimator was computed using the methods described
in Cook and Zhang (2015b). It is as discussed in Section 10.4.3.
cb|sem: This estimator was discussed in Section 10.5. It was computed using
the lavaan package based on Rosseel (2012).
PLS: This designates the PLS estimator discussed in Section 10.4.3. It was
computed using the methods described by Cook, Forzani, and Liu (2023b,
Table 2).
We can see from the results shown in Figure 10.2, which are the aver-
ages over 10 replications, that at N = 100, Matrixpls tends to underes-
timate Ψ, while the other five estimators of Ψ clearly overestimate for all
302 The Role of PLS in Social Science Path Analyses
N = 100
FIGURE 10.2
Simulations with N = 100, ΣX|ξ = ΣY |η = I3 . Horizontal axes give the
true correlations; vertical axes give the estimated correlations. Lines x = y
represented equality.
cor(η, ξ) ∈ (0, 2/3), with ENV doing a bit better than the others. Because the
indicators are independent with constant variances conditional on the con-
structs, we did not expect much difference between MLE and pls|sem. Other
than cb|sem, none of these estimators made use of the fact that ΣX|ξ and ΣY |η
are diagonal matrices. At N = 1000 shown in Figure 10.3 the performance of
ENV is essentially flawless, while the other PLS-type estimators show a slight
propensity to overestimate Ψ.
Simulation results 303
N = 1000
FIGURE 10.3
Simulations with N = 1000, ΣX|ξ = ΣY |η = I3 . Horizontal axes give the
true correlations; vertical axes give the estimated correlations. Lines x = y
represented equality.
FIGURE 10.4
Simulations with N = 100, ΣX|ξ = ΣY |η = L(LT L)−1 LT + 3L0 LT0 . Horizontal
axes give the true correlations; vertical axes give the estimated correlations.
The lines represent x = y.
FIGURE 10.5
Results of simulations with N = 1000, ΣX|ξ = ΣY |η = L(LT L)−1 LT + 3L0 LT0 .
Horizontal axes give the true correlations; vertical axes give the estimated
correlations. The lines represent x = y.
is essentially flawless, while the MLE does well. The other four estimators all
have a marked tendency toward underestimation, particularly cb|sem.
FIGURE 10.6
Results of simulations with two reflective composites for X and Y , q = u = 2.
Horizontal axes give the true correlations; vertical axes give the estimated
correlations. (Constructed following Fig. 6.3 of Cook and Forzani (2023) with
permission.)
Specifically, let 1k be a k × 1 vector of ones, let L1 = (8, −0.7, 60, 1T9 , −1T9 )T ,
L2 = (1, 0.3, −0.59/60, −1T9 , 1T9 )T and L = (L1 , L2 ). The conditional means
and variances were generated as E(X | ξ) = 0.1(L1 + 0.9L2 )ξ, ΣX|ξ =
5L(LT L)−1 LT + 0.1L0 LT0 and E(Y | η) = 0.1(L1 + 0.9L2 )η, ΣY |η =
5L(LT L)−1 LT + 0.1L0 LT0 . From this structure, we have, ΣY,η = ΣX,ξ =
0.1(L1 + L2 ), and
We see from Figure 10.6 that the ENV estimator does the best at all
sample sizes, while PLS does well at the larger sample sizes. The Matrixpls
estimator from Rönkkö et al. (2016b) underestimates the true correlation Ψ at
all displayed sample sizes because it implicitly assumes that q = u = 1. The
cb|sem estimator also underestimates its target cor(ξ, η) because it cannot
deal with non-diagonal covariance matrices.
10.8 Discussion
This chapter is based on the relatively simple path diagram of Figure 10.1.
Phrased in terms of a construct ξ and a corresponding vector of indicators
X, the following overarching conclusions about the role of PLS in path anal-
yses apply regardless of the complexity of the path model. The relationship
between the indicators and the construct can be reflective or formative. The
same conclusions hold with ξ and X replaced with η and Y .
Path modeling.
At its core, path modeling hinges on the ability of the investigators to identify
sets of indicators that are related to the constructs in the manner specified.
Our view is that an understanding of a construct should not be divorced from
the specific indicators selected for its study. This view leads naturally to us-
ing E(ξ | X) as a means of construct reification. Here, PLS and ENV can
have a useful role in reducing the dimension of X without loss of informa-
tion on ξ, which allows X to be replaced by reduced predictors XR so that
E(ξ | X) = E(ξ | XR ).
On the other hand, if marginal characteristics of the constructs like
cov(ξ, η) are of sole interest, then we see PLS as having little or no relevance
to the analysis, unless the lower bound |Ψ| ≤ |cov(ξ, η)| is useful.
PLS|SEM
The success of this method for estimating Ψ depends critically on the stan-
dardization/scaling used.
No scaling or standardization works best when one real composite of the
indicators, say ΦT X, extracts all of the available information from X about ξ.
308 The Role of PLS in Social Science Path Analyses
That is, once the real composite ΦT X is known and fixed, ξ and X are inde-
pendent or at least uncorrelated. However, we found no rational for adopting
this one-composite framework, which we see as tying the hands of pls|sem.
Envelopes and their descendent PLS methods include methodology for esti-
mating the number of composites needed to extract all of the available infor-
mation. Expanding the one-composite framework now used by pls|sem has
the potential to increase its efficacy considerably.
Standardization using sample covariance matrices produces the maximum
likelihood estimator under the core model. We expect that it will also be
called for in more complicated model, but further investigation is needed to
affirm this. We see no compelling reason to use marginal scaling.
CB|SEM
Bias
We do not see bias as playing a dominant role in the debate over methodology.
If the goal is to estimate cor(η, ξ) using cb|sem, then estimation bias may,
depending on the sample size, play a notable role in cb|sem, as maximum
likelihood estimators are biased but unbiased asymptotically. If the goal is to
estimate cor(η, ξ) using PLS, then structural bias is relevant. But if the goal
is to estimate Ψ, then structural bias has no special relevance if PLS is used.
11
Ancillary Topics
In this chapter we present various sidelights and some extensions to enrich our
discussions from previous chapters. In Section 11.1 we discuss the NIPALS and
SIMPLS algorithms as instances of general algorithms N and S introduced in
Section 1.5. In Section 11.2 we discuss bilinear models that have been used
to motivate PLS algorithms, particularly the simultaneous reduction of re-
sponses and predictors, and show that they rely on an underlying envelope
structure. This connects with the discussion in Chapter 5 on simultaneous
reduction. The relationship between NIPALS, SIMPLS, and conjugate gra-
dient algorithms is discussed in Section 11.3. Sparse PLS is discussed briefly
in Section 11.4, and Section 11.5 has an introductory discussion of PLS for
multi-way (tensor-valued) predictors. A PLS algorithm for generalized linear
regression is proposed in Section 11.6.
We know from Chapter 3 that the NIPALS and SIMPLS algorithms for
predictor reduction provide estimators of the envelope EΣX(B) and that they
depend on the data only via SX,Y and SX . To emphasize this aspect of the
algorithms, we denote them as NIPALS(SX,Y , SX ) and SIMPLS(SX,Y , SX ).
With U b = SX,Y and Mc = SX we have the following connection between these
algorithms for predictor reduction in linear models:
T
NIPALS(SX,Y , SX ) = N(SX,Y SX,Y , SX )
T
SIMPLS(SX,Y , SX ) = S(SX,Y SX,Y , SX ).
However, the underlying theory allows for many other options for us-
ing N and S to estimate EΣX(B). Recall from Proposition 1.6 that, for all
k, EM (M k A) = EM (A) and, for k 6= 0, EM k (A) = EM (A). In particular,
EΣX(B) = EΣX (CX,Y ). This suggests that when n p, we could also use the
T T
algorithms N(βbols βbols , SX ) and S(βbols βbols , SX ) to estimate EΣX(B). As a second
instance, it follows also that
T 2 2 T 2
which indicates that we could use also N(SX,Y SX,Y , SX ), N(SX SX,Y SX,Y SX ,
SX ), or the corresponding versions from algorithm S to estimate EΣX(B). The
essential point here is that there are many choices for Mc and A,b and conse-
quently many different versions of NIPALS and SIMPLS, that give the same
envelope in the population but can produce different estimates in applica-
tion. Likelihood-based approaches like that for predictor envelopes discussed
in Section 2.3 can help alleviate this ambiguity.
To illustrate the associated reasoning, recall that the likelihood-based
development in Section 2.3.1 was put forth as a basis for estimating
EΣX(B), but further reasoning is needed to see its implications for PLS al-
gorithms. We start by rewriting the partially maximized likelihood function
from (2.10),
−1
Lq (G) = log |GT SX|Y G| + log |GT SX G|
= log |GT SX|Y G| + log |GT {SX|Y + SX◦Y }−1 G|
= log |GT SX|Y G| + log |GT {SX|Y + SX,Y SY−1 SY,X }−1 G|.
M = ΣX|Y and A = ΣX,Y Σ−1 Y ΣY,X . From Proposition 1.8 and the discussion
of (1.26) we have the following equivalences
−1/2 −1/2 −1/2
EM (A) = EΣX|Y (ΣX,Y ΣY ) = EΣX (ΣX,Y ΣY ) = EΣX(B)ΣY = EΣX(B).
As pointed out in Section 3.9, we again see that the likelihood is based
−1/2
on the standardized response vector Z = SY Y , which suggests that NI-
PALS and SIMPLS algorithms also be based on the standardized responses:
−1/2 −1/2
NIPALS(SX,Y SY , SX ) and SIMPLS(SX,Y SY , SX ), with corresponding
adaptations to algorithms N and S. This idea was introduced in Section 3.9,
but here it is used as an illustration of the general recommendation that a
likelihood can be used to guide the implementation of a PLS algorithm. The
algorithms NIPALS(SX,Y , SX ) and SIMPLS(SX,Y , SX ) with the original un-
standardized responses might be considered when SY is singular.
T
Xn×p = T Rp×u + En×p
T
Yn×r = T Ur×u + Fn×r
Tn×u = XWp×u
312 Ancillary Topics
where min(p, r) ≥ u and the matrix W of weights has full column rank. The
rows of R and U represent the loadings and the rows of T represent the scores.
Descriptions of this bilinear model in the PLS literature rarely mentioning any
stochastic properties of E and F , regarding them generally as unstructured
residual or error matrices. Martens and Næs (1989), as well as others, seem
to treat the bilinear model as a data description rather than as a statistical
model per se. To develop the connection with envelopes, it is helpful to re-
formulate the model in terms of uncentered random vectors XiT and YiT that
correspond to the rows of X and Y, and in terms of independent zero mean
error vectors eTi and fiT , representing the rows of E and F . Then written in
vector form the bilinear model becomes for i = 1, . . . , n
Xi = αX + Rti + ei
Yi = αY + U ti + fi (11.1)
ti = W T Xi ,
where αX and αY are intercept vectors that are needed because the model
is in terms of uncentered data, and e f . In the bilinear model (5.13) for
simultaneous reduction of responses and predictors, the latent vectors vi and
ti are assumed to be independent, while in (11.1) the corresponding latent
vector are the same and this common latent vector is a linear function of X.
As shown in the following discussion, the condition ti = W T Xi has negligible
impact on the X model but it does lead to envelopes in terms of the Y model.
It follows from (11.1) that we can take W to be semi-orthogonal without
any loss of generality, so we assume that in the following. Substituting for t
in the Y equation,
Yi = αY + U W TXi + fi . (11.2)
Thinking of this Y -model in the form of the multivariate linear model (1.1),
we must have B ⊆ span(W ). Without structure beyond assuming that the
error vectors fi are independent copies of a random vector f with mean 0
and positive definite variance, this is a reduced rank multivariate regression
model (Cook, Forzani, and Zhang, 2015; Izenman, 1975). With W regarded
as known, the estimator for β is
b T = W (W T SX W )−1 W T SX,Y ,
WUW
which is the same form as given in Tables 3.1 and 3.4. The issue then is how
we estimate W , the basis for which must come from the X-model.
Bilinear models 313
Since this equality holds for all values of W T X and W has full column rank,
we must have W T αX = 0, W T R = Iu , and W T e ≡ 0. In consequence, we can
take R = W without loss of generality. Recalling that W is semi-orthogonal,
the X-model (11.3) then reduces to Xi = αX + PW Xi + QW ei . This implies
that QW X = QW αX + QW e and so the X-model reduces further to simply
X = αX + PW X + e
= PW αX + QW αX + PW X + QW e
= PW αX + PW X + QW X
= PW X + QW X, (11.4)
since W T αX = 0. This holds for any W , so the X-model doesn’t really add
any restrictions to the problem. The only restriction on W arises from the
Y -model, which implies that span(W ) must contain span(β). This line of rea-
soning does not give rise to envelopes directly because (11.4) holds for any
span(W ) that contains span(β). The final step to reach envelopes is to require
that cov(PW X, QW X) = cov(PW X, QW e) = 0. With this and the previous
conclusion that B ⊆ span(W ), we see that span(W ) is a reducing subspace of
ΣX that contains B, and then u = q becomes the number of components. As
we have seen previously in this chapter, PLS algorithms NIPALS and SIMPLS
require this condition in the population, although we have not seen it stated
in the literature as part of a bilinear model.
In consequence,
β = Σ−1 T 2
X ΣX,Y = (W Σt W + σe Ip )
−1
W Σt B T V T .
β = Σ−1
X ΣX,Y
Since (A−1 Σt A−1 +σe2 Iq )−1 A−1 Σt B T has rank q, r > q and V has full column
rank, we have that
and Viallon (2022) raised additional issues and proposed a generalization that
addresses some of them.
Multiple versions of the bilinear model have been used to motivate PLS.
Our assessment is that they can be confusing and more of hindrance rather
than a help, particularly since PLS can be motivated fully using envelopes.
Aω = b, (11.7)
ω̃ = GH −1 GT b
= G(GT AG)−1 GT b,
which serves also to highlight the fact that only span(G) = EA (span(b)) mat-
ters. This form for ω̃ could be particularly advantageous if G were known and
its column dimension s = dim{EA (span(b))} were small relative to r, or if the
eigenvalues of H0 were small enough to cause numerical difficulties. Of course,
implementations of this idea require an accurate numerical approximation of
a suitable G. We might consider developing approximations of G by using the
Conjugate gradient, NIPALS, and SIMPLS 317
general PLS-type algorithms N(b, A) or S(b, A), but in this discussion there
is not necessarily a statistical context associated with (11.7), so it may be
unclear how to use cross validation or a holdout sample to aid in selecting a
suitable dimension s. Nevertheless, we demonstrate in this section that the
highly regarded conjugate gradient method for solving (11.7) is in fact an en-
velope method that relies on NIPALS and SIMPLS for an approximation of
G (e.g. Phatak and de Hoog, 2002; Stocchero, de Nardi, and Scarpa, 2020).
In keeping with the theme of this book, we now consider (11.7) in the
context of model (1.1) with a univariate response and our standard notation
A = var(X) = ΣX , ω = β and b = cov(X, Y ) = σX,Y . In this context, (11.7)
becomes the normal equations for the population, ΣX β = σX,Y . Sample ver-
sions are discussed later in this section.
Table 11.1 gives the conjugate gradient algorithm in the context of solving
the normal equations ΣX β = σX,Y for a regression with a real response. It
was adapted from Elman (1994) and it applies to solving any linear system
Aω = b for A symmetric and positive definite.
β1 = βnpls = βspls when the number of components is q = 1 (cf. Tables 3.1 and
3.4). The estimators are thus identical when the respective stopping criteria
are met. For CGA to stop at β1 we need kQTσX,Y (ΣX ) σX,Y k < . For NIPALS
to stop with q = 1 we need, from Table 3.1, QTσX,Y (ΣX ) σX,Y = 0. Thus aside
from a relatively minor difference in the population stopping criterion, the
CGA and NIPALS are so far identical.
Assuming that the stopping criterion is not met, the next part of CGA is
to compute
r1T σX,Y T
= σX,Y QσX,Y (ΣX ) σX,Y = 0
pT1 σX,Y = r1T σX,Y + σX,Y
T
QσX,Y (ΣX ) QTσX,Y (ΣX ) σX,Y
= r1T r1 .
r1T r1
β2 = β1 + T
p1
p1 ΣX p1
= p0 (pT0 ΣX p0 )−1 pT0 σX,Y + p1 (pT1 ΣX p1 )−1 pT1 σX,Y
= V2 (V2T ΣX V2 )−1 V2T σX,Y
= βspls when q = 2,
1. span(ri ) = span(wi+1 ), i = 0, 1, . . . , q − 1
2. span(pi ) = span(vi+1 ), i = 0, 1, . . . , q − 1
This proposition tells us that in effect CGA uses NIPALS and SIMPLS to
pursue an envelope solution to the linear system of equations ΣX β = σX,Y . All
three algorithms – CGA, NIPALS, and SIMPLS – can be regarded as meth-
ods for estimating β based on the predictor envelope EΣX(B). The conjugate
gradient method (Hentenes and Stiefel, 1952) preceded NIPALS and SIMPLS
320 Ancillary Topics
This alternative version of steepest decent is in principle better than the basic
version because at each step it optimizes simultaneously over the coefficients
αj if all directions dj , but it is relatively complicated and rather unwieldily
in application. However, there is an equivalent algorithm that gives the same
solution at the end (but not in the intermediate steps) and updates only from
the last step. The key in this variation is to find a single direction that gives the
same iterate at each step. This leads then to CGA (Björck, 1966; Elman, 1994):
2. Set β1 = β0 + δ0 p0 , where
dT0 d0
δ0 = arg min φ(β0 + δp0 ) = T
,
δ p0 ΣX p0
Recall from (2.11) that the predictor envelope estimator of φ in model (2.5)
is based on minimizing over semi-orthogonal matrices G ∈ Rp×q the objective
PLS for multi-way predictors 323
function
−1
Lq (G) = log |GT SX|Y G| + log |GT SX G|.
β = Γη
Σvec(X) = ΓΩΓT + Γ0 Ω0 ΓT0
expected. As explained in Olivieri and Escandar (2014), this is due to the fact
that the bilinear structure is only partially used in the estimation process.
Following the ideas for response envelopes given by Ding and Cook (2018),
we propose here a new algorithm that uses the structure of bilinearity from
the beginning. Specifically, consider the bilinear model
β1 = Wq1 A1
β2 = Vq2 A2
Σ1 = Wq1 Ω1 WqT1 + Wq1 ,0 Ω1,0 WqT1 ,0
Σ2 = Vq1 Ω2 VqT1 + Vq1 ,0 Ω2,0 VqT1 ,0 .
11.6.1 Foundations
Instead of proposing a specific modeling environment for envelopes, Cook
and Zhang (2015a) started with an asymptotically normal estimator. Let
θ ∈ Θ ⊆ Rm denote a parameter vector, which we decompose into a vec-
tor φ ∈ Rp , p ≤ m, of targeted parameters and a vector ψ ∈ Rm−p nuisance
√
parameters. We require that n(φb − φ) converge in distribution to a normal
random vector with mean 0 and covariance matrix Vφφ (θ) > 0 as n → ∞.
Allowing Vφφ (θ) to depend on the full parameter vector θ means that the
variation in φb is can depend on the parameters of interest φ in addition to the
nuisance parameters ψ. In many problems we may construct φ and ψ to be
orthogonal parameters in the sense of Cox and Reid (1987). In the remainder
of this section, we suppress notation indicating that Vφφ (θ) may depend on θ
and write instead Vφφ in place of Vφφ (θ).
328 Ancillary Topics
See Cook and Zhang (2015a) for methods of estimation in the general setting.
PLS for generalized linear models 329
where ϑi = α+β TXi and C(ϑi ) = yi ϑi −b(ϑi ) is the kernel of the log likelihood.
The full log-likelihood can be written as
n
X n
X n
X
Cn (α, β) = C(ϑi | yi ) = C(ϑi ) + c(yi ).
i=1 i=1 i=1
Different log likelihood functions are summarized in Table 11.3 via the kernel.
We next briefly review Fisher scoring, which is the standard iterative
method for maximizing Cn (α, β). At each iteration of the Fisher scor-
ing method, the update step for βb can be summarized in the form of a
TABLE 11.3
A summary of one-parameter exponential families. For the normal, σ = 1.
A(ϑ) = 1 + exp(ϑ). C 0 (ϑ) and C 00 (ϑ) are the first and second derivatives of
C(ϑ) evaluated at the true value.
weighted least squares (WLS) estimator where the weights are defined as
ω(ϑ) = −C 00 (ϑ). With the canonical link, as we are assuming, −C 00 (ϑ) =
b00 (ϑ) = var(Y | ϑ). For a sample of size n, we define the population weights as
var(Y | ϑi )
ωi = ω(ϑi ) = Pn , i = 1, . . . , n,
j=1 var(Y | ϑj )
Pn
which are normalized so that i=1 ωi = 1. Estimated weights are obtained by
simply substituting estimates for the ϑi ’s. In keeping with our convention, we
use the same notation for population and estimated weights, which should be
clear from context. Let Ω = diag(ω1 , . . . , ωn ) and define the weighted sample
estimators, which use sample weights,
n
X
X̄(Ω) = ωi Xi
i=1
Xn
ωi [Xi − X̄(Ω) ][Xi − X̄(Ω) ]T
SX(Ω) =
i=1
n n
X o
SX,Z(Ω)
b = ωi [Xi − X̄(Ω) ][Zbi − Z̄(Ω) ]T ,
i=1
where Zbi = ϑbi + {Yi − µ(ϑbi )}/ωi is a pseudo-response variable at the cur-
rent iteration. The weighted covariance SX(Ω) is the sample version of the
population-weighted covariance matrix
β = Σ−1
X(ω) ΣX,Z(ω) .
PLS for generalized linear models 331
The asymptotic covariance matrix of βbwls is (e.g. Cook and Zhang, 2015a)
√ −1
b = Vββ (θ) = E(−C 00 ) · ΣX(ω)
avar( nβ) ,
From this we see that the envelope for improving β in a GLM has the same
form as the envelope EΣX(B) = EΣX (CX,Y ) for linear predictor reduction. This
implies that we can construct PLS-type estimators of β in GLM’s by first per-
forming dimension reduction using a NIPALS or SIMPLS algorithm. Specifi-
cally, implement a sample version of the NIPALS algorithm in Table 3.1b or
the SIMPLS algorithm in Table 3.4b, substituting SX(Ω) for ΣX and SX,Z(Ω) b
for ΣX,Y . Following reduction, prediction can be based on the GLM regres-
sion of Y on the reduced predictors W TX. Let νb denote the estimator of the
coefficient vector from this regression. The corresponding PLS estimator of β
is W νb.
Following these ideas, Table 11.4 gives an algorithm for using a PLS to
fit a one-parameter family GLM. The algorithm has two levels of iteration.
The outer level, which is shown in the table, are the score-based iterations
for fitting a GLM. The inner iterations, which are not shown explicitly, occur
when calling a PLS algorithm during each outer GLM iteration. The “For
k = 1, 2, . . . ,” instruction indexes the GLM iterations. For each value of k
there is an instruction to “call PLS algorithm” with the current parameter
values. The calls to a PLS algorithm all have the same number of components
q and so these PLS iterations terminate after q stages. The overall algorithm
stops when the coefficient estimates no longer change materially. The algo-
rithm does not require n > p.
11.6.3 Illustration
In this section we use the relatively straightforward simulation scenario
of Cook and Zhang (2015a, Sec. 5.1) to support the algorithm of Ta-
ble 11.4. We generated n = 150 observations according to a logistic regression
Y | X ∼ Bernoulli(logit(β TX)), where β = (0.25, 0.25)T and X follows a
332 Ancillary Topics
TABLE 11.4
A PLS algorithm for predictor reduction in GLMs.
2.0 2.0
Density
Density
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
β1 β1
0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
β1
FIGURE 11.1
PLS-GLM data: Estimates of the densities of the estimators of the first com-
ponent β1 in the simulation to illustrate the PLS algorithm for GLMs in
Table 11.4. Linear envelope and PLS starts refer to starting iteration at the
envelope and PLS estimators from a fit of the linear model, ignoring the GLM
structure.
Let Ybi = Ȳ + βbols Xi and ri = Yi − Ybi denote the i-th vectors of fitted values and
residuals, i = 1, . . . , n, and let D = β − βbols . Then after substituting Ȳ for α,
the remaining log likelihood L(β, ΣY |X ) to be maximized can be expressed as
n
X
(2/n)L(β, ΣY |X ) = c − log |ΣY |X | − n−1 (Yi − Ȳ − βXi )T Σ−1
Y |X
i=1
× (Yi − Ȳ − βXi )
n
X
= c − log |ΣY |X | − n−1 (ri − DXi )T Σ−1
Y |X (ri − DXi )
i=1
n
!
X
= c − log |ΣY |X | − n −1
tr ri riT Σ−1
Y |X
i=1
n
!
X
− n−1 tr D Xi XiT DT Σ−1
Y |X
i=1
n
!
X
= c − log |ΣY |X | − n −1
tr ri riT Σ−1
Y |X
i=1
− tr(DSX D T
Σ−1
Y |X ),
Pn
where c = −r log(2π) and the penultimate step follows because i=1 ri XiT =
0. Consequently, L(β, ΣY |X ) is maximized over β by setting β = βbols so D = 0,
M = PR M PR + QR M QR . (A.1)
M = (PR + QR )M (PR + QR ) = PR M PR + QR M QR .
2
Proofs for Chapter 1 337
3. |M | = |AT M A| × |AT0 M A0 |.
5. If R ⊆ span(M ) then
The conclusion follows since (A, A0 ) is an orthogonal matrix and thus has
determinant 1.
338 Proofs of Selected Results
PR M −1 PR + QR M −1 QR = M −1 PR + M −1 QR = M −1 .
I. |AT0 M A0 | = |M | × |AT M −1 A|
AT M A A T M A0
|M | = |OT M O| =
AT0 M A AT0 M A0
= |AT M A| × |AT0 M A0 − AT0 M A(AT M A)−1 AT M A0 |
≤ |AT M A| × |AT0 M A0 |.
M = PR M PR + QR M QR
= R(RT M R)RT + QR M QR
:= RΩRT + QR M QR .
If v ∈ R⊥ ⊥ ⊥
1 + R2 then it can be written as v = v1 + v2 , where v ∈ R1 and
v ∈ R⊥ ⊥
2 . Then M v = M v1 + M v2 ∈ R1 + R2 .
⊥
Proof. Using a variant of the Woodbury identity for matrix inverses we have
Proof. Since PS + QS = Ip ,
The conclusion now follows from Proposition 1.2 which implies that S reduces
ΣX if and only if
ΣX = PS ΣX PS + QS ΣX QS .
2
WdT Wd = Id , d = 1, . . . , q.
342 Proofs of Selected Results
Proof. Since the columns of Wd are all eigenvectors with length one, the diag-
onal elements of WdT Wd must all be 1. We show orthogonality by induction.
For d = 2,
w2 = `1 (XT2 Y2 YT2 X2 )
= `1 (QTw1 (SX ) XT1 Y2 YT2 X2 ),
1
where the second step follows by substituting (3.1) for X2 . It follows that
w1T w2 = 0.
For d = 3, substituting (3.1) twice,
w3 = `1 (XT3 Y3 YT3 X3 )
= `1 (QTw2 (SX ) XT2 Y3 YT3 X3 )
2
But it follows from (3.4) that SX2 w1 = 0 and consequently Qw2 (SX2 ) w1 = w1
and it follows that w1T w3 = 0. The rest of the justification follows straightfor-
wardly by induction and is omitted.
for d = 1, . . . , q − 1, j = 1, . . . , d.
X2 = X1 − s1 l1T
= X1 − X1 w1 (w1T XT1 X1 w1 )−1 w1T XT1 X1
= QX1 w1 X1 .
Clearly,
XT2 s1 = XT2 X1 w1 = XT1 QX1 w1 X1 w1 = 0
Proof. The representations for Xd+1 and Yd+1 follow straightforwardly from
Lemma 3.2, (3.6) and the definition of sd = Xd wd : Since, by Lemma 3.2,
Qsd+1 Qsj = I − Psd+1 − Psj , the conclusion follows by using (3.6).
For the fitted values we take βbnpls from Table 3.1 to get,
SXd ,Yd = n−1 XTd Yd = n−1 XT1 QSd−1 QSd−1 Y1 = n−1 XT1 QSd−1 Y1 = SXd ,Y1 .
It follows from (A.7) that X1 wk+1 must fall into the orthogonal com-
plement of the subspace spanned by the columns of Sk . Consequently,
sk+1 = QSk X1 wk+1 = X1 wk+1 . Thus, Sd = X1 Wd .
Let Dd = diag(ks1 k2 , . . . , ksd k2 ). As defined in Table 3.1, ld = XTd sd /ksd k2 .
Substituting the form of Xd from (A.5), we get
where the second equality follows from the first consequence of Lemma 3.2.
This implies that Ld = XT1 Sd Dd−1 . Similarly, Md = YT1 Sd Dd−1 . Substituting
these into Y
b npls and using the fact that Sq = X1 Wq we get
Y
b npls = X1 Wq (LTq Wq )−1 MqT
= X1 Wq (Dq−1 SqT X1 Wq )−1 Dq−1 SqT Y1
= Sq (SqT Sq )−1 SqT Y1 = PSq Y1 .
the coordinates of XT1 in terms of the eigenvectors of SX . Recall also that wd∗
and s∗d denote the weights and scores that result from applying NIPALS to
data (Z1 , Y1 ).
Lemma 3.4 states that, for d = 1, . . . , q, (a) sd = s∗d and (b) wd = V wd∗ .
We see from the form of wd that, for d = 1, . . . , q, wd ∈ span(SX ),
s1 = X1 `1 (XT1 Y1 YT1 X1 )
= Z1 V T `1 (V ZT1 Y1 YT1 Z1 V T )
= Z1 `1 (ZT1 Y1 YT1 Z1 )
= s∗1 .
For d = 2
Under the induction hypothesis, assume that the conclusion holds for d < q.
Then we have for the next term in the sequence
Proof. Part (a) follows by straightforward algebra and its proof is omitted.
For part (b) we have
Σ−1/2 PQ Σ1/2 v Σ
1/2
= QV (Σ) v{v T QTV (Σ) ΣQV (Σ) v}−1 v T QTV (Σ) Σ
Σ1/2 V
where the second equality follows by substituting for ∆ and the third follows
from part (a). Next, multiplying on the left by QV (Σ) and using (A.8) we have
QV (Σ) Qv(∆) = QV (Σ) − QV (Σ) v(v T QTV (Σ) ΣQV (Σ) v)−1 v T QTV (Σ) Σ
= QV (Σ) − PQV (Σ) v(Σ)
= I − PV (Σ) − PQV (Σ) v(Σ) .
Proofs for Chapter 3 347
To show part (d) we begin by substituting for the middle ∆ and then using
part (c):
Proof.
B1 + B2 ⊆ EΣ (B1 ) + EΣ (B2 ) ⊆ EΣ (B1 + B2 ). (A.9)
The first containment seems clear, as Bj ⊆ EΣ (Bj ), j = 1, 2. For the second
containment, EΣ (B1 ) ⊆ EΣ (B1 + B2 ) and EΣ (B2 ) ⊆ EΣ (B1 + B2 ). The second
containment follows since EΣ (B1 + B2 ) is a subspace.
We next wish to show that EΣ (B1 ) + EΣ (B2 ) reduces Σ. The envelopes
EΣ (B1 ) and EΣ (B2 ) both reduce Σ and so by definition,
In consequence
Restatement. Assume the regression structure given in (4.1) and (4.7) with
p fixed. Then
βbpls = σ σT σ
b(b b)(b b)−1 .
σ T SX σ
We first expand σ σT σ
b(b b) and σbT SX σ
b. For this, we need the following expan-
sions (see Cook and Setodji, 2003)
n
√ 1
X 1
σ − σ) =
n(b n− 2 (xi yi − σ) + Op (n− 2 ),
i=1
n
√ − 12
X 1
n(SX − Σ) = n (xi xTi − Σ) + Op (n− 2 ).
i=1
Proofs for Chapter 4 349
σ k2 .
bkb
Step I: Expand σ
σT σ
b(b
σ b) = (b σ − σ + σ)T (b
σ − σ + σ)(b σ − σ + σ)
= σ − σ)kσk2 + σ(b
(b σ − σ) + σkσk2 + Op (n−1 ),
σ − σ)T σ + σσ T (b
so
√ √
σ k2 − σkσk2 =
σ kb
n(b σ − σ)kσk2 + σσ T (b
n{(b σ − σ) + σσ T (b
σ − σ)}
1
+ Op (n− 2 ), (A.10)
√ √ 1
= n(b σ − σ)kσk2 + 2 nσσ T (b σ − σ) + Op (n− 2 ),
√ 1
= kσk2 n{(b σ − σ)} + Op (n− 2 ),
σ − σ) + 2Pσ (b
√ 1
= kσk{Ip + 2Pσ } n(b σ − σ) + Op (n− 2 ), (A.11)
n
1 1
X
= kσk{Ip + 2Pσ }n− 2 (xi yi − σ) + Op (n− 2 ).
i=1
T −1
Step II. Expand (b b) .
σ SX σ
√ √
σ T SX σ
n(b b − σ T Σσ) = n{(bσ − σ + σ)T (SX − Σ + Σ)(b
σ − σ + σ) − σ T Σσ}
√ √ √
= n(bσ − σ)T Σσ + nσ T (SX − Σ)σ + nσ T Σ(b σ − σ)
1
+ Op (n− 2 ),
√ √ 1
σ − σ) + Op (n− 2 ).
= nσ T (SX − Σ)σ + 2 nσ T Σ(b
b A ∈ Rq×q
Next, we derive a general result for inverse expansions. Let A,
√ b
with A nonsingular. Assume that n(A − A) converges in distribution at rate
√ b−1 = A−1 + Op (n− 12 ). Then
n and that A
√
n(A
bAb−1 − I) = 0 ⇒
√
b − A + A)(A
n{(A b−1 − A−1 + A−1 ) − I} = 0 ⇒
√
n{(Ab − A)(Ab−1 − A−1 ) + (A
b − A)A−1 + A(A b−1 − A−1 ) = 0.
√ b − A)(Ab−1 − A−1 ) = Op (n− 12 ), we have
Since n{(A
√
n{(Ab − A)A−1 + A(A b−1 − A−1 )} = Op (n− 12 ) ⇒
√ √
b−1 − A−1 ) = − nA−1 (A
n(A b − A)A−1 + Op (n− 12 ).
b − σ and SX − Σ.
Step IV. Substitute the expansions for σ
n
√ 1
X
n(βbpls − β) = n− 2 (ΦT ΣΦ)−1 [{Ip + 2PΦ − 2(ΦT ΣΦ)−1 PΦ Σ}(xi yi − σ)
i=1
1
−1
− (Φ ΣΦ) T
PΦ (xi xTi − Σ)σ] + Op (n− 2 ).
β = Σ−1 σ = Φδ −1 ΦT σ = Φδ −1 kσk = σδ −1
Σσ/δ = σ,
we see that this expression is the same as conclusion (i) in the proposition.
Proofs for Chapter 4 351
where x.
Step V. Study R.
But δ −1 Σσ = PΦ σ = σ. So we get
Thus,
where
M = ηΣ−1 T −1
Y |X η ⊗ ∆0 + ∆ ⊗ ∆0 + ∆
−1
⊗ ∆0 − 2Iq ⊗ Ip−q
= (η 2 δ0 /σY2 |X )Ip−1 + (δ/δ0 )Ip−1 + (δ0 /δ)Ip−1 − 2Ip−1
n o
= (η 2 δ0 /σY2 |X ) + (δ0 /δ)(1 − δ/δ0 )2 Ip−1 .
Dividing η 2 {(η 2 δ0 /σY2 |X ) + (δ0 /δ)(1 − δ/δ0 )2 }−1 , the cost for the envelope es-
b by the corresponding PLS cost (σ 2 δ0 /δ 2 ) from avar{√nvec(βbpls )},
timator β, Y
which comes directly from Cook et al. (2013), and using the relationship
σY2 = η 2 δ + σY2 |X leads to cost ratio given below Corollary 4.1.
Proofs for Chapter 4 353
tr(∆0 )
K1 (n, p) =
nkσk2
tr(∆20 )
K2 (n, p) =
nkσk4
tr1/2 (∆30 )
K3 (n, p) = = tr1/2 (∆3σ )/n,
nkσk3
DN = (βbpls − βpls )T ωN
bT σ b)−1 σ
σ T SX σ bT − σ T σ(σ T ΣX σ)−1 σ T ωN
= σ b(b (A.13)
1/2
= Op {n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p)}, (A.14)
Theorem 4.1. It turns out that the addend K3 (n, p) is superfluous because
the hypothesis of Theorem 4.1, Kj (n, p) → 0 for j = 1, 2, implies that
K3 (n, p) → 0:
1
K3 (n, p) ≤ (K1 (n, p)K2 (n, p))1/2 ≤ √ (K1 (n, p) + K2 (n, p)) ,
2
which establishes that K3 is at most the order of K1 + K2 . Nevertheless, to
maintain a connection with the literature, we prove Proposition A.1, which
then implies Theorem 4.1.
bT SX σ̂
σ
= 1 + Op {n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p)}. (A.15)
σ T ΣX σ
bT σ̂
σ
= 1 + Op {n−1/2 + K1 (n, p)}. (A.16)
σT σ
bT σ
σ σT σ
= Op (1). (A.17)
b
σ T
b SX σ σT Σ
b Xσ
E(U ) = nΘ
E(U 2 ) = nΘtr(Θ) + n(n + 1)Θ2
E(U 3 ) = nΘtr2 (Θ) + n(n + 1)(Θtr(Θ2 ) + 2Θ2 tr(Θ)) + n(n2 + 3n + 4)Θ3
E(U 5 ) = n5 + 10 n4 + 65 n3 + 160 n2 + 148 n Θ5
+ 4 n4 + 6 n3 + 21 n2 + 20 n (tr Θ) Θ4
2
+ 6 n3 + 3 n2 + 4 n (tr Θ)
+3 n4 + 5 n3 + 14 n2 + 12 n (tr Θ2 ) Θ3
n
3
+ 4 n2 + n (tr Θ) + 4 2 n3 + 5 n2 + 5 n (tr Θ2 )(tr Θ)
o
+2 n4 + 5 n3 + 14 n2 + 12 n (tr Θ3 ) Θ2
n 2
4 2
+ n(tr Θ) + 6 n2 + n (tr Θ2 )(tr Θ) + 2 n3 + 5 n2 + 5 n (tr Θ2 )
+4 n3 + 3 n2 + 4 n (tr Θ3 )(tr Θ)
o
+ n4 + 6 n3 + 21 n2 + 20 n (tr Θ4 ) Θ.
The next three lemmas, each with its own Proof section, give ingredients
to establish (A.15)–(A.17) from (A.18) and (A.19).
εT XXT ε
1 tr(∆0 )
= Op (1) + (A.20)
n2 σ T σ n nσ T σ
εT XW XT ε tr(∆20 ) tr2 (∆0 )
1
= Op (1) + + . (A.21)
n3 σ T Σσ n nσ T Σσ n2 σ T Σσ
Proof of Lemma A.2. Since both quantities are positive we need only to
compute their expectations using Lemma A.1 and then employ Markov’s in-
equality. Recall from the preamble to Section 4.4 that ε is the n × 1 vector
with model errors i as elements.
σY2 |X
E(n−2 εT XXT ε) = E(tr(W )) = O(n−1 )tr(ΣX )
n2
σY2 |X tr(Σ2X ) tr2 (ΣX )
E(n−3 εT XW XT ε) = E(tr(W 2
)) = O(1) + .
n3 n n2
have
εT XXT ε
tr(ΣX )
E = O(1)
n2 σ T σ nσ T σ
1 tr(∆0 )
= O(1) +
n nσ T σ
1
= O(1) + K1 (n, p)
n
!
T
βpls W 2 XT ε
1
var = O(1) + K22 (n, p) + K32 (n, p) . (A.23)
n3 σ T Σ X σ n
Proof of Lemma A.3. Each term has expectation zero, so we compute their
variances with the help of Lemma A.1.
!
T
βpls W XT ε τ2 τ2
var 2 T
T
= 4 T 2 E(βpls W XT XW βpls ) = 4 T 2 E(βpls
T
W 3 βpls )
n σ σ n (σ σ) n (σ σ)
1
= O(1) n3 βpls
T
Σ3X βpls + n2 βpls
T
Σ2X βpls tr(ΣX )
n4 (σ T σ)2
+n2 βpls
T
ΣX βpls tr(Σ2X ) + nβpls
T
ΣX βpls tr2 (ΣX ) .
T
Now, βpls Σ3X βpls = σ T ΣX σ (σ T σ)2 and therefore we have
!
T
βpls W XT ε
1 −1 2
var = O(1) + n {K1 (n, p) + K2 (n, p) + K1 (n, p)} .
n2 σ T σ n
Proofs for Chapter 4 357
σY2 |X
!
T
βpls W 2 XT ε
var = T
E(βpls W 2 XT XW 2 βpls )
n 2 σ T ΣX σ n6 (σ T ΣX σ)2
σY2 |X
= T
E(βpls W 5 βpls ).
n6 (σ T ΣX σ)2
E(W 5 ) can now be evaluated using Lemma A.1 and the results simplified to
yield (A.23). 2
Lemmas A.2–A.4 give the orders of scaled versions of all six addends on
the right hand sides of (A.18) and (A.19). These are next used to determine
the orders (A.15)–(A.17). By combining (A.20), (A.22) and (A.24) we see that
bT σ
σ b/kσk is of the order of
1 1 1
+ K1 (n, p) + √ + √ + K1 (n, p) ,
n n n
Following the previous logic, we arrive at the stated order (A.15). (A.17)
follows immediately from (A.16) and (A.15).
Continuing now with the proof of Proposition A.1, the next step is to
rewrite (A.13) in a form that makes use of Proposition A.2. Recall that
δ = σ T ΣX σ/kσk2 is the eigenvalue of ΣX that is associated with the ba-
sis vector Φ = σ/kσk of the envelope EΣX(B). Let δb = σbT SX σ σ Tσ
b/b b. From
(A.13) then, we need to find the order of
DN = (δb−1 σ
b − δ −1 σ)T ωN
= δb−1 (b
σ − σ)T ωN − δb−1 (b b − σ T ΣX σ)(σ T ΣX σ)−1 σ T ωN
σ T SX σ
b − σ T σ)(σ T ΣX σ)−1 σ T ωN .
σT σ
+(b
DN = (δb−1 δ)δ −1 (b
σ − σ)T ωN − (δb−1 δ)δ −1 (b b − σ T ΣX σ)(σ T ΣX σ)−1 σ T ωN
σ T SX σ
b − σ T σ)(σ T ΣX σ)−1 σ T ωN .
σT σ
+ (b
Therefore an order for DN can be found by adding the orders of the following
three terms.
I = δ −1 (b
σ − σ)T ωN .
II = δ −1 (b b − σ T ΣX σ)(σ T ΣX σ)−1 σ T ωN .
σ T SX σ
III = (b b − σ T σ)(σ T ΣX σ)−1 σ T ωN .
σT σ
Term I.
var(I) = δ −2 E((b
σ − σ)T ΣX (bσ − σ)) δ −2 tr{var(b σ )ΣX }
2 T
var(Y )tr(ΣX ) + σ ΣX σ
δ −2
n
n−1 δ −2 var(Y ) δ 2 + tr(∆20 ) + n−1 δ −2 σ T ΣX σ
n−1 + K2 (n, p).
Proofs for Chapter 4 359
Term II.
and
2
var(δ −1 σ T eN ) = δ −1 σ T ΣX σ = (σ T σ)2 (σ T ΣX σ)−1 1.
Therefore
II = Op n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p) .
Term III.
III = O n−1/2 + K1 (n, p) .
Thus,
DN = I + II + III
1/2
= Op n−1/2 + K2 (n, p) + K1 (n, p) + K2 (n, p) + K3 (n, p)
1/2
= Op n−1/2 + K2 (n, p) + K1 (n, p) + K3 (n, p) ,
β T ΣX β T
= σX,Y Σ−1
X σX,Y
T
Φ∆−1 ΦT + Φ0 ∆−1
= σX,Y 0 Φ0 σX,Y
T −1 T
= σX,Y Φ∆ Φ σX,Y ,
where the last step follows because σX,Y is contained in EΣX(B), which has
semi-orthogonal basis Φ. Let φi denote the i-th column of Φ and define
T
σX,Y Pφi σX,Y
wi = T
, i = 1, . . . , q,
σX,Y PΦ σX,Y
β T ΣX β T
Φ∆−1 ΦT σX,Y
= σX,Y
q
X
= wi (kσX,Y k2 /δi ).
i=1
Proof. For notational convenience, let M = (PR Y, PS X). We first show that
conditions (a) and (b) are equivalent to the conditions
Assume that conditions (a) and (b) hold. Let R ∈ Rr×u and R0 ∈ Rr×r−u
be semi-orthogonal basis matrices for R and its orthogonal complement R⊥ .
Then (R, R0 ) is an orthogonal matrix and condition (a) holds if and only
if QS X (RT Y, R0T Y, PS X). Consequently, condition (a) implies that (see
(Cook, 1998, Proposition 4.6) for background on conditional independence.)
QS X (PR Y, QR Y, PS X) ⇒ QS X (M, QR Y )
⇒ (a1) QS X QR Y | M and (a2) QS X M
⇒ (a3) QS X M | QR Y and (a4) QS X QR Y.
QR Y (PS X, QS X, PR Y ) ⇒ QR Y (M, QS X)
⇒ (b1) QS X QR Y | M and (b2) QR Y M
⇒ (b3) QR Y M | QS X and (b4) QS X QR Y.
(I0 ) and (II1) imply that QR Y (M, QS X), while conditions (I0 ) and (II2)
imply that QS X (M, QR Y ). Replacing M with its definition, these impli-
cations give
The required condition (a) and (b) follow from here since if A is independent
of B then A is independent of any non-stochastic function of B. That is, these
statements imply that
This established that conditions (I0 ) and (II) are equivalent to conditions (a)
and (b).
To establish (a) and (b) with (I) and (II), first assume conditions (a) and
(b). Then (I0 ) and (II) hold. But (II) implies that (I) holds if and only if (I0 )
holds. Next, assume that (I0 ) and (II) hold. Then (a) and (b) hold and, again,
(II) implies that (I) holds if and only if (I0 ) holds.
2
Proof. Recall that the canonical correlation directions are the pairs of vec-
−1/2 −1/2
tors {ai , bi } = {ΣX ei , ΣY fi }, where {ei , fi } is the i-th left-right eigen-
−1/2 −1/2
vector pair of the correlation matrix ρ = ΣX ΣX,Y ΣY , i = 1, . . . , d,
d = rank(ΣX,Y ). Now, the conclusion follows for the ai ’s because
−1/2
span(a1 , . . . , ad ) = span(Σ−1
X ΣX,Y ΣY ) ⊆ span(Φ) = EΣX(B).
can be represented as
Sres = Σ
b Y |X + PG SY |W TX QG + QG SY |W TX PG ,
where Σ
b Y |X is as given in Lemma 5.2.
Proofs for Chapter 5 363
−1
Proof. Substituting ηb = SW TX SW TX,GT Y and expanding, we have
n
X
Sres = n−1 η T W TXi }{(Yi − Ȳ ) − Gb
{(Yi − Ȳ ) − Gb η T W TXi }T
i=1
−1
= SY − GSGT Y,W T X SW T Y SW T X,Y
−1 T −1 T
−SY,W T X SW T X SW T X,GT Y G + GSGT Y,W T X SW T X SW T X,GT Y G .
−1
Let M = SY,W T X SW T X SW T X,Y so that SY − M = SY |W T X . Then
Sres = SY − PG M − M PG + PG M PG
= SY − PG M QG − QG M PG − PG M PG .
We next expand
SY = PG SY PG + QG SY PG + PG SY QG + QG SY QG ,
1. k ← 1.
3. Compute the first singular vectors of (X(k) )T Y(k) , uk left and vk right.
6. Let SX (k) = X(k)T X(k) /n and SY (k) = Y(k)T Y(k) /n. Construct the residu-
als:
X(k+1) ← X(k) − X
b (k) = X(k) − ξk (ξkT ξk )−1 ξkT X(k)
= X(k) − X(k) uk (uTk X(k)T X(k) uk )−1 uTk X(k)T X(k)
= X(k) Quk (SX (k) )
Y(k+1) ← Y(k) − Y
b (k) = Y(k) − ωk (ωkT ωk )−1 ωkT Y(k)
= Y(k) − Y(k) vk (vkT Y(k)T Y(k) vk )−1 vkT Y(k)T Y(k)
= Y(k) Qvk (SY (k) ) .
7. If (X(k+1) )T Y(k+1) = 0 stop and k is the rank of the PLS model. Other-
wise, k ← k + 1 and go to step 3.
Now, if k = 1, u1 and v1 are the first left and right singular values of SX,Y
from the initialization step in Table 5.2. Then ξ1 = X(1) u1 , ω1 = Y(1) v1 and
SX (2) ,Y (2) = (X(2) )T Y(2) /n = QTu1 (S )) SX (1) ,Y (1) Qv1 (SY (1) ) . (A.27)
X (1)
Now, consider u2 and v2 , the first left and right singular vectors of SX (2) ,Y (2) ,
and
where the last step follows from the form of SX (2) and Lemma 3.5. Similarly,
Y(3) = Y(1) Q(v1 ,v2 )(SY (1) ) .
We next compute first left and right singular vectors of
We seen then that when SX = SX (1) , SX,Y SX (1) ,Y (1) and SY (1) are re-
placed with their population versions, the initialization and steps k = 1, 2 of
Proofs for Chapter 6 365
is equivalent to (6.6),
The equivalence of (6.5) and (6.6) relies on Proposition 4.4 from Cook
(1998), which states that for U , V and W random vectors, U V | W if and
only if U (V, W ) | W . From this we have that (6.5a) holds if and only if
(Y, X2 ) (X1 , X2 ) | PS X1 , X2 .
(RY |2 , X2 ) (R1|2 , X2 ) | PS X1 , X2 .
Applying Proposition 4.4 from Cook (1998) again, the last statement holds if
and only if
RY |2 R1|2 | PS X1 , X2 .
We obtain (6.6a) by applying the same logic to the conditioning argument,
(PS X1 , X2 ), and to (6.5b).
X2 ∼ N (0, Σ2 )
T
X1 | X2 ∼ N (β1|2 X2 , ΦΩΦT + Φ0 Ω0 ΦT0 )
Y | (X1 , X2 ) ∼ N (η T ΦT X1 + β2T X2 , σY2 |X ),
366 Proofs of Selected Results
The MLE of β1|2 is βb1|2 = S2−1 S2,1 and then R1|2 i is the i-th residual from
the fit of X1 on X2 ,
T
R1|2 i = X1 i − βb1|2 X2 i
n
X
S1|2 = n−1 T
R1|2 i R1|2 i
i=1
= S1 − S1,2 S2−1 S2,1 .
b = ΦT S1|2 Φ, Ω
From this we have the MLEs Ω b 0 = ΦT S1|2 Φ0 and thus the
0
Proofs for Chapter 6 367
The next step is to determine and substitute the values of η and β2 that
maximize the log likelihood with Φ and σ 2 held fixed. To facilitate this we
orthogonalize the terms in the sum. To do this we use the coefficients βb1|2 Φ
Pn
from the OLS fit of ΦT X1 i on X2 i . Let SS = i=1 (yi − η T ΦT X1 i − β T X2 i )2 .
Then write
n
X
SS = (yi − η T (ΦT X1i − ΦT βb1|2
T
X2 i + ΦT βb1|2
T
X2 i ) − β2T X2 i )2
i=1
n
X
= (yi − η T ΦT (X1i − βb1|2
T
X2 i ) − η T ΦT βb1|2
T
X2 i − β2T X2 i )2
i=1
n
X
= (yi − η T ΦT R1|2 i − β2∗T X2 i )2 ,
i=1
Pn
T
where R1|2 i = X1 i − βb1|2 X2 i and β2∗ = β2 + βb1|2 Φη. Since i=1 R1|2 i X2Ti = 0,
we can fit the two terms in the last sum separately, getting straightforwardly
n
X
n−1 ΦT R1|2 i Yi = ΦT {S1Y − S1,2 S2−1 S2,Y }
i=1
S1|2 = S1 − S1,2 S2−1 S2,1
η = (ΦT S1|2 Φ)−1 ΦT SR1|2 ,Y
SR1|2 ,Y = S1Y − S1,2 S2−1 S2,Y
β2∗ = S2−1 S2,Y
as the values of η and β2∗ that minimize the sum. Substituting these values
and simplifying we get the next partially maximized log likelihood
n np
log L3 = − log |S2 | −
2 2
n n
− log |Φ S1|2 Φ| − log |ΦT0 S1|2 Φ0 |
T
2 2
n 2 n
− log σY |X − 2 {SY |2 − SR T
1|2 ,Y
Φ[ΦT S1|2 Φ]−1 ΦT SR1|2 ,Y }.
2 2σY |X
368 Proofs of Selected Results
σY2 |X = SY |2 − SR
T
1|2 ,Y
Φ[ΦT S1|2 Φ]−1 ΦT SR1|2 ,Y ,
For clarity we use det rather than | · | to denote the determinant operator in
the remainder of this proof. Then
!
ΦT S1|2 Φ ΦT SR1|2 ,Y
T4 = det T
SR 1|2 ,Y
Φ SY |2
n o
= det ΦT S1|2 Φ − ΦT SR1|2 ,Y SY−1|2 SR T
1|2 ,Y
Φ
n o
= det ΦT [S1|2 − SR1|2 ,Y SY−1|2 SRT
1|2 ,Y ]Φ .
Since R1|2 is uncorrelated in the sample with X2 , we have SR1|2 ,Y = SR1|2 ,RY |2
and so
n o
F (Φ) = det ΦT [S1|2 − SR1|2 ,RY |2 SY−1|2 SR
T
R
1|2 Y |2
]Φ + det Φ T −1
S1|2 Φ .
gives the objective function at (6.9). The estimators are obtained by piecing
together the various parameter functions that maximize the log likelihood.
Consequently,
1. M = ΣW α(αT ΣW α)−1 .
2. M T is a generalized inverse of α.
4.
E(W | αT W ) − E(W ) = PS(Σ
T
W)
(W − E(W )),
E{E(W | αT W )W T α} = E{E(W W T α | αT W )}
= E(W W T α)
= ΣW α,
E(M αT W W T α) = M αT ΣW α.
Consequently,
Σ W α = M α T ΣW α
Since α is a basis for SY |X , the left hand side does not depend on the value
of Y . This implies that E(X | αT X, Y = y) = E(X | αT X) and thus that
The left and right hand sides are a function of only αT X. Since the expecta-
tion of the left hand side in αT X is 0, the expectation of the right hand side
must also be 0,
E(MY αT E(X | Y )) = E(MY )αT µX .
Replacing the second term on the right hand side of (A.29) with E(MY )αT µX ,
it follows that
2
372 Proofs of Selected Results
Moreover, if the single index model holds and ΣX,Y 6= 0 then EΣX(B) =
EΣX (SE(Y |X) ).
Proof. We know from Proposition 9.4 and Corollary 9.2 that B ⊆ SE(Y |X)
with equality under the single index model (9.2) with ΣX,Y 6= 0. Let A1
be semi-orthogonal basis matrix for B and extent it so that (A1 , A2 ) is a
semi-orthogonal basis matrix for SE(Y |X) . Then SE(Y |X) = span((A1 , A2 )) =
span(A1 ) + span(A2 ). Applying Proposition 1.7 with M = ΣX , S1 =
span(A1 ) = B and S2 = span(A2 ), we have SE(Y |X) = EΣX(B)+EΣX (span(A2 ),
which implies the desired conclusion, EΣX(B) ⊆ EΣX (SE(Y |X) ).
When the single index model holds, B = SE(Y |X) , which implies equality
SE(Y |X) = EΣX(B).
2
Proof. The first two elements of the diagonal are direct consequence of the
fact that cov(Z) = E(cov(Z|H))+var(E(Z|H)). For ΣX,ξ and ΣY,η we use the
fact that βX|ξ = σξ−2 ΣX,ξ and βY |η = ση−2 ΣY,η . To compute ΣY,η we use that
• Marginal constraints,
! !
X µX
∼ + X,Y
Y µY
where
!
DX|ξ + c2 BB T BAT
var(X,Y ) := Σ(X,Y ) = . (A.31)
AB T DY |η + d2 AAT
• Regression constraints
! !
X µX
∼ + X,Y
Y µY
From (A.43)
−1
DX|ξ + (B T DX|ξ B)−1 BB T /(σξ2 − 1)
AB T
Σ(X,Y ) =
BAT
DY |η + (AT DY−1|η A)−1 AAT /(ση2 − 1)
where cor2 (η, ξ)(ση2 − 1)−1 (AT DY−1|η A)−1 (σξ2 − 1)−1 (B T DX|ξ
−1
B)−1 = 1.
cor{E(ξ|X), E(η|Y )}, E(ξ|X), E(η|Y ), ΣX,ξ and ΣY,η are identifiable in the
reflexive model except for sign, while cor(η, ξ), σξ2 , ση2 , σξ,η , βX|ξ , βY |η , ΣY |η
and ΣX|ξ are not identifiable. Moreover,
Proof. The first part follows from reduced rank model literature (see for ex-
ample Cook, Forzani, and Zhang (2015)). Now, using (A.30) and the fact that
µη = 0,
and therefore
where we use the hypothesis that var(E(η|Y )) = 1 in the last equal. In the
same way
var(E(ξ|X)) = Σξ,X Σ−1 T
X Σξ,X = 1. (A.35)
Now, since
(A.36), (A.37), and (A.38) together makes AB T = Amη ση−2 σξ,η σξ−2 mξ B T and
therefore
Now, using (A.34) and (A.35) together with (A.37) and (A.38) we have that
= (AT Σ−1 T −1
Y AB ΣX B)
1/2
Let us note that from (A.40) and (A.41), m2ξ and m2η are not unique, never-
theless m2η m2ξ is unique since any change of A and B should be such AB T is
the same. As a consequence mη mξ is unique except for a sign. And therefore
ση−2 σξ,η σξ−2 is unique except for a sign.
Now, coming back to equations (A.37) and (A.38) we have that
ση2 = E(var(η|Y )) + 1
E(ξ|X) = Σξ,X Σ−1
X (X − µX )
σξ2 = E(var(ξ|X)) + 1.
= Ση,Y Σ−1 −2 −2 −1
Y ΣY,η ση σξ,η σξ Σξ,X ΣX ΣX,ξ .
Now, we will prove that σξ and ση are not identifiable. For that, since
ση and σξ have to be greater than 1 and ση σξ should be constant we can
Proofs for Chapter 10 377
Lemma A.6. Under the reflexive model of Section 10.2 and the regression
constraints,
Proof. Expressions (A.42) and (A.44) are equivalent to (10.9) and (10.10) in
Chapter 10.
By the covariance formula and the fact that by Proposition 10.1 we have
ΣX,ξ = B(B T Σ−1X B)
−1/2
and ΣY,η = A(AT Σ−1 Y A)
−1/2
from where we get
(A.42) and (A.44). Now, taking inverse and using the Woodbury inequality
we have
B T Σ−1 T −1
X|ξ BB ΣX B
B T Σ−1
X B = σξ2 .
σξ2 B T Σ−1 T −1
X B + B ΣX|ξ B
As a consequence
σξ2 − 1
B T Σ−1
X B = B T Σ−1
X|ξ B σξ2
and (A.43) follows replacing this into (A.42). The proof of (A.45) follows anal-
ogously. 2
378 Proofs of Selected Results
Σ−1
X = Σ−1 −1 2 −1
X|ξ − ΣX|ξ ΣX,ξ (σξ + Σξ,X ΣX,ξ ΣX,ξ )
−1
Σξ,X Σ−1
X|ξ . (A.48)
Using the fact that Σξ,X Σ−1X ΣX,ξ = 1 proven in Proposition 10.1 and multi-
plying to the left and to the right of (A.48) by Σξ,X and ΣX,ξ we have
1 = Σξ,X Σ−1 −1 2 −1
X|ξ ΣX,ξ − Σξ,X ΣX|ξ ΣX,ξ (σξ + Σξ,X ΣX|ξ ΣX,ξ )
−1
Σξ,X Σ−1
X|ξ ΣX,ξ
= Hξ − Hξ (σξ2 + Hξ )−1 Hξ
Hξ σξ2
=
σξ2 + Hξ
must imply that ΣX|ξ = Σ̃X|ξ and σξ2 = σ̃ξ2 . If σξ2 is identifiable, so
σξ2 = σ̃ξ2 , then (A.49) implies ΣX|ξ = Σ̃X|ξ since (B T Σ−1
X B)
−1
BB T is identi-
fiable. Similarly, if ΣX|ξ is identifiable, so ΣX|ξ = Σ̃X|ξ , then (A.49) implies
that σξ2 = σ̃ξ2 and thus that σξ2 is identifiable.
Now, assume that elements (i, j) and (j, i) of ΣX|ξ and Σ̃X|ξ are known to
be 0 and let ek denote the p×1 vector with a 1 in position k and 0’s elsewhere.
Then multiplying (A.49) on the left by ei and on the right by ej gives
σξ−2 (B T Σ−1
X B)
−1
(B)i (B)j = σ̃ξ−2 (B T Σ−1
X B)
−1
(B)i (B)j
Since (B T Σ−1
X B)
−1
BB T is identifiable and (B)i (B)j 6= 0, this implies that
σξ−2 = σ̃ξ−2 and thus σξ2 is identifiable, which implies that ΣX|ξ is identifiable.
2
= Σξ,X Σ−1 −1
X ΣX,Y ΣY ΣY,η .
From (A.30),
cor(ξ, η)
cov{E(ξ | X), E(η | Y )} = var{E(ξ | X)}var{E(η | Y )} .
σξ ση
In consequence, we get the first conclusion,
To perhaps aid intuition, we can see the result in another way by using
the Woodbury identity to invert ΣX = ΣX,ξ Σξ,X + ΣX|ξ we have
Similarly,
Ση,Y Σ−1
Y |η ΣY,η
var{E(η | Y )} = Ση,Y Σ−1
Y ΣY,η = ,
1 + Ση,Y Σ−1
Y |η ΣY,η
and thus
#1/2
Σξ,X Σ−1 Ση,Y Σ−1
"
X|ξ ΣX,ξ Y |η ΣY,η
Ψ= cor(ξ, η).
1 + Σξ,X Σ−1 −1
X|ξ ΣX,ξ 1 + Ση,Y ΣY |η ΣY,η
This form shows that |Ψ| ≤ |cor(ξ, η)| and provides a more detailed expression
of their ratio.
1. span(ri ) = span(wi+1 ), i = 0, . . . , q − 1
2. span(pi ) = span(vi+1 ), i = 0, . . . , q − 1
382 Proofs of Selected Results
Lemma A.7. In the context of the single-response linear model, the vectors
generated by CGA satisfy
Turning to p1 :
r1T r1
p1 = r1 + p0
r0T r0
T
σX,Y σX,Y
= σX,Y − T
ΣX σX,Y
σX,Y ΣX σX,Y
T
σXY QσX,Y (ΣX ) QTσX,Y (ΣX ) σX,Y
+ T
σX,Y
σX,Y σX,Y
T
σX,Y T
σX,Y σX,Y Σ2X σX,Y
= T Σ σ 2
(σXY X X,Y )
T
Σ2X σX,Y )−1 σX,Y
T
× σX,Y − ΣX σX,Y (σX,Y ΣX σX,Y
T
σX,Y T
σX,Y σX,Y Σ2X σX,Y
= T Σ σ 2
QΣX σX,Y σX,Y ∝ v2 = QΣX σX,Y σX,Y
(σXY X X,Y )
span(p1 ) = span(v2 ), (from Table 3.4).
Additionally,
where the second equality follows from (A.51) and we used in the last equality
the fact that pT0 ΣX p1 = 0.
Since the proposition holds for q = 1, 2, we now suppose that it holds for
q = 1, . . . , k − 1. We next prove that it holds for q = k; that is, we show that
span(pk ) = span(vk+1 ), span(rk ) = span(wk+1 ) and that βk+1 = βnpls = βspls
when the number of PLS components is k + 1.
For the induction hypotheses we have for q = 1, . . . , k
5. pTq−1 ΣX pi = 0 for i = 0, . . . , q − 2,
T
6. rq−1 pi = 0 for i = 0, . . . , q − 2,
Items 4–6 are implied by Lemma A.7. To prove the proposition we first show
that items 1 and 2 hold for q = k+1. To do this, we first show that {r0 , . . . , rk }
is an orthogonal set. Although this is implied by Lemma A.7(i), we provide
an alternate demonstration for completeness.
Claim: {r0 , . . . , rk } is an orthogonal set.
T
It remains to show that rk−1 rk = 0. By construction
T
T T rk−1 rk−1
rk−1 rk = rk−1 rk−1 − T
rk−1 ΣX pk−1
pk−1 ΣX pk−1
( )
T rk−1 ΣX pk−1
= rk−1 rk−1 1 − T
pk−1 ΣX pk−1
= 0,
T
where the last equality follows because rk−1 ΣX pk−1 = pTk−1 ΣX pk−1 . To see
this we have by construction
Proof. The proof follows from Proposition 3.7 and Lemma A.7(iii).
In a bit more detail, we know from induction hypothesis 2 that for
q = 1, . . . , k and i = 0, . . . , q −1, span(ri ) = span(wi+1 ). From Lemma A.7(iii)
Pk
Claim: βk+1 = i=0 pi (pTi ΣX pTi )−1 σX,Y = βnpls = βspls
Proof.
βk+1 = βk + αk pk
k−1
X rkT rk
= pi (pTi ΣX pi )−1 pTi σX,Y + T
pk
i=0
pk ΣX pk
k−1
X
= pi (pTi ΣX pi )−1 pTi σX,Y + pk (pTk ΣX pk )−1 rkT rk
i=0
k−1
X
= pi (pTi ΣX pi )−1 pTi σX,Y + pk (pTk ΣX pk )−1 pTk σX,Y .
i=0
= rkT rk
Akter, S., S. F. Wamba, and S. Dewan (2017). Why pls-sem is suitable for
complex modelling? an empirical illustration in big data analytics quality.
Production Planning & Control 28 (11–12), 1011–1021.
387
388 Bibliography
Arvin, M., R. P. Pradhan, and M. S. Nair (2021). Are there links between
institutional quality, government expenditure, tax revenue and economic
growth? Evidence from low-income and lower middle-income countries.
Economic Analysis and Policy 70(C), 468–489.
Björck, Å. (1966). Numerical Methods for Least Squares Problems. SIAM.
Chun, H. and S. Keleş (2010). Sparse partial least squares regression for
simultaneous dimension reduction and predictor selection. Journal of the
Royal Statistical Society B 72 (1), 3–25.
Cook, R. D. and L. Forzani (2018). Big data and partial least squares
prediction. The Canadian Journal of Statistics/La Revue Canadienne de
Statistique 47 (1), 62–78.
Bibliography 391
Cook, R. D., L. Forzani, and L. Liu (2023b). Partial least squares for simul-
taneous reduction of response and predictor vectors in regression. Journal
of Multivariate Analysis 196, https://doi.org/10.1016/j.jmva.2023.105163.
Cook, R. D., B. Li, and F. Chiaromonte (2010). Envelope models for parsi-
monious and efficient multivariate linear regression. Statistica Sinica 20 (3),
927–960.
Ding, S., Z. Su, G. Zhu, and L. Wang (2021). Envelope quantile regression.
Statistica Sinica 31 (1), 79–105.
Geladi, P. (1988). Notes on the history and nature if partial least squares
(PLS) modeling. Journal of Chemometrics 2, 231–246.
Guide, J. B. and M. Ketokivi (2015, July). Notes from the editors: Redefining
some methodological criteria for the journal. Journal of Operations
Management 37, v–viii.
396 Bibliography
Helland, I. S., S. Sæbø, T. Almøy, and R. Rimal (2018). Model and estimators
for partial least squares regression. Journal of Chemometrics 32 (9), e3044.
Henderson, H. and S. Searle (1979). Vec and vech operators for matrices,
with some uses in Jacobians and multivariate statistics. The Canadian
Journal of Statistics/La Revue Canadienne de Statistique 7 (1), 65–81.
Bibliography 397
Lavoie, F. B., K. Muteki, and R. Gosselin (2019). A novel robust nl-pls regres-
sion methodology. Chemometrics and Intelligent Laboratory Systems 184,
71–81.
Li, K. C. (1991). Sliced inverse regression for dimension reduction (with dis-
cussion). Journal of the American Statistical Association 86 (414), 316–342.
Li, L., R. D. Cook, and C.-L. Tsai (2007). Partial inverse regression.
Biometrika 94 (3), 615–625.
Lindgren, F., P. Geladi, and S. Wold (1993). The kernel algorithm for pls.
Journal of Chemometrics 7 (1), 44–59.
Liu, Y. and W. Ravens (2007). PLS and dimension reduction for classification.
Computational Statistics 22, 189–208.
Lohmöller, J.-B. (1989). Latent Variable Path Modeling with Partial Least
Squares. New York: Springer.
Pardoe, I., X. Yin, and R. D. Cook (2007). Graphical tools for quadratic
discriminant analysis. Technometrics 49 (2), 172–183.
400 Bibliography
Russo, D. and K.-J. Stol (2023). Don’t throw the baby out with the
bathwater: Comments on “recent developments in PLS”. Communications
of the Association for Information Systems 52, 700–704.
Shan, P., S. Peng, Y. Bi, L. Tang, C. Yang, Q. Xie, and C. Li (2014). Partial
least squares–slice transform hybrid model for nonlinear calibration.
Chemometrics and Intelligent Laboratory Systems 138, 72–83.
Shao, Y., R. D. Cook, and S. Weisberg (2007). Marginal tests with sliced
average variance estimation. Biometrika 94 (2), 285–296.
Shao, Y., R. D. Cook, and S. Weisberg (2009). Partial central subspace and
sliced average variance estimation. Journal of Statistical Planning and
Inference 139 (3), 952–961.
Small, C. G., J. Wang, and Z. Yang (2000). Eliminating multiple root prob-
lems in estimation (with discussion). Statistical Science 15 (4), 313–341.
Vinzi, E. V., L. Trinchera, and S. Amato (2010). Pls path modeling: From
foundations to recent developments and open issues for model assessment
and improvement. In E. V. Vinzi, W. W. Chin, J. Henseler, and H. Wang
(Eds.), Handbook of Partial Least Squares, Chapter 2, pp. 47–82. Berlin:
Springer-Verlag.
Wold, H. (1975a). Path models with latent variables: The NIPALS ap-
proach. In H. M. Blalock, A. Aganbegian, F. M. Borodkin, R. Boudon,
and V. Capecchi (Eds.), Quantative Scoiology, Chapter 11, pp. 307–357.
London: Academic Press.
Wold, H. (1982). Soft modeling: the basic design and some extensions. In K. G.
Jörgensen and H. Wold (Eds.), Systems under indirect observation: Causal-
ity, structure, prediction, Vol. 2, pp. 1–54. Amsterdam: NorthHolland.
Wold, S. (1992). Nonlinear partial least squares modeling ii. Spline inner
relation. Chemometrics and Intelligent Laboratory Systems 14 (1–3), 71–84.
Wold, S., H. Martens, and H. Wold (1983). The multivariate calibration prob-
lem in chemistry solved by the pls method. In A. Ruhe and B. Kågström
(Eds.), Proceedings of the Conference on Matrix Pencils, Lecture Notes in
Mathematics, Vol. 973, pp. 286–293. Heidelberg: Springer Verlag.
Wold, S., J. Trygg, A. Berglund, and H. Antti (2001). Some recent de-
velopments in PLS modeling. Chemometrics and Intelligent Laboratory
Systems 7, 131–150.
407
408 Index