100% found this document useful (1 vote)
76 views

Partial Least Squares Regression

and related dimension reduction methods. R. Dennis Cook and Liliana Forzani

Uploaded by

sfcagrrl2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
76 views

Partial Least Squares Regression

and related dimension reduction methods. R. Dennis Cook and Liliana Forzani

Uploaded by

sfcagrrl2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 448

Partial Least Squares

Regression
Partial least squares (PLS) regression is, at its historical core, a black-box algorith-
mic method for dimension reduction and prediction based on an underlying linear
relationship between a possibly vector-valued response and a number of predictors.
Through envelopes, much more has been learned about PLS regression, resulting in
a mass of information that allows an envelope bridge that takes PLS regression from
a black-box algorithm to a core statistical paradigm based on objective function op-
timization and, more generally, connects the applied sciences and statistics in the
context of PLS. This book focuses on developing this bridge. It also covers uses of PLS
outside of linear regression, including discriminant analysis, non-linear regression,
generalized linear models and dimension reduction generally.

Key Features:
•  Showcases the first serviceable method for studying high-dimensional regressions.
•  Provides necessary background on PLS and its origin.
•  R and Python programs are available for nearly all methods discussed in the book.

R. Dennis Cook is Professor Emeritus, School of Statistics, University of Minnesota.


His research areas include dimension reduction, linear and nonlinear regression, ex-
perimental design, statistical diagnostics, statistical graphics, and population genet-
ics. Perhaps best known for “Cook’s Distance,” a now ubiquitous statistical method,
he has authored over 250 research articles, two textbooks and three research mono-
graphs. He is a five-time recipient of the Jack Youden Prize for Best Expository Paper
in Technometrics as well as the Frank Wilcoxon Award for Best Technical Paper. He
received the 2005 COPSS Fisher Lecture and Award, and is a Fellow of ASA and IMS.

Liliana Forzani is Full Professor, School of Chemical Engineering, National Universi-


ty of Litoral and principal researcher of CONICET (National Scientific and Technical
Research Council), Argentina. Her contributions are in mathematical statistics, espe-
cially sufficient dimension reduction, abundance in regression and statistics for che-
mometrics. She established the first research group in statistics at her university after
receiving her Ph.D in Statistics at the University of Minnesota. She has authored over
75 research articles in mathematics and statistics, and was recipient of the L‘Oreal-
Unesco-Conicet prize for Women in science.
Partial Least Squares
Regression
And Related Dimension Reduction
Methods

R. Dennis Cook
Liliana Forzani
Designed cover image: © R. Dennis Cook and Liliana Forzani

MATLAB and Simulink are trademarks of The MathWorks, Inc. and are used with permission. The MathWorks
does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB or
Simulink software or related products does not constitute endorsement or sponsorship by The MathWorks of a
particular pedagogical approach or particular use of the MATLAB and Simulink software.
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2024 R. Dennis Cook and Liliana Forzani

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted,
or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, includ-
ing photocopying, microfilming, and recording, or in any information storage or retrieval system, without written
permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works
that are not available on CCC please contact [email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Cook, R. Dennis, author. | Forzani, Liliana, author.


Title: Partial least squares regression : and related dimension reduction
methods / R. Dennis Cook and Liliana Forzani.
Description: First edition. | Boca Raton, FL : CRC Press, 2024. | Includes
bibliographical references and index.
Identifiers: LCCN 2024000157 (print) | LCCN 2024000158 (ebook) | ISBN
9781032773186 (hbk) | ISBN 9781032773230 (pbk) | ISBN 9781003482475
(ebk)
Subjects: LCSH: Least squares. | Regression analysis. | Multivariate
analysis.
Classification: LCC QA275 .C76 2024 (print) | LCC QA275 (ebook) | DDC
511/.422--dc23/eng/20240326
LC record available at https://lccn.loc.gov/2024000157
LC ebook record available at https://lccn.loc.gov/2024000158

ISBN: 978-1-032-77318-6 (hbk)


ISBN: 978-1-032-77323-0 (pbk)
ISBN: 978-1-003-48247-5 (ebk)

DOI: 10.1201/9781003482475

Typeset in CMR10 font


by KnowledgeWorks Global Ltd.

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
For Violette and Rose

R. D. C.

For Bera, Edu and Marco

L. F.
Contents

Preface xvii

Notation and Definitions xxi

Authors xxvii

List of Figures xxix

List of Tables xxxiii

1 Introduction 1
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Corn moisture . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Meat protein . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Serum tetracycline . . . . . . . . . . . . . . . . . . . . . 6
1.2 The multivariate linear model . . . . . . . . . . . . . . . . . . . 7
1.2.1 Notation and some algebraic background . . . . . . . . 9
1.2.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Partitioned models and added variable plots . . . . . . . 15
1.3 Invariant and reducing subspaces . . . . . . . . . . . . . . . . . 15
1.4 Envelope definition . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Algebraic methods of envelope construction . . . . . . . . . . . 22
1.5.1 Algorithm E . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.2 Algorithm K . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.3 Algorithm N . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.4 Algorithm S . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.5 Algorithm L . . . . . . . . . . . . . . . . . . . . . . . . 32
1.5.6 Other envelope algorithms . . . . . . . . . . . . . . . . 33

vii
viii Contents

2 Envelopes for Regression 35


2.1 Model-free predictor envelopes defined . . . . . . . . . . . . . . 36
2.1.1 Central subspace . . . . . . . . . . . . . . . . . . . . . . 37
2.1.2 Model-free predictor envelopes . . . . . . . . . . . . . . 38
2.1.3 Conceptual contrasts . . . . . . . . . . . . . . . . . . . . 40
2.2 Predictor envelopes for the multivariate linear model . . . . . . 41
2.3 Likelihood estimation of predictor envelopes . . . . . . . . . . . 43
2.3.1 Likelihood-inspired estimators . . . . . . . . . . . . . . . 43
2.3.2 PLS connection . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Response envelopes, model-free and model-based . . . . . . . . 48
2.5 Response envelopes for the multivariate linear model . . . . . . 52
2.5.1 Response envelope model . . . . . . . . . . . . . . . . . 52
2.5.2 Relationship to models for longitudinal data . . . . . . . 53
2.5.3 Likelihood-based estimation of response envelopes . . . 55
2.5.4 Gasoline data . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.5.6 PLS connection . . . . . . . . . . . . . . . . . . . . . . . 62
2.6 Nadler-Coifman model for spectroscopic data . . . . . . . . . . 62
2.7 Variable scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3 PLS Algorithms for Predictor Reduction 69


3.1 NIPALS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.2 NIPALS example . . . . . . . . . . . . . . . . . . . . . . 74
3.1.3 Sample structure of the NIPALS algorithm . . . . . . . 75
3.1.4 Population details for the NIPALS algorithm . . . . . . 78
3.2 Envelopes and NIPALS . . . . . . . . . . . . . . . . . . . . . . 83
3.2.1 The NIPALS subspace reduces ΣX and contains B . . . 83
3.2.2 The NIPALS weights span the ΣX -envelope of B . . . . 85
3.3 SIMPLS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.2 SIMPLS example . . . . . . . . . . . . . . . . . . . . . . 88
3.3.3 Details underlying the SIMPLS algorithm . . . . . . . . 90
3.4 Envelopes and SIMPLS . . . . . . . . . . . . . . . . . . . . . . 92
3.4.1 The SIMPLS subspace reduces ΣX and contains B . . . 93
3.4.2 The SIMPLS weights span the ΣX -envelope of B . . . . 94
3.5 SIMPLS v. NIPALS . . . . . . . . . . . . . . . . . . . . . . . . 96
Contents ix

3.5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5.2 Summary of common features of NIPALS and SIMPLS 98
3.6 Helland’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 Likelihood estimation of predictor envelopes . . . . . . . . . . . 103
3.9 Comparisons of likelihood and PLS estimators . . . . . . . . . 104
3.10 PLS1 v. PLS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.11 PLS for response reduction . . . . . . . . . . . . . . . . . . . . 108

4 Asymptotic Properties of PLS 110


4.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 One-component regressions . . . . . . . . . . . . . . . . . . . . 112
4.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3 Asymptotic distribution of βbpls as n → ∞ with p fixed . . . . . 114
4.3.1 Envelope vs. PLS estimators . . . . . . . . . . . . . . . 116
4.3.2 Mussels muscles . . . . . . . . . . . . . . . . . . . . . . 117
4.4 Consistency of PLS in high-dimensional regressions . . . . . . . 120
4.4.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.4.2 Quantities governing asymptotic behavior of DN . . . . 122
4.4.3 Consistency results . . . . . . . . . . . . . . . . . . . . . 124
4.5 Distributions of one-component PLS estimators . . . . . . . . . 129
4.5.1 Asymptotic distributions and bias . . . . . . . . . . . . 130
4.5.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . 132
4.5.3 Theorem 4.2, Corollary 4.6 and their constituents . . . . 133
4.5.4 Simulation results . . . . . . . . . . . . . . . . . . . . . 139
4.5.5 Confidence intervals for conditional means . . . . . . . . 141
4.6 Consistency in multi-component regressions . . . . . . . . . . . 142
4.6.1 Definitions of the signal η(p), noise κ(p) and collinearity
ρ(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.6.2 Abundance v. sparsity . . . . . . . . . . . . . . . . . . . 144
4.6.3 Universal conditions . . . . . . . . . . . . . . . . . . . . 145
4.6.4 Multi-component consistency results . . . . . . . . . . . 146
4.7 Bounded or unbounded signal? . . . . . . . . . . . . . . . . . . 149
4.8 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.8.1 One-component regression . . . . . . . . . . . . . . . . . 150
4.8.2 Two-component regression . . . . . . . . . . . . . . . . . 154
x Contents

5 Simultaneous Reduction 156


5.1 Foundations for simultaneous reduction . . . . . . . . . . . . . 156
5.1.1 Definition of the simultaneous envelope . . . . . . . . . 159
5.1.2 Bringing in the linear model . . . . . . . . . . . . . . . . 159
5.1.3 Cook-Zhang development of (5.7) . . . . . . . . . . . . . 160
5.1.4 Links to canonical correlations . . . . . . . . . . . . . . 161
5.2 Likelihood-based estimation . . . . . . . . . . . . . . . . . . . . 161
5.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.2.2 Computing . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3 PLS estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4.1 Bilinear models for simultaneous reduction . . . . . . . 167
5.4.2 Another class of bilinear models . . . . . . . . . . . . . 169
5.4.3 Two-block algorithm for simultaneous reduction . . . . 169
5.5 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.5.2 Varying sample size . . . . . . . . . . . . . . . . . . . . 173
5.5.3 Varying the variance of the material components via a . 175
5.5.4 Varying sample size with non-diagonal covariance
matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.5.5 Conclusions from the simulations . . . . . . . . . . . . . 178
5.5.6 Three illustrative data analyses . . . . . . . . . . . . . . 180

6 Partial PLS and Partial Envelopes 182


6.1 Partial predictor reduction . . . . . . . . . . . . . . . . . . . . . 183
6.2 Partial predictor envelopes . . . . . . . . . . . . . . . . . . . . . 185
6.2.1 Synopsis of PLS for compressing X1 . . . . . . . . . . . 185
6.2.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.2.3 Likelihood-based partial predictor envelopes . . . . . . . 189
6.2.4 Adaptations of partial predictor envelopes . . . . . . . . 192
6.2.5 Partial predictor envelopes in the Australian Institute
of Sport . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.3 Partial predictor envelopes in economic growth prediction . . . 194
6.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3.3 Model and data . . . . . . . . . . . . . . . . . . . . . . . 195
6.3.4 Methods and results . . . . . . . . . . . . . . . . . . . . 197
Contents xi

6.4 Partial response reduction . . . . . . . . . . . . . . . . . . . . . 199


6.4.1 Synopsis of PLS for Y compression to estimate β1 . . . 199
6.4.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.4.3 Likelihood-based partial response envelopes . . . . . . . 201
6.4.4 Application to longitudinal data . . . . . . . . . . . . . 203
6.4.5 Reactor data . . . . . . . . . . . . . . . . . . . . . . . . 205

7 Linear Discriminant Analysis 207


7.1 Envelope discriminant subspace . . . . . . . . . . . . . . . . . . 208
7.2 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . 211
7.3 Principal fitted components . . . . . . . . . . . . . . . . . . . . 212
7.3.1 Discrimination via PFC with Isotropic errors . . . . . . 215
7.3.2 Envelope and PLS discrimination under model (7.8) . . 216
7.4 Discrimination via PFC with ΣX|C > 0 . . . . . . . . . . . . . 219
7.5 Discrimination via PLS and PFC-RR . . . . . . . . . . . . . . 220
7.5.1 Envelopes and PFC-RR model (7.10) combined . . . . . 220
7.5.2 Bringing in PLS . . . . . . . . . . . . . . . . . . . . . . 221
7.6 Overview of LDA methods . . . . . . . . . . . . . . . . . . . . . 222
7.7 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.7.1 Coffee . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.7.2 Olive oil . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8 Quadratic Discriminant Analysis 229


8.1 Dimension reduction for QDA . . . . . . . . . . . . . . . . . . . 231
8.2 Dimension reduction with S = SC|X . . . . . . . . . . . . . . . 234
8.3 Dimension reduction with S = EΣ (SC|X ) . . . . . . . . . . . . . 236
8.3.1 Envelope structure of model (8.1) . . . . . . . . . . . . . 236
8.3.2 Likelihood-based envelope estimation . . . . . . . . . . . 238
8.3.3 Quadratic discriminant analysis via algorithms N and S 239
8.4 Overview of QDA methods . . . . . . . . . . . . . . . . . . . . 242
8.5 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
8.5.1 Birds, planes, and cars . . . . . . . . . . . . . . . . . . . 244
8.5.2 Fruit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.6 Coffee and olive oil data . . . . . . . . . . . . . . . . . . . . . . 246
xii Contents

9 Non-linear PLS 249


9.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9.2 Central mean subspace . . . . . . . . . . . . . . . . . . . . . . . 250
9.3 Linearity conditions . . . . . . . . . . . . . . . . . . . . . . . . 253
9.4 NIPALS under the non-linear model . . . . . . . . . . . . . . . 255
9.5 Extended PLS algorithms for dimension reduction . . . . . . . 258
9.5.1 A generalization of the logic leading to Krylov spaces . 258
9.5.2 Generating NIPALS vectors . . . . . . . . . . . . . . . . 260
9.5.3 Removing linear trends in NIPALS . . . . . . . . . . . . 261
9.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.6.1 Available approaches to prediction . . . . . . . . . . . . 263
9.6.2 Prediction via inverse regression . . . . . . . . . . . . . 264
9.7 Tecator data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.8 Etanercept data . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
9.9 Solvent data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

10 The Role of PLS in Social Science Path Analyses 281


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.1.1 Background on the pls|sem vs cb|sem debate . . . . . 282
10.1.2 Chapter goals . . . . . . . . . . . . . . . . . . . . . . . . 283
10.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10.2 A reflective path model . . . . . . . . . . . . . . . . . . . . . . 284
10.2.1 Path diagrams used in this chapter . . . . . . . . . . . . 284
10.2.2 Implied model . . . . . . . . . . . . . . . . . . . . . . . 287
10.2.3 Measuring association between η and ξ . . . . . . . . . 287
10.2.4 Construct interpretation . . . . . . . . . . . . . . . . . . 288
10.2.5 Constraints on ξ and η . . . . . . . . . . . . . . . . . . . 289
10.2.6 Synopsis of estimation results . . . . . . . . . . . . . . . 290
10.3 Reduced rank regression . . . . . . . . . . . . . . . . . . . . . . 290
10.4 Estimators of Ψ . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
10.4.1 Maximum likelihood estimator of Ψ . . . . . . . . . . . 291
10.4.2 Moment estimator . . . . . . . . . . . . . . . . . . . . . 292
10.4.3 Envelope and PLS estimators of Ψ . . . . . . . . . . . . 293
10.4.4 The pls|sem estimator used in structural equation
modeling of path diagrams . . . . . . . . . . . . . . . . 294
10.5 cb|sem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
10.6 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Contents xiii

10.7 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . 300


10.7.1 Simulation results with ΣX|ξ = ΣY |η = I3 . . . . . . . . 301
10.7.2 Non-diagonal ΣX|ξ and ΣY |η . . . . . . . . . . . . . . . 303
10.7.3 Multiple reflective composites . . . . . . . . . . . . . . . 305
10.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

11 Ancillary Topics 309


11.1 General application of the algorithms . . . . . . . . . . . . . . . 309
11.2 Bilinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
11.2.1 Bilinear model of Martin and Næs . . . . . . . . . . . . 311
11.2.2 Bilinear probabilistic PLS model . . . . . . . . . . . . . 313
11.3 Conjugate gradient, NIPALS, and SIMPLS . . . . . . . . . . . 316
11.3.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
11.3.2 Details for the conjugate gradient algorithm of
Table 11.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 317
11.3.3 Origins of CGA . . . . . . . . . . . . . . . . . . . . . . . 320
11.4 Sparse PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
11.5 PLS for multi-way predictors . . . . . . . . . . . . . . . . . . . 323
11.6 PLS for generalized linear models . . . . . . . . . . . . . . . . . 325
11.6.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . 327
11.6.2 Generalized linear models . . . . . . . . . . . . . . . . . 328
11.6.3 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 331

A Proofs of Selected Results 335


A.1 Proofs for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . 335
A.1.1 Justification for (1.7) . . . . . . . . . . . . . . . . . . . . 335
A.1.2 Lemma 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . 336
A.1.3 Proposition 1.2 . . . . . . . . . . . . . . . . . . . . . . . 336
A.1.4 Corollary 1.1 . . . . . . . . . . . . . . . . . . . . . . . . 337
A.1.5 Lemma 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . 338
A.1.6 Proposition 1.3 . . . . . . . . . . . . . . . . . . . . . . . 339
A.1.7 Proposition 1.4 . . . . . . . . . . . . . . . . . . . . . . . 339
A.1.8 Lemma 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . 340
A.1.9 Proposition 1.8 . . . . . . . . . . . . . . . . . . . . . . . 340
A.2 Proofs for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . 341
xiv Contents

A.2.1 Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 341


A.3 Proofs for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 341
A.3.1 Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . 341
A.3.2 Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 342
A.3.3 Lemma 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 343
A.3.4 Lemma 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . 344
A.3.5 Lemma 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . 346
A.3.6 Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . 347
A.4 Proofs for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 348
A.4.1 Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . 348
A.4.2 Notes on Corollary 4.1 . . . . . . . . . . . . . . . . . . . 352
A.4.3 Form of ΣX under compound symmetry . . . . . . . . . 353
A.4.4 Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . 353
A.4.5 Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . . . 359
A.5 Proofs for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . 360
A.5.1 Proposition 5.1 . . . . . . . . . . . . . . . . . . . . . . . 360
A.5.2 Proof of Lemma 5.1 . . . . . . . . . . . . . . . . . . . . 362
A.5.3 Proof of Lemma 5.3 . . . . . . . . . . . . . . . . . . . . 362
A.5.4 Justification of the two-block algorithm, Section 5.4.3 . 363
A.6 Proofs for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . 365
A.6.1 Justification for the equivalence of (6.5) and (6.6) . . . . 365
A.6.2 Derivation of the MLEs for the partial envelope,
Section 6.2.3 . . . . . . . . . . . . . . . . . . . . . . . . 365
A.7 Proofs for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . 369
A.7.1 Proposition 9.1 . . . . . . . . . . . . . . . . . . . . . . . 369
A.7.2 Proposition 9.2 . . . . . . . . . . . . . . . . . . . . . . . 370
A.7.3 Proposition 9.3 . . . . . . . . . . . . . . . . . . . . . . . 371
A.7.4 Proposition 9.5 . . . . . . . . . . . . . . . . . . . . . . . 372
A.8 Proofs for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 372
A.8.1 Model from Section 10.2 . . . . . . . . . . . . . . . . . 372
A.8.2 Constraints in cb|sem . . . . . . . . . . . . . . . . . . . 373
A.8.3 Lemma 10.1 . . . . . . . . . . . . . . . . . . . . . . . . . 374
A.8.4 Proposition 10.1 . . . . . . . . . . . . . . . . . . . . . . 374
A.8.5 Proposition 10.2 . . . . . . . . . . . . . . . . . . . . . . 378
A.8.6 Lemma 10.2 . . . . . . . . . . . . . . . . . . . . . . . . . 379
A.8.7 Proof of Proposition 10.3 . . . . . . . . . . . . . . . . . 380
Contents xv

A.9 Proofs for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 381


A.9.1 Proposition 11.1 . . . . . . . . . . . . . . . . . . . . . . 381

Bibliography 387

Index 407
Preface

Partial least squares (PLS) regression is, at its historical core, a black-box
algorithmic method for dimension reduction and prediction based on an un-
derlying linear relationship between a possibly vector-valued response and a
number of predictors. Its origins trace back to the work of Herman Wold in
the 1960s, and it is generally recognized that Wold’s non-linear iterative par-
tial least squares (NIPALS) algorithm is a critical point in its evolution. PLS
regression made its first appearances in the chemometrics literature around
1980, subsequently spreading across the applied sciences. It has attracted con-
siderable attention because, with highly collinear predictors, its performance is
typically better than that of standard methods like ordinary least squares, and
it has particularly good properties in the class of abundant high-dimensional
regressions where many predictors contribute information about the response
and the sample size n is not sufficient for standard methods to yield un-
ambiguous answers. In some regressions, its estimators can converge at the
root-n rate regardless of the asymptotic relationship between the n and the
number of predictors. It is perhaps the first serviceable method for studying
high-dimensional linear regressions.
The statistics community has been grappling for over three decades with
the regressions in which n is not sufficient to allow crisp application of standard
methods. In the early days, they fixed on the idea of sparsity, wherein only a
few predictors contribute information about the response, to drive their fitting
and prediction methods, resulting in a focus that likely hindered the pursuit of
other plausible methods. Not being based on optimizing an objective function,
PLS regression fell on the fringes of statistical methodology and consequently
did not receive much attention from the statistics community. This picture
began to change in 2013 with the findings that PLS regression is a special
case of the then nascent envelope theory of dimension reduction (Cook et al.,
2013). Through envelopes, we have learned much about PLS regression in the
past decade, resulting in a critical mass of information that allows us to provide
an envelope bridge that takes PLS regression from a black-box algorithm to a
xvii
xviii Preface

core statistical paradigm based on objective function optimization and, more


generally, connects the applied sciences and statistics in the context of PLS.
Developing that bridge is a goal of this book. We hope that our arguments
will stimulate interest in PLS within the statistics community. A second goal
of this book is to check out uses of PLS outside of linear regression, including
discriminant analysis, non-linear regression, generalized linear models, and
dimension reduction generally.

Outline
Chapter 1 contains background on the multivariate linear model and on
envelopes. This is not intended as primary instruction, but may be sufficient to
establish basic ideas and notation. In Section 1.5, we introduce five algebraic
methods for envelope construction, some of which are generalizations of com-
mon PLS algorithms from the literature. In Chapter 2, we review envelopes for
response and predictor reduction in regression, and discuss first connections
with PLS regression. In Chapter 3, we describe the common PLS regression
algorithms for predictor reduction, NIPALS and SIMPLS, and prove their con-
nections to envelopes. We also discuss PLS for response reduction, PLS1 v.
PLS2 and various other topics. These first three chapters provide a foundation
for various extensions and adaptations of PLS that come next. Chapters 4 –
11 do not need to be read in order. For instance, readers who are not particu-
larly interested in asymptotic considerations may wish to skip Chapter 4 and
proceed with Chapter 5.
Various asymptotic topics are covered in Chapter 4, including convergence
rate and abundance v. sparsity. Simultaneous PLS reduction of responses and
predictors in multivariate linear regression is discussed in Chapter 5, and
methods for reducing only a subset of the predictors are described in Chap-
ter 6. We turn to adaptations for linear and quadratic discriminant analysis
in Chapters 7 and 8. In Chapter 9 we argue that there are settings in which
the dimension reduction arm of a PLS algorithm is serviceable in non-linear
as well as linear regression.
The versions of PLS used for path analysis in the social sciences are notably
different from the PLS regressions used in other areas like chemometrics. PLS
for path analysis is discussed in Chapter 10. Ancillary topics are discussed in
Chapter 11, including bilinear models, the relationship between PLS regression
and conjugate gradient methods, sparse PLS, and PLS for generalized linear
models. Most proofs are given in an appendix, but some that we feel may be
particularly informative are given in the main text.
Preface xix

Computing
R or Python computer programs are available for nearly all of the
methods discussed in this book and many have been implemented in both
languages. Most of the methods are available in integrated packages, but
some are standalone programs. These programs and packages are not dis-
cussed in this book, but descriptions of and links to them can be found at
https://lforzani.github.io/PLSR-book/. This format will allow for updates
and links to developments following this book. The web page also gives errata,
links to recent developments, color versions of some grayscale plots as they ap-
pear in the book, and commentary on parts of the book as necessary for clarity.

Acknowledgments
Earlier versions of this book were used as lecture notes for a one-semester
course at the University of Minnesota. Students in this course and our collabo-
rators contributed to the ideas and flavor of the book. In particular, we would
like to thank Shanshan Ding, Inga Helland, Zhihua Su, and Xin (Henry) Zhang
for their helpful discussions. Much of this book of course reflects the many
stimulating conversations we had with Bing Li and Francesca Chiaromonte
during the genesis of envelopes. The tecator dataset of Section 9.7 is available
at http://lib.stat.cmu.edu/datasets/tecator.
We extend our gratitude to

Rodrigo Garcı́a Arancibia for providing the economic growth data and anal-
ysis of Section 6.3.
Fabricio Chiappini for providing the etanercept data of Section 9.8 and for
sharing with us a lot of discussion and background on chemometric data.
He together with Alejandro Olivieri were very generous in providing us
with necessary data.
Pedro Morin for helping us comprehend the subtle link between PLS algo-
rithms and conjugate gradient methods.

Marilina Carena for help drawing Figure 10.1 and Jerónimo Basa for drawing
the mussels pictures for Chapter 4.

Special thanks to Marco Tabacman for making our codes more readable and
to Eduardo Tabacman for his guidance in creating numerous graphics in R.

R. Dennis Cook Liliana Forzani


St. Paul, MN, USA Santa Fe, Argentina
January, 2024
Notation and Definitions

Reminders of the following notation may be included in the book from time
to time.

• r number of responses.

• p number of predictors.

• q number of predictor components and dimension of predictor envelope.

• u number of response components and dimension of response envelope.

• X an p × 1 vector of predictors. X is the n × p matrix with rows (Xi − X̄)T


and X0 is the n × p matrix with rows XiT , i = 1, . . . , n.

• Y an r × 1 vector of responses. Y is the n × r matrix with rows (Yi − Ȳ )T


and Y0 is the n × r matrix with rows YiT , i = 1, . . . , n.

• Observable data matrices will be generally represented in mathbb font like


X, Y, and Z.

• C = (X T , Y T )T , the concatenation of X and Y .

• v1 (·) largest eigenvalue of the argument matrix with corresponding eigen-


vector `1 (·) normalized to have length 1.

• diag(d1 , . . . , dm ) denotes an m×m diagonal matrix with diagonal elements


dj . j = 1, . . . , m.

• tr(A) denotes the trace of a square matrix A.

• Wq (M ) denotes the Wishart distribution with q degrees of freedom and


scale matrix M .

xxi
xxii Notation and Definitions

• Xn = Op (bn ) if Xn /bn = Op (1). That is, if Xn /bn is bounded stochas-


tically: for every  > 0 there is a finite constant K() > 0 and a finite
integer N () > 0 so that, for all n ≥ N ,

Pr(|Xn /bn | > K) ≤ .

• ak  bk means that, as k → ∞, ak = O(bk ) and bk = O(ak ), and we then


describe ak and bk as being asymptotically equivalent.

• a  b informal comparison indicating that the real number a is large


relative to the real b.

• (M )ij denotes the i, j-th element of the matrix M . (V )i denotes the i-th
element of the vector V .

• A† denotes the Moore-Penrose inverse of the matrix A.

• Rm×n is the set of all real m × n matrices and Sk×k is the set of all real
symmetric k × k matrices.

• For a positive definite V ∈ Rt×t , PA(V ) denotes the projection onto


span(A) ⊆ Rt if A is a matrix, and then we have the matrix represen-
tation
PA(V ) = A(AT V A)−1 AT V

when AT V A is non-singular. The same notation PA(V ) denotes the pro-


jection onto A if A is a subspace. In both cases projection is with respect
to the V inner product, and PA denotes projection with the identity inner
product. Let QA(V ) = I − PA(V ) .

• For a vector V , kV k = (V T V )1/2 . For a real number a, |a| indicates the


absolute value of a. For a square matrix M , |M | denotes the determinant
of M .

• We indicate that a matrix A is positive definite by A > 0 and positive


semi-definite by A ≥ 0.

• vec(·): Ra×b → Rab vectorizes an arbitrary matrix by stacking its columns,


and vech(·): Ra×a → Ra(a+1)/2 , vectorizes a symmetric matrix by extract-
ing its columns of elements on and below the diagonal.
Notation and Definitions xxiii

• The population expectation and variance of a random vector A are de-


noted as E(A) and var(A). The expectation operator may occasionally be
subscripted with the random vector for clarity. For instance, EA|B (A) de-
notes the expectation of A with respect to the conditional distribution of
A given B.

• For random vectors A ∈ Ra and C ∈ Rc

E (A − E(A))(C − E(C))T

ΣA,C =
E (A − E(A))(A − E(A))T .

ΣA =

When the vectors are enclosed in parentheses,

Σ(A,C) = ΣB ,

where B is the vector constructed by concatenating A and C:


!
A
B= .
C

This extends to multiple vectors: Σ(A1 ,A2 ,...,Ak ) = ΣB , where B is now


constructed by concatenating A1 , . . . , Ak .

• For a square nonsingular matrix A partitioned in diagonal blocks as


!
A1,1 A1,2
A= ,
A2,1 A2,2

we set
A1|2 = A1,1 − A1,2 A−1
2,2 A2,1 .

• For stochastic vectors A and B, βA|B = ΣA,B Σ−1 B . The subscripts may be
dropped if clear from context; for instance, β = Σ−1
X ΣX,Y .

• Subspaces are indicated in a math calligraphy font; e.g. S, R, and A.


Subscripts may be added for specificity.

– If M ∈ Rm×n then span(M ) ⊆ Rm is the subspace spanned by


columns of M .
– The sum of two subspaces S and R of Rm is defined as S + R =
{s + r | s ∈ S, r ∈ R}.
xxiv Notation and Definitions

– CX,Y = span(ΣX,Y ), CY,X = span(ΣY,X ), B = span(β) and B 0 =


span(β T )
– Called an envelope, EM (S) is the intersection of all reducing sub-
spaces of M that contain S, Definition 1.2.
– SY |X is the central subspace for the regression of Y on X, Defini-
tion 2.1

• For an m × n matrix A and a p × q matrix B, their direct sum is defined as


the (m + p) × (n + q) block diagonal matrix A ⊕ B = diag(A, B). We will
also use the ⊕ operator for two subspaces. If S ⊆ Rp and R ⊆ Rq then
S ⊕ R = span(S ⊕ R) where S and R are basis matrices for S and R.

• The Kronecker product ⊗ between two matrices A ∈ Rr×s and B ∈ Rt×u


is the rt × su matrix defined in blocks as

A ⊗ B = (A)ij B, i = 1, . . . , r, j = 1, . . . s,

where (A)ij denotes the ij-th element of A.

• For stochastic quantities A, B, C with a joint distribution, A ∼ B in-


dicates that A has the same distribution as B, A B indicates that A
is independent of B and A B | C indicates that A is conditionally in-
dependent of B given any value for C. See Cook (1998, Section 4.6) for
background on independence relationships.

• Partial least squares notation: w, s, l weights, scores, and loadings. wd and


Wd weight vector and weight matrix for NIPALS. vd and Vd weight vector
and weight matrix for SIMPLS. The same notation is used for sample and
population weights, as seems common in the PLS literature. The context
will be clear from the setting.

• Common acronyms

– CB|SEM, covariance-based structural equation modeling


– CGA, conjugate gradient algorithm
– CMS, central mean subspace
– MLE, maximum likelihood estimation/estimator
– NIPALS, non-linear iterative partial least squares
Notation and Definitions xxv

– OLS, ordinary least squares


– PLS, partial least squares
– PLS|SEM, PLS-based structural equation modeling or analysis
– RRR, reduced rank regression
– SEM, structural equation model
– SIMPLS, straightforward implememtation of a statistically inspired
modification of PLS de Jong (1993)
– WLS, weighted least squares.
Authors

R. Dennis Cook

Dennis Cook is Professor Emeritus, School of Statistics, University of Min-


nesota. His research areas include dimension reduction, linear and non-linear
regression, experimental design, statistical diagnostics, statistical graphics and
population genetics. Perhaps best known for “Cook’s Distance,” a now ubiq-
uitous statistical method, he has authored over 250 research articles, two text-
books and three research monographs. He is a five-time recipient of the Jack
Youden Prize for Best Expository Paper in Technometrics as well as the Frank
Wilcoxon Award for Best Technical Paper. He received the 2005 COPSS Fisher
Lecture and Award, and is a Fellow of ASA and IMS.

School of Statistics
University of Minnesota
Minneapolis, MN 55455, U.S.A.
Email: [email protected]

xxvii
xxviii Authors

Liliana Forzani

Liliana Forzani is Full Professor, School of Chemical Engineering, National


University of Litoral and principal researcher of CONICET (National Scien-
tific and Technical Research Council), Argentina. Her contributions are in
mathematical statistics, especially sufficient dimension reduction, abundance
on regression and statistics for chemometrics. She established the first research
group in statistics at her university after receiving her Ph.D in Statistics at the
University of Minnesota. She has authored over 60 research articles in math-
ematics and statistics, and was recipient of the L‘Oreal-Unesco-Conicet prize
for Women in science. https://sites.google.com/site/lilianaforzani/english

Facultad de Ingenierı́a Quı́mica, UNL


Santiago del Estero 2819
Santa Fe, Argentina
Researcher of CONICET.
Email: [email protected]
List of Figures

1.1 Corn moisture: Plots of the PLS fitted values as ◦ and lasso
fitted values as + versus the observed response for the test
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Meat protein: Plot of the PLS fitted values as ◦ and lasso fitted
values as + versus the observed response for the test data. . . 7
1.3 Serum tetracycline: Plots of the PLS fitted values as ◦ and lasso
fitted values as + versus the observed response for the test
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Illustration of response envelopes using two wavelengths from


the gasoline data. High octane data marked by x and low
octane numbers by o. Eb = EbΣY |X (B 0 ): estimated envelope.
Eb⊥ = EbΣ⊥Y |X (B 0 ): estimated orthogonal complement of the en-
velope. Marginal distributions of high octane numbers are rep-
resented by dished curves along the horizontal axis. Marginal
envelope distributions of low octane numbers are represented
by solid curves along the horizontal axis. (From the Graphical
Abstract for Cook and Forzani (2020) with permission.) . . . . 60
2.2 Illustration of how response rescaling can affect a response en-
velope from the regression of a bivariate response on a binary
predictor, X = 0, 1. The two distributions represented in each
plot are the distributions of Y | (X = 0) and Y | (X = 1). The
axes represent the coordinates of Y = (y1 , y2 )T . . . . . . . . . . 66

4.1 Mussels’ data: Plot of the observed responses Yi versus the fit-
ted values Ybi from the PLS fit with one component, q = 1 . . . 118
4.2 Mussels’ data: Plot of the fitted values Ybi from the PLS fit
with one component, q = 1, versus the OLS fitted values. The
diagonal line y = x was included for clarity. . . . . . . . . . . . 118

xxix
xxx List of Figures

4.3 Division of the (a, s) plane according to the convergence prop-


erties given in conclusions II and III of Corollaries 4.3 and 4.4. 128
4.4 Simulation results
 on bias: The right histogram in each plot
is of V −1/2 GT βbpls − β (1 + b) and the left histogram is of
 
V −1/2 GT βbpls − β . The standard normal reference density is
also shown and in all cases n = p/2. (Plot was constructed with
permission using the data that Basa et al. (2022) used for their
Fig. 1.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.5 Plots of the means and standard deviations versus log p corre-
sponding to the simulations of Figure 4.4. . . . . . . . . . . . . 139
4.6 Tetracycline data: The open circles give the validation root
MSE from 10, 20, 33, 50, and 101 equally spaced spectra. (From
Fig. 4 of Cook and Forzani (2019) with permission.) . . . . . . 150
4.7 Illustration of the behavior of PLS predictions in abundant and
sparse regressions. Lines correspond to different numbers of ma-
terial predictors. Reading from top to bottom as the lines ap-
proach the vertical axis, the first bold line is for 2 material
predictors. The second dashed line is for p2/3 material predic-
tors. The third solid lines is for p − 40 material predictors and
the last line with circles at the predictor numbers used in the
simulations is for p material predictors. The vertical axis is the
squared norm kβ − βk b 2 and always n = p/3. (This figure was
SX
constructed with permission using the same data as Cook and
Forzani (2020) used for their Fig. 1.) . . . . . . . . . . . . . . . 153

6.1 Plot of lean body mass versus the first partial SIR predictor
based on data from the Australian Institute of Sport. circles:
males; exes: females. . . . . . . . . . . . . . . . . . . . . . . . . 184
6.2 Economic growth prediction for 12 South American countries,
2003–2018, n = 161: Plot of the response Y versus the leave-
one-out fitted values from (a) the partial PLS fit with q1 = 7
and (b) the lasso. . . . . . . . . . . . . . . . . . . . . . . . . . 199

7.1 Illustration of the importance of the central discriminant sub-


space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 Coffee data: Plots of estimated classification rate (accuracy)
versus the number of components . . . . . . . . . . . . . . . . . 225
List of Figures xxxi

7.3 Coffee data: Plots of PLS, PFC, and Isotropic projections. . . . 226
7.4 Olive oil data: Plots of PLS, PLS+PFC, and Isotropic projec-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8.1 Bird-plane-cars data: Plots of the first two projected features


for AN-Q1 and AN-Q2. . . . . . . . . . . . . . . . . . . . . . . 245
8.2 Fruit data: Plots of the first two projected features for AN-Q1
and AN-Q2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

9.1 Diagnostic plots for fat in the Tecator data. Plot (a) was
smoothed with a quadratic polynomial. The other three plots
were smoothed with splines of order 5. (From Fig. 1 of Cook
and Forzani (2021) with permission.) . . . . . . . . . . . . . . . 268
9.2 Diagnostic plots for protein in the Tecator data. Plot (a) was
smoothed with a quadratic polynomial. The other three plots
were smoothed with splines of order 5. . . . . . . . . . . . . . . 269
9.3 Diagnostic plots for water in the Tecator data. Plot (a) was
smoothed with a quadratic polynomial. The other three plots
were smoothed with splines of order 5. . . . . . . . . . . . . . . 270
9.4 Tecator data: root mean squared prediction error versus num-
ber of components for fat. Upper curve is for linear PLS,
method 1 in Table 9.1; lower curve is for inverse PLS predic-
T
tion with βbnpls X, method 4. (From Fig. 2 of Cook and Forzani
(2021) with permission.) . . . . . . . . . . . . . . . . . . . . . . 274
9.5 Solvent data: Predictive root mean squared error PRMSE ver-
sus number of components for three methods of fitting. A: linear
PLS. B: non-parametric inverse PLS with W TX, as discussed in
Section 9.7. C: The non-linear PLS method proposed by Lavoie
et al. (2019). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

10.1 Reflective path diagrams relating two latent constructs ξ and


η with their respective univariate indicators, X1 , . . . , Xp and
Y1 , . . . , Yr . ϕ denotes either cor(ξ, η) or Ψ = cor{E(ξ | X), E(η |
Y )}. Upper and lower diagrams are the uncore and core
models. (From Fig. 2.1 of Cook and Forzani (2023) with per-
mission.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
xxxii List of Figures

10.2 Simulations with N = 100, ΣX|ξ = ΣY |η = I3 . Horizontal


axes give the true correlations; vertical axes give the estimated
correlations. Lines x = y represented equality. . . . . . . . . . 302
10.3 Simulations with N = 1000, ΣX|ξ = ΣY |η = I3 . Horizontal
axes give the true correlations; vertical axes give the estimated
correlations. Lines x = y represented equality. . . . . . . . . . 303
10.4 Simulations with N = 100, ΣX|ξ = ΣY |η = L(LT L)−1 LT +
3L0 LT0 . Horizontal axes give the true correlations; vertical axes
give the estimated correlations. The lines represent x = y. . . . 304
10.5 Results of simulations with N = 1000, ΣX|ξ = ΣY |η =
L(LT L)−1 LT + 3L0 LT0 . Horizontal axes give the true corre-
lations; vertical axes give the estimated correlations. The lines
represent x = y. . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
10.6 Results of simulations with two reflective composites for X and
Y , q = u = 2. Horizontal axes give the true correlations; verti-
cal axes give the estimated correlations. (Constructed following
Fig. 6.3 of Cook and Forzani (2023) with permission.) . . . . . 306

11.1 PLS-GLM data: Estimates of the densities of the estimators


of the first component β1 in the simulation to illustrate the
PLS algorithm for GLMs in Table 11.4. Linear envelope and
PLS starts refer to starting iteration at the envelope and PLS
estimators from a fit of the linear model, ignoring the GLM
structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
List of Tables

1.1 Root mean squared prediction error (RMSE), from three ex-
amples that illustrate the prediction potential of PLS
regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Conceptual contrasts between three dimension reduction


paradigms. ‘Construction’ means that methodology attempts
to select S to satisfy the condition. ‘Assumption’ means that
methodology requires the condition without any attempt to
ensure its suitability. ‘No constraint’ means that methodology
does not address the condition per se. . . . . . . . . . . . . . . 40
2.2 Relationship between predictor and response envelopes when X
and Y are jointly distributed: CX,Y = span(ΣX,Y ) and CY,X =
span(ΣY,X ). B = span(βY |X ) and β 0 = span(βYT |X ) are as used
previously for model (1.1). B∗ and B∗T are defined similarly but
in terms of the regression of X | Y . P-env and R-env stand for
predictor and response envelope. . . . . . . . . . . . . . . . . . 50

3.1 NIPALS algorithm: (a) sample version adapted from Martens


and Næ s (1989) and Stocchero (2019). The n × p matrix X
contains the centered predictors and the n×r matrix Y contains
the centered responses; (b) population version derived herein. . 71
3.2 Bare bones version of the NIPALS algorithm given in Ta-
ble 3.1(a). The notation Y1 = Y of Table 3.1(a) is not used
since here there is no iteration over Y . . . . . . . . . . . . . . . 77
3.3 Bare bones version in the principal component scale of the NI-
PALS algorithm given in
Table 3.1(a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xxxiii
xxxiv List of Tables

3.4 SIMPLS algorithm: (a) sample version adapted from de Jong


(1993, Table 1). The n × p matrix X contains the centered pre-
dictors and the n × r vector Y contains the centered responses;
(b) population version derived herein. . . . . . . . . . . . . . . 87
3.5 Helland’s algorithm for univariate PLS regression: (a) sample
version adapted from Table 2 of Frank and Frideman (1993).
The n × p matrix X contains the centered predictors and the
n × r vector Y contains the centered responses; (b) population
version derived herein. . . . . . . . . . . . . . . . . . . . . . . . 101

4.1 Coefficient estimates and corresponding asymptotic standard


deviations (S.D.) from three fits of the mussels’ data: Ordinary
least squares, (OLS), partial least squares with one component
(PLS), and envelope with one dimension (ENV). Results for
OLS and ENV are from Cook (2018). . . . . . . . . . . . . . . 119
4.2 Mussels’ muscles: Estimates of the covariance matrix ΣX from
the envelope, OLS and PLS (q = 1) fits. . . . . . . . . . . . . . 120
4.3 The first and second columns give the orders O(·) of V 1/2 and b
under conditions 1– 3 as n, p → ∞. The fourth column headed b

gives from Corollary 4.6 the order of quantity n|b| that must
converge to 0 for the bias term to be eliminated. . . . . . . . . 137
4.4 Estimated coverage rates of nominal 95% confidence intervals
(4.17) for the parameters indicated in the first row. The third

and fourth columns are for the the setting in which n|b| → 0,

the fifth and sixth columns are for the setting in which n|b| 6→
0, and the last column is for the adjusted interval given in
Theorem 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.1 NIPALS algorithm for computing the weight matrices W and


G for compressing X and Y . The n × p matrix X contains
the centered predictors and the n × r matrix Y contains the
centered responses. `1 (·) denotes the eigenvector corresponding
to the largest eigenvalue of the matrix argument. . . . . . . . . 166
5.2 Population version of the common two-block algorithm for si-
multaneous X-Y reduction based on the bilinear model (5.12).
(Not recommended.) . . . . . . . . . . . . . . . . . . . . . . . . 170
List of Tables xxxv

5.3 Error measures with a = 50, b = 0.01, p = 50, r = 4, q = 10,


and u = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.4 Error measures with a = 50, b = 0.01, p = 50, r = 4, q = 40,
and u = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.5 Error measures with a = 50, b = 0.01, p = 30, r = 50, q = 20,
and u = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.6 Error measures with n = 57, b = 5, p = 50, r = 4, q = 40, and
u = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.7 Error measures with n = 57, b = 0.1, p = 50, r = 4, q = 40, u =
3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.8 Settings p = 50, r = 4, q = 10, u = 3 with Ω, Ω0 , ∆ and ∆0 gen-
erated as M M T where M had the same size as the correspond-
ing covariances and was filled with uniform (0, 1) observations. 179
5.9 Prediction errors predMSE for three datasets . . . . . . . . . . 181

6.1 Leave-one-out cross validation MSE for GPD growth predic-


tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2 Leave-one-out cross validation of the mean prediction error pre-
dRMSE for the reactor data. . . . . . . . . . . . . . . . . . . . 206

7.1 Olive oil and Coffee data: Estimates of the correct classification
rates (%) from leave one out cross validation. . . . . . . . . . . 224

8.1 Birds-planes-cars data: Results from the application of three


classification methods. Accuracy is the percent correct classifi-
cation based on leave-one-out cross validation. . . . . . . . . . 244
8.2 Birds-planes-cars data: Estimates of the correct classification
rates (%) from leave one out cross validation. . . . . . . . . . . 246
8.3 Fruit data: Estimates of the correct classification rates (%) from
leave one out cross validation. . . . . . . . . . . . . . . . . . . . 247
8.4 Olive oil and Coffee data: Estimates of the correct classifica-
tion rates (%) from leave one out cross validation. Results in
columns 4–8 are as described in Table 7.1. . . . . . . . . . . . 248

9.1 Tecator Data: Number of components based on 5-fold cross


validation using the training data for each response.
NP-PLS denotes non-parametric PLS and NP-I-PLS denotes
non-parametric inverse PLS. . . . . . . . . . . . . . . . . . . . . 273
xxxvi List of Tables

9.2 Tecator Data:(a) Root mean squared training and testing pre-
diction errors for five methods of prediction. . . . . . . . . . . . 275
9.3 Etanercept Data: Numbering of the methods corresponds to
that in Table 9.2. Part (a): Number of components was based
on leave-one-out cross validation using the training data with
p = 9,623 predictors. RMSEP is the root mean squared error
of prediction for the testing data. Part (b): The number of
components for MLP is the number of principal components
selected for the input layer of the network. Part (c): Number of
components was based on leave-one-out cross validation using
all 35 data points. CVRPE is the cross validation root mean
square error of prediction based on the selected 6 components. 277

11.1 Conjugate gradient algorithm: Population version of the conju-


gate gradient method for solving the linear system of equations
ΣX β = ΣX,Y for β. . . . . . . . . . . . . . . . . . . . . . . . . 318
11.2 NIPALS K-PLS algorithm: . . . . . . . . . . . . . . . . . . . . 326
11.3 A summary of one-parameter exponential families. For the nor-
mal, σ = 1. A(ϑ) = 1 + exp(ϑ). C 0 (ϑ) and C 00 (ϑ) are the first
and second derivatives of C(ϑ) evaluated at the true value. . . 329
11.4 A PLS algorithm for predictor reduction in GLMs. . . . . . . . 332
1

Introduction

Partial least squares (PLS) regression is at its historical core an algorithmic


method for prediction based on an underlying linear relationship between
a possibly vector-valued response Y and a number p of predictors xj , j =
1, . . . , p. With highly collinear predictors, its performance is typically better
than that of standard methods like ordinary least squares (OLS), and it is
serviceable in the class of abundant high-dimensional regressions where many
predictors contribute information about the response and the sample size n is
not sufficient for standard methods to yield unambiguous answers.
The origins of PLS regression trace back to the work of Herman Wold in
the 1960s, and it is generally recognized that Wold’s non-linear iterative par-
tial least squares (NIPALS) algorithm (Wold, 1966, 1975b) is a critical point
in the evolution of PLS regression. It has since garnered considerable inter-
est, making its first appearance in the chemometrics literature around 1980
(Geladi, 1988; Wold, Martens, and Wold, 1983) and subsequently spreading
across the applied sciences. PLS regression algorithms stand today as central
methods for prediction in the applied sciences, boasting hundreds of articles
extending their capabilities and extolling their virtues. We use “PLS regres-
sion” as a collective referring to the class of PLS regression algorithms. Specific
algorithms will be referenced when necessary for clarity.
There have certainly been useful papers on PLS regression in the mainline
statistics journals (e.g. Frank and Frideman, 1993; Garthwaite, 1994; Helland,
1990, 1992; Næs and Helland, 1993; Nguyen and Rocke, 2004), but the devel-
opment of PLS regression algorithms nearly all originates outside of statistics.
This may be explained partly by differences in culture. Estimators in statistics
are historically based on given stochastic models, since these approaches al-
low for relatively straightforward evaluation on standard criteria and typically
provide useful comprehensible paradigms for data analyses. In contrast, PLS

DOI: 10.1201/9781003482475-1 1
2 Introduction

regression is defined by data-based algorithms – NIPALS, de Jong’s SIMPLS


method (de Jong, 1993), or one of the many variations (e.g. Lindgren, Geladi,
and Wold, 1993; Martin, 2009; Stocchero, 2019) – that have been historically
viewed as enigmatic forms of estimation by statisticians. Helland et al. (2018)
framed the cultural differences in terms of creativity and rigor. An article by
Breiman (2001) is widely cited for his portrayal of the model vs. algorithmic
cultures.
The relative paucity of PLS regression studies in the mainline statistics
journals might also be partly explained by a contemporary principle govern-
ing high-dimensional regression. The statistics community has for over two
decades embraced sparsity as a natural characterization of high-dimensional
regressions, the inertia of which seems to have inhibited consideration of other
plausible paradigms. Indeed, some seem to view sparsity as akin to a natu-
ral law: If you are faced with a high-dimensional regression then naturally it
must be sparse. Others have seen sparsity as the only recourse. In the logic of
Frideman et al. (2004), the bet-on-sparsity principle arose because, to con-
tinue the metaphor, there is otherwise little chance of a reasonable payoff.
In contrast, standard PLS regression methodology works best in abundant
(anti-sparse) regression where many predictors contribute information about
the response, as in spectroscopy applications in chemometrics. This sets up
a dichotomy that distinguishes PLS regression methods from sparse meth-
ods. If many predictors contribute information about the response in a high-
dimensional regression, then sparse methods will be unserviceable generally
because the regression is not sparse, but abundant methods like PLS regres-
sions can do well, as described in later chapters. If few predictors contribute
information about the response, then sparse methods may do well because the
regression is sparse, but we would not expect abundant methods to be service-
able, generally. Collinearity is less of a problem for PLS regression methods
than it is for sparse regression methods.
A close connection exists between PLS regression and relatively recent
envelope methods for dimension reduction (Cook, 2018). There are several
types of envelope methods, depending on the goals of the analysis. The two
most common are response envelopes (Cook, Li, and Chiaromonte, 2010) for
reducing the response vector Y and predictor envelopes for reducing the vector
of predictors X (Cook, Helland, and Su, 2013). These envelope methods are
based on dimension reduction perspectives that set them apart from other
Examples 3

statistical methods, the overarching goal being to separate with clarity the
information in the data that is material to the study goals from that which
is immaterial, which is in the spirit of Fisher’s notion of sufficient statistics
(Fisher, 1922). This is the same as the general objective of PLS regression,
and it is this connection of purpose that promises to open a new chapter in
PLS regression by enhancing its current capabilities, extending its scope and
bringing applied scientists and statisticians closer to a common understanding
of PLS regression (Cook and Forzani, 2020).
Following brief expository examples in the next section, this chapter sets
forth the multivariate linear model that forms the basis for much of this book
and describes the algebra that underpins envelopes, which are reviewed in
Chapter 2, and our treatment of PLS regression, which begins in earnest in
Chapter 3. This and the next chapters are intended to set the stage for our
treatment of PLS regression in Chapter 3 and beyond. Although we indicate
how results in this chapter link to PLS algorithms, it nevertheless can be read
independently or used as a reference for the developments in later chapters.
Throughout this book we use 5-fold or 10-fold cross validation to pick
subspace dimensions and the number of compressed predictors in PLS. These
algorithms require a seed to initiate the partitioning of the sample, and the
results typically depend on the seed. However, in our experience, the seed does
not affect results materially.

1.1 Examples
In this section we describe three experiments where PLS regression has been
used effectively. This may give indications about the value of PLS and about
the topics addressed later in this book. All three examples are from chemo-
metrics, where PLS has been used extensively since the early 1980s. Lasso
(Tibshirani, 1996) fits were included to represent sparse methodology. PLS
and lasso fits were determined by using library{pls} and library{glmnet} in R
(R Core Team, 2022).

1.1.1 Corn moisture


In our first example, the goal is to predict the moisture content of corn using
spectral measurements at 2 nm intervals in the range 1100–2498 nm, for a total
4 Introduction

of 700 predictors (e.g. Allegrini and Olivieri, 2013). The model was trained on
a total of ntrain = 50 corn samples and then tested on a separate sample of
size ntest = 30. Let Y denote the moisture content of a corn sample, and let
X denote the corresponding 700 × 1 vector of spectral predictors. We assume
that the mean function is linear, E(Y | X) = α + β T X. The overarching goal
is then to use the training data (Yi , Xi ), i = 1, . . . , 50, to produce an estimate
b of the scalar α and an estimate βb of the 700 × 1 vector β to give a linear
α
rule, E(Y
b | X) = αb + βbT X, for predicting the moisture content of corn.
Since ntrain  p, traditional methods like OLS are not serviceable. It
has become common in such settings to assume sparsity, which reflects the
view that only a relatively few predictors, p∗ < ntrain , furnish information
about the response. The lasso represents methods that pursue sparse fits,
producing estimates of β with most coefficients equal to zero. Another option
is to use PLS regression, which includes a dimension reduction method that
compresses the predictors onto q < ntrain linear combinations traditionally
called components, X 7→ W TX where W ∈ Rp×q , and then bases prediction
on the OLS linear regression of Y on W TX. The number of components q is
typically determined by using cross validation or a holdout sample. It is known
that PLS can produce effective predictions when ntrain  p and the predictors
are highly collinear, as they often are in chemometrics applications. A third
option is to use OLS, replacing inverses with generalized inverses. This method
is known to produce relatively poor results; we use it here as an understood
reference point.
The second row of Table 1.1, labeled “Moisture in corn,” summarizes the
results. Columns 2–6 give characteristics of the fits. With a 10-fold cross vali-
dation, PLS selected 23 compressed predictors or components, while the lasso
picked 21 of the 700 predictors as relevant. The final three columns give the
root mean squared error (RMSE) of prediction for the ntest = 30 testing
observations,
" n
#1/2
Xtest

RMSE = n−1
test (Yi − E(Y
b 2
| Xi )) .
i=1

Clearly, PLS did best over the test dataset. This conclusion is supported by
Figure 1.1, which shows plots of the PLS and lasso fitted values versus the
observed response for the test data.
Examples 5
TABLE 1.1
Root mean squared prediction error (RMSE), from three examples that illus-
trate the prediction potential of PLS regression.

RMSE

Dataset p ntrain ntest PLS q Lasso p Lasso PLS OLS
Moisture in corn 700 50 30 23 21 0.114 0.013 6.52
Protein in meat 100 170 70 13 19 1.232 0.787 3.09
Tetracycline in serum 101 50 57 4 10 0.101 0.070 1.17

Columns 2–6 give the number of predictors p, the size of the training set
ntrain , the size of the test set ntest , the number of PLS components q cho-
sen by 10-fold cross validation, and the number of predictors p∗ estimated
to have nonzero coefficients by the Lasso. Columns 7–9 give the root mean
squared prediction error from the test data for the Lasso, PLS, and OLS using
generalized inverses.

FIGURE 1.1
Corn moisture: Plots of the PLS fitted values as ◦ and lasso fitted values as
+ versus the observed response for the test data.
6 Introduction

1.1.2 Meat protein


This dataset consists of 240 NIR absorbance spectra of meat samples recorded
on a Tecator Infratec Food and Feed Analyzer (e.g. Boggaard and Thodberg,
1992; Allegrini and Olivieri, 2016). The response variable for this example is
the protein content in each meat sample as determined by analytic chemistry.
The spectra ranged from 850 to 1050 nm, discretized into p = 100 wavelength
values. Training and testing datasets contain ntrain = 170 and ntest = 70
samples. As in Section 1.1.1, the goal is to use the training data, (Yi , Xi ),
i = 1, . . . , 170, to estimate a rule, E(Y
b b + βbT X, for predicting the
| X) = α
protein content of meat. One difference in this example is that ntrain > p, so a
relatively small sample size is not an issue, and the usual OLS estimator was
used.
The third row of Table 1.1 shows the results of lasso, PLS, and OLS fitting.
PLS compressed the data into q = 13 components, while the lasso judged that
19 wavelengths are relevant for predicting protein. The RMSE given in the
last three columns of Table 1.1 again show that PLS has the smallest overall
error, followed by the lasso and then OLS. Figure 1.2 shows plots of the
PLS and lasso fitted values against the observed responses for the test data.
Our visual impression conforms qualitatively with the root mean squares in
Table 1.1. However, here there is an impression of non-linearity, so that it
may be possible to improve on linear prediction rules. We show in Chapter 9
that the PLS compression step X 7→ W T X is serviceable when E(Y | X) is a
non-linear function of X, just as it is when E(Y | X) is linear. The same may
not hold for the lasso or for sparse methods generally.

1.1.3 Serum tetracycline


Goicoechea and Oliveri (1999) used PLS to develop a predictor of tetracy-
cline concentration in human blood. The 50 training samples were constructed
by spiking blank sera with various amounts of tetracycline in the range 0–4
µg mL−1 . A validation set of 57 samples was constructed in the same way.
For each sample, the values of the predictors were determined by measuring
fluorescence intensity at p = 101 equally spaced points in the range 450–550
nm. Goicoechea and Oliveri (1999) determined using leave-one-out cross vali-
dation that the best predictions of the training data were obtained with q = 8
components, but 10-fold cross validation gave q = 4 components.
The multivariate linear model 7

FIGURE 1.2
Meat protein: Plot of the PLS fitted values as ◦ and lasso fitted values as +
versus the observed response for the test data.

The fourth row of Table 1.1 shows the results of lasso, PLS, and OLS
fitting. PLS compressed the data into q = 4 components, while the lasso
judged that 10 of the original 101 predictors are relevant. The RMSE in the
last three columns of Table 1.1 again show that PLS has the smallest overall
error, followed by the lasso and then OLS. Figure 1.3 shows plots of the
PLS and lasso fitted values versus the observed response for the test data.
Our visual impression conforms qualitatively with the root mean squares in
Table 1.1. In this example, the lasso and PLS fitted values seem in better
agreement than those of the previous two examples, an observation that is
supported by the root mean squared values in Table 1.1.

1.2 The multivariate linear model


Consider the multivariate regression of a response vector Y ∈ Rr on a vector
of non-stochastic predictors X ∈ Rp . The standard linear model for describing
8 Introduction

FIGURE 1.3
Serum tetracycline: Plots of the PLS fitted values as ◦ and lasso fitted values
as + versus the observed response for the test data.

a sample (Yi , Xi ) can be represented in vector form as

Yi = α + β T(Xi − X̄) + εi , i = 1, . . . , n, (1.1)


Pn
where we have centered the predictors i=1 (Xi − X̄) = 0 about the sam-
ple mean X̄, the error vectors εi ∈ Rr are independently and identically
distributed normal vectors with mean 0 and covariance matrix ΣY |X > 0,
α ∈ Rr is an unknown vector of intercepts and β ∈ Rp×r is an unknown
matrix of regression coefficients. We can think of the responses Yi as an ob-
servation from the conditional distributions of Y | (X = Xi ), i = 1, . . . , n.
Centering the predictors in the model facilitates discussion and presentation
of some results, but is technically unnecessary. The normality requirement for
ε is used for certain foundations and likelihood-based estimation but is not
essential, as discussed in later chapters.
If X is stochastic, so X and Y have a joint distribution, it is standard
practice to condition on the observed values of X since the predictors are
ancillary under model (1.1). However, when PLS regression is used to compress
The multivariate linear model 9

X, the predictors are stochastic and not ancillary, as the marginal distribution
of X is hypothesized to carry relevant information about the regression. With
stochastic predictors, we replace Xi in model 1.1 with Xi − µX , centering the
predictors about the population mean µX , and we add the assumption that
the predictor vector X is independent of the error vector ε. Let ΣX = var(X)
and ΣX,Y = cov(X, Y ). With this structure, we can represent β in terms of
parameters of the joint distribution

β = Σ−1
X ΣX,Y . (1.2)

Stochastic predictors will come into play starting in Chapter 2. We con-


tinue to follow standard practice in this review and regard the predictors as
non-stochastic.

1.2.1 Notation and some algebraic background


Let Rr×c denote the space of all real r × c matrices and let Sr×r denote the
space of all real symmetric r × r matrices. For random vectors A ∈ Ra and
C ∈ Rc , we use ΣA,C ∈ Ra×c to denote the matrix of covariances between the
elements of A and the elements of C:

ΣA,C = E (A − E(A))(C − E(C))T .




Similarly, ΣA denotes the variance-covariance matrices for A:

ΣA = E (A − E(A))(A − E(A))T .


When the vectors are enclosed in parentheses,

Σ(A,C) = ΣB ,

where B is the vector constructed by concatenating A and C:


!
A
B= .
C

This extends to multiple vectors: Σ(A1 ,..., Ak ) = ΣB , where B is now con-


structed by concatenating A1 , . . . , Ak .
Let A ∈ Rr×c and let A = span(A) denote the subspace of Rr spanned by
the columns of A. Then the projection onto A in the usual inner product will
be indicated by either PA or PA . The projection onto A in the ∆ ∈ Sr×r inner
10 Introduction

product, ∆ > 0, will be indicated by either PA(∆) or PA(∆) . This projection


can be represented in matrix form as

PA(∆) = A(AT ∆A)−1 AT ∆, (1.3)

provided AT ∆A > 0. The usual inner product arises when ∆ = I and then
PA = A(AT A)−1 AT . Regardless of the inner product, Q·(·) = Ir −P·(·) denotes
the orthogonal projection.
We will occasionally encounter a conditional variate of the form N | C T N ,
where N ∈ Rr is a normal vector with mean µ and variance ∆, and C ∈ Rr×q
is a non-stochastic matrix with q < r. Assuming that C T ∆C > 0, the mean
and variance of this conditional form are as follows (Cook, 1998, Section 7.2.3).

E(N | C T N ) T
= µ + PC(∆) (N − µ) (1.4)
var(N | C T N ) = ∆ − ∆C(C T ∆C)−1 C T ∆
= ∆QC(∆)
= QTC(∆) ∆QC(∆) . (1.5)

Let Y denote the n × r centered matrix with rows (Yi − Ȳ )T , let Y0 denote
the n × r uncentered matrix with rows YiT , let X denote the n × p matrix with
rows (Xi − X̄)T and let X0 denote the n×p matrix with rows XiT , i = 1, . . . , n.
With this notation, model (1.1) can be represented in full matrix form as

Y0; n×r = 1n αT + Xβ + E, (1.6)

where E is an n × r matrix with rows εTi .


Let
n
X
SY,X = YT X/n = n−1 (Yi − Ȳ )(Xi − X̄)T
i=1
n
X
SX = XT X/n = n−1 (Xi − X̄)(Xi − X̄)T
i=1
n
X
T −1
SY = Y Y/n = n (Yi − Ȳ )(Yi − Ȳ )T ,
i=1

where SY is the usual estimator of ΣY . When X is stochastic, SX and SY,X


are the usual estimators of ΣX and ΣX,Y . When X is non-stochastic, SX is
The multivariate linear model 11

non-stochastic as well, and in this case, we define ΣX = limn→∞ SX and

ΣY,X = lim E(SY,X |X0 )


n→∞
n
X
= lim n−1 E(Yi − Ȳ )(Xi − X̄)T
n→∞
i=1
n
X
= lim n−1 β T (Xi − X̄)Xi − X̄)T
n→∞
i=1
= β T lim SX = β T ΣX .
n→∞

The Kronecker product ⊗ between two matrices A ∈ Rr×s and B ∈ Rt×u


is the rt × su matrix defined in blocks as

A ⊗ B = (A)ij B, i = 1, . . . , r, j = 1, . . . , s,

where (A)ij denotes the ij-th element of A. Kronecker products are not in
general commutative, A ⊗ B 6= B ⊗ A. The vec operator transforms a matrix
A ∈ Rr×u to a vector vec(A) ∈ Rru by stacking its columns. Representing A
in terms of its columns aj , A = (a1 , . . . , au ), then
 
a1
 a2 
 
vec(A) =   ..  .

.
au

The vech operator transforms a symmetric matrix A ∈ Sr×r to a vector


vech(A) ∈ Rr(r+1)/2 by stacking its unique elements on and below the di-
agonal. If  
a11 a12 a13
A = a12 a22 a23  ,
 

a13 a23 a33


then  T
vech(A) = a11 a12 a13 a22 a23 a33 .
The commutation matrix Kpm ∈ Rpm×pm is the unique matrix that trans-
forms the vec of a matrix into the vec of its transpose: For A ∈ Rp×m ,
vec(AT ) = Kpm vec(A). For background on Kronecker products, vec opera-
tors, and commutation matrices, see Henderson and Searle (1979), Magnus
and Neudecker (1979), Neudecker and Wansbeek (1983). Properties of these
operators were summarized by Cook (2018, Appendix A).
12 Introduction

For an m × n matrix A and a p × q matrix B, their direct sum is defined


as the (m + p) × (n + q) block diagonal matrix
!
A 0
A ⊕ B = diag(A, B) = .
0 B

We will also use the ⊕ operator for two subspaces. If S ⊆ Rp and R ⊆ Rq


then S ⊕ R = span(S ⊕ R) where S and R are basis matrices for S and R.

1.2.2 Estimation
With normality of the errors in model (1.1) and n > p, the maximum likelihood
estimator of α is Ȳ and the maximum likelihood estimator of β, which is also
the OLS estimator, is
−1
βbols = (XT X)−1 XT Y = (XT X)−1 XT Y0 = SX SX,Y , (1.7)

where the second equality follows because the predictors are centered and, as
defined previously, Y0 denotes the n × r uncentered matrix with rows YiT .
A justification for this result is sketched in Appendix Section A.1.1. We use
T
Ybi = Ȳ + βbols (Xi − X̄) and R
bi = Yi − Ybi to denote the i-th vectors of fitted val-
ues and residuals, i = 1, . . . , n. Notice from (1.7) that βbols can be constructed
by doing r separate univariate linear regressions, one for each element of Y on
X. The coefficients from the j-th regression then form the j-th column of βbols ,
j = 1, . . . , r. This observation will be useful when discussing the differences be-
tween PLS1 algorithms, which apply to regressions with a univariate response,
and PLS2 algorithms, which apply to regressions with multiple responses.
The sample covariance matrices of Yb , R, b and Y – which are denoted SY ◦X ,
SY |X , and SY – can be expressed as
−1
SY ◦X = n−1 YT PX Y = SY,X SX SX,Y , (1.8)
Xn
SY |X = n−1 R biT = n−1 YT QX Y,
bi R (1.9)
i=1
−1
= SY − SY,X SX SXY ,
= SY − SY ◦X ,
SY = n−1 YT Y = SY ◦X + SY |X , (1.10)

where, as defined at (1.3), PX = X(XT X)−1 XT denotes the projection onto


the column space of X, QX = In − PX and SY |X is the maximum likelihood
The multivariate linear model 13

estimator of ΣY |X . We will occasionally encounter a standardized version of


βbols ,
std 1/2 −1/2
βbols = SX βbols SY |X , (1.11)

which corresponds to the estimated coefficient matrix from the OLS fit of the
−1/2 −1/2
standardized responses SY |X Y on the standardized predictors SX X.
The joint distribution of the elements of βbols can be found by using the vec
operator to stack the columns of βbols : vec(βbols ) = {Ir ⊗ (XT X)−1 XT }vec(Y0 ).
Although X is stochastic in PLS applications, properties of βbols are typically
described conditional on the observed values of X. We emphasize that in the
following calculations by conditioning on the observed matrix of uncentered
predictors X0 . Since vec(Y0 ) is normally distributed with mean α ⊗ 1n + (Ir ⊗
X)vec(β) and variance ΣY |X ⊗ In , it follows that vec(βbols ) | X0 is normally
distributed with mean and variance

E{vec(βbols ) | X0 } = vec(β) (1.12)


T −1 −1 −1
var{vec(βbols ) | X0 } = ΣY |X ⊗ (X X) =n ΣY |X ⊗ SX . (1.13)

T
The covariance matrix can be represented also in terms of βbols by us-
T
ing the rp × rp commutation matrix Krp to convert vec(βols ) to vec(βbols
b ):
T
vec(βols ) = Krp vec(βols ) and
b b

−1 −1
T
var{vec(βbols ) | X0 } = n−1 Krp (ΣY |X ⊗ SX T
)Krp = n−1 SX ⊗ ΣY |X .

The unconditional mean and variance are then


n o
E{vec(βbols } = E E{vec(βbols ) | X0 } = vec(β)
n o
var{vec(βbols )} = E{var[vec(βbols ) | X0 ]} + var E[vec(βbols ) | X0 ]
 −1
= n−1 ΣY |X ⊗ E SX .

The conditional and unconditional variances var{vec(βbols )} are typically


−1
estimated by substituting the residual covariance matrix for ΣY |X and SX
 −1
for E SX .
−1
var{vec(
c βbols )} = n−1 SY |X ⊗ SX . (1.14)

Let ei ∈ Rr denote the indicator vector with a 1 in the i-th position and
0’s elsewhere. Then, the covariance matrix for the i-th column of βbols is

−1
var{vec(βbols ei ) | X0 } = (eTi ⊗Ip )var{vec(βbols ) | X0 }(ei ⊗Ip ) = n−1 SX (ΣY |X )ii .
14 Introduction

We commented following (1.7) that the j-th column of βbols is the same as
doing the linear regression of the j-th response on X, j = 1, . . . , r. Consistent
with that observation, we see from this that the covariance matrix for the
j-th column of βbols is the same as that from the marginal linear regression
of (Y )j on X. We refer to the estimate (βbols )ij divided by its standard error
−1
{n−1 (SX )jj (SY |X )ii }1/2 as a Z-score:

(βols )ij
b
Z= −1 . (1.15)
{n−1 (SX )jj (SY |X )ii }1/2

This statistic will be used from time to time for assessing the magnitude of
(βbols )ij , sometimes converting to a p-value using the standard normal distri-
bution.
The usual log-likelihood ratio statistic for testing that β = 0 is

|SY |
Λ = n log , (1.16)
|SY |X |

which is asymptotically distributed under the null hypothesis as a chi-square


random variable with pr degrees of freedom. This statistic is sometimes re-
ported with an adjustment that is useful when n is not large relative to r and
p (Muirhead, 2005, Section 10.5.2).
The Fisher information J for (vecT (β), vechT (ΣY |X ))T in the conditional
model (1.1) is
!
Σ−1
Y |X ⊗ ΣX 0
J= 1 T −1 −1
, (1.17)
0 2 Er (ΣY |X ⊗ ΣY |X )Er

where Er is the expansion matrix that satisfies vec(A) = Er vech(A) for A ∈


Pn
Sr×r , and ΣX = limn→∞ i=1 SX > 0. It follows from standard likelihood

theory that n(vec(βbols ) − vec(β)) is asymptotically normal with mean 0 and
variance given by the upper left block of J −1 ,

avar( nvec(βbols ) | X0 ) = ΣY |X ⊗ Σ−1
X . (1.18)

Asymptotic normality also holds without normal errors but with some tech-
nical conditions: if the errors have finite fourth moments and if the maximum

leverage converges to 0, max1≤i≤n (PX )ii → 0, then n(vec(βbols ) − vec(β))
converges in distribution to a normal vector with mean 0 (e.g. Su and Cook,
2012, Theorem 2).
Invariant and reducing subspaces 15

1.2.3 Partitioned models and added variable plots


A subset of the predictors may occasionally be of special interest in multivari-
ate regression. Partition X into two sets of predictors X1 ∈ Rp1 and X2 ∈ Rp1 ,
p1 + p2 = p, and conformably partition the rows of β into β1 and β2 . Then,
dropping the subscript i, model (1.1) can be rewritten as

Y = µ + β1T (X1 − X̄1 ) + β2T (X2 − X̄2 ) + ε, (1.19)

where the elements of β1 are the coefficients of interest. We next re-


parameterize this model to force the new predictors to be uncorrelated in
the sample and to focus attention on β1 .
b1|2 = X1 − X̄1 −SX ,X S −1 (X2 − X̄2 ) denote a typical residual vector
Let R 1 2 X2
T −1
from the ordinary least squares fit of X1 on X2 , and let β2∗ = β1T SX1 ,X2 SX2
+
T
β2 . Then the partitioned model can be re-expressed as

Y = µ + β1T R T
b1|2 + β2∗ (X2 − X̄2 ) + ε. (1.20)

In this version of the partitioned model, the parameter vector β1 is the same as
6 β2∗ unless SX1 ,X2 = 0. The predictors – R
that in (1.19), while β2 = b1|2 and X2
– in (1.20) are uncorrelated in the sample SRb1|2 ,X2 = 0, and consequently the
maximum likelihood estimator of β1 is obtained by regressing Y on R b1|2 . The
maximum likelihood estimator of β1 can also be obtained by regressing R bY |2 ,
the residuals from the regression of Y on X2 , on R b1|2 . A plot of R
bY |2 versus
Rb1|2 is called an added variable plot (Cook and Weisberg, 1982). These plots
are often used in univariate linear regression (r = 1) as general graphical
diagnostics for visualizing how hard the data are working to fit individual
coefficients.
Model (1.20) will be used in Chapter 6 when developing partial PLS and
partial envelope methods.

1.3 Invariant and reducing subspaces


Invariant and reducing subspaces (e.g. Conway, 1994) are essential con-
stituents of envelopes and PLS methodology. In this section we cover aspects
of these subspaces that will be useful later. Much of the material in this
16 Introduction

section is based on the results of Cook et al. (2010). The statistical relevance
of the constructions in this section will be addressed in Chapter 2.
If R is a subspace of Rr and A ∈ Rr×c , then we define

AT R = {ATR | R ∈ R}
RTA = {RTA | R ∈ R}
R⊥ = {S ∈ Rr | S T R = 0}.

If R1 and R2 are subspaces of Rr , then

R1 + R2 = {R1 + R2 | Rj ∈ Rj , j = 1, 2}.

We begin with the definitions of invariant and reducing subspaces.

Definition 1.1. A subspace R of Rr is an invariant subspace of M ∈ Rr×r


if the linear transformation M R ⊆ R; so M maps R to a subset of itself. R
is a reducing subspace of M if R and R⊥ are both invariant subspaces of M .
If R is a reducing subspace of M , we say that R reduces M .

The next lemma describes a matrix equation that characterizes invariant


subspaces. A justification is given in Appendix A.1.2.

Lemma 1.1. Let R be an s-dimensional subspace of Rr and let M ∈ Rr×r .


Then R is an invariant subspace of M if and only if, for any A ∈ Rr×s with
span(A) = R, there exists a B ∈ Rs×s such that M A = AB.

Recall from Section 1.2.1 that Sr×r denotes the space of all real, symmetric
r ×r matrices. The next lemma tells us that if M ∈ Sr×r and R is an invariant
subspace of M then R reduces M . This fact is handy in proofs because to
show that R reduces a symmetric M , we need to show only that R is an
invariant subspace of M . In this book we will be concerned almost exclusively
with real symmetric linear transformations M and reducing subspaces.

Lemma 1.2. Let R be an s-dimensional subspace of Rr and let M ∈ Sr×r .


Then R reduces M if and only if R is an invariant subspace of M .

Proof. By definition, if R reduces M then R is an invariant subspace of M .


Next, suppose that R is an invariant subspace of M . Then, by definition
M R ⊆ R, which implies that (R⊥ )T M R = {0}. Since M is symmetric, this
implies RT M R⊥ = {0}. Consequently, we must have M R⊥ ⊆ R⊥ .
Invariant and reducing subspaces 17

Examples of invariant subspaces of M ∈ Sr×r are {0}, Rr , and span(M ).


For a less obvious construction that will be relevant later, let x ∈ Rr and
consider the subspace

R = span{x, M x, M 2 x, . . . , M r−1 x, . . .}.

To avoid trivial cases, we assume r ≥ 2. The Cayley-Hamilton theorem states


Pr
that i=0 ai M i = 0, where the ai ’s are the coefficients of the characteristic
polynomial |λI − M | of M . In consequence, for j ≥ r, M j x is a linear combi-
nation of the r vectors {x, M x, . . . , M r−1 x} and so R can be represented as
a subspace spanned by a finite set of r vectors:

R = span{x, M x, M 2 x, . . . , M r−1 x}. (1.21)

Then R reduces M because M is symmetric and R is an invariant subspace


of M ,
M R = span{M x, M 2 x, M 3 x, . . . , M r x} ⊆ R.

Subspaces of this form are often called cyclic invariant subspaces in linear
algebra.
Subspaces (1.21) arise in connection with PLS regressions having a uni-
variate response, particularly when allowing for the possibility that the full
set {x, M x, . . . , M r−1 x} may not be necessary to span R. For t ≤ r, let

Kt (M, x) = {x, M x, M 2 x, . . . , M t−1 x} 
, (1.22)
Kt (M, x) = span{Kt (M, x)} 

which are called a Krylov basis and a Krylov subspace of dimension t in numer-
ical analysis, terminology that we adopt for this book. For example, let r = 3,
√ √ √
v1 = (1, 1, 1)/ 3, v2 = (−1, 1, 0)/ 2 and v3 = (1, 1, −2)/ 6, and construct
M = 3Pv1 + Pv2 + Pv3 . Then K1 (M, v1 ) and, for any real scalars a and b,
K1 (M, av2 + bv3 ) are one-dimensional reducing subspace of M . For an arbi-
trary vector x ∈ R3 , what is the maximum possible dimension of Kq (M, x)?
If M R ⊆ R, so R is an invariant subspace of M , and if x ∈ R then
clearly M x ∈ R. Any invariant subspace of M that contains x must then
also contain all of the vectors {x, M x, . . . , M r−1 x}. Consider a subspace R
that is unknown, but known to be an invariant subspace of M . If we know or
can estimate one vector x ∈ R, then we can iteratively transform x by M to
obtain additional vectors in R: for any t, Kt (M, x) ⊆ R. Here we can think
18 Introduction

of the vector x as serving as a seed for gaining information on R. This is one


of the key ideas behind the NIPALS regression algorithm.
The next proposition from Cook and Li (2002, Prop. 3) provides motivation
for considering Krylov subspaces of dimension less than r:

Proposition 1.1. If M j x ∈ Kj−1 (M, x) then M s x ∈ Kj−1 (M, x) for any


s > j.

Proof. If M j x ∈ Kj−1 (M, x) then

M j+1 x ∈ M Kj−1 (M, x) = span{M x, M 2 x, . . . , M j x} ⊆ Kj−1 (M, x)

and the desired conclusion follows.

An important implication of Proposition 1.1 is that the Krylov subspaces


are strictly monotonically increasing until a certain dimension t is reached and
they are constant thereafter,

K1 (M, x) ⊂ K2 (M, x) ⊂ · · · ⊂ Kt (M, x) = · · · Kr (M, x), (1.23)

where “⊂” indicates strict containment. Estimation of the transition point t


at which the sequence stops growing is a key consideration in NIPALS regres-
sions. In Section 1.5 we identify the transition point as an envelope dimension.
It is also equal to the number of components in PLS regressions when r = 1.
The next proposition, whose proof is given in Appendix A.1.3, gives a
different characterization of a reducing subspace. It’s necessary and sufficient
condition could be used as a definition of a reducing subspace in this context,
and it is used in envelope literature because it avoids the need to keep track
of multiplicities of eigenvalues (Cook, Li, and Chiaromonte, 2010).

Proposition 1.2. R reduces M ∈ Rr×r if and only if M can be written in


the form
M = PR M PR + QR M QR . (1.24)

Corollary 1.1 describes consequences of Proposition 1.2 that will be useful


in envelope calculations, including derivations of maximum likelihood estima-
tors. Its proof is given in Appendix A.1.4. Let A† denote the Moore-Penrose
inverse of the matrix A.

Corollary 1.1. Let R reduce M ∈ Rr×r , let A ∈ Rr×u be a semi-orthogonal


basis matrix for R, and let A0 be a semi-orthogonal basis matrix for R⊥ . Then
Invariant and reducing subspaces 19

1. M and PR , and M and QR commute.

2. R ⊆ span(M ) if and only if AT M A is full rank.

3. |M | = |AT M A| × |AT0 M A0 | when u < r.

4. If M is full rank then

M −1 = A(AT M A)−1 AT + A0 (AT0 M A0 )−1 AT0


= PR M −1 PR + QR M −1 QR .

5. If R ⊆ span(M ) then

M † = A(AT M A)−1 AT + A0 (AT0 M A0 )† AT0 .

The next lemma will be useful when studying likelihood-based algorithms


later in this chapter and in Chapter 3. It shows in part how to write |AT0 M A0 |
as a function of a basis A for the orthogonal complement of span(A0 ). A proof
is given in Appendix Section A.1.5.
Lemma 1.3. Suppose that M ∈ Sr×r is positive definite and that the column-
partitioned matrix O = (A, A0 ) ∈ Rr×r is orthogonal. Then
I. |AT0 M A0 | = |M | × |AT M −1 A|

II. log |AT M A| + log |AT0 M A0 | ≥ log |M |

III. log |AT M A| + log |AT M −1 A| ≥ 0.


Parts II and III become equalities if and only if span(A) reduces M .
The next proposition gives a sufficient condition for a projection in an M
inner product to be the same as the projection in the usual inner product. It
will be helpful when studying how PLS algorithms work. Its proof, as given in
Appendix Section A.1.6, follows by application of (1.24) in conjunction with
(1.3).
Proposition 1.3. If R reduces the positive definite matrix M ∈ Sr×r and
R ⊆ span(M ) then PR(M ) = PR .
Certain minimal reducing subspaces are in effect parameters in PLS regres-
sions. The following proposition ensures that the minimal reducing subspace
is well defined. Its proof is in Appendix A.1.7.
Proposition 1.4. The intersection of any two reducing subspaces of M ∈
Rr×r is also a reducing subspace of M .
20 Introduction

1.4 Envelope definition


In this section, we give definitions and basic properties of envelopes (from
Cook et al., 2010; Cook, 2018) that will facilitate an understanding of their
role in PLS regression. The formal statements are devoid of statistical content,
but that will become apparent as the book progresses.
We are now ready to define an envelope:

Definition 1.2. Let M ∈ Sr×r and let S ⊆ span(M ). Then, the M -envelope
of S, denoted by EM (S), is the intersection of all reducing subspaces of M that
contain S.

This definition is concerned only with reducing subspaces that contain a


specified subspace S, which will typically be spanned by a parameter vector
or matrix. The requirement that S ⊆ span(M ) is necessary to ensure that
envelopes are well defined and that they capture the quantity of interest S.
This definition does not require that M be non-singular, a property that will
be useful in understanding how PLS operates in high-dimensional regressions.
In that case, the condition S ⊆ span(M ) constrains the regression structures
that can be identified in high dimensions that exceed the sample size.
The next two propositions give certain relationships between envelopes
that will be useful later. They were proven originally by Cook et al. (2010).
Proofs are available also from (Cook, 2018, Appendix A.3.1).

Proposition 1.5. Let K ∈ Sr×r commute with M ∈ Sr×r and let S ⊆


span(M ). Then, KS ⊆ span(M ) and

EM (KS) = KEM (S).

If, in addition, S ⊆ span(K) and EM (S) reduces K, then

EM (KS) = EM (S).

Proposition 1.6. Let M ∈ Sr×r and let S ⊆ span(M ) Then

EM (M k S) = EM (S) for all k ∈ R


EM k (S) = EM (S) for all k ∈ R with k 6= 0.

In reference to model (1.1) with random predictors, we will in later chapters


frequently encounter the smallest reducing subspace of ΣX := var(X) that
Envelope definition 21

contains B := span(β), denoted EΣX(B). When ΣX > 0, it follows immediately


from Proposition 1.6 that EΣX(B) = EΣX (span(ΣX,Y )), since β = Σ−1X ΣX,Y .
For future use, we let CX,Y = span(ΣX,Y ) and CY,X = span(ΣY,X ). With this
notation, we can write EΣX(B) = EΣX (CX,Y ).

Proposition 1.7. Let M ∈ Sr×r , let Sj ⊆ span(M ), j = 1, 2. Then EM (S1 +


S2 ) = EM (S1 ) + EM (S2 ).

Proof. EM (S1 + S2 ) is a reducing subspace of M that contains both S1 and


S2 . Since EM (Sj ) is the smallest reducing subspace of M that contains Sj , we
have that EM (Sj ) ⊆ EM (S1 + S2 ), j = 1, 2, and consequently

EM (S1 ) + EM (S2 ) ⊆ EM (S1 + S2 ).

Next, let Bj be a basis matrix for EM (Sj ), j = 1, 2 and let B = (B1 , B2 ),


so span(B) = EM (S1 ) + EM (S2 ). Then, from Lemma 1.1 there are matrices
A1 and A2 so that M Bj = Bj Aj and
! !
A1 0 A1 0
M B = (B1 , B2 ) =B .
0 A2 0 A2

In consequence, EM (S1 ) + EM (S2 ) reduces M and contains S1 + S2 , and so


must contains the smallest reducing subspace of M that contains S1 + S2 ,

EM (S1 + S2 ) ⊆ EM (S1 ) + EM (S2 ).

The conclusion follows.

The direct sum of A ∈ Rm×n and B ∈ Rp×q is defined as the (m+p)×(n+q)


block diagonal matrix A ⊕ B = diag(A, B). We will also use the ⊕ operator
for two subspaces. If S ⊆ Rp and R ⊆ Rq then S ⊕ R = span(S ⊕ R) where
S and R are basis matrices for S and R. The next lemma (Cook and Zhang,
2015b) shows how to interpret the direct sum of envelopes. In preparation, let
M1 ∈ Sp1 ×p1 , M2 ∈ Sp2 ×p2 , and let S1 and S2 be subspaces of span(M1 ) and
span(M2 ). Then

Lemma 1.4. EM1 (S1 ) ⊕ EM2 (S2 ) = EM1 ⊕M2 (S1 ⊕ S2 ).

The proof of this lemma is in Appendix A.1.8.


We next consider settings in which envelopes are invariant under certain
simultaneous changes in M and S. The proof is in Appendix A.1.9.
22 Introduction

Proposition 1.8. Let ∆ ∈ Sr×r be a positive definite matrix and let S be a u-


dimensional subspace of Rr . Let G ∈ Rr×u be a semi-orthogonal basis matrix
for S and let V ∈ Su×u be positive semi-definite. Define Ψ = ∆ + GV GT .
Then ∆−1 S = Ψ−1 S and

E∆ (S) = EΨ (S) = E∆ (∆−1 S) = EΨ (Ψ−1 S) = EΨ (∆−1 S) = E∆ (Ψ−1 S).

For a first application of Proposition 1.8, consider the multivariate linear


regression model (1.1) but now assume that the predictors X are random with
covariance matrix ΣX , so that Y and X have a joint distribution. Because the
predictors are random we can express the marginal variance of Y as

ΣY = ΣY |X + β T ΣX β = ΣY |X + GV GT , (1.25)

where ΣY |X > 0, G is a semi-orthogonal basis matrix for B 0 = span(β T ),


β T = GA, dim(B 0 ) ≤ min(r, p), and V = AΣX AT is positive definite since A
must have full row rank. These forms match the decomposition required for
Proposition 1.8 with Ψ = ΣY and ∆ = ΣY |X . Consequently, we have

EΣY |X (B 0 ) = EΣY (B 0 ) = EΣY |X (Σ−1 0


Y |X B )

= EΣY (Σ−1 0 −1 0
Y B ) = EΣY (ΣY |X B ) (1.26)
= EΣY |X (Σ−1 0
Y B ).

1.5 Algebraic methods of envelope construction


In this section we describe various population-level methods for construct-
ing the envelope EM (A) with M ∈ Sr×r and A ⊆ span(M ). Some of these
algorithms will be seen to be closely related to algorithms for PLS regression.

1.5.1 Algorithm E
The next proposition (Cook et al., 2010) describes a method of constructing
an envelope in terms of the eigenspaces of M . We use h generally to denote
the number of eigenspaces of a real symmetric matrix. Recalling the subspace
computations introduced at the outset of Section 1.3,
Algebraic methods of envelope construction 23

Proposition 1.9. (Algorithm E.) Let M ∈ Sr×r , let Pi , i = 1, . . . , h ≤ r, be


the projections onto the eigenspaces of M with distinct non-zero eigenvalues
Ph
λi , i = 1, . . . , h, and let A ⊆ span(M ). Then EM (A) = i=1 Pi A.
Ph
Proof. To prove that i=1 Pi A is the smallest reducing subspace of M that
contains A, it suffices to prove the following statements:
Ph
1. i=1 Pi A reduces M .
Ph
2. A ⊆ i=1 Pi A.
Ph
3. If T reduces M and A ⊆ T , then i=1 Pi A ⊆ T .
Ph
Let A be a basis matrix for A. Statement 1 follows because M = i=1 λi Pi
and
h
X
M Pi A = span{M P1 A, . . . , M Ph A}
i=1
h
X
= span{λ1 P1 A, . . . , λh Ph A} = Pi A.
i=1

Ph
Statement 2 holds because A = {P1 v + · · · + Ph v : v ∈ A} ⊆ i=1 Pi A.
Ph
Turning to statement 3, if T reduces M , it can be written as T = i=1 Pi T .
If, in addition, A ⊆ T then we have Pi A ⊆ Pi T for i = 1, . . . , h. Statement 3
Ph Ph
follows since i=1 Pi A ⊆ i=1 Pi T = T .

We refer to the algorithm implied by this proposition as Algorithm E, since


it is based directly on the eigenstructure of M .

1.5.2 Algorithm K
When dim(A) = 1, the dimension of the envelope is bounded above by the
number h of eigenspaces of M : dim(EM (A)) ≤ h. We use Proposition 1.9
in this case to gain further insights into the Krylov subspaces discussed in
Section 1.3 and to motivate a new algebraic method of constructing envelopes.
As we will see in Chapter 3, Algorithm K is a foundation for the NIPALS
regression algorithm with a univariate response, Y ∈ R1 .
Let a ∈ Rr be a basis vector for the one-dimensional subspace A, let λi 6= 0
be the eigenvalue for the i-th eigenspace of M , let q = dim(EM (A)) so that a
24 Introduction

projects non-trivially onto q eigenspaces of M . We take these to be the first q


eigenspaces without loss of generality. Let Cq = (P1 a, . . . , Pq a). Then
q
X
EM (A) = Pi A = span(Cq )
i=1
h
X q
X
M ta = λtj Pj a = λtj Pj a
j=1 j=1
q
X
a = Pj a.
j=1

Using these relationships, we have

Kt (M, a) = (a, M a, M 2 a, . . . , M t−1 a)


 
X q Xq Xq q
X
=  Pj a, λj Pj a, λ2j Pj a, . . . , λt−1
j Pj a

j=1 j=1 j=1 j=1

Writing this as a product of Cq and the Vandermonde matrix

1 λ1 λ21 · · · λt−1
 
1
t−1 
 1 λ2 λ22 · · · λ2 

V = .. .. .. .. 
 . . . . 

1 λq λ2q · · · λt−1
q q×t

we have

Kt (M, a) = Cq V.

The λj ’s are distinct eigenvalues by construction and so, if t < q, rank(V ) = t.


This implies that when t < q, Kt (M, a) ⊂ span(Cq ) = EM (A). Similarly, if
t ≥ q then Kt (M, a) = span(Cq ) = EM (A). We summarize this result in the
following proposition.

Proposition 1.10. (Algorithm K.) Let M ∈ Sr×r be positive semi-definite.


Let a ∈ Rr , a 6= 0, A = span(a) ⊆ span(M ) and q = dim(EM (A)). Then for
the Krylov subspaces are strictly monotonically increasing until dimension q
is reached and they are constant thereafter:

K1 (M, a) ⊂ K2 (M, a) ⊂ · · · ⊂ Kq (M, a) = EM (A) = Kq+1 (M, a) · · · = Kr (M, a),


(1.27)
where “⊂” indicates strict containment.
Algebraic methods of envelope construction 25

We refer to the algorithm implied by this proposition as Algorithm K,


since it is based on Krylov subspaces.
Proposition 1.10 describes a method of enveloping a one-dimensional sub-
space with the reducing subspaces of a positive semi-definite matrix M ∈ Sr×r .
The guidance it offers when the dimension of A is greater than 1 is not as
helpful. Let h = dim(A), let A = (a1 , . . . , ah ) be a basis matrix for A and let
Aj = span(aj ). The Krylov spaces

Kj (M, A) = {A, M A, M 2 A, . . . , M j−1 A} 
, (1.28)
Kj (M, A) = span{Kj (M, A)} 

are monotonic

K1 (M, A) ⊂ K2 (M, A) ⊂ · · · ⊂ Kt (M, A) = · · · Kr (M, A), (1.29)

as they are when h = 1 (1.23). However, the stopping point t is no longer equal
to the dimension q of EM (S). We can bound q ≤ th, but there is no way to get
a better handle on q without imposing additional structure. The Krylov space
Kt (M, A) at termination is related to the individual Krylov spaces Kr (M, Aj ),
j = 1, . . . , h, corresponding to the columns of A as follows,
h
X h
X h
X
Kt (M, A) = Kt (M, Aj ) = Kqj (M, Aj ) = EM (Aj ) = EM (A)
j=1 j=1 j=1

where the final two equalities follow from Propositions 1.10 and 1.7. In con-
sequence, the columns of Kt (M, A) span the desired envelope EM (A), but
they do not necessarily form a basis for it, leading to inefficiencies in statis-
tical applications. For that reason a different method is needed to deal with
multi-dimensional subspaces.

1.5.3 Algorithm N
Let v1 (·) denote the largest eigenvalue of the argument matrix with corre-
sponding eigenvector `1 (·) of length 1, and let q denote the dimension of the
M -envelope of A, EM (A). Then, the following algorithm generates EM (A). We
refer to it as Algorithm N and give its proof here since one instance of it cor-
responds to the NIPALS partial least squares regression algorithm discussed
in Section 3.1. Recall from Section 1.2.1 that QA(∆) = I − PA(∆) , where PA(∆)
denotes the projection onto span(A) in the ∆ inner product.
26 Introduction

As described in the next proposition, the population version of Algo-


rithm N requires positive semi-definite matrices A, M ∈ Sr×r as inputs with
A = span(A) ⊆ span(M ). We describe specific instances of the algorithm as
N(A, M ) using A and M as arguments. Sample versions of the algorithm are
similarly described as N(A,
b Mc), where A b and M
c are consistent estimators of
A and M .

Proposition 1.11. (Algorithm N.) Let A, M ∈ Sr×r be positive semi-definite


matrices with A = span(A) ⊆ span(M ), let u1 = `1 (A) and let U1 = (u1 ). For
i = 1, 2, . . . , construct

ui+1 = `1 (QTUi (M ) AQUi (M ) )


Ui+1 = (Ui , ui+1 )
If QTUi+1 (M ) AQUi+1 (M ) = 0, stop and set k = i + 1.

At termination, k = q, span(Uq ) = EM (A),

A = PUq APUq = APUq = PUq A


M = PUq M PUq + QUq M QUq .

Proof. For clarity we divide the proof into a number of distinct claims. Let
λi+1 = v1 (QTUi (M ) AQUi (M ) ). Prior to stopping, λi+1 > 0.

Claim 1: UiT Ui = I and UiT ui+1 = 0, i = 1, . . . , k − 1.


Since QTUi (M ) AQUi (M ) ui+1 = λi+1 ui+1 , we have

UiT QTUi (M ) AQUi (M ) ui+1 = λi+1 UiT ui+1

or equivalently, since UiT QTUi (M ) = 0, λi+1 UiT ui+1 = 0, which implies that
UiT ui+1 = 0 since λi+1 6= 0. Since ui+1 is an eigenvector chosen with
length 1, UiT Ui = I.
In the claims that follow, let k be the point at which the algorithm termi-
nates.

Claim 2: QTUk (M ) A = 0.
Since A is positive semi-definite there is a full rank matrix V so that
A = V V T . When the algorithm terminates, QTUk (M ) V V T QUk (M ) = 0, and
so QTUk (M ) V = 0 and QTUk (M ) V V T = QTUk (M ) A = 0.
Algebraic methods of envelope construction 27

Claim 3: span(Uk ) reduces M .


By Lemmas 1.1 and 1.2, it is sufficient to show that there is a k × k matrix
Ωk so that M Uk = Uk Ωk . From claim 2, A = PUTk (M ) A, meaning

A = M Uk (UkT M Uk )−1 UkT A. (1.30)

Next,
λi+1 ui+1 = QTUi (M ) AQUi (M ) ui+1
 
= A − PUTi (M ) A QUi (M ) ui+1 . (1.31)

Substituting the right-hand side of (1.30) for the first A in (1.31) and using
Ui = Uk (Ii , 0)T gives

λi+1 ui+1 = M Uk (UkT M Uk )−1 UkT A − M Uk (Ii , 0)T (UiT M Ui )−1 UiT A


× QUi (M ) ui+1
= M Uk (UkT M Uk )−1 UkT A − (Ii , 0)T (UiT M Ui )−1 UiT A


× QUi (M ) ui+1
= M Uk Vi+1 for i = 0, . . . , k − 1

where Vi+1 ∈ Rr is defined implicitly. Let S = (V1 , . . . , Vk ) and Λk =


diag(λ1 , . . . , λk ). Therefore M Uk S = Uk Λk . Now, since rank Uk is k and
Λk > 0, we have that S is invertible and therefore M Uk = Uk Λk S −1 =
Uk Ωk , with nonsingular Ωk = Λk S −1 .

Claim 4: span(A) ⊆ span(Uk ).


Substituting Claim 3 into (1.30), we have

A = M Uk (UkT M Uk )−1 UkT A


= Uk Ωk (UkT Uk Ωk )−1 UkT A
= Uk UkT A = PUk A.

And the conclusion follows.

Claim 5: span(Uk ) is the smallest reducing subspace of M that contains A.


From Claims 3 to 4, we know that span(Uk ) is a reducing subspace of M
that contains span(A). It remains to show that span(Uk ) is the smallest
such subspace.
28 Introduction

Let Γ : r × q be a semi-orthogonal basis matrix for EM (A) and let (Γ, Γ0 )


be an orthogonal r × r matrix. Then M = ΓΩΓT + Γ0 Ω0 ΓT0 and A =
ΓΓT A. We have shown that span(Γ) = EM (A) ⊆ span(Uk ). We now use a
contradiction that shows that span(Uk ) ⊆ span(Γ).
Suppose that span(Uk ) 6⊆ span(Γ), so that span(Uk ) ∩ span(Γ0 ) contains a
nontrivial subspace. From the sequence of vectors {ui }, let d be the index
of the first vector such that ud = ΓBd + Γ0 Cd with Cd 6= 0, reflecting that
Uk intersects Γ0 . First, we show that Bd 6= 0. To do so, we first prove that

AT QUd−1 (M ) Γ0 = 0. (1.32)

Since Ud−1 ⊂ Γ and ΓT M Γ0 = 0 , we have


T T
Ud−1 M Γ0 = Ud−1 ΓΓT M Γ0 = 0,

and

AT QUd−1 (M ) Γ0 = AT Γ0 − AT Ud−1 (Ud−1


T
M Ud−1 )−1 Ud−1
T
M Γ0
= AT Γ0 = AT ΓΓT Γ0
= 0,

from which (1.32) follows. Now, since d ≤ q, we have that

uTd QTUd−1 (M ) AAT QUd−1 (M ) u 6= 0,

and since ud = ΓBd + Γ0 Cd , (1.32) implies Bd 6= 0, and since CdT Cd +


BdT Bd = 1 with Cd 6= 0, we have BdT Bd < 1.
Now, let us call ũd = ΓBd /(BdT Bd )1/2 . Then ũTd ũd = 1 and by (1.32)

uTd QTUd−1 (M ) A = BdT ΓT QTUd−1 (M ) A + CdT ΓT0 QTUd−1 (M ) A


= BdT ΓT QTUd−1 (M ) A
6= 0. (1.33)

With this result, we have

ũTd QTUd−1 (M ) AAT QUd−1 (M ) ũd = uTd QTUd−1 (M ) AAT QUd−1 (M ) ud /(BdT Bd )
> uTd QTUd−1 (M ) AAT QUd−1 (M ) ud ,

where the inequality follows because BdT Bd < 1 and (1.33). Therefore, the
maximum is not at ud , which is the contradiction we seek.
Algebraic methods of envelope construction 29

It follows that span(Uk ) is the smallest reducing subspace of M that contains


A and consequently EM (A) = span(Uk ) and q = k.

The next proposition describes a connection between Algorithms K and N


when A has dimension 1. We give the following lemma in preparation for the
proposition.

Lemma 1.5. Let M ∈ Sr×r be a positive semi-definite matrix and let A ∈


Rr×s with rank(A) = s and A = span(A) ⊆ span(M ). Then

(i) AT M A is positive definite.

(ii) for s = 1 and t ≤ q = dim(EM (A)) we have that KtT(M, A)M Kt (M, A) >
0.

Proof. Let m = rank(M ) and construct the spectral decomposition of M as


M = GDGT , where G ∈ Rr×m is semi-orthogonal and D > 0 is diagonal.
Since span(A) ⊆ span(M ), there is a B ∈ Rm×s with rank s and A = GB.
Conclusion (i) now follows since

ATM A = B T GT GDGT GB = B TDB > 0.

For conclusion (ii), first it follows from Proposition 1.10 that Kt (M, A) has
full column rank if and only if t ≤ q. Since A ⊆ span(M ) there is a non-zero
vector b ∈ Rr so that A = M b and

Kt (M, A) = (A, M A, . . . , M t−1 A)


= (M b, M 2 b, . . . , M t b)
= M Kt (M, b).

Consequently, Kt (M, A) ⊆ span(M ). Conclusion (ii) now follows from conclu-


sion (i).

Proposition 1.12. If dim(A) = 1, M ≥ 0, A ⊆ span(M ) and A is spanned


by the vector a ∈ Rr , then Algorithms N and K are equivalent; that is,
span(Uq ) = Kq (M, a) = EM (A). Further, the columns of Uq = (u1 , . . . , uq )
correspond to a sequential orthogonalization of the columns of the Krylov ma-
trix Kq (M, a) = (a, M a, M 2 a, . . . , M q−1 a).

Proof. Let A = aaT and assume without loss of generality that aT a = 1.


Recall that Proposition 1.11 requires M ∈ Sr×r to be positive semi-definite
30 Introduction

and a ∈ span(M ). The first u-vector from Algorithm N in Proposition 1.11 is


u1 = a, which is also the first vector in Kq (M, a). The second u-vector is

u2 = QTu1 (M ) a = QTa(M ) a
= a − M a/(aTM a),

where aT M A > 0 by Lemma 1.5. Clearly, u1 and u2 are orthogonal vec-


tors, uT1 u2 = 0, that span the same space as the first two columns in
the Krylov matrix, span(u1 , u2 ) = span(a, M a) = K2 (M, a). Assume that
span(Uj ) = Kj (M, a), j = 1, 2, . . . , t < q. Then, there is a nonsingular matrix
Ht ∈ Rt×t so that Ut = Kt Ht , where for notational simplicity we are sup-
pressing the arguments to Kt (M, a). Arguing by induction, we need to show
that span(Ut+1 ) = span(Ut , ut+1 ) = Kt+1 (M, a) ∈ Rr×(t+1) . Write
! !
Ht
Ut+1 = (Ut , ut+1 ) = (Kt Ht , ut+1 ) = Kt+1 , ut+1 .
01×t

Turning to ut+1 , we have

ut+1 = QTUt (M ) a = QTKt (M ) a


= a − M Kt (KtT M Kt )−1 KtT a,

where KtT M Kt > 0 by Lemma 1.5 (ii). Let ht = (KtT M Kt )−1 KtT a ∈ Rt .
Then

ut+1 = a − (M a, M 2 a, . . . , M t a)ht ∈ Rr
!
1
= Kt+1 .
−ht

Substituting this into the above representation for Ut+1 ∈ Rr×(t+1) gives
! ! !
Ht Ht 1
Ut+1 = Kt+1 , ut+1 = Kt+1 .
01×t 0 −ht

Since t < q, Ut+1 and Kt+1 both have full column rank t + 1, so
!
Ht 1
Lt :=
0 −ht

must be nonsingular and thus span(Ut+1 ) = Kt+1 . The orthogonality of the


column of Ut follows from Claim 1 in the proof of Proposition 1.11. Conse-
quently, the columns of Ut can be obtained by orthogonalizing the columns of
Algebraic methods of envelope construction 31

Kt . That is, since Ut+1 = Kt+1 Lt and the columns of Ut+1 are orthogonal with
length 1, the multiplication of Kt+1 on the right by Lt serves to orthogonalize
the columns of Kt+1 .

1.5.4 Algorithm S
The algorithm indicated in the following proposition is called Algorithm S be-
cause a special case of it yields the SIMPLS algorithm discussed in Section 3.3.
Like Algorithm N, its population version requires positive semi-definite ma-
trices A, M ∈ Sr×r as inputs with A = span(A) ⊆ span(M ). We may describe
specific instances of the algorithm as S(A, M ) using A and M as arguments.
Sample versions of the algorithm are similarly described as S(A,
b Mc), where A b
and Mc are consistent estimators of A and M .

Proposition 1.13. (Algorithm S.) Let A, M ∈ Sr×r be positive semi-definite


matrices with A = span(A) ⊆ span(M ). Let u1 = `1 (A) and U1 = (u1 ). For
i = 1, 2, . . ., construct

ui+1 = `1 (QM Ui AQM Ui )


Ui+1 = (Ui , ui+1 )
If QM Ui A = 0 stop and set k = i.

At termination, k = q, span(Uq ) = EM (A) and

A = PUq APUq = APUq = PUq A


M = PUq M PUq + QUq M QUq .

The proof of this proposition parallels that for Algorithm N in Proposi-


tion 1.11 and is therefore omitted. Details on the SIMPLS version of Algorithm
N are available in Sections 3.3 and 3.4. Those details can be used as a template
for the proof of this proposition. Algorithm S uses the identity inner prod-
uct while Algorithm N uses the M inner product. Otherwise, the structure of
these algorithms is similar.
None of the algorithms discussed in this section requires M to be nonsingu-
lar, but all require that A be contained in span(M ). In statistical applications
described in Chapter 2 and in subsequent chapters, M will often be a sample
32 Introduction

covariance matrix that, under these algorithms, may be singular. In conse-


quence, the algorithms in this section are adaptable to dimension reduction
in regressions with p > n.
Algorithm S can be described also as follows. Let u0 = 0 and, as in Propo-
sition 1.13, let Ui+1 = (u0 , u1 , . . . , ui ). Then, given Uk , k < q, Uk+1 is con-
structed by concatenating Uk with

uk+1 = max uT Au subject to (1.34)


u∈Rp

u M Uk = 0 and uT u = 0.
T

This can be seen as equivalent to Algorithm S described in Proposition 1.13


by introducing QM Uk into (1.34) to incorporate the constraint uT M Uk = 0 :

uk+1 = max uT QM Uk AQM Uk u subject to uT u = 0


u∈Rp
= `1 (QM Uk AQM Uk ).

The form of the algorithm given in (1.34) is not directly constructive, but
it connects with a version of the SIMPLS algorithm in the literature; see
Section 3.3.
The following corollary is provided in summary; its proof straightforward
and omitted.

Corollary 1.2. Under the conditions of Proposition 1.12, algorithms S, N,


and K are equivalent. That is, they result in the same subspace.

1.5.5 Algorithm L
The Algorithms K, N, and S all allow for M to be positive semi-definite. In
contrast, the algorithm needed to implement Proposition 1.14 requires M to
be positive definite. It is referred to as Algorithm L because it is inspired by
likelihood-based estimation discussed in Sections 2.3.1 and 2.5.3.

Proposition 1.14. (Algorithm L.) Let M > 0 and A ≥ 0 be symmetric r × r


matrices. Then the M -envelope of A = span(A) can be constructed as

EM (A) = span{arg min[ log |GT M G| + log |GT0 (M + A)G0 | ]}


G
= span{arg min[ log |GT M G| + log |GT (M + A)−1 G| ]},
G

where minG is taken over all semi-orthogonal matrices G ∈ Rr×q and (G, G0 )
is an orthogonal matrix.
Algebraic methods of envelope construction 33

Proof. This proof relies on repeated use of Lemma 1.3. First,

log |GT0 (M + A)G0 | = log |GT (M + A)−1 G| + log |M + A|,

which establishes the equivalence the two minimization problems. Now,

log |GT M G| + log |GT0 (M + A)G0 | = log |GT M G| + log |GT0 M G0 + GT0 AG0 |
≥ log |GT M G| + log |GT0 M G0 |
= log |M | + log |GT M G| + log |GT M −1 G|
≥ log |M |,

where the final step follows from Lemma 1.3 (III). To achieve the lower bound,
equality in the first inequality requires that A ⊆ span(G). The second equal-
ity follows from Lemma 1.3. Consequently, achieving equality in the second
inequality requires that span(G) reduce M . Overall then, span(G) must be a
reducing subspace of M that contains A. The conclusion follows since q is the
dimension of the smallest subspace that satisfies these two properties.

1.5.6 Other envelope algorithms


We briefly introduce three other algorithms in this section. They are not
included in detail because their relationship with PLS and algorithms N, K,
and S are still under study. Nevertheless, adaptations of these algorithms have
the potential to add materially to the body of PLS-type methods.
The optimization required in Algorithm L is essentially over a Grassman-
nian, the set of all q dimensional subspaces of Rr . Although good algorithms
are available, they are time-consuming and prone to getting stuck in a local
optimum. Cook, Forzani, and Su (2016) developed an algorithm that does not
require optimization over a Grassmannian and is less likely to get caught in
a local optimum. Their algorithm is based on a new parameterization that
arises from identifying q active rows of G so that optimization can be carried
out over the remaining rows.
Cook and Zhang (2016) proposed a sequential likelihood-based algorithm
for estimating a general envelope EM (A). Called the 1D algorithm, it requires
optimization in r dimensions in the first step and reduces the optimization
dimension by 1 in each subsequent step until an estimate of a basis for EM (A)
is obtained. Compared to direct Grassmannian algorithms for maximizing
the log likelihood, it is straightforward to implement and has the potential
34 Introduction

to reduce the computational burden with little loss of effectiveness. It could


be used to produce a stand-alone estimator, or as an alternative method for
getting starting values for a Grassmann optimization of the objective function
in Proposition 1.14.
Cook and Zhang (2018) proposed a screening algorithm called Envelope
Component Screening (ECS) that reduces the original dimension r to a more
manageable dimension d ≤ n, without losing notable structural information
on the envelope. The ECS algorithm can be a useful tool for reducing the
dimension prior to application of a full optimization algorithm. Adaptations
might compete directly with PLS algorithms, although that possibility has
apparently not been studied.
2

Envelopes for Regression

Envelopes are dimension reduction constructs that come in several varieties


depending on the goals of the analysis. In reference to model (1.1), the two
most common are response envelopes for reducing the response vector Y and
predictor envelopes for reducing the vector of predictors X. Envelope theory
and methodology were developed first by Cook, Li, and Chiaromonte (2010)
for response reduction and later extended to predictor reduction by Cook,
Helland, and Su (2013), who also established first the connections with PLS
regression. Cook and Zhang (2015b) developed envelopes for simultaneous
response and predictor reduction. As mentioned at the outset of Chapter 1,
the goal of envelope methods, stated informally, is to separate with clarity the
information in the data that is material to the study goals from that which
is immaterial, which is in the spirit of Fisher’s notion of sufficient statistics
(Fisher, 1922).
There are now many articles extending and refining various aspects of en-
velope methodology, including partial envelopes (Su and Cook, 2011) for infer-
ence on a subset of the regression coefficients, envelopes for quantile regression
(Ding, Su, Zhu, and Wang, 2021), spatial modeling (Rekabdarkolaee, Wang,
Naji, and Fluentes, 2020), and matrix-variate and tensor-variate regressions
(Ding and Cook, 2018; Li and Zhang, 2017). Forzani and Su (2020) extended
the normal likelihood theory underlying envelopes to elliptically contoured
distributions. An envelope version of reduced rank regression was developed
by Cook, Forzani, and Zhang (2015). Wang and Ding (2018) developed an
envelope version of vector autoregression and Samadi and Herath (2021) de-
veloped a reduced rank version of envelope vector autoregression.
Several articles have been devoted to using envelopes to aid further un-
derstanding and development of partial least squares regression. Cook and Su
(2016) showed how to use envelopes to develop scaling methods for partial

DOI: 10.1201/9781003482475-2 35
36 Envelopes for Regression

least squares regression. Envelopes were used as a basis for studying the high
dimensional behavior of partial least squares regression by Cook and Forzani
(2018, 2019). Although partial least squares regression is usually associated
with the multivariate linear model, Cook and Forzani (2021) showed that PLS
dimension reduction may be serviceable in the presence of non-linearity. The
role of envelopes in furthering the application of PLS in chemometrics was
discussed by Cook and Forzani (2020).
Bayesian versions of envelopes were developed by Khare, Pal, and Su
(2016) and Chakraborty and Su (2023). Su and Cook (2011) developed the no-
tion of inner envelopes for capturing part of the regression. Each of these and
others demonstrate a potential for envelope methodology to achieve reduc-
tion in estimative and predictive variation beyond that attained by standard
methods, sometimes by amounts equivalent to increasing the sample size many
times over. An introduction to envelopes is available from the monograph by
Cook (2018) and a brief overview from Cook (2020).
Our goal in this chapter is to give envelope foundations that allow us to
connect with PLS regression methodology.

2.1 Model-free predictor envelopes defined


Predictor envelopes are designed to reduce the dimension of the predictor vec-
tor so that the reduced predictors carry unambiguously all of the information
that is material to the prediction of Y , leaving behind a portion of the predic-
tor vector that is immaterial to prediction. To develop this idea, consider the
regression of a vector-valued response Y ∈ Rr on a stochastic vector-valued
predictor X ∈ Rp based on a simple random sample (Yi , Xi ), i = 1, . . . , n, of
size n, without necessarily invoking model (1.1). Regression methods gener-
ally become less effective when n is not large relative to p, perhaps p is large
relative to n which is indicated as p  n, or the predictors are highly collinear.
The difficulties caused by a relatively small sample size or high collinearity
can often be mitigated effectively by linearly reducing the dimension of X
without losing information about Y , either prior to or in conjunction with
developing a method of prediction. Denote the reduced predictors as PS X,
where S represents a subspace of Rp . We expressed the reduced predictors as
Model-free predictor envelopes defined 37

projected vectors in Rp to avoid the ambiguity involved in selecting a basis at


this level of discussion.

2.1.1 Central subspace


The reduced predictors PS X must hold all of the information that X has
about Y . This is interpreted to mean that, holding PS X fixed, any remaining
variation in X should be independent of Y , a condition that is expressed
symbolically as
Y X | PS X. (2.1)

This condition is the driving force behind sufficient dimension reduction


(SDR), the goal of which is to reduce the predictor dimension without re-
quiring a pre-specified model for the regression of Y on X. The intersection
of all subspaces for which (2.1) holds is called the central subspace, first for-
mulated by Cook (1994) and symbolized SY |X . It has been the target of much
theoretical and methodological inquiry, and there is now a substantial body
of effective model-free dimension-reduction methodology based on the cen-
tral subspace and related constructions. This notion is summarized by the
following definition. .

Definition 2.1. A subspace S ⊆ Rp that satisfies (2.1) is called a dimension


reduction subspace for the regression of Y on X. If the intersection of all
dimension-reduction subspaces is itself a dimension reduction subspace it is
called the central subspace and denoted as SY |X .

Early work that marks the beginning of the area is available from Cook
(1998). Relatively recent advances are described in the monograph by Li
(2018). To help fix ideas, consider the following four standard models, each
with stochastic predictor X ∈ Rp and β, β1 , and β2 all p × 1 vectors.

1. Single-response linear regression: Y = α + β TX + σε with ε X, E(ε) = 0,


and var(ε) = 1,

2. Logistic regression: logit(p(X)) = β TX.

3. Cox regression with hazard function Λ0 (t) exp(β TX).

4. Single-response heteroscedastic linear regression: Y = α+β1TX +σ(β2T X)ε.


38 Envelopes for Regression

In models 1–3, the response depends on the predictor only via β TX, and so
Y X | β TX and SY |X = span(β). From a model-free SDR perspective, these
models look the same because each depends on only one linear combination of
the predictors. Model 4 depends on two linear combinations of the predictor,
SY |X = span(β1 , β2 ). In the multivariate linear model (1.1) with β ∈ Rp×r ,
we have SY |X = span(β) ⊆ Rp .

2.1.2 Model-free predictor envelopes


Methodology based on condition (2.1) alone may be ineffective when n is not
large relative to p or the predictors are highly collinear because then it is
hard in application to distinguish the part PS X of X that is material to the
prediction of Y from the complementary part QS X. Because it may take a
large sample size to sort out correlated predictors, requiring that n  p is one
way to compensate for these potential shortcomings. We may compensate also
by requiring, in addition to (2.1), that PS X QS X, so the part of X that
is selected as material PS X is independent of the complementary immaterial
part QS X. This condition is then used to induce a measure of clarity in
the separation of X into material and immaterial parts. We now have two
conditions that characterize a predictor envelope S:

(a) Y X | PS X and (b) PS X QS X ⇐⇒ (c) (Y, PS X) QS X. (2.2)

As indicated in this equation, conditions (2.2a) and (2.2b) are together equiv-
alent to condition (2.2c), which says that jointly (Y, PS X) is independent of
QS X. The material components are still represented as PS X and the imma-
terial components as QS X. See Cook (2018) for further discussion of the role
of conditional independence in envelope foundations.
In full generality, (2.2) represents a strong ideal goal that is not cur-
rently serviceable as a basis for methodological development because of the
need to assess condition (2.2b). However, if the predictors are normally dis-
tributed, condition (2.2b) is manageable since it then holds if and only if
cov(PS X, QS X) = PS ΣX QS = 0, where ΣX denotes the covariance matrix
of the predictor vector X. This connects with the reducing subspaces of ΣX
as described in the next lemma. Normality is not required, and its proof is in
Appendix A.2.1.
Lemma 2.1. Let S ⊆ Rp . Then S reduces ΣX if and only if
cov(PS X, QS X) = 0.
Model-free predictor envelopes defined 39

Replacing (2.2b) with cov{PS X, QS X} = 0 relaxes independence, requir-


ing only that there be no linear association between PS X and QS X. This mod-
ification somewhat weakens the strength of (2.2), but experience has shown
that it nevertheless leads to methodology that has the potential to outperform
standard methods. These arguments lead to the following tempered version of
(2.2) that is more amenable to methodological development and is equivalent
to (2.2) with normal predictors:

(a) Y X | PS X and (b) PS ΣX QS = 0. (2.3)

By Definition 2.1, any subspace that satisfies (2.2) or (2.3) is a dimension


reduction subspace. We are now in a position to give a formal definition of
an envelope in the context of (2.3). This is the form that connects with PLS
regression.

Definition 2.2. Assume that SY |X ⊆ span(ΣX ). The predictor envelope for


the regression of Y on X is then the intersection of all dimension reduction
subspaces that reduce ΣX and contain SY |X . The predictor envelope is denoted
as EΣX (SY |X ) and represented as just EX for use in subscripts.

If the eigenvalues of ΣX are unique then PEX ΣX QEX = 0 holds when


EΣX (SY |X ) is spanned by any subset of q eigenvectors of ΣX . Further, if
dim(EΣX (SY |X )) = q then EΣX (SY |X ) is spanned by q eigenvectors of ΣX .
This implies that the eigenstructure of ΣX plays a population role in envelope
regressions. However, reduction by principal components has no special role:
ensuring that PEX ΣX QEX = 0 by selecting EΣX (SY |X ) to be the span of
the leading q eigenvectors of ΣX , which are often preferred, requires a leap
of faith that is not supported by these foundations. A useful takeaway from
this discussion is that the spectral structure of ΣX arises as a constituent of
envelope foundations because of requirement (2.3b), which was included to
enhance the distinction between material and immaterial predictors.
Ensuring that there is clarity between the material and immaterial pre-
dictors by imposing condition (2.3b) comes with a potential cost. Since
SY |X ⊆ EΣX (SY |X ), the envelope is an upper bound on the central sub-
space. In consequence, the number of reduced predictors PEX X dictated by
the envelope is at least as large as the reduced predictors given by central
subspace PSY |X X. In the extreme, it is possible to have regressions where
dim(SY |X ) = 1 while dim(EΣX (SY |X )) = p, although in our experience such
an occurrence is the exception and not the rule.
40 Envelopes for Regression
TABLE 2.1
Conceptual contrasts between three dimension reduction paradigms. ‘Con-
struction’ means that methodology attempts to select S to satisfy the con-
dition. ‘Assumption’ means that methodology requires the condition without
any attempt to ensure its suitability. ‘No constraint’ means that methodology
does not address the condition per se.

Requirement
Paradigm (a) Y X | PS X (b) PS X QS X
SDR Construction No constraints
Variable Selection Construction Assumption
Envelopes Construction Construction

2.1.3 Conceptual contrasts


Table 2.1 contrasts the conceptual basis for three dimension reduction
paradigms. The paradigms all use requirement (a) as stated in the table.
Requirement (b) is intended as a representative stand-in for various methods
that might be used to control the dependence between the material PS X and
immaterial QS X parts of X. For instance, condition (b) might be replaced by
cov{PS X, QS X} = 0, as discussed in the previous section.
Sufficient dimension reduction methods like sliced inverse regression (SIR;
Li, 1991), sliced average variance estimation (SAVE; Cook and Weisberg,
1991; Cook, 2000), likelihood acquired directions (LAD; Cook and Forzani,
2009) and many other SDR methods (Li, 2018) operate under requirement
(a) only. Under various technical conditions, they are designed to estimate
the central subspace SY |X . These methods pay no attention to condition (b),
PS X QS X. As a consequence, when there is high collinearity in X, a rela-
tively large sample may be needed to estimate SY |X accurately.
Variable selection is conceptually similar to SDR, except bases for S are
constrained to be subsets of the columns of Ip so PS projects only onto co-
ordinate axes, which has the effect of selecting variables from X. To ensure
consistency, variable selection methods controlled by assumption the depen-
dence between the selected variables PS X and the variables eliminated QS X.
For instance, Goh and Dey (2019) studied asymptotic properties of marginal
least-square estimators in ultrahigh-dimensional linear regression models with
Predictor envelopes for the multivariate linear model 41

correlated errors. They assume in effect that n1/2 times the sample correlation
between the selected and eliminated predictors is bounded, which exerts firm
control over the dependence between PS X and QS X as p and n grow.
Envelopes handle the dependence between PS X and QS X by construction,
the selected subspaces S being required to satisfy both conditions (a) and (b)
of Table 2.1. This has the advantage of ensuring a crisp distinction between
material and immaterial parts of X, but it has the potential disadvantage of
leading to a larger subspace since SY |X ⊆ EΣX (SY |X ). Nevertheless, experi-
ence has indicated that this tradeoff is very often worthwhile, particularly in
model-based analyses and in high-dimensional settings where n  p.

2.2 Predictor envelopes for the multivariate linear model


In the context of model (1.1) with stochastic predictors, E(X) = µX and
var(X) = ΣX > 0. Let B = span(β), the subspace of Rp spanned by the
columns of β. As stated along with the (1.1), normality is not required un-
less needed for specific methods like likelihood-based estimation. Since (1.1)
depends on X only via β TX, it follows immediately that SY |X = B and that
the envelope under Definition 2.2 becomes EΣX (SY |X ) = EΣX(B), the smallest
reducing subspace of ΣX that contains B. From Proposition 1.6, this envelope
can be expressed equivalently as EΣX(B) = EΣX (CX,Y ), the smallest reducing
subspace of ΣX that contains CX,Y = span(ΣX,Y ). The distinction between
these representations of the predictor envelope in the context of model (1.1),

EΣX (SY |X ) = EΣX(B) = EΣX (CX,Y ), (2.4)

will be useful when introducing response envelopes in Section 2.4.


We are now in a position to write the envelope version of (1.1). Let q =
dim{EΣX (B)}, let Φ ∈ Rp×q denote a semi-orthogonal basis matrix for EΣX (B)
and let (Φ, Φ0 ) ∈ Rp×p be an orthogonal matrix. Then we can represent
β = Φη, where η ∈ Rq×r gives the coordinates of β in terms of basis Φ. The
linear model then becomes
)
Yi = α + η T ΦT Xi + εi , i = 1, . . . , n
, (2.5)
ΣX = Φ∆ΦT + Φ0 ∆0 ΦT0

where ∆ = var(ΦT X) = ΦT ΣX Φ ∈ Rq×q is the covariance matrix of the


material components and ∆0 = var(ΦT0 X) = ΦT0 ΣX Φ0 ∈ R(p−q)×(p−q) is the
42 Envelopes for Regression

covariance matrix of the immaterial components. No relationship is assumed


between the eigenvalues of ∆ and those of ∆0 . Conditions (2.3) are satisfied
straightforwardly under this model: Y X | ΦT X and cov(ΦT X, ΦT0 X) = 0.
The number of free real parameters in this model is the sum of r for α, rq
for η, q(p − q) for EΣX (B), q(q + 1)/2 for ∆, (p − q)(p − q + 1)/2 for ∆0 and
r(r + 1)/2 for ΣY |X , giving a total of

Nq = r + rq + p(p + 1)/2 + r(r + 1)/2.

This amounts to a reduction of Np − Nq = r(p − q) parameters over the


standard model.
The constituent parameters η, Φ, ∆ and ∆0 in model (2.5) are
not identifiable. For instance, for any conforming orthogonal matrix O,
Φη = (ΦO)(OT η) = Φ∗ η∗ and, in consequence, Φ and η are not identifiable.
However, the key parameters – the envelope EΣX(B) = span(Φ), β = Φη and
ΣX – are all identifiable and the model itself is identifiable. These parameters
are the ones that are typically important and the same as those in the base
multivariate linear model (1.1). The lack of identifiability of the constituent
parameters will be of little consequence in application. The eigenvalues of ∆
and ∆0 are also identifiable, which can aid an assessment of a particular fit:
Model (2.5) will give the greatest efficiency gains when the eigenvalues of ∆
are mostly larger than those of ∆0 as described in the following discussion of
(2.8). This structure is a partial consequence of incorporating a subspace as
an unknown parameter. A similar structure holds for all of the envelope and
PLS methods for reducing the predictors described in this book.
If the envelope EΣX(B) were known then, according to (2.5), we could
pick any basis matrix Φ and fit the linear regression of Y on the compressed
predictors ΦTX, giving the estimated coefficient matrix βbY |ΦTX from the mul-
tivariate regression of Y on ΦTX. The maximum likelihood envelope estimator
βbΦ of β with known basis Φ is then

βbΦ = ΦβbY |ΦTX . (2.6)

Asymptotic variances can provide insights into the potential gain offered by
the envelope estimator: Repeating (1.18) for ease of comparison,

avar( nvec(βbols )) = ΣY |X ⊗ Σ−1
X

avar( nvec(βbΦ )) = ΣY |X ⊗ Φ∆−1 ΦT , (2.7)
Likelihood estimation of predictor envelopes 43

where avar( nvec(βbΦ )) is based on Cook (2018, Section 4.1.3). This gives a
difference of

ΣY |X ⊗ (Σ−1
X − Φ∆ Φ ) = ΣY |X ⊗ Φ0 ∆−1
−1 T T
0 Φ0 ,

which indicates that the difference in asymptotic variances will be large when
the variability of the immaterial predictors ∆0 = var(ΦT0 X) is small relative
to ∆ = var(ΦT X). This can be seen also from the ratio of variance traces:
tr{ΣY |X ⊗ Φ∆−1 ΦT } tr{∆−1 }
= . (2.8)
tr{ΣY |X ⊗ Σ−1
X } tr{∆−1 } + tr{∆−1
0 }

2.3 Likelihood estimation of predictor envelopes


Following Cook, Helland, and Su (2013), we develop in this section maximum
likelihood estimation of the parameters in model (2.5). We assume throughout
that the dimension q of the envelope EΣX(B) is known.
As mentioned at the outset of Section 1.2, it is traditional in regression to
base estimation on the conditional likelihood from the distribution of Y given
X, treating the predictors as fixed even if they may have been randomly sam-
pled. This practice arose because in many regressions the predictors provide
only ancillary information and consequently estimation and inference should
be conditioned on their observed values. (See Aldrich (2005) for a review and
a historical perspective.) In contrast, PLS and the likelihood-based envelope
method developed in this section both postulate a link – represented here by
the envelope EΣX(B) – between β, the parameter of interest, and ΣX . As a
consequence, X is not ancillary and we pursue estimation through the joint
distribution of Y and X.

2.3.1 Likelihood-inspired estimators


Let C = (X T , Y T )T denote the random vector constructed by concatenating
X and Y , and let SC denote the sample version of ΣC = var(C). Given
q, we base estimation on the objective function Fq (SC , ΣC ) = log |ΣC | +
tr(SC Σ−1
C ) that stems from the log likelihood of the multivariate normal family
after replacing the population mean vector with the vector of sample means,
although here we do not require C to have a multivariate normal distribution.
44 Envelopes for Regression

Rather, we are using Fq as a multi-purpose objective function in the same


spirit as least squares objective functions are often used.
The structure of the envelope EΣX (B) can be introduced into Fq by us-
ing the parameterizations ΣX = Φ∆ΦT + Φ0 ∆0 ΦT0 and ΣX,Y = Φγ, where
Φ ∈ Rp×q is a semi-orthogonal basis matrix for EΣX(B) = EΣX (CX,Y ),
(Φ, Φ0 ) ∈ Rp×p is an orthogonal matrix, and ∆ ∈ Rq×q and ∆0 ∈ R(p−q)×(p−q)
are symmetric positive definite matrices, as defined previously for model (2.5).
From Proposition 1.6 or (2.4), CX,Y = span(ΣX,Y ) ⊆ EΣX (B), so we wrote
ΣX,Y as linear combinations of the columns of Φ. The matrix γ ∈ Rq×r
then gives the coordinates of ΣX,Y in terms of the basis Φ. With this we
have
! !
ΣX ΣX,Y Φ∆ΦT + Φ0 ∆0 ΦT0 Φγ
ΣC = =
ΣTX,Y ΣY γ T ΦT ΣY
  
! ∆0 0 0 ΦT0 0
Φ0 Φ 0
=  0 ∆ γ   ΦT 0 
  
0 0 Ir
0 γ T ΣY 0 Ir
= OΣOT C OT , (2.9)

where !
Φ0 Φ 0
O= ∈ R(p+r)×(p+r)
0 0 Ir
is an orthogonal matrix and
 
∆0 0 0
ΣO T C = 0 ∆ γ  ∈ R(p+r)×(p+r)
 

0 γT ΣY

is the covariance matrix of the transformed vector OT C. The objective func-


tion Fq (SC , ΣC ) can now be regarded as a function of the five constituent
parameters – Φ, ∆, ∆0 , γ and ΣY – that comprise ΣC . The parameters β and
η of model (2.5) can be written as functions of those parameters: η = ∆−1 γ
and β = Φη = Φ∆−1 γ.
To minimize Fq (SC , ΣC ) we first hold Φ fixed and substitute (2.9) giving

Fq (SC , ΣC ) = log |OΣOT C OT | + tr(OT SC OΣ−1


OT C
)
= log |ΣOT C | + tr(SOT C Σ−1
OT C
).
Likelihood estimation of predictor envelopes 45

We next write Fq in terms of addends that can be minimized separately. The


sample variance of (ΦTX, Y ) is
!
ΦT SX Φ ΦT SX,Y
S(ΦTX,Y ) = .
SY,X Φ SY
Let
!
Φ 0
O2 = ,
0 Ir
!
∆ γ
ΣO2T C = .
γT ΣY
Then we have that !
∆0 0
ΣOT C = ,
0 ΣO2T C
and
 
Fq (SC , ΣC ) = log |∆0 | + tr ΦT0 SX Φ0 ∆−1 + log |ΣO2T C | + tr SΦTX,Y Σ−1

0 T
O C
.
2

The values of the parameters that minimize Fq for fixed Φ are ΣY = SY ,


∆ = ΦT SX Φ, ∆0 = ΦT0 SX Φ0 and γ = ΦT SX,Y . Substituting these forms into
Fq then leads to the following estimator of the envelope when q is assumed
known: using G as an optimization argument to avoid confusion with the
unknown parameter Φ,

EbΣX (B) = span{arg min Lq (G)}, where (2.10)


G
−1
Lq (G) = log |GT SX|Y G| + log |GT SX G|
−1
= log |GT (SX − SX,Z SX,Z
T
)G| + log |GT SX G|, (2.11)
−1/2
Z = SY Y is the standardized response vector, the minimization in (2.10)
is taken over all semi-orthogonal matrices G ∈ Rp×q and we have used
Lemma 1.3I,
−1
|GT0 SX G0 | = |SX | × |GT SX G|,
b be any semi-orthogonal basis of EbΣ (B).
in the derivation of (2.11). Let Φ X

The objective function Lq (G) is an instance of the objective function for the
sample version of Algorithm L introduced in Section 1.5.5. The connection is
obtained by setting M = SX|Y and A = SY ◦X . Then from (1.8) to (1.10) we
have SX = SX|Y + SY ◦X = M + A, so
−1
log |GT M G| + log |GT (M + A)−1 G| = log |GT SX|Y G| + log |GT SX G|.
46 Envelopes for Regression

The estimators of the constituent parameters, which are maximum likeli-


hood estimators under normality of C, are

Σ
bY = SY ,

b = b T SX Φ,
Φ b


b0 b T0 SX Φ
= Φ b 0,

γ
b b T SX,Y
= Φ
b −1 γ
ηb = ∆ b.

From these we construct the estimators of the parameters of interest:

Σ
b X,Y = PΦ
b SX,Y ,

Σ
bX = Φ
b∆ bT + Φ
bΦ b 0∆ b T0 = P b SX P b + Q b SX Q b ,
b 0Φ
Φ Φ Φ Φ
−1
Σ
b Y |X = SY − SY,X PΦ
b SX PΦ
b SX,Y

βb = Φ
b∆b −1 γ
b = Φ(
b Φ b −1 Φ
b T SX Φ) b T SX,Y = P b (2.12)
Φ(SX ) βols ,
b

−1
where βbols = SX SX,Y is the ordinary least squares estimator of β. The es-
timators ∆, ∆0 and γ
b b b depend on the selected basis Φ. b The parameters of
interest – Σ b X and βb – depend on EbΣ (B) but do not depend on the
b X,Y , Σ
X

particular basis selected.


The following proposition (Cook et al., 2013, Prop. 9) gives the asymptotic
distribution of βb when C is normally distributed. In preparation, let βbΦ and
βbη denote respectively the maximum likelihood estimators of β when Φ and
η are known, and let † denote the Moore-Penrose matrix inverse.

Proposition 2.1. Assume that q is known and that C is normally distributed


with mean µC and covariance matrix ΣC > 0. Then, under model (2.5,
√ b − vec(β)} converges in distribution to a normal random vector
n{vec(β)
with mean 0 and covariance matrix
n√ o n√ o n√ o
avar b = avar
nvec(β) nvec(βbΦ ) + avar nvec(QΦ βbη )

= ΣY |X ⊗ Φ∆−1 ΦT + (η T ⊗ Φ0 )M † (η ⊗ ΦT0 ),

where M = ηΣ−1 T −1
Y |X η ⊗ ∆0 + ∆ ⊗ ∆0 + ∆
−1
⊗ ∆0 − 2Iq ⊗ Ip−q . Additionally,
Tq = n(F (SC , ΣC ) − F (SC , SC )) converges to a chi-squared random variable
b
with (p − q)r degrees of freedom, and
n√ o n√ o
avar b ≤ avar
nvec(β) nvec(βbols ) .
Likelihood estimation of predictor envelopes 47

The first addend avar{ nvec(βbΦ )} = ΣY |X ⊗ Φ∆−1 ΦT on the right hand
side of the asymptotic variance is the asymptotic variance of maximum like-
lihood estimators of β when a basis Φ for the envelope is known, as dis-

cussed previously around (2.7. The second addend avar{ nvec(QΦ βbη )} =
(η T ⊗ Φ0 )M † (η ⊗ ΦT0 ) can then be interpreted as the cost of estimating the
envelope. The final statement of the proposition is that the difference between
the asymptotic variance of the ordinary least squares estimator and that for
the envelope estimator is always positive semi-definite.
The following corollary, also from Cook et al. (2013), gives sufficient con-
ditions for the envelope and OLS estimators to be asymptotically equivalent.

Corollary 2.1. Under the conditions of Proposition 2.1, if ΣX = σx2 Ip and


β ∈ Rp×r has rank r then
n√ o n√ o
avar b = avar
nvec(β) nvec(βbols ) .

Accordingly, if there is no collinearity among predictors of equal variability


then the OLS and envelope estimators are equivalent asymptotically. This
finding does not rule out the possibility that envelope estimators may still be
superior to OLS in non-asymptotic settings.
Proposition 2.1 can be used to construct an asymptotic standard error
for (β)
b ij , i = 1, . . . , p, j = 1, . . . , r, by first substituting estimates for the un-
n√ o
known quantities on the right side of avar b to obtain an estimated
nvec(β)

asymptotic variance avar{ d nvec(β)}.b The estimated asymptotic variance
√ b √
avar{
d n(β)ij } is then the corresponding diagonal element of avar{ d nvec(β)},
b
and its asymptotic standard error is
√ b
[avar{ n(β) }]1/2
se{(β)ij } = √ ij (2.13)
d
b .
n

If Ci , i = 1, . . . , n, are independent copies of C, which is not necessarily


√ b − vec(β)} converges
normal but has finite fourth moments, then n{vec(β)
in distribution to a normal random vector with mean 0 and positive definite

covariance matrix avar{ nvec(β)} b that depends on fourth moments of C and
is quite complicated (Cook et al., 2013, Prop. 8). The residual bootstrap is a
useful option for dealing with variances in this case.
48 Envelopes for Regression

2.3.2 PLS connection


As developed in Chapter 3, the PLS regression algorithms NIPALS and SIM-

PLS in Tables 3.1 and 3.4 both provide n-consistent estimators of a basis for
the predictor envelope EΣX(B) when n → ∞ with p and q fixed. And they do
so subject to (2.3) (Cook, 2018; Cook et al., 2013). These algorithms proceed
sequentially using one-at-a-time estimation of the basis vectors, often called
weights in the PLS literature. The dimension q of the envelope corresponds to
the number of components in common PLS terminology. The estimated basis
for EΣX(B) is then used to estimate β. A key point here is that, like other
envelope methods, PLS regression stands apart from much of the regression
and dimension reduction literature by using both of the conditions (2.3a) and
(2.3b) to guide the reduction.
Model (2.5) provides for maximum likelihood estimation of the parameters
estimated by PLS algorithms, along with estimation of ΣX and ΣY |X . Since
the maximum likelihood estimators inherit optimality properties from general
likelihood theory, they must be superior to the PLS estimators as n → ∞ with
r and p fixed as long as the model holds. However, the PLS estimators are
serviceable when n < p while the maximum likelihood estimators are not, and
PLS dimension reduction is serviceable also under certain model deviations
(Chapter 9).
The objective function Lq (G) in (2.11) is non-convex and consequently
there is always a chance that an optimization algorithm will get caught in a
local minimum. On the other hand, computation of the PLS estimator will
be seen in Chapter 3 to be relatively straightforward. These contrasts suggest
that the maximum likelihood and PLS estimators can play somewhat different
roles in application.
Envelope and PLS estimators are compared in Chapter 3.

2.4 Response envelopes, model-free and model-based


So far, we have focused on predictor envelopes guided first by conditions (2.2)
and then by conditions (2.3) and finally by model (2.5). We turn to response
envelopes in this section, guided by corresponding foundations. Although PLS
methods seem concerned mostly with predictor reduction, response reduction
Response envelopes, model-free and model-based 49

is useful in some applications as well, including PLS for discrimination dis-


cussed in Chapter 7 and the Nadler-Coifman chemometrics regression dis-
cussed in Section 2.6. The inverse regression of X on Y is used in these and
related applications. There are strong parallels between response and predic-
tor envelopes, and we make use of them to give a relatively brief exposition
on response envelopes in this section. Simultaneous reduction of the predic-
tors and the responses is discussed in Chapter 5. Response envelopes were
proposed and developed by Cook et al. (2010).
The overarching idea for response envelopes is to separate with clarity the
material part of Y , whose distribution is affected by changing X, from the
immaterial part of Y whose distribution is unaffected by changing X. Sim-
ilar to predictor envelopes, this separation is accomplished by defining (and
subsequently estimating) the unique smallest subspace E of the r-dimensional
response space so that the projection of the vector of responses Y onto E holds
the part of Y that is affected by changing X, while the projection onto the or-
thogonal complement of E holds the part of Y that is unaffected by X. These
aspirations are described formally by requiring E to be the smallest subspace
of the response space that satisfies the following two conditions, which are the
counterparts to condition (2.2) for predictor envelopes.

(a) QE Y X and (b) PE Y QE Y | X ⇐⇒ (c) (PE Y, X) QE Y. (2.14)

Condition (a) says that QE Y is marginally independent of X, so that in the


regression context the distribution of QE Y is unaffected by changing X. The
complementary part of Y , PE Y , then represents the part of Y that is affected
by X. Condition (b) formalizes the relationship between QE Y and PE Y : PE Y
and QE Y are conditionally independent given X, so QE Y cannot impact PE Y
through an association. Conditions (a) and (b) together are equivalent to
condition (c): PE Y and X are jointly independent of QE Y . The distribution
of QE Y then represents a static component of the regression.
Further, condition (2.14c) is the same as condition (2.2c) with the roles
of Y and X interchanged. This means that the discussion of Sections 2.1 and
2.2 hold also for response envelopes by interchanging X and Y . The similarity
with (2.2c) can be seen also by expressing (2.14) as

(a)X Y | PE Y and (b) PE Y QE Y,

which is a dual of (2.2) obtained by interchanging X and Y .


50 Envelopes for Regression

Let SX|Y denote the central subspace for the regression of X on Y . Then,
in parallel to Definition 2.2, we have

Definition 2.3. Assume that SX|Y ⊆ span(ΣY ). The response envelope for
the regression of Y on X is then the intersection of all reducing subspaces of
ΣY that contain SX|Y , and is denoted as EΣY (SX|Y ), which is represented as
EY for use in subscripts.

Comparing Definitions 2.2 and 2.3 we see that response and predictor
envelopes are dual constructions when X and Y are jointly distributed, as
summarized in Table 2.2. The predictor envelope EΣX (SY |X ) for the regression
of Y on X is also the response envelope for the regression of X on Y , and
the response envelope EΣY (SX|Y ) for the regression of Y on X is also the
predictor envelope for the regression of X on Y . These equivalences in model-
free envelope construction are shown in Table 2.2(A).

TABLE 2.2
Relationship between predictor and response envelopes when X and Y are
jointly distributed: CX,Y = span(ΣX,Y ) and CY,X = span(ΣY,X ). B =
span(βY |X ) and β 0 = span(βYT |X ) are as used previously for model (1.1). B∗
and B∗T are defined similarly but in terms of the regression of X | Y . P-env
and R-env stand for predictor and response envelope.

(A) Central subspaces


EΣX (SY |X ) EΣY (SX|Y )
Predictor envelope for Y | X Predictor envelope for X | Y
Response envelope for X | Y Response envelope for Y | X
(B) Multivariate linear models
Y = α + βYT |X X + ε T
X = µ + βX|Y Y +
0 T
B = span(βY |X ), B = span(βY |X ) B∗ = span(βX|Y ), B∗0 = span(βX|Y
T
)
P-env for Y |X: EΣX(B) = EΣX (CX,Y ) R-env for X|Y : EΣX (B∗0 ) = EΣX (CX,Y )
R-env for Y |X: EΣY (B0 ) = EΣY (CY,X ) P-env for X|Y : EΣY (B∗ ) = EΣY (CY,X )
R-env for Y |X: EΣY (B0 ) = EΣY |X (B0 ) R-env for X|Y : EΣX|Y (B∗ ) = EΣX|Y (B∗0 )

Table 2.2(B) shows envelope relationships in the context of multivariate


linear models. From there we see that the predictor envelope in the regres-
sion of Y on X – EΣX (CX,Y )– is the same as the response envelope for the
linear regression of X on Y – again, EΣX (CX,Y ): using Proposition 1.8 and the
Response envelopes, model-free and model-based 51

discussion that follows,

EΣX(B) = EΣX (span(Σ−1


X ΣX,Y )) = EΣX (CX,Y ).

Similarly, the response envelope for the linear regression of Y on X – EΣY (B 0 ),


where B 0 = span(β T ) as defined near (1.26) – is the same as the predictor
envelope for the linear regression of X on Y :

EΣY (B 0 ) = EΣY (CY,X ). (2.15)

The envelope forms EΣX (CX,Y ) and EΣY (CY,X ) reflect the duality of re-
sponse and predictor envelopes when X and Y are jointly distributed, while
forms EΣX(B) and EΣY (B 0 ) are in terms of model (1.1) to facilitate re-
parametrization.
So far, we have required that X and Y be jointly distributed. This implies,
for example, that PLS predictor reduction would not normally be useful in
designed experiments . For instance, if predictor values are selected to follow
a 2k factorial or if center points are included to allow estimation of quadratic
effects, there seems no reason to suspect that the predictor distribution carries
useful information about the responses. However, response reduction may still
be of value.
The conditions (a) and (b) in (2.14) can be stated alternatively as

(a) For all x1 , x2 , FQE Y |X=x1 (z) = FQE Y |X=x2 (z) 
, (2.16)
(b) For all x, PE Y QE Y | X = x, 

where FQE Y |X=x (z) denotes the conditional CDF of QE Y | X = x. These


conditions do not require X to be stochastic. In consequence, X is allowed to
be ancillary in PLS and envelope response reduction and, using (1.26), the
only envelope that is relevant from Table 2.2 is

EΣY (B 0 ) = EΣY |X (B 0 ).

This relation allows us to parameterize the multivariate linear model (1.1) by


thinking in terms of either a basis for EΣY (B 0 ) or a basis for EΣY |X (B 0 ). In the
next section we use EΣY |X (B 0 ) as our guide since that seems most convenient
in terms of model (1.1).
52 Envelopes for Regression

2.5 Response envelopes for the multivariate linear model


We are now in a position to parameterize model (1.1) in terms of the response
envelope EΣY |X (B 0 ). We first introduce the model for response envelopes and
then discuss background motivation that arises from longitudinal data.

2.5.1 Response envelope model


Let u denote the dimension of the response envelope, let Γ be an r × u semi-
orthogonal matrix whose columns form a basis for it, and let (Γ, Γ0 ) be an
orthogonal matrix. Then, we have the response envelope model
)
Yi = α + Γη1 Xi + εi , i = 1, . . . , n,
, (2.17)
ΣY |X = ΓΩΓT + Γ0 Ω0 ΓT0 ,

where Ω and Ω0 are u × u and (r − u) × (r − u) positive definite matrices and


the errors are independent N (0, ΣY |X ) random vectors. The coordinates η1 are
subscripted here to distinguish them from the coordinates used in predictor
envelopes (2.5). The predictors are ancillary under this model and thus are
treated as non-stochastic. The parameterization β T = Γη1 in (2.17) arises
because we must have span(β T ) ⊆ EΣY |X (B 0 ), so the columns of η1 give the
coordinates of the columns of β T in terms of basis Γ. The likelihood stemming
from (2.17) can be maximized to obtain envelope estimators of β and ΣY |X
(Cook et al., 2010; Cook, 2018).
That model (2.17) satisfies conditions (2.16) can be seen as follows. Condi-
tion (2.14a) stipulates that the conditional distribution of ΓT0 Y | X must not
depend on the value of X. This follows because the distribution of ΓT0 Y | X
is characterized as ΓT0 Y = ΓT0 α + ΓT0 ε, where ΓT0 ε ∼ Nr−u (0, Ω0 ). Condition
(2.16b) requires that ΓT Y be independent of ΓT0 Y given X. This follows from
the normality of the errors and ΓT ΣY |X Γ0 = 0.
If the envelope EΣY |X (B 0 ) were known then, according to (2.17), we could
pick any basis matrix Γ and fit the linear regression of the reduced response
ΓT Y on predictors X, giving the estimated coefficient matrix βbΓT Y |X from
the multivariate regression of ΓT Y on X. The maximum likelihood envelope
estimator βbΓ of β with known basis Γ is then

βbΓ = βbΓT Y |X ΓT = βbols PΓ . (2.18)


Response envelopes for the multivariate linear model 53

Variances can provide insights into the potential gain offered by the envelope
estimator:

var(vec(βbΓ )) = var(vec(βbols PΓ ))
= (PΓ ⊗ Ip )var(vec(βbols ))(PΓ ⊗ Ip )
−1
= n−1 (PΓ ⊗ Ip ) ΣY |X ⊗ SX

(PΓ ⊗ Ip )
−1
= n−1 ΓΩΓT ⊗ SX ,

giving a difference with the ordinary least squares estimator of

−1
var(vec(βbols )) − var(vec(βbΓ )) ≥ n−1 Γ0 Ω0 ΓT0 ⊗ SX ≥ 0.

From this we conclude that if Γ is known and if var(ΓT0 Y ) = Ω0 is large relative


to var(ΓT Y ) = Ω then the gains from the envelope model can be substantial.
While predicated on Γ being known, this results provides useful intuition in to
the more common setting where Γ is unknown. Using the spectral norm k · k
as a measure of overall size, the envelope model will be advantageous when
kΩk  kΩ0 k. The variance of the maximum likelihood estimator of β under
model (2.17) is given in Proposition 2.2.
Recall from the discussion at the end of Section 2.2 that predictor envelopes
can be expected to reduce the estimative and predictive variation substan-
tially when the variability of the immaterial predictors ∆0 = var(ΦT0 X) is
small relative to ∆ = var(ΦT X). But here we have found that the gains from
using response envelopes can be substantial when variability of the immate-
rial responses Ω0 = var(ΓT0 Y ) is large relative to the variation in the material
responses Ω = var(ΓT Y ).

2.5.2 Relationship to models for longitudinal data


Models of the form (2.17) with known Γ, not necessarily semi-orthogonal,
occur in regressions where the elements of Yi ∈ Rr are r longitudinal mea-
surements on the i-th subject. For example, suppose that we are studying the
effects of a treatment and it is known that a subjects expected response to
the treatment is quadratic over time. Perhaps the quadratic is concave and
we are ultimately interested in estimating the point at which the response is
54 Envelopes for Regression

maximized. In this case, Γ ∈ Rr×3 and it can be set as

t21
 
1 t1
1 t2 t22
 
 
Γ= .. .. .. ,
. . .
 
 
1 tr t2r

where tj is the time at which the j-th measurement is taken. The model in
this case is simply Yi = Γα + εi , where α = (α0 , α1 , α2 )T ∈ R3 contains the
coefficients of the quadratic response. For instance, the expected responses at
times t1 and tr are α0 + α1 t1 + α2 t21 and α0 + α1 tr + α2 t2r , respectively.
Expanding the context of the illustration, suppose now that we are com-
paring two treatments indicated by X = 0, 1 and that the effect of a treatment
is to alter the coefficients of the quadratic. Let the columns of the matrix
 
α00 α01
A =  α10 α11  = (α•0 , α•1 )
 

α20 α21

represent the coefficients of the quadratic response for X = 0 and X = 1, and


let δ = α•1 − α•0 . Then the model allowing for the two treatments can be
written
!
1−X
Yi = ΓA +ε
X
= Γα•0 + ΓδX + ε. (2.19)

This model is now in the form of the response envelope model (2.17), except
that Γ is known. Depending on the context, variations of this model may be
appropriate. For example, allowing for an unconstrained intercept, the model
becomes

Yi = δ0 + ΓδX + ε.

If t1 = 0 corresponds to a basal measurement prior to treatment application


then we know that the first element of δ is 0.
Kenward (1987) reported on an experiment to compare two treatments for
the control of gut worm in cattle. The treatments were each randomly assigned
to 30 cows whose weights were measured at 2, 4, 6, . . . , 18 and 19 weeks after
treatment. The goal of the experiment was to see if a differential treatment
Response envelopes for the multivariate linear model 55

effect could be detected and, if so, estimate the time point when the difference
was first manifested. Judging from the plot of average weight by treatment
and time (Cook, 2018, Fig. 1.5), it is not clear what functional form could be
used to describe the weight profiles. In such cases, treating Γ as unknown and
adopting a response envelope model (2.17) may be a viable analysis strategy.
Details of such an analysis are available from Cook (2018). PLS and maximum
likelihood envelope estimators of the parameters in models of the form given
in (2.19) are discussed in Section 6.4.4.
We next turn to maximum likelihood estimation of the parameters in re-
sponse envelope (2.17).

2.5.3 Likelihood-based estimation of response envelopes


Following Cook, Li, and Chiaromonte (2010), we develop in this section max-
imum likelihood estimation of the parameters in model (2.17). We assume
throughout this section that the dimension u of the envelope EΣY |X (B 0 ) is
known. The connection with PLS algorithms is discussed in Section 2.5.6.
In our derivation of maximum likelihood estimators for predictor reduction
via model (2.5), we assumed that X and Y have a joint distribution that was
assumed to be normal for the purpose of deriving the asymptotic distribution
of β.
b However, the predictors are ancillary under model (2.17) for response
reduction and, in consequence, we condition on the predictors, treating them
as known constants in our development of the maximum likelihood estimators.
In asymptotic considerations, recall from Section 1.2.1 that in this setting we
define ΣX = limn→∞ SX > 0.
The log likelihood L(α, β, ΣY |X ) for model (1.1) is

L = −(nr/2) log(2π) − (n/2) log |ΣY |X | − (1/2)


n
X
× (Yi − α − β T Xi )T Σ−1 T
Y |X (Yi − α − β Xi ).
i=1

Substituting the parametric forms from envelope model (2.17), β T = Γη1


and ΣY |X = ΓΩΓT +Γ0 Ω0 ΓT0 , we obtain the log likelihood Lu (α, η1 , EΣY |X (B 0 ),
Ω, Ω0 ) for the envelope model with dimension u,

Lu = −(nr/2) log(2π) − (n/2) log |ΓΩΓT + Γ0 Ω0 ΓT0 |


n
X
−(1/2) (Yi − α − Γη1 Xi )T (ΓΩΓT + Γ0 Ω0 ΓT0 )−1 (Yi − α − Γη1 Xi ).
i=1
56 Envelopes for Regression

In model (2.17), Γ is not identifiable but its span EΣY |X (B 0 ) is identifiable.


For this reason we write the log likelihood Lu in terms of the envelope, but
computationally we must work in terms of a basis, and for this reason opti-
mization is carried out using Γ. The original derivation by Cook et al. (2010)
uses projections instead of bases. We next sketch how to find the parameter
values that maximize Lu .
From conclusions 3 and 4 of Corollary 1.1, we have

|ΓΩΓT + Γ0 Ω0 ΓT0 | = |Ω| × |Ω0 |


(ΓΩΓ + T
Γ0 Ω0 ΓT0 )−1 = ΓΩ−1 ΓT + Γ0 Ω−1 T
0 Γ0 .

Substituting these into the log likelihood we get

Lu = −(nr/2) log(2π) − (n/2) log |Ω| − (n/2) log |Ω0 |


n
X
− (1/2) (Yi − α − Γη1 Xi )T (ΓΩ−1 ΓT + Γ0 Ω−1 T
0 Γ0 )(Yi − α − Γη1 Xi ).
i=1

The maximum likelihood estimator of α is α b = Ȳ because the predictors


in model (1.1) are centered. Substituting this into the likelihood function and
then decomposing

Yi − Ȳ = PΓ (Yi − Ȳ ) + QΓ (Yi − Ȳ )

and simplifying, we arrive at the first partially maximized log likelihood,

L(1) 0
u (η1 , EΣY |X (B ), Ω, Ω0 ) = −(nr/2) log(2π) + L(11) 0
u (η1 , EΣY |X (B ), Ω)

+L(12) 0
u (EΣY |X (B ), Ω0 ),

where

L(11)
u = −(n/2) log |Ω|
Xn
−(1/2) {ΓT (Yi − Ȳ ) − η1 Xi }T Ω−1 {ΓT (Yi − Ȳ ) − η1 Xi }
i=1
n
X
L(12)
u = −(n/2) log |Ω0 | − (1/2) (Yi − Ȳ )T Γ0 Ω−1 T
0 Γ0 (Yi − Ȳ ).
i=1

(11)
Holding Γ fixed, Lu can be seen as the log likelihood for the multivariate
(11)
regression of ΓT (Yi − Ȳ ) on Xi , and thus Lu is maximized over η1 at the
Response envelopes for the multivariate linear model 57
(11)
value η1 = (βbols Γ)T . Substituting this into Lu and simplifying we obtain a
(11)
partially maximized version of Lu
n
X
L(21) 0
u (EΣY |X (B ), Ω) = −(n/2) log |Ω| − (1/2)
bi )T Ω−1 ΓT R
(ΓT R bi ,
i=1

where, as defined in Section 1.2, R bi is the i-th residual vector from the
fit of the standard model (1.1). From this it follows immediately that, still
(21)
with Γ fixed, Lu is maximized over Ω at Ω = ΓT SY |X Γ. Consequently,
(31)
we arrive at the third partially maximized log likelihood Lu (EΣY |X (B 0 )) =
−(n/2) log |ΓT SY |X Γ| − nu/2. By similar reasoning, the value of Ω0 that max-
(2)
imizes Lu (EΣY |X (B 0 ), Ω0 ) is Ω0 = ΓT0 SY Γ0 . This leads to the maximization
(2)
of Lu (EΣY |X (B 0 )) = −(n/2) log |ΓT0 SY Γ0 | − n(r − u)/2.
Combining the above steps, we arrive at the partially maximized form
L(2) 0 T
u (EΣY |X (B )) = −(nr/2) log(2π) − nr/2 − (n/2) log |Γ SY |X Γ|

− (n/2) log |ΓT0 SY Γ0 |.


Next, by Lemma 1.3, log |ΓT0 SY Γ0 | = log |SY | + log |ΓT SY−1 Γ|. Consequently,
(2)
we can express Lu (EΣY |X (B 0 )) as a function of Γ alone:
L(2) 0
u (EΣY |X (B )) = −(nr/2) log(2π) − nr/2 − (n/2) log |SY |

− (n/2) log |ΓT SY |X Γ| − (n/2) log |ΓT SY−1 Γ|. (2.20)


Summarizing, the maximum likelihood estimators EbΣY |X (B 0 ) of EΣY |X (B 0 )
and of the remaining parameters are determined as
EbΣY |X (B 0 ) = span{arg min (log |GT SY |X G| + log |GT SY−1 G|)} (2.21)
G

ηb1 = b T,
(βbols Γ)
βb = b η )T = βbols Pb ,
(Γb (2.22)
Γ


b bT
= Γ SY |X Γ,
b


b0 b T0 SY Γ
= Γ b0 ,

Σ
b Y |X = Γ
bΩ bT + Γ
bΓ b0 Ω b T0 ,
b 0Γ

where minG is over all semi-orthogonal matrices G ∈ Rr×u , Γ b is any semi-


0
orthogonal basis matrix for EΣY |X (B ) and Γ0 is any semi-orthogonal basis
b b
matrix for the orthogonal complement EbΣ⊥Y |X (B 0 ) of EbΣY |X (B 0 ). The fully max-
imized log-likelihood for fixed u is then
L
bu = −(nr/2) log(2π) − nr/2 − (n/2) log |SY |
b T SY |X Γ|
−(n/2) log |Γ b T S −1 Γ|.
b − (n/2) log |Γ b (2.23)
Y
58 Envelopes for Regression

Comparing the objective function in (2.21) to the corresponding objective


function (2.10) for predictor envelopes, we see that one can be obtained from
the other by interchanging the roles of Y and X. This is consistent with
our previous comparison of Definitions 2.2 and 2.3 from which we concluded
that response and predictor envelopes are dual constructions when X and
Y are jointly distributed, as summarized in Table 2.2. The dual nature of
likelihood-based envelope constructions is a theme that will occur elsewhere
in this book, without necessarily requiring that X and Y be jointly distributed.
See, for example, the discussion of partial predictor and response envelopes
following (6.16). The partially maximized log likelihood in (2.23) is covered
by Algorithm L discussed in Section 1.5.5.
The optimization required in (2.21) can be sensitive to starting values.
Discussions of starting values and optimization algorithms are available from
Cook (2018). The estimators βb and Σ b Y |X are invariant to the selection of a
basis Γ and thus are unique, but the remaining estimators ηb, Ω,
b b and Ω b 0 are
basis dependent and thus not unique. However, the eigenvalues of Ω b and Ω b0
are not basis dependent, which enables us to judge their sizes using a standard
norm.
Asymptotic distributions of various quantities under model (2.17) were
derived by Cook et al. (2010) and reviewed by Cook (2018, Section 1.6). Here
we give only the asymptotic distribution of βb from Cook et al. (2010) since
that will typically be of interest in applications.

Proposition 2.2. Under the envelope model (2.17) with non-stochastic pre-

dictors, normal errors and known u = dim{EΣ (B)}, n(βb − β) is asymptoti-
cally normal with mean 0 and variance

b = ΓΩΓT ⊗ Σ−1 + (Γ0 ⊗ η T )U † (ΓT ⊗ η),
avar{ nvec(β)} (2.24)
X 0

where

U = Ω−1 T −1
0 ⊗ ηΣX η + Ω0 ⊗ Ω + Ω0 ⊗ Ω
−1
− 2Ir−u ⊗ Iu .

These results can be used in practice to construct an asymptotic stan-


dard error for (β)
b ij , i = 1, . . . , p, j = 1, . . . , r, by first substituting es-
timates for the unknown quantities on the right side of (2.24) to obtain

an estimated asymptotic variance avar{ d nvec(β)}.b The estimated asymp-
√ b
totic variance avar{
d n(β)ij } is then the corresponding diagonal element of
Response envelopes for the multivariate linear model 59

avar{
d b and its asymptotic standard error is
nvec(β)},
√ b
b ij } = [avar{ n(β) }]1/2
se{(β) √ ij (2.25)
d
.
n

Under mild regularity conditions, n(βb−β) is asymptotically normal with

non-normal errors, but avar{ nvec(β)} b is considerably more complicated than
that given in Proposition 2.2. However, the bootstrap can still be used to
estimate its variance. See Cook (2018, Section 1.9) for further discussion.
Parallel to our discussion of the asymptotic variance for predictor envelopes
following Proposition 2.1, the first addend on the right side of (2.24) is the
asymptotic variance of βbΓ , the maximum likelihood estimator of β when Γ is
known:

avar{ nvec(βbΓ )} = ΓΩΓT ⊗ Σ−1 X .

The second addend on the right side of (2.24) can be interpreted as the cost
of estimating Γ. See Cook (2018) for further discussion.
We present an illustrative analysis in the next section to help fix the ideas
behind response reduction. Many more illustrations on the use of response
envelopes are available from Cook (2018).

2.5.4 Gasoline data


We use a subset of the well-known gasoline data (Kalivas, 1997) to illustrate
the operating characteristics of response envelopes, particularly how they deal
with collinearity in the responses. Due to the relationships presented in Ta-
ble 2.2(B), this will also provide intuition on how predictor envelopes deal
with collinearity in the predictors. The full dataset consists of measurements
of the octane number and diffuse reflectance measured as log(l/R) from 900
to 1700 nm in 2 nm intervals from n = 60 gasoline samples. This data set is
too large to allow for straightforward informative graphics, so we selected two
reflectance measures to comprise the continuous bivariate r = 2 response vec-
tor Y , referring to them generically as wavelengths 1 and 2. We constructed a
binary predictor X = 0 or 1 by dividing the data into 31 cases with relatively
low octane numbers and 29 cases with high octane numbers. This simplified
setting allows for an informative low-dimensional graphical representation.
The models we consider are as given in (1.1) and (2.17), and the 60 ob-
servations are plotted in Figure 2.1 along with various enhancements to be
explained. The nominal goal of the analysis is to understand how low and
60 Envelopes for Regression

FIGURE 2.1
Illustration of response envelopes using two wavelengths from the gaso-
line data. High octane data marked by x and low octane numbers by o.
Eb = EbΣY |X (B 0 ): estimated envelope. Eb⊥ = EbΣ⊥Y |X (B 0 ): estimated orthogonal
complement of the envelope. Marginal distributions of high octane numbers
are represented by dished curves along the horizontal axis. Marginal envelope
distributions of low octane numbers are represented by solid curves along the
horizontal axis. (From the Graphical Abstract for Cook and Forzani (2020)
with permission.)

high octane numbers affect the bivariate distribution of the wavelengths by


estimating the elements of β = (β1 , β2 )T . Consider estimating β1 , the mean
of wavelength 1 for high octane numbers minus the mean of wavelength 1 for
low octane numbers. For a likelihood analysis under model (1.1), the data
are projected onto the horizontal axis, as represented by path A. This gives
rise to the two widely dispersed marginal distributions represented by dashed
and solid curves along the horizontal axis in Figure 2.1. These distributions
overlap substantially and in consequence it would take a large sample size to
6 0.
infer that β1 =
Response envelopes for the multivariate linear model 61

An envelope analysis based on model (2.17) results in the inference that the
distribution of high and low octane numbers are identical along the orthogonal
complement EΣ⊥Y |X (B 0 ) of the true one-dimensional envelope EΣY (B 0 ). The
substantial variation along Eb⊥ (B 0 ) is thus inferred to be immaterial to
ΣY |X
the analysis and, in consequence, all differences between high and low octane
numbers are inferred to lie in the envelope subspace. To estimate β1 , the
envelope analysis first projects the data onto the estimated envelope EbΣY |X (B 0 )
to remove the immaterial variation, as represented by path B in Figure 2.1.
The resulting point is then projected onto the horizontal axis for inference
on β1 . This produces two narrowly dispersed distributions represented by
dashed and solid curves in Figure 2.1. The difference in variation between the
narrowly and widely dispersed distributions represents the gain that envelopes
can achieve over standard methods of analysis, which in this case is substantial.
In this illustration we have use two wavelengths as the continuous bivari-
ate response Y with a binary predictor X. The response envelope illustrated
in Figure 2.1 is EΣY (B 0 ) = EΣY |X (B 0 ), which corresponds to the lower left
entry in Table 2.2B. This is the predictor envelope in the lower right entry in
Table 2.2B. Thus, Figure 2.1 serves to illustrate both response and predictor
envelopes. The estimation process illustrated is for response envelopes, how-
ever. The estimated envelope aligns closely with the second eigenvector of the
sample variance-covariance matrix of the wavelengths. That arises because we
are working in only r = 2 dimensions. In higher dimensions the envelope does
not have to align closely with any single eigenvector but may align with a
subset of them.

2.5.5 Prediction
We think of ΓT Y as subject to refined prediction since its distribution depends
on X, while ΓT0 Y is only crudely predictable since its distribution does not de-
pends on X. Assuming that the predictors are centered, the actual predictions
at a new value XN of X are
! !
Γb T Yb b T Ȳ + ηbXN
Γ
= .
Γb T Yb Γb T Ȳ
0 0

Multiplying by (Γ,
b Γb 0 ) we get simply

YbN = Ȳ + Γb
b η XN .
62 Envelopes for Regression

2.5.6 PLS connection


The NIPALS and SIMPLS algorithms summarized in Tables 3.1 and 3.4 can
be used for response reduction by interchanging the roles of X and Y . They
√ b pls of a basis for EΣ (B 0 ) when n → ∞
provide n-consistent estimators Γ Y

with p and q fixed. And they do so subject to (2.14). The estimated basis Γ
b pls
is then used to estimate β by substituting into (2.18) to give

βbpls = βbols PΓbpls . (2.26)

The context for this estimator is regressions with many responses and few
predictors so βbols is well defined. Model (2.17) allows maximum likelihood
estimation of these same quantities along with estimation of ΣY |X . In other
words, model (2.17) provides for maximum likelihood estimation of the quan-
tities estimated by the PLS algorithms.
Another envelope-based estimator of β is the maximum likelihood esti-
mator based on model (2.17). That estimator (2.22) is of the same form as
(2.26), except the estimated basis derives from maximum likelihood estima-
tion. Adding a subscript ‘env’ to envelope estimators, we see a parallel between
the likelihood and PLS estimators (2.22) and (2.26):

βbenv = βbols PΓbenv . (2.27)

2.6 Nadler-Coifman model for spectroscopic data


As mentioned in Chapter 1, PLS regression flourishes within the chemometrics
community, largely for predicting the concentration of analytes Y from spec-
tral data X (Martens and Næs, 1989). Training samples (Xi , Yi ), i = 1, . . . , n,
consist typically of vectors Yi of analyte concentrations and digitized log spec-
tra Xi of hundreds of perhaps highly collinear predictors. Nadler and Coifman
(2005) used Beer’s law to guide the formulation of a model for the regression
of a spectrum X on concentrations Y . Beer’s law states that, in regular set-
tings with a single analyte (r = 1) under study, the logarithm of the spectrum
is proportional to the analyte concentration Y multiplied by its unique re-
sponse spectrum. Additivity is typically assumed when several substances are
Nadler-Coifman model for spectroscopic data 63

present, enabling a typical measured spectrum to be represented as

Xp×1 = µ + Vp×r Yr×1 + Up×s Zs×1 + ξ (2.28)


= µ + Ap×(r+s) W(r+s)×1 + ξ,

where the second line is an abbreviated representation of the model with


A = (V, U ) and W T = (Y T , Z T ). In version (2.28), the constant vector
µ may represent a non-stochastic baseline shift or machine drift. The non-
stochastic columns Vj , j = 1, . . . , r, of V are the unknown characteristic spec-
tra of the r analytes Y of interest. The elements of the stochastic vector Z
are centered concentrations of any unknown substances that may be present;
their corresponding characteristic spectra are represented by the columns Uj ,
j = 1, . . . , s, of U . The elements of ξ are uncorrelated random errors with
means 0 and variances σ 2 . They represent noise and other errors introduced
by the measurement process. To connect this construction with the response
envelope model (2.17), we assume that p  r +s, as is typical in chemometrics
applications. The number of analytes Y of interest, r, is typically small. The
number s of unknown substances is also imagined to be small. We assume
that rank(A) = r + s.
Let Γ ∈ Rp×(r+s) be a semi-orthogonal basis matrix for span(A) and let
(Γ, Γ0 ) be an orthogonal matrix. Then model (2.28) is an instance of model
(2.17) with the roles of the predictors and responses reversed. To see this
we assume that all rows of ΣW,Y are non-zero to avoid distracting technical
diversions and then first consider the regression coefficients from (2.28).

ΣX,Y = AΣW,Y
β T
:= ΣX,Y Σ−1
Y

= AΣW,Y Σ−1 −1
Y = PΓ AΣW,Y ΣY

= Γη1 ,

where η1 = ΓT AΣW,Y Σ−1


Y . Turning to the variances,

ΣX = AΣW AT + σ 2 Ip
= PΓ (AΣW AT + σ 2 Ip )PΓ + QΓ σ 2
ΣX|Y = AΣW |Y AT + σ 2 Ip
= ΓΩΓT + Γ0 Ω0 ΓT0

where Ω = ΓT AΣW |Y AT Γ + σ 2 Ir+s and Ω0 = σ 2 Ip−(r+s) . It follows im-


mediately from these results that CX,Y = span(A) is the smallest reducing
64 Envelopes for Regression

subspace of ΣX that contains CX,Y . Using Table 2.2, we see that the Nadler-
Coifman model (2.28), which was developed specifically for spectroscopic ap-
plication, and its forward regression counterpart both have a natural envelope
structure via EΣX (CX,Y ). Further, both NIPALS and SIMPLS discussed in
Chapter 3 are serviceable methods for fitting these models. Methods for the
simultaneous compression of the predictor and response vectors, which are
applicable in the particular envelope structure for (2.28), are discussed in
Chapter 5.

2.7 Variable scaling


The selection of estimators for use in multivariate regression can depend in
part on the measurement scales or units involved. The responses and predic-
tors can be represented in different base quantities like length, time, mass,
temperature and current, and these quantities could be measured in different
rather arbitrary units. Units of length could be inches, feet, meters, miles,
kilometers, and perhaps light years. If the regression involves rather arbitrary
measurement systems, we may want to insure that our results do not depend
materially on the particular systems selected. This can be done by selecting
estimators that are invariant or equi-variant under classes of measurement
transformations. For example, the OLS estimator βbols for the regression of Y
on X is equi-variant under the linear transformation X 7→ AX, where A is full
rank; that is, the OLS estimator from the regression of Y on AX is A−1 βbols , so
βbols 7→ A−1 βbols . And the fitted values from an OLS fit are invariant under the
same class of transformations. The class of rescaling transformations, defined
as X 7→ Λ−1 X where Λ is a diagonal matrix with positive diagonal elements,
could also be relevant in some regressions.
If all responses are measured in the same units or all predictors are mea-
sured in the same units and if maintaining the same units is fundamental
to the regression, then requiring invariance or equi-variant under a rather
arbitrary class of transformations seems like an unnecessary restriction on
the regression. Indeed, in such cases we may not wish to restrict attention to
invariant or equi-variant estimators and some estimators may not work well in
the presence of arbitrarily selected units. For instance, principal components
are not equi-variant and Hotelling (1933) cautioned against using arbitrary
Variable scaling 65

units:

The method of principal components can therefore be applied only


if . . . a metric – a definition of distance – [is] assumed in the [p]-
dimensional space, and not simply a set of axes . . . [Emphasis added]

Similarly, the estimators based on the envelope methods introduced in this


chapter and the PLS estimators studied in the next chapter are not invari-
ant or equi-variant under rescaling of the response vector or the predictor
vector. They too require a purposefully-selected metric. Their applications
are less encumbered and they tend to work best when the predictors or re-
sponses are measured in the same units, because then the issue of a metric
is not pertinent. More specifically, response envelopes and PLS methods for
response compression tend to work best when the responses are in the same
or similar units. Similarly, predictor envelopes and PLS methods for predictor
compression tend to work best when the predictors are measured in the same
or similar units. In each of the three examples of Section 1.1, the predictors are
all spectral measurements. The responses for the cattle data discussed in Sec-
tion 2.5.2 are all weights in kilograms taken at various times after treatment.
Is it necessary or useful to restrict attention to estimators that are equi-variant
under rescaling, say measuring weights at some times in Pounds, Stones, Mil-
ligrams and Metric Tons? The responses for the gasoline data discussed in
Section 2.5.4 are all spectral measures.
To broaden the applicability of envelopes, Cook and Su (2013, 2016) devel-
oped methods to estimate an appropriate scaling of the responses or predic-
tors. Suppose that we rescale Y by multiplication by a non-singular diagonal
matrix Λ. Let YΛ = ΛY denote the new response, let (βΛ , ΣΛ ) and (βbΛ , Σ b Λ)
denote the corresponding parameters and their envelope estimators based on
the envelope model (2.17) for the regression of YΛ on X. Then we do not gener-
ally have invariance, βbΛ = β,b Σb Λ = Σ,
b or equivariance, βbΛ = Λβ,b Σ b Λ = ΛΣΛ,
b
and the dimension of the envelope subspace may change because of the trans-
formation. This is illustrated schematically in Figure 2.2 for r = 2 responses
and a binary predictor X. The left panel, which shows a structure similar to
that in Figure 2.1, has an envelope that aligns with the second eigenvector
of ΣY |X . Beginning in the left panel, suppose we rescale Y to YΛ . This may
change the relationship between the eigenvectors and B, as illustrated by the
distributions shown in Figure 2.2b, where BΛ = span(Λβ). Since BΛ aligns
with neither eigenvector of ΣY |X , the envelope for the regression of YΛ on X
66 Envelopes for Regression


2 EΣ (B) 2

1 B = EΣ (B) 1

o
0 0
o
o
o BΛ

-1 -1

-2 -2

-2 -1 0 1 2 -2 -1 0 1 2

a. Original distributions b. Rescaled distributions


FIGURE 2.2
Illustration of how response rescaling can affect a response envelope from the
regression of a bivariate response on a binary predictor, X = 0, 1. The two
distributions represented in each plot are the distributions of Y | (X = 0) and
Y | (X = 1). The axes represent the coordinates of Y = (y1 , y2 )T .

is two dimensional: EΣΛ (BΛ ) = R2 . In this case, all linear combinations of Y


are material to the regression, the envelope model is the same as the standard
model and no efficiency gains are achieved. Cook and Su (2013) developed
likelihood-based methodology for estimating the diagonal scaling matrix Λ
to take us from Figure 2.2b, where there is no useful envelope structure, to
Figure 2.2a where envelopes offer substantial gain. They assumed that the
scaled response vector Λ−1 Y follows the response envelope model (2.17), so
the scaled envelope model for the original responses Y becomes

Y = α + ΛΓηX + ε, with ΣY |X = ΛΓΩΓT Λ + ΛΓ0 Ω0 ΓT0 Λ. (2.29)

The scaled response Λ−1 Y conforms to an envelope model with u-dimensional


envelope EΛ−1 ΣY |X Λ−1 (Λ−1 B 0 ) and semi-orthogonal basis matrix Γ. From this
starting point, they developed likelihood-based methods for estimating Λ and
the corresponding envelopes, and showed good results in simulations and data
analyses.
Cook and Su (2016) also developed likelihood-based methodology for es-
timating the diagonal scaling matrix Λ in predictor envelopes. They assumed
that the regression of Y on Λ−1 X follows the q dimensional envelope model
Variable scaling 67

(2.5) and allowed Λ to be structured so groups of predictors can be scaled in


the same way:

Λ = diag(1, . . . , 1, λ2 , . . . , λ2 , . . . , λs . . . , λs ) ∈ Rp×p ,

where there are s distinct positive scaling parameters, (λ1 , λ2 , . . . , λs ), and


the group structure of Λ is known. The resulting extension of envelope model
(2.5) is then

Y = α + η T ΦT Λ−1 (X − µX ) + ε (2.30)
T
ΣX = ΛΦ∆Φ Λ + ΛΦ0 ∆0 ΦT0 Λ,

where Φ ∈ Rp×q is a semi-orthogonal basis matrix for the envelope


EΛ−1 ΣX Λ−1 (ΛB), and the remaining parameter structure is as discussed previ-
ously. From this starting point, Cook and Su (2016) developed likelihood-based
methods for estimating Λ and the corresponding envelopes, and showed good
results in simulations and data analyses. See the original papers and Cook
(2018) for additional discussion and details of the methodology.
In Chapter 3, we study PLS algorithms for estimating the response and
predictor envelopes in multivariate regressions. As mentioned earlier in this
section, these PLS methods are not invariant or equi-variant under rescaling
and consequently tend to work best when the predictor or responses are mea-
sured in the same units. There is a tradition in some areas of standardizing the
variables so they all have the same sample standard deviations. However, there
is no theoretical support for this operation and it may make matters worse. For
instance, we know that in model (2.29) we can rescale the elements of Y = (yj )
by the diagonal elements of Λ = diag(λ1 , . . . , λr ) – yj 7→ yi /λj , j = 1, . . . , r –
to get an envelope model with response envelope EΣY (B 0 ) = span(Γ). But this
scaling is very different from the usual scaling to get unit standard deviations.
In the population, the usual scaling operation is equivalent to rescaling by the
marginal population standard deviation of yj :
1/2
(var(yj ))1/2 = λj eTj Γ(ηη T + Ω)ΓT + Γ0 Ω0 ΓT0 ej
 
,

where ej ∈ Rr has a 1 in position j and 0’s elsewhere. Clearly, starting with


model (2.29) and rescaling by the marginal standard deviations of the re-
sponses will not have a result that benefits the analysis. Suppose instead that
the original data Y follows envelope model (2.17). Then standardizing by the
marginal standard deviations of the responses,
1/2
(var(yj ))1/2 = eTj Γ(ηη T + Ω)ΓT + Γ0 Ω0 ΓT0 ej
 
,
68 Envelopes for Regression

may well destroy the envelope structure and result in estimators that are
more variable than necessary. In short, the usual scaling to get unit standard
deviations is generally questionable.
Aside from the proposal by Cook and Su (2016), there does not seem
to be PLS methodology for estimating proper rescaling parameters. Further
study along these lines is needed to expand the scope of PLS as envelope
methodology has been expanded by Cook and Su (2013, 2016).
3

PLS Algorithms for Predictor


Reduction

PLS algorithms were not historically associated with specific models and, in
consequence, were not typically viewed as methods for estimating identifiable
population parameters, but were seen instead as methods for prediction. We
study in this chapter several algorithms for PLS regression, describing cor-
responding statistical models and their connection with envelopes. For the
major algorithms we begin each section with a synopsis that highlights their
main points. Each synopsis contains a description of the sample algorithm
and the corresponding population version. Subsequent to each synopsis, we
study characteristics of the sample algorithms, prove the relationship between
the population and sample algorithms and establish the connection with en-
velopes.
PLS regression algorithms are classified as PLS1 algorithms, which apply
to regressions with a univariate response, or PLS2 algorithms, which apply
to regressions with multiple responses. The algorithms presented here are all
of the PLS2 variety and become PLS1 algorithms when the response is uni-
variate. Their similarities and differences are highlighted later in this chap-
ter. To help maintain a distinction between PLS1 and PLS2 algorithms, we
use ΣX,Y = cov(X, Y ) when the response Y can be multivariate but we use
σX,Y = cov(X, Y ) when there can be only a single real response.
While the PLS algorithms in this chapter are presented and studied for
the purpose of reducing the dimension of the predictor vector, with minor
modifications they apply also to reducing the dimension of the response vector
in the multivariate (multiple response) linear model (1.1). The justification for
this arises from the discussion of Section 2.5.1, particularly Table 2.2. PLS for

DOI: 10.1201/9781003482475-3 69
70 PLS Algorithms for Predictor Reduction

response reductions is discussed in Section 3.11 after our discussion of PLS


for predictor reductions.

3.1 NIPALS algorithm


According to Martens and Næs (1989, p. 118), NIPALS – nonlinear iterative
partial least squares – arose from Herman Wold’s ideas (Wold, 1975b) for
iterative fitting of bilinear models in econometrics and social sciences. Such
models were often applied to data with several collinear predictors and a
number of observations that was insufficient for standard method to work
reliably. NIPALS was conceived as a dimension reduction method for reducing
the predictors and thereby mitigating the collinearity and sample size issues.

3.1.1 Synopsis
We begin by centering and organizing the data (Yi , Xi ), i = 1, . . . , n, into
matrices X ∈ Rn×p and Y ∈ Rn×r , as defined in Section 1.2.1. Table 3.1(a)
gives steps in the NIPALS algorithm for reducing the predictors. The table
serves also to define notation used in this synopsis, like Xd , Yd and md . It may
not be easy to see what the algorithm is doing and the tradition of referring
to loadings ld and md , weights wd and scores sd does not seem to help with
intuition. The weight vectors require calculation of a first eigenvector `1 (·)
normalized to have length 1. Early versions of the NIPALS algorithm (e.g.
Martens and Næs, 1989, Frame 3.6) contained subroutines for calculating `1
(Stocchero, 2019), which tended to obscure the underlying ideas. We have
dropped the eigenvector subroutine in Table 3.1(a). The algorithm is not gen-
erally associated with a particular model, but it does involve PLS estimation
of β = Σ−1X ΣX,Y .
For simplicity, we use the same notation for weight vectors wd and weight
matrices Wd in a sample and in the population. The context should be clear
from the frame of reference under discussion.
As a first step in developing the population version shown in Ta-
ble 3.1(b), we rewrite selected steps in terms of the sample covariance matrix
SXd ,Yd = n−1 XTd Yd ∈ Rp×r between the deflated predictors and the deflated
NIPALS algorithm 71

TABLE 3.1
NIPALS algorithm: (a) sample version adapted from Martens and Næs (1989)
and Stocchero (2019). The n × p matrix X contains the centered predictors
and the n×r matrix Y contains the centered responses; (b) population version
derived herein.

(a) Sample Version


Initialize X1 = X, Y1 = Y.
Delete quantities with 0 subscript, e.g. W1 = w1 .
Select q ≤ min{rank(SX ), n − 1}
For d = 1, . . . q, compute
sample covariance matrix SXd ,Yd = n−1 XTd Yd
T
weights wd = `1 (SXd ,Yd SX d ,Yd
)
scores sd = Xd wd
X loadings ld = XTd sd /sTd sd
Y loadings md = YTd sd /sTd sd
X deflation Xd+1 = Xd − sd ldT
Y deflation Yd+1 = Yd − sd mTd
Append wd , ld , md , sd ,
onto matrices Wd = (Wd−1 , wd ) ∈ Rp×d , Ld = (Ld−1 , ld ) ∈ Rp×d ,
Md = (Md−1 , md ) ∈ Rr×d , Sd = (Sd−1 , sd )
End
Compute reg. coefficients βbnpls = Wq (LTq Wq )−1 MqT
(b) Population Version
Initialize w1 = `1 (ΣXY ΣTXY ), W1 = (w1 )
For d = 1, 2 . . .
End if QTWd (ΣX ) ΣX,Y = 0 and then set q = d
Compute weights wd+1 = `1 (QTWd (ΣX ) ΣX,Y ΣTX,Y QWd (ΣX ) )
Append Wd+1 = (Wd , wd+1 )
Compute reg. coefficients βnpls = Wq (WqT ΣX Wq )−1 WqT ΣX,Y
= Wq βY |WqT X
(c) Notes
Orthogonal weights WqT Wq = Iq .
Projection QWd (ΣX ) = I − Wd (WdT ΣX Wd )−1 WdT ΣX = I −
PWd (ΣX ) .
Envelope connection span(Wq ) = EΣX (B), the ΣX -envelope of B :=
span(β).
Score matrix Sd These are traditional intermediaries, although they
are not needed in the computation of βbnpls .
Algorithm N This is an instance of Algorithm N discussed in §§1.5.3
&11.1.
PLS1 v. PLS2 Algorithm is applicable for PLS1 or PLS2 fits; see
§3.10.
72 PLS Algorithms for Predictor Reduction

response and the sample covariance matrix SXd = n−1 XTd Xd ∈ Rp×p of the
deflated predictors:
T
wd = `1 (SXd ,Yd SX d ,Yd
) ∈ Rp
sd = Xd wd ∈ Rn×1
md T
= SX d ,Yd
wd (wdT SXd wd )−1 ∈ Rr×1
ld = SXd wd (wdT SXd wd )−1 ∈ Rp×1
Xd+1 = Xd − Xd wd (wdT SXd wd )−1 wdT SXd
= Xd Qwd (SXd ) (3.1)
n×p
= QXd wd Xd ∈ R (3.2)
Yd+1 = Yd − Xd wd (wdT SXd wd )−1 wdT SXd Yd
= QXd wd Yd ∈ Rn×r (3.3)
SXd+1 = QTwd (SX ) SXd Qwd (SXd ) (3.4)
d

SXd+1 ,Yd+1 = QTwd (SX ) SXd ,Yd , (3.5)


d

where Qwd (SXd ) the operator that projects onto the orthogonal complement
of span(wd ) in the SXd inner product,

Qwd (SXd ) = I − wd (wdT SXd wd )−1 wdT SXd = I − Pwd (SXd ) ,

as defined previously at (1.3). From (3.1) we see that the deflation of Xd


consists of the projection of the rows of Xd onto span⊥ (wd ) in the SXd inner
product, and that SXd+1 = QTwd (SX ) SXd Qwd (SXd ) . When SXd is nonsingular,
d
the deflation Yd+1 can be expressed as a residual matrix

Yd+1 = Yd − Xd Pwd (SXd ) βbd = Yd − Xd wd βbY |wdT Xd ,


−1
where βbd = SX d
SXd Yd . The deflation steps essentially involve computing resid-
uals sequentially from rank 1 approximations of Xd and Yd to give the residual
matrices Xd+1 and Yd+1 at the next iteration. They may have computational
or data-analytic advantages, but are unnecessary for the computation of βbnpls ,
as will be shown later.
The population version of the NIPALS algorithm given in Table 3.1(b)
is a special case of Algorithm N described in Proposition 1.11 obtained by
setting A = ΣX,Y ΣTX,Y and M = ΣX . The population stopping criterion
QTWd (ΣX ) ΣX,Y = 0 in Table 3.1(b) depends on finding a reducing subspace of
ΣX that contains span(ΣX,Y ), as shown in Section 3.2.1. The population algo-
rithm continues to accumulate weight vectors until it finds a weight matrix Wd
NIPALS algorithm 73

that reduces ΣX and that spans a subspace containing span(ΣX,Y ). The stop-
ping criterion is then met because, by Proposition 1.3, QWd (ΣX ) = QWd and,
since span(ΣX,Y ) ⊆ span(Wd ), QWd ΣX,Y = 0. This represents the essence of
the links connecting envelopes with PLS, as discussed in Section 3.2.
The sample NIPALS algorithm of Table 3.1(a) does not contain an explicit
mechanism for choosing the stopping point q and so it must be determined
by some external criterion. The algorithm is typically run for several values
of q and then the stopping point is chosen by predictive cross validation or a
holdout sample. Additionally, the sample version does not require SX to be
nonsingular. However, when SX is nonsingular, the NIPALS estimator can be
represented as the projection of the OLS estimator βbols onto span(Wq ) in the
SX inner product:
βbnpls = PWq (SX ) βbols .

This requires that SX be positive definite and so does not have a direct sample
counterpart when n < p. If in addition, q = p then PWq (SX ) = Ip and the
NIPALS estimator reduces to the OLS estimator, βbnpls = βbols .
The NIPALS estimator βbnpls shown in Table 3.1(a) depends only on the
final weight matrix Wq = (w1 , . . . , wq ) and the two corresponding loading ma-
trices Lq = (l1 , . . . , lq ) and Mq = (m1 , . . . , mq ). The columns of these matrices
depend on the data only via SXd and SXd ,Yd , d = 1, . . . , q. In consequence,
the population version of the algorithm shown in Table 3.1(b) can be deduced
by replacing SXd ,Yd and SXd with their population counterparts, which we
represent as ΣXd ,Yd and ΣXd . There is no crisp population counterpart of
q associated with the sample algorithm, although later we will see that, as
indicated in Table 3.1(c), this is in fact the dimension of EΣX (B), the ΣX -
envelope of B. We will show later in Section 3.1.4 that the sample version of
the population algorithm in Table 3.1(b) is the NIPALS algorithm.
Starting with the population algorithm in Table 3.1(b), βbnpls is obtained
by replacing ΣX and ΣX,Y by their sample counterparts, SX and SX,Y , and
stopping at the selected value of q. Additionally, the population algorithm
shows that the intermediate quantities in Table 3.1(a) – ld , md , sd , Xd , Yd ,
and SXd ,Yd – are not necessary to get the PLS estimator βbpls , although they
might be useful for computational or diagnostic purposes.
74 PLS Algorithms for Predictor Reduction

3.1.2 NIPALS example


Before turning to details of the NIPALS algorithm in the next section, we give
an illustrative example of the population algorithm to perhaps help fix ideas.
With p = 3 predictors and r = 2 responses, let
   
1 4/3 0 5 0 !
1 −0.6
ΣX = 10 4/3 4 0 , ΣX,Y = 0 4 and ΣY = 10 .
   
−0.6 4
0 0 5 0 0

The response covariance matrix ΣY is not used here but will be used in sub-
sequent illustrations. The last row of ΣX,Y is zero, but this alone does not
provide clear information about the contribution of x3 to the regression. If x3
is correlated with the other two predictors, it may well be material. However,
we see from ΣX that in this regression x3 is uncorrelated with x1 and x2 and
this enables a clear conclusion about the role of x3 . In short, since the last row
of ΣX,Y is zero and since the third predictor is uncorrelated with the other
two, we can conclude immediately that at most two predictors, x1 and x2 , are
needed. One goal of this example is to illustrate how the computations play
out to reach that conclusion.
According to the initialization step in Table 3.1(b), the first eigenvector of
ΣX,Y ΣTX,Y = diag(25, 16, 0) is w1 = (1, 0, 0)T . To compute the second weight
vector we need
   
0 −4/3 0 0 0
Qw1 (ΣX ) =  0 1 0  and QTw1 (ΣX ) ΣX,Y =  −20/3 4  .
   

0 0 1 0 0

Since QTw1 (ΣX ) ΣX,Y has rank 1, the second weight vector is w2 = (0, 1, 0)T .
For the third weight vector we need QTW2 (ΣX ) ΣX,Y , where W2 = (w1 , w2 ). It
can be seen by direct calculation that QTW2 (ΣX ) ΣX,Y = 0 and so the algorithm
terminates, giving q = 2. And, as described in Table 3.1, W2T W2 = I2 .
We can reach the conclusion that q = 2 also by reasoning that span(W2 )
is a reducing subspace of ΣX :

ΣX = W2 ∆W2T + w3 ∆0 w3T ,

where w3 = (0, 0, 1)T , ∆0 = 5 and


!
1 4/3
∆= .
4/3 4
NIPALS algorithm 75

It follows from Proposition 1.3 that QW2 (ΣX ) = QW2 , which can be seen also
by direct calculation:

QW2 (ΣX ) = I2 − W2 (W2T ΣX W2 )−1 W2T ΣX


= I2 − W2 ∆−1 ∆W2T = QW2 .

The conclusion that we have reached the stopping point now follows since
span(ΣX,Y ) = span(W2 ).
In short, only two reduced predictors W2T X = (x1 , x2 )T are needed to
describe fully the regression of Y on X.

3.1.3 Sample structure of the NIPALS algorithm


We see from (3.2) and (3.3) that Xd+1 and Yd+1 are constructed by projecting
the columns of Xd and Yd onto the orthogonal complement of span(Xd wd ).
Repeating this operation, we see that the columns of Xd+1 and Yd+1 are
constructed from a sequence of projections:

Xd+1 = QXd wd QXd−1 wd−1 · · · QX2 w2 QX1 w1 X1 
(3.6)
Yd+1 = QXd wd QXd−1 wd−1 · · · QX2 w2 QX1 w1 Y1 

To better understand these projections it is necessary to characterize the


relationships between the score vectors sj = Xj wj and between the weights
wj . Justifications for the following two lemmas are given in Appendices A.3.1
and A.3.2

Lemma 3.1. Following the notation from Table 3.1(a), for the sample version
of the NIPALS algorithm

WdT Wd = Id , d = 1, . . . , q.

Lemma 3.2. For q > 1, d = 1, . . . , q − 1 and j = 1, . . . , d,

XTd+1 sj = XTd+1 Xj wj = 0.

Consequently,

Psd+1 sj = PXd+1 wd+1 Xj wj = 0


Qsd+1 Qsj = I − Psd+1 − Psj ,

again for q > 1, d = 1, . . . , q − 1 and j = 1, . . . , d.


76 PLS Algorithms for Predictor Reduction

It follows from this lemma that the score vectors sd are mutually orthog-
onal, which allows a more informative version of the deflations (3.6). Recall
from Table 3.1 that Sd = (s1 , . . . , sd ) and from (1.3) that

QSd = I − PSd = I − Sd (SdT Sd )−1 SdT .

Then, as proven in Appendix A.3.3,

Lemma 3.3. For d = 1, . . . , q − 1,

Xd+1 = (I − Psd − Psd−1 − . . . − Ps2 − Ps1 )X1


= (I − PSd )X1 = QSd X1
Yd+1 = (I − Psd − Psd−1 − . . . − Ps2 − Ps1 )Y1
= (I − PSd )Y1 = QSd Y1
SXd ,Yd = SXd ,Y1
ld = XT1 sd /ksd k2
md = YT1 sd /ksd k2
wd+1 = `1 (XT1 QSd Y1 YT1 QSd X1 )
Sd = X1 Wd
Y
b npls = X1 βbnpls = PSq Y1 ,

b npls denotes the n × r matrix of fitted responses from the NIPALS fit
where Y
and Sd is the score matrix as defined in Table 3.1.

With the NIPALS fitted values Y b npls given in Lemma 3.3, we define the
NIPALS residuals as Rb npls = Y1 − Yb npls . These quantities allow the construc-
tion of many standard diagnostic and summary quantities, like residual plots
and the multiple correlation coefficient. Standard formal inference procedures
are problematic, however, because the scores are stochastic.
The form of wd+1 given in Lemma 3.3 shows that the deflation of Y1 is
unnecessary for computing the weight matrix:

wd+1 = `1 (XT1 QSd Y1 YT1 QSd X1 )


= `1 (XTd+1 Y1 YT1 Xd+1 ).

Consequently, the Y -deflation steps in the NIPALS algorithm of Table 3.1 is


unnecessary for computing the PLS estimator of the regressions coefficients,
which gives rise to the bare bones version of version of the NIPALS algorithm
NIPALS algorithm 77

TABLE 3.2
Bare bones version of the NIPALS algorithm given in Table 3.1(a). The no-
tation Y1 = Y of Table 3.1(a) is not used since here there is no iteration over
Y.

Initialize X1 = X. Quantities subscripted with 0 are to


be deleted
Select q ≤ min{rank(SX ), n − 1}
For d = 1, . . . , q, compute
sample covariance matrix SXd ,Y = n−1 XTd Y
T
X weights wd = `1 (SXd ,Y SX d ,Y
)
X scores sd = Xd wd
X loadings ld = XTd sd /sTd sd
X deflation Xd+1 = Xd − sd ldT = Qsd Xd = Xd Qwd (SXd )
Append Wd = (Wd−1 , wd ) ∈ Rp×d
End Set W = Wq
Compute regression βbnpls = W (W T SX W )−1 W T SX,Y .
coefficients

shown in Table 3.2. The estimator βbnpls given in Table 3.2 is the sample
counterpart of the population version given in Table 3.1(b). It is shown in
(3.12), following the justification given in Section 3.1.4 for the population
version.
Lemma 3.4 indicates also how the NIPALS algorithms deals operationally
with rank-deficient regressions in which rank(SX ) < p. Let the columns of the
p × p1 matrix V be the eigenvectors of SX with non-zero eigenvalues, p1 ≤ p.
Then we can express XT1 = V ZT1 , where ZT1 is an p1 × n matrix that contains
the coordinates of XT1 in terms of the eigenvectors of SX . Let wd∗ and s∗d denote
the weights and scores that result from applying NIPALS to data (Z1 , Y1 ).
Then

Lemma 3.4. For d = 1, . . . , q, sd = s∗d and wd = V wd∗ .

We see from Table 3.1 and Lemma 3.3 that βbnpls depends only on X1 , Y1 ,
weights Wq and the scores Sq . As a consequence of Lemma 3.4 we can apply the
NIPALS algorithm by first reducing the data to the principal component scores
Z1 , running the algorithm on the reduced data (Z1 , Y1 ) and then transforming
back to the original scale. The eigenvectors V of SX account for 100% of the
78 PLS Algorithms for Predictor Reduction
TABLE 3.3
Bare bones version in the principal component scale of the NIPALS algorithm
given in Table 3.1(a).

Initialize p1 = rank(SX ) ≤ p, V = p × p1 matrix of


eigenvectors of SX with non-zero eigenval-
ues, X1 = X, Z1 = X1 V .
Delete quantities with 0 subscript.
Select q ≤ min{p1 , n − 1}
For d = 1, . . . , q, compute
sample covariance matrices SZd = n−1 ZTd Zd , SZd ,Y = n−1 ZTd Y
X weights wd∗ = `1 (SZd ,Y SZTd ,Y )
X scores s∗d = Zd wd∗
X loadings ld = ZTd s∗d /s∗T
d sd

X deflation Zd+1 = Zd − s∗d ldT = Qs∗d Zd = Zd Qwd∗ (SZd )


Append Wd∗ = (Wd−1 ∗
, wd∗ ) ∈ Rp×d
End Set W = Wq∗
Compute regression βbnpls = W (W T SX W )−1 W T SX,Y
coefficients = V {W ∗ (W ∗T SZ W ∗ )−1 W ∗T SZ,Y }
Notes V might be restricted to eigenvectors that
account for a high, say 98 percent, of the
variation in X

variation in X. Since the eigenvectors of SX corresponding to its smallest


eigenvalues may be relatively unstable, it may be worthwhile to use only the
principal components that account for a high percentage, say 98%, of the
variation in X. Cross validation could be used to compare fits with various
percentages retained. This discussion is summarized in Table 3.3. A proof of
Lemma 3.4 is given in Appendix A.3.4.

3.1.4 Population details for the NIPALS algorithm


In this section we show that the population algorithm in Table 3.1(b) cor-
responds to the sample version given in Table 3.1(a). Justifications in this
and subsequent sections make use of the following lemma, whose proof is in
Appendix A.3.5. The lemma is useful in characterizing the NIPALS algorithm
because, as represented in (3.4) and (3.5), it uses projections in various inner
NIPALS algorithm 79

products, depending on the step in the iteration.

Lemma 3.5. Let V denote a p × c matrix of rank c < p, let v denote a p × 1


vector that is not contained in span(V ), let Σ denote a p × p positive definite
matrix and let ∆ = QTV (Σ) ΣQV (Σ) . Then

(a) ∆ = QTV (Σ) Σ = ΣQV (Σ)

(b) P(V,v)(Σ) = PV (Σ) + PQV (Σ) v(Σ)

(c) QV (Σ) Qv(∆) = Q(V,v)(Σ)

(d) QTv(∆) ∆Qv(∆) = QT(V,v)(Σ) ΣQ(V,v)(Σ) .

This lemma describes various relationships between projections in the Σ


inner product. They may be familiar in the usual inner product Σ = I, but are
generally less familiar in the Σ inner product. For instance, (b) describes the
projection onto span(V, v) in the Σ inner product as the sum of the projection
onto span(V ) in the Σ inner product plus the projection in the Σ inner product
of the part of v that is orthogonal to span(V ) in the Σ inner product. If Σ = I,
(b) reduces to the relatively well-known result P(V,v) = PV + PQV v . Similarly,
when Σ = I the left hand side of (d) reduces to I − PV − PQV v , while the
right hand side becomes Q(V,v) .
We take it as clear that the following population algorithm follows directly
from (3.4) and (3.5):
Direct Population NIPALS Algorithm Construct w1 = `1 (ΣXY ΣTXY ).
Then for d = 1, 2 . . . ,

ΣXd+1 = QTwd (ΣX ) ΣXd Qwd (ΣXd ) 

d 
ΣXd+1 ,Yd+1 = QTwd (ΣX ) ΣXd ,Yd (3.7)
d 
ΣT

w
d+1 = ` (Σ
1 Xd+1 ,Yd+1 ),
Xd+1 ,Yd+1

continuing until ΣXd+1 ,Yd+1 = 0, at which point we use q = d components.


To demonstrate that the population algorithm in Table 3.1(b) corresponds
to the sample version given in Table 3.1(a), it is sufficient to show that the
population algorithm in Table 3.1(b) is equivalent to the direct population
algorithm (3.7).

Proposition 3.1. The population algorithm given in Table 3.1(b) is equiva-


lent to the direct population algorithm. That is, they produce the same semi-
orthogonal weight matrix and the same population coefficients.
80 PLS Algorithms for Predictor Reduction

Proof. The main part of the proof is by induction. Clearly, the algorithms
produce the same w1 . At d + 1 = 2, we have

ΣX2 = QTw1 (ΣX ) ΣX Qw1 (ΣX )


ΣX2 ,Y2 = QTw1 (ΣX ) ΣX,Y
w2 = `1 (ΣX2 ,Y2 ΣTX2 ,Y2 )
= `1 (QTw1 (ΣX ) ΣX,Y ΣTX,Y Qw1 (ΣX ) ).

This step matches the corresponding step of the population NIPALS algo-
rithm in Table 3.1(b).
At d + 1 = 3, we again see that the direct NIPALS algorithm matches that
from Table 3.1(b).

ΣX3 = QTw2 (ΣX ) ΣX2 Qw2 (ΣX2 )


2

= QT(w1 ,w2 )(ΣX ) ΣX Q(w1 ,w2 )(ΣX )


ΣX3 ,Y3 = QTw2 (ΣX ) ΣX2 ,Y2
2

= QT(w1 ,w2 )(ΣX ) ΣX,Y


w3 = `1 (ΣX3 ,Y3 ΣTX3 ,Y3 )
= `1 (QT(w1 ,w2 )(ΣX ) ΣX,Y ΣX,Y Q(w1 ,w2 )(ΣX ) ).

The second equality for ΣX3 follows by Lemma 3.5(d) with V = (w1 ), v = w2
and Σ = ΣX . The second equality for ΣX3 ,Y3 follows by replacing ΣX2 ,Y2 =
QTw1 (ΣX ) ΣX,Y to get

ΣX3 ,Y3 = QTw2 (ΣX ) QTw1 (ΣX ) ΣX,Y


2

and then using Lemma 3.5(c). The second equality for w3 follows by direct
substitution.
Assume that these relationships hold for d = k − 1 ≤ q − 1 and let
Wk−1 = (w1 , . . . , wk−1 ):

ΣXk = QTwk−1 (ΣX ) ΣXk−1 Qwk−1 (ΣXk−1 )


k−1

= QTWk−1 (ΣX ) ΣX QWk−1 (ΣX ) (3.8)


ΣXk ,Yk = QTwk−1 (ΣX ) ΣXk−1 ,Yk−1
k−1

= QTWk−1 (ΣX ) ΣX,Y (3.9)


wk = ΣXk ,Yk ΣTXk ,Yk
= `1 (QTWk−1 (ΣX ) ΣX,Y ΣX,Y QWk−1 (ΣX ) ).
NIPALS algorithm 81

To show the general result by induction we need to show that the relationships
hold for d = k. Let Wk = (Wk−1 , wk ). Then we need to show

ΣXk+1 = QTwk (ΣX ) ΣXk Qwk (ΣXk )


k

= QTWk (ΣX ) ΣX QWk (ΣX )


ΣXk+1 ,Yk+1 = QTwk (ΣX ) ΣXk ,Yk
k

= QTWk (ΣX ) ΣX,Y


wk+1 = `1 (ΣXk+1 ,Yk+1 ΣTXk+1 ,Yk+1 )
= `1 (QTWk (ΣX ) ΣX,Y ΣTX,Y QWk (ΣX ) ).

The first equality for ΣXk+1 follows by definition. The second equality for
ΣXk+1 follows by first substituting for

ΣXk = QTWk−1 (ΣX ) ΣX QWk−1 (ΣX )

from the induction hypothesis to get

ΣXk+1 = QTwk (ΣX ) QTWk−1 (ΣX ) ΣX QWk−1 (ΣX ) Qwk (ΣXk ) .


k

The desired conclusion for ΣXk+1 then follows from Lemma 3.5(c) with
V = Wk−1 , v = wk and Σ = ΣX . The first equality for ΣXk+1 ,Yk+1 fol-
lows by definition. The second equality for ΣXk+1 ,Yk+1 follows by replacing
ΣXk ,Yk = QTWk−1 (ΣX ) ΣX,Y to get

ΣXk+1 ,Yk+1 = QTwk (ΣX ) QTWk−1 (ΣX ) ΣX,Y ,


k

and then using Lemma 3.5(c).


It remains to show that the population version of the estimator
βpls = Wq (LTq Wq )−1 MqT given in Table 3.1(a) corresponds to βnpls in Ta-
b
ble 3.1(b), assuming that the stopping points are the same. This task re-
quires that we deal with the population loading matrices Lq = (l1 , . . . , lq ) and
Mq = (m1 , . . . , mq ). Let W0 = 0. From Table 3.1,
XTk sk XTk Xk wk SXk wk
lk = T
= T T
= .
sk sk wk Xk Xk wk wk SXk wk
For the population version we replace SXk with ΣXk , implicitly substitute the
population version of wk and then use (3.8) to get for k = 0, . . . , q − 1,
QTWk (ΣX ) ΣX QWk (ΣX ) wk+1 QWk (ΣX ) wk+1
lk+1 = T QT
= ΣX T T
wk+1 Wk (ΣX ) ΣX QWk (ΣX ) wk+1 wk+1 QWk (ΣX ) ΣX QWk (ΣX ) wk+1
QWk (ΣX ) wk+1
mk+1 = ΣTX,Y T T
,
wk+1 QWk (ΣX ) ΣX QWk (ΣX ) wk+1
82 PLS Algorithms for Predictor Reduction

where the second equality for lk+1 follows from Lemma 3.5(a) and the form
of mk+1 is found following the same general steps as we did for lk+1 .
The role of the numerators QWk (ΣX ) wk+1 is to provide a successive orthog-
onalization of columns of Wq = (w1 , . . . , wq ), while the denominators provide
a normalization in the ΣX inner product. Thus,
( )
QWk (ΣX ) wk+1
span T QT
| k = 0, 1, . . . , q − 1 = span(Wq ),
wk+1 Wk (ΣX ) ΣX QWk (ΣX ) wk+1
(3.10)
and in consequence there is a nonsingular q×q matrix A so that Lq = ΣX Wq A.
In the same way we have
Mq = ΣTX,Y Wq A.

Substituting these forms into the population version of the PLS estimator
given in Table 3.1(a), we get

βpls = Wq (LTq Wq )−1 MqT


= Wq (AT WqT ΣX Wq )−1 AT WqT ΣX,Y
= Wq (WqT ΣX Wq )−1 WqT ΣX,Y (3.11)
= Wq βY |WqT X ,

which is βnpls given in Table 3.1(b). The semi-orthogonality of the weight


matrix Wq can be seen as follows. From Table 3.1(b),

wd+1 = `1 (QTWd (ΣX ) ΣX,Y ΣTX,Y QWd (ΣX ) ).

Thus, there is a vector v ∈ Rr so that wd+1 = QTWd (ΣX ) ΣX,Y v. But


WdT QTWd (ΣX ) = 0, and so WdT wd+1 = 0 for all d = 1, . . . , q. 2

Recall that the notation βY |WqT X means the coefficients from the popula-
tion OLS fit on Y on the reduced predictors WqT X. From this we see that the
normalization of ld and md in Table 3.1 plays no essential role as it has no
effect on the subspace equality in (3.10). The sample version βpls of βpls , as
shown in Table 3.2, follows immediately from (3.11):

βbpls = W (W T SX W )−1 W T SX,Y , (3.12)

where W is the sample version of Wq as defined in Table 3.2.


Envelopes and NIPALS 83

3.2 Envelopes and NIPALS


In this section we concentrate on population characteristics to establish first
connections between NIPALS and envelopes. The population stopping condi-
tion
QTWq (ΣX ) ΣX,Y = 0 (3.13)
will play a key role in our discussion. Some results in this section can be
deduced from properties of Algorithm N in Section 1.5.3. We provide separate
demonstrations in this section to aid intuition and establish connections with
NIPALS.

3.2.1 The NIPALS subspace reduces ΣX and contains B


The goal of this section is to show that the NIPALS subspace span(Wq )
reduces ΣX and contains B without introducing an envelope construction
per se. Assume first that the NIPALS algorithm stops after getting w1 ,
so then we have ΣX,Y = PwT1 (ΣX ) ΣX,Y , which follows directly from the
algorithms stopping condition that QTw1 (ΣX ) ΣX,Y = 0. This tells us that
span(ΣX,Y ) = span(ΣX w1 ), so there is an r × 1 vector c that satisfies
ΣX,Y = ΣX w1 cT . In consequence, β = Σ−1 T
X ΣX,Y = w1 c , rank(β) = 1 and
B = span(w1 ). Further, since w1 is an eigenvector of ΣX,Y ΣTX,Y by construc-
tion, we have for some r ×1 vector b, ΣX,Y = w1 bT and thus w1 bT = ΣX w1 cT ,
so w1 must be also an eigenvector of ΣX . Summarizing, if the population
NIPALS algorithm stops at q = 1 then
(a) β = w1 cT ,
(b) B = span(w1 ),
(c) dim{B} = 1,
(d) w1 is an eigenvector of both ΣX,Y ΣTX,Y and ΣX .
The stopping criterion for a general q implies immediately that
ΣX,Y = ΣX Wq (WqT ΣX Wq )−1 WqT ΣX,Y . (3.14)
Therefore,
β = Σ−1
X ΣX,Y = PWq (ΣX ) β

B ⊆ span(Wq ) (3.15)
dim(B) ≤ q.
84 PLS Algorithms for Predictor Reduction

These are the general q counterparts for conclusions (a)–(c) with q = 1. To


demonstrate an extension of conclusion (d) consider

QTWd (ΣX ) ΣX,Y = (I − ΣX Wd (WdT ΣX Wd )−1 WdT )ΣX,Y


= ΣX,Y − ΣX Wd (WdT ΣX Wd )−1 WdT ΣX,Y .

Next, we substitute (3.14) for the first ΣX,Y on the right hand side and write
the first Wd = Wq (Id , 0)T to get

QTWd (ΣX ) ΣX,Y = ΣX Wq (WqT ΣX Wq )−1 WqT ΣX,Y


− ΣX Wq (Id , 0)T (WdT ΣX Wd )−1 WdT ΣX,Y
= ΣX Wq (WqT ΣX Wq )−1 WqT − (Id , 0)T (WdT ΣWd )−1 WdT ΣX,Y


:= ΣX Wq Ad ,

where the q × r matrix Ad is defined implicitly. In consequence, we have for


d = 0, 1, . . . , q − 1

wd+1 = `1 ΣX Wq Ad ATd WqT ΣX .




Since wd+1 ∈ span(ΣX Wq ), this implies that wd+1 can be represented as


wd+1 = ΣX Wq cd for some q × 1 vector cd and thus that there is a q × q matrix
C = (c1 , . . . cq ) so that Wq = ΣX Wq C. Since Wq has full column rank and
ΣX is nonsingular, C must be nonsingular and so

ΣX Wq = Wq C −1 . (3.16)

This result tells us that span(Wq ) must be a reducing subspace of ΣX , as


described in Definition 1.1.
To see that span(Wq ) contains span(ΣX,Y ) we substitute the right hand
side of (3.16) for the first product ΣX Wq on the right hand side of (3.14) to
get
ΣX,Y = Wq C −1 (WqT ΣX Wq )−1 WqT ΣX,Y .
from which it follows that span(ΣX,Y ) ⊆ span(Wq ).
Altogether then,

Proposition 3.2. span(Wq ) is a reducing subspace of ΣX that contains


span(ΣX,Y ) and B.

This result characterizes span(Wq ) as a reducing subspace of ΣX that con-


tains B. But to establish that span(Wq ) is an envelope, we need to demonstrate
also that it is minimal.
Envelopes and NIPALS 85

3.2.2 The NIPALS weights span the ΣX -envelope of B


Taken together, results (3.15) and (3.16) characterize span(Wq ) as a reducing
subspace of ΣX that contains B or, equivalently, contains span(ΣX,Y ), as men-
tioned in the discussion following Proposition 1.6. Those results were obtained
without making use of envelopes per se. We now make use of that connection to
establish further properties of the NIPALS subspaces span(Wd ), d = 1, . . . , q,
including that the NIPALS subspace span(Wq ) is the ΣX -envelope of B; that
is, the smallest reducing subspace of ΣX that contains span(ΣX,Y ).
Let Φ ∈ Rp×q denote a semi-orthogonal basis matrix for the intersection of
all reducing subspace of ΣX that contain span(ΣX,Y ) and let (Φ, Φ0 ) ∈ Rp×p
be an orthogonal matrix. Then we know from Proposition 1.2 that ΣX can be
expressed as
ΣX = Φ∆ΦT + Φ0 ∆0 ΦT0 .
We proceed by induction to show that the NIPALS weights wj , j = 1, . . . , q,
are all the envelope and that the algorithm terminates with q weights.
By construction ΣX,Y = PΦ ΣX,Y and consequently w1 =
T
`1 (PΦ ΣX,Y ΣX,Y PΦ ) is in the envelope, w1 ∈ span(Φ). Represent w1 = Φb1 .
Then
QTw1 (ΣX ) ΣX,Y = ΣX,Y − PwT1 (ΣX ) ΣX,Y
= PΦ ΣX,Y − ΦPbT1 (∆) ΦT ΣX,Y
= ΦQTb1 (∆) ΦT ΣX,Y .

Consequently,
n o
w2 = `1 QTw1 (ΣX ) ΣX,Y ΣTX,Y Qw1 (ΣX )
n o
= `1 ΦQTb1 (∆) ΦT ΣX,Y ΣTX,Y ΦQb1 (∆) ΦT
∈ span(ΦQTb1 (∆) ) ⊂ span(Φ)

and so span(W2 ) = span(w1 , w2 ) ⊆ span(Φ). Additionally, since w1 = Φb1 it


follows that w1T ΦQTb1 (∆) = 0 and thus w1T w2 = 0, as claimed in Table 3.1(c)
and shown in Lemma 3.1.
For the induction hypothesis we assume that Wd−1 = span(w1 , . . . , wd−1 ) ⊆
span(Φ), where the wj ’s are orthogonal vectors. Then using the previous
argument by replacing w1 with Wd−1 , it follows that wd ∈ span(Φ), and
that span(Wd−1 ) is a proper subspace of span(Wd ); that is, span(Wd−1 ) ⊂
span(Wd ) ⊆ span(Φ). This demonstrates that the dimension of the NI-
PALS subspaces span(Wd ) continue to increase until d = q, at which point
86 PLS Algorithms for Predictor Reduction

span(Wq ) = span(Φ) because span(Wq ) ⊆ span(Φ) and q = dim (span(Φ)) =


dim (span(Wq )). The growth of the NIPALS subspaces stops at this point
because then the stopping criterion (3.13) is met.
Summarizing,

Proposition 3.3. The population NIPALS algorithm produces a series of


properly nested subspaces that settle on the envelope after q steps,

span(W1 ) ⊂ span(W2 ) ⊂ · · · ⊂ span(Wq ) = EΣX (B),

where q = dim[EΣX {B}].

3.3 SIMPLS algorithm


According to de Jong (1993),

. . . the name SIMPLS for [his] PLS algorithm, since it is a


straightforward implememtation of a statistically inspired modification
of the PLS method according to the simple concept given in Table 1.

SIMPLS then followed in the footsteps of NIPALS as a relatively “simple”


method for reducing the predictors to alleviate collinearity and sample size
issues.

3.3.1 Synopsis
Table 3.4(a) gives the SIMPLS data-based algorithm as it appears in de Jong
(1993, Table 1). The weight vectors and weight matrices are denoted as v
and V to distinguish them from the NIPALS weights. The scores and score
matrices are denoted as s and S, and the loadings and loading matrices as l
and L. de Jong (1993, Appendix) also discussed a more elaborate version of
the algorithm that contains steps to facilitate computation. Like the NIPALS
algorithm, the SIMPLS sample algorithm does not contain a mechanism for
stopping and so q is again typically determined by predictive cross validation
or a holdout sample. Also like the NIPALS algorithm, the sample SIMPLS
algorithm does not require SX to be nonsingular. However, when SX is non-
singular, the SIMPLS estimator can be represented as the projection of the
SIMPLS algorithm 87

TABLE 3.4
SIMPLS algorithm: (a) sample version adapted from de Jong (1993, Table 1).
The n × p matrix X contains the centered predictors and the n × r vector Y
contains the centered responses; (b) population version derived herein.

(a) Sample Version


Initialize SX,Y = n−1 XT Y, q ≤ min{rank(SX ), n − 1},
v1 = first left singular vector of SX,Y . V1 = (v1 ),
s1 = Xv1 , S1 = (s1 ), ld = XT sd /sTd sd , L1 = (l1 ).
For d = 2, . . . , q
Compute weights vd = first left singular vector of QLd−1 SX,Y
Compute scores sd = Xvd
Compute loadings ld = XT sd /sTd sd
Append vd , sd , ld onto Vd−1 , Sd−1 , Ld−1
End
Compute reg. βbspls = Vq (VqT SX Vq )−1 VqT SX,Y
coefficients
(b) Population Version
Initialize v1 = `1 (ΣXY ΣTXY ), V1 = (v1 )
For d = 1, 2 . . .
End if QΣX Vd ΣX,Y = 0 and then set q = d
Compute weights vd+1 = `1 (QΣX Vd ΣX,Y ΣTX,Y QΣX Vd )
Append Vd+1 = (Vd , vd+1 )
Compute reg. βspls = Vq (VqT ΣX Vq )−1 Vq ΣX,Y
coefficients = Vq βY |VqT X
(c) Notes
Weights VqT ΣX Vq is a diagonal matrix.
Envelope connection span(Vq ) = EΣX (B), the ΣX -envelope of B =
span(β).
Projections QΣX Vd and QLd−1 use the usual inner product. See
(1.3).
Scores & Loadings Sd and Ld are traditional computational interme-
diaries, but are not needed per se for βbspls .
Algorithm S This is an instance of Algorithm S discussed in
§§1.5.4 & 11.1.
PLS1 v. PLS2 Algorithm is applicable for PLS1 or PLS2 fits; See
§3.10.
88 PLS Algorithms for Predictor Reduction

OLS estimator βbols onto span(Vq ) in the SX inner product:

βbspls = PVq (SX ) βbols ,

which requires that SX be positive definite and so does not have a direct
sample counterpart when n < p. Recall that projections are defined at (1.3),
so
PVq (SX ) = Vq (VqT SX Vq )−1 VqT SX
and QVq (SX ) = I − PVq (SX ) .
Maintaining the convention established for NIPALS, we use vd and Vd to
denote weight vectors and weight matrices in both the sample and population.
The population version of the SIMPLS algorithm, which will be justified
herein, is shown in Table 3.4(b). Substituting SX and SX,Y for their popu-
lation counterparts ΣX and ΣX,Y produces the same weights and the same
estimated coefficient vector as the sample version in Table 3.4(a), provided
the same value of q is used. The score and loading vectors computed in Ta-
ble 3.4(a) are not really necessary for the algorithm.
The population version in Table 3.4(b) is a special case of Algorithm S
described previously in Section 1.5.4 with A = ΣX,Y ΣTX,Y and M = ΣX .
Consequently, reasoning from (3.17), the SIMPLS algorithm can be described
also as follows, still using the notation of Table 3.4(b). Let Vi = (v1 , . . . , vi ).
Then given Vk , k < q, Vk+1 is constructed by concatenating Vk with

vk+1 = max v T ΣX,Y ΣTX,Y v subject to (3.17)


v∈Rp

v ΣX Vk = 0 and v T v = 0.
T

This is the description of SIMPLS that (Cook, Helland, and Su, 2013, Section
4.3) used to establish the connection between PLS algorithms and envelopes.
The population construction algorithm (3.17) is shown in Section 3.3.3, equa-
tions (3.18) and (3.19).

3.3.2 SIMPLS example


In this section we revisit the example of Section 3.1.2 using SIMPLS instead
of NIPALS. Recall that p = 3, r = 2,
   
1 4/3 0 5 0
ΣX = 10 4/3 4 0 , ΣX,Y = 0 4 ,
   

0 0 5 0 0
SIMPLS algorithm 89

and that the first weight vector is the same as that for NIPALS, v1 = (1, 0, 0)T .
As in Section 3.1.2, v1 ∈ span(ΣX,Y ) and span(ΣX,Y ) reduces ΣX . Direct
computation of the second weight vector

v2 = `1 (QΣX v1 ΣX,Y ΣTX,Y QΣX v1 )

is not as straightforward as it was for NIPALS since we are not now working in
the ΣX inner product. However, computation is facilitated by using a change
of basis for span(ΣX,Y ) that explicitly incorporates ΣX v1 = (1, 4/3, 0)T . Let
!
5−1 −5−1
A= .
3−1 3/16

Then,  
1 −1
ΣX,Y A =  4/3 3/4  ,
 

0 0
 
1 0
PΣX v1 ΣX,Y = (PΣX v1 ΣX,Y A)A−1 =  4/3 0  A−1 ,
 

0 0
and
 
0 −1
QΣX v1 ΣX,Y = (ΣX,Y A − PΣX v1 ΣX,Y A)A−1 = 0 3/4  A−1 ,
 

0 0

which has rank 1. It follows that v2 = a/kak and V2 = (v1 , v2 ), where a =


(−1, 3/4, 0)T .
The weight vectors for NIPALS are orthogonal, w1T w2 = 0, but here the
SIMPLS weight vectors are not orthogonal v1T v2 =
6 0. However, they are or-
thogonal in the ΣX inner product,
!
1 0
V2T ΣX V2 = .
0 9/43 − 1

For the next iteration we need to find QΣX V2 ΣX,Y . But V2 is a reduc-
ing subspace of ΣX and so QΣX V2 = QV2 . Then the algorithm terminates:
QΣX V2 ΣX,Y = QV2 ΣX,Y = 0 since span(ΣX,Y ) = span(V2 ). Although the
NIPALS weights W2 are not equal to the SIMPLS weights V2 , they span the
same subspace and in consequence βnpls = βspls .
90 PLS Algorithms for Predictor Reduction

3.3.3 Details underlying the SIMPLS algorithm


In this section we take a closer look at the sample SIMPLS algorithm by
stepping through a few values of d, and in doing so we establish justification
for its population version shown in Table 3.4(b).
On the first pass (d = 1) through the sample SIMPLS algorithm we com-
pute the first left singular vector of SX,Y . This is equivalent to computing

T
v1 = `1 (SX,Y SX,Y ).

The first score vector is then computed as s1 = Xv1 , which gives the first
linear combination of the predictors. The first loading vector is

XT s1 XT Xv1 SX v1
l1 = T
= T T
= T .
s1 s1 v1 X Xv1 v1 SX v1

We summarize the computations for d = 1 as follows.

T
v1 = `1 (SX,Y SX,Y )
s1 = Xv1
SX v1
l1 =
v1T SX v1
V1 = (v1 ); S1 = (s1 ); L1 = (l1 ).

In the second pass through the algorithm, d = 2, we first compute the first
left singular vector of QL1 SX,Y . From the computations from the first step
d = 1, span(L1 ) = span(SX v1 ). In other words, the normalization by v1T SX v1
in the computation of l1 is unnecessary. Thus we can compute the first left
singular vector of QSX v1 SX,Y . Following the logic expressed in step d = 1, we
then have
T
v2 = `1 (QSX v1 SX,Y SX,Y QSX v1 ).

The rest of the steps in the second pass through the algorithm are similar to
those in the first pass, so we summarize the second pass as

T
v2 = `1 (QSX v1 SX,Y SX,Y QSX v1 )
s2 = Xv2
SX v2
l2 = T
v2 SX v2
V2 = (v1 , v2 ); S2 = (s1 , s2 ) = XV2 ;
L2 = (l1 , l2 ) = SX V2 diag−1 (v1T SX v1 , v2T SX v2 ).
SIMPLS algorithm 91

For completeness, we now state the results for the a general d-th pass
through the algorithm:

T
vd = `1 (QSX Vd−1 SX,Y SX,Y QSX Vd−1 )
T
= `1 (QLd−1 SX,Y SX,Y QLd−1 )
sd = Xvd
SX vd
ld =
vdT SX vd
Vd = (v1 , v2 . . . vd ); Sd = (s1 , s2 . . . sd ) = XVd ;
Ld = (l1 , l2 . . . ld ) = SX Vd diag−1 (v1T SX v1 , v2T SX v2 . . . vdT SX vd ),

where QSX Vd−1 = QLd−1 . From this general step, it can be seen that the final
weight matrix Vq , SX , and SX,Y are all that is needed to compute the SIMPLS
estimator βbspls of the coefficient matrix after q steps:

βbspls = Vq (VqT SX Vq )−1 VqT SX,Y = Vq βbY |VqT X .

The score matrix Sq = XVq gives the n × q matrix of reduced predictor values.
It may be of interest for interpretation and graphical studies. The loading
matrix Lq is essentially a matrix of normalized scores.
In view of the previous observations, we can reduce the SIMPLS algorithm
T
to the following compact version. Construct v1 = `1 (SXY SXY ) and V1 = (v1 ).
Then for d = 1, . . . , q − 1

T
vd+1 = `1 (QLd SX,Y SX,Y QLd ) (3.18)
= arg max hT SX,Y SX,Y
T
h (3.19)
hT h=1,hT SX Vd =0
Vd+1 = (Vd , vd+1 ).

Replacing SX,Y and SX with their population counterparts ΣX,Y and ΣX


and imposing the stopping criterion QΣX Vd+1 ΣX,Y = 0 gives the population
SIMPLS algorithm shown in Table 3.4(b).
To see the equivalence of the two equalities, (3.18) and (3.19), for vd+1 ,
we first show that (3.18) implies (3.19). Form (3.18) can be written as

v T QLd SX,Y SX,Y


T
QLd h
vd+1 = arg max T
.
h6=0 h h
92 PLS Algorithms for Predictor Reduction

Decompose h = QLd h + PLd h := h1 + h2 to get

hT1 SX,Y SX,Y


T
h1
vd+1 = arg max
h1 ∈span⊥ (Ld ), h2 ∈span(Ld ) hT1 h1 + hT2 h2
hT1 SX,Y SX,Y
T
h1
= arg max
h1 ∈span⊥ (Ld ), h2 =0 hT1 h1 + hT2 h2
hT1 SX,Y SX,Y
T
h1
= arg max .
h1 ∈span⊥ (Ld ) hT1 h1

Then vd+1 ∈ span⊥ (Ld ), which holds if and only if vd+1T


SX Vd = 0 and so
(3.19) holds.
To see that (3.19) implies (3.18), we must have vd+1 ∈ span⊥ (Ld ) from
(3.19). Consequently, we can replace h in (3.19) with QLd h and (3.19) becomes
equivalently

vd+1 = arg max hT QLd SX,Y SX,Y


T T
QLd h = `1 (QLd SX,Y SX,Y QLd ),
hT h=1

which gives (3.18).


As mentioned in Table 3.4, the weights Vq are orthogonal in the ΣX
inner product, VqT ΣX Vq is a diagonal matrix. This can be seen directly
from (3.19). It follows also because vd+1 ∈ span(QΣX Vd ΣX,Y ) which implies
VdT ΣX vd+1 = 0.

3.4 Envelopes and SIMPLS


Cook et al. (2013) and Cook (2018) showed that the weights Vq from the
population version of the SIMPLS algorithm span a reducing subspace of ΣX
that contains B. In this section we sketch a different justification for the same
result. Our justification here relies on the SIMPLS stopping criterion

QΣX Vq ΣX,Y = 0. (3.20)

Although key details differ, the argument here follows closely that for NIPALS
given in Section 3.2. Also, some results in this section may be deduced from
properties of Algorithm S in Section 1.5.4. We provide separate demonstra-
tions in this section to aid intuition and establish connections with SIMPLS.
Envelopes and SIMPLS 93

3.4.1 The SIMPLS subspace reduces ΣX and contains B


The goal of this section is to use the SIMPLS algorithm itself to deduce that
the SIMPLS subspace span(Vq ) reduces ΣX and contains B. This then will
establish a relationship between SIMPLS and reducing subspaces of ΣX . To
establish a connection with envelopes per se, we need to demonstrate that
span(Vq ) is minimal, which is done in the next section.
Assume that β 6= 0 and that the SIMPLS algorithm stops after getting v1 ,
so then we have q = 1 and stopping condition QΣX v1 ΣX,Y = 0, which is equiv-
alent to ΣX,Y = PΣX v1 ΣX,Y . This tells us that span(ΣX,Y ) = span(ΣX v1 ),
so there is an r × 1 vector c that satisfies ΣX,Y = ΣX v1 cT . In consequence,
β = Σ−1 T
X ΣX,Y = v1 c , rank(β) = 1 and B = span(v1 ). Further, since v1 is an
eigenvector of ΣX,Y ΣTX,Y , we have for some r × 1 vector b, ΣX,Y = v1 bT and
thus v1 bT = ΣX v1 cT , so v1 must be also an eigenvector of ΣX . Summarizing,
if the population SIMPLS algorithm stops at q = 1 then

(a) β = v1 cT ,
(b) B = span(v1 ),
(c) dim{B} = 1,
(d) v1 is an eigenvector of both ΣX,Y ΣTX,Y and ΣX .

To study the behavior of the population SIMPLS algorithm at a general


stopping point q, we have for d = 1, . . . , q − 1 the population weight vectors
constructed as
vd+1 = `1 QΣX Vd ΣX,Y ΣTX,Y QΣX Vd .

(3.21)

The stopping criterion (3.20) implies that

ΣX,Y = ΣX Vq (VqT Σ2X Vq )−1 VqT ΣX ΣX,Y , (3.22)

and therefore
B ⊆ span(Vq ). (3.23)

It remains to demonstrate that span(Vq ) is a reducing subspace of ΣX :

QΣX Vd ΣX,Y = (I − ΣX Vd (VdT Σ2X Vd )−1 VdT ΣX )ΣX,Y


= ΣX,Y − ΣX Vd (VdT Σ2X Vd )−1 VdT ΣX ΣX,Y .
94 PLS Algorithms for Predictor Reduction

Next, we substitute (3.22) for the first ΣX,Y on the right hand side and write
the Vd = Vq (Id , 0)T to get

QΣX Vd ΣX,Y = ΣX Vq (VqT Σ2X Vq )−1 VqT ΣX ΣX,Y


− ΣX Vq (Id , 0)T (VdT Σ2X Vd )−1 VdT ΣX ΣX,Y
= ΣX Vq (VqT Σ2X Vq )−1 Vq − (Id , 0)T (VdT Σ2X Vd )−1 VdT ΣX ΣX,Y


:= ΣX Vq Ad ,

where the q × r matrix Ad is defined implicitly. In consequence, we have for


d = 1, . . . , q − 1
vd+1 = `1 ΣX Vq Ad ATd VqT ΣX .


Since vd+1 ∈ span(ΣX Vq ), this implies that vd+1 can be represented as vd+1 =
ΣX Vq md for some q × 1 vector md and thus that there is a q × q matrix
M = (m1 , . . . , mq ) so that Vq = ΣX Vq M . Since Vq has full column rank and
ΣX is non-singular, M must be nonsingular and so

ΣX Vq = Vq M −1 . (3.24)

This result tells us that span(Vq ) is a reducing subspace of ΣX . Thus, in


addition to (3.22) we also have (see Proposition 1.6)

span(ΣX,Y ) ⊆ span(Vq ). (3.25)

In summary,

Proposition 3.4. The span span(Vq ) of the weight matrix Vq from the popula-
tion SIMPLS algorithm is a reducing subspace of ΣX that contains span(ΣX,Y )
and B.

3.4.2 The SIMPLS weights span the ΣX -envelope of B


Taken together, results (3.23) and (3.24) characterize span(Vq ) as a reducing
subspace of ΣX that contains B or, equivalently, contains span(ΣX,Y ). Those
results were obtained without making use of envelopes per se. We now make
use of that connection to establish further properties of the SIMPLS subspaces
span(Vd ), d = 1, . . . , q, including that the SIMPLS subspace span(Vq ) is the
ΣX -envelope of B; that is, the smallest reducing subspace of ΣX that contains
span(ΣX,Y ).
Envelopes and SIMPLS 95

Let Φ ∈ Rp×q denote a semi-orthogonal basis matrix for EΣX(B), the in-
tersection of all reducing subspace of ΣX that contain span(ΣX,Y ) and let
(Φ, Φ0 ) ∈ Rp×p be an orthogonal matrix. This the same notation we used
when dealing with NIPALS and envelopes in Section 3.2.2. Then we know
from Proposition 1.2 that ΣX can be expressed as

ΣX = Φ∆ΦT + Φ0 ∆0 ΦT0 .

Turning to the first step in the SIMPLS algorithm, ΣX,Y = PΦ ΣX,Y


by construction and consequently v1 is in the envelope, v1 ∈ span(Φ) and
ΣX v1 = Φ∆ΦT v1 := Φa1 , where a1 = ∆ΦT v1 . And

QΣX v1 ΣX,Y = QΦa1 ΣX,Y = ΣX,Y − PΦa1 ΣX,Y


= PΦ ΣX,Y − PΦa1 ΣX,Y
= ΦQa1 ΦT ΣX,Y .

Consequently,

= `1 QΣX v1 ΣX,Y ΣTX,Y QΣX v1



v2
= `1 ΦQa1 ΦT ΣX,Y ΣTX,Y ΦQa1 ΦT


∈ span(Φ),

and so V2 = span(v1 , v2 ) ⊆ span(Φ). Additionally, since v2 ∈


span(QΣX v1 ΣX,Y ) and v1T ΣX QΣX v1 ΣX,Y = 0 it follows that v2T ΣX v1 = 0,
as we have seen.
Proceeding by induction, assume that Vd−1 = span(v1 , . . . , vd−1 ) ⊆
span(Φ). Then using the previous argument by replacing v1 with Vd−1 , it
follows that vd ∈ span(Φ), that vdT ΣX Vd−1 = 0 and that span(Vd−1 ) is a
proper subspace of span(Vd ); that is, span(Vd−1 ) ⊂ span(Vd ). This demon-
strates that the dimension SIMPLS subspaces span(Vd ) continue to increase
until d = q, at which point span(Vq ) = span(Φ) because span(Vq ) ⊆ span(Φ)
and q = dim (span(Φ)) = dim (span(Vq )). We used the same rationale when
proving Proposition 3.3.
Summarizing,

Proposition 3.5. The population SIMPLS algorithm of Table 3.4 produces


a series of properly nested subspaces that settle on the envelope after q steps,

span(V1 ) ⊂ span(V2 ) ⊂ · · · ⊂ span(Vq ) = EΣX (B),

where q = dim[EΣX {B}].


96 PLS Algorithms for Predictor Reduction

3.5 SIMPLS v. NIPALS


In this section we compare the NIPALS and SIMPLS algorithms, starting with
their population versions shown in Tables 3.1(b) and 3.4(b).

3.5.1 Estimation
While first weight vectors from the population algorithms are the same,
w1 = v1 , the second and subsequent weight vectors differ:

w2 = `1 (QTw1 (ΣX ) ΣX,Y ΣTX,Y Qw1 (ΣX ) )


v2 = `1 (QΣX w1 ΣX,Y ΣTX,Y QΣX w1 ),

where we have used w1 in place of v1 for the calculation of v2 . These weight


vectors are distinct because of the different projections – Qw1 (ΣX ) and QΣX w1
– used to remove the first weight vector w1 from consideration. However, we
know from results in Sections 3.2 and 3.4 that at termination

span(Wq ) = span(Vq ) = EΣX (B).

Since βnpls and βspls depend only on the subspaces spanned by the correspond-
ing weight matrices, it follows that βnpls = βspls , and so NIPALS and SIMPLS
produce the same result in the population.
The sample estimators are generally different, βbnpls 6= βbspls , with two im-
portant exceptions as given in the following proposition (de Jong, 1993).

Proposition 3.6. Recall that βbnpls and βbspls denote the sample NIPALS and
SIMPLS estimators. Assume that these estimators are each constructed with
q components.

1. If a single component is used, so q = 1, then the sample w1 = v1 and


consequently βbnpls = βbspls .

2. If the response is univariate, so r = 1, then again βbnpls = βbspls .

Proof. The first conclusion is straightforward. To demonstrate the second con-


clusion, consider the sequence of weight vectors for NIPALS starting with
SIMPLS v. NIPALS 97

w1 ∝ SX,Y and

w2 ∝ QTw1 (SX ) SX,Y


= SX,Y − SX SX,Y (w1T SX w1 )−1 w1T SX,Y
:= SX,Y − SX SX,Y c1

where c1 = (w1T SX w1 )−1 w1T SX,Y is a scalar. From this we have the represen-
tation
 0 1
span(W2 ) = span SX SX,Y , SX SX,Y .

We next show by induction that for d = 1, . . . , q,


 0 1 d−1
span(Wd ) = span SX SX,Y , SX SX,Y , . . . , SX SX,Y . (3.26)

Under the induction hypothesis,


 0 1 d−2
span(Wd−1 ) = span SX SX,Y , SX SX,Y , . . . , SX SX,Y (3.27)

and

wd ∝ QTWd−1 (SX ) SX,Y


T
= SX,Y − SX Wd−1 (Wd−1 SX Wd−1 )−1 Wd−1
T
SX,Y .

We need to show that


 0 1 d−1
span SX SX,Y , SX SX,Y , . . . , SX SX,Y = span(Wd−1 , wd )

where span(Wd−1 ) is as given under the induction hypothesis (3.27).


Since wd depends on Wd−1 only via span(Wd−1 ), we substitute the induc-
tion hypothesis for the Wd−1 ’s and let c = (Wd−1 T
SX Wd−1 )−1 Wd−1
T
SX,Y ∈ Rd .
Then
1 d−1
wd ∝ SX,Y − (SX SX,Y , . . . , SX SX,Y )c.

It follows that span(Wd−1 , wd ) = span(Wd ) as defined in (3.26). Using


the same logic for SIMPLS, it follows that span(Vd ) = span(Wd ) for all
d = 1, 2, . . . , q and thus that βbnpls = βbspls as long as the stopping point is
the same. 2

From this we see that the orthogonal NIPALS weight vectors w1 , . . . , wq


result from a sequential residualization of the Krylov sequence (see Section 1.3)
0 1 q−1
{SX SX,Y , SX SX,Y , . . . , SX SX,Y },
98 PLS Algorithms for Predictor Reduction

which led Manne (1987) and Helland (1990) to claim that the NIPALS algo-
rithm is a version of the Gram-Schmidt procedure (see Section 3.6).
Changing to the population to complete the discussion, the SIMPLS weight
vectors also result in an orthogonalization of the Krylov sequence, except
now the weight vectors v1 , . . . , vq are orthogonal in the ΣX inner product. In
particular,

• v1 = σX,Y ,
• v2 is the vector of residuals from the regression of σX,Y on ΣX v1 ,
• v3 is the vector of residuals from the regression of σX,Y on ΣX (v1 , v2 ),
..
.
• vk is the vector of residuals from the regression of σX,Y on
ΣX (v1 , v2 , . . . , vk−1 ).

In consequent, we see that for a univariate response, the NIPALS and SIM-
PLS sample and population vectors can be viewed as different orthogonaliza-
tion of the Krylov vectors. This implies that in the population and sample
span(Wk ) = span(Vk ), k = 1, . . . , q; that is, we have demonstrated that

Proposition 3.7. For single-response regressions and 1 ≤ k ≤ q,

Population: span(Wk ) = span(Vk ) = Kk (ΣX , σX,Y ),

Sample: span(Wk ) = span(Vk ) = Kk (SX , SX,Y ),

where Kk was defined at (1.22).

3.5.2 Summary of common features of NIPALS and SIMPLS


As we have seen, the SIMPLS and NIPALS algorithms share a number of
features:

• Both algorithms give the envelope EΣX (B) in the population and in that
sense are aiming at the same target. However, the algorithms give different
sample weight vectors, as discussed previously.

• In the population, the NIPALS algorithm makes use of the ΣX inner prod-
uct to generate its orthogonal weights, WqT Wq = Iq , while the SIMPLS
algorithm uses the identity inner product to generate its weight vectors
that are orthogonal in the ΣX inner product, so VqT ΣX Vq is a diagonal
matrix. Nevertheless, span(Wq ) = span(Vq ) = EΣX(B).
SIMPLS v. NIPALS 99

• With q = dim{EΣX (B)} known and fixed, p fixed and n → ∞, both algo-

rithms after q steps produce n-consistent estimators of β, because the

algorithms are smooth functions of SX,Y and SX , which are n-consistent
estimators of ΣX,Y and ΣX .

• The number of components q ≤ min{rank(SX ), n − 1)} is often selected


by predictive cross or a hold-out sample, particularly in chemometrics.
Underestimation of q produces biased estimators, while overestimation
results in estimators that are more variable than need be. Our experience
indicates that underestimation is the more serious error and that typically
a little overestimation may be tolerable.

• Neither algorithms requires that SX be positive definite and they are gen-
erally serviceable in high-dimensional regressions. Asymptotic properties
as n, p → ∞ are discussed in Chapter 4.

• βbnpls = βbspls when q = 1 or r = 1. If SX is non-singular and q = p then


βbnpls = βbspls = βbols . Otherwise, βbnpls 6= βbspls .

• Both population algorithms are invariant under full rank linear transfor-
mations YA = AY of the response vector. The coefficient matrix with the
transformed responses is βA := βAT . Since span(βA ) = B, the envelope is
invariant under this transformation,

EΣX (span(βA )) = EΣX (span(βAT )) = EΣX(B).

The population target is thus invariant under choice of A.


However, the sample versions of the algorithms are susceptible to change
under full rank linear transformations of the response. This can be appre-
ciated by noting that w1 = v1 changes with the transformation. In the
T
original scale w1 = v1 = `1 (SX,Y SX,Y ), while in the transformed scale
T T
w1 = v1 = `1 (SX,Y A ASX,Y ). This begs the question: What is the “best”
choice of A? The likelihood-based estimators discussed in Chapters 3 and
−1/2
11 suggest that when r  n, A = SY is a good choice.

• Neither algorithm is invariant under full rank linear transformations of X.


100 PLS Algorithms for Predictor Reduction

3.6 Helland’s algorithm


Manne (1987) and Helland (1990) emphasized a connection between univariate
PLS regression and the Krylov subspace shown in (3.26) (see also Section 1.3).
Frank and Frideman (1993) referred the corresponding algorithm as “Hel-
land’s algorithm,” as described in Table 3.5. While the Krylov basis vectors
d
SX SX,Y , d = 0, . . . , q − 1 are linearly independent, they can exhibit a degree
of multicollinearity that can cause serious problems for the algorithm. The or-
thogonalized versions NIPALS and SIMPLS should be computationally more
stable in such instances, although the orthogonalization may become prob-
lematic if the vectors in the Krylov sequence become highly collinear. This
possibility plays a role in the asymptotic collinearity measure ρ(p) introduced
in Section 4.6.1.

3.7 Illustrative example


In this section, we use an example from Chun and Keleş (2010) to illustrate
selected results from the developments in this chapter.
Consider a regression in which the n × p data matrix X of predictor values
can be represented as

Xn×p = (H1 1T1 + E1 , H2 1T2 + E2 , H3 1T3 + E3 ),

where Hj is an n × 1 vector of independent standard normal variates,


j = 1, 2, 3, and 11 , 12 , and 13 are vectors of ones having dimensions p1 , p2
P
and p3 with j pj = p. The Ej ’s are matrices of conforming dimensions also
consisting of independent standard normal variates. The random quantities
Hj and Ej , j = 1, 2, 3, are mutually independent. A typical p × 1 predictor
vector can be represented as
 
h1 11
X =  h2 12  + e,
 

h3 13
where the hj ’s are independent standard normal random variables and e is a
p × 1 vector of independent standard normal variates.
Illustrative example 101

TABLE 3.5
Helland’s algorithm for univariate PLS regression: (a) sample version adapted
from Table 2 of Frank and Frideman (1993). The n × p matrix X contains the
centered predictors and the n × r vector Y contains the centered responses;
(b) population version derived herein.

(a) Sample Version


Compute sample SX,Y = n−1 XT Y, SX = n−1 XT X
covariance matrices
Initialize K1 = (SX,Y ), the first Krylov basis.
Select q ≤ min{rank(SX ), n − 1)}
For d = 2 . . . q
d−1
Compute Krylov vector kd = SX SX,Y
Append Kd = (Kd−1 , kd )
End
Compute regression βbHpls = Kq (KqT SX Kq )−1 KqT SX,Y
coefficients
(b) Population Version
Initalize K1 = (ΣX,Y )
For d = 2, . . .
Compute Krylov vector kd = Σd−1
X ΣX,Y
Append Kd = (Kd−1 , kd )
End when first QKd−1 kd = 0 and then set q = d − 1
Compute regression βHpls = Kq (KqT ΣX Kq )−1 KqT ΣX,Y
coefficients = Kq βY |KqT X
Envelope connection span(Kq ) = EΣX (B), the ΣX -envelope of B
where B = span(β)
Algorithm K This is an instance of Algorithm K
discussed in §1.5.2.

The variance-covariance matrix of the predictors X is


 
11 1T1 0 0
ΣX =  0 12 1T2 0  + Ip .
 

0 0 13 1T3

From this structure, we see that the predictors consist of three indepen-
dent blocks of sizes p1 , p2 and p3 . Let uT1 = (1T1 , 0, 0), uT2 = (0, 1T2 , 0) and
uT3 = (0, 0, 1T3 ) denote three eigenvectors of ΣX with non-zero eigenvalues.
102 PLS Algorithms for Predictor Reduction

Then ΣX can be expressed equivalently as

ΣX = (1 + p1 )Pu1 + (1 + p2 )Pu2 + (1 + p3 )Pu3 + Q,

where P(·) is the projection onto the subspace spanned by the indicated eigen-
vector and Q is the projection onto the p − 3 dimensional subspace that is
orthogonal to span(u1 , u2 , u3 ).
Next, with a single response r = 1, the n × 1 vector of responses was
generated as the linear combination Y = H1 − H2 + , where  is an n × 1
vector of independent normal variates with mean 0 and variance σ 2 , which
gives

ΣX,Y = u1 − u2
1 1
β = u1 − u2 .
1 + p1 1 + p2
Consequently, if p1 6= p2 then both β and ΣX,Y are linear combinations of
two eigenvectors of ΣX and only q = 2 components are needed to characterize
the regression, specifically uT1 X and uT2 X. Equivalently, the two-dimensional
envelope EΣX(B) = span(u1 , u2 ). If p1 = p2 then EΣX(B) = span(u1 − u2 ). This
follows because span(ΣX,Y ) = B = span(u1 − u2 ) has dimension 1.
To calculate the first two NIPALS weight vectors we need

ΣX ΣX,Y = (1 + p1 )u1 − (1 + p2 )u2


ΣTX,Y ΣX ΣX,Y = (1 + p1 )p1 + (1 + p2 )p2
ΣTX,Y ΣX,Y = p1 + p2 .

With this we can now calculate the first two NIPALS weight vectors as

w1 ∝ ΣX,Y = u1 − u2
w2 ∝ QTΣX,Y (ΣX ) ΣX,Y
= ΣX,Y − ΣX ΣX,Y (ΣTX,Y ΣX ΣX,Y )−1 ΣTX,Y ΣX,Y
p2 − p1
= (p2 u1 + p1 u2 ).
(1 + p1 )p1 + (1 + p2 )p2
Clearly, span(W2 ) = span(u1 , u2 ) provided p1 6= p2 . If p1 = p2 then ΣX
reduces to
ΣX = (1 + p1 )P(u1 ,u2 ) + (1 + p3 )Pu3 + Q,
u1 and u2 belong to the same eigen-space and, in consequence, only one com-
ponent is needed and EΣX(B) = span(u1 − u2 ). Also, when p1 = p2 , the
stopping criterion is met at w2 = 0, giving q = 1.
Likelihood estimation of predictor envelopes 103

Turning to the SIMPLS weights, we need in addition to the previous


preparatory calculations,

ΣTX,Y Σ2X ΣX,Y = (1 + p1 )2 p1 + (1 + p2 )2 p2 .

Then the first two SIMPLS weights are

v1 ∝ ΣX,Y = u1 − u2
v2 ∝ QΣX ΣX,Y ΣX,Y
= ΣX,Y − ΣX ΣX,Y (ΣTX,Y Σ2X ΣX,Y )−1 ΣTX,Y ΣX ΣX,Y
p2 − p1
= ((1 + p2 )p2 u1 + (1 + p1 )p1 u2 ).
(1 + p1 ) p1 + (1 + p2 )2 p2
2

The interpretation here is similar to that for the NIPALS weights.


The first three Krylov terms in Helland’s algorithm are

ΣX,Y = u1 − u2
ΣX ΣX,Y = (1 + p1 )u1 − (1 + p2 )u2
Σ2X ΣX,Y = (1 + p1 )2 u1 − (1 + p2 )2 u2 .

From this we have the Krylov basis K2 = (u1 − u2 , (1 + p1 )u1 − (1 + p2 )u2 ). If


p1 6= p2 then span(K2 ) = span(u1 , u2 ) and so QK2 {(1+p1 )2 u1 −(1+p2 )2 u2 } =
0 and q = 2. If p1 = p2 then span(K1 ) = span(u1 − u2 ), QK1 {(1 + p1 )u1 −
(1 + p2 )u2 } = 0 and q = 1.
The three algorithms – NIPALS, SIMPLS, Helland – all agree that if
p1 =6 p2 then EΣX(B) = span(u1 , u2 ), while EΣX(B) = span(u1 − u2 ) if p1 = p2 .

3.8 Likelihood estimation of predictor envelopes


As mentioned at the outset of this chapter, PLS regression algorithms were
not historically associated with specific models and, in consequence, were not
typically viewed as methods for estimating identifiable population parameters
but were seen instead as methods for prediction. Nevertheless, the algorithms
of this chapter, and more generally those of Section 1.5, can be viewed as
moment-based methods for fitting the predictor envelope model (2.5), an ap-
proach that has often been called “soft modeling” in the chemometrics liter-
ature. Soft modeling by its nature does not require fully laid foundations, re-
sulting in methodology with a rather elusive quality. In contrast, model-based
104 PLS Algorithms for Predictor Reduction

“hard modeling” has the advantage of laying bare all of the structural and
stochastic foundations of a method, allowing the investigator to study perfor-
mance under deviations from those foundations as necessary for a particular
application. There seems to be a feeling within the chemometrics community
that the success of PLS algorithms over the past four decades does not fully
compensate for the lack of adequate foundations (Stocchero, 2019).
Since we have shown in this chapter that NIPALS and SIMPLS give the
envelope in the population, we could use likelihood theory to get maximum
likelihood estimators when the sample size is sufficiently large. This approach
was described in Section 2.3 and a preview of the connection to PLS was given
in Section 2.3.2.

3.9 Comparisons of likelihood and PLS estimators


Turning to comparisons of the likelihood-based method with PLS methods,
we see first that Lq (G), given in (2.11), depends on the response only through
−1/2
its standardized version Z = SY Y . On the other hand, PLS depends on the
scale of the response. When q = 1, the PLS estimator of EΣX (B) is the span
T
of the first eigenvector w1 of SX,Y SX,Y . After performing a full rank transfor-
mation of the response Y 7→ AY , the PLS estimator of EΣX (B) is the span of
the first eigenvector we1 of SX,Y AT ASX,YT
. Generally, span(w1 ) 6= span(w e1 ),
so the estimates of EΣX (B) differ, although ΣX,Y ΣX,Y and ΣX,Y A AΣTX,Y
T T

span the same subspace. It is customary to standardize the individual re-


c j )}1/2 , j = 1, . . . , r, prior to application of
sponses marginally yj 7→ yj /{var(y
PLS, but it is evidently not customary to standardize the responses jointly
−1/2
Yi 7→ Zi = SY Yi . Of course, the PLS algorithm could be applied after
replacing Y with jointly standardized responses Z.
The methods also differ on how they utilize information from SX . The like-
lihood objective function (2.11) can be written as (Cook, Li, and Chiaromonte,
2010, Section 4.3.2)

−1
Lq (G) = log |GT SX G| + log |GT SX G| + log |SZ|GT X |.

The sum of the first two addends on the right hand side is non-negative and
zero when the columns of G correspond to any subset of q eigenvectors of SX .
Comparisons of likelihood and PLS estimators 105

To see this, suppose the columns of G correspond to a subset of q eigenvectors


of SX with eigenvalues λi given by the diagonal elements of the diagonal
matrix Λ. Then
q
X
log |GT SX G| = log |GT GΛ| = log λi
i=1
q
X
−1
log |GT SX G| = log |GT GΛ−1 | = − log λi
i=1

and the conclusion follows. Consequently, the role of these addends is to pull
the solution toward subsets of q eigenvectors of SX . This in effect imposes
a sample counterpart of the characterization in Proposition 1.9, which states
that in the population EΣX (B) is spanned by a subset of the eigenvectors of
ΣX . There is no corresponding operation in the PLS methods. The first SIM-
PLS vector v1 does not incorporate direct information about SX . The second
PLS vector incorporates SX by essentially removing the subspace span(SX v1 )
from consideration, but the choice of span(SX v1 ) is not guided by the rela-
tionship between v1 and the eigenvectors of SX . Subsequent SIMPLS vectors
operate similarly in successively smaller spaces. In application, PLS methods
often require more directions to match the performance of the likelihood-based
method (Cook et al., 2013).
Our discussion of likelihood-based estimation has so far been restricted to
regressions in which n  max(p, r) or n → ∞ with p and r fixed. Rimal,
Trygve, and Sæbø (2019) adapted likelihood-based estimation to accommo-
date regressions with n < p by selecting the principal components of X that
account for 97.5 percent of its sample variation and then using likelihood-
based estimation on the principal component predictors. Using large-scale
simulations and data analyses, they compared the predictive performance of
envelope methods, principal component regression and the kernel PLS method
(Lindgren, Geladi, and Wold, 1993) from the PLS package in R (R Core Team,
2022) in contexts that reflect chemometric applications. They summarized
their overall findings as follows . . .

Analysis using both simulated data and real data has shown that
the envelope methods are more stable, less influenced by [predictor
collinearity] and in general, performed better than PCR and PLS meth-
ods. These methods are also found to be less dependent on the number
of components.
106 PLS Algorithms for Predictor Reduction

The envelope methods used by Rimal et al. (2019) are likelihood-based and
they differ from the PLS methods only on the method of estimating EΣX(B).
In contrast, principal component regression is not designed specifically to esti-
mate EΣX(B) and this may in part account for its relatively poor performance.
Subsequently, Rimal, Trygve, and Sæbø (2020) reported the results of a second
study, this time focusing on the estimative performance of the methods.
Comparisons based on asymptotic approximations are presented in Chap-
ter 4.

3.10 PLS1 v. PLS2


As mentioned previously, PLS1 and PLS2 refer to PLS regression with univari-
ate and multivariate responses. Helland’s algorithm of Table 3.5 is necessarily
of the PLS1 type. The NIPALS and SIMPLS algorithms of Tables 3.1 and 3.4
are of the PLS2 variety, but of course they apply when r = 1 and so reduce to
PLS1 algorithms. The key distinction between PLS1 and PLS2 algorithms is
how they are used to fit a regression with a multivariate response, with PLS2
as described in Tables 3.1 and 3.4.
To motivate the use of PLS1 algorithms to fit multivariate response re-
gressions, let Yj , j = 1, . . . , r, denote the individual responses and consider
the OLS estimator of β given in (1.7)
−1 −1 −1 −1
βbols = SX SX,Y = (SX SX,Y1 , SX SX,Y2 , . . . , SX SX,Yr )
:= (βbols,1 , βbols,2 , . . . , βbols,r ),

where βbols,j ∈ Rp denotes the coefficient vector from the OLS fit of Yj on
X. From this we see that the j-th column of the OLS coefficient matrix βbols
consists of the coefficient vector from the univariate-response regression of Yj
on X. Following this lead, we can use NIPALS or SIMPLS to construct a
different PLS estimator by performing r univariate PLS regressions:

βbnpls1 := (βbnpls,1 , βbnpls,2 , . . . , βbnpls,r ) 
(3.28)
βbspls1 := (βbspls,1 , βbspls,2 , . . . , βbspls,r ) 

where βbnpls,j is the estimator of the coefficient vector from the NIPALS fit of
the j-th response on X, with a similar definition for βbspls,j .
PLS1 v. PLS2 107

Partition β on its columns, β = (β1 , . . . , βr ) and let Bj = span(βj ). The


columns of the PLS1 estimators βbnpls1 and βbspls1 are based on estimators of
the corresponding envelopes EΣX (Bj ), j = 1, . . . , r, while the PLS2 estimators
βbnpls and βbspls are based on estimators of overall envelope EΣX (B) = EΣX (B1 +
Pr
· · ·+Br ), since B = i=1 Bj . The following lemma gives a relationship between
these subspace constructions. Its proof is in Appendix A.3.6

Lemma 3.6. let Σ ∈ Sp be positive definite and let Bj be a subspace of Rp ,


j = 1, 2. Then
EΣ (B1 + B2 ) = EΣ (B1 ) + EΣ (B2 ),
and

max{dim(EΣ (B1 )), dim(EΣ (B2 ))} ≤ dim(EΣ (B1 + B2 ))


≤ dim(EΣ (B1 )) + dim(EΣ (B2 )).

This lemma implies a key fact about the relationship PLS1 and PLS2
envelopes:

EΣX (B) = EΣX (B1 + · · · + Br ) = EΣX (B1 ) + · · · + EΣX (Br ).

According to this result, if our goal is to estimate EΣX(B) in order to facilitate


estimation of β or prediction then, in the population, it does not matter if
we use PLS1 or PLS2 regressions. However, estimating β via an estimator
of EΣX(B) is not the only way to use envelopes to improve estimation. An
alternative is to pursue column-wise estimation of β, using EΣX (Bj ) to im-
prove estimation of the j-th column βj , j = 1, . . . , r, and then assembling the
estimators into an estimator of β, as shown in (3.28).
This distinction reflects a potential for PLS1 fits to do better than PLS2
fits, because PLS1 is better able to adapt to the individual response regres-
sions, at least theoretically. For instance, suppose that r = 2, dim{EΣX (B1 )} =
1 and dim{EΣX (B2 )} = p. Then substantial variance reduction is possible
when estimating β1 from the NIPALS, say, fit of Y1 on X. But no variance re-
duction is possible when estimating β2 from the regression of Y2 on X. Know-
ing the dimensions and using NIPALS, the PLS1 estimator of β takes the
form βbnpls1 = (βbnpls,1 , βbols,2 ). On the other hand, because dim(EΣX(B)) = p,
the overall NIPALS estimator must be simply βbnpls = βbols . In this case a PLS1
fit may be better than PLS2.
If dim{EΣX (Bj )} = 1 and Bj is a unique one-dimensional eigenspace of
ΣX , j = 1, . . . , r, so dim(EΣX(B)) = r, then, knowing the dimensions, each
108 PLS Algorithms for Predictor Reduction

NIPALS estimator βbnpls,j for the univariate regressions will be fitted with
q = 1. But the overall NIPALS estimator will be fitted with q = r, since
dim(EΣX(B)) = r. Again, a PLS1 fit may be better than PLS2.
If dim{EΣX (B)} = 1 and dim{EΣX (Bj )} = 1, j = 1, . . . , r then, knowing
the dimensions, each NIPALS estimator βbnpls,j for the univariate regressions
will be fitted with q = 1. But the overall NIPALS estimator will also be fitted
with q = 1, since dim(EΣX(B)) = 1. In this case a PLS2 fit may be better than
PLS1.
This discussion indicates that, with known envelope dimensions (number
of components), either PLS1 or PLS2 may be preferable, depending in the
regression. On the other hand, PLS1 methods require in application that each
dimension dim{EΣX (Bj )}, j = 1, . . . , r, be estimated separately, while PLS2
methods need only one dimensions dim{EΣX (B)} estimated. This means that
in PLS1 regressions there are more chances for mis-estimation of the number
of components. In practice it may be advantageous to try both methods and
use cross validation or a hold-out sample to assess their relative strengths.
While the previous discussion was cast in terms of PLS estimators, the
same reasoning and conclusions apply to the more general algorithms intro-
duced in Section 1.5. For instance, in multivariate regressions we can think
of applying algorithms N and S overall one response at a time, leading to
estimators of the form

βbN1 := (βbN,1 , βbN,2 , . . . , βbN,r ) 
(3.29)
βbS1 := (βbS,1 , βbS,2 , . . . , βbS,r ) 

where βbN,j is the estimator of the coefficient vector from using Algorithm N
to fit the j-th response on X, with a similar definition for βbS,j .

3.11 PLS for response reduction


The various algorithms in this chapter were presented as methods for reduc-
ing the dimension of the predictor vector in linear regression. However, they
also apply to reducing the dimension of the response vector in multivariate
(multiple response) linear regression. To reduce a response vector using PLS
simply interchange the roles of the predictor vector X and the response vector
PLS for response reduction 109

Y in any of the algorithms presented in this chapter. In other words, apply an


algorithm to the regression of X on Y . The justification for this stems from
the envelope theory presented in Section 2.4: The predictor envelope EΣX(B)
for the regression of Y on X is also the response envelope for the regression
of X on Y . Hence, to use PLS for response reduction, we simply specify Y as
the predictor vector and X as the response vector in any of the algorithms in
this chapter. The resulting semi-orthogonal weight matrix, say Gu ∈ Rr×u to
distinguish it from the weight matrix for reducing the predictors, is used to
reduce the response vector Y 7→ GTu Y .
However, the PLS estimators of β given along with reductive algorithms
in this chapter are not appropriate following response reduction. Instead, the
PLS estimator is
−1
βb = SX SX,GTu Y GTu .

This estimator can be used in conjunction with the multivariate linear model
and cross validation to select an appropriate value for the number u of response
components, in the same way that cross validation is used to select the number
of predictor components.
Response and predictor reductions can be combined to achieve simultane-
ous response-predictor reduction in the multivariate linear model. Methodol-
ogy for this is discussed in Chapter 5.
4

Asymptotic Properties of PLS

In this chapter we consider asymptotic properties of PLS estimators, as


n → ∞ and as n, p → ∞, in regressions with a real response, r = 1.
Sections 4.2 through 4.5 are restricted to regressions with one component,
q = dim(EΣX(B)) = 1. This is a notable limitation. However, there are many
instances of regressions requiring only a single component so these results do
have practical value. For instance, in chemometrics, application of Beer’s law
indicates that samples based on a single analyte will follow a one-component
regression (Nadler and Coifman, 2005). For additional examples, see Berntsson
and Wold (1986, Section 3.2), Cook (2018, Section 4.4.4), and Kettaneh-Wold
(1992, pg. 65). Chiappini et al. (2021) designed experiments based on a single
analyte. Data from these experiments are thus analyzed appropriately with a
one-component PLS fit.
The results for single-component regressions will serve also to set the stage
for our study of multi-component consistency in Section 4.6, still with a real
response. We consider one-component and multi-component regressions sep-
arately because much more is known about one-component regressions and
results for multi-component regressions tend to be more intricate. Illustra-
tions, simulations, and examples are used throughout the chapter. Notably
missing from this chapter are results on the asymptotic distribution of PLS
estimators in multi-component regressions. Such results were apparently un-
available during the writing of this book, but hopefully will be available in
the future.

DOI: 10.1201/9781003482475-4 110


Synopsis 111

4.1 Synopsis
Our study of asymptotic properties of PLS led us to believe that classical
PLS methods can be effective for studying the class of high dimensional abun-
dant regressions (Cook, Forzani, and Rothman, 2012). Abundance is defined
for one-component PLS regressions in Definition 4.1. It is defined for multi-
component regressions in Definition 4.3 and a characterization is given in
Proposition 4.2. Informally, an abundant regression is one in which many
predictors contribute information about the response. In some abundant re-
gressions, estimators of regression coefficients and fitted values can converge

at the n rate as n, p → ∞ without regard to the relationship between n
and p. This phenomenon is described in Corollaries 4.3–4.5 for one compo-
nent regressions and in Theorems 4.7 and 4.8 for multi-component regressions,
relying mostly on the results of Cook and Forzani (2018, 2019).
Abundant regressions apparently occur frequently in the applied sciences.
For instance, in chemometrics calibration, spectra are often digitized to give
hundreds or even thousands of predictor variables (Martens and Næs, 1989)
and it is generally expect that many points along the spectrum contribute in-
formation about the analyte of interest. Wold, Kettaneh, and Tjessem (1996)
argued against the tendency to “. . . drastically reduce the number of variables
. . . ,” in effect arguing for abundance in chemometrics. In contrast to abun-
dance, a sparse regression is one in which few predictors contain response
information so the signal coming from the predictors is finite and bounded.
Classical PLS methods are generally inconsistent in sparse regressions, but
modifications have been developed to handle these cases, as discussed in
Section 11.4.
In Section 4.3 we discuss traditional asymptotic approximations based on
the one-component model and letting n → ∞ with p fixed. This material
mostly comes from Cook, Helland, and Su (2013). Although PLS is often
associated with high dimensional regressions in which n < p, is it still service-
able in traditional regression contexts and knowledge of this setting will help
paint an overall picture of PLS. In particular, there are settings in which PLS
asymptotically outperforms standard methods like ordinary least squares and
there are also settings in which PLS underperforms.
112 Asymptotic Properties of PLS

Using result from Basa, Cook, Forzani, and Marcos (2022), in Section 4.5
we describe the asymptotic normal distribution as n, p → ∞ of a user-selected
univariate linear combination of the PLS coefficients estimated from a one-
component PLS fit and show how to construct asymptotic confidence intervals
for mean and predicted values. We concluded from Theorem 4.2 and Corol-
lary 4.6 that the conditions for asymptotic normality are more stringent than
those for consistency and, as illustrated in Figure 4.4, that there is a potential
for asymptotic bias.

4.2 One-component regressions


4.2.1 Model
We modify certain notation in model (2.5) to emphasize the restriction to a
real response. We use σY2 |X in place of ΣY |X , σY2 for var(Y ) and, as mentioned
at the outset of Chapter 3, we use σX,Y in place of ΣX,Y . Since model (2.5)
is now restricted to a single component, Φ ∈ Rp is a length-one basis vector
for EΣX(B) and ∆ is a real number. To emphasize this structure, we use δ in
place of ∆. By direct calculation from (2.5), we get that σX,Y = Φηδ, and
so without loss of generality, we can define Φ = σX,Y /kσX,Y k, where for a
vector V , kV k = (V T V )1/2 , as defined in the notation section. The conclusion
that this Φ is a basis vector of EΣX(B) follows also from the discussion of the
NIPALS and SIMPLS algorithms in Sections 3.2 and 3.4.
Construct Φ0 ∈ Rp×(p−1) so that (Φ, Φ0 ) ∈ Rp×p is an orthonormal matrix.
Then by construction Y X | ΦTX and cov(ΦTX, ΦT0 X) = 0. The indepen-
dence condition means that, given ΦTX, there is no more information in X
about Y , while the covariance conditions insures that ΦTX cannot be obscured
by high correlations. In consequence we again see ΦTX as the material part of
X, the part that furnishes information about Y , while ΦT0 X is the immaterial
part of X. Define the scalar var(ΦTX) = δ and the positive definite matrix
var(ΦT0 X) = ∆0 = ΦT0 ΣX Φ0 ∈ R(p−1)×(p−1) . Then, since EΣX(B) reduces ΣX ,
we have the representation ΣX = δΦΦT + Φ0 ∆0 ΦT0 , which shows that δ is an
eigenvalue of ΣX with eigenvector Φ. Model (2.5) with a univariate response
One-component regressions 113

and a one-dimensional envelope can now be written as

= µ + δ −1 kσX,Y k ΦTXi + i , i = 1, . . . , n,

Yi (4.1)
ΣX = δΦΦT + Φ0 ∆0 ΦT0 (4.2)
σY2 |X = var() > 0.

In reference to the general form of the linear model for predictor envelopes
given at (2.5), η = δ −1 kσX,Y k. While δ is an eigenvalue of ΣX with corre-
sponding normalized eigenvector Φ,
T
σX,Y ΣX σX,Y
δ = ΦT ΣX Φ = , (4.3)
kσX,Y k2

it need not be the largest eigenvalue of ΣX . We see also that the vector of
regression coefficients

β = Σ−1 −1
X σX,Y = ΣX ΦkσX,Y k = δ
−1
σX,Y
kσX,Y k2
= T
σX,Y
σX,Y ΣX σX,Y

is an (unnormalized) eigenvector of ΣX . This type of connection between


the conditional mean E(Y | X) and the predictor variance ΣX sets PLS,
and envelope methodology generally, apart from other dimension reduction
paradigms. It is a consequence of postulating that the marginal distribution
of X contains relevant regression to inform on the regression.
It follows from (4.1), (4.2), and (4.3) that

σY2 = var(Y ) = δ −1 kσX,Y k2 + σY2 |X (4.4)


T
cov(Y, Φ X) = kσX,Y k (4.5)

and that the squared population correlation coefficient between Y and ΦTX
is
δ −1 kσX,Y k2
ρ2 (Y, ΦTX) = −1 . (4.6)
δ kσX,Y k2 + σY2 |X
Since Y and X are required to have a joint distribution, the covariance matrix
of the concatenated variable C = (X T , Y )T is similar to (2.9):

! !
ΣX σX,Y δΦΦT + Φ0 ∆0 ΦT0 δηΦ
ΣC = = . (4.7)
T
σX,Y σY2 δηΦT σY2
114 Asymptotic Properties of PLS

4.2.2 Estimation
The usual estimators of ΣX and σX,Y are SX and SX,Y , as given in Section 1.2.
With a single component, the estimated PLS weight vector is

W
c ∝ SX,Y ,
SX,Y
Φ
b = ,
kSX,Y k
T
SX,Y SX SX,Y
δbpls = .
kSX,Y k2

Let (Φ,
b Φb 0 ) ∈ Rp×p be an orthonormal matrix. Then, the PLS estimators of
∆0 and ΣX are


b 0,pls b T0 SX Φ
= Φ b 0,

Σ
b X,pls = δbpls Φ
bΦbT + Φ
b 0∆ bT .
b 0,pls Φ
0

The PLS estimator of β is then


kSX,Y k2 −1
βbpls = T
× SX,Y = δbpls SX,Y .
SX,Y SX SX,Y
T
bpls = Ȳ − βbpls
The estimator of µ is µ X̄, so with centered predictors the fitted
T
values can be expressed as Ybi = Ȳ + βbpls (Xi − X̄). The predicted value of Y
T
at a new value XN of X is then YN = E(Y | XN ) = Ȳ + βbpls
b b (XN − X̄).

4.3 Asymptotic distribution of βbpls as n → ∞ with p fixed


In this section we discuss the asymptotic approximations of the distribution
of βbpls based on letting n → ∞ with p fixed. The discussion is confine to
the regression structure of Section 4.2.1: a one-component (q = 1) regression
with a univariate response (r = 1) based on an independent and identically
distributed sample (Yi , Xi ), i = 1, . . . , n, which has finite fourth moments.
Under this structure Cook, Helland, and Su (2013, Proposition 10) proved
the following.

Proposition 4.1. Assume the one-component regression structure given in


(4.1) and (4.7) with p fixed for an independent and identically distributed
sample (Yi , Xi ), i = 1, . . . , n, which has finite fourth moments. Then
Asymptotic distribution of βbpls as n → ∞ with p fixed 115

(i) The PLS estimator βbpls of β has the expansion


n
√ X
n(βbpls − β) = n−1/2 δ −1 (Xi − µX )i + QΦ (Xi − µX )(Xi − µX )T β

i=1

+ Op (n−1/2 ),

where  is the error for model (4.1).



(ii) n(βbpls − β) is asymptotically normal with mean 0 and variance
√ n o
avar( nβbpls ) = δ −2 ΣX σY2 |X + var(QΦ (X − µX )(X − µX )T β) .

(iii) If, in addition, PΦ X QΦ X then



avar( nβbpls ) = δ −1 σY2 |X PΦ + δ −2 σY2 Φ0 ∆0 ΦT0 .

The proof of Proposition 4.1 is rather long and so Cook, Helland, and Su
(2013) provided only a few key steps. We provide more details in the proof
given in Appendix A.4.1.
According to the results in parts (i) and (ii) of Proposition 4.1, βbpls is
asymptotically normal and that its asymptotic covariance depends on fourth
moments of the marginal distribution of X. However, if PΦ X is independent
of QΦ X, as required in part (iii), then only second moments are needed. The
condition of part (iii) is implied when X is normally distributed:

cov(PΦ X, QΦ X) = PΦ ΣX QΦ = 0

since the envelope reduces ΣX .


Under the conditions of Proposition 4.1 (iii),

avar( nβbols ) = σY2 |X δ −1 PΦ + Φ0 ∆−1 T

0 Φ0 .

Using this result, the asymptotic covariance in part (iii) of Proposition 4.1
can be expressed equivalently as
√ √ −1/2
avar( nβbpls ) = avar( nβbols ) + Φ0 ∆0
n o
−1/2
× (σY2 /σY2 |X )(∆20 /δ 2 ) − Ip−1 ∆0 ΦT0 σY2 |X .

From this we see that the performance of PLS relative to OLS depends
on the strength of the regression as measured by the ratio σY2 /σY2 |X ≥ 1
116 Asymptotic Properties of PLS

and on the level of collinearity as measured by ∆20 /δ 2 . For every level of


collinearity there is a regression so that PLS does worse asymptotically
than OLS, and for every regression strength, there is a level of collinear-
2
ity so that PLS does better than OLS. For instance, if ΣX = σX Ip then
√ b √ b T 2 2
avar( nβpls ) = avar( nβols ) + Φ0 Φ0 (σY − σY |X ) and the asymptotic covari-
ance of the PLS estimator is never less than that of the OLS estimator. In
contrast, recall from Proposition 2.1 that the maximum likelihood envelope
√ b √
estimator never does worse than OLS, avar( nβ) ≤ avar( nβbols ). Since β 6= 0
and r = 1, it follows from Corollary 2.1 that these asymptotic variances are
2
equal when ΣX = σX Ip .

4.3.1 Envelope vs. PLS estimators


Since the envelope estimator βb of β under model (4.1) with multivariate nor-
mality is the maximum likelihood estimator, it follows from general likelihood
√ b √
theory that avar( nβ) ≤ avar( nβbpls ). The following corollary, which makes
use of Proposition 2.1 and is discussed in Appendix 4.1, addresses the dif-
ference between the envelope and PLS estimators in a special case of model
(4.1).

Corollary 4.1. Assume that the regression of Y ∈ R1 on X ∈ Rp follows


model (4.1) with ∆0 = δ0 Ip−1 and that C = (Y, X T )T correspondingly follows
a multivariate normal distribution. Then
√  −1
avar( nβ)b = σ 2 δ −1 ΦΦT + η 2 η 2 δ0 /σ 2 + (δ0 /δ)(1 − δ/δ0 )2 Φ0 ΦT0
Y |X Y |X

avar( nβbpls ) = σY2 |X δ −1 ΦΦT + (σY2 δ0 /δ 2 )Φ0 ΦT0 ,
√ b
where avar( nβ) was obtained as a special case of Proposition 2.1 and
−1
η = δ kσX,Y k, as defined following (4.2).

The first addends on the right-hand side of the asymptotic variances are
the same and correspond to the asymptotic variances when Φ is known. This
is as it must be since the envelope and PLS estimators are identical when Φ
is known. Thus, the second addends on the right-hand side of the asymptotic
variances correspond to the cost of estimating the envelope. To compare the
asymptotic variances we compare the costs of estimating the envelope. Let
 −1
Cost for βb = η 2 η 2 δ0 /σY2 |X + (δ0 /δ)(1 − δ/δ0 )2

Cost for βbpls = σY2 δ0 /δ 2 .


Asymptotic distribution of βbpls as n → ∞ with p fixed 117

It follows from Corollary 4.1 that

Cost for βb τ (1 − τ )
= ≤ 1,
Cost for βpls
b (δ0 /δ − τ )2 + τ (1 − τ )

where τ = σY2 |X /σY2 , which is essentially one minus the population multi-
ple correlation coefficient for the regression of Y on X. The relative cost of
estimating the envelope is always less than or equal to one, indicating that
PLS is the more variable method asymptotically. This is expected since the
envelope estimator inherits optimal properties from general likelihood theory.
Otherwise, the cost ratio depends on the signal strength as measured by τ and
the level of collinearity, as measured by δ0 /δ. The envelope estimator tends
to do much better than PLS in low signal regressions where τ is close to 1
and in high signal regressions where τ is close to 0. If there is a high degree of
collinearity so δ0 /δ is small and (δ0 /δ − τ )2 ≈ τ 2 then the cost ratio reduces to
1 − τ and again the envelope estimator will do better than PLS in low signal
regressions. On the other hand, if the level of collinearity is about the same
as the signal strength δ0 /δ − τ ≈ 0 then the PLS estimator will do about the
same as the envelope estimator asymptotically.
The comparison of this section is based on approximations achieved by
letting n → ∞ with p fixed. They may have little relevance in regression where
n is not large relative to p. When n < p, PLS regression is still serviceable,
while the maximum likelihood estimation based on model (4.1) is not.

4.3.2 Mussels muscles


Cook (2018, Section 4.4.4) used PLS to analyze data from an ecological study
of horse mussels sampled from the Marlborough Sounds off the coast of New
Zealand. The response variable Y is the logarithm of the mussel’s muscle mass,
and four predictors are the logarithms of shell height, shell width, shell length
(each in millimeters), and shell mass (in grams). The sample size is n = 79 and
p = 4. Cook (2018) obtained a clear inference that the regression required only
a single component, q = 1, so these data fit the requirements of this section.
As background, Figure 4.1 shows a plot of the response versus fitted values
T
Ybi = E(Y
b | Xi ) = µ
b + βbpls Xi from the one-component PLS fit. The plot shows
a clear linear trend with no sign of notable anomalies. Figure 4.2 shows a plot
of the fitted values from the one-component PLS fit against the OLS fitted
values from the fit of Y on all four predictors.
118 Asymptotic Properties of PLS

FIGURE 4.1
Mussels’ data: Plot of the observed responses Yi versus the fitted values Ybi
from the PLS fit with one component, q = 1

PLS fitted values vs OLS fitted values


4.0
3.5
3.0
2.5
2.0
1.5

1.5 2.0 2.5 3.0 3.5

FIGURE 4.2
Mussels’ data: Plot of the fitted values Ybi from the PLS fit with one component,
q = 1, versus the OLS fitted values. The diagonal line y = x was included for
clarity.
Asymptotic distribution of βbpls as n → ∞ with p fixed 119

TABLE 4.1
Coefficient estimates and corresponding asymptotic standard deviations
(S.D.) from three fits of the mussels’ data: Ordinary least squares, (OLS),
partial least squares with one component (PLS), and envelope with one di-
mension (ENV). Results for OLS and ENV are from Cook (2018).
OLS PLS, q = 1 ENV, q = 1
Estimate S.D. Estimate S.D. Estimate S.D.
βb1 0.741 0.410 0.142 0.0063 0.141 0.0052
βb2 −0.113 0.399 0.153 0.0067 0.154 0.0056
βb3 0.567 0.118 0.625 0.0199 0.625 0.0194
βb4 0.170 0.304 0.206 0.0086 0.206 0.0073

The coefficient estimates from the OLS, PLS, and envelope fits along with
their asymptotic standard errors are shown in Table 4.1. The OLS standard
errors are the usual ones. The PLS standard errors were obtained by plugging
in estimates into the asymptotic variances given in Proposition 4.1. The enve-
lope standard errors were obtained similarly from the asymptotic covariance
matrix given in Proposition 2.1. The PLS and envelope standard errors were
computed under multivariate normality of X. We see that the OLS standard
errors are between about 6 and 60 times those of the PLS estimator. These
types of differences are not unusual in the envelope literature, as illustrated
in examples from Cook (2018). The PLS standard errors are between about
2 and 20 percent larger than the corresponding envelope standard errors, but
are still much smaller than the OLS standard errors.
Shown in Table 4.2 are the estimates of the predictor covariance matrix ΣX
from the envelope, OLS, and PLS fits of the mussels regression. The estimate
of ΣX from the OLS fit is SX as defined in Section 1.2.1. The estimates from
the envelope and PLS fits were obtained by substituting the corresponding
estimates into (4.2). Estimation from the PLS fit is discussed in Section 4.2.2.
We see that the envelope and PLS estimates are both close to the OLS esti-
mate, which supports the finding that the dimension of the envelope is 1.
120 Asymptotic Properties of PLS
TABLE 4.2
Mussels’ muscles: Estimates of the covariance matrix ΣX from the envelope,
OLS and PLS (q = 1) fits.

Envelope
X log H log L log S log W
log H 0.030 0.031 0.123 0.041
log L 0.035 0.134 0.045
log S 0.550 0.180
log W 0.063
OLS
X log H log L log S log W
log H 0.030 0.031 0.123 0.041
log L 0.035 0.134 0.045
log S 0.550 0.180
log W 0.063
PLS
X log H log L log S log W
log H 0.031 0.032 0.126 0.042
log L 0.036 0.136 0.046
log S 0.557 0.182
log W 0.064

4.4 Consistency of PLS in high-dimensional regressions


In this section we discuss consistency in high-dimensional regressions, essen-
tially reporting on the work of Cook and Forzani (2018). The context begins
with that used for Proposition 4.1 – the data (Yi , Xi ), i = 1, . . . , n, arise as in-
dependent copies of (Y, X), which follows the single-response, one-component
PLS regression described in Section 4.2.1. Here we require, in addition, that
(Y, X) follows a non-singular (p + 1)-dimensional multivariate normal distri-
bution, and we allow n and p to diverge in various alignments. The normality
assumption is used to facilitate asymptotic calculations and to connect with
related results in the literature; nevertheless, simulations and experience in
practice indicate that it is not essential for the results of this section to be
useful elucidations. To avoid trivial cases, we assume throughout that β = 6 0.
Estimators are as described in Section 4.2.2.
Consistency of PLS in high-dimensional regressions 121

Recall from Section 1.2 that Y0 = (Y1 , . . . , Yn )T denotes the vector of un-
centered responses and that X denotes the n × p matrix with rows (Xi − X̄)T ,
i = 1, . . . , n. Recall also that SX,Y = n−1 XT Y0 and SX = n−1 XT X ≥ 0
represent the usual moment estimators of σX,Y and ΣX using n for the
divisor. We use Wq (M ) to denote the Wishart distribution with q degrees of
freedom and scale matrix M . With W = XT X ∼ Wn−1 (ΣX ), we can represent
SX = W/n, SX,Y = n−1 (W β + XT ε), where ε is the n × 1 vector with the
model errors i as elements. (This use of the W notation is different from
the weight matrix in PLS.) We use the notation “ak  bk ” to mean that, as
k → ∞, ak = O(bk ) and bk = O(ak ), and describe ak and bk as then being
asymptotically equivalent.

4.4.1 Goal
Our general goal is to gain insights into the predictive performance of βbpls as
n and p grow in various alignments. In this section we describe how that goal
will be pursued.
Let YN = µ + β T (XN − E(X)) + N denote a new observation on Y at
a new independent observation XN of X. The PLS predicted value of YN at
T
XN is YbN = Ȳ + βbpls (XN − X̄), giving a difference of

YbN − YN = Ȳ − µ + (βbpls − β)T (XN − E(X)) − (βbpls − β)T (X̄ − E(X))


−β T (X̄ − E(X)) + N .

The first term Ȳ − µ = Op (n−1/2 ). Since σY2 = β T ΣX β + σY2 |X must remain


constant as p grows, β 6= 0 and ΣX > 0, we see that β T ΣX β = O(1) as p → ∞.
Thus the fourth term β T (X̄ −E(X)) = Op (n−1/2 ) by Chebyschev’s inequality:

var(β T (X̄ − E(X))) = β T ΣX β/n → 0 as n, p → ∞.

The term (βbpls − β)T (X̄ − E(X)) must have order smaller than or equal to
the order of (βbpls − β)T (XN − E(X)), which will be at least Op (n−1/2 ).
Consequently, we have the essential asymptotic representation
n o
YbN − YN = Op (βbpls − β)T (XN − E(X)) + N as n, p → ∞.

Since N is the intrinsic error in the new observation, the n, p-asymptotic be-
havior of the prediction YbN is governed by the estimative performance of βbpls
as measured by

DN := (βbpls − β)T ωN
 
= T
SX,Y Φ(
b Φ b −1 Φ
b T SX Φ) b T − σ T Φ(ΦT ΣX Φ)−1 ΦT ωN , (4.8)
X,Y
122 Asymptotic Properties of PLS

where ωN = XN − E(X) ∼ N (0, ΣX ). Consistency of βbpls is discussed at the


end of Section 4.6.4. Until then, we focus exclusively on predictions via DN .
We next give characterizations of populations constructions that will be
helpful when studying the asymptotic properties of DN .

4.4.2 Quantities governing asymptotic behavior of DN


Recall that β = ηΦ, η = δ −1 kσX,Y k and, from (4.3), δ = ΦT ΣX Φ. Properties
of DN depend in part on the asymptotic behavior of

β T ΣX β = η 2 ΦT ΣX Φ = δ −2 kσX,Y k2 ΦT ΣX Φ = δ −1 kσX,Y k2 (4.9)

as p → ∞. We know that σY2 = β T ΣX β + σY2 |X . Since σY2 remains fixed,


σY2 ≥ β T ΣX β ≥ 0 as p → ∞. It is technically possible to have β T ΣX β → σY2
so in the limit σY2 |X = 0. We find this implausible as a useful approximation
and so we assume that σY2 |X is bounded away from 0.
On the other extreme, it is possible for β T ΣX β → 0 as p → ∞, in which
case there is no signal left in the limit and ρ2 (Y, ΦTX) → 0, where ρ is as given
in (4.6). To gain intuition into scenarios where this might happen, consider
kσX,Y k2 cov2 (ΦTX, Y )
β T ΣX β = = ,
δ var(ΦTX)
where the second equality follows from (4.5) and δ = var(ΦTX) = ΦT ΣX Φ
from (4.3). The notions of abundant and sparse regression provide useful in-
tuition on the asymptotic behavior of PLS. At the beginnings of Chapters 1
and 4 we characterized an abundant regression informally as one in which
many predictors contribute information about the response, while a sparse
regression is one in which few predictors contain response information. We
now formalize those ideas in the context of one-component regressions. This
definition is generalized to multi-component regressions in Section 4.6.1.

Definition 4.1. A one-component regression (4.1) is said to be abundant if


kσX,Y k → ∞ as p → ∞. It is said to be sparse if kσX,Y k is bounded as p → ∞.

We assume for illustration and to gain intuition about abundance and spar-
sity in the context of one-component regressions that X can be represented as
X = V ν + ζ, where ν is a real stochastic latent variable that is related to X
via the non-stochastic vector V and the X-errors ζ (ν, Y ). Without loss of
generality we take var(ν) = 1. We know that, for our one-component model,
σX,Y is a basis for the one-dimensional envelope EΣX(B). Accordingly, in terms
Consistency of PLS in high-dimensional regressions 123

of our illustrative model for X, σX,Y = V cov(ν, Y ) with cov(ν, Y ) 6= 0 not


depending on p, so we can take our normalized basis vector to be Φ = V /kV k
and then re-express our illustrative model as X = ΦkV kν + ζ. In consequence,
cov2 (ΦTX, Y )
β T ΣX β = (4.10)
var(ΦTX)
kV k2 cov2 (ν, Y )
=
kV k2 + ΦT Σζ Φ
kσX,Y k2
= ,
kσX,Y k2 /cov2 (ν, Y ) + ΦT Σζ Φ
where Σζ = var(ζ) > 0. This result enables us to gain intuition on a few
different types of regressions.
Sparse regression: Perhaps the most common instance of a sparse regres-
sion is one in which only a finite number of elements of σX,Y are non-zero.
Correspondingly, the same elements of V and Φ are non-zero. Result (4.10)
shows that in such sparse regressions β T ΣX β is bounded away from 0 as
p → ∞ and so there will always be a signal present, although that signal
may be hard to find in application.

Abundant regression, bounded Σζ : Abundant regressions are those in


which kV k2 → ∞ as p → ∞. Since the eigenvalue of ΣX that goes with
eigenvector Φ is

δ = var(ΦTX) = kV k2 + ΦT Σζ Φ,

it is generally unreasonable in abundant regressions to require the eigen-


values of ΣX to be bounded. However, it does seem reasonable to require
bounded eigenvalues of Σζ , in which case β T ΣX β is again bounded away
from 0 as p → ∞.

Abundant regression, unbounded Σζ : It is possible in abundant regres-


sions to have Σζ unbounded and yet β T ΣX β is still bounded away from
0. As an instance of this, assume that V is a vector of ones V = 1p and
that Σζ is compound symmetric with off-diagonal element 1 > ρ > 0 and
diagonal element 1. Then, as shown in Appendix A.4.3,

Σζ = ρ1p 1Tp + (1 − ρ)Ip = {(p − 1)ρ + 1}P1p + (1 − ρ)Q1p .

Then, as p → ∞,
p cov2 (ν, Y ) cov2 (ν, Y )
β T ΣX β = → .
p + (p − 1)ρ + 1 ρ+1
124 Asymptotic Properties of PLS

Since the eigenvalue of Σζ associated with Φ increases linearly with p,


β T ΣX β is still bounded away from 0.

For β T ΣX β to converge to 0 in one-component regressions giving rise to


(4.10), we need ΦT Σζ Φ to increase faster than kV k2 . While that is technically
possible, we find it practically implausible. As a consequence of the discussion
so far, we assume that β T ΣX β is bounded away from zero and infinity as
p → ∞, implying that there will always be some regression signal with σY2 >
σY2 |X > 0. PLS has been found to be serviceable in many application areas,
particularly chemometrics, so we expect that this is not a severe restriction.
These restrictions can be imposed succinctly by requiring that β T ΣX β  1 as
p → ∞. Since now β T ΣX β  1, we have from (4.9) that δ −1 kσX,Y k2  1 and

δ −2 kσX,Y k2 ΦT ΣX Φ = δ −2 σX,Y
T
ΣX σX,Y  1.

In consequence,

δ  kσX,Y k2 and σX,Y


T
ΣX σX,Y  kσX,Y k4 .

This implies that δ, the eigenvalue of ΣX associated with Φ, provides a mea-


sure of the signal that is asymptotically equivalent to kσX,Y k2 .

4.4.3 Consistency results


The asymptotic properties of PLS predictions depend also on the relationship
between tr(∆0 ), which reflects the variation of the noise in X, and kσX,Y k2 ,
which is a measure of the signal coming from X. The following theorem, which
stems from Cook and Forzani (2018) and is proven in Appendix A.4.4, pro-
vides a characterization of the asymptotic behavior of DN in terms of the noise
to signal ratio ∆σ := ∆0 /kσX,Y k2 . In preparation, define for j = 1, 2, 3, 4,

tr(∆j0 ) tr(∆jσ )
Kj (n, p) = = . (4.11)
nkσX,Y k2j n
In this section we will use only K1 (n, p) and K2 (n, p). We will make use
of K3 (n, p) and K4 (n, p) when considering asymptotic distributions in Sec-
tion 4.5.

Theorem 4.1. If model (4.1) holds, β T ΣX β  1 and Kj (n, p), j = 1, 2,


converges to 0 as n, p → ∞ then
n o
1/2
DN = Op n−1/2 + K1 (n, p) + K2 (n, p) .
Consistency of PLS in high-dimensional regressions 125

This theorem can be used to gain insights into specific types of regressions.
In particular, in many regressions the eigenvalues of ∆0 may be bounded as
p → ∞, reflecting that the noise in X is bounded asymptotically. For instance,
this structure can occur when the predictor variation is compound symmetric:

ΣX = π 2 (1 − ρ + ρp)P1p + (1 − ρ)Q1p ,


where π 2 = var(Xj ) for all j and ρ = corr(Xj , Xk ) for all j 6= k. Since


we are restricting consideration to q = 1, EΣX(B) must fall in one of the
two eigenspaces of ΣX : either EΣX(B) = span(1p ) or EΣX(B) ⊆ span⊥ (1p ). If
EΣX(B) = span(1p ) then ∆0 = (1 − ρ)Ip−1 which has bounded eigenvalues.

We then have Φ = 1p / p, δ = π 2 (1 − ρ + ρp), signal kσX,Y k2  δ  p and
tr(∆0 ) = tr((1 − ρ)Q1p )  p.
If the eigenvalues of ∆0 are bounded as p → ∞ then so are the eigenvalues
of ∆k0 for any fixed k. Consequently, tr(∆kσ )  p/kσX,Y k2k and we have
p
K1 (n, p) 
nkσX,Y k2

1/2 p
K2 (n, p)  √ .
nkσX,Y k2

Comparing the magnitudes of the Kj ’s as they approach 0, we have

Corollary 4.2. Assume the conditions of Theorem 4.1 and that the eigenval-
ues of ∆0 are bounded as p → ∞. Then

I.  √ 
1 p p
D N = Op √ + √ + .
n nkσX,Y k2 nkσX,Y k2

II. (Abundant) If kσX,Y k2  p then DN = Op {1/ n}.
√ √
III. (Sparse) If kσX,Y k2  1 then DN = Op p/ n .

The first conclusion tells us that with bounded noise the predictive perfor-
mance of PLS methods depends on an interplay between the sample size n, the
number of predictors p and the signal as measured by kσX,Y k2 . The second and
third conclusions relate to specific instances of that interplay. The second con-
clusion says informally that if most predictors are correlated with the response
then PLS predictions will converge at the usual root-n rate, even if n < p. The
third conclusion says that if few predictors are correlated with the response or
kσX,Y k increases very slowly, then for predictive consistency the sample size
126 Asymptotic Properties of PLS

needs to be large relative to the number of predictors. The third case clearly
suggests a sparse solution, while the second case does not. Sparse versions of
PLS regression have been proposed by Chun and Keleş (2010) and Liland et al.
(2013). In view of the apparent success of PLS regression over the past four
decades, it is reasonable to conclude that many regressions are closer to abun-
dant than sparse. The compound symmetry example with EΣX(B) = span(1p )
is covered by Corollary 4.2 (II) and so the convergence rate is Op (n−1/2 ).
The next two corollaries give more nuanced results by tying the signal
kσX,Y k2 and the number of predictors p to the sample size n. Corollary 4.3
is intended to provide insights when the sample size exceeds the number of
predictors, while Corollary 4.4 is for regressions where the sample size is less
than the number of predictors.

Corollary 4.3. Assume the conditions of Theorem 4.1 and that the eigenval-
ues of ∆0 are bounded as p → ∞. Assume also that p  na for 0 < a ≤ 1 and
that kσX,Y k2  ps  nas for 0 ≤ s ≤ 1. Then

I. DN = Op n−1/2 + na(1/2−s)−1/2 + na(1−s)−1 = Op (n−1/2 + na(1/2−s)−1/2 ).




II. (Abundant) DN = Op n−1/2 if s ≥ 1/2.




III. (In-between) DN = Op n−1/2+a(1/2−s) if 0 < s ≤ 1/2.




IV. (Sparse) DN = Op n−(1−a)/2 if s = 0.




The requirement from Theorem 4.1 that Kj (n, p) converge to 0 forces the
terms in the order given in conclusion I to converge to 0 to ensure consistency,
which limits the values of a and s jointly. The corollary predicts that s = 1/2

is a breakpoint for n-convergence of PLS predictions in high-dimensional
regressions. If the signal accumulates at a rate that is greater than kσX,Y k2 
p1/2 , so s = 1/2, then predictions converge at the usual root-n rate. This
indicates that there is considerably more leeway in the signal rate to obtain

n-convergence than that described by Corollary 4.2(I). Otherwise a price is
paid in terms of a slower rate of convergence. For example, if kσX,Y k2  p1/4
and p  n, so s = 1/4 and a = 1, then DN = Op (n−1/4 ), which is consider-
ably slower than the root-n rate of case II. This corollary also gives additional
characterizations of how PLS predictions will do in sparse regressions. From
conclusion IV, we see that if s = 0, so kσX,Y k2  1, and a = 0.8, then
DN = Op (n−0.1 ), which would not normally yield useful results but could
likely be improved by using a sparse fit. On the other hand, if a is small, say
Consistency of PLS in high-dimensional regressions 127

a = .2 so p  n0.2 , then DN = Op (n−0.4 ), which is a reasonable rate that


may yield useful conclusions. The in-between category is meant to signal that
those regressions can behave rather like an abundant regression if s is close to
1/2 or like a sparse regression when s is close to 0.
The next corollary allows p > n.

Corollary 4.4. Assume the conditions of Theorem 4.1 and that the eigenval-
ues of ∆0 are bounded as p → ∞. Assume also that p  na for a ≥ 1 and that
kσX,Y k2  ps  nsa for 0 ≤ s ≤ 1. Then

I. DN = Op n−1/2 + na(1/2−s)−1/2 + na(1−s)−1 = Op n−1/2 + na(1−s)−1 .


 

II. (Abundant) DN = Op (n−1/2 ) if a(1 − s) ≤ 1/2.

III. (In-between) DN = Op (na(1−s)−1 ) if 1/2 ≤ a(1 − s) < 1.

IV. (Sparse) No convergence if s = 0.

Theorem 4.1 requires that K1 (n, p) → 0. Thus, in the context of this corol-
lary, for consistency we need as n, p → ∞
p
K1 (n, p)   na(1−s)−1 → 0,
nkσX,Y k2

which requires a(1 − s) < 1. This is indicated also in conclusion I of this corol-
lary. The corollary does not indicate an outcome when a(1 − s) ≥ 1, although
here PLS might be inconsistent depending on the specific values of a and s. The
usual root-n convergence rate is achieved when a(1 − s) ≤ 1/2. For instance,
if a = 2 so p = n2 then we need s ≥ 3/4 for root-n convergence. However, in
contrast to Corollary 4.3, here there is no convergence with sparsity. If s = 0,
then we need a < 1 for convergence, which violates the corollary’s hypothesis.
Figure 4.3 gives a visual representation of the division of the (a, s) plane
according to the convergence properties of PLS from parts II and III of Corol-
laries 4.3 and 4.4. The figure is constructed to convey the main parts of the
conclusions; not all aspects of these corollaries are represented. The abundant
and in-between categories occupy most of the plane, while sparse fits are indi-
cated only in the upper left corner that represents high-dimensional problems

with weak signals, p > n and kσX,Y k ≤ p.
The previous three corollaries require that the eigenvalues of ∆0 be
bounded. The next corollary relaxes this condition by allowing a finite number
of eigenvalues δ0,j of ∆0 to be asymptotically similar to p (δ0,j  p for a finite
128 Asymptotic Properties of PLS

FIGURE 4.3
Division of the (a, s) plane according to the convergence properties given in
conclusions II and III of Corollaries 4.3 and 4.4.

collection of indices j), while keeping the remaining eigenvalues bounded. In


this case, tr(∆j0 )  pj , j = 1, 2, 3. For instance, returning to the compound
symmetry example, we may have EΣX(B) ⊂ span⊥ (1p ). Then the eigenval-
ues of ∆0 are π 2 (1 + (p − 1)ρ) with multiplicity 1, which accounts for the
(1 − ρ + ρp)P1p term in ΣX , and π 2 (1 − ρ) with multiplicity p − 2. Conse-
quently, δ0,1  p while the remaining eigenvalues of ΣX are all bounded.

Corollary 4.5. Assume the conditions of Theorem 4.1 and that δ0,j  p for
a finite collection of indices j while the other eigenvalues of ∆0 are bounded
as p → ∞. Assume also that p  na for a ≥ 1 and that kσX,Y k2  ps for
0 ≤ s ≤ 1. Then
DN = Op (n−1/2+a(1−s) ).
Distributions of one-component PLS estimators 129

The conditions of Theorem 4.1 in the context of Corollary 4.5 imply


that for consistency we need a(1 − s) < 1/2, with the usual root-n conver-
gence rate being essentially achieved when a(1 − s) is small. If s = 1, then
DN = Op (n−1/2 ), which agrees with the conclusion of Corollary 4.2. This
highlights one important conclusion from Corollary 4.5: PLS predictions can
still have root-n convergence when some of the eigenvalues of ∆0 increase like
p, but for this to happen we need an abundant signal, kσX,Y k  p . Second,
Corollary 4.5 shows the interaction between the number of predictors and the
signal rate in high-dimensional regression. Write
1 na 1 p
n−1/2+a(1−s) = √ as  √ s .
nn np

Thinking of ps roughly as the number of active predictors, this says that


the number of predictors per active predictor must be small relative to the
square root of the sample size for a good convergence rate. For instance,
with n = 625, p = 1000 and about 250 active predictors, so a ∼ 1.075 and
s ∼ 0.8, we get a corresponding convergence rate of about n0.3 . If we increase
the active predictors to 500, the corresponding convergence rate becomes
about n0.4 .

4.5 Distributions of one-component PLS estimators


Let G ∈ Rp be a vector of user-selected, non-stochastic constants. We continue
our study of PLS by describing in Section 4.5.1 the asymptotic distribution
of (βbpls − β)T G in one-component regressions as developed in Section 4.2.
Our discussion is based largely on the results of Basa et al. (2022). Selecting
GT = (1, −1, 0, . . . , 0), the asymptotic distribution of (βbpls − β)T G could be
used to aid inference about the difference between the first two elements of β.
In Section 4.5.2 we discuss how the distribution can be used to form asymp-
totic confidence intervals for GT β. We expand the discussion in Section 4.5.3.2
by studying the asymptotic orders of various constituents. A few simulation
results are presented in Section 4.5.4.
For a new randomly selected value XN of X, we report in Section 4.5.5
T
on the asymptotic distribution for µ̂ + βbpls XN as a tool for studying the con-
T
ditional mean E(Y | X) = µ + β X. In this section we give only the main
130 Asymptotic Properties of PLS

results of Basa et al. (2022); full proofs and additional details are available
from the supplement to their paper.

4.5.1 Asymptotic distributions and bias


We first introduce quantities that will be used in pursuing the asymptotic
distribution of (βbpls − β)T G. Define

GT ΣX σY2 − σX,Y σX,Y T



G
V (G) = (4.12)
nδ 2
T 2 2
(G Φ) σY |X σ 2 GT Φ0 ∆0 ΦT0 G
= + Y 2
. (4.13)
 nδ 2
 nδ 2 2 
kσX,Y k σ kσX,Y k
b = − − σY2 |X K1 − Y K12 + K2 ,
δ δ
= O (max(K1 , K2 )) . (4.14)
J(n, p) = n−1/2 + K1 (n, p) + K2 (n, p),

where we have suppressed the arguments (n, p) to Kj , V 1/2 (G) will be used
to scale (βbpls − β)T G and b is related to a potential for asymptotic bias. We
D
use −→ to denote convergence in distribution.
The following theorem describes the asymptotic distribution reported by
Basa et al. (2022).

Theorem 4.2. Assume that the one-component model (4.1) holds with
β T ΣX β  1. Assume also that (a) X ∼ Np (µX , ΣX ), (b) E(4 ) from model
(4.1) is bounded, (c) K1 (n, p) and K2 (n, p) converge to 0 as n, p → ∞, and
(d) K3 (n, p) is bounded and K4 (n, p) converges to 0 as n, p → ∞.

(I) If JV −1/2 β T Gb → 0 as n, p → ∞, then


n oT
D
V −1/2 βbpls − β(1 + b) G−→N (0, 1).

(II) If, as n, p → ∞, n1/4 Kj (n, p) → 0 for j = 1, 2 then


−1/2 T
JV β Gb → 0.

Condition (a) requires normality for X, and condition (b) is the usual
requirement of finite fourth moments. Condition (c) plus the constraint
β T ΣX β  1 are the same as the conditions for consistency in Theorem 4.1.
Condition (d) is new and is needed to insure stable asymptotic distributions.
Like conditions (b) and (c), condition (d) is judged to be mild.
Distributions of one-component PLS estimators 131

Although the non-stochastic sequence represented by b converges to 0 as


K1 and K2 converge to zero, it nevertheless represents a potential for bias
when using Theorem 4.2 as the basis for inference in applications. In the fol-
lowing discussion we assume conditions (a)–(d) of Theorem 4.2 and focus on
constraints that control the bias.
Since b is a real scalar, inference about ratios of elements of β will be rel-
atively free of any bias effect. If GT β = 0 there is no bias contribution since
then the asymptotic mean is 0. If GT β 6= 0, bias may play a notable role.
Writing
   
V −1/2 GT βbpls − β (1 + b) = V −1/2 GT βbpls − β − V −1/2 GT βb,

we see by Slutsky’s Theorem that the bias will be unimportant asymptoti-


cally when the second addend on the right side V −1/2 GT βb → 0 as p → ∞.
Since we require J → 0, the condition V −1/2 GT βb → 0 implies the suffi-
cient condition of Theorem 4.2(I), JV −1/2 β T Gb → 0, but not conversely. We
use lower bounds on V to find manageable sufficient conditions under which
V −1/2 GTβb → 0. Direct calculation gives

(kσX,Y k2 /δ)(GT σX,Y )2


V −1 (GTβ)2 = n
σY2 |X (GT σX,Y )2 + σY2 (kσX,Y k2 /δ)GT Φ0 ∆0 ΦT0 G
kσX,Y k2
≤n , (4.15)
δσY2 |X

with the upper bound being attained when G ∈ span(β). Since the choice
of G does not affect b, this indicates that the bias effects will be the most
prominent when G ∈ span(β).
It follows from (4.15) that
√ √
V −1/2 |GTβb| ≤ n|b|(kσX,Y k2 /δσY2 |X )1/2  n|b|. (4.16)

In consequence, n|b| → 0 is a sufficient condition for avoiding the bias effects

when G 6∈ span⊥ (β). Inspecting n|b| we have

√ √ kσX,Y k2 σ 2 kσX,Y k2  2
 
n|b| ≤ n − σY2 |X K1 + Y K1 + K2 .
δ δ

This inequality gives rise to conditions for n|b| → 0. Since σY2 and

kσX,Y k2 /δ − σX,Y
2
are bounded, it is sufficient to require nKj (n, p) → 0,
j = 1, 2. We summarize these findings in the following corollary.
132 Asymptotic Properties of PLS

Corollary 4.6. Under the condition (a)–(d) of Theorem 4.2,



(I) If n|b| → 0 as n, p → ∞, then
 T
D
V −1/2 βbpls − β G−→N (0, 1).
√ √
(II) If, as n, p → ∞, nKj (n, p) → 0, j = 1, 2, then n|b| → 0.

Recall that for consistency Theorem 4.1 requires Kj (n, p) → 0 as n, p → 0,


j = 1, 2. In contrast, Theorem 4.2 and Corollary 4.6 require that n1/4 Kj (n, p)

and nKj (n, p), j = 1, 2, converge to 0 for asymptotic normality with and
without the bias term. In consequence, more stringent conditions are needed
for asymptotic normality than are needed for consistency alone.
We conclude this section with a result in the next theorem that is closely
related to Cook and Forzani (2018, Theorem 1).

Theorem 4.3. Under the conditions of Theorem 4.2,

(βbpls − β)T ΣX (βbpls − β) = Op n−1 + K12 (n, p) + K2 (n, p) .




Therefore for XN − µX ∼ N (0, ΣX ) with ΣX as specified in (4.2),


 
1/2
(βbpls − β)T (XN − µX ) = Op n−1/2 + K1 (n, p) + K2 (n, p) .

This establishes consistency of βbpls in the ΣX inner product.

4.5.2 Confidence intervals


Theorem 4.2 allows the straightforward construction of asymptotic confidence
intervals for (1 + b)β T G:
h i
CIα ((1 + b)β T G) = βbpls
T
G ± zα/2 V 1/2 (G) , (4.17)

where zα/2 denotes a selected percentile of the OLS normal distribution. Un-
der Corollary 4.6 this same interval becomes a confidence interval for β T G, in
which case we refer to the interval as CIα (β T G).
It is not possible to use (4.17) in applications because V is unknown. How-
ever, a sample version of interval (4.17) can be constructed by using plug-in
estimators for V . To construct an estimator of V we simply plug in the esti-
mators of the constituents of representation (4.12),
GT (SX SY − SX,Y SX,Y
T
)G
Vb (G) = . (4.18)

b 2
pls
Distributions of one-component PLS estimators 133

where
b T SX Φ
δbpls = Φ b = S T SX SX,Y /S T SX,Y ,
X,Y X,Y

as described in Section 4.2.2. The following theorem implies by Slutsky’s The-


orem that asymptotically it is reasonable to replace V with Vb in the construc-
tion of confidence intervals:

Theorem 4.4. Under the conditions of Theorem 4.2, Vb /V → 1 in probability


as n, p → ∞.

Additionally, we have conditions that allow us to replace b with b̂, and


therefore obtain a confidence interval CIα (β T G) for β T G under only Theo-
rem 4.2(I). Let b̂ denote the plug-in estimator of b:
!
kSX,Y k2 2 2 n
b 1 − SY kSX,Y k K
o
b̂ = − − SY2 |X K b2 + K
1
b2 ,
δbpls δbpls

where
bj )
tr(∆
bj = 0,pls
K
nkSX,Y k2j
and ∆
b 0,pls is as defined in Section 4.2.2.

Theorem 4.5. Under all conditions of Theorem 4.2(II), n(b̂ − b) → 0 in
probability as n, p → ∞. In consequence, an asymptotic confidence interval
for β T G is
1 h i
CIα (β T G) = T
βbpls G ± zα/2 Vb 1/2 (G) .
1 + b̂

4.5.3 Theorem 4.2, Corollary 4.6 and their constituents


In this section we assume that the one-component model (4.1) holds with
β T ΣX β  1, and that conditions (a) and (b) of Theorem 4.2 hold. Our objec-
tive is to study the remaining constituents of the theorem, including conditions
(c) and (d), and its Corollary 4.6 in plausible scenarios that may help with
intuition in practice. We now consider settings in which,

1. The one-component model (4.1) holds with β T ΣX β  1, and (a) and (b)
of Theorem 4.2.

2. kσX,Y k2  pt for 0 ≤ t ≤ 1.
134 Asymptotic Properties of PLS

3. Eigenvalue constraints,

(i). The eigenvalues of ∆0 are bounded, so λmax (∆0 )  1 and tr(∆k0 )  p


for all k. In this case we redefine Kj (n, p) = p/npjt , which is equiva-
lent to Definition 4.11 for the purpose of assessing condition (c) and
(d) of Theorem 4.2.
(ii). A finite number of eigenvalues of ∆0 increase as p, so λmax (∆0 )  p,
and tr(∆k0 )  pk for all k. In this case we redefine Kj (n, p) = pj /npjt .

Conditions 1 and 2, and either 3(i) or 3(ii) are assumed for the rest of this
section.

4.5.3.1 Kj (n, p)

It should be understood in the following that we deal with Kj (n, p) for only
j = 1, . . . , 4. Under condition 3(i), we have for j ≥ 1
p 1

Kj (n, p) = = 
npjt npjt−1 (4.19)
K (n, p) ≥ K (n, p) for all n, p. 
1 j

From here we seen that Kj (n, p) → 0 if p1−t = o(n).


Under condition 3(ii),

pj 1

Kj (n, p) = = , 
npjt npj(t−1) (4.20)
Ki (n, p) ≤ Kj (n, p), i ≤ j.

From here we seen that Kj (n, p) → 0 if K4 (n, p) → 0 or equivalently if


4(1−t)
p = o(n). Sparse regressions with t = 0 place the most severe require-
ments on n: under condition 3(i) we must have p/n → 0 and under condition
3(ii) we must have p4 /n → 0.
To summarize for later reference, conditions (c) and (d) of Theorem 4.2
will hold when
p1−t
→0 under 3(i), (4.21)
n
p4(1−t)
→0 under 3(ii). (4.22)
n

4.5.3.2 Scaling quantity V (G)

Two equal forms for V (G) were presented in (4.12) and (4.13). The form
given in (4.12) was used in previous developments. The form in (4.13) derives
Distributions of one-component PLS estimators 135

immediately from the asymptotic variance for the fixed p case presented in
Proposition 4.1(iii):
√  √ 
GT avar nβbpls G avar nGT βbpls
V (G) = = .
n n
That is, the V (G) derived from n-asymptotics is the same as that in n, p-
asymptotics. When letting n → ∞ with p fixed, nV is static; it does not
change since p is static. In consequence, with p static, the result from Propo-
√ D
sition 4.1 can be stated alternatively as nGT (βbpls − β)−→N (0, nV ), leading

to the conclusion that convergence is at the usual n rate.
The same scaling quantity V is appropriate for our n, p-asymptotic results:
D D
V −1/2 (βbpls − β(1 + b))T G−→N (0, 1) and V −1/2 (βbpls − β)T G−→N (0, 1).

However, we do not necessarily have n convergence because, in our con-
text, nV is dynamic, changing as p → ∞. Let λmax (·) denote the maximum
eigenvalue of the argument matrix. Then
1 n T 2 2 o
V (G) = (G Φ) σY |X + δ −1 σY2 GT Φ0 ∆0 ΦT0 G

1 n o
≤ kGk2 σY2 |X + δ −1 σY2 kGk2 λmax (∆0 )

σY2 |X
( )
2
2 kσX,Y k kσX,Y k2 2 λmax (∆0 )
= kGk + σY .
δ nkσX,Y k2 δ nkσX,Y k4

Under conditions 1, 2 and 3(i), as n, p → ∞,

V (G) = O kGk2 n−1 max kσX,Y k−2 , kσX,Y k−4


 

= O kGk2 n−1 max 1/pt , 1/p2t = O(kGk2 /npt ).


 

In consequence, the variance of the limiting distribution is approached at rate



at least npt/2 /kGk. If the regression is sparse, t = 0, and kGk is bounded,

we get the usual n convergence rate. But if t > 0, then there is a synergy
between the n and p, with the sample size being magnified by pt . If t = 1 then

the convergence rate is np/kGk, for instance. In many applications we may
√ √
have kGk ≤ p and then the convergence rate is at least n. This discussion
is summarized in the first column of the second row of Table 4.3.
Under conditions 1, 2 and 3(ii), as n, p → ∞,

V (G) = O kGk2 n−1 max{kσX,Y k−2 , pkσX,Y k−4 }




= O kGk2 n−1 max{1/pt , 1/p2t−1 } = O kGk2 /np2t−1 ,


 
136 Asymptotic Properties of PLS

and now the variance of the limiting distribution is approached at least at



rate npt−1/2 /kGk. This is summarized in the first column of the third row
of Table 4.3.
This discussion is summarized in the first column of Table 4.3. The nec-
essary conditions (4.21) and (4.22) to guarantee conditions (c) and (d) of
Theorem 4.2 also insure that V (G) → 0 under conditions 3(i) and 3(ii).

4.5.3.3 Bias b

It follows from (4.14) and (4.19) that under condition 3(i) b = O(1/npt−1 ),
which is reported in the second column, second row of Table 4.3. That
1/npt−1 → 0 is implied by (4.21). For t = 0, b = O(p/n), for t = 1/2, b =

O( p/n) and for t = 1, b = O(1/n). When hypotheses (a) - (d) of Theorem 4.2
are met in this example, n1/4 Kj → 0, j = 1, 2 is sufficient to have the limiting
distribution stated in the theorem. Under condition 3(i), K1 ≥ K2 and so
1
n1/4 K1 = →0
n3/4 pt−1
is sufficient for Theorem 4.2. This condition is listed in the third column,
second row of Table 4.3.
It follows from (4.14) and (4.20) that under condition 3(ii), b =
O 1/np2(t−1) as given in the second column, third row of Table 4.3. That


1/np2(t−1) → 0 is implied by (4.22). For t = 0, b = O(p2 /n), for t = 1/2,


b = O(p/n) and for t = 1, b = O(1/n). Still under condition 3(ii), n1/4 Kj → 0,
j = 1, 2 is sufficient to have the limiting distribution stated in the theorem.
But in our example, K1 ≤ K2 and so
1
n1/4 K2 = →0
n3/4 p2(t−1)
is sufficient for Theorem 4.2 given that the remaining conditions are satisfied.
This condition is listed in the third column, third row of Table 4.3.
The conditions from Corollary 4.6 that are needed to eliminate the bias
term are shown in the column headed b. The entries in that column are “be-
tween” those in columns headed b and b. The factors listed in the first and sec-
ond columns of Table 4.3 all converge to 0 provided kGk does not grow too fast:
since (4.21) and (4.22) are required for Theorem 4.2 to hold in the first place,
and (4.22) is the same as the last factor listed in the last column of Table 4.3.
Distributions of one-component PLS estimators 137

TABLE 4.3
The first and second columns give the orders O(·) of V 1/2 and b under condi-
tions 1– 3 as n, p → ∞. The fourth column headed b gives from Corollary 4.6

the order of quantity n|b| that must converge to 0 for the bias term to be
eliminated.

V 1/2 b n1/4 Kj (n, p), j = 1, 2 b


Condition 3(i)
√ √
kGk/ npt/2 1/npt−1 1/n3/4 pt−1 1/ npt−1
Condition 3(ii)
√ √
kGk/ npt−1/2 1/np2(t−1) 1/n3/4 p2(t−1) 1/ np2(t−1)

4.5.3.4 Simulation to illustrate bias

It seems clear from Table 4.3 that the bias plays a notable role in the conver-
gence. Nevertheless, in abundant regressions with kσX,Y k2  p, the quantities
√ √
in Table 4.3 all converge to 0 at the n rate or better when kGk/ p  1.
The results in Table 4.3 hint that, as n, p → ∞, the proper scaling of
the asymptotic normal distribution may be reached relatively quickly, while
achieving close to the proper location may take a larger sample size. Basa et al.
(2022) conducted a series of simulations to illustrate the relative impact that
bias and scaling can have on the asymptotic distribution of GT βbpls . In refer-
ence to the one-component model (4.1), they set n = p/2, µ = 0, σY |X = 1/2,
δ = kσX,Y k2 and ∆0 = Ip−1 . The covariance vector σX,Y was generated with
bp1/2 c standard normal elements and the remaining elements equal to 0, so

kσX,Y k2  p. Following the discussion of (4.15), they selected G = σX,Y
to emphasize bias. Then, X was generated as N (0, ΣX ) and Y | X was then
generated according to (4.1) with N (0, σY2 |X ) error. In reference to Table 4.3,
this simulation is an instance of condition 3(i) with t = 1/2.
For each selection of n = p/2 this setup was replicated 500 times
and side-by-side histograms drawn of D1 = V −1/2 GT βbpls − β (1 + b) and
 
D2 = V −1/2 GT βbpls − β . Their results are shown graphically in Figure 4.4.
Since Kj (n, p)  p−j/2 conditions (a) – (d) of Theorem 4.2 hold. Further, since
n1/4 Kj (n, p)  p(−2j+1)/4 → 0 for j = 1, 2, 3, 4 it follows from Theorem 4.2(II)
that D1 converges in distribution to a standard normal. The convergence rate
for the largest of these n1/4 K1 (n, p)  p−1/4 is quite slow so it may take a
138 Asymptotic Properties of PLS

p=8 p = 16 p = 32

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

p = 64 p = 128 p = 256

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

p = 512 p = 1024 p = 2048

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

FIGURE 4.4
Simulation results on bias:
 The right histogram in each  plot is of
−1/2 T b −1/2 T b
V G βpls − β (1 + b) and the left histogram is of V G βpls − β .
The standard normal reference density is also shown and in all cases n = p/2.
(Plot was constructed with permission using the data that Basa et al. (2022)
used for their Fig. 1.)

large value of p before the asymptotic distribution is a useful approximation.



Because nK1 (n, p)  1, we cannot use Corollary 4.6 to conclude that D2
converges to a standard normal.
We see from Figure 4.4 that the histogram of D1 moves to coincide with the
standard normal density as n and p increase, although it takes around 2, 000
Distributions of one-component PLS estimators 139

a. Means b. Standard Deviations

FIGURE 4.5
Plots of the means and standard deviations versus log p corresponding to the
simulations of Figure 4.4.

predictors before the approximation seems quite close to the standard normal.
This is in qualitative agreement with the slow rate of convergence mentioned
previously for this example. The histograms for D2 do not converge to a stan-
dard normal density. Visually, it seems that the histogram of D1 gets to the
right scaling faster than it achieves the right location, which is in agreement
with the discussion of Table 4.3. This is supported by the plots in Figure 4.5.

4.5.4 Simulation results


Basa et al. (2022) constructed a series of simulations to illustrate the behavior

of the confidence intervals (4.17) as n = p grows with p. For each p they
constructed σX,Y to have bp3/4 c elements generated independently from a
standard normal distribution with the remaining p − bp3/4 c elements set equal
to 0. For p = 1024, σX,Y had 181 non-zero elements. They set δ = kσX,Y k2 ,
so the ratio δ/kσX,Y k2 is bounded. They also set ∆0 = Ip−1 , constructed ΣX
according to (4.2), β = δ −1 σX,Y and generated the data (Yi , Xi ), i = 1, . . . , n,
independently as Xi ∼ N (0, Σ) and Y | Xi ∼ N (β T Xi , 1). The contrast vec-
tor G was generated as a standard normal vector and nominal 95% confidence
intervals were constructed according to (4.17). This scenario has relatively
small bias because the choices δ = kσX,Y k2 and σY2 |X = 1 have the effect
140 Asymptotic Properties of PLS
TABLE 4.4
Estimated coverage rates of nominal 95% confidence intervals (4.17) for the
parameters indicated in the first row. The third and fourth columns are for

the the setting in which n|b| → 0, the fifth and sixth columns are for the

setting in which n|b| 6→ 0, and the last column is for the adjusted interval
given in Theorem 4.5.
√ √
n|b| → 0 n|b| 6→ 0
n p β G (1 + b)β T G
T
β G (1 + b)β T G
T
βT G
16 256 0.924 0.897 0.766 0.887 0.856
22 512 0.949 0.920 0.734 0.927 0.903
32 1024 0.947 0.946 0.742 0.952 0.920

of eliminating the first addend in the bias b (4.14). The part of G that falls
in span⊥ (β) will also mitigate the bias. The coverage rates, estimated by re-
peating the procedure 1, 000 times and counting the number of intervals that
covered β T G and (1 + b)β T G, are shown in the third and fourth columns
of Table 4.4. Theorem 4.2 holds in this simulation scenario. For instance,

Kj (n, p) = 1/p3j/4−1/2 . Corollary 4.6(I) also holds since n|b|  n−1/2 . How-
ever, the sufficient condition given in Corollary 4.6(II) does not hold since

nK1 (n, p) = 1 for all p.
To emphasize the bias, Basa et al. (2022) conducted another simulation
with the same settings, except we set σY2 |X = 1/2, G = σX,Y and, to enhance
the contrast, the true value of V was used in the interval construction. Ac-
cording to the discussion following (4.15), this choice of G will maximize the
bias effect and the first addend of (4.14) will now contribute to the bias. The
results are shown in the fifth and sixth columns of Table 4.4. We see now that

CI0.05 (β T G) suffers in comparison to CI0.05 ((1 + b)β T G) since n|b| does not

converge to 0, n|b|  1. However, by (4.16),
√ √
JV −1/2 |GT βb| ≤ nJ|b|(kσX,Y k2 /δσY2 |X )1/2 = 2nJ|b| → 0,

and thus Theorem 4.2(I) holds. Additionally, the conditions for Theo-
rem 4.2(II) hold and so the adjusted confidence interval of Theorem 4.5 is
applicable. The rates for that adjusted interval are shown in the last column
of Table 4.4.
Distributions of one-component PLS estimators 141

4.5.5 Confidence intervals for conditional means


The confidence intervals discussed in Section 4.5.2 require that G be a non-
stochastic vector. The estimated conditional mean at a new value XN of X
T
is E(Y
b | XN ) = Ȳ + βbpls (XN − X̄). Although XN is fixed, X̄ is stochastic
and so the corresponding G = (XN − X̄) is stochastic. In consequence, the
confidence intervals of Section 4.5.2 may not be appropriate for a conditional
mean E(Y | XN ) or a prediction.
Let M (Y |XN ) = µY +(1+b)β T (XN −µX ), FN = E(Y b | XN )−M (Y | XN )
2
and let ω(XN ) = V (XN − µX ) + σY |X /n, where V (·) is as defined at (4.12).
The asymptotic distribution of FN is given in the following theorem from Basa
et al. (2022).

Theorem 4.6. Assume that the data (Yi , Xi ), i = 1, . . . , n, are indepen-


dent observations on a multivariate normal random vector of dimension p + 1
that follows the one-component model (4.1) and has the asymptotic properties
needed for Theorem 4.2. Then
D
ω −1/2 (XN )FN −→N (0, 1). (4.23)

Theorem 4.6 allows us to construct confidence statements for M (Y | XN )


based on ω −1/2 (XN )FN . However, our main interest is in confidence state-
ments for E(Y | X) based on E(Y
b | X) − E(Y | X) = FN + bβ T (XN − µX ).
The following corollary addresses these confidence intervals.

Corollary 4.7. Assume the conditions
n o and that n|b| → 0
of Theorem 4.6
D
as n, p → ∞. Then ω −1/2 (XN ) E(Y
b | X) − E(Y | X) −→N (0, 1).

As a consequence of Corollary 4.7, approximate confidence intervals for


E(Y | XN ) and YN := E(Y | XN ) + N can then be constructed by using the
plug-in method to get an estimator ω
b of ω:
h i
CIα (E(Y | XN )) = E(Y
b b 1/2 (XN )
| XN ) ± zα/2 ω (4.24)
h 1/2 i
CIα (YN ) = E(Y
b | XN ) ± zα/2 ω b (XN ) + τb2 . (4.25)

Similar to the interpretation of the confidence intervals in Section 4.5.2,



if n|b| does not converge to zero, then (4.24) and (4.25) can always be in-
terpreted under Theorem 4.6 as confidence intervals for M (Y | XN ) and
M (Y | XN ) + N .
142 Asymptotic Properties of PLS

4.6 Consistency in multi-component regressions


The asymptotic behavior of PLS predictions in regressions with multiple q > 1
components is qualitatively similar to the behavior of single-component regres-
sions described in Section 4.4. However, there are also important differences
that serve to distinguish the two cases. In this section we give an expanded
discussion of the consistency results of Cook and Forzani (2019) for multiple
component regressions. Full technical details are available in the supplement
to their article.
Our context for studying the asymptotic behavior of PLS predictions
in multi-component regressions is the same as that described for single-
component regressions: the single-response model is as given in (2.5) with
r = 1 and the notational changes indicated at the outset of Section 4.2.1.
The goal is still to characterize the asymptotic behavior of DN as given in
(4.8). The main difference is that now we allow the fixed envelope dimension
q = dim(EΣX(B)) > 1 so that the column dimension of its semi-orthogonal
basis matrix Φ ∈ Rp×q is allowed to exceed 1. Subsequent differences arise be-
cause the orders for multiple-component regressions are not as tight as those
given for one-component regressions. Nevertheless, as will be discussed in this
section, there are multiple-component regressions in which PLS predictions

can converge at or near the n rate.
In this section we revert back to usual notation and use ΣX,Y to denote the
covariance between X and Y , while still using σX,Y for the covariance when
r = 1. This notational convention will facilitate distinguishing regressions with
r = 1 from those with r > 1.

4.6.1 Definitions of the signal η(p), noise κ(p) and collinearity


ρ(p)
Cook and Forzani (2019) found by using its envelope structure that the per-
formance of a multi-component PLS regression in high dimensions depends
primarily on the signal coming from the material predictors, the noise arising
from the immaterial predictors and collinearity among the Krylov vectors, as
discussed in Section 3.6. Of these, the two most important are perhaps the sig-
nal as measure by the variation in the material predictors ΦTX and the noise
Consistency in multi-component regressions 143

arising from the immaterial predictors as measured by the variation in ΦT0 X:

Definition 4.2. The signal η and noise κ in a multi-component regression are

η(p) = tr(var(ΦTX)) = tr(∆) = tr(PE ΣX ),

and
κ(p) = tr(var(ΦT0 X)) = tr(∆0 ) = tr(QE ΣX ).
In Definition 4.1 and at the outset of Section 4.4.3 we used kσX,Y k2  δ =
tr(∆) as a measure of the signal in one-component regressions. Our definition
of signal in multi-component regressions then reduces to the previous one when
q = 1. In our treatment of one-component regressions, we used the noise-to-
signal ratio ∆σ := ∆0 /kσX,Y k2 to characterize in Theorem 4.1 the asymptotic
behavior of DN in terms of K1 (n, p) = tr(∆σ )/n and K2 (n, p) = tr(∆2σ )/n. A
corresponding fine level of analysis was not maintained in the multi-component
case. Instead, Cook and Forzani (2019) consistently used additional bounding
of the form tr(Ak ) ≤ trk (A) when characterizing the asymptotic behavior of
PLS predictions. For instance, this means that K2 (n, p) ≤ nK12 (n, p). In this
way, we obtain a coarser version of Theorem 4.1 depending only on the sample
size and noise-to-signal ratio via an extended definition of K1 :
κ(p)
K1 (n, p) = .
nη(p)
This is similar to the signal rate found by Cook, Forzani, and Rothman (2013)
in their study of abundant high-dimensional linear regression.
It follows from our discussion of Section 3.5.1 that the population
NIPALS weight vectors arise from a sequential orthogonalization of the
q vectors ΣjX σX,Y , j = 0, . . . , q − 1, in the Krylov sequence Kq =
(σX,Y , ΣX σX,Y , . . . , Σq−1
X σX,Y ). While these basis vectors are linearly inde-
pendent by construction, for this orthogonalization to be stable as p → ∞,
the Krylov basis vectors cannot be too collinear. Let Rj2 denote the squared
multiple correlation coefficient from the linear regression of the j-th coordinate
T
σX,Y Σj−1 T
X X of Kq X onto the rest of its coordinates,

T
(σX,Y T
X, . . . , σX,Y Σj−2 T j T q−1
X X, σX,Y ΣX X, . . . , σX,Y ΣX X).

Then the collinearity among the Krylov basis vectors arises asymptotically as
the rate of increase in the sum of variance inflation factors
Xq
ρ(p) = (1 − Rj2 )−1 .
j=1
144 Asymptotic Properties of PLS

PLS regressions have the best performance asymptotically when ρ(p) is


bounded, but can also perform reasonably when ρ(p) increases with p but

at a slower rate than n. A sufficient condition for ρ(p) to be bounded comes
from the regression of Y on the material predictors scaled by the signal rate
p
Zp = ΦTX/ η(p). If, as p → ∞, var(Zp ) = η −1 (p)∆ > 0 and the rank of
the matrix of Krylov vectors KZp = (σZp ,Y , ΣZp σZp ,Y , . . . , Σq−1
Zp σZp ,Y ) for the
regression of Y on Zp is stable at q, then ρ(p) is bounded (Cook and Forzani,
2019, Prop. 1). These conditions ensure that q = dim(EΣX(B)) is stable as p →
∞. Since tr(var(Zp )) = 1, the first condition implies that the eigenvalues δj of
∆ are similarly sized so that δj  η(p), j = 1, . . . , q. In contrast, if δq /η(p) → 0
as p → ∞ then asymptotically we drop at least one dimension of EΣX(B). In
application, if the eigenvalues of ∆ have quite dissimilar sizes then PLS meth-
ods may be inclined to miss the material predictors associates with the smaller
eigenvalues. This may not be surprising since all standard regression methods
have relative difficulty finding the predictors with quite small effects.
These results imply that collinearity within the material part of X mat-
ters, while collinearity within the immaterial part does not. In consequence,
observing collinearity among the columns of X does not necessarily signal
problems. The results hint that if the computations are stable then the basis-
vector collinearity described here will not be an issue.

4.6.2 Abundance v. sparsity


At the beginning of this chapter, we informally defined an abundant regression
to be one in which many predictors contribute information about the response,
while a sparse regression is one in which few predictors contain response in-
formation. Then in the context of one-component regressions we defined an
abundant regression as one in which kσX,Y k2 → ∞, while a sparse regression
is one in which kσX,Y k2 is bounded as n → ∞. We now adapt those ideas to
multi-component regressions.

Definition 4.3. If η(p) → ∞ as p → ∞ then the regression is said to be


abundant. The regression is said to be sparse if η(p) is bounded as p → ∞.

There is also a more easily interpretable sufficient condition for abundance


given in the following proposition. Its proof is in Appendix A.4.5.
Consistency in multi-component regressions 145

Proposition 4.2. Assume that the eigenvalues of ∆0 are bounded and


β T ΣX β  1. Then κ(p)  p and, kσX,Y k2 → ∞ if and only if η(p) → ∞,
where k · k2 denotes the Euclidean norm.

Practically, if there are many predictors that are correlated with the re-
sponse and if basis-vector collinearity as described above is not an issue then
we have an abundant regression.
To perhaps provide a little intuition we next describe how kσX,Y k2 con-
nects with η(p) and the eigenvalues of ΣX . Let λ1 (A) ≥ · · · ≥ λm (A)
denote the ordered eigenvalues of the symmetric m × m matrix A. Since
Pq
η(p) = j=1 λj (ΦT ΣX Φ), it follows that (Cook and Forzani, 2019, eq. (4.4))

kσX,Y k2
σY2 > . (4.26)
η(p)
The response variance σY2 is a finite constant and does not change with p.
Consequently, regardless of the value of p, the ratio in this relationship is
bounded above by σY2 . If kσX,Y k2 increases as we add more predictors then
η(p) must correspondingly increase to maintain the bound (4.26) for all p. If
kσX,Y k2 → ∞ as p → ∞ then η(p) must also diverge in a way that main-
tains the bound. Equivalently, if η(p) is bounded, so the regression is sparse,
then kσX,Y k2 must also be bounded. In sum, kσX,Y k2 may serve as a useful
surrogate for the signal η(p).
There is also a relationship between η(p) and the eigenvalues of ΣX (Rao,
1979, Thm. 2.1):
Xq
η(p) ≤ λj (ΣX ). (4.27)
j=1

Accordingly, if the regression is abundant, so η(p) → ∞, then at least the


largest eigenvalue λ1 (ΣX ) → ∞ as p → ∞. And if the eigenvalues of ΣX are
bounded then the signal must be bounded, giving a sparse regression. This
result clarifies a potentially mis-interpreted result by Chun and Keleş (2010),
as discussed following Theorem 4.8.

4.6.3 Universal conditions


Before discussing asymptotic results in the next section, we summarize the
overarching conditions used by Cook and Forzani (2019). Several of these
conditions are carryovers from the conditions required for the one-component
case discussed in Section 4.4.3.
146 Asymptotic Properties of PLS

C1. Model (2.5) holds. The response Y is real (r = 1), (Y, X) follows
a non-singular multivariate normal distribution and the data (Yi , Xi ),
i = 1, . . . , n, arise as independent copies of (Y, X). To avoid the trivial
case, we assume that the coefficient vector β = 6 0, which implies that the
dimension of the envelope q ≥ 1. This is the same as the model adopted for
the one-component case, except here the number of components is allowed
to exceed one.

C2. The error standard deviation σY |X is bounded away from 0 as p → ∞,


and β T ΣX β  1. These conditions were used in the one-component case.

C3. The number of components q, which is the same as the dimension of the
envelope EΣX(B), is known and fixed for all p. This is the same structure
adopted for the one-component case, except here the number of compo-
nents is allowed to exceed one.

C4. K1 and ρ/ n → 0 as n, p → ∞, where K1 and ρ are defined in Sec-
tion 4.6.1.

C5. η = O(κ) as p → ∞, where η ≥ 1, and η and κ are defined in Sec-


tion 4.6.1. Although it is technically possible for κ to be a smaller order
than η, we find such situations either implausible or bordering on determin-
istic. This condition was used implicitly when discussing one-component
consistency in Section 4.4.3.

C6. ΣX > 0 for all p. This restriction allows SX to be singular, which is a


scenario PLS was designed to handle. We do not require as a universal
condition that the eigenvalues of ΣX are bounded as p → ∞.

Additional conditions will be needed for various results.

4.6.4 Multi-component consistency results


The definition of DN given at (4.8) holds for multi-component as well as
one-component regressions. Depending on properties of the regression, the
asymptotic behavior DN in multi-component regressions can depend crucially
on all of the quantities described in Section 4.6.3: n, q, η, κ and K1 . In this
section we summarize the main results of Cook and Forzani (2019) along with
a few special scenarios that may provide useful intuition in practice. All of the
Consistency in multi-component regressions 147

asymptotic results in this section should be understood to hold as n, p → ∞.


Proofs are available from the online supplement to Cook and Forzani (2019).
The results of Theorem 4.7 are the most general, requiring for potentially
good results in practice only that C1–C6 hold and that the terms character-
izing the orders go to zero as n, p → ∞. In particular, the eigenvalues of ΣX
need not be bounded.

Theorem 4.7. As n, p → ∞,
√ n o
DN = Op (ρ/ n) + Op ρ1/2 n−1/2 (κ/η)q .

In particular,

I. If ρ  1 then DN = Op n−1/2 (κ/η)q .





II. If κ  η then DN = Op (ρ/ n).

III. If q = 1 then DN = Op ( nK1 ).

We see from this that the asymptotic behavior of PLS depends crucially
on the relative sizes of signal η and noise κ in X. It follows from the general
result that if κ  p, as likely occurs in many applications, particularly spectral
applications in chemometrics, and if η  p, so the regression is abundant, then

DN = Op (ρ/ n). If, in addition, ρ  1 then PLS fitted values converge at the

usual n-rate, regardless of the relationship between n and p.
On the other hand, if the signal in X is small relative to the noise in
X, so η = o(κ), then it may take a very large sample size for PLS predic-
tion to be consistent. For instance, suppose that the regression is sparse and
only q predictors matter and thus η  1. Then it follows reasonably that
ρ  1 and, from part I, DN = Op {n−1/2 κq }. If, in addition, κ  p then
DN = Op {pq n−1/2 }. Clearly, if q is not small, then it could take a huge sam-
ple size for PLS prediction to be usefully accurate.
Theorem 4.7 places no constraints on the rate of increase in κ(p) = tr(∆0 ).
In many regressions it may be reasonable to assume that the eigenvalues of
∆0 are bounded so that κ(p)  p as p → ∞. In the next theorem we describe
the asymptotic behavior of PLS predictions when the eigenvalues of ∆0 are
bounded. It is a special case of Cook and Forzani (2019, Theorem 2).

Theorem 4.8. If the eigenvalues of ∆0 are bounded as p → ∞ then κ  p and


√   
DN = Op ρ/ n + Op (ρp/nη)1/2 .
148 Asymptotic Properties of PLS

In particular,

I. If ρ  1 or if q = 1 then
( 1/2 )
p
DN = Op .


II. If η  p then DN = Op (ρ/ n) .

The order of DN now depends on a balance between the sample size n, the
variance inflation factors as measured through ρ and the noise to signal ratio
in K1 , but it no longer depends on the dimension q. We see in the conclusion
to case I that there is synergy between the sample size and the signal, the sig-
nal serving to multiply the sample size. For instance, if we assume a modest
√ √
signal of η(p)  p then in case I we must have n large relative to p for the
best results. If η  p and ρ  1 then from case II we again get convergence at

the n rate.
Chun and Keleş (2010) concluded that PLS can be consistent in high-
dimensional regressions only if p/n → 0. However, they required the eigen-
values of ΣX to be bounded as p → ∞ and that ρ(p)  1. If the eigenvalues
of ΣX are bounded then the eigenvalues of ∆0 are bounded so κ  p and
from (4.27) the signal must be bounded as well η  1. It then follows from
Theorem 4.8 that DN = Op {(p/n)1/2 }, which is the rate obtained by Chun
and Keleş (2010). By required the eigenvalues of ΣX to be bounded, Chun
and Keleş (2010) in effect restricted their conclusion to sparse regressions.
So far our focus has been on the rate of convergence of predictions as mea-
sured by DN . There is a close connection between the rate for DN and the
rate of convergence of βbpls in the ΣX inner product. Let

Vn,p = {(βbpls − β)T ΣX (βbpls − β)}1/2 .

Then, as shown by Cook and Forzani (2019, Supplement Section S8), Vn,p and
DN have the same order as n, p → ∞, so Theorems 4.7 and 4.8 hold with DN
replaced by Vn,p . It follows from that the special cases of Theorems 4.7 and 4.8
and the subsequent discussions apply to Vn,p as well. In particular, estimative
convergence for βbpls as measured in the ΣX inner product will be at or near
the root-n rate under the same conditions as predictive convergence.
Bounded or unbounded signal? 149

4.7 Bounded or unbounded signal?


The distinction between a bounded or unbounded signal η(p) has been a com-
mon theme in this chapter. If the signal is bounded η(p)  1 we would not nor-
mally expect PLS to do well in high dimensional n < p regressions, and then
methods tailored to sparse regressions would be necessary. But if the signal is
abundant η(p) → ∞, then we would normally expect PLS to give serviceable
results, particularly when η(p)  p1/2 or greater. How this dichotomy plays
out in practice depends on the area of application. For instance, it seems that
many Chemometrics regressions are of the abundant variety.
Goicoechea and Oliveri (1999) used PLS to develop a predictor of tetracy-
cline concentration in human blood. Fifty training samples were constructed
by spiking blank sera with various amounts of tetracycline in the range 0–4
µg mL−1 . A validation set of 57 samples was constructed in the same way.
For each sample, the values of the predictors were determined by measuring
fluorescence intensity at p = 101 equally spaced points in the range 450–550
nm. The authors determined using leave-one-out cross validation that the best
predictions of the training data were obtained with q = 4 linear combinations
of the original 101 predictors.
Cook and Forzani (2019) use these data to illustrate the behavior of PLS
predictions in Chemometrics as the number of predictors increases. They used
PLS with q = 4 to predict the validation data based on p equally spaced spec-
tra, with p ranging between 10 and 101. The root mean squared error (MSE)
is shown in Figure 4.6 for five values of p. PLS fits were determined by using
library{pls} in R (R Core Team, 2022). We see a relatively steep drop in MSE
for small p, say less than 30, and a slow but steady decrease in MSE there-
after. Since we are dealing with actual out-of-sample prediction, there is no
artifactual reason why the root-MSE should continue to decrease with p.

4.8 Illustration
We use in this section variations on the example presented in Section 3.7
to illustrate selected asymptotic results, including the distinction between
150 Asymptotic Properties of PLS

0.12
0.10 0.11
Root MSE
0.07 0.08 0.09

FIGURE 4.6
Tetracycline data: The open circles give the validation root MSE from
10, 20, 33, 50, and 101 equally spaced spectra. (From Fig. 4 of Cook and Forzani
(2019) with permission.)

abundant and sparse regressions (Cook and Forzani, 2020). A detailed analy-
sis of a one-component regression is described in Section 4.8.1, and a cursory
analysis of a two-component regression is given in Section 4.8.2.

4.8.1 One-component regression


Beginning with the example of Section 3.7, we now set p1 = p2 and increase the
variance of the predictors from 1 to 25. The n × p data matrix X of predictor
values can be represented as

Xn×p = (H1 1T1 + E1 , H2 1T2 + E2 , H3 1T3 + E3 ),

where now Hj is an n × 1 vector of independent normal variates with mean


0 and variance 25, j = 1, 2, 3, and 11 , 12 , and 13 are vectors of ones having
dimensions p1 = p2 = (p − d)/2 and p3 = d, and we restrict p − d to be even.
The Ej ’s are matrices of conforming dimensions also consisting of indepen-
dent standard normal variates. A typical p × 1 predictor vector can still be
represented as  
h1 11
X =  h2 12  + e,
 

h3 13
Illustration 151

but now the hj ’s are independent normal random variables with mean 0 and
variance 25. The variance-covariance matrix of the predictors X is now
 
11 1T1 0
ΣX = 25  0 12 1T2 0  + Ip .
 

0 0 13 1T3
Again, the predictors consist of three independent blocks of sizes (p − d)/2,
(p − d)/2 and d. The correlation between predictors in the same block is about
0.96. Recall that uT1 = (1T1 , 0, 0), uT2 = (0, 1T2 , 0), and uT3 = (0, 0, 1T3 ). Then
ΣX can be expressed equivalently as

ΣX = (1 + 25p1 )Pu1 + (1 + 25p2 )Pu2 + (1 + 25p3 )Pu3 + Q,


= (1 + 25(p − d)/2)P(u1 ,u2 ) + (1 + 25d)Pu3 + Q,

where P(·) is the projection onto the subspace spanned by the indicated vectors
and Q is the projection onto the p − 3 dimensional subspace that is orthog-
onal to span(u1 , u2 , u3 ). We see that ΣX has three eigenspaces span(u1 , u2 ),
span(u3 ) and span⊥ (u1 , u2 , u3 ). With r = 1, the n × 1 vector of responses Y
was generated as the linear combination Y = 3H1 − 4H2 + , where H1 and
H2 where  is an n × 1 vector of independent normal variates with mean 0
and finite variance. This gives

σX,Y = 75u1 − 100u2


kσX,Y k2 = (752 + 1002 )(p − d)/2
75 100
β = u1 − u2
1 + 25p1 1 + 25p2
1
= σX,Y
1 + 25(p − d)/2
(p − d)/2
β T ΣX β = (752 + 1002 ) .
1 + 25(p − d)/2
Consequently, both β and σX,Y are eigenvectors of ΣX since these vectors
fall in the one-dimensional envelope EΣX(B) = span(3u1 − 4u2 ). This implies
that X can be replaced with the single predictor 3uT1 X − 4uT2 X without
loss of information on the regression. Normalized bases for the envelope and
its two-dimensional orthogonal complement are then
3u1 − 4u2
Φ =
k3u1 − 4u2 k
 
4u1 + 3u2 u3
Φ0 = , ,U ,
k4u1 + 3u2 k ku3 k
152 Asymptotic Properties of PLS

where the columns of U ∈ Rp×(p−3) form an orthonormal basis for


span⊥ (u1 , u2 , u3 ). All of the results of this chapter are applicable here since
this is a one-component regression.
According to Definition 4.1, this regression is abundant if kσX,Y k → ∞
and sparse otherwise. The first two blocks of (p − d)/2 predictors shown in X
are material to the regression, while the last block of d predictors is imma-
terial. If the total of (p − d) material predictors is large relative to the total
d of immaterial predictors then kσX,Y k will be large, so PLS and envelopes
should do well, depending on the sample size. Otherwise, there will be some
degradation in the performance of the methods, with perhaps extreme failure
when d is large relative to (p − d), unless n  p.
From Definition 4.2, the predictor signal and noise measures are

η(p) = Φ T ΣX Φ
= 1 + 25(p − d)/2
κ(p) = tr(ΦT0 ΣX Φ0 )
= −1 + 25(p + d)/2 + p.

The eigenvalues of ∆0 are 1 + 25(p − d)/2, 1 + 25d, and 1 with multiplicity


p − 3. The effectiveness of envelopes and PLS is now governed by n, p, and d.
Corollaries 4.2–4.4 are not applicable in this case since one and perhaps two
of the eigenvalues of ∆0 are unbounded. However, Corollary 4.5 is applicable
since only a finitely many of the eigenvalues of ∆0 are unbounded. The follow-
ing three cases illustrate the application of Corollary 4.5. Recall that a and
s are defined as constants such that p = na with a ≥ 1 and kσX,Y k2 = ps ,
which is informally interpreted as the number of material predictors, and that
DN = Op (n−1/2+a(1−s) ), so the convergence rate is n1/2−a(1−s) .

Case 1: When d is fixed, kσX,Y k2  p and so s = 1 and DN converges at the



rate n, regardless of the relationship between n and p. In this case, all but
finitely many predictors are material, rather like an antithesis of sparsity.

Case 2: If d = p − p2/3 , so there are p − d = p2/3 material predictors then


kσX,Y k2  p2/3 , which gives s = 2/3. If, in addition, p  n then a = 1
and DN converges at the rate n1/2−a(1−s) = n1/6 , which is quite slow.

Case 3: If d = p − 2 then s = 0, a ≥ 1, there are only 2 material predictors


and no convergence is indicated by the corollary.
Illustration 153

FIGURE 4.7
Illustration of the behavior of PLS predictions in abundant and sparse regres-
sions. Lines correspond to different numbers of material predictors. Reading
from top to bottom as the lines approach the vertical axis, the first bold line
is for 2 material predictors. The second dashed line is for p2/3 material pre-
dictors. The third solid lines is for p − 40 material predictors and the last line
with circles at the predictor numbers used in the simulations is for p mate-
rial predictors. The vertical axis is the squared norm kβ − βk b 2 and always
SX
n = p/3. (This figure was constructed with permission using the same data
as Cook and Forzani (2020) used for their Fig. 1.)

Figure 4.7 shows the results of a small simulation to reinforce these com-
ments. The vertical axis is the squared difference between centered mean value
β TX and its PLS estimator βbTX averaged over the sample: kβ − βk b 2 =
SX
−1
P n T T 2
n i=1 (β Xi − β Xi ) . The horizontal axis is the number of predictors p
b
and the sample size is n = p/3. The lines on the plot are for different num-
bers of material predictors. The curve for 2 material predictors corresponds
to a sparse regression and, as expected, the results support Case 3, as there
is no visual evidence that it is converging to 0. The other three curves are for
abundant regressions with varying rates of information accumulation.
For the curve for “p” material predictors is the best that can be achieved
in the context of the simulated regression. For the “p−40” curve all predictors
are material except for 40. As the total number of predictors increases, the
40 immaterial predictors cease to play a role and the “p − 40” curve coincides
with the “p” curve. These results support Case 1.
154 Asymptotic Properties of PLS

The remaining dashed curve for “p2/3 ” material predictors represents Case
2, an abundant regression with a very slow rate of information accumulation
that is less than that for the “p” and “p − 40” curves. The theory predicts
that this curve will eventually coincide with the “p” curve, but demonstrating
that result by simulation will surely take a very, very large value of p.

4.8.2 Two-component regression


A one-component regression was imposed in the previous section by requiring
that p1 = p2 = (p − d)/2 and p3 = d. If we drop that requirement and assume
that p1 , p2 , and p3 are distinct, then we obtain a two-component regression.
In this section, we give a few characteristics of that two-component regression.
The quantities X, X, and Y are as given in the previous section, except
we no longer require p1 = p2 . Thus we still have

ΣX = (1 + 25p1 )Pu1 + (1 + 25p2 )Pu2 + (1 + 25p3 )Pu3 + Q,


σX,Y = 75u1 − 100u2
75 100
β = u1 − u2 ,
1 + 25p1 1 + 25p2
which are the same as corresponding quantities given in the previous section,
but are repeated here for ease of reference. Now, ΣX has four eigenspaces
span(uj ), j = 1, 2, 3 and span⊥ (u1 , u2 , u3 ). The envelope is two-dimensional
EΣX(B) = span(u1 , u2 ) and its orthogonal complement EΣ⊥X(B) = span(u3 , U )
has dimension p − 2, where the columns of U ∈ Rp×(p−3) form an orthonormal
basis for span⊥ (u1 , u2 , u3 ), as defined in the previous section. Now, X can be
replaced by two univariate predictors (uT1 X, uT2 X) without loss of information
about the regression. Since this is a two-component regression, we now rely on
the results of Section 4.6 to characterize DN asymptotically. For this we need
√ √
Φ = (u1 / p1 , u2 / p2 )
η(p) = tr(ΦT ΣX Φ)
= 2 + 25(p1 + p2 )
κ(p) = −2 + p + 25p3 ,

where κ(p) comes from adding the eigenvalues of ∆0 , (1 + 25p3 ) and 1 with
multiplicity p − 3. From here we can appeal to Theorems 4.7 and 4.8 to char-
acterize the asymptotic behavior of PLS predictions.
Illustration 155

If ρ  1 and p3 is bounded, then the eigenvalues of ∆0 are bounded, κ  p,


and we can appeal to part I of Theorem 4.8 to get DN . For this we need
p p1 + p2 + p3 1
=  .
nη(p) n(2 + 25(p1 + p2 )) n

In this case then we have root-n convergence, DN = Op (n−1/2 ). Further,


since p3 is bounded, η  p and part II of Theorem 4.8 holds. That is,

DN = Op (ρ/ n).
If ρ  1 and p3 is unbounded then we can use part I of Theorem 4.7. For
this theorem we need
κ −2 + p + 25p3
=
η 2 + 25(p1 + p2 )
p3
 1+ .
p1 + p2
In consequence,
 2 !
1 p3
DN = Op √ 1+ .
n p1 + p2
This rate depends on the total number of material predictors p1 +p2 relative to
the number of immaterial predictors. If p3 /(p1 + p2 ) is bounded as p → ∞ we
again achieve root-n convergence. In this case κ  η and we recover the result
of part II of Theorem 4.7. However, if the ratio p3 /(p1 + p2 ) is unbounded,
then the sample size needs to be large enough to compensate and again force
n−1/2 (κ/η)2 → 0. This discussion once more reflects the idea of abundance:
for the best results, the number of material predictors p1 + p2 should be large
relative to the number of immaterial predictors p3 .
5

Simultaneous Reduction

Response and predictor envelope methods have the potential to increase effi-
ciency in estimation and prediction. It might be anticipated then that combin-
ing response and predictor reductions may have advantages over either method
applied individually. We discuss simultaneous predictor-response reduction in
this chapter. Our discussion is based mostly on two relatively recent papers:
Cook and Zhang (2015b) developed maximum likelihood estimation under a
multivariate model for simultaneous reduction of the response and predictor
vectors, studied asymptotic properties under different scenarios and proposed
two algorithms for getting estimators. Cook, Forzani, and Liu (2023b) devel-
oped simultaneous PLS estimation based on the same type of multivariate
linear regression model used by Cook and Zhang (2015b).
As in previous developments, we first discuss in Section 5.1 conditional
independence foundations for simultaneous reduction. We then incorporate
the multivariate linear model, turning to likelihood-based estimation in Sec-
tion 5.2. Simultaneous PLS estimation is discussed in Section 5.3 and other
related methods are discussed briefly in Section 5.4. Since PLS is a focal point
of this book, the main thrust of our discussion follows Cook, Forzani, and Liu
(2023b), with results from Cook and Zhang (2015b) integrated as relevant.

5.1 Foundations for simultaneous reduction


Recall that X ∈ Rp , Y ∈ Rr and that, from (2.2), the condition
(Y, PS X) QS X is a foundation for predictor envelopes. It insures that QS X
is immaterial to the regression since it is independent of Y and PS X jointly.
Similarly, the foundational condition for response envelopes is obtained by

DOI: 10.1201/9781003482475-5 156


Foundations for simultaneous reduction 157

interchanging the roles of X and Y : (PR Y, X) QR Y from (2.14). We com-


bine these two conditions in the following proposition for simultaneous reduc-
tion of X and Y considered jointly, without designating one as the response
and one as the predictor. Its proof is given in Appendix A.5.1.

Proposition 5.1. Define the subspaces S ⊆ Rp and R ⊆ Rr . Then the two


condition
(a) QS X (Y, PS X) and (b) QR Y (X, PR Y )
hold if and only if the following two conditions hold:

(I) QR Y QS X and (II) (PR Y, PS X) (QR Y, QS X).

To implement Proposition 5.1 in practice, we replace conditions (I) and


(II) with corresponding zero covariance conditions:

(I∗ ) cov(QR Y, QS X) = 0 and (II∗ ) cov {(PR Y, PS X), (QR Y, QS X)} = 0.


(5.1)
These covariance conditions imply straightforwardly that
   
PR Y PR ΣY PR PR ΣY,X PS 0 0
   
 PS X   PS ΣX,Y PR PS ΣX PS 0 0 
var  = 
 Q Y  
 R   0 0 QR ΣY QR 0 

QS X 0 0 0 QS ΣX QS
(5.2)
T T T p+r
Recall that C = (X , Y ) ∈ R denotes the random vector con-
structed by concatenating X and Y , and that SC denotes the sample version
of ΣC = var(C). Equation (5.2) is expressed in terms of subspaces S and R,
but for the development of methodology we use it to express the covariance
matrix ΣC of the observable vector C in terms of S and R:

ΣX = var(X) = var(PS X + QS X)
= var(PS X) + var(QS X)
= PS ΣX PS + QS ΣX QS ,

where the second equality follows because from (5.2), cov(PS X, QS X) = 0.


Similarly, ΣY = PR ΣY PR + QR ΣY QR . This then implies that
!
ΣX ΣX,Y
ΣC =
ΣY,X ΣY
!
PS ΣX PS + QS ΣX QS PS ΣX,Y PR
= . (5.3)
PR ΣY,X PS PR ΣY PR + QR ΣY QR
158 Simultaneous Reduction

The structure of (5.3) does not specify X or Y as the response. Proceeding


with these variables in their traditional roles of regressing Y on X, (5.3) is
satisfied by setting S to be the predictor envelope S = EΣX (CX,Y ) = EΣX(B)
and R to be the response envelope R = EΣY (CY,X ) = EΣY |X (B 0 ), as de-
scribed in Table 2.2. (Recall from the discussion following Proposition 1.8
that EΣY (CY,X ) = EΣY (B 0 ) = EΣY |X (B 0 ).)
A coordinate representation of ΣC may be of help when introducing the
linear model. Recall from (2.5) that Φ ∈ Rp×q is a semi-orthogonal ba-
sis matrix for EΣX(B), that (Φ, Φ0 ) is orthogonal, that ∆ = var(ΦT X) =
ΦT ΣX Φ ∈ Rq×q is the covariance matrix of the material predictors and that
∆0 = var(ΦT0 X) = ΦT0 ΣX Φ0 ∈ R(p−q)×(p−q) is the covariance matrix of the
immaterial predictors. Similarly, from (2.17), Γ ∈ Rr×u is a semi-orthogonal
basis matrix for EΣY |X (B 0 ) and for EΣY (CY,X ), and (Γ, Γ0 ) is orthogonal. De-
fine Θ = var(ΓT Y ) = ΓT ΣY Γ ∈ Ru×u , which is the variance of the material
part of the response vector and Θ0 = var(ΓT0 Y ) = ΓT0 ΣY Γ0 ∈ R(r−u)×(r−u) ,
which is the variance of the immaterial part of the response vector. Define
also K = ΦT ΣX,Y Γ ∈ Rq×u , which contains the coordinates of ΣX,Y relative
to Φ and Γ. Then the coordinate form of ΣC can be represented as
!
ΣX ΣX,Y
ΣC =
ΣY,X ΣY
!
Φ∆ΦT + Φ0 ∆0 ΦT0 ΦKΓT
= . (5.4)
ΓK T ΦT ΓΘΓT + Γ0 Θ0 ΓT0

As in our discussion of predictor envelopes in Section 2.2, no relationship is


assumed between the eigenvalues of ∆ and ∆0 : The eigenvalues of ∆ could be
any subset of the eigenvalues of ΣX . Similar flexibility holds for the eigenvalues
of ΣY .
It follows straightforwardly from (5.4) that the covariance matrices for the
material and immaterial parts of C agree with (5.2),
!
∆ K
Σ(Φ⊕Γ)T C =
KT Θ

and !
∆0 0
Σ(Φ0 ⊕Γ0 )T C = ,
0 Θ0
where the direct sum operator ⊕ is defined in Section 1.2.1.
Foundations for simultaneous reduction 159

To see the roles of dimension and rank, let

d = rank(ΣX,Y ) = rank(K) ≤ min(q, u), (5.5)

where q ≤ p and u ≤ r. The equality rank(ΣX,Y ) = rank(K) follows since,


from (5.4), ΣX,Y = ΦKΓT and Φ and Γ both have full column rank. From
the definitions of the X- and Y -envelopes, if d = r < p then EΣY |X (B 0 ) = Rr ,
Γ = Ir and we may reduce X only. Similarly, if d = p < r then EΣX (B) = Rp ,
Φ = Ip and reduction is possible only in the response space. Hence, we will
assume d < min(r, p) from now on and discuss the general situation where
simultaneous reduction is possible.

5.1.1 Definition of the simultaneous envelope


Model (5.4) is formulated in terms of two envelopes, the response envelope
EΣY |X (B 0 ) with basis matrix Γ and the predictor envelope EΣX (B) with basis
matrix Φ. Model estimation requires that we estimate these envelopes, either
individually using the methodology described in Chapter 2 or jointly based
on the likelihood stemming from model (5.4). However, using Lemma 1.4, it
is possible to describe model (5.4) in terms of a single simultaneous envelope:

EΣY |X (B 0 ) ⊕ EΣX (B) = EΣY (B 0 ) ⊕ EΣX (B) = EΣY ⊕ΣX (B 0 ⊕ B). (5.6)

In consequence, we can view our goal as the estimation of a single envelope


EΣY ⊕ΣX (B 0 ⊕ B), although the methodology described in this chapter still
deals with separate envelopes.

5.1.2 Bringing in the linear model


So far we have not made explicit use of a regression model. Assuming now
that linear model (1.1) holds, it follows from (5.4) that
β = Σ−1
X ΣX,Y

= Φ∆−1 ΦT ΦKΓT
= ΦηΓT , where η = ∆−1 K;
ΣY |X = ΣY − var(β T X)
= ΓΘΓT + Γ0 Θ0 ΓT0 − ΓK T ∆−1 KΓT
= Γ(Θ − K T ∆−1 K)ΓT + Γ0 Θ0 ΓT0
= ΓΩΓT + Γ0 Ω0 ΓT0 ,

where Ω = Θ − K T ∆−1 K = Θ − η T ∆η and Ω0 = Θ0 .


160 Simultaneous Reduction

This structure leads to the following linear model for the simultaneous
reduction of predictors and responses,

Y = α + Γη T ΦT X + ε 
ΣY |X = ΓΩΓT + Γ0 Ω0 ΓT0 , (5.7)

T T 
ΣX = Φ∆Φ + Φ0 ∆0 Φ0 ,
where η contains the coordinates of β relative to bases Γ and Φ. In the follow-
ing sections we discuss estimation under model (5.7) by maximum likelihood,
PLS and a related two-block method.
Model (5.7) is the same as that used by Cook and Zhang (2015b) in their
development of a likelihood-based simultaneous reduction method. However,
instead of starting with general reductive conditions given in Proposition 5.1,
Cook and Zhang (2015b) took a different route, which we now describe since
it may furnish additional insights.

5.1.3 Cook-Zhang development of (5.7)


The approach used by Cook and Zhang (2015b) essentially combines the en-
velopes for predictor reduction, model (2.5), and response reduction, model
(2.17), to achieve simultaneous reduction in the multivariate linear model
(1.1) in which Y and X are still jointly stochastic. To see how they ac-
complished this union of envelopes, recall that d = rank(β) = rank(K)
and consider the singular value decomposition β = U DV T , where U ∈
Rp×d and V ∈ Rr×d are orthogonal matrices, U T U = Id = V T V , and
D = diag(λ1 , . . . , λd ) is a diagonal matrix with elements λ1 ≥ · · · ≥ λd > 0
being the d singular values of β. This decomposition of β provides the essential
constituents, U and V , for construction the envelopes. We have denoted the
column space (the left eigenspace) and row space (the right eigenspace) of β as
B = span(U ) and B 0 = span(V ). Parallel to our discussion of K, if d = r < p
then EΣY |X (B 0 ) = Rr and we may reduce X only. Similarly, if d = p < r
then EΣX (B) = Rp and reduction is possible only in the response space, which
leads to the reminder that we assume d < min(r, p). Since β = U DV T , where
span(U ) = B ⊆ span(Φ) and span(V ) = B 0 ⊆ span(Γ) by the definition of the
X- and Y -envelopes, we can represent

β = U DV T = (ΦA)D(B T ΓT ) = ΦηΓT

for some semi-orthogonal matrices A ∈ Rq×d and B ∈ Ru×d , where


η = ADB T . Bringing in the structured covariance matrices from the response
Likelihood-based estimation 161

and predictor envelopes, we again arrive at the simultaneous reductive model


(5.7).

5.1.4 Links to canonical correlations


Result (5.2) shows a similarity between simultaneous envelope reduction and
canonical correlation analysis: the components that capture the linear rela-
tionships between X and Y are uncorrelated with the rest of the components.
Canonical correlation analysis is widely used for the purpose of simulta-
neously reducing the predictors and the responses. In the population, it finds
canonical pairs of directions {ai , bi }, i = 1, . . . , d, so that the correlations be-
tween aTi X and bTi Y are maximized subject to the constraints aTj ΣX ak = 0,
aTj ΣX,Y bk = 0 and bTj ΣY bk = 0 for all j =6 k, and aTj ΣX aj = 1 and bTj ΣY bj =
−1/2 −1/2
1 for all j. The solution is then {ai , bi } = {ΣX ei , ΣY fi }, where {ei , fi }
−1/2 −1/2
is the i-th left-right eigenvector pair of the matrix ρ = ΣX ΣX,Y ΣY .
The following lemma from Cook and Zhang (2015b, Lemma 2) connects
envelopes and canonical correlations. A proof is sketched in Appendix A.5.2.

Lemma 5.1. Under the simultaneous envelope model (5.7), canonical cor-
relation analysis can find at most d directions in the population, where
d = rank(ΣX,Y ) as defined in (5.5). Moreover, the directions are contained in
the simultaneous envelope as

span(a1 , . . . , ad ) ⊆ EΣX (B), span(b1 , . . . , bd ) ⊆ EΣY |X (B 0 ). (5.8)

Canonical correlation analysis may thus miss some information about the
regression by ignoring some material parts of X and/or Y . For example, when
r is small, it can find at most r linear combinations of X, which can be insuf-
ficient for regression. Cook and Zhang (2015b) found in the simulation studies
that the performance of predictions based on estimated canonical correlation
reductions aT1 X, . . . , aTd X and bT1 Y, . . . , bTd varied widely for different covari-
ance structures and was generally poor.

5.2 Likelihood-based estimation


Assuming that C is normally distributed, Cook and Zhang (2015b) devel-
oped maximum likelihood estimation under model (5.7). They also studied
asymptotic properties under different scenarios and proposed two algorithms
for getting estimators G and W of the semi-orthogonal bases Γ and Φ in model
162 Simultaneous Reduction

(5.7). One is a likelihood-based algorithm that alternates between iterations


on two non-convex objective functions for predictor and response reductions.
That algorithm is generally reliable, but can be slow and can get caught in a lo-
cal optimum. Their second algorithm optimizes over one basis vector at a time.
It is generally faster and nearly as efficient as the likelihood-based algorithm.
They also prove that if the data Ci , i = 1, . . . , n, are independent observa-
tions on C which has finite fourth moments then jointly the vectorized maxi-
mum likelihood estimators of β, ΣX and ΣY |X are asymptotically distributed
as a normal random vector. If C is normally distributed that asymptotic dis-
tribution simplifies considerably.

5.2.1 Estimators
As in other envelope model encountered in this book, the most difficult part
of parameter estimation is determining estimators for the basis matrices Φ
and Γ. Once this is accomplished, the estimators of the remaining parameters
in model (5.7) are relatively straightforward to construct.
Let G and W denote semi-orthogonal matrices that estimate versions of Γ
and Φ. These may come from maximum likelihood estimation, PLS estimation
or some other methods. Estimators of the remaining parameters in (5.7) are
given in the following lemma. The derivation is omitted since it is quite similar
to the derivations of the estimators for predictor and response envelopes.

Lemma 5.2. After compressing the centered predictor X 7→ W TX and the


response Y 7→ GT Y vectors and assuming multivariate normality for C, the
maximum likelihood estimators of the parameters in model (1.1) via model
(5.7) are
αb = Ȳ
−1
ηb = SW TX SW TX,GT Y

−1 T
βb = W SW TX SW TX,GT Y G


b = SW TX

b0 = SW0T X

b = SGT Y |W TX

b0 = SGT0 Y
Σ
b Y |X b T + G0 Ω
= GΩG b 0 GT0

Σ
bX b T + W0 ∆
= W ∆W b 0 W0T ,

where (W, W0 ) and (G, G0 ) are orthogonal matrices.


Likelihood-based estimation 163

The next lemma gives an instructive form of the residuals from the fit of
model (5.7. It proof is in Appendix A.5.3.

Lemma 5.3. Using the sample estimators from Lemma 5.2, the sample co-
variance matrix of the residuals from model (5.7),
n
X
Sres = n−1 η T W TXi }{(Yi − Ȳ ) − Gb
{(Yi − Ȳ ) − Gb η T W TXi }T ,
i=1

can be represented as

Sres = Σ
b Y |X + PG SY |W TX QG + QG SY |W TX PG ,

where Σ
b Y |X is as given in Lemma 5.2.

From this lemma we see that Sres = Σ b Y |X if and only if G is a reducing


subspace of SY |W TX . As this holds in the population, it fits well with the ba-
sic theory leading to model (5.7). It also suggests that lack-of-fit diagnostics
might be developed from the term PG SY |W TX QG .
An objective function for estimation under model (5.7) can constructed
under the same general setup as we used for predictor envelopes in Sec-
tion 2.3. We again base estimation on the objective function Fq (SC , ΣC ) =
log |ΣC | + tr(SC Σ−1
C ) that stems from the log likelihood of the multivariate
normal family after replacing the population mean vector with the vector of
sample means, although here we do not require C to have a multivariate nor-
mal distribution. Rather, as discussed in Section 2.3.1, we are again using
Fq as a multi-purpose objective function in the same spirit as least squares
objective functions are often used.
We use (5.4), (5.7), η = ∆−1 K and Θ = Ω + K T ∆−1 K to structure ΣC as

! ! !T
Φ 0 ∆ ∆η Φ 0
ΣC =
0 Γ η T ∆ Ω + η T ∆η 0 Γ
! ! !T
Φ0 0 ∆0 0 Φ0 0
+
0 Γ0 0 Ω0 0 Γ0
!
∆ ∆η
= (Φ ⊕ Γ) (Φ ⊕ Γ)T
η T ∆ Ω + η T ∆η
+(Φ0 ⊕ Γ0 )(∆0 ⊕ Ω0 )(Φ0 ⊕ Γ0 )T . (5.9)
164 Simultaneous Reduction

From this we see that the simultaneous envelope is a reducing subspace of


ΣC with special structure, just as the response and predictor envelopes are
reducing subspaces of their respective covariance matrices. Then following the
general ideas used in Section 2.3 for maximizing the likelihood for predictor
envelopes, the objective function Fq (SC , ΣC ) can be minimized explicitly over
∆, ∆0 , Ω, Ω0 and η with Γ and Φ held fixed. The resulting partially minimized
objective function can be written as

F (SC , Φ⊕Γ) = log (Φ ⊕ Γ)T SC (Φ ⊕ Γ) +log (Φ ⊕ Γ)T (SX ⊕ SY )−1 (Φ ⊕ Γ) .

Let
(Φ, b = argminW,G F (SC , W ⊕ G),
b Γ)

where the minimization is over p × q semi-orthogonal matrices W and r × u


semi-orthogonal matrices G. Once these estimators are found, the estimators
of the constituent parameters are as given in Lemma 5.2 replacing W with Φ b
and G with Γ. b
If the observations on Ci , 1 = 1, . . . , n, are independent and identically dis-
tributed with finite fourth moments and if the dimensions are known then the
joint asymptotic distribution of the maximum likelihood estimators is multi-
variate normal. The asymptotic covariance matrix is rather complicated and
not amenable to use in applications, although it simplifies considerably if C is
normally distributed. Nevertheless, the residual bootstrap works well (Cook
and Zhang, 2015b) and that is what we recommend for constructing inference
statements.

5.2.2 Computing
Cook and Zhang (2015b) discussed three algorithms for minimizing the si-
multaneous objective function, one of which uses an algorithm that alternates
between predictor and response reduction. If we fix Φ as an arbitrary orthog-
onal basis, then the objective function F (SC , Φ ⊕ Γ) can be re-expressed as
an objective function in Γ for response reduction as given in (2.21):

F (Γ|Φ) = log |ΓT SY |ΦT X Γ| + log |ΓT SY−1 Γ|. (5.10)

Similarly, if we fix Γ, the objective function F (SC , Φ ⊕ Γ) reduces to the con-


ditional objective function for predictor reduction as given in (2.11),
−1
F (Φ|Γ) = log |ΦT SX|ΓT Y Φ| + log |ΦT SX Φ|. (5.11)
PLS estimation 165

The following alternating algorithm based on (5.10) and (5.11) can be used
to obtain a minimizer of the objective function F (SC , Φ ⊕ Γ).
1. Initialization: Set the starting value Φ(0) and get Γ(0) = arg minΓ F (Γ|Φ(0) ).
2. Alternating: For the k-th stage, obtain Φ(k) = arg minΦ F (Φ|Γ = Γ(k−1) )
and Γ(k) = arg minΓ F (Γ|Φ = Φ(k) ).
3. Convergence criterion: Evaluate F (Φ(k−1) ⊕ Γ(k−1) ) − F (Φ(k) ⊕ Γ(k) )


and return to the alternating step if it is bigger than a tolerance; oth-


erwise, stop the iteration and use Φ(k) ⊕ Γ(k) as the final estimator.
According to Cook and Zhang (2015b), as long as good initial values are used,
the alternating algorithm, which monotonically decreases F (Φ ⊕ Γ), will con-
verge after only a few cycles. Root-n consistent starting values are particularly
important to mitigate potential problems caused by multiple local minima and
to ensure efficient estimation. For instance, under joint normality of X and

Y , one Newton-Raphson iteration from any n-consistent estimator will be
asymptotically as efficient as the maximum likelihood estimator, even if local
minima are present (Small et al., 2000).

5.3 PLS estimation


The Cook-Zhang likelihood-based analysis under the joint reduction model
(5.7) is not serviceable unless the sample size is sufficiently large. This is
where PLS comes in. We have seen in Proposition 5.1 that at the population
level, the requirements for simultaneous reduction of X and Y are obtained
by combining the requirements for separate reductions. This suggests that we
proceed in an analogous fashion using PLS: Obtain the weight matrix W for
predictor compression by running a PLS algorithm, NIPALS or SIMPLS, as
usual. Then interchange the roles of X and Y and run the algorithm again
to obtain the weight matrix G for response compression. Since we are run-
ning two separate applications of the NIPALS or SIMPLS algorithms, basic
asymptotic behavior of the reductions is as described in Propositions 2.1, 2.2
and the discussions that follow them.
The weight matrices W and G can be used in Lemma 5.2 to estimate the
remaining parameters. For clarity, NIPALS-type algorithms for constructing
W and G are shown in Table 5.1. Table 5.1(a) is the same as the bare-bones
version in Table 3.2.
166 Simultaneous Reduction
TABLE 5.1
NIPALS algorithm for computing the weight matrices W and G for compress-
ing X and Y . The n × p matrix X contains the centered predictors and the
n × r matrix Y contains the centered responses. `1 (·) denotes the eigenvector
corresponding to the largest eigenvalue of the matrix argument.

(a) Predictor reduction, W


Initialize X1 = X. Quantities subscripted with 0 are
to be deleted.
Select q ≤ min{rank(SX ), n − 1)}
For d = 1, . . . , q, compute
sample covariance matrix SXd ,Y = n−1 XTd Y
T
X weights wd = `1 (SXd ,Y SX d ,Y
)
X scores sd = Xd wd
X loadings ld = XTd sd /sTd sd
X deflation Xd+1 = Xd − sd ldT = Qsd Xd = Xd Qwd (SXd )
Append Wd = (Wd−1 , wd ) ∈ Rp×d
End Set W = Wq
(b) Response reduction, G
Initialize Y1 = Y,
Select u ≤ min{rank(SY ), n − 1)}
For d = 1, . . . , u, compute
sample covariance matrix SYd ,X = n−1 YTd X
Y weights gd = `1 (SYd ,X SYTd ,X )
Y scores cd = Yd gd
Y loadings hd = YTd cd /cTd cd
Y deflation Yd+1 = Yd − cd hTd = Qcd Yd = Yd Qgd (SYd )
Append Gd = (Gd−1 , gd ) ∈ Rp×d
End Set G = Gu
−1 T
(c) Compute βb = W SW T X SW T X,GT Y G

(d) Notes Switching the roles of X and Y , parts (a)


and (b) of this algorithm are identical. This
is in line with the theory of Section 2.4 and
summarized in Table 2.2. A difference arises
in how W T X and GT Y are used in comput-
ing β.
b
Other methods 167

To help fix ideas we revisit the example introduced in Section 3.1.2 in


which p = 3, r = 2,
   
1 4/3 0 5 0 !
1 −0.6
ΣX = 10 4/3 4 0 , ΣX,Y = 0 4 , and ΣY = 10 .
   
−0.6 4
0 0 5 0 0
We demonstrated in Section 3.1.2 using NIPALS and in Section 3.3.2 using
SIMPLS that only two linear combinations of the predictors are material to
the regression with span(W2 ) = span(V2 ) = span((1, 0, 0)T , (0, 1, 0)T ). To re-
duce the 2 × 1 response vector we interchange the roles of the predictors and
responses, treating X as the response and Y as the predictor. The first re-
sponse weight vector is then the first eigenvector of
!
25 0
ΣY,X ΣX,Y = ,
0 16

which is w1 = (1, 0)T . The second weight vector(0, 1)T is constructed using
! !
0 −0.6 T 0 0 0
Qw1 (ΣY ) = , Qw1 (ΣY ) ΣY,X = .
0 1 −3 4 0
and !
0 0
QTw1 (ΣY ) ΣY,X ΣX,Y Qw1 (ΣY ) = .
0 1
In consequence, no response reduction is possible since span(w1 , w2 ) = R2 .

5.4 Other methods


5.4.1 Bilinear models for simultaneous reduction
Bilinear models have also been used as guides to simultaneous reduction of Y
and X. A common formulation of a bilinear model for simultaneous reduction
is (e.g. Rosipal and Krämer, 2006; Rosipal, 2011; Weglin, 2000; Wold, 1975a)
)
X0,n×p = 1n µTX + Vn×dv Cp×d
T
+E
v
, (5.12)
Y0,n×r = 1n µTY + Tn×dt Dr×d
T
t
+F
where X0 and Y0 are the matrices of uncentered predictors and responses, V
and T are matrices of random scores with rows viT and tTi , i = 1, . . . , n. We
168 Simultaneous Reduction

assume that the vi ’s and ti ’s are independent copies random vectors v and
t, both with mean 0. The models are the same form after nonsingular linear
transformations of t and v and, in consequence, we assume without loss of
generality that var(t) = Idt and var(v) = Idv . However, t and u are required
to be correlated, cov(t, u) = Σt,u , for the predictors and responses to be cor-
related. C and D are non-stochastic loading matrices with full column ranks,
and E and F are matrices of random errors.
Model (5.12) is an instance of a structural equation model. Investigators
using structural equation models have historically taken a rather casual atti-
tude toward the random errors, focusing instead on the remaining structural
part of the model (Kruskal, 1983). Often, properties of the error matrices E
and F are not mentioned, presumably with the implicit understanding that
the errors are small relative to the structural components. That view may be
adequate for some purposes, but not here. We assume that the rows eTi and
fiT , i = 1, . . . , n, of E and F are independently distributed with means 0 and
constant variance-covariance matrices Σe and Σf , and E F .
Written in terms of the rows XiT of X0 and the rows YiT of Y0 , both
uncentered, these models become for i = 1, . . . , n
)
Xi = µX + Cvi + ei
. (5.13)
Yi = µY + Dti + fi

The vectors vi and ti are interpreted as latent vectors that control the ex-
trinsic variation in X and Y . These latent vectors may be seen as imaginable
constructs as in some applications like path analyses (e.g. Haenlein and Ka-
plan, 2004; Tenenhaus and Vinzi, 2005) or as convenient devices to achieve
dimension reduction. The X and Y models are connected because t and v are
correlated with covariance matrix Σt,v . If t and v are uncorrelated then so are
X and Y .
To develop a connection with envelopes, let C = span(C) and D = span(D)
and let Φ ∈ Rp×q and Γ ∈ Rr×u be semi-orthogonal basis matrices for
EΣe (C) and EΣf (D), respectively, q ≥ dv , u ≥ dt , and let (Φ, Φ0 ) ∈ Rp×p
and (Γ, Γ0 ) ∈ Rr×r be orthogonal matrices. With this we have

Σe = ΦWe ΦT + Φ0 We,0 ΦT0


Σf = ΓWf ΓT + Γ0 Wf,0 ΓT0
ΣY,X = PΓ Σt,v PΦ , (5.14)
Other methods 169

where the W(·) ’s are positive definite matrices what can be expressed in
terms of quantities in model (5.13). Such expressions of the W(·) ’s will play
no role in what follows and so are not given. We see from (5.14) that
ΓT0 ΣY,X = ΣΓT0 Y,X = 0 and that ΣY,X Φ0 = ΣY,ΦT0 X = 0, and thus that
the covariance between X and Y is captured entirely by the covariance be-
tween ΦTX and ΓT Y . In effect, ΓT Y represents the predictable part of Y and
ΦT X represents the material part of X.
In short, the bilinear model (5.12) leads to the envelope structure of (5.3)
that stems from Proposition 5.1. See Section 11.2 for additional discussion of
bilinear models as a basis for PLS.

5.4.2 Another class of bilinear models


The phrase ‘bilinear model’ generally means that somehow there are two lin-
ear structures represented. The precise meaning of ‘bilinear model’ can change
depending on context. The book Bilinear Regression Analysis by vonRosen
(2018) contains a through treatment of bilinear models represented, in von-
Rosen’s notation, as Xp×n = Ap×q Bq×k Ck×n + E, where X is observable, A
and C are known non-stochastic matrices, B is unknown to be estimated and
E is a matrix of errors. While this class of models is outside the scope of this
book, we show how it can arise in the context of model (1.1) to emphasize
how it differs from model (5.12).
Starting with the full matrix representation (1.6) of model (1.1), suppose
that we model the columns of β T as being contained in a known subspace
S with known p × k basis matrix Z: β T = ZB T where the p × k matrix B
contains the coordinates of β T relative to basis Z. Substituting this into (1.6),
we get the model
Y0 = 1r αT + XBZT + E.

This represents linear models in the rows and columns of Y and is the same
model form as that studied by vonRosen (2018). Models of this form arise in
analyses of longitudinal data, as discussed in Section 6.4.4.

5.4.3 Two-block algorithm for simultaneous reduction


Staying with the theme of connecting PLS algorithms with envelopes, we next
consider the population version of the two-block algorithm that has been used
frequently for simultaneous X-Y reduction in conjunction with (5.12) (e.g.
170 Simultaneous Reduction
TABLE 5.2
Population version of the common two-block algorithm for simultaneous X-Y
reduction based on the bilinear model (5.12). (Not recommended.)

Initialize (c1 , d1 ) = arg maxkck=kdk=1 cT ΣX,Y d,


C1 = (c1 ), D1 = (d1 )
For k = 1, 2 . . .
Compute weights (ck+1 , dk+1 ) = arg maxkck=kdk=1 cT QTCk (ΣX ) ΣX,Y
QDk (ΣY ) d
Append Ck+1 = (Ck , ck+1 ), Dk+1 = (Dk , dk+1 )
End when first maxkck=kdk=1 cT QTCk (ΣX ) ΣX,Y QDk (ΣY ) d = 0
and then define k̄ = k
Compute regression βbipls = Ck̄ (Ck̄T ΣX Ck̄ )−1 Ck̄T ΣX,Y PDk̄
coefficients
Envelope connection span(Ck̄ ) ⊆ EΣX (B), span(Dk̄ ) ⊆ EΣY (B 0 )
where B = span(β), B 0 = span(β T )

Rosipal and Krämer, 2006; Rosipal, 2011; Weglin, 2000; Wold, 1975a). The
derivation of the population two-block algorithm from the data-based algo-
rithm is available in Appendix A.5.4, since it follows the same general steps
as the derivations of the population algorithms for NIPALS and SIMPLS dis-
cussed previously in Sections 3.1 and 3.3.
The population version of the two-block algorithm developed in Ap-
pendix A.5.4 is shown in Table 5.2. The sample version, which is obtained by
replacing ΣX , ΣY , and ΣX,Y with their sample versions SX , SY , and SX,Y ,
does not require SX and SY to be nonsingular. Perhaps the main drawback of
this algorithm is that the column dimensions of the weight matrices Ck̄ and
Dk̄ must both equal k̄, the number of components, so the number of mate-
rial response linear combinations will be the same as the number of material
predictor linear combinations. In consequence, we are only guaranteed in the
population to generate subsets of EΣX (B) and EΣY (B 0 ), and a material part
of the response or predictors must necessarily be missed, except perhaps in
special cases.
We use the example of Sections 3.1.2 and 3.3.2 to illustrate an important
limitation of the algorithm in Table 5.2. That example had p = 3 predictors
Empirical results 171

and r = 2 responses. For ease of reference,


   
1 4/3 0 5 0 !
1 −0.6
ΣX = 10 4/3 4 0 , ΣX,Y = 0 4 , and ΣY = 10 .
   
−0.6 4
0 0 5 0 0

According to the initialization step in Table 5.2, we first need to calculate

(c1 , d1 ) = arg max cT ΣX,Y d


kck=kdk=1

to get c1 = (1, 0, 0)T and d1 = (1, 0)T . The next step is to check the stop-
ping criterion, and for that we need to calculate QTc1 (ΣX ) ΣX,Y Qd1 (ΣY ) . The
projections in this quantity are
 
0 −4/3 0 !
0 0.6
Qc1 (ΣX ) =  0 1 0  and Qd1 (ΣY ) = .
 
0 1.0
0 0 1

Direct calculation then gives

QTc1 (ΣX ) ΣXY Qd1 (ΣY ) = 0, (5.15)

so we stop with (c1 , d1 ), which implies that we need only cT1 X = x1 and
dT1 Y = y1 to characterize the regression of Y on X. However, we know from
Sections 3.1.2 and 3.3.2 that two linear combinations of the predictors are
required for the regression, and from Section 5.3 that two linear combinations
of the response are required. In consequence, the two-block algorithm missed
key response and predictor information, and for this reason it cannot be rec-
ommended for simultaneous reduction of the predictors and responses. The
simultaneous envelope methods discussed in Sections 5.2 and 5.3 are definitely
preferred.

5.5 Empirical results


In this section we present simulation results and brief analyses of three
datasets from Cook et al. (2023b) and its supplement. Five methods are com-
pared for predicting Y and estimating β in model (1.1): (1) ordinary least
172 Simultaneous Reduction

squares (OLS), (2) the two block method (Section 5.4.3), (3) sample size per-
mitting, the envelope-based likelihood reduction method for simultaneously
reducing X and Y (XY -ENV, Section 5.2), (4) PLS for predictor reduction
only (X-PLS, Table 3.1), and (5) the newly proposed PLS method for reducing
both X and Y (XY -PLS, Section 5.3).
Our overall conclusions are that any of these five methods can give com-
petitive results depending on characteristics of the regression. However the
XY -ENV and XY -PLS methods were judged best overall, with XY -ENV
being preferred when n is sufficiently large. We judged XY -PLS to be the
best overall. Lacking detailed knowledge of the regression or if n is sufficiently
‘large’, we would likely choose XY -PLS for use in predictive applications.

5.5.1 Simulations
The performance of a method can depend on the response and predictor di-
mensions r and p, the dimensions of the response and predictor envelopes u
and q, the components ∆, ∆0 , Ω, and Ω0 of the covariance matrices ΣY |X and
ΣX given at (5.7), β = ΦηΓT and the distribution of C. Following Cook et al.
(2023b) we focused on the dimensions and the covariance matrices. Given all
dimensions, the elements of the parameters Φ, η, and Γ were generated using
independent uniform (0, 1) random variables. The distribution of C was taken
to be multivariate normal. This gives an edge to the XY -ENV method so we
felt comfortable thinking of it as the gold standard in large samples.
Following Cook and Zhang (2015b) we set

∆ = aIq , ∆0 = Ip−q
Ω = bIu , Ω0 = 10Ir−u ,

where a and b were selected constants. We know from our discussion at the
end of Section 2.5.1 and from previous studies (Cook and Zhang, 2015b; Cook,
2018) that the effectiveness of predictor envelopes tends to increase with a,
while the effectiveness of response envelopes increases as b decreases. Here
we are interested mainly in the effectiveness of the joint reduction methods
as they compare to each other and to the marginal reduction methods. We
compared methods based on the average root mean squared prediction error
Empirical results 173

per response and the total mean squared error


v
m uX r
1 Xu t (Yij − Ybij )2 ,
predRMSE =
m i=1 j=1
n r
1 XX
totalMSE = (Yij − Ybij )2 ,
rn i=1 j=1

where Ybij denotes a fitted value and Yij is an observation from an indepen-
dent testing sample of size m = 1000 generated with the same parameters as
the data for the fitted model. To focus on estimation, we also computed the
average root mean squared estimation error per coefficient
v
p uX r
1 Xu t (βij − βbij )2 .
betaRMSE =
p i=1 j=1

The true dimensions were used for all the methods. Results are given in Ta-
bles 5.3–5.4. Some cells in these tables are missing data because a method ran
into rank issues and could not be implement cleanly.

5.5.2 Varying sample size


The results in Table 5.3 are for p = 50 predictors with q = 10 components
and r = 4 responses with u = 3 components. At n = 1000 the XY -ENV
had the best performance, as expected, with XY -PLS a close second. The
performance of all methods deteriorated as the sample size decreased. After
n = 57, the rank requirements for OLS and XY -ENV were no longer met
and so computing ceased for these methods. At n = 50 and n = 25, the two-
block method deteriorated badly, while XY -PLS continued to be serviceable.
The difference between the performances of X-PLS and XY -PLS supports
response reduction in addition to predictor reduction.
The settings for Table 5.4 are the same as those for Table 5.3 except q
was increased to 40. The observed errors generally increased. The two-block
method and X-PLS does deteriorate, but the two simultaneous methods again
dominate. The performance is similar to the q = 10 case.
The results in Table 5.5 are for p = 30 predictors with q = 20 components
and r = 50 responses with u = 2 components. For n ≥ 85, the XY -ENV again
had the best performance, as expected, with XY -PLS a close second. The
174 Simultaneous Reduction
TABLE 5.3
Error measures with a = 50, b = 0.01, p = 50, r = 4, q = 10, and u = 3.

n = 1000 OLS Two Block XY -ENV X-PLS XY -PLS


predRMSE 2.64 3.52 2.51 2.63 2.53
totalMSE 2.61 3.30 2.55 2.61 2.55
betaRMSE 0.07 0.04 0.00 0.07 0.01
n = 200
predRMSE 3.38 7.18 2.51 3.07 2.58
totalMSE 2.94 4.78 2.54 2.82 2.59
betaRMSE 0.18 0.08 0.01 0.14 0.03
n = 100
predRMSE 5.17 11.74 2.55 3.39 2.67
totalMSE 3.63 6.06 2.57 2.98 2.66
betaRMSE 0.33 0.11 0.02 0.16 0.04
n = 70
predRMSE 9.48 15.69 2.57 3.74 2.76
totalMSE 4.89 6.95 2.59 3.14 2.73
betaRMSE 0.53 0.13 0.03 0.16 0.05
n = 57
predRMSE 26.99 20.20 2.68 3.93 2.87
totalMSE 7.92 7.84 2.70 3.25 2.81
betaRMSE 0.93 0.15 0.07 0.16 0.06
n = 50
predRMSE 21.12 4.16 2.94
totalMSE 8.02 3.36 2.86
betaRMSE 0.15 0.16 0.06
n = 25
predRMSE 36.93 6.74 4.37
totalMSE 10.54 4.42 3.66
betaRMSE 0.21 0.17 0.10

two-block method does relatively better in this case because there is substan-
tial response reduction possible, while X-PLS does relatively worse because
it neglects response reduction. We again conclude that with sufficiently large
sample size, XY -ENV performs the best, while XY -PLS is serviceable overall.
Empirical results 175
TABLE 5.4
Error measures with a = 50, b = 0.01, p = 50, r = 4, q = 40, and u = 3.

n = 1000 OLS Two Block XY -ENV X-PLS XY -PLS


predRMSE 2.63 20.86 2.50 2.63 2.51
totalMSE 2.60 8.15 2.54 2.60 2.54
betaRMSE 0.04 0.15 0.00 0.04 0.00
n = 200
predRMSE 3.40 89.17 2.53 3.40 2.57
totalMSE 2.95 16.49 2.55 2.95 2.57
betaRMSE 0.10 0.32 0.01 0.10 0.01
n = 100
predRMSE 5.08 148.64 2.54 5.03 2.65
totalMSE 3.61 21.26 2.56 3.60 2.64
betaRMSE 0.16 0.42 0.01 0.16 0.03
n = 70
predRMSE 9.26 185.52 2.58 8.11 2.83
totalMSE 4.83 23.72 2.60 4.59 2.79
betaRMSE 0.27 0.47 0.02 0.23 0.05
n = 57
predRMSE 27.70 212.60 2.75 13.04 3.35
totalMSE 8.10 25.42 2.73 5.86 3.16
betaRMSE 0.49 0.50 0.04 0.26 0.08
n = 50
predRMSE 229.25 22.42 4.87
totalMSE 26.41 7.75 3.87
betaRMSE 0.52 0.29 0.10
n = 25
predRMSE 317.80
totalMSE 31.08
betaRMSE 0.61

5.5.3 Varying the variance of the material components via a


In Table 5.6, all simulation parameters were held fixed, except for a which was
varied. The XY -PLS method is judged to be the best over the table, except
at a = 0.5. The performance of the two-block method improves as a decreases
and is the best at a = 0.5.
176 Simultaneous Reduction
TABLE 5.5
Error measures with a = 50, b = 0.01, p = 30, r = 50, q = 20, and u = 2.

OLS Two Block XY -ENV X-PLS XY -PLS


n = 1000
predRMSE 9.91 9.89 9.62 9.81 9.64
totalMSE 22.15 22.12 21.82 22.04 21.84
betaRMSE 0.41 0.08 0.00 0.08 0.03
n = 200
predRMSE 11.39 10.99 9.66 10.76 9.76
totalMSE 23.73 23.29 21.86 23.07 21.98
betaRMSE 0.98 0.19 0.01 0.19 0.06
n = 100
predRMSE 13.97 12.19 9.72 12.18 9.95
totalMSE 26.26 24.45 21.93 24.53 22.18
betaRMSE 1.54 0.25 0.03 0.29 0.09
n = 85
predRMSE 15.25 12.63 9.84 12.83 10.01
totalMSE 27.42 24.85 22.06 25.17 22.25
betaRMSE 1.74 0.27 0.06 0.32 0.10
n = 70
predRMSE 17.46 13.06 13.79 10.09
totalMSE 29.31 25.24 26.08 22.34
betaRMSE 2.05 0.29 0.36 0.11
n = 50
predRMSE 26.08 14.27 16.66 10.33
totalMSE 35.61 26.29 28.61 22.60
betaRMSE 2.95 0.34 0.47 0.13
n = 25
predRMSE 16.99 44.08 11.68
totalMSE 28.46 45.61 24.00
betaRMSE 0.42 1.20 0.25

The settings for Table 5.7 are the same as those for Table 5.6 except
b = 0.1. Now, XY -PLS does the best for a ≥ 50 and XY -ENV is the best for
a = 5, 0.5. As we have seen in other tables, the relative sizes of the material
and immaterial variation matters.
Empirical results 177
TABLE 5.6
Error measures with n = 57, b = 5, p = 50, r = 4, q = 40, and u = 3.

a = 1000 OLS Two Block XY -ENV X-PLS XY -PLS


predRMSE 69.72 4203.06 44.55 23.18 17.06
totalMSE 14.98 112.51 11.99 8.91 7.67
betaRMSE 0.88 0.50 0.67 0.04 0.03
a = 500
predRMSE 69.58 2104.98 44.53 23.59 17.17
totalMSE 14.97 79.66 11.99 8.99 7.70
betaRMSE 0.88 0.50 0.68 0.07 0.05
a = 50
predRMSE 68.54 216.87 45.75 28.69 19.81
totalMSE 14.88 25.75 12.18 9.87 8.24
betaRMSE 0.91 0.50 0.70 0.35 0.25
a=5
predRMSE 68.64 28.26 48.34 41.53 26.98
totalMSE 14.90 9.62 12.52 11.73 9.52
betaRMSE 1.18 0.52 0.96 0.87 0.66
a = 0.5
predRMSE 69.56 10.69 53.46 46.60 37.95
totalMSE 14.96 6.11 12.99 12.43 11.12
betaRMSE 2.66 0.70 2.25 2.15 1.87

5.5.4 Varying sample size with non-diagonal covariance


matrices
In Table 5.3 the matrices Ω, Ω0 , ∆, and ∆0 were all generated as constants
times an identity matrix. For contrast, Table 5.8 was as Table 5.3 but the
matrices were generated as M M T where M had the same size as the corre-
sponding covariance matrix and was filled with independent realizations from
a uniform (0, 1) random variable. The conclusions are qualitative similar to
those from Table 5.3, with the important exception that the relatively good
performance of XY -ENV at small sample sizes is not apparent. At n = 57 in
Table 5.3, n = 57, the predRMSE is 2.68 for XY -ENV and 2.87 for XY -PLS,
while in Table 5.8 it is 15.46 for XY -ENV and 2.57 for XY -PLS. The table
sustains our general conclusion that XY -PLS is overall the best from among
those considered.
178 Simultaneous Reduction
TABLE 5.7
Error measures with n = 57, b = 0.1, p = 50, r = 4, q = 40, u = 3.

a = 1000 OLS Two Block XY -ENV X-PLS XY -PLS


predRMSE 28.47 4199.55 4.04 9.18 3.04
totalMSE 8.40 112.45 3.50 4.94 2.97
betaRMSE 0.49 0.50 0.10 0.03 0.02
a = 500
predRMSE 28.46 2101.22 4.19 9.51 3.08
totalMSE 8.40 79.56 3.55 5.03 3.01
betaRMSE 0.49 0.50 0.10 0.05 0.03
a = 50
predRMSE 28.38 212.71 3.99 13.29 3.60
totalMSE 8.39 25.43 3.51 5.99 3.34
betaRMSE 0.51 0.50 0.11 0.26 0.08
a=5
predRMSE 25.27 24.00 3.68 19.19 4.36
totalMSE 7.99 8.75 3.40 7.08 3.70
betaRMSE 0.64 0.51 0.15 0.54 0.18
a = 0.5
predRMSE 25.96 6.33 3.65 22.57 22.45
totalMSE 8.06 4.58 3.39 7.54 7.60
betaRMSE 1.42 0.66 0.33 1.32 1.33

5.5.5 Conclusions from the simulations


Setting a = 50 and b = 0.01 in Tables 5.3–5.5 was intended to represent
regressions in which predictor and response compressions have a reasonable
chance to reducing prediction variation. Changing these values enough may
certainly change the results materially. For instance, we know from Corol-
lary 2.1 that if ΣX = σx2 Ip and β ∈ Rp×r has rank r then the estimator of β
based on predictor envelopes is asymptotically equivalent to the OLS estima-
tor. Nevertheless, we conclude from Tables 5.3–5.5 that, in regressions where
response and predictor envelopes are appropriate, joint envelope reduction via
XY -ENV and joint PLS reduction via XY -PLS are the clear winners, even
doing better than OLS when n is large.
In Tables 5.6 and 5.7 we held the sample size fixed at n = 57, varied b be-
tween the tables and varied the variance of the material predictors by varying
Empirical results 179
TABLE 5.8
Settings p = 50, r = 4, q = 10, u = 3 with Ω, Ω0 , ∆ and ∆0 generated as M M T
where M had the same size as the corresponding covariances and was filled
with uniform (0, 1) observations.

n = 1000 OLS Two Block XY -ENV X-PLS XY -PLS


predRMSE 1.24 1.71 1.28 1.26 1.27
totalMSE 1.98 2.37 2.01 2.00 2.01
betaRMSE 0.33 0.19 0.10 0.10 0.10
n = 200
predRMSE 1.59 3.28 1.57 1.55 1.56
totalMSE 2.24 3.06 2.22 2.23 2.24
betaRMSE 0.77 0.23 0.19 0.14 0.14
n = 100
predRMSE 2.42 4.46 2.11 1.92 1.92
totalMSE 2.76 3.48 2.57 2.49 2.49
betaRMSE 1.34 0.24 0.38 0.17 0.17
n = 70
predRMSE 4.55 8.46 3.66 2.27 2.24
totalMSE 3.76 4.45 3.31 2.71 2.69
betaRMSE 2.21 0.27 1.30 0.19 0.20
n = 57
predRMSE 13.96 9.48 15.46 2.61 2.57
totalMSE 6.38 4.75 6.55 2.89 2.87
betaRMSE 4.34 0.28 4.59 0.21 0.21
n = 50
predRMSE 10.62 2.77 2.74
totalMSE 5.02 2.98 2.96
betaRMSE 0.28 0.22 0.22
n = 25
predRMSE 17.77 4.27 4.20
totalMSE 6.44 3.66 3.62
betaRMSE 0.30 0.25 0.25

a within the tables. At b = 5 in Table 5.6, the XY -PLS method performed


the best overall. The only exception was that the two-block method did better
at a = 0.5. At b = 0.1 in Table 5.7, the XY -PLS method also performed well,
except at a = 0.5 when the XY -ENV method performed the best, with the
two-block method not far behind.
180 Simultaneous Reduction

In Tables 5.3–5.7 the various covariance matrices were set to constants


times an identity matrix so their relative magnitudes could be controlled eas-
ily. In Table 5.8 the covariance matrices were not diagonal but had the form
M M T , as described previously. Still, the XY -PLS method did best overall.
From this and other simulations not reported, we concluded that the XY -
PLS method is preferable at small sample sizes. When n is ‘large’ XY -PLS and
XY -ENV are both serviceable, with XY -PLS being computationally easier to
implement.
Rimal et al. (2019) adapted X-ENV to small n regressions by replacing
X with its principal components sufficient to absorb 97.5% of the variation.
While they reported good empirical results, to our knowledge this method has
not been studied for XY compressions.

5.5.6 Three illustrative data analyses


The slump flow of concrete depends on its ingredients: cement, fly ash, slag,
water, super plasticizer, coarse aggregate, and fine aggregate. These ingredi-
ents, which were measured in kilograms per cubic meter of concrete, comprise
the p = 7 predictors for this regression. The r = 3 response variables were
slump, flow, and 28-day compressive strength. The data set (Yeh, 2007) con-
tains 103 multivariate observations, which are comprised of 78 original records
and 25 new records that were added later. Cook et al. (2023b) used the first
78 records as a training set and the new 25 records as the testing set. Predic-
tions were evaluated against the testing set using predRMSE. The dimensions
were determined via 10-fold cross validation for the two-block, XY -PLS and
X-PLS. BIC was used to determine the dimensions for XY -ENV. For com-
pleteness, we also included the PLS1 method (see Section 3.10) with dimension
determined by 10-fold cross validation. The results shown in Table 5.9 con-
form to the general conclusions given at the outset of Section 5.5, although
here is no advantage to compressing the response in addition to the predictor.
As a second illustration we report the results from a study by Skagerberg
et al. (1992) of a low density tubular polyethylene reactor. There are n = 56
multivariate observations, each with p = 22 predictors and r = 6 responses.
The predictor variables consist of 20 temperatures measured at equal distances
along the reactor together with the wall temperature of the reactor and the
feed rate. The responses are the output characteristics of the polymers pro-
duced. Because the distributions of the values of all the response variables are
Empirical results 181
TABLE 5.9
Prediction errors predMSE for three datasets

Dataset OLS Two Block XY -ENV X-PLS XY -PLS X-PLS1


Concrete 25.02 19.58 13.29 13.17 13.17 13.22
Reactor 2.22 1.14 1.88 0.94 0.93 0.95
Biscuit NA 2.13 NA 2.07 1.29 2.26

highly skewed to the right, the responses were transformed to the log scale
and then standardized to have unit variance. Dimensions were determined as
described in the concrete example. The prediction errors predRMSE were de-
termined using the leave-one-out method to compare with their results. The
results shown in Table 5.9 suggest that for these data there is little advantage
in compressing the responses although q = 3.
Our third illustration comes from a near-infrared spectroscopy study on
the composition of biscuit dough (Osborne et al., 1984). The original data
has n = 39 samples in a training dataset and 31 samples in a testing set.
The two sets were created on separate occasions and are not the result of a
random split of a larger dataset. Each sample consists of r = 4 responses,
the percentages of fat, sucrose, flour and water, and spectral readings at 700
wavelengths. Cook and Zhang (2015b) used a subset of these data to illustrate
application of XY -ENV. They constructed the data subset by reducing the
spectral range to restrict the number of predictors to 20 from a potential of
700. This allowed them to avoid ‘n < p’ issues and again dimension was chosen
by cross-validation. In a separate study, Li, Cook, and Tsai (2007) reasoned
that the leading and trailing wavelengths contain little information, which
motivated them to use middle wavelengths, ending with p = 64 predictors.
Using the subset of wavelengths constructed by Li et al. (2007), Cook et al.
(2023b) applied the four methods that do not require n > p, which gave the
results in the last row of Table 5.9. We see that XY -PLS again performed
the best. Since its performance was better than that of X-PLS, we see an
advantage to compressing the responses.
6

Partial PLS and Partial


Envelopes

It can happen when striving to reduce the predictors in the regression of


Y ∈ Rr on X ∈ Rp that it is desirable and proper to shield part of X from
the reduction. Partition X into two sub-vectors X1 ∈ Rp1 and X2 ∈ Rp2 ,
p = p1 + p2 , where we wish to compress X1 while leaving X2 untouched by
the dimension reduction. How do we compress X1 while correctly adjusting for
X2 ? This general class of problems is referred to as partial dimension reduction
(Chiaromonte, Cook, and Li, 2002).
Skagerberg, MacGregor, and Kiparissides (1992) collected n = 56 obser-
vations to study the polymerization reaction along a reactor. The r = 6 re-
sponse variables were polymer properties: number-average molecular weight,
weight-average molecular weight, frequency of long chain branching, frequency
of short chain branching, the content of vinyl groups and vinylidene groups
in the polymer chain. The predictors were twenty temperatures measured at
equal distances along the reactor plus the wall temperature of the reactor and
the solvent feed rate. The last two predictors measure different characteristics
than the first 20 predictors and it may be desirable to exclude them from the
dimension reduction, aiming to find a low dimensional index that characterizes
temperature along the reactor. Instead or in addition, it may be relevant to
compress the 20 temperature measurements along the reactor in the hope of
improving estimation of the effects of wall temperature and the solvent feed
rate. In either case, the p = 22 predictor would then be partitioned in to
the p1 = 20 temperatures along the reactor X1 and the p2 = 2 remaining
predictors X2 .
In this chapter we discuss foundations of partial envelopes, leading to PLS
and maximum likelihood methodology for compressing the response or the

DOI: 10.1201/9781003482475-6 182


Partial predictor reduction 183

predictors X1 in the partitioned linear model

Y = α + βT X + ε
= α + β1T X1 + β2T X2 + ε, (6.1)

where Y ∈ Rr , ! !
X1 β1
X= , β= ,
X2 β2
and the errors have mean 0 variance-covariance matrix ΣY |X and are
independent of the predictors. We assume throughout this chapter that
p2  min(n, p1 ). There are several contexts in which partial reduction may
be useful, in addition to the kinds of application emphasized in the reac-
tion example. For instance, there may be a linear combination GTβ of the
components of β that is of special interest. In which case reducing the dimen-
sion while shielding GTβ from the reduction may improve its estimation (see
Section 6.2.4).

6.1 Partial predictor reduction


Chiaromonte, Cook, and Li (2002) were the first to consider partial sufficient
(X )
dimension reduction, defining the partial central subspace SY |X2 1 as the in-
tersection of all subspaces S ⊆ Rp1 that satisfy the conditional independence
statement
Y X1 | PS X1 , X2 , (6.2)

which is equivalent to requiring that Y X | PS X1 , X2 . The partial central


subspace does not always exist, as the intersection of all subspaces that satisfy
(6.2) does not necessarily satisfy (6.2). But the intersection does satisfy (6.2)
(X )
under quite mild conditions and so we always assume the existence of SY |X2 1 .
Chiaromonte et al. (2002) developed partial sliced inverse regression (SIR)
(X )
for estimating SY |X2 1 when X2 is categorical serving to identify a number of
subpopulations. They motivated their approach in part by the regression of
lean body mass on logarithms of height, weight and three hematological vari-
ables X1 plus an indicator variable X2 for gender based on a sample of 202
athletes from the Australian Institute of Sport. They concluded that for this
184 Partial PLS and Partial Envelopes
(X )
regression SY |X2 1 has dimension 1 and thus that a single linear combination
of the hematological variables, called the first partial SIR predictor, is suffi-
cient to characterize the regression of lean body mass on (X1 , X2 ). Shown in
Figure 6.1 is a summary plot, marked by gender, of lean body mass versus
the first partial SIR predictor. It can be inferred from their analysis that lean
body mass for males and females responds to the same linear combination
of the hematological variables, but also that the gender-specific regressions
differ. Dimension reduction in this example serves to improve prediction and
the estimation of the gender effect. Bura, Duarte, and Forzani (2016) devel-
oped a method for estimating the sufficient reductions in regressions with
exponential family predictors. Applying their method to the data from the
Australian Institute of Sport, they inferred that SY |X has dimension 1. They
gave a summary plot that is essentially identical to that shown in Figure 6.1.

FIGURE 6.1
Plot of lean body mass versus the first partial SIR predictor based on data
from the Australian Institute of Sport. circles: males; exes: females.
Partial predictor envelopes 185

Shao, Cook, and Weisberg (2009) adapted sliced average variance esti-
mation for estimating the partial central subspace. Wen and Cook (2007) ex-
tended the minimum discrepancy approach to SDR developed by Cook and Ni
(2005) to allow the partial central subspace to be estimated. In this chapter we
describe PLS and maximum likelihood methods for compressing X1 while leav-
ing X2 unchanged in the context of linear models. Park, Su, and Chung (2022)
recently applied partial PLS methods to the study of cytokine-based biomarker
analysis for COVID-19. We will link to their work as this chapter progresses.

6.2 Partial predictor envelopes


6.2.1 Synopsis of PLS for compressing X1
As developed in this section, to use PLS to compress the predictors X1 in linear
model (6.1), follow these steps. We will use numbers 1 and 2 as substitutes for
X1 and X2 in subscripts, unless a fuller notation seems necessary for clarity.

Get residuals: Construct the residuals R bY |2 i and R


b1|2 i , i = 1, . . . , n, from
the OLS fit of the multivariate linear regressions of Y on X2 and X1 on X2 .
Because we are confining attention to regressions in which p2  min(p1 , n)
there should be no dimension or sample size issues in these regressions.

Set components: Select the number of components q1 ≤ p1 .

Run PLS: Run the dimension reduction arm of a PLS algorithm with di-
mension q1 , response R
bY |2 i and predictor vector R
b1|2 i , and get the output
c ∈ Rp1 ×q1 of weights.
matrix W

c TX1 . This gives the compressed version of X1 .


Compress: X1 7→ W

c TX1 and X2 to get a prediction


Predict: Fit the linear regression of Y on W
rule.

This algorithm can be used in conjunction with cross validation and a holdout
sample to select the number of components and estimate the prediction error
of the final estimated model. It is also possible to use maximum likelihood to
estimate a partial reduction, as described in Section 6.2.3.
186 Partial PLS and Partial Envelopes

6.2.2 Derivation
Our pursuit of methods for partial predictor reduction proceeds via partial
predictor envelopes which are introduced by adding a second conditional inde-
pendence condition to (6.2) and then following the reasoning that led to (2.2):

(a) Y X1 | PS X1 , X2 and (b) PS X1 QS X1 | X2 . (6.3)

The rationale here is the same as that leading to (2.2): we use condition (b)
to induce a measure of clarity in the separation of X1 into its material and
immaterial parts. Conditions (a) and (b) hold if and only if (Cook, 1998,
Proposition 4.6)
(c) (Y, PS X1 ) QS X1 | X2 . (6.4)

Consequently (Y, PS X1 ) is required to be independent of QS X1 at every value


of X2 . The conditional independence condition (b) is hard to meet in prac-
tice so we follow the logic leading to (2.3) and replace it with a conditional
covariance condition,

cov(PS X1 , QS X1 | X2 ) = PS var(X1 | X2 )QS = 0.

This relaxes independence, requiring only that there be no linear association


between PS X and QS X at any value of X2 , and leads to the new pair of
conditions

(a) Y X1 | PS X1 , X2 and (b) PS var(X1 | X2 )QS = 0. (6.5)

Replacing condition (a) with an operational version, we get the conditions


that (Park et al., 2022, eq. (8)) used to motivate their study:

cov(Y, QS X1 | PS X1 , X2 ) and (ii) PS var(X1 | X2 )QS = 0.

We can now define a partial envelope based on (6.5):


 
(X2 ) (X2 )
Definition 6.1. The partial predictor envelope Evar(X 1 |X2 )
SY |X1 for com-
pressing X1 in the regression of Y on (X1 , X2 ) is the intersection of all sub-
(X )
spaces that contain the partial central subspace SY |X2 1 and reduce var(X1 | X2 )
for all values of X2 . Let q1 denote the dimension of the partial envelope.

This serves as a conceptual starting point for our discussion. We now


describe additional structure that facilitates our pursuit of methodology. To
facilitate the development, we restrict the rest of our discussion of partial
Partial predictor envelopes 187

predictor envelopes to regressions with a real response Y ∈ R1 , but much


of the development can be extended rather straightforwardly to multivari-
ate responses. We return to multivariate responses when considering partial
response envelopes in Section 6.4.
Let β1|2 ∈ Rp2 ×p1 denote the population OLS regression coefficient from
the multivariate linear regression of X1 on X2 and let R1|2 = X1 − E(X1 ) −
T
β1|2 (X2 − E(X2 )) denote a population residual vector from that same re-
gression. In addition, let βY |2 ∈ Rp1 denote the population OLS regres-
sion coefficient from the linear regression of Y on X2 and let RY |2 =
Y − E(Y ) − βYT |2 (X2 − E(X2 )) denote a population residual from that same
regression. Then conditions (6.5) are equivalent to the same conditions ex-
pressed in terms of the residuals R1|2 and RY |2 ,

(a) RY |2 R1|2 | PS R1|2 , X2 and (b) PS var(R1|2 | X2 )QS = 0. (6.6)

and, as a consequence, the partial envelopes are identical


   
(X2 ) (X2 ) (X2 ) (X2 )
Evar(R 1|2 |X2 ) SR Y |2 |R1|2
= Evar(X 1 |X2 ) SY |X 1
.

A brief justification for (6.6) is given in Appendix A.6.1.


The replacement of the conditional independence condition (6.3b) with the
conditional covariance condition (6.5b) is a minor structural adaptation. We
now make an assumption that may be suitable for linear models but may not
be so in more general contexts. By construction cov((RY |2 , R1|2 ), X2 ) = 0. We
take this a step further and assume that

A1 (RY |2 , R1|2 ) X2 .

This condition plays a central role in dimension reduction generally. For in-
stance, (Cook, 1998, Section 13.3) used it in the development of Global Net
Effects Plots. With assumption A1, conditions (6.6) reduce immediately to

(a) RY |2 R1|2 | PS R1|2 and (b) PS var(R1|2 )QS = 0, (6.7)

and the partial envelope reduces to an ordinary envelope,


   
(X2 ) (X2 )
Evar(R 1|2 |X2 ) SR Y |2 |R1|2
= Evar(R 1|2 ) SR Y |2 |R1|2
.

Similar constructions involving just the central subspace were given by (Cook,
1998, Chapter 7).
188 Partial PLS and Partial Envelopes

These conditions are identical to those given in (2.3) with Y and X replaced
by RY |2 and R1|2 . Thus we can learn how to compress X1 by applying PLS to
the regression of RY |2 on R1|2 , except that these variables are not observed and
must be estimated. Recall we are requiring p2 to be small relative to n and p1 :

A2 p2  min(n, p1 ).

With this condition, the estimators R bY |2 and Rb1|2 of RY |2 on R1|2 can be


constructed straightforwardly by using OLS fits. Let BY |2 and B1|2 denote
the coefficient vector and matrix from the OLS fits of Y and X1 on X2 . Then,
for i = 1, . . . , n,

R
bY |2 i = Yi − Ȳ − BYT |2 (X2i − X̄2 )
T
R
b1|2 i = X1i − X̄1 − B1|2 (X2i − X̄2 ).

We now perform PLS dimension reduction with input data (R b1|2 i ),


bY |2 i , R
i = 1, . . . , n, or equivalently with the sample covariance matrices S1|2 and
SR1|2 ,RY |2 constructed from R b1|2 i and R
bY |2 i

−1
S1|2 = S11 − S12 S22 S21
SR1|2 ,RY |2 = SR1|2 ,Y .

Writing
n
X
SR1|2 ,Y = n−1 T
[X1i − X̄1 − B1|2 (X2i − X̄2 )]YiT
i=1
n
X
−1
= n−1 [X1i − X̄1 − S1|2 S22 (X2i − X̄2 )]YiT
i=1
−1
= S1,Y − S1,2 S22 S2,Y ,

we have
−1
SR1|2 ,RY |2 = S1,Y − S1,2 S22 S2,Y .

This then leads directly to the algorithm given by the synopsis of Section 6.2.1.
Since the sample covariance matrix between R b1|2 and RbY |2 is the same as that
between R1|2 and Y , the algorithm could be run with Y in place of R
b bY |2 . In
this case the dimension reduction arms of the SIMPLS and NIPALS algo-
T
rithms are instances of Algorithms S and N with A = SR1|2 ,RY |2 SR 1|2 ,RY |2
and
M = S1|2 . This version of Algorithm S was used by Park et al. (2022) as a
partial PLS algorithm.
Partial predictor envelopes 189

The algorithm of Section 6.2.1 can be deduced also by proceeding infor-


b1|2 i = X̄1 +B T (X2 i −
mally based on the partitioned linear model (6.1). Let X 1|2
X̄2 ) denote the i-th fitted value from the OLS fit of the linear regression of
X1 on X2 . Then write the partitioned linear model as

Y = α + β1T (X1 − X
b1|2 ) + β T X
1
b1|2 + β T X2 + ε
2

= α∗ + β1T R
b1|2 + β2∗T X2 + ε,

where β2∗ = B1|2 β1 and constants have been absorbed by α∗ . Since R b1|2 and
X2 are uncorrelated in the sample, β1 can be obtained from the regression of
b
Y on Rb1|2 and βb∗ from the regression of Y on X2 , which leads back to the al-
2
gorithm of Section 6.2.1. The OLS estimator B1 of β1 can then be represented
as the coefficients from the OLS fit of R1|2 i on Yi :
−1 −1
B1 = SR 1|2
SR1|2 ,Y = S1|2 SR1|2 ,Y . (6.8)

6.2.3 Likelihood-based partial predictor envelopes


In this section, continuing to restrict discussion to regressions with a real re-
sponse, we consider likelihood-based partial reduction under model (6.1) as a
large-sample counterpart to partial PLS summarized in Section 6.2.1. Under
that model, since Y X1 | β1T X1 , X2 , the partial central subspace is
(X )
SY |X2 1 = span(β1 ) := B1

and consequently the partial envelope is


 
(X2 ) (X2 ) (X2 )
Evar(X 1 |X2 )
SY |X1 = Evar(X1 |X2 ) (B1 ) .

Since the predictors are not ancillary in this treatment, the likelihood
should be based on the joint distribution of (Y, X1 , X2 ). Without loss of gen-
erality we assume that all variables are centered and so have marginal means
of zero. We assume also that in model 6.1

X2 ∼ N (0, Σ2 )
T
X1 | X2 ∼ N (β1|2 X2 , Σ1|2 )
Y | (X1 , X2 ) ∼ N (β1T X1 + β2T X2 , σY2 |X ).

Cast in terms of the parameters of this model, the partial envelope is


(X ) (X )
Evar(X
2
1 |X2 )
(B1 ) = EΣ1|22 (B1 ),
190 Partial PLS and Partial Envelopes
(X )
with q1 now denoting the dimension of EΣ1|22 (B1 ). Let Φ ∈ Rp1 ×q1 be a semi-
orthogonal basis matrix for it and let (Φ, Φ0 ) be an orthogonal matrix. Parallel
to the development in Section 2.2, we then have representations

β1 = Φη
Σ1|2 = Φ∆ΦT + Φ0 ∆0 ΦT0 .

With this structure we can write our normal model in terms of the envelope
basis,

X2 ∼ N (0, Σ2 )
T
X1 | X2 ∼ N (β1|2 X2 , Φ∆ΦT + Φ0 ∆0 ΦT0 )
Y | (X1 , X2 ) ∼ N (η T ΦT X1 + β2T X2 , σY2 |X ).

From here the log likelihood is straightforwardly


n
n 1 X T −1 n n
log L = − log |Σ2 | − tr X2 i Σ2 X2 i − log |∆| − log |∆0 |
2 2 i=1 2 2
n
1X
− T
(X1 i − β1|2 X2 i )T (Φ∆−1 ΦT + Φ0 ∆−1 T T
0 Φ0 )(X1 i − β1|2 X2 i )
2 i=1
n
n 1 X
− log σY2 |X − 2 (yi − η T ΦT X1 i − β2T X2 i )2 .
2 2σY |X i=1

Details of how to maximize this log likelihood function are available in Ap-
pendix A.6.2. Here we report only the final estimators.
After maximizing the log likelihood over all parameters except Φ we find
that the MLE of the partial envelope is
n o
(X ) −1
EbΣ1|22 (B1 ) = span arg min log GT S1|Y,2 G + log GT S1|2 G , (6.9)
G

where the minimization is over all p1 × q1 semi-orthogonal matrices G, S1|Y,2


is the sample covariance matrix of the residuals from the multivariate linear
regression of X1 on Y and X2 , and S1|2 is the sample covariance matrix of
the residuals from the multivariate linear regression of X1 on X2 . Let Φ b be
any semi-orthogonal basis this estimator and let (Φ,
b Φb 0 ) be an orthogonal ma-
T
trix. Then the compressed version of X1 is Φ X1 . The MLEs of the location
b
Partial predictor envelopes 191

parameters are

ηb = (Φ b −1 Φ
b T S1|2 Φ) b T SR ,Y
1|2

βb1 = Φb

= PΦ(S
b 1|2 ) B1

βb2 = S2−1 [S2,y − S2,1 Φb


b η )]

= S2−1 [S2,Y − S2,1 βb1 ],

where B1 was defined at (6.8). In particular, we see that the MLE of β1 is the
projection of B1 onto span(Φ)
b in the S1|2 inner product. The MLE’s of the
scale parameters are

Σ
b2 = S2 ,

b b T S1|2 Φ,
= Φ b


b0 b T0 S1|2 Φ
= Φ b 0,

Σ
b 1|2 = PΦ b + QΦ
b S1|2 PΦ b S1|2 QΦ
b

bY2 |X
σ T
= SY |2 − SR 1|2 ,Y
Φ[
bΦ b −1 Φ
b T S1|2 Φ] b T SR ,Y .
1|2

Our envelope estimator (6.9) is the same as that found by Park, Su, and
Chung (2022, eq. (11)), who also gave asymptotic distributions of βb1 and
βb1 in their Proposition 2. Park et al. (2022) named the subsequent estimators
envelope-based partial least squares estimators (EPPLS). We see this as a mis-
nomer. Historically and throughout this book, partial least squares has been
associated with specific algorithms, which we have generalized to Algorithms N
and S. Only relatively recently was it found that the dimension reduction arm
of PLS estimates an envelope (Cook, Forzani, and Rothman, 2013). We think
communication will be clearer by continuing the historical practice of linking
PLS with a specific class of algorithms as described in Chapter 3, particularly
since the algorithms are serviceable in n < p regressions, while EPPLS is not.
When n < p1 the covariance matrices S1|Y,2 and S1|2 in (6.9) are singular.
Park et al. (2022) suggested to then replace these matrices with their sparse
permutation invariant covariance estimators (SPICE, Rothman, Bickel, Lev-
ina, and Zhu, 2008), but this option does not seem to have been studied in
detail.
192 Partial PLS and Partial Envelopes

6.2.4 Adaptations of partial predictor envelopes


It often happens when studying the regression of a real response Y on X via
the linear model Y = α + β T X + ε that there is a linear combination GTβ of
the components of β that is of special interest, where the p × 1 vector G is
known. The usual estimators of GTβ are the OLS estimator GTB, the MLE en-
velope estimator GT βb and the corresponding PLS envelope estimator GT βbpls .
However, it is possible that a partial envelope or a partial PLS estimator will
outperform these possibilities.
To develop a partial estimator of GTβ, let G0 ∈ Rp×p−1 be a semi-
orthogonal basis matrix for span⊥ (G). Then

β TX = β T QG X + β T PG X
= β T G0 GT0 X + β T G GT X/kGk2
 

= φT1 Z1 + φ2 Z2 ,

where φ2 = β T G is the new parameter of interest, φ1 = GT0 β are nuisance


parameters for the purpose of inferring about φ2 , Z2 = GT X/kGk2 and


Z1 = GT0 X are the observable transformed predictors that go with the nui-
sance parameters. The model in terms of the new parameters and predictors is

Y = α + φT1 Z1 + φ2 Z2 + ε. (6.10)

In this model we wish to reduce the dimension of Z1 without impacting Z2


to improve the estimation of the parameter of interest φ2 . This falls neatly
within the context of Sections 6.2 and 6.2.3 from which we learned that
 
φb2 = SZ−1 2
SZ 2 ,Y − SZ2 ,Z 1
φ
b1 , (6.11)

where φb1 can come from either the partial PLS fit or the MLE of the partial
(Z2 )
envelope, Evar(Z 1 |Z2 )
(span(φ1 )), where

Z1 = GT0 X
Z2 = GT X/kGk2
SZ2 = GT SX G/kGk4
SZ2 ,Y = GT SX,Y /kGk2
SZ2 ,Z1 = GT SX G0 /kGk2
φb2 = kGk4 (GT SX G)−1 {GT SX,Y /kGk2 − GT SX G0 /kGk2 φb1 }
= kGk2 (GT SX G)−1 GT {SX,Y − SX G0 φb1 }.
Partial predictor envelopes 193

Informally, if either estimator reduces the variation in the estimator of φ1 then


we would expect improved estimation of φ2 .
Partial least squares fits are often used for constructing rules to predict Y
at a new value Xnew of X. Partial envelopes fitted by either PLS or maximum
likelihood offer a route to an alternative estimator that has the potential to
outperform the usual predictions. Centering the predictors to facilitate expo-
sition, the population and sample forms of a prediction based on the linear
model Y = α + β T X + ε are

E(Y | Xnew ) = E(Y ) + β T (Xnew − E(X))


Yb = E(Y
b | Xnew )
= Ȳ + βeT (Xnew − X̄),

where βe stands for a generic estimator of β. In consequence, we can adapt φ2


for prediction by setting G = Xnew − X̄ and, with this value of G, following the
development that lead to (6.10) and (6.11). The prediction is then Yb = Ȳ + φb2 .

6.2.5 Partial predictor envelopes in the Australian Institute


of Sport
We introduced partial predictor reduction in Section 6.1 by using data from
the Australian Institute of Sport. Using SIR as an implementation of sufficient
dimension reduction, Chiaromonte et al. (2002) concluded that only one linear
combinations of the X1 variates is necessary to explain the regression of lean
body mass (LBM) on X1 and gender X2 , with the plot in Figure 6.1 serving
as a summary of the regression of LBM on X1 and X2 .
We applied partial PLS and partial predictor envelope to the same data
and concluded via cross validation that each method gave q1 = 4, indicating
little reduction of X1 . How could partial sufficient reduction via SIR lead to
q1 = 1 component, while partial PLS and partial envelopes lead to q1 = 4
components? Recall that partial sufficient reduction is grounded in condition
(6.3a), while partial PLS and partial envelopes require both conditions (6.3a)
and (6.3b). If the variables in X1 are correlated to such an extent that con-
dition (6.3b) cannot be satisfied with one component then naturally we will
end with q1 > 1 components. We expect that is the case here.
In the next section we give a more detailed application of partial predictor
envelopes.
194 Partial PLS and Partial Envelopes

6.3 Partial predictor envelopes in economic growth


prediction
6.3.1 Background
Governance and institutional quality can play a significant role in a country’s
economic growth as measured by the growth of its per capita gross domes-
tic product (GDP) Arvin, Pradhan, and Nair, 2021; Emara and Chiu, 2015;
Fawaz et al., 2021; Gani, 2011; Han et al., 2014; Radulović, 2020. The World-
wide Governance Indicators (WGI) from the World Bank are usually used to
measure aspects of governance. Such indicators have been used as covariates
in regression models, either jointly or in separate models due to the high corre-
lations between them (Gani, 2011; Radulović, 2020; Fawaz et al., 2021). Other
authors (e.g. AlShiab, Al-Malkawi, and Lahrech, 2020; Arvin, Pradhan, and
Nair, 2021; Asongu and Nnanna, 2019; Emara and Chiu, 2015; Han, Khan, and
Zhuang, 2014) use the first principal component analysis of the WGI variates
as a composite governance index then, along with other socioeconomic control
variables, estimate a regression model to analyze the impact of this governance
index on economic growth. Here the dimension of the reduction is set to q1 = 1
to facilitate interpretability and then each loading of the principal component
is interpreted as the weight of each governance indicator in the index.
In this application we follow the economist’s approach in the sense that
we apply dimension reduction on the governance variables, but incorporat-
ing two variants: First, extensions that use supervised reduction via partial
least squares and envelopes; and second, in order to improve predictions, we
allow for the possibility that q1 > 1 in both the supervised reduction and
non-supervised reduction via PCA. In addition, we contemplate this in a con-
text of partial reductions, i.e. we only want to reduce the set of variables that
measure governance, leaving the other socioeconomic variables as a control
without reducing their dimension.

6.3.2 Objectives
Given the empirical evidence of governance as a driver of per capita GDP,
here we analyze the power of governance to predict the economic growth.
Therefore, the objective is the prediction of the per capita GDP using
Partial predictor envelopes in economic growth prediction 195

governance indicators. Specifically, we study the impact of governance on eco-


nomic growth in the twelve South American countries as measured by the
percentage change in yearly per capita GDP (Y ) using WGI’s from the World
Bank. This application is novel in several ways:

1. The objective is to obtain a good prediction of economic growth.

2. In particular, this can be extended to problems where you want to predict


a variable that is difficult or expensive to calculate from an indicator that
uses variables that are easier or less expensive to capture.

3. We compare supervised methodologies for construction of composite gov-


ernance indicators in search of better predictive results, including method-
ologies that are serviceable in high-dimensional “n < p” settings.

4. The application of this phenomenon focused on South American countries


has been little explored.

6.3.3 Model and data


For prediction we consider the linear model

Y = β0 + β1T X1 + β2T X2 + ε, (6.12)

where X1 are p1 governance measures and X2 are p2 economics variables taken


as a control. We have a total of p1 = 24 governance variables and, in view
of the limited sample sizes, expect that reducing these to Composite Gover-
nance Indicators will improve prediction of the response Y . Additionally, we
have included p2 = 19 economic control variables, including dummy variables
for specific country effects. Therefore, in (6.12) we have a total of p = 43
predictors variables, but hope to reduce the dimension of X1 .
The response and predictors were measured for 12 South American coun-
tries over a period of up to 16 years, giving a maximum sample size of
12×16 = 192. However, data for several year-country combinations are missing
leading to the sample sizes and periods listed in the first column of Table 6.1.

6.3.3.1 Governance variables, X1

The World Bank considers six aggregate WGI’s that combine the views of a
large number of enterprise, citizen and expert survey respondents: control of
196
TABLE 6.1
Leave-one-out cross validation MSE for GPD growth prediction.

n Period Countries Method


PPLS-q1 PPCA-q1 PPLS-1 PPCA-1 LASSO FULL PPENV-q1 PPENV-1
161 2003–2018 12 9.37 9.04 10.05 10.14 11.63 10.29 9.60 10.36
(q1 = 7) (q1 = 11) (q1 = 5)
109 2008–2018 12 6.75 6.80 7.07 7.34 7.54 7.06 6.58 7.30
(q1 = 8) (q1 = 20) (q1 = 12)
86 2010–2018 12 4.93 5.12 4.99 5.24 6.49 6.61 4.95 5.25
(q1 = 3) (q1 = 12) (q1 = 5)

Partial PLS and Partial Envelopes


53 2010–2014 12 5.68 5.85 7.72 6.66 6.68 22.40 5.39 6.66
(q1 = 2) (q1 = 8) (q1 = 5)
51 2013–2018 10 7.10 6.97 8.32 8.25 6.70 37.8 – –
(q1 = 3) (q1 = 3)
41 (n < p) 2014–2018 9 4.98 5.29 4.98 5.30 5.49 – – –
(q1 = 1) (q1 = 2)
32 (n < p) 2015–2018 9 1.80 1.94 1.80 1.94 7.54 – – –
(q1 = 1) (q1 = 1)
Partial predictor envelopes in economic growth prediction 197

corruption, rule of law, regulatory quality, government effectiveness, political


stability and voice and accountability. The basic measure of each indicator is a
weighted average standardized to have mean zero and standard deviation one,
with values from −2.5 to 2.5, approximately, where higher values correspond
to better governance. All six are highly positively correlated, and are all posi-
tively correlated with the per capita GDP; i.e., economic growth is positively
associated with better governance indicators. In addition to these six variables,
we have included several other governance measures such as the percentile rank
term among all countries. In total we have p1 = 24 governance variables.

6.3.3.2 Economic control variables, X2

Following the related literature, we include foreign domestic investment (net


inflows as % of GPD), domestic credit to private sector (as % of GPD), Age
dependency ratio ( % of working-age population), population growth (annual
%), secondary school enrollment (gross % of population), an inflation factor
(GDP deflator) and the logged lag of per capita GPD. Including a trend vari-
able and country specific effects (through dummies variables) we have a total
of p2 = 19 control variables that do not enter in the reduction.

6.3.4 Methods and results


To evaluate the predictive performance of partial PLS and envelope methods,
we ran a leave-one-out cross validation experiment, comparing the following
methodological alternatives:

• PPLS-q1 : Partial PLS with optimal q1 obtained via cross validation, as


described in Section 6.2.

• PPLS-1: Partial PLS with q1 = 1 as is usual used for composite indicator,


as described in Section 6.2.

• PPCA-q1 : Partial principal component regression with optimal q1 obtained


via cross validation. In involved simply selecting the q1 best principal com-
ponents from the p1 = 24 governance variables.

• PPCA-1: This is the usual implementation for governance indicators in


regression context. That is, we used the first principal component as a
governance index and as a replacement for the 24 governance variables.
198 Partial PLS and Partial Envelopes

• LASSO: Regression with variable selection using LASSO with the penaliza-
tion parameter selected by cross validation. This is intended to represent
penalization methods.

• FULL: OLS Regression on all p = 43 governance and control variables if


allowed by sample size.

• PPENV-d: Partial predictor envelopes, as described in Section 6.2.3, with


q1 selected by leave-one-out cross validation.

• PPENV-1: Partial predictor envelopes, as described in Section 6.2.3, with


q1 = 1, as allowed by sample size.
Shown in Table 6.1 are the leave-one-out predictive mean squared errors,
n n r
1X 1 XX
predMSE = ||Yi − Ŷ−i ||22 = (Yij − Ŷ−ij )2 ,
n i=1 n i=1 j=1

where Ŷ−i is the prediction of the i-th response vector based on the data with-
out that vector. The second form of predMSE expresses the same quantity in
terms of the individual responses Yij .
Several observations are apparent.
• The full and PPENV methods cannot handle relatively small sample sizes,
as expected, while the PLS and PCA methods are apparently serviceable
in such cases.

• As expected, there is appreciable gain over the FULL method when the
sample size is relatively small.

• The PLS and envelope methods are close competitors, although the PLS
methods are serviceable when n < p, while the envelope methods are not.

• The methods with q1 = 1 component are dominated by other methods


and cannot be recommended based on these results.

• There is little reason to recommend the PCA and LASSO methods, al-
though these methods may be in the running on occasion.
To round out the discussion, Figure 6.2 shows plots of the response Y ver-
sus the leave-one-out predicted values from the partial PLS fit with q1 = 7 and
the lasso for the data in the first row of Table 6.1 comprising 12 South Amer-
ican countries, 2003–2018, n = 161. The visual impression seems to confirm
the MSE’s shown in Table 6.1.
Partial response reduction 199

10 10
Y

Y
0 0

-10 -10

0 5 0 5 10
Leave-one-out fitted values from PPLS-7 Leave-one-out fitted values from LASSO

a. Partial PLS b. Lasso

FIGURE 6.2
Economic growth prediction for 12 South American countries, 2003–2018, n =
161: Plot of the response Y versus the leave-one-out fitted values from (a) the
partial PLS fit with q1 = 7 and (b) the lasso.

6.4 Partial response reduction


In Section 6.2 we considered PLS and maximum likelihood envelope methods
for compressing X1 in linear model (6.1). Starting in Section 6.2.2, that dis-
cussion was restricted to regressions with a real response. In this section we
return to model (6.1) with multivariate responses Y ∈ Rr , r > 1, and consider
methods for compressing Y . Our goal here is the same as it was in Section 6.2,
improved estimation of β1 , but now we pursue that goal via compression of
Y rather than compression of X1 . The predictors in this section need not be
stochastic, as they are not used to inform the reduction of Y .

6.4.1 Synopsis of PLS for Y compression to estimate β1


To use PLS to compress the response Y for the purpose of estimating β1 in
linear model (6.1), follow these steps.

Center the predictors: This step is technically unnecessary, but it makes


subsequent descriptions somewhat easier. Center the predictors so X̄1 = 0
and X̄2 = 0.
200 Partial PLS and Partial Envelopes

Construct residuals: Construct the residuals R b1|2 i and R


bY |2 i , i = 1, . . . , n
from the OLS fits of the multivariate linear regressions of X1 on X2 and of
Y on X2 . We restrict attention to regressions in which p2  min(p1 , r, n)
so there should be no dimension or sample size issues in these regressions.

Set components: Select the number of components u1 ≤ r.

Run PLS: Run the dimension reduction arm of a PLS algorithm with re-
sponse vector Rb1|2 i and predictor vector R
bY |2 i , i = 1, . . . , n, and get the
r×u1
output matrix W ∈ R
c of weights. The projection onto span(W c ) is a

n-consistent estimator of EΣY |X (span(β1T )), the smallest reducing sub-
space of the conditional covariance matrix ΣY |X that contains span(β T ).

Compress and estimate: Compress Y 7→ W c TY and then fit the multivari-


c T Y on (X1 , X2 ):
ate regression of W

c T Y = a + ηX1 + γX2 + e.
W

If the sample size permits, use OLS; otherwise use PLS to reduce the
predictors. Let ηb denote the estimated coefficients of X1 . Then the PLS
estimator of β1T is W
c ηb.

The estimator of β2 is the estimated coefficient vector βb2 in the regression


of the residuals Y − Ȳ − W
c ηbX1 on X2 .

c ηbX1 + βbTX2 .
Predict: The rule for predicting Y is then Yb = Ȳ + W 2

This algorithm can be used in conjunction with cross validation and a holdout
sample to select the number of components and estimate the prediction error
of the final estimated model. It is also possible to use maximum likelihood to
estimate a partial reduction, as described in Section 6.4.3.

6.4.2 Derivation
To compress the response vector using PLS for the purpose of estimating β1
in model (6.1), we strive to project Y onto the smallest subspace E or Rr that
satisfies
(a) QE Y X1 | X2 and (b) PE Y QE Y | (X1 , X2 ). (6.13)

Let Γ ∈ Rr×u1 be a semi-orthogonal basis matrix for E and let (Γ, Γ0 ) be an


orthogonal matrix. Condition (a) requires that span(β1T ) ⊆ E. In consequence,
Partial response reduction 201

β1 can be represented in terms of basis Γ, β1T = Γη, Γ0 is orthogonal to the row


space of β1 , and the distribution of ΓT0 Y | (X1 , X2 ) does not depend on X1 :

ΓT Y = Γα + ηX1 + ΓTβ2TX2 + ΓTε.


ΓT0 Y = Γ0 α + ΓT0 β2TX2 + ΓT0 ε.

Condition (b) requires that PE ΣY |X QE = 0 so E is a reducing subspace of


ΣY |X . It follows that E can be characterized as an envelope. Specifically, E is
the smallest reducing subspace of ΣY |X that contains B10 = span(β1T ), which
is denoted EΣY |X (B10 ).
Conditions (a) and (b) of (6.13) are equivalent to the combined condition

(c) (X1 , PE Y ) QE Y | X2 . (6.14)

Comparing this to condition (6.4) we see that the two conditions are the same,
except the roles of X1 and Y are interchanged. In consequence, we can achieve
partial dimension reduction of the response vector by following the PLS al-
gorithm of Section 6.2.1, but interchanging X1 and Y and using a different
fitting scheme. This then is how we arrived at the algorithm of Section 6.4.1.

6.4.3 Likelihood-based partial response envelopes


It follows from our discussion in Section 6.4.2 that the partial response enve-
lope version of model (6.1) is

Y = α + ΓηX1 + β2TX2 + ε 
, (6.15)
ΣY |X = ΓΩΓT + Γ0 Ω0 ΓT0 

where var(ΓT Y | X) = Ω and var(ΓT0 Y | X) = Ω0 . Assuming that Y | X is


normally distributed, Su and Cook (2011) derived the maximum likelihood es-
timators under this model. Their results were reviewed by Cook (2018, Chap-
ter 3). Here we give only the estimators. Let B 0 = span(β1T ).
The maximum likelihood estimator of the envelope EΣY |X (B10 ) is
n o
EbΣY |X (B10 ) = span arg min log |GT SRY |2 |R1|2 G| + log |GT0 SRY |2 G0 |
G
n o
= span arg min log GT SY |X G + log |GT SY−1|2 G| , (6.16)
G

where minG is over r × u1 semi-orthogonal matrices, (G, G0 ) ∈ Rr×r is an


orthogonal matrix. The function to be optimized is of the general form Al-
gorithm L discussed in Section 1.5.5. Comparing (6.16) with that for partial
202 Partial PLS and Partial Envelopes

predictor reduction (6.9) we see that one can be obtained from the other by
interchanging the roles of Y and X1 , although X1 need not be stochastic for
(6.16). This is in line with our justification around (6.14) for the PLS com-
pression algorithm described in Section 6.4.
Following determination of EbΣY |X (B10 ), the maximum likelihood estimators
of the remaining parameters are as follows. The maximum likelihood estima-
tor βb1 of β1 is obtained from the projection onto EbΣY |X (B10 ) of the maximum
likelihood estimator B1T of β1T from model (6.1),

βb1T = PEb1 B1T ,

where PEb1 denotes the projection operator for EbΣY |X (B10 ). The maximum like-
lihood estimator βb2 of β2 is the coefficient matrix from the ordinary least
squares fit of the residuals Y − Ȳ − βb1T X1 on X2 . If X1 and X2 are orthogonal
then βb2 reduces to the maximum likelihood estimator of β2 from the standard
model. Let Γ b be a semi-orthogonal basis matrix for EbΣ (B 0 ) and let (Γ, b Γ
b0 )
Y |X 1
be an orthogonal matrix. The maximum likelihood estimator Σ b Y |X of ΣY |X
is then

Σ
b Y |X = PEb1 SRY |2 |R1|2 PEb1 + QEb1 SRY |2 QEb1 = Γ
bΩ bT + Γ
bΓ b0 Ω b T0
b 0Γ


b b T SR |R Γ
= Γ b
Y |2 1|2


b0 b T0 SR Γ
= Γ b .
Y |2 0

The asymptotic distributions of βb1 and βb2 were derived and reported by Su
and Cook (2011). Here we give only their result for βb1 since that will typically
be of primary interest in applications. See also Cook (2018, Chapter 3).
As we discussed for response reduction in Section 2.5.3, it is appropriate to
treat the predictors as non-stochastic and ancillary. Accordingly, Su and Cook
(2011) treated the predictors as non-stochastic when deriving the asymptotic
distribution of βb1 . Recall from Section 2.5.3 that when X is non-stochastic we
define ΣX as the limit of SX as n → ∞. Partition ΣX = (Σi,j ) according to
the partitioning of X (i, j = 1, 2) and let

Σ1|2 = Σ1,1 − Σ1,2 Σ−1


2,2 Σ2,1 .

The matrix Σ1|2 is constructed in the same way as the covariance matrix
for the conditional distribution of X1 | X2 when X is normally distributed,
although here X is fixed. Define

U1|2 = Ω−1 T −1
0 ⊗ ηΣ1|2 η + Ω0 ⊗ Ω + Ω0 ⊗ Ω
−1
− 2Iu1 (r−u1 ) .
Partial response reduction 203

The limiting distribution of βb1 is stated in the following proposition.

Proposition 6.1. Under the partial envelope model (6.1) with non-stochastic

predictors, n{vec(βb1 ) − vec(β1 )} converges in distribution to a normal ran-
dom vector with mean 0 and covariance matrix
n√ o

avar nvec(βb1 ) = ΓΩΓT ⊗ Σ−1 T T
1|2 + (Γ0 ⊗ η )U1|2 (Γ0 ⊗ η),

where † denotes the Moore-Penrose inverse.


The form of the asymptotic variance of βb1 given in this proposition is iden-
tical to the form of the asymptotic variance of βb from the full envelope model
given in Proposition 2.2, although the definitions of components are different.
√ b requires ΣX , while avar{√nvec(βb1 )} uses Σ1|2
For instance, avar{ nvec(β)}
in its place. The asymptotic standard error for an element of βb1 is then
n √ o1/2
n o avar[
d n(βb1 )ij ]
se (βb1 )ij = √ .
n

6.4.4 Application to longitudinal data


In Section 2.5.2 we motivated response envelopes by pointing out their po-
tential utility when the model matrix connecting the treatments with a lon-
gitudinal response vector is unknown. In this section we address a perhaps
more typical setting in which the model matrix is presumed known and we
use envelopes to improve estimation of the effects vector. The results here are
from Cook, Forzani, and Liu (2023a), which contains additional developments
along the same lines.
Let Y denote a vector of r longitudinal responses and let U ∈ Rr×k de-
note a known model matrix connecting the predictors X ∈ Rp to the response
vector via the generalization of model (2.19)

Yi = U α0 + U αXi + εi , i = 1, . . . , n, (6.17)

where α0 ∈ Rk , α ∈ Rk×p , and the error vectors are independent copies of a


N (0, ΣY |X ) random vector. In longitudinal data analyses we normally imagine
X as non-stochastic reflecting the treatment assigned to the i-th experimen-
tal unit. We adopt that view here, unless stated otherwise. The first step in
developing envelope/PLS estimators for α is to transform model (6.17) to a
form that is more amenable to analysis. Let U = span(U ), let U0 be a semi-
orthogonal basis matrix for U ⊥ , and let W = (U (U T U )−1 , U0 ) := (W1 , W2 ).
204 Partial PLS and Partial Envelopes

Then the transformed model becomes


! ! !
T W1T Y YDi α0 + αXi
W Yi = := = + W T εi , i = 1, . . . , n,
W2T Y YSi 0
(6.18)
k r−k
where YDi ∈ R and YSi ∈ R . The transformed variance can be represented
block-wise as ΣW := var(W ε) = (WiT ΣY |X Wj ), i, j = 1, 2. The mean E(YD |
T

X) depends non-trivially on X and thus, as indicated by the subscript D, we


think of YD as providing direct information about the regression. On the other
hand, E(YS | X) = 0 and thus YS provides no direct information but may
provide useful subordinate information by virtue of its association with YD .
To find the maximum likelihood estimators from model (6.18), we write
the full log likelihood as the sum of the log likelihoods for the marginal model
from YS | X and the conditional model from YD | (X, YS ):

YSi | Xi = eSi (6.19)


YDi | (Xi , YSi ) = α0 + αXi + φD|S YSi + eD|Si , (6.20)

where

φD|S = (U T U )−1 U T ΣY |X U0 (U0T ΣY |X U0 )−1 ∈ Rk×(r−k)


eD|S = W1T ε
eS = W2T ε.

The variances of the errors are ΣS := var(eS ) = U0T ΣY |X U0 and ΣD|S :=


var(eD|S ) = (U T Σ−1
Y |X U )
−1
. The number of free real parameters in this con-
ditional model is Ncm (k) = k(p + 1) + r(r + 1)/2. The subscript “cm” is used
also to indicate estimators arising from the conditional model (6.20). The
maximum likelihood estimator and its asymptotic variance are
−1
βbcm = Uα
bcm = U SD,RX|(1,S) SX|S
= U (SD,X − SD,S SS−1 SS,X )SX|S
−1


avar( nvec(βbcm )) = Σ−1 T
X ⊗ U ΣD|S U ,

where αbcm and βbcm are the MLEs of α and β in model (6.18).
Model (6.20) is in the form of a partitioned multivariate linear model (6.1)
and so it is amenable to compression of the response vector for the purpose
of estimating α. Let A = span(α) and parameterize (6.20) in terms of a semi-
orthogonal basis matrix Γ ∈ Rk×u for EΣD|S (A), the ΣD|S -envelope of A. Let
Partial response reduction 205

η ∈ Ru×p be an unconstrained matrix giving the coordinates of α in terms


of semi-orthogonal basis matrix Γ, so α = Γη, and let (Γ, Γ0 ) ∈ Rk×k be an
orthogonal matrix. Then the envelope version of model (6.20) conforms to our
current context:

YDi | (Xi , YSi ) = α0 + ΓηXi + φD|S YSi + eD|Si 
, (6.21)
ΣD|S = ΓΩΓT + Γ0 Ω0 ΓT0 

where Ω ∈ Ru×u and Ω0 ∈ R(k−u)×(k−u) are positive definite matrices. This


model can now be fitted with the PLS methods of Section 6.4.1 or the like-
lihood methods of Section 6.4.3. Many additional details and results on this
envelope-based analyses of longitudinal data are available from Cook et al.
(2023a).

6.4.5 Reactor data


The data for this illustration arose from an experiment involving a low den-
sity tubular polyethylene reactor (Skagerberg et al., 1992). These data were
used by Cook and Su (2016) in their study of scaled predictor envelopes and
partial least squares. The predictor variables consist of 20 temperatures mea-
sured at equal distances along the reactor together with the wall temperature
of the reactor and the feed rate. The 6 responses are the output characteristics
of the polymers produced: y1 , the number-average molecular weight; y2 , the
weight-average molecular weight; y3 , the frequency of long chain branching;
y4 , the frequency of short chain branching; y5 , the content of vinyl groups; y6 ,
the content of vinylidene groups. Because the distributions of the values of all
the response variables are highly skewed to the right, the analysis was per-
formed using the logarithms of their corresponding values. For interpretational
convenience all responses were standardized to unit variance. Responses y1 and
y2 are seen to be strongly correlated, and y4 , y5 and y6 form another strongly
correlated group. The third response y3 is more weakly correlated with the
others. The predictive accuracy of each method was estimated through leave-
one-out cross-validation. Overall, there are a total of n = 56 observations on
p = 22 predictors and r = 6 responses.
Since the first p1 = 20 predictors X1 are temperatures along the reactor,
they differ from the remaining two predictors X2 , and so we consider dimension
reduction treating X1 and X2 differently. Specifically, we use model (6.1) as the
basis for partial response reduction using the PLS algorithm of Section 6.4.1
206 Partial PLS and Partial Envelopes
TABLE 6.2
Leave-one-out cross validation of the mean prediction error predRMSE for the
reactor data.

Partial Predictor Partial Response


Method OLS PLS Envelope PLS Envelope
No. components p1 = 20 16 16 2 2
predRMSE 0.44 0.37 0.37 0.34 0.33

and the envelope estimator described in Section 6.4.3. For comparison, we also
studied partial predictor reduction using the model of Section 6.2.3 as the ba-
sis for the PLS algorithm of Section 6.2.1 and the likelihood-based partial
envelope estimation of Section 6.2.3. The comparison criterion is prediction
of the response vector using leave-one-out cross validation to form the mean
prediction error predRMSE as defined in Section 5.5.1.
The results shown in Table 6.2 indicate that partial response reduction
is somewhat better than partial predictor reduction and that both reduction
types do noticeably better than OLS. The two components indicated in the
partial response applications indicate that the basis matrix Γ in model (6.1) is
6 × 2. Consequently, ΓT Y is the part of the response vector that is influenced
by the changes in the predictors, while ΓT0 Y is unaffected by changes in the
predictors.
7

Linear Discriminant Analysis

Partial least squares has gained some recognition as a method of discriminat-


ing between r + 1 populations or classifications based on a vector X ∈ Rp
of features, although it was not originally designed for that task. Define
Y = (0, . . . , 0, 1j , 0, . . . 0)T as an r × 1 indicator vector with a single one
indicating class j = 1, . . . , r and the zero vector Y = (0, . . . , 0)T indicating
class j = 0, which is occasionally called one-hot encoding. In this way the r ×1
vector Y serves to indicate r + 1 classes. Using one of the algorithms discussed
in Chapter 3, PLS regression is then used to reduce the feature vector X in
the data (Xi , Yi ), i = 1, . . . , n, followed by use of Bayes’ rule for the actual
classification of a new observation. Barker and Ravens (2003) pointed out cer-
tain basic connections between this partial least squares discriminant analysis
(PLS-DA) and Fisher’s linear discriminant analysis (LDA). For instance, if
SX > 0 and the number of components q = p then PLS-DA is the same as
LDA. Liu and Ravens (2007) reinforced the arguments of Barker and Ravens
(2003) by providing illustrations on real data sets. Brereton and Lloyd (2014)
provided a tutorial on PLS-DA, motivating it by using a somewhat ques-
tionable version of the bilinear calibration model described by Martens and
Næs (1989, p. 91). They compared PLS-DA to LDA and to other methods of
discrimination. PLS-DA is now available in most software for PLS regressions.
In this chapter, we revisit PLS-DA but differ from the literature by con-
necting it with principal fitted components (PFC), envelopes and reduced-rank
(RR) regression. Results in previous chapters help to explain why it might be
expected to be serviceable when the number of features is larger than the sam-
ple size. Familiarity with classical normal-theory linear and quadratic discrim-
inant analysis is assumed here and in Chapter 8. (See for example McLachlan,
2004).

DOI: 10.1201/9781003482475-7 207


208 Linear Discriminant Analysis

7.1 Envelope discriminant subspace


Given an observed vector X ∈ Rp of features associated with an object, the
problem of estimating the class C to which it belongs from among r + 1
possible classes indexed by the set χ = {0, . . . , r} is frequently addressed by
using Bayes’ rule to assign the object to the class φ(X) with the maximum
probability:
φ(X) = arg max Pr(C = k | X).
k∈χ

It is widely recognized that we might lower substantially the chance of mis-


classification if we can reduce the dimension of X without loss of information
on φ(X). With that goal in mind, Cook and Yin (2001) considered reducing,
without loss of information, a vector X of continuous features by projecting
it onto a subspace S ⊆ Rp and then employing Bayes’ rule on the reduced
features φS (X) := arg maxk∈χ Pr(C = k | PS X):

Definition 7.1. If
φ(X) = φS (X) (7.1)

then S is called a discriminant subspace. If the intersection of all discriminant


subspaces is itself a discriminant subspace, it is called the central discriminant
subspace and denoted as DC|X .

In this way the central discriminant subspace captures all of the classifi-
cation information that X has about class C and thus has the potential to
reduce the dimension of X without loss of information on φ(X). For exam-
ple, suppose that there are three classes, χ = {0, 1, 2}, two normal features
X = (X1 , X2 )T and that Pr(C = k | X1 , X2 ) = Pr(C = k | X1 + X2 ). Then
only the sum of the two predictors is relevant for classification and (7.1) holds
with S = span((1, 1)T ). This reflects the kind of setting we have in mind when
envelopes are employed a bit later.
To illustrate the potential importance of discriminant subspaces, con-
sider a stylized problem adapted from Cook and Yin (2001), still using three
classes and two normal features. Define the conditional distribution of C | X
Envelope discriminant subspace 209

FIGURE 7.1
Illustration of the importance of the central discriminant subspace.

according to the following:

Pr(C = 0 | X1 < 0, X2 < 0) = Pr(C = 0 | X1 > 0, X2 > 0) = 1,


Pr(C = 0 | X1 > 0, X2 < 0) = Pr(C = 0 | X1 < 0, X2 > 0) = ω,
Pr(C = 1 | X1 > 0, X2 < 0) = 1−ω
Pr(C = 2 | X1 < 0, X2 > 0) = 1 − ω,

where 0 < ω < 1. These conditional distributions are depicted in Figure 7.1.
The central subspace SC|X = R2 and no linear dimension reduction is possi-
ble. However, dimension reduction is possible for φ(X). Although we would
not normally expect applications to be so intricate, the example does illustrate
the potential relevance of discriminant subspaces.
If ω > 1/2 then φ(X) = 0 for all X, DC|X = span((0, 0)T ) is a discriminant
subspace and consequently the two features provide no information to aid in
classification. In this case we are able to discard X completely. If ω < 1/2
then φ(X) depends non-trivially on both predictors, DC|X = R2 and so no
dimension reduction is possible.
210 Linear Discriminant Analysis

To bring envelopes into the discussion, recall that conditions (2.3), restated
here with Y replaced by C for ease of reference

(a) C X | PS X and (b) PS ΣX QS = 0, (7.2)

are operational versions of the more general envelope conditions (2.2) for re-
ducing the dimension of X in the regression of Y on X. Condition (7.2a)
requires that PS X capture all of ways in which X can inform on the distribu-
tion of C. Dimension reduction for discriminant analysis is different since we
are not interested in capturing all aspects of X that provide information about
C. Rather we pursue only the part of X that affects φ(X). Adapting to this
distinction, Zhang and Mai (2019) replaced condition (7.2a) with requirement
(7.1), leading to consideration of subspaces S ⊆ Rp with the properties

(a) φ(X) = φS (X) and (b) PS ΣX QS = 0. (7.3)

The rationale for including (7.3b) is similar to that discussed in Section 2.1:
methodology based on condition (7.3a) alone may not be effective when p > n
or the features are highly collinear because then it is hard in application to
distinguish the material part PS X of X that is required for φ(X) from the
complementary part QS X that is immaterial. The role of condition (7.3b) is
then to induce a measure of clarity in the separation of X into parts that
are material and immaterial to φ(X). Zhang and Mai (2019) formalized the
notion of the smallest subspace that satisfies (7.3) as follows:

Definition 7.2. If the intersection of all subspaces that satisfy (7.3) is it-
self a subspace that satisfies (7.3), then it is called the envelope discriminant
subspace.

A variety of methods have been proposed for implementing aspects of these


general ideas. Most focus on modeling Pr(C = k | X = x) directly, or indi-
rectly via the conditional density f of X given (C = k) (e.g. Cook and Yin,
2001, eq. (5)):
πk f (x | C = k)
Pr(C = k | X = x) = Pr , (7.4)
j=0 πj f (x | C = j)

where πk = Pr(C = k) is the marginal prior probability of class k ∈ χ.


Beginning with a model for X | C and specification of prior probabilities
πk , this represents the Bayes posterior probability of class k. However, if
Pr(C = k | X = x) is modeled directly, say with logistic regression when
Linear discriminant analysis 211

there are two classes, this Bayes interpretation may not play a useful role.
PLS discriminant analysis (e.g. Brereton and Lloyd, 2014) arises by modeling
X | (C = k) as a conditional normal, leading to methodology closely associ-
ated with Fisherian LDA. While this may seem simple relative to the range
of methods available, Hand (2006) argued that simple methods can outper-
form more intricate methods that do not adapt well to changing circumstances
between classifier development and application.
We next turn to LDA where we bring in PLS methods. Quadratic discrim-
inant analysis is considered in Chapter 8.

7.2 Linear discriminant analysis


The nominal stochastic structure underlying LDA for predicting the class
C ∈ χ based on a vector of features X ∈ Rp is

X | (C = k) ∼ N (µk , ΣX|(C=k) ), Pr(C = k) = πk , k ∈ χ, (7.5)


Pr
where πk > 0, k=0 πk = 1 and the intra-class covariance matrices ΣX|(C=k)
are assumed to be equal over the classes, ΣX|(C=k) = ΣX|(C=j) := ΣX|C . Un-
der this model, the Bayes’ rule is to classify a new unit with feature vector X
onto the class with maximum posterior probability, giving the classification
function from (7.4)

φlda (X) = arg max Pr(C = k | X)


k∈χ

= arg max{log πk + µTk Σ−1


X|C (X − µk /2)} (7.6)
k∈χ

= arg max{log(πk /π0 ) + βkT Σ−1


X|C (X − (µk + µ0 )/2)}
k∈χ

= arg max{log(πk /π0 ) + βkT Σ−1


X|C (X − (βk + 2µ0 )/2)}, (7.7)
k∈χ

where the vectors βk = (µk − µ0 ) ∈ Rp , k = 1, . . . , r, are the mean deviations


relative to the reference population µ0 . The prior probabilities πk are assumed
to be known, perhaps πk = (1 + r)−1 or, when a training sample is available,
πk is set equal to the observed fractions of cases in class C = k. Given the
stochastic setup (7.5), the success of a prediction rule in application depends
on having good estimators of the βk ’s and ΣX|C . These parameters can be
well-estimated by using a multivariate regression model when n  p.
212 Linear Discriminant Analysis

Let β = (β1 , . . . , βr )T ∈ Rr×p and recall that Y = (0, . . . , 0, 1j , 0, . . . 0)T


denotes the r × 1 vector indicating class j ∈ χ. Then µ0 , β, and ΣX|C can be
estimated by fitting the multivariate linear model implied by (7.5),

Xi = µ0 + β T Yi + εi , i = 1, . . . , n, (7.8)

where the errors are independent copies of ε ∼ N (0, ΣX|C ). Maximum likeli-
hood estimators of the parameters in this model were reviewed in Section 1.2.2.
Since the roles of X and Y are different in (7.8) we restate the estimators for
ease of reference:

βbols = SY−1 SY,X


Σ
b X|C = SX|Y
T
µ
b0 = X̄ − βbols Ȳ .

These estimators are then substituted into φlda (X) to estimate the class with
the maximum posterior probability, a process that characterizes classical LDA.
However, this estimation procedure can be inefficient if β has less than full
row rank or if only a part of X is informative about C. In the extreme, if it
were known that rank(β) = 1 then it may be possible to improve the classical
method considerably. Incorporating the central discriminant subspace allows
for the possibility that β is rank deficient, and envelopes were designed to deal
the possibility that only a portion of X informs on C.

7.3 Principal fitted components


As used previously, let B 0 = span(β T ) ⊆ Rp . Zhang and Mai (2019) noted, as
may be clear from (7.7), that under model (7.5),

DC|X = span(Σ−1 T −1 0
X|C β ) = ΣX|C B , (7.9)

so estimating the central discriminant subspace reduces to estimation of


span(Σ−1 T
X|C β ) for model (7.8), including its dimension. To incorporate this
into model (7.8), let d = dim(B 0 ) ≤ min(p, r), let B ∈ Rp×d be a semi-
orthogonal basis matrix for B 0 and define coordinate vectors bk so that
βk = Bbk , k = 1, . . . , r. Let b = (b1 , . . . , br ) ∈ Rd×r , which has rank d.
Principal fitted components 213

Then β T = Bb and model (7.8) becomes

Xi = µ0 + BbYi + εi , i = 1, . . . , n. (7.10)

If d = r then β T has full column rank and this model reduces to model (7.8).
If d < r this becomes an instance of the general model for PFC (Cook, 2007;
Cook and Forzani, 2008b), which is an extension to regression of the Tipping-
Bishop model that yields probabilistic principal components (Tipping and
Bishop, 1999). The general PFC model allows the Y -vector to be any user-
specified function of a response, but in discriminant analysis Y is properly
an indicator vector as defined previously. A key characteristic of model (7.10)
is summarized in the following proposition (Cook (2007, Prop 6); Cook and
Forzani (2008b, Thm 2.1)). In preparation, define a p × d basis matrix for the
central discriminant subspace as

Φ = Σ−1
X|C B ∈ R
p×d
.

Proposition 7.1. Under model (7.10),

X | (Y, ΦT X) ∼ X | ΦT X.

Proof. Let (B, B0 ) be an orthogonal matrix. We suppress the subscript i in


this proof for notational simplicity. Transforming X in model (7.10) with
(Φ, B0 )T , the model becomes
! ! ! !
ΦTX ΦTµ0 ΦTBbY ΦT ε
= + + .
B0T X B0T µ0 0 B0T ε

We see from this model that B0T X Y , so marginally B0T X carries no


information on the class C. Further,

cov(ΦT ε, B0T ε) = ΦT var(ε)B0


= ΦT ΣX|C B0 = B T Σ−1
X|C ΣX|C B0

= 0.

Since ε is normally distributed, it follows that ΦTX B0T X | Y . This plus the
previous conclusion B0T X Y implies that B0T X (ΦTX, Y ) and thus that
B0T X Y | ΦTX (Cook, 1998, Proposition 4.6). Since (B0 , Φ)T X is a full rank
linear transformation of X, this last conclusion implies that X Y | ΦTX.
The desired conclusion – X | (Y, ΦT X) ∼ X | ΦT X – follows.
214 Linear Discriminant Analysis

In consequence of Proposition 7.1, Y X | ΦTX, which implies that ΦTX


holds all of the information that X has about Y and thus, together with (7.9),

SY |X = DC|X = span(Φ). (7.11)

Accordingly, we lose no information when basing LDA classifications on the


reduced features ΦTX instead of the full features, and may gain by reduc-
ing substantially the probability of misclassification. In terms of the reduced
features, the classification function is

φlda (ΦTX) = arg max Pr(C = k | ΦTX)


k∈χ

= arg max{log(πk /π0 )


k∈χ

+ βkT Φ(ΦT ΣX|C Φ)−1 ΦT (X − (βk + 2µ0 )/2)},


= arg max{log(πk /π0 )
k∈χ

+ (PB(Σ−1 ) βk )T Σ−1
X|C PB(Σ
−1
) (X − (βk + 2µ0 )/2)},
X|C X|C

= arg max{log(πk /π0 ) + bTk ΦT (X − (Bbk + 2µ0 )/2)}, (7.12)


k∈χ

where still βk = (µk − µ0 ) = Bbk , k = 1, . . . , r. Result (7.12) was obtained by


substituting

Φ(ΦT ΣX|C Φ)−1 ΦT = Σ−1


X|C PB(Σ
−1
)
T
= PB(Σ−1 Σ−1 P
) X|C B(Σ
−1
).
X|C X|C X|C

It might be concluded from (7.12) that DC|X = span(Φb). But span(Φb) =


span(Φ) since b has full row rank, and so we again arrive at (7.11).
Representation (7.12) expresses the classification in terms of the reduced
features ΦTX, while (7.7) expresses it terms of the full features. Comparing
(7.7) to the steps leading to (7.12) we see that in the classification with re-
duced features, the features have been projected onto span(B) in the Σ−1 X|C
inner product. In consequence, we would expect the reduced predictors to yield
more reliable categorizations, at least in large samples, because directions of
extraneous variation have been removed. There are several methods that can
be used to obtain estimators of the unknown quantities in φlda (ΦTX). These
estimators are based on variations of model (7.10) that all have the same mean
structure but differ on the error structures. We discuss some of these in the
following sections, leading to connections with envelopes and PLS methods in
Sections 7.3.2.
Principal fitted components 215

7.3.1 Discrimination via PFC with Isotropic errors


Isotropic errors – ΣX|C = σ 2 Ip – may be appropriate when measurement er-
ror dominates model (7.10) and, given the correct classification, the features
are conditionally independent and have the same variance. They may also be
useful as an approximation to avoid estimating a large variance-covariance
matrix with limited sample size. In particular, this isotropic classifier might
be serviceable in some applications when n  p.
With isotropic errors, Y X | B TX, PB(Σ−1 ) = PB , and the classification
X|C
function reduces to

φlda (ΦTX) = φlda (B TX)


= arg max{log(πk /π0 ) + σ −2 βkT PB (X − (βk + 2µ0 )/2)}
k∈χ

= arg max{log(πk /π0 ) + σ −2 bTk B T (X − (Bbk + 2µ0 )/2)}.


k∈χ

The maximum likelihood estimator of φlda (B TX) can be constructed as


follows. Let Y denote the n × r matrix with rows (Yi − Ȳ )T , let X denote
the n × p matrix with rows (Xi − X̄)T , let X b = PY X denote the n × p ma-
trix of centered fitted values from the fit of the full feature model (7.8) and
let λ̂1 , . . . , λ̂d and φb1 , . . . , φbd denote the first d eigenvalues and corresponding
eigenvectors of Sfit = X b T X/n,
b the sample covariance matrix of the fitted vec-
tors. The following estimators arise from model (7.10) with d specified and
isotropic errors (Cook, 2007):

B
b = (φb1 , . . . , φbd )
b T XT Y(YT Y)−1
bb = B

µ
b0 = X̄ − B
bbbȲ

βb = (YT Y)−1 YT XPBb = (βb1 , . . . , βbr )


( d
)
X
2 −1 T
b = p
σ tr(X X/n) − λ̂i .
i=1

These estimators, which require n > r and n > d but not n > p, can now
we substituted into φlda (B TX) for a sample classifier. If classes have equal
prior probabilities, πk = πj , then the classification rule simplifies a bit and an
estimator of σ 2 is no longer necessary:

φlda (B TX) = arg max{bTk B T (X − (Bbk + 2µ0 )/2)}.


k∈χ
216 Linear Discriminant Analysis

The dimension d of DC|X = span(B) can be estimated straightforwardly by


using cross validation , with a holdout sample reserved for estimation of the
classification error of the final classification function.

7.3.2 Envelope and PLS discrimination under model (7.8)


As suggested previously, methodology based on model (7.8) alone may not
be effective when the features are highly collinear because then it is hard in
application to distinguish the material part of X that is required for classifica-
tion from the complementary part of X that does nothing more than induce
extraneous variation. This is where envelopes play a key role via condition
(7.3b), which induces a measure of clarity in the separation of X into its
parts that are material and immaterial to φ(X). Operationally, an envelope
parametrization of model (7.8) leads to a model with a structure for ΣX|C
that can adapt to changing circumstances through the choice of dimension.
The maximum likelihood estimators based on model (7.8) are not service-
able for classification when p > n because then SX|Y is singular and direct
estimation of the classification function (7.7) is not possible. This is where
PLS classification may be particularly useful, as it provides a method of esti-
mating the material part of the feature vector when p > n. It may be useful
also when n  p and the features are collinear, although here likelihood-based
envelopes may dominate.
Since DC|X = span(Φ) under model (7.10), it follows from Definition 7.2
and model (7.10) that the envelope discriminant subspace is the ΣX|C -
envelope of span(Φ), which leads to the following envelope equality after ap-
plying (1.26),

EΣX|C (DC|X ) = EΣX (B 0 ).

Let u = dim(EΣX (B 0 )). As in previous chapters, we can now parameterize


model (7.8) in terms of a semi-orthogonal basis matrix Γ ∈ Rp×u for EΣX (B 0 ).
Let (Γ, Γ0 ) be an orthogonal matrix. Then the PFC model becomes

Xi = µ0 + ΓηYi + εi , i = 1, . . . , n, (7.13)
T
ΣX|C = ΓΩΓ + Γ0 Ω0 ΓT0 .

This response envelope model (See Sections 2.4 and 2.5.3) was anticipated by
Cook (2007), who proposed it as an extension (EPFC) of PFC model (7.10),
Principal fitted components 217

but without the key understanding that can derive from the envelope struc-
ture. It was subsequently proposed specifically for discrimination by Zhang
and Mai (2019), who called it the envelope discriminate subspace (ENDS)
model. Based on model (7.13) we determine class membership by using (7.7)
in combination with the multivariate model for the reduced features

ΓTXi = α0 + ηYi + i , i = 1, . . . , n, (7.14)


ΣΓTX|C = Ω,
η = (η1 , . . . , ηr )
α0 = ΓT µ0 .

This results in the central discriminant subspace DC|ΓT X = span(Ω−1 η) and


corresponding classification function (7.12) expressed in terms of the reduced
features

φlda (ΓTX) = arg max log(πk /π0 ) + ηkT Ω−1 (ΓTX − (ηk + 2ΓTµ0 )/2) .

k∈χ
(7.15)

7.3.2.1 Maximum likelihood envelope estimators

Maximum likelihood estimators of the unknown parameters in (7.15) can be


constructed by adapting the estimators given in Section 2.5.3 using the fea-
tures X as the response and the class indicator vector Y as the predictor:

−1
log |GT SX|Y G| + log |GT SX

Γ
b = arg min G| (7.16)
G
b T SX,Y S −1 = (b
ηb = Γ η1 , . . . , ηbr )
Y
b η = Pb SX,Y S −1 ,
Γb Γ Y

µ
b0 = X̄ − Γb
b η Ȳ


b b T SX|Y Γ,
= Γ b


b0 b T SX Γ
= Γ b0 ,
0

Σ
b X|C = Γ
bΩ bT + Γ
bΓ b0 Ω b T0 ,
b 0Γ

where minG is over all semi-orthogonal matrices G ∈ Rr×u , (Γ,b Γb 0 ) is an or-


thogonal matrix. The fully maximized log-likelihood for fixed u is then

L
bu = −(nr/2) log(2π) − nr/2 − (n/2) log |SX |
b T SX|Y Γ|
−(n/2) log |Γ b T S −1 Γ|.
b − (n/2) log |Γ b
X
218 Linear Discriminant Analysis

The estimators Γ,
b ηbk , µ
b0 and Ω
b are now substituted into (7.15) to determine
the class with the maximum estimated probability.

7.3.2.2 Discrimination via PLS

Recall from the discussion of Chapter 3 that NIPALS and SIMPLS estimate
the predictor envelope EΣX(B) in the regression of Y on X. However, we also
know from the discussions of Sections 2.4 and 3.11 that this envelope is the
same as the response envelope in the regression of X on Y , which is exactly
what appears in model (7.13). In other words, beginning with classification
indicator vectors Yi and corresponding features Xi we can use either NIPALS
or SIMPLS weight matrices as estimates of a basis Γ b pls for EΣ (B), which is
X

then used in place of Γ b from (7.16) to construct the remaining estimators for
substitution into (7.15). The adaptation of PLS algorithms to discrimination
problems is then seen to be straightforward, the methodology for constructing
Γ
b pls being covered by the general discussions in previous chapters. In partic-
ular, the PLS algorithms are not hindered by the requirement that n > p
and their asymptotic behavior in high dimensions is governed by the results
summarized in Chapter 4.
To be clear, a procedure for computing the PLS discrimination function is
outlined as follows.

1. Run a NIPALS or SIMPLS algorithm with response Y and predictor X and


extract the weight matrix W normalized if necessary so that W T W = Ip .
Set Γ
b pls = W .

2. Construct the required estimators based on the estimators at (7.16) as


follows.

ηbpls b Tpls SX,Y S −1 = (b


= Γ η1,pls , . . . , ηbr,pls )
Y

µ
b0,pls = X̄ − Γ
b pls ηbpls Ȳ


b pls b Tpls SX|Y Γ
= Γ b pls .

3. Substitute these estimators into (7.15) to get the estimated PLS discrim-
inant function:

φblda (ΓT X) = arg max


k∈χ
n o
T
× log(πk /π0 ) + ηbk,pls b −1 (Γ
Ω b Tpls X − (b b Tpls µ
ηk,pls + 2Γ b0,pls )/2) .
pls
Discrimination via PFC with ΣX|C > 0 219

7.4 Discrimination via PFC with ΣX|C > 0


Discrimination via PFC with isotropic errors ΣX|C = σ 2 Ip was discussed in
Section 7.3.1. In this section we discuss PFC with anisotropic errors ΣX|C > 0.
In the context of model (7.10), which has anisotropic errors ΣX|C > 0,
PFC and RR regression models (Anderson, 1999; Izenman, 1975; Reinsel and
Velu, 1998) are the same and produce the same maximum likelihood estima-
tors of µ0 , β and ΣX|C . Aspects of these models are related to our discussion
of bilinear PLS models in Section 5.12. Estimators, including methods for
inferring about d, were described by Cook and Forzani (2008b, Sec 3.1) in
the context of PFC and by Cook, Forzani, and Zhang (2015, Sec 2.1) in the
context of RR regression. The estimators can be obtained from either source,
although those given by Cook et al. (2015) are more succinct. The estimators
were also summarized by (Cook, 2018, Sec. 9.10.2).
On the other hand, PFC regressions include subspace estimation, while
RR regressions normally do not. The PFC family of methods covers possibili-
ties, like the isotropic and extended PFC models, that are natural extension of
probabilistic principal components (Tipping and Bishop, 1999) to regression
problems and that are not related to RR regression. Nevertheless, the meth-
ods are equivalent under model (7.10) and so we use the reference acronym
PFC-RR to denote model (7.10) and quantities derived therefrom.
Define
−1/2 −1/2
CX,Y = SX SX,Y SY

to be the matrix of sample correlations between the elements of the stan-


−1/2 −1/2
dardized vectors SX X and SY Y , with singular value decomposition
(d)
CX,Y = U DV T . Extending this notation, let CX,Y = Ud Dd VdT , where
Ud ∈ Rr×d and Vd ∈ Rp×d consist of the first d columns of U and V , and
Dd is the diagonal matrix consisting of the first d singular values of CX,Y . We
T
also use CY,X = CX,Y . Then, assuming normal errors, that d is given and that
r < p, the maximum likelihood estimators of the parameters in model (7.10)
are (Cook et al., 2015, Sec 2.1)

µ
b0 = X̄ − βbT Ȳ (7.17)
1/2 (d) −1/2
βbT = (βb1 , . . . , βbr ) = SX CX,Y SY (7.18)
220 Linear Discriminant Analysis

Σ
b X|C = SX − βbT SY,X
1/2 (d) (d) 1/2
= SX (Ir − CX,Y CY,X )SX . (7.19)

The usual OLS estimators are obtained by setting d = r so there is no rank


reduction. Once the PFC-RR estimators given in (7.17) – (7.19) are available
they can be substituted into (7.7) to obtain the estimated classification func-
tion. The dimension d can then be selected by cross validation using the error
rate as a criterion.

General estimation methods like that based on PFC-RR model (7.10) and
the corresponding PFC discriminant function require that n  p to insure
that Σ−1
X|C can be well estimated. The same is true of the envelope discrimina-
tion discussed in Section 7.3.2. But PLS discrimination discussed at the end of
Section 7.3.2 and PFC discrimination with isotropic errors may be serviceable
without requiring n  p. Of these two, PLS fitting is surely the more versatile
since it does not require isotropic errors.

7.5 Discrimination via PLS and PFC-RR


7.5.1 Envelopes and PFC-RR model (7.10) combined
Discrimination based on PFC-RR model (7.10) is designed to exploit aspects
of regression of X on Y that are different from those exploited by envelope
model (7.13). In PFC-RR model (7.10) there is no modeled connection be-
tween β T = Bb and the error covariance matrix ΣX|C as there is in (7.13).
Envelope model (7.13) makes no direct allowance for the possibility that β
is rank deficient, rank(β T ) = dim(B 0 ) = d < min(p, r). These distinctions
have consequences. In binary classification, r = 1 and p > 1, so β T ∈ Rp and
its only possible ranks are 0 and 1. In this case PFC-RR is not useful, while
an envelope can still lead to substantial gains if span(β T ) is contained in a
low-dimensional reducing subspace of ΣX|C . More generally, PFC-RR offers
no gain when β is full rank, while envelopes can still produce substantial gain.
On the other hand, it is also possible to have situations where envelopes offer
no gain, while PFC-RR provides notable gain. For instance, if r > 1, p > 1 and
d = 1, so PFC-RR gives maximal gain, it is still possible that EΣX|C (B 0 ) = Rp
so that envelopes produce no gain.
Discrimination via PLS and PFC-RR 221

There are also notable differences between the ways in which gains are
produced by envelope and by PFC-RR regressions. The gain from PFC-RR
regression results primarily from the reduction in the number of real param-
eters need to specify β (Cook, 2018, Sec. 9.2.1). On the other hand, the gain
from a response envelope is due to the reduction in the number of parameters
and to the structure of ΣX|C = ΓΩΓT + Γ0 Ω0 ΓT0 , with massive gains possible
depending on the relationship between kΩk and kΩ0 k.
These contrasts lead to the conclusion that envelope and PFC-RR regres-
sions (7.10) are distinctly different methods of dimension reduction with differ-
ent operating characteristics. Reasoning in the equivalent context of reduced-
rank regression, Cook et al. (2015) combined PFC-RR and response envelopes,
leading to a new dimension reduction paradigm that can automatically choose
the better of the two methods and, if appropriate, can also give an estimator
that does better than both of them.
When formulating envelope model (7.13) no explicit accommodation was
included for the dimension d of B 0 . The PFC-RR -envelope model includes
such an accommodation by starting with model (7.10) and then incorporating
a semi-orthogonal basis matrix Γ ∈ Rr×u for EΣX|C (span(B)):

Xi = µ0 + ΓηbYi + εi , i = 1, . . . , n, (7.20)
T
ΣX|C = ΓΩΓ + Γ0 Ω0 ΓT0 ,

where β T = Γηb, η ∈ Ru×d , b ∈ Rd×r as defined for (7.10) and the re-
maining parameters are as defined in (7.13). This model contains two tuning
dimension, u and d, that need to be determined subject to the constraints
0 ≤ d ≤ u ≤ min(p, r). Maximum likelihood estimators of the parameters in
this model as well as suggestions for determining the tuning parameters are
available from Cook et al. (2015). These estimators can then be substituted
into classification function (7.7).

7.5.2 Bringing in PLS


Cook et al. (2015) showed that, in large samples, PFC-RR envelope estima-
tion converges to a two-stage procedure with the first stage based on envelope
model (7.13) and the second based on a PFC-RR regression model starting
from envelope-reduced model (7.14). Recognizing the advantages of using PLS
instead of maximum likelihood estimation in the first stage when n is not large
relative to min(p, r), leads to the following outline of the possibilities.
222 Linear Discriminant Analysis

First, neglect d and estimate the envelope basis Γ ∈ Rp×u using either

(i) maximum likelihood for envelope model (7.13) if n  min(p, r), or


(ii) PLS from the regression of Y on X, as discussed at the end of Sec-
tion 7.3.2.

Standard methods based on cross validation or a holdout sample can be


used to determine its dimension u. Let Γb denote the estimated basis from
either PLS or MLE. Selecting from the list following (7.16), the estimators
needed from this fit are Γ,
b µb0 and Σ
b X|C .

Second, substitute Γ
b into model (7.14) and then use that as the basis for a
PFC-RR regression. This corresponds to fitting the working RR model

b TXi
Γ = α0 + GbYi + ei , i = 1, . . . , n,
ΣΓbTX|C = Ω,

where G ∈ Ru×d , b ∈ Rd×r and, in reference to model (7.14), η = Gb. Let


α̃0 , G̃, b̃, and Ω̃ denote the estimators coming from the reduced rank fit of
this working model. Then the corresponding estimator of η is η̃ = G̃b̃. The
dimension d can be determined by cross validation or a holdout sample
based on this working RR model.

Recalling that in the population α0 = ΓT µ0 , the estimated classification


function is then found by substituting α̃0 , η̃, and Ω̃ into (7.15), giving an
estimated classification function based on the compressed features Γ b T X:
n o
φ(Γb TX) = arg max log(πk /π0 ) + η̃kT Ω̃−1 {Γ
b TX − (η̃k + 2α̃0 )/2} .
k∈χ
(7.21)

7.6 Overview of LDA methods


We have now considered seven procedures for estimating the discriminant
function φ. These are based on the five models listed below along with the
number of real parameters in each:

1. Standard model (7.8): p + pr + p(p + 1)/2


Illustrations 223

2. PFC-RR model (7.10): p + d(p − d) + rd + p(p + 1)/2

3. Isotrophic PFC model, Sec. 7.3.1: p + d(p − d) + rd + 1

4. MLE based on envelope model (7.13): p + ru + p(p + 1)/2

5. PLS based on envelope model (7.13): p + ru + p(p + 1)/2

6. MLE based on envelope PFC-RR model (7.20): p+d(u−d)+rd+p(p+1)/2

7. PLS based on envelope PFC-RR model (7.20): p+d(u−d)+rd+p(p+1)/2

Our personal preference would be to use envelope methods 4 and 6 if


n  min(p, r) since these methods subsume the standard and PFC-RR mod-
els 1 and 2. We would use the isotropic PFC model when there is good reason
to set ΣX|C = σ 2 Ip . The PLS methods 5 and 7 are at their best when n is not
large relative to min(p, r), perhaps n < min(p, r). In all cases we tend to pre-
fer cross validation based on classification rates for dimension determination,
although information criteria are available for the likelihood-based methods.
We would normally use a holdout sample to estimate the rate of misclassifi-
cation for the selected method. Finally, recalling the discussion of Chapter 4,
the envelope and PLS methods discussed here should work well in abundant
discrimination problems where many features contribute information on the
classification. They are not recommended in sparse problems where p is large
and yet few features inform on classification.

7.7 Illustrations
The likelihood-based classification methods listed as items 2, 4, and 6 in Sec-
tion 7.6 will eventually dominate for a sufficiently large sample, since the
methods inherit optimality properties from general likelihood theory. Our pri-
mary goal for this section is to provide some intuition into the methods that
do not require a large sample size, principally isotropic PFC and PLS, meth-
ods 3, 5 and 7. We choose two data sets – Coffee data and Olive Oil data –
from the literature because these data sets have been used in studies to com-
pare classification methods for small samples. That enabled us to compare our
methods with other methods without the need to implement them.
224 Linear Discriminant Analysis

7.7.1 Coffee
Zheng, Fu, and Ying (2014), using data from Downey, Briandet, Wilson, and
Kemsley (1997), compared five discrimination methods on their ability to
predict one of two coffee varieties, Arabica and Robusta. The data consist of
29 Arabica and 27 Robusta species with corresponding Fourier transform in-
frared spectral features obtained by sampling at p = 286 wavelengths. The five
methods they compared, including PLS discriminant analysis, are named in
Table 7.1. Detailed descriptions of the methods are available from their article.
The last five entries in the second row of Table 7.1 gives the rates of correct
classification, estimated by using leave-one-out cross validation, from Zheng
et al. (2014, Table 1). The second and third entries in the second row are the
rates based on leave-one-out cross validation that we observed by applying
methods 3 (ISO) and 5 (PLS) listed in Section 7.6. Using 10-fold cross val-
idation, we chose 3 components for PLS and 4 components for classification
via isotropic PFC. For the fourth entry, PLS+PFC, we first applied PLS and
then used PFC to further reduce the compressed PLS features. This resulted
in 3 PLS components and one PFC component. The linear classification rule
was then based on the PLS+PFC compressed feature. Our implementation of
PLS did better than three of the five methods use by Zheng et al. (2014) and
did the same as two of their methods at 100% accuracy.
These data are sensitive to particular partition used to conduct the 10-fold
cross validation. Depending on the seed to start the pseudo-random sampling,

TABLE 7.1
Olive oil and Coffee data: Estimates of the correct classification rates (%) from
leave one out cross validation.

From Zheng et al. (2014)


Dataset PLS ISO PLS+PFC KNN LS-SVM PLS-DA BP-ANN ELM
Coffee 100 91.1 100 82.2 97.5 100 94.8 100
Olive Oil 99.2 88.3 93.3 83.2 95.1 93.1 90.0 97.4

ISO and PLS refer to methods 3 and 5 as listed in Section 7.6. The remaining
designations are those used by Zheng et al. (2014). KNN: k-nearest neighbor.
LS-SVM: least-squares support vector machine. PLS-DA: partial least-squares
discriminant analysis. BP-ANN: back propagation artificial neural network.
ELM: extreme learning machine.
Illustrations 225

1.0

0.9

Accuracy 0.8

0.7

0.6

4 8 12
Number of components

FIGURE 7.2
Coffee data: Plots of estimated classification rate (accuracy) versus the number
of components

we estimated 3 or 5 components. With 3 components, the subsequent leave-


one-out validation gave 100 percent accuracy, as reported in Table 7.1. With
5 component we observed one misclassification, resulting in a correct classi-
fication rate of 55/56 = 0.9821. Shown in Figure 7.2 is a plot of the rate of
correct classification versus the number of components. As long as one avoids
1, 2 and 5 components, the percent of correct classification is 100.
The marked plot of the first two PLS components in Figure 7.3a shows
good separation, although PLS required three components in total. The plot of
the PLS+PFC direction in Figure 7.3b shows perfect separation. For contrast,
Figure 7.3c shows a marked scatterplot of the first two ISO components.
As a benchmark, we randomly permuted the classification labels for the
coffee data and re-ran our implementation of PLS. Again using 10-fold cross
validation, we selected 4 component, which subsequently gave the leave-one-
out classification rate of 48.21 arising from 27 correct classifications.

7.7.2 Olive oil


Zheng et al. (2014) employed also a spectral dataset from Tapp, Defernez,
and Kemsley (2003) to see if their methods could distinguish olive oils by
226 Linear Discriminant Analysis

a. PLS projections b. PLS+PFC projections

c. Isotropic (ISO) projections

FIGURE 7.3
Coffee data: Plots of PLS, PFC, and Isotropic projections.

their country of origin. There were 60 authenticated extra virgin olive oils
from four countries: 10 from Greece, 17 from Italy, 8 from Portugal, and 25
from Spain. The p = 570 features were obtained from Fourier transform in-
frared spectroscopy of each of the 60 samples. The analyses of these data
parallels those for the Coffee data in Section 7.7.1. The percentages of correct
Illustrations 227

a. PLS projections b. PLS+PFC projections

c. Isotropic (ISO) projections


FIGURE 7.4
Olive oil data: Plots of PLS, PLS+PFC, and Isotropic projections.

classification are shown in the third row of Table 7.1 and the corresponding
graphics are shown in Figure 7.4. Our implementation of PLS gave notably
better results than the PLS-DA method of Zheng et al. (2014). We have no
explanation for the difference.
Our implementation of 10-fold cross validation gave 28 components for
PLS and 28 components for ISO. Marked plots of the first two PLS and ISO
components are shown in panels a and c of Figures 7.4. The separation is not
228 Linear Discriminant Analysis

very clear, perhaps signaling the need for more components. Application of
PLS+PFC resulted in 17 PLS components and 2 PFC component based on
the PLS components. A marked plot of the two PLS components is shown
in Figures 7.4b where we observe perfect separation of the four classes. One
advantage of the PLS+PFC method is its ability to allow informative low
dimensional plots, as illustrated here.
8

Quadratic Discriminant
Analysis

Quadratic discriminant analysis (QDA) proceeds under the same model (7.5)
as linear discriminant analysis, except that now the conditional covariance ma-
trices are no longer assumed to be the same, so we may have ΣX|C=k = 6 ΣX|C=j
for k =6 j. Let Σk = ΣX|(C=k) . The nominal stochastic structure underlying
QDA for predicting the class C ∈ χ based on a vector of continuous features
X ∈ Rp is then

X | (C = k) ∼ N (µk , Σk ), Pr(C = k) = πk , k ∈ χ. (8.1)


Pr
As in linear discriminant analysis, χ = {0, 1, . . . , r}, πk > 0 and k=0 πk = 1,
but here the intra-class covariance matrices Σk are permitted to be unequal
over the classes. Under this model, the Bayes’ rule is to classify a new unit
with feature vector X onto the class with maximum posterior probability.
From (7.4) the classification function is

φqda (X) = arg max Pr(C = k | X)


k∈χ
1
log |Σk | + 12 (X − µk )T Σ−1

= arg max log(πk ) + 2 k (X − µk ) (8.2)
k∈χ

= arg max ak − X T (Σ−1 −1 1 T −1 −1



k µk − Σ0 µ0 ) + 2 X (Σk − Σ0 )X
k∈χ
(8.3)
−1 −1
n h i
= arg max ak − X T (Σ−1 k − Σ )µk + Σ (µk − µ)
k∈χ
−1
o
+ 12 X T (Σ−1
k − Σ )X , (8.4)

DOI: 10.1201/9781003482475-8 229


230 Quadratic Discriminant Analysis

where

ak = log(πk ) + (1/2) log |Σk | + (1/2)µTk Σ−1


k µk
X
Σ = πk Σk = E(var(X | C))
k∈χ
X
µ = πk µk .
k∈χ

Each of these three forms (8.2)–(8.4) may be useful depending on con-


text. Form (8.2) arises directly from inspection of the posterior probability
Pr(C = k | X). An expanded version of this form was used by Pardoe, Yin,
and Cook (2007) in a study of graphical methods for quadratic discrimina-
tion. In form (8.3), which was used by Zhang and Mai (2019) in their study
of quadratic discrimination, the terms are shifted by using the first category
(µ0 , Σ0 ) as a reference point. The average variance and average mean are used
for centering in (8.4). Similar forms arise in the likelihood-based sufficient di-
mension reduction methodology developed by Cook and Forzani (2009). In
developing (8.2)–(8.4), addends not depending on k were removed.
The maximum posterior probability for LDA contains only linear terms
in X and the central discriminant subspace (Definition 7.1) DC|X = Σ−1 0
X|C B ,
while φqda (X) contains both linear and quadratic terms in X. If the intraclass
covariance matrices are equal, then Σk = Σ = ΣX|C , where ΣX|C is the nota-
tion used on Chapter 7 for the common intra-class covariance matrix. Using Σ
to represent the common covariance matrix, the three forms (8.2)–(8.4) reduce
straightforwardly to a form that is equivalent to (7.6):
−1
n o
φqda (X) = arg max log(πk ) + (1/2)(X − µk )T Σ (X − µk )
k∈χ
−1
n o
= arg max ak − X T Σ (µk − µ0 )
k∈χ
−1
n o
= arg max ak − X T Σ (µk − µ) ,
k∈χ

where
−1
ak = log(πk ) + (1/2)µTk Σ µk
Σ = ΣX|C .
Dimension reduction for QDA 231

If the intraclass means are equal, µk = µ where µ represents the common


mean, then the three forms (8.2)–(8.4) of the classification function reduce to
 
1 1 T −1
φqda (X) = arg max log(πk ) + log |Σk | + (X − µ) Σk (X − µ)
k∈χ 2 2
 
T −1 −1 1 T −1 −1
= arg max ak − X (Σk − Σ0 )µ + X (Σk − Σ0 )X
k∈χ 2
 i 1 
−1 −1
h
T −1 T −1
= arg max ak − X (Σk − Σ )µ + X (Σk − Σ )X ,
k∈χ 2
where

ak = log(πk ) + (1/2) log |Σk | + (1/2)µT Σ−1


k µ
X
Σ = πk Σ k .
k∈χ

With sufficient observations per class, φqda (X) can be estimated consis-
tently by substituting sample versions of Σk and µk , k = 0, . . . , r. However,
as with linear discriminant analysis, we strive to reduce the dimension of
the feature vectors without loss of information and thereby reduce the rate
of mis-classifications. It may be clear from the above forms that the feature
vectors furnish classification information through the scaled mean deviations
−1 −1
Σ (µk − µ) and precision deviations Σ−1 k −Σ . These deviations will play a
key role when pursuing dimension reduction for quadratic discriminant anal-
ysis.
Beginning with a simple random sample (Ci , Xi ), i = 1, . . . , n, let nk de-
note the number of observations in class C = k so the total sample size can be
Pr
represented as n = k=0 nk . In the remainder of this section, we frequently
take πk = nk /n for use in application as will often be appropriate. This also
facilitates presentation, particularly connections with other methods.

8.1 Dimension reduction for QDA


Zhang and Mai (2019) showed that for model (8.1) the central discriminant
subspace is again equal to the central subspace, DC|X = SC|X . Consequently
we can pursue dimension reduction via the central subspace, which was in-
troduced in Section 2.1. In their development of likelihood-based sufficient di-
mension reduction, Cook and Forzani (Likelihood acquired directions (LAD);
232 Quadratic Discriminant Analysis

2009, Thm 1 and Prop 1) proved the key result shown in Proposition 8.1. In
preparation, define

βk = µk − µ, k ∈ χ
T
β = (β0 , β1 , . . . , βr )
B0 = span(β T ).

This definition of βk differs from that used for linear discriminant analysis
(see just below (7.7)). Here µ is used for centering rather than µ0 . Recall also
that, from Definition 2.1, a subspace S ⊆ Rp with the property C X | PS X
is called a dimension reduction subspaces for the regression of C on X and
that the central subspace SC|X is the intersection of all dimension reduction
subspaces.

Proposition 8.1. Assume model (8.1), let S be a subspace of Rp with semi-


orthogonal basis matrix Ψ and let d = dim(S). Then S is a dimension reduc-
tion subspace for the regression of C on X if and only if the following two
conditions are satisfied
−1
(i) Σ B0 ⊆ S
−1
(ii) Σ−1
k =Σ + Ψ∆k (Ψ)ΨT for all k ∈ χ,

where

∆k (Ψ) = (ΨT Σk Ψ)−1 − (ΨT ΣΨ)−1


= {var(ΨTX | C = k)}−1 − {avej∈χ var(ΨTX | C = j)}−1 .

Additionally, condition (ii) is equivalent to condition


T
(iii) Σk = Σ + PΨ(Σ) (Σk − Σ)PΨ(Σ) or all k ∈ χ.

In Proposition 8.1 the dimension reduction subspace S need not be the


smallest; that is, it need not be a basis for the central subspace. Later in this
chapter we will be using this proposition for applications in which S is not
necessarily minimal.
−1
Condition (i) of Proposition 8.1 tells us that Σ (µk −µ) ∈ S for all k ∈ χ.
−1
Consequently, for each k there is a bk ∈ Rd so that Σ (µk − µ) = Ψbk . Sub-
stituting this and condition (ii) of Proposition 8.1 into φqda (X) given in (8.4)
we obtain a reduced form of the quadratic discrepancy function. This reduced
Dimension reduction for QDA 233

discrepancy function can be seen to arise by reducing X 7→ ΨTX ∈ Rd and


then rewriting (8.4) in terms of the moments of ΨTX:

E(ΨTX | C = k) = ΨT µk
var(ΨTX | C = k) = Ψ T Σk Ψ
E{var(ΨTX | C)} = ΨT ΣΨ.

Using these moments in conjunction with (8.4) gives the reduced classification
function

φqda (ΨTX) = arg max ak − Lk (ΨTX) + Qk (ΨTX) ,



(8.5)
k∈χ

where

ak = log(πk ) + (1/2) log |ΨT Σk Ψ| + (1/2)µTk Ψ(ΨT Σk Ψ)−1 ΨT µk ,

the linear terms are represented by

Lk (ΨTX) = X T Ψ
(Ψ Σk Ψ)−1 − (ΨT ΣΨ)−1 ΨTµk + (ΨT ΣΨ)−1 ΨT (µk − µ) ,
 T 

the quadratic terms by

Qk (ΨTX) = (1/2)X T Ψ (ΨT Σk Ψ)−1 − (ΨT ΣΨ)−1 ΨTX,




and Σ and µ are as defined previously. This classification function is computed


in the same way as (8.4), except it is based on d features ΨTX instead of the
original p features X. For this to be useful as a basis for a method for QDA,
we need to specify a dimension reduction subspace S and have available a
method of estimating a basis. Once a basis estimator is available, a sample
version of φqda (ΨTX) can be constructed by substituting ΨTX and its sample
moments into (8.5).
We discuss basis estimation methods in next sections. The first is based
on taking S to be the central subspace S = SC|X and estimating a basis by
maximum likelihood under model (8.1). The second is based on taking S to
be the Σ-envelope of SC|X , S = EΣ (SC|X ), and estimating a basis by using
maximum likelihood or partial least squares. Choosing S = SC|X with maxi-
mum likelihood estimation is reasonable when the covariance matrices Σk are
well-estimated by their sample versions, so nk  p, k ∈ χ, and intraclass
234 Quadratic Discriminant Analysis

collinearity is negligible to moderate, since likelihood estimation tends to de-


grade generally if either if these conditions is not met. Choosing S = EΣ (SC|X )
with maximum likelihood estimation is advisable when the covariance matri-
ces Σk are well-estimated by their sample versions and intraclass collinearity
is high in some classes. Choosing S = EΣ (SC|X ) with partial least squares es-
timation should be considered when for some groups Σk is not well-estimated
or the intraclass collinearity is moderate to high.
Recall that we can represent the marginal covariance matrix of the features
X as

ΣX = E(var(X | C)) + var(E(X | C))


X
= Σ+ πk (µk − µ)(µk − µ)T .
k∈χ

The ΣX -envelope of SC|X is equal to the Σ-envelope of SC|X , so either could


be used for an envelope version of QDA. We chose EΣ (SC|X ) because it fits
well with past work (e.g. Cook and Forzani, 2009).

8.2 Dimension reduction with S = SC|X


The subspace
−1 −1 −1
L := Σ B 0 = span{Σ (µk − µ0 ) | k ∈ χ} = span{Σ (µk − µ) | k ∈ χ},

which is the focus of Proposition 8.1(i), captures the classification information


available from the class means. The subspace
−1
Q := span{Σ−1 −1 −1
k − Σ0 | k ∈ χ} = span{Σk − Σ | k ∈ χ},

which is the subject of Proposition 8.1(ii), captures the classification infor-


mation available from the class variances. The following corollary restates the
conditions of Proposition 8.1 in terms of the subspaces L and Q and adds the
conclusion that SC|X = L + Q.

Corollary 8.1. Assume model (8.1) and let S be a subspace of Rp . Then S


is a dimension reduction subspace if and only if the following two conditions
are satisfied
Dimension reduction with S = SC|X 235

(i) L ⊆ S

(ii) Q ⊆ S.

In addition,

(iii) SC|X = DC|X = L + Q.

Conditions (i) and (ii) are restatements of conditions (i) and (ii) in Propo-
sition 8.1. The conclusion that SC|X = L + Q follows because L + Q is a
dimension reduction subspace that is contained in all dimension reduction
subspaces. The conclusion that DC|X = SC|X was demonstrated by Zhang
and Mai (2019).
Cook and Forzani (2009, Thm 2) showed that the maximum likelihood
estimator Φ b of a basis Φ for SC|X can be constructed as follows. Let SX and
Sk denote the sample versions of ΣX , the marginal covariance matrix of X,
and Σk , the covariance matrix of X restricted to class C = k. Also let S =
Pr
k=0 (nk /n)Sk denote the sample version of the average covariance matrix Σ.
Then under model (8.1) with fixed dimension d = dim(SC|X ) and Sk > 0 for
all k ∈ χ, the maximum likelihood estimator of a basis matrix Φ for SC|X is

Φ
b = arg max `d (H),
H

where the maximization is over all semi-orthogonal matrices H ∈ Rp×d and


`d (H) is the log likelihood function maximized over all parameters except Φ
(Cook and Forzani, 2008a),
r
X
`d (H) = C + (n/2) log |H T SX H| − (n/2) (nk /n) log |H T Sk H|
k=0
C = −(np/2)(1 + log(2π)) − (n/2) log |SX |.

Once Φb is available, the sample version of the reduced quadratic classification


function is constructed as
n o
b TX) = arg max ak − Lk (Φ
φqda (Φ b TX) + Qk (Φ
b TX) , (8.6)
k∈χ

where

b T Sk Φ|
ak = log(πk ) + (1/2) log |Φ b + (1/2)X̄kT Φ(
b Φ b −1 Φ
b T Sk Φ) b T X̄k ,

and the linear terms are represented by


hn o i
b TX) = X T Φ
Lk (Φ b (Φ b −1 − (Φ
b T Sk Φ) b −1 Φ
b T S Φ) b TX̄k + (Φ b −1 Φ
b T S Φ) b T (X̄k − X̄) .
236 Quadratic Discriminant Analysis

The quadratic terms in (8.6) are


n o
b TX) = (1/2)X T Φ
Qk (Φ b (Φ b −1 − (Φ
b T Sk Φ) b −1 Φ
b T S Φ) b TX,

where X̄k is the average feature vector in class C = k and X̄ is the overall
average. This reduced classification function may be reasonable when the in-
traclass sample sizes are relatively large nk  p for all k ∈ χ and intraclass
collinearity is negligible to moderate.

8.3 Dimension reduction with S = EΣ (SC|X )


We begin this section by using envelopes to enhance the structure of model
(8.1) .

8.3.1 Envelope structure of model (8.1)


Let Γ be a semi-orthogonal basis matrix for EΣ (SC|X ), the Σ-envelope of SC|X
with dimension u. The population classification function is then φqda (ΓTX),
so only ΓTX is material to classification. Select Γ0 ∈ Rp×p−u so that (Γ, Γ0 ) is
an orthogonal matrix. It follows from Proposition 1.2 and Corollary 1.1 that,

Σ = ΓΩΓT + Γ0 Ω0 ΓT0 (8.7)


−1
Σ = ΓΩ−1 ΓT + Γ0 Ω−1 T
0 Γ0 .

We see from this that one role of the envelope is to separate Σ into a part ΓΩΓT
that is material to classification and a complementary part Γ0 Ω0 ΓT0 that is
immaterial. Since SC|X ⊆ EΣ (SC|X ), the envelope space EΣ (SC|X ) is a dimen-
sion reduction subspace for the regression of C on X and, from Corollary 8.1,
L ⊆ EΣ (SC|X ) and Q ⊆ EΣ (SC|X ). The following corollary to Proposition 8.1
describes key relationships that will guide the methodology in this section.

Corollary 8.2. Assume the hypotheses in Proposition 8.1. Then

(I) B 0 ⊆ Σ SC|X ⊆ EΣ (Σ SC|X ) = EΣ (SC|X ),

(II) span(Σ − Σk ) ⊆ Σ SC|X ⊆ EΣ (Σ SC|X ) = EΣ (SC|X ),


−1
(III) span(Σ − Σ−1
k ) ⊆ EΣ (Σ SC|X ) = EΣ (SC|X ).
Dimension reduction with S = EΣ (SC|X ) 237

Proof. Since SC|X is a dimension reduction subspace by construction and


since SC|X ⊆ EΣ (SC|X ), it follows that EΣ (SC|X ) is also a dimension reduc-
tion subspace.
To show conclusion (I), we have that Proposition 8.1(i) implies that

B 0 ⊆ ΣSC|X ⊆ Σ EΣ (SC|X ).

Since Σ trivially commutes with itself, Proposition 1.5 implies that

Σ EΣ (SC|X ) = EΣ (Σ SC|X ) = EΣ (SC|X ),

which implies conclusion (I).


Turning to conclusion (II), Proposition 8.1(iii) implies that for each k ∈ χ

span(Σ − Σk ) ⊆ ΣSC|X ⊆ Σ EΣ (SC|X ).

Conclusion (II) now follows by using the argument in the justification of con-
clusion (I).
Conclusion (III) follows immediately from Proposition 8.1(ii). 2

To describe the structure of ΣX , we have from Corollary 8.2(I) that there


is a positive definite matrix U so that
r
X r
X
var(E(X | C)) = πk (µk − µ)(µk − µ)T = πk βk βkT = ΓU ΓT , (8.8)
k=0 k=0

where Γ is a semi-orthogonal basis matrix for EΣ (SC|X ) as defined at the


outset of this section. Using this result and (8.7) we have

ΣX = E(var(X | C)) + var(E(X | C)) (8.9)


T
= Σ + ΓU Γ
= Γ(Ω + U )ΓT + Γ0 Ω0 ΓT0 . (8.10)

We next discuss two ways in which this structure can be used to esti-
mate the corresponding reduced classification function φqda (ΓTX). The first
is likelihood-based, requiring relatively large sample settings where nk  p
and so Sk > 0 for all k ∈ χ. The second is when nk is not large relative to p
for some k. This is where PLS comes into play since then the likelihood-based
classification function may be unserviceable or not sufficiently reliable.
238 Quadratic Discriminant Analysis

8.3.2 Likelihood-based envelope estimation


Su and Cook (2013, eq. (2.3)) and later Zhang and Mai (2019) showed that,
under the QDA model (8.1) with Sk > 0 for all k ∈ χ, a maximum likelihood
estimator of a basis Γ for EΣ (SC|X ) is
( r
)
X
b = arg min log |GT S −1 G| +
Γ (nk /n) log |GT Sk G| . (8.11)
X
G
k=0

This objective function is similar in structure to `d (H) for estimating SC|X ,


as described in Section 8.2. There are consequential differences, however, be-
cause `d (H) is designed for estimation of the central discriminant subspace,
b is an estimated basis for an upper bound E (SC|X ) on that subspace.
while Γ Σ
As discussed previously, the bound is intended to accommodate collinearity
among the features.
The maximum likelihood estimators of other key parameters associated
with model (8.1) are (Su and Cook, 2013)

b = X̄
µ
µ
bk = X̄ + PΓb (X̄k − X̄)
Σ
b k = Pb Sk Pb + Qb SX Qb
Γ Γ Γ Γ

Σ = PΓb SPΓb + QΓb SX QΓb


b

Σ
bX = PΓb SX PΓb + QΓb SX QΓb ,
P
where S = k∈χ (nk /n)Sk . Substituting these estimators into (8.5) gives the
classification function under the ENDS-QDA model of Zhang and Mai (2019):
n o
b TX) = arg max ak − Lk (Γ
φqda (Γ b TX) + Qk (Γ
b TX) (8.12)
k∈χ

where

b T Sk Γ|
ak = log(πk ) + (1/2) log |Γ b + (1/2)X̄kT Γ(
b Γ b −1 Γ
b T Sk Γ) b T X̄k ,

the linear terms are represented by


hn o i
b TX) = X T Γ
Lk (Γ b (Γ b −1 − (Γ
b T Sk Γ) b −1 Γ
b T S Γ) b TX̄k + (Γ b −1 Γ
b T S Γ) b T (X̄k − X̄) ,

the quadratic terms by


n o
b TX) = (1/2)X T Γ
Qk (Γ b (Γ b −1 − (Γ
b T Sk Γ) b −1 Γ
b T S Γ) b TX,
Dimension reduction with S = EΣ (SC|X ) 239

where X̄k is the average feature vector in class C = k and X̄ is the overall
average. This reduced classification function may be advisable when the in-
traclass sample sizes are relatively large nk  p for all k ∈ χ and intraclass
collinearity is high for some classes.
Rounding out the discussion, Su and Cook (2013) developed envelope es-
timation of multivariate means from populations with different covariance
matrices. The basic model that they used to develop envelope estimation is
the same as (8.1), although they were interested only in estimation of popula-
tion means and not subsequent classification. They based their methodology
on the following definition of a generalized envelope.

Definition 8.1. Let M be a collection of real t×t symmetric matrices and let
V ∈ span(M ) for all M ∈ M. The M-envelope of V, denoted EM (V), is the in-
tersection of all subspaces of Rt that contain V and reduce each member of M.

They applied this definition with M = {Σk | k ∈ χ} and V = B 0 , and


showed that the maximum likelihood estimator of a basis for EM (B 0 ) is given
by (8.11), suggesting that there is an intrinsic connection between the en-
velopes EM (B 0 ) and EΣ (SC|X ).

8.3.3 Quadratic discriminant analysis via algorithms N and S


When the intraclass sample sizes are relatively small, perhaps nk < p for
some k ∈ χ or intraclass collinearity is high for some classes, it is wise to
consider estimating EΣ (SC|X ) by using Algorithms N and S as introduced in
Section 1.5 since these methods do not require nonsingular empirical class co-
variance matrices. In reference to those general algorithms, the first selections
might be M = Σ and A = SC|X . To use these choices in applications, we need
to have an estimator of Σ and an estimator of (a basis for) SC|X . The average
covariance matrix Σ can be estimated straightforwardly by using S. However,
carrying this ideas to fruition, we find that S needs to be nonsingular for the
estimation of SC|X , sending us back to consider likelihood-based methods.
Corollary 8.2 tells us that Σ SC|X ⊆ EΣ (Σ SC|X ) = EΣ (SC|X ), which pro-
vides an alternative choice: M = Σ and A = ΣSC|X , which does not require S
to be nonsingular when used as an estimator of Σ. The next corollary describes
the population foundation for an estimator of ΣSC|X .
240 Quadratic Discriminant Analysis

Corollary 8.3. Assume the hypotheses in Proposition 8.1. Then


 
X 
span πk (ΣX − Σk )2 = Σ SC|X .
 
k∈χ

Proof. Let Π = diag(π0 , π1 , . . . , πp ), and recall from (8.8) that


r
X
var(E(X | C)) = πk βk βkT .
k=0

Then it follows from Corollary 8.2(I) that

span{var(E(X | C))} = span(β T Πβ) ⊆ ΣSC|X ,

and from Corollary 8.2(II) that

span(Σ − Σk ) ⊆ ΣSC|X .

Combining these we have that

span(Σ − Σk + β T Πβ) ⊆ ΣSC|X .

But using (8.9)

Σ + β T Πβ − Σk = E(var(X | C)) + var(E(X | C)) − Σk


= Σ X − Σk ,

which implies that span(ΣX − Σk ) ⊆ ΣSC|X and thus that


 
X 
span πk (ΣX − Σk )2 ⊆ Σ SC|X . (8.13)
 
k∈χ

The left-hand side of (8.13) is the span of the kernel function for sliced av-
erage variance estimation (SAVE) as proposed by Cook and Weisberg (1991)
and developed further by Cook (2000) and Shao, Cook, and Weisberg (2007,
2009). A comprehensive treatment is available from Li (2018, Ch. 5). Equality
follows from Cook and Forzani (2009, Discussion of Prop. 3).
While normality guarantees equality in (8.13), containment holds under
much weaker conditions. In particular, if var(X | PSC|X X) is a non-random
Dimension reduction with S = EΣ (SC|X ) 241

matrix and if the linearity condition as given in Definition 9.3 holds then the
containment represented in (8.13) is assured. See Li (2018, Ch. 5) for further
discussion.
Algorithms N and S can now be used in applications by setting
X
M = S and A = (nk /n)(SX − Sk )2 , (8.14)
k∈χ

and using predictive cross validation to determine the dimension of the enve-
lope.
The methodology implied by Corollary 8.3 implicitly treats the mean and
variance components – β T Πβ and Σ − Σk – of Corollary 8.2 equally. In some
applications, it may be useful to differentially weight these components, par-
ticularly if they contribute unequally to classification. Let 0 ≤ a ≤ 1 and
combine the weighted components (1 − a)(Σ − Σk ) and aβ T πβ as
X X
πk {(1 − a)(Σ − Σk ) + aβ T πβ}2 = (1 − a)2 πk (Σ − Σk )2 + a2 (β T πβ)2 ,

k∈χ k∈χ

where the equality holds because the cross product terms sum to zero. With-
out loss of generality, we can rescale the right-hand side to give
 
 X 
span (1 − λ) πk (Σ − Σk )2 + λ(β T πβ)2 ⊆ ΣSC|X ,
 
k∈χ

where λ = a2 /{(1−a)2 +a2 }. This then implies that we use the sample version
of Algorithm N or S with M = S and
X
Aλ = (1 − λ) (nk /n)(S − Sk )2 + λ(SX − S)2 . (8.15)
k∈χ

The coefficient λ and the number of PLS components can be determined by


cross validation with the error rate of the final choices estimated from a hold-
out sample. The resulting weight matrix is then substituted for Γ b in (8.12) to
give the PLS classification function.
When λ = 0,
 
X 
span πk (Σ − Σk )2 ⊆ ΣSC|X ,
 
k∈χ

which is closely related to the methods for studying covariance matrices pro-
posed by Cook and Forzani (2008a).
242 Quadratic Discriminant Analysis

8.4 Overview of QDA methods


All methods summarized begin with the QDA model (8.1) and then use various
classification functions and estimation methods depending on the dimension
reduction paradigm. The primary methods are listed below along with the
number of parameters N and notes on application.

1. QDA: Classification function (8.4) with no dimension reduction:

N = (r + 1){p + p(p + 1)/2}.

This method is most appropriate when µk and Σk are well estimated,


k ∈ χ. Even in this setting, dimension reduction may still reduce the
misclassification error materially. We illustrate this method in Section 8.5.

2. Classification function (8.6) based on reduction via the central subspace


SC|X of dimension d estimated using LAD (Cook and Forzani, 2009):

N = p + rd + p(p + 1)/2 + d(p − d) + rd(d + 1)/2.

This method is appropriate when µk and Σk are well estimated, k ∈ χ, and


dimension reduction is effective. The dimension d of the central subspace
can be estimated using likelihood testing, an information or cross valida-
tion (Cook and Forzani, 2009). Cross validation tends to be preferred as
it is less dependent on underlying distribution requirements.
In addition to LAD, a variety of other methods are available to estimate
the central subspace. We use LAD, sliced inverse regression (SIR; Li, 1991)
and SAVE (Cook and Weisberg, 1991; Cook, 2000) in Section 8.5 to illus-
trate classification via the central subspace.

3. Classification function (8.12) based on dimension reduction via the en-


veloped central subspace SC|X of dimension u estimated by using the
maximum likelihood function from Zhang and Mai (2019):

N = p + ru + p(p + 1)/2 + ru(u + 1)/2.

This method may be useful when collinearity is present. Cross validation


can be used to determine the dimension u of the envelope. It is illustrated
in Section 8.5.
Illustrations 243

4. AN-Q1: Classification function (8.12) based on dimension reduction via the


enveloped central subspace SC|X of dimension u estimated by using PLS:

N = p + ru + p(p + 1)/2 + ru(u + 1)/2.

The weight matrix U is obtained by using Algorithm N or S of Section 1.5


with M = S and A = k∈χ (nk /n)(SX − Sk )2 , and using predictive cross
P

validation to determine the dimension of the envelope. The function (8.12)


is then used for classification after replacing Γ
b with U in the function itself
and in the estimators leading to that function.
This method may be particularly useful when µk is well-estimated but
estimation of Σk is questionable because of collinearity, k ∈ χ. It is illus-
trated in Section 8.5 There it is called AN-Q1, indicating Algorithm N,
first quadratic method.

5. AN-Q2: Classification function (8.12) based on dimension reduction via the


enveloped central subspace SC|X of dimension u estimated by using PLS:

N = p + ru + p(p + 1)/2 + ru(u + 1)/2.

The weight matrix U is obtained by using Algorithm N or S of Section 1.5


with M = S and Aλ as given in (8.15). Predictive cross validation can be
used to determine the dimension of the envelope and the mixing parameter
λ. The function (8.12) is then used for classification after replacing Γ
b with
U in the function itself and in the estimators leading to that function.
This method may be particularly useful for assessing classification errors
as they depend on the relative contributions λ of the means and variances.
It is illustrated in Section 8.5 using Algorithm N. There it is called AN-Q2,
indicating Algorithm N, second quadratic method.

Methods 3–5 all depend on the same classification function but differ on
the method of estimating the compressed predictors.

8.5 Illustrations
In this section we compare the PLS-type methods AN-Q1 and AN-Q2 to other
classification methods studied in the literature. We confine attention mostly
244 Quadratic Discriminant Analysis
TABLE 8.1
Birds-planes-cars data: Results from the application of three classification
methods. Accuracy is the percent correct classification based on leave-one-out
cross validation.

Dim. Reduction Method u φqda Accuracy (%)


AN-Q2 (λ = 0.6) 4 (8.15) 98.8
AN-Q1 3 (8.14) 97.6
QDA, no reduction 13 (8.4) 92.1

to methods that require and thus benefit from having nonsingular class covari-
ance matrices Sk . We see this as a rather stringent test for AN-Q1 and AN-Q2,
which do not require the class covariance matrices Sk to be nonsingular.

8.5.1 Birds, planes, and cars


This illustration is from a study to distinguish birds, planes and cars by the
sounds they make. A two-hour recording was made in the city of Ermont,
France, and then 5-second snippets of sounds were selected. This resulted in 58
recordings identified as birds, 43 as cars and 64 as planes. Each recording was
processed and ultimately represented by 13 Scale Dependent Mel-Frequency
Cepstrum Coefficients, which constitute the features for distinguishing birds,
planes and cars. This classification problem then has p = 13 feature vec-
tors and n = 165 total observations. These data were used by Cook and
Forzani (2009) to illustrate dimension reduction via LAD and by Zhang and
Mai (2019) in their development of classification methodology based on the
envelope discriminant subspace (ENDS) for QDA.
We applied the three methods shown in Table 8.1. The number u of con-
densed features and λ were selected by 10-fold cross validation for AN-Q1 and
AN-Q2. The estimated classification functions were assessed using leave-one-
out cross validation. Plots of the first two compressed features for AN-Q1 and
AN-Q2 are shown in Figure 8.1.
Zhang and Mai (2019, Table 4) compared their proposed method ENDS to
11 other classification methods, including classifiers based on Bayes’ rule, the
support vector machine, SIR and SAVE. Their results are listed in columns
5–11 of Table 8.2 along with those from Table 8.1. For QDA without reduc-
tion they reported a misclassification rate of 7.7% which corresponds to an
Illustrations 245

a. AN-Q2 b. AN-Q1

FIGURE 8.1
Bird-plane-cars data: Plots of the first two projected features for AN-Q1 and
AN-Q2.

accuracy rate of 92.3%. This agrees well with our estimate of 92.1% in Ta-
ble 8.1 and in the fourth column of Table 8.2. Due to this agreement, we feel
comfortable comparing our results based on Algorithm N with their results for
12 other methods. Of those 12 methods, their new method ENDS had the best
accuracy at 96.2%, which is less than our estimated accuracy rates for the two
methods based on Algorithm N. Zhang and Mai (2019, Figure 3) gave a plot of
the first two feature vectors compressed by using ENDS. Their plot is similar
to those in Figure 8.1 but the point separation does not appear as crisp.
All of the methods studied by Zhang and Mai (2019) require that the
intraclass covariance matrices Sk be nonsingular, except perhaps for their im-
plementation of SVM. However, being based on Algorithm N, methods AN-Q2
and AN-Q1 do not require the Sk ’s to be nonsingular. We take this as a gen-
eral indication of the effectiveness of the methods based on Algorithm N since
they are able to perform on par with or better than methods that rely on rel-
atively large sample sized. A similar conclusion arises from the fruit example
of the next section.
246 Quadratic Discriminant Analysis
TABLE 8.2
Birds-planes-cars data: Estimates of the correct classification rates (%) from
leave one out cross validation.

From Zhang and Mai (2019)


Dataset AN-Q2 AN-Q1 QDA NB SVM QDA ENDS SIR SAVE DR
Birds+ 98.2 97.6 92.1 94.0 87.4 92.3 96.2 93.0 71.8 91.1

Classification rates for columns 5–11 were taken from Zhang and Mai (2019).
NB: Naive Bayes classifier (Hand and Yu, 2001). SVM: support vector ma-
chine. QDA: quadratic discriminant analysis with no reduction. ENDS: en-
velope discrimination. SIR: sliced inverse regression. SAVE: sliced average
variance estimation. DR: directional regression (Li and Wang, 2007).

8.5.2 Fruit
This dataset contains a collection of 983 infrared spectra collected from straw-
berry, 351 samples, and non-strawberry, 632 samples, fruit purees. The spec-
tral range was restricted to 554–11,123 nm, and each spectrum contained 235
variables. This then is a classification problem with p = 235 features for clas-
sifying fruit purees as strawberry or non-strawberry. It was used by Zheng,
Fu, and Ying (2014) to compare several classification methods.
Percentages of correct classifications are shown in Table 8.3. The number
of compressed features, determined by 10-fold cross validation , for AN-Q1 and
AN-Q2 were 14 and 18. The value of λ for AN-Q2 was estimated similarly to
be 1. The rates of correct classification shown in Table 8.3 were determined by
leave-one-out cross validation. Except for the relatively poor performances by
KNN and PLS-DA, there is little to distinguish between the methods. Plots of
the first two compressed feature vectors are shown in Figure 8.2, although the
results of Zheng et al. (2014) show that more than two compressed features
are needed for each method.

8.6 Coffee and olive oil data


Recall that the sizes of the coffee and olive oil data discussed in Sections 7.7.1
and 7.7.2 are n = 56, p = 256 and n = 60, p = 570. Recall also from Table 7.1
Coffee and olive oil data 247

TABLE 8.3
Fruit data: Estimates of the correct classification rates (%) from leave one out
cross validation.

From Zheng et al. (2014)


Dataset AN-Q1 AN-Q2 KNN LS-SVM PLS-DA BP-ANN ELM
Fruit 95.63 95.73 67.01 96.01 85.07 95.33 95.05

Classification rates for columns 4–8 were taken from Zheng et al. (2014). KNN:
k-nearest neighbor. LS-SVM: least-squares support vector machine. PLS-DA:
partial least-squares discriminant analysis. BP-ANN: back propagation artifi-
cial neural network. ELM: extreme learning machine.

a. AN-Q2 b. AN-Q1

FIGURE 8.2
Fruit data: Plots of the first two projected features for AN-Q1 and AN-Q2.

that the PLS methods discussed in Chapter 7 did well for these data when
compared to the methods of Zheng et al. (2014).
The coffee and olive oil datasets may not be large enough to give compelling
evidence that the intraclass covariance matrices are unequal and consequently
the PLS classification methods of Chapter 7 would be a natural first choice.
It could also be argued reasonably that, to error on the safe side, quadratic
methods should be tried as well. Shown in columns 2 and 3 of Table 8.4 are the
rates of correct classification from applying AN-Q1 and AN-Q2 to the coffee
248 Quadratic Discriminant Analysis

and olive oil data. For comparison, we again listed the results from Zheng et al.
(2014). For the coffee data, these quadratic methods did as well as the best
method studied by Zheng et al. (2014) and, from Table 7.1, as well as the best
PLS method. Viewing the results for the olive oil data, the quadratic methods
did reasonably, but not as well as the PLS or PLS+PFC results in Table 7.1.

TABLE 8.4
Olive oil and Coffee data: Estimates of the correct classification rates (%)
from leave one out cross validation. Results in columns 4–8 are as described
in Table 7.1.

From Zheng et al. (2014)


Dataset AN-Q1 AN-Q2 KNN LS-SVM PLS-DA BP-ANN ELM
Coffee 100 100 82.2 97.5 100 94.8 100
Olive oil 95.0 94.2 83.2 95.1 93.1 90.0 97.4

The result for the coffee and olive oil data support the notion that AN-Q1
and AN-Q2 may be serviceable methods in classification problems where the
class covariance matrices are singular. Combining this with our conclusions
from the birds-planes-cars and fruit datasets leads the conclusion that AN-Q1
and AN-Q2 may be serviceable methods without much regardless for the class
sample sizes.
9

Non-linear PLS

The behavior of predictions stemming from the PLS regression algorithms


discussed in Chapter 3 depends in part on the accuracy of the multivariate
linear model (1.1). In particular, the usefulness of those algorithms may be
called into question if the conditional mean E(Y | X) is a non-linear function
of X, leading to a possibly non-linear model that we represent as

Yi = E(Y | Xi ) + εi , i = 1, . . . , n, (9.1)

where the errors ε are independent and identically distributed random vectors
with mean 0 and variance ΣY |X . Model (9.1) is intended to be the same as
model (1.1) except for the possibility that the mean is a non-linear function of
X. If E(Y | X) = β0 +β TX then model (9.1) reduces to model (1.1). It is recog-
nized that predictions from PLS algorithms based on (1.1) are not serviceable
when the mean function E(Y | X) has significant non-linear characteristics
(e.g. Shan et al., 2014).
Following Cook and Forzani (2021), in this chapter we study the behav-
ior of the PLS regression algorithms under model (9.1), without necessarily
specifying a functional form for the mean. We restrict attention to univariate
responses (r = 1) starting in Section 9.5, but until that point the response
may be multivariate. Our discussion is based mainly on the NIPALS algorithm
(Table 3.1) although our conclusions apply equally to SIMPLS (Table 3.4) and
to Helland’s algorithm (Table 3.5).

DOI: 10.1201/9781003482475-9 249


250 Non-linear PLS

9.1 Synopsis
We bring in two new ideas to facilitate our discussion of PLS algorithms
under non-linearity (9.1). The first is a construction – the central mean sub-
space (CMS) – that is used to characterize the mean function E(Y | X) in a
way that is compatible with envelopes (Section 9.2). The second is a linearity
condition that is used to constrain the marginal distribution of the predictors
(Section 9.3). This condition is common in sufficient dimension reduction (e.g.
Cook, 1998; Li, 2018) and is used to rule out anomalous predictor behavior
by requiring that certain regressions among the predictors themselves are all
linear. Using these ideas we conclude in this chapter that

1. Plots of the responses against NIPALS fitted values derived from the algo-
rithm in Table 3.1 can be used to diagnose non-linearity in the mean func-
tion E(Y | X). Linearity in such plots supports linear model (1.1), while a
clearly non-linear trend contradicts model (1.1) in support of model (9.1)
(see Proposition 9.4).

2. Under the linearity condition, NIPALS may be serviceable for dimension


reduction under non-linear model (9.1). However, while linear model (1.1)
provides for straightforward predictions based on PLS weights, non-linear
model (9.1) does not. This means that additional analysis is required to
formulate predictions under non-linear model (9.1). Possibilities for pre-
diction are discussed in Section 9.6.

3. By introducing a generalization of the logic leading to Krylov subspaces,


we argue in Section 9.5 that PLS algorithms can be generalized to yield
methodology that can adapt to non-linearities without prespecifying a
parametric form for E(Y | X).

9.2 Central mean subspace


Cook and Li (2002) introduced the CMS, which is designed for dimension
reduction targeted at the conditional mean E(Y | X) without requiring a
Central mean subspace 251

pre-specified parametric model. The CMS plays a central role in characteriz-


ing the behavior of PLS regression algorithms under (9.1). We first introduce
the idea of a mean dimension-reduction subspace, which is also from Cook
and Li (2002):

Definition 9.1. A subspace S ⊆ Rp is a mean dimension-reduction subspace


for the regression of Y on X if Y E(Y | X) | PS X.

The intuition behind this definition is that the projection PS X carries all
of the information that X has about the conditional mean E(Y | X). Let
α ∈ Rp×dim(S) be a basis for a mean dimension reduction subspace S. Then if
S were known, we might expect that E(Y | X) = E(Y | αT X), thus reducing
the dimension of X for the purpose of estimating the conditional mean. This
expectation is confirmed by the following proposition (Cook and Li, 2002)
whose proof is sketched in Appendix A.7.1.

Proposition 9.1. The following statements are equivalent.

(i) Y E(Y | X) | αT X,

(ii) cov{(Y, E(Y | X)) | αT X} = 0,

(iii) E(Y | X) is a function of αT X.

Statement (i) is equivalent to Definition 9.1, although here it is stated in


terms of a basis α. Statement (ii) says that the conditional covariance between
Y and E(Y | X) is 0, and statement (iii) tells us that E(Y | X) = E(Y | αT X).
As in our discussion of the central subspace, the most parsimonious reduction
is provided by the smallest mean dimension-reduction subspace:

Definition 9.2. If the intersection of all mean dimension-reduction subspaces


is itself a mean dimension-reduction subspace then it is called the CMS and
denoted as SE(Y |X) with dimension q = dim(SE(Y |X) ).

The CMS does not always exist, but it does exist under mild conditions
that should not be worrisome in practice (Cook and Li, 2002). We assume
existence of the CMS throughout this chapter.
252 Non-linear PLS

Single index model.

Suppose r = 1 and that the regression of Y on X follows the single index model

Y = f (β1T X) + ε, (9.2)

where X ε, β1 ∈ Rp and the function f is unknown. Any sub-


space S ⊆ Rp that contains span(β1 ) is a mean dimension-reduction sub-
space because E(Y | X) is constant conditional on PS X and then trivially
Y E(Y | X) | PS X. Conclusion (ii) of Proposition 9.1 holds also because
E(Y | X) is constant conditional on αTX, where α is still a basis for S. When
referring to this model in subsequent developments, we will always require
that ΣX,Y 6= 0, unless stated otherwise.

Double index model.

If r = 1, then the regression of Y on X follows a double index model if

Y = f (β1T X, β2T X) + ε, (9.3)

where X ε, βj ∈ Rp , j = 1, 2 and the function f is unknown. For example,


we might have
T
Y = γ0 + β1T X + γ1 e−β2 X + ε.
Any subspace S ⊆ Rp that contains span(β1 , β2 ) is a mean dimension-
reduction subspace because E(Y | X) is constant conditional on PS X and
then trivially Y E(Y | X) | PS X. Conclusion (ii) of Proposition 9.1 holds
also because E(Y | X) is constant conditional on αTX, where α is a basis for S.
We defined in Section 2.1 the central subspace SY |X as the intersection of
all subspaces S such that Y X | PS X, while the CMS SE(Y |X) is defined
as the intersection of all subspaces S such that Y E(Y | X) | PS X. The
central subspace captures all of the ways in which X affects the distribution
of Y , while the CMS captures all the ways in which just E(Y | X) depends on
X. Clearly, we must have SE(Y |X) ⊆ SY |X . Under models (9.2) and (9.3) Y
depends on X only via its mean function and in those cases SE(Y |X) = SY |X .
In consequence, the model-free predictor envelope described in Definition 2.2
reduces to EΣX (SY |X ) = EΣX (SE(Y |X) ), which becomes one tool for our study
of PLS regression algorithms under (9.1). If linear model (1.1) holds, then
the coefficient matrix β is well-defined and SE(Y |X) = SY |X = span(β). If
non-linear model (9.1) holds, then there is no model-based β defined and, as
mentioned previously, we have only SE(Y |X) = SY |X .
Linearity conditions 253

The ability to estimate vectors in the CMS or its ΣX -envelope


EΣX (SE(Y |X) ) is crucial for understanding PLS regression algorithms under
model (9.1). Estimation is facilitated when the predictors satisfy a linearity
condition on their marginal distribution.

9.3 Linearity conditions


We first introduce the linearity condition in terms of an essentially arbitrary
random vector and then show how it is applied in the regression context.

Definition 9.3. Let W ∈ Rp be a random vector with mean 0 and covariance


matrix ΣW > 0. Let α ∈ Rp×q , q ≤ p, be any basis matrix for a subspace S ⊆
Rp . We say that W satisfies the linearity condition relative to S if there exists
a non-stochastic p × q matrix M such that E(W | αT W = u) = M u for all u.

The essence of the linearity condition is that E(W | αT W ) = E(W | PS W )


must be a linear function of αT W . If the linearity condition holds for a sub-
space S it need not necessarily hold for any other distinct subspace. For in-
stance, the linearity conditions holds trivially for S = Rp , but that does not
imply that it holds for any proper subspace of Rp . The following proposi-
tion, which is adapted from Cook (1998, Prop. 9.1), gives properties of M
under Definition 9.3. Its proof is sketched in Appendix A.7.2. The notion of a
subspace reducing a matrix was described in Definition 1.1.

Proposition 9.2. Assume that W ∈ Rp , a random vector with mean 0 and


covariance matrix ΣW > 0, satisfies the linearity condition relative to S under
Definition 9.3. Let α be a basis for S. Then

1. M = ΣW α(αT ΣW α)−1 .

2. M T is a generalized inverse of α.

3. αM T is the orthogonal projection operator for S relative to the ΣW inner


product (see equation 1.3).

4.
E(W | αT W ) − µW = PS(Σ
T
W)
(W − µW ),

where for completeness we have allowed µW = E(W ) to be non-zero.


254 Non-linear PLS

5. If S reduces ΣW then PS(ΣW ) = PS .

If W ∈ Rp is elliptically contoured with mean µW and variance ΣW , which


includes the multivariate normal family, then it is known that W satisfies the
linearity condition for any subspace S ⊆ Rp with basis matrix α (Cook, 1998,
Section 7.3.2),

E(W | αT W ) = µW + PS(Σ
T
W)
(W − µW ).

Moreover, Eaton (1986) has shown that a random vector W ∈ Rp satisfies


the linearity condition for all subspace S ⊆ Rp if and only if W is elliptically
contoured. Being elliptically contoured is then sufficient to imply the linearity
condition under Definition 9.3, although it is not necessary since the defini-
tion requires linearity for only the selected subspace and not all subspaces.
Diaconis and Freedman (1984) and Hall and Li (1993) showed that almost all
projections of high-dimensional data are approximately normal and thus sat-
isfy the linearity condition. For all of these reasons, the linearity condition is
largely seen as mild when p  q. It should hold well in the analysis of spectral
data in chemometrics, since then p is often in the hundreds and q is relatively
small. See Li and Wang (2007) and Cook (1998, 2018) for further discussion.
We will be requiring the linearity condition of the predictors relative to
subspaces associated with the regression of Y on X. For these applications,
it may be sufficient to have the distribution of X given Y satisfy the lin-
earity condition relative to the central subspace. The proof of the following
proposition is available in Appendix A.7.3.

Proposition 9.3. Assume that the distribution of X | Y satisfies the linearity


condition relative to SY |X for each value of Y . Then the marginal distribution
of X also satisfies the linearity condition relative to SY |X .

The next corollary and lemma describe special settings for the linearity
condition.

Corollary 9.1. (I) If model (9.1) holds and if X | Y satisfies the linearity
condition relative to SE(Y |X) for each value of Y , then X satisfies the linearity
condition relative to SE(Y |X) . (II) If X | Y is elliptically contoured, then X
satisfies the linearity condition relative to SY |X .

Proof. (I) The result follows because under model (9.1), SE(Y |X) = SY |X . (II)
If X | Y is elliptically contoured, then it satisfies the linearity condition for
NIPALS under the non-linear model 255

all values of Y relative to any subspace S ⊆ Rp . The conclusion follows from


Proposition 9.3. 2

9.4 NIPALS under the non-linear model


As a first use of the linearity condition, consider the NIPALS algorithm un-
der model (9.1). Let η be a basis matrix for SE(Y |X) , the CMS, and define
β = Σ−1X ΣX,Y , the OLS coefficients from the linear population regression of
Y on X, and correspondingly let B = span(β). In this use of β we are not as-
suming that linear model (1.1) holds. Rather, β is simply a parameter vector
constructed from ΣX and ΣX,Y . The following proposition connects β with
the CMS.

Proposition 9.4. Assume the linearity condition for X relative to SE(Y |X)
and that non-linear model (9.1) holds. Then span(ΣX,Y ) ⊆ ΣX SE(Y |X) and
consequently B ⊆ SE(Y |X) .

Proof. We step through the proof so the role of various structures can be
seen. Recall that η is a basis matrix for SE(Y |X) . Expanding the expectation
operator, we first have

ΣX,Y = EX,Y {(X − µX )(Y − µY )T }


= EX EY |X {(X − µX )(Y − µY )T }
= EX {(X − µX )(E(Y | X) − µY )T }.

From Proposition 9.1, E(Y | X) = E(Y | η T X) and so

ΣX,Y = EX {(X − µX )(E(Y | η T X) − µY )T }


= EηT X EX|ηT X {(X − µX )(E(Y | η T X) − µY )T }
= EηT X {(E(X | η T X) − µX )(E(Y | η T X) − µY )T }.

From the assumed linearity condition and conclusion 4 of Proposition 9.2 we


have
E(X | η T X) − µX = Pη(Σ
T
X)
(X − µX ).
256 Non-linear PLS

Substituting back, we get

T
ΣX,Y = EηT X {Pη(Σ X)
(X − µX )(E(Y | η T X) − µY )T }
T
= Pη(Σ X)
ΣX,Y (9.4)
span(β) ⊆ SE(Y |X) ,

and our claim follows. 2

In the proof of Proposition 9.4, the linearity condition is applied only to


the marginal distribution of the predictors; it does not involve the response.
This will also be true for all subsequent applications of the linearity condition.
The following corollary gives Proposition 9.4 when restricted to single in-
dex models (9.2) with non-zero β. Its proof is straightforward and omitted.

Corollary 9.2. Assume the linearity condition for X relative to SE(Y |X) . As-
sume also that the regression of Y on X follows the single index model (9.2)
and that ΣX,Y 6= 0. Then B = SE(Y |X) .

The next proposition provides a formal connection between the predictor


envelope EΣX(B) and the envelope of the CMS EΣX (SE(Y |X) ). Its proof is in
Appendix A.7.4.

Proposition 9.5. Assume the linearity condition for X relative to SE(Y |X) .
Then under the non-linear model (9.1), we have

EΣX(B) ⊆ EΣX (SE(Y |X) ).

Moreover, if the single index model (9.2) holds and ΣX,Y 6= 0 then EΣX(B) =
EΣX (SE(Y |X) ).

We know from the discussion in Section 3.2 that the weight matrix from
the NIPALS algorithm provides an estimator of a basis for the predictor en-
velope EΣX(B), which by Proposition 9.5 is contained in the corresponding
envelope of the CMS. It follows that the compressed predictors W TX are as
relevant for the non-linear regression model (9.1) as they are for the linear
model (1.1), provided that the linearity condition holds. In some regressions
it may happen that EΣX(B) is a proper subset of EΣX (SE(Y |X) ), in which case
the NIPALS compression may miss relevant directions in which the mean
function is non-linear. However, if the single index model (9.2) holds then
NIPALS under the non-linear model 257

EΣX(B) = EΣX (SE(Y |X) ) and NIPALS will not miss any relevant directions in
the population.
These results have implications for graphical diagnostics to detect non-
linearities in a regressions. The following steps may be helpful in regressions
with a real response.

1. Fit the data with NIPALS, SIMPLS or a PLS regression variation thereof
and get the estimated coefficient matrix βbpls .
T
2. Plot the response Y versus the fitted values βbpls X. If the plot shows clear
curvature, then the serviceability of the envelope model (2.5) is question-
able. If no worrisome non-linearities are observed, then the model is sus-
tained.
The conclusion EΣX(B) ⊆ EΣX (SE(Y |X) ) from Proposition 9.5 implies that
EΣX(B) could be a proper subset of EΣX (SE(Y |X) ) and thus that some
directions in the CMS envelope are missed. However, if the underlying re-
gression is a single index model, then the plot of the response versus the
fitted values is sufficient for detecting non-linearity.

3. A closer inspection may be useful for some regressions. In these cases,


plots of the response versus selected linear combinations, aT W TX with
a ∈ Rq , of the compressed predictors W TX may be informative. Since
the only role of the weight matrix is to provide an estimate of a basis
for the envelope, the linear combinations to plot may be unclear. Plotting
the response versus the principal components of W TX may be a useful
default choice. Another potentially useful possibility is to plot the com-
binations wjTX against Y , where wj is the j-th weight vector generated
by the NIPALS algorithm. The first few linear combinations will likely be
most important. Clear non-linearity in a plot of Y versus wjTX is then an
indication that the mean function is non-linear.

If non-linearities are detected, the usual predictive procedures for selecting


the number of components q may not be useful here because they rely on a
linear model, and predictions based on the linear model may no longer be ser-
viceable. Different methods may be needed in such cases. Several possibilities
are illustrated in Section 9.6.
258 Non-linear PLS

9.5 Extended PLS algorithms for dimension reduction


It is known from our discussion in Section 3.5.1 that for single-response lin-
ear regressions in the context of model (1.1) the population weight matrix W
from the NIPALS algorithm arises as Gram-Schmidt orthogonalization of the
Krylov sequence Kq (ΣX , σX,Y ) (Helland, 1990; Manne, 1987). We also know
from Proposition 1.10 that the Krylov subspace Kq (ΣX , σX,Y ) = EΣX(B), so

span(β) ⊆ EΣX(B) = span(W ) = Kq (ΣX , σX,Y ), (9.5)

which is the fundamental population justification for using the NIPALS al-
gorithm for dimension reduction in conjunction with using linear model (1.1)
for prediction. In particular, it implies that there is a vector η ∈ Rq so that
β = W η, leading to the reduced linear model Y = α + η T W TX + ε.
We propose in Section 9.5.1 a procedure for generalizing PLS algorithms for
dimension reduction in regressions covered by non-linear model (9.1). These
generalizations effect dimension reduction but not prediction. Methods for
prediction are discussed in Section 9.6. We show in Section 9.5.2 that the gen-
eralized procedure can produce the Krylov sequence (9.5) and in Section 9.5.3
we show how to use it to remove linear trends to facilitate detecting non-
linearities. We expect that there will be many other applications of the gen-
eralized method in the future.
We confine discussion to regressions with a real response, r = 1, in the
remainder of this chapter.

9.5.1 A generalization of the logic leading to Krylov spaces


Recall the following from our discussion of invariant subspaces and envelopes
in Section 1.3. If S is an invariant subspace of ΣX and if x ∈ S, then we
can iteratively transform x by ΣX to find additional vectors in S, eventually
collecting enough vectors to span S. The NIPALS algorithm for a univariate
response operates with S = EΣX(B) and x = σX,Y , which gives rise to the
Krylov sequence Kq (ΣX , σX,Y ). This idea of iteratively transforming a vec-
tor in a selected invariant subspace to collect more vector in that subspace is
the fundamental idea behind NIPALS and our generalization. In the follow-
ing proposition, a version of which was first proven by Cook and Li (2002,
Theorem 3), our subspace is S = EΣX (SE(Y |X) ).
Extended PLS algorithms for dimension reduction 259

Proposition 9.6. Assume model (9.1) with a real response. Let Γ be a basis
matrix for EΣX (SE(Y |X) ) and let U and V be real-valued functions of ΓTX.
Assume also that X satisfies the linearity condition relative to EΣX (SE(Y |X) ).
Then
E{(U Y + V )(X − µX )} ∈ EΣX (SE(Y |X) ).

Proof. Assume without loss of generality that µX = 0. Then

E((U Y + V )X) = EX {(U E(Y | X) + V )X}.

Let η be basis matrix for SE(Y |X) . By Proposition 9.1, E(Y | X) = E(Y | η TX).
Since SE(Y |X) ⊆ EΣX (SE(Y |X) ), it follows that E(Y | X) = E(Y | η T X) =
E(Y | ΓTX). Consequently,

E((U Y + V )X) = EX {(U E(Y | ΓTX) + V )X}


= EΓT X {(U E(Y |ΓT X) + V )E(X | ΓTX)}.

Since X satisfies the linearity condition relative to EΣX (SE(Y |X) ) = span(Γ),
we know from conclusions 4 and 5 of Proposition 9.2 that E(X | ΓTX) = PΓ X.
Using the condition that U and V are real-valued functions of ΓT X, we have

E((U Y + V )X) = EΓT X {(U E(Y |ΓT X) + V )E(X | ΓTX)}


= EΓT X {(U E(Y |ΓT X) + V )PΓX}
= PΓ E{(U Y + V )X}
∈ EΣX (SE(Y |X) ).

The importance of this proposition is as follows. Suppose we know one


vector ν0 in the envelope EΣX (SE(Y |X) ). We can find another vector in
EΣX (SE(Y |X) ) by defining functions u : R 7→ R and v : R 7→ R, which
are then used to form a constructed response Y ∗ = u(ν0T X)Y + v(ν0T X).
The covariance between the constructed response and X is in the envelope:
ν1 = cov(Y ∗ , X) ∈ EΣX (SE(Y |X) ). This can then be iterated to find additional
vectors in the envelope:

T T
νj = cov{u(νj−1 X)Y + v(νj−1 X), X} ∈ EΣX (SE(Y |X) ), j = 1, 2, . . . .
260 Non-linear PLS

9.5.2 Generating NIPALS vectors


In this section we show how to use Proposition 9.6 to generate the Krylov se-
quence. This then is another way to reach Proposition 9.5 and the conclusions
that follow it. This argument shows how to achieve generalizations of NIPALS
dimension reduction, which may be useful in some regressions.
Setting U = 1, V = 0 we have a first vector

E((U Y + V )X) = σX,Y


span(σX,Y ) ⊆ EΣX (SE(Y |X) ).

There are many ways to construct second and subsequent vectors in


EΣX (SE(Y |X) ) depending on the choices for U and V . Our selections here
were chosen to generate the Krylov sequence Kq (ΣX , σX,Y ) associated with
the NIPALS algorithm.
T T
Take U = 0 and V = v(σX,Y X) = σX,Y X. Then

E((U Y + V )X) = E(XX T σX,Y ) = ΣX σX,Y ∈ EΣX (SE(Y |X) )

is a second vector in the envelope. To get a third vector, we use the sec-
ond vector since ΣX σX,Y ∈ EΣX (SE(Y |X) ). Then with U = 0 and V =
v{(ΣX σX,Y )T X} = (ΣX σX,Y )T X, we have

E((U Y + V )X) = Σ2X σX,Y ⊆ EΣX (SE(Y |X) ).

Continuing by induction, we generate the Krylov subspace,

span(σX,Y , ΣX σX,Y , Σ2X σX,Y , . . .) ⊆ EΣX (SE(Y |X) ).

Under the non-linear model we still have strict monotone containment of the
Krylov subspaces following (1.23) with equality after reaching q components
is reached. However, here Kq (ΣX , σX,Y ) is contained in EΣX (SE(Y |X) ) without
necessarily being equal:

K1 (ΣX , σX,Y ) ⊂ · · · ⊂ Kq (ΣX , σX,Y ) = Kq+1 (ΣX , σX,Y ) · · · ⊆ EΣX (SE(Y |X) ).

As we have used in the past, q is the number of NIPALS components. However,


in his context, q is not necessarily equal to the dimension of the correspond-
ing envelope, q ≤ dim{EΣX (SE(Y |X) )}. Putting it all together we have the
non-linear model counterpart to (9.5),
Extended PLS algorithms for dimension reduction 261

Corollary 9.3. Under the hypotheses of Proposition 9.6,

Kq (ΣX , σX,Y ) = span(W ) = EΣX(B) ⊆ EΣX (SE(Y |X) ), (9.6)

with equality EΣX(B) = EΣX (SE(Y |X) ) under the single index model.

This result reinforces the conclusions of Proposition 9.5 and the discussion
that follows it: the NIPALS algorithm can be used to diagnose non-linearity
in the mean function. In fact, the mild linearity condition for the envelope is
the only novel condition needed.
This graphical diagnostic procedure discussed at the end of Section 9.4
will display linear as well as non-linear trends. For the purpose of diagnosing
non-linearity in the mean function, it may be desirable to remove linear trends
first and use the NIPALS residuals
T
R = Y − E(Y ) − βnpls (X − µX )

as a response, where βnpls denotes the population NIPALS coefficient vector.


However, using the residuals to generate a Krylov sequence Kp−1 (ΣX , σX,r )
may not be effective because the residual covariance σX,r will likely be small
and uninformative; σX,r = 0 when the OLS residuals are used. In the next
sections we propose a different diagnostic method based on an application of
Proposition 9.6 with residuals.

9.5.3 Removing linear trends in NIPALS


Diagnosing non-linearity in the mean function is most effectively done by first
removing linear trends and then dealing with the residuals. To accomplish
that we define the third-moment matrix

ΣRXX = E{R(X − E(X))(X − E(X))T },

where the expectation is with respect to the joint distribution of (R, X), and
then apply Proposition 9.6 with first vector βnpls ∈ EΣX (SE(Y |X) ). To get
the second vector, assume without loss of generality that E(X) = 0 and use
T
U = u(βnpls T
X) = βnpls X and V = v(βnplsT
X) = −βnplsT T
XE(Y ) − (βnpls X)2 .
This gives

E((U Y + V )X) = E({Y X T βnpls − E(Y )X T βnpls − (X T βnpls )2 }X)


= E({Y − E(Y ) − X T βnpls }XX T βnpls )
= E(rXX T )βnpls = ΣRXX βnpls ∈ EΣX (SE(Y |X) ),
262 Non-linear PLS

where the final containment follows from Proposition 9.6. For the third vector
we operationally replace βnpls in u and v with ΣRXX βnpls to get

U = u((ΣRXX βnpls )T X) = (ΣRXX βnpls )T X


V = v((ΣRXX βnpls )T X) = −(ΣRXX βnpls )T XE(Y ) − ((ΣRXX βnpls )T X)2 .

This gives the third vector

E((U Y + V )X) = Σ2RXX βnpls ∈ EΣX (SE(Y |X) ).

Repeating this operation iteratively, we get by induction

span(βnpls , ΣRXX βnpls , . . . , ΣjRXX βnpls , . . .) ⊆ EΣX (SE(Y |X) ), (9.7)

which implies that we run NIPALS with Σ b RXX and βbnpls . Using residuals
in Σb RXX has the advantage that the linear trend is removed making non-
linearities easier to identify.
The graphical diagnostic for non-linearity in the mean function can be
summarized as follows:

1. Run NIPALS(Σ bX,Y ), obtain βbnpls and then form the sample residuals
bX, σ

T
bi = Yi − Ȳ − βbnpls
R (Xi − X̄), i = 1, . . . , n.

2. Construct the third moment matrix


n
X
b RXX = n−1
Σ bi (Xi − X̄)(Xi − X̄)T .
R
i=1

3. Run NIPALS(Σ b RXX , βbnpls ) and extract the weights hj , j = 1, . . . , p − 1.


Here we use hj to denote the vectors of weights to distinguish them from
the usual weights wj defined in Table 3.1.

4. Plot Y or residuals versus hTj X for j = 1, 2, . . . and look for clear non-
linear trends. Since βnpls is the first vector listed on the left-hand side of
(9.7), h1 = βbnpls /kβbnpls k, where the normalization arises from the eigen-
vector computation in Table 3.1. In consequence, a plot of Y or residuals
against hT1 X is effectively the same as a plot of Y or residuals against the
T
NIPALS fitted values βbnpls Xi , i = 1, . . . , n.
Prediction 263

If the linear model is incorrect, we expect that E(R | X) will exhibit


quadratic or high-order behavior, giving a clear signal that the model is some-
how deficient. Further interpretation of non-linear trends can be problematic
because it is possible that dim(SE(R|X) ) > dim(SE(Y |X) ), implying that the
residual regression R | X is more complicated than the original regression
Y | X because the residual regression can introduce extraneous directions
into the picture. If non-linearity is detected in residual plots, it may be wise
to use the plots at the end of Section 9.4 to guide model development.

9.6 Prediction
The graphical methods described in Sections 9.5.2 and 9.5.3 are used to as-
sess the presence of non-linearity in the mean function. If no notable non-
linearity is detected then the linear methods described in Chapter 3 may be
serviceable for prediction. If non-linearity is detected, the compressed pre-
dictors w1T X, w2T X, . . . can still serve for dimension reduction under Corol-
lary 9.3, but there is no model or rule for prediction and consequently there
is no way to determine the number of components. To complete the anal-
ysis we need a method or model for predicting Y from w1T X, . . . , wdT X,
d = 1, 2, . . . , min{p, n − 1}, along with the associated estimated predictive
mean squared error used in selecting the number of components.

9.6.1 Available approaches to prediction


A variety of methods have been proposed to handle non-linear PLS applica-
tions, ranging from models that describe relationships between latent variables
in a non-linear way (Wold et al., 2001), to quadratic PLS (Baffi et al., 1999;
Li et al., 2005), to spline PLS (Wold, 1992), to neural network PLS (Baffi,
Martin, and Morris, 1999; Chiappini et al., 2020; Chiappini, Goicoechea, and
Olivieri, 2020; Olivieri, 2018). A tutorial on the use of neural networks for
multivariate calibration is available from Despagne and Luc Massart (1998).
Shan et al. (2014) reviewed these and other methods, pointing out their rel-
ative strengths and weaknesses. They then proposed a new method based on
slicing the response into non-overlapping regions within which the relation-
ship between Y and X follows linear model (1.1), and they demonstrated good
264 Non-linear PLS

performance against a variety of other methods. Nevertheless, being the basis


for sliced inverse regression, the slicing operation is well understood in suf-
ficient dimension reduction. In particular, overall efficiency tends to drop off
as the number of slices increases because the number of observations per slice
decreases resulting in a loss of information for each intra-slice linear model fit.
The available non-linear methods seem to either ignore dimension reduc-
tion or to have a new dimension reduction paradigm built in. However, we
have seen as a consequence of Corollary 9.3 that standard PLS regression al-
gorithms like NIPALS and SIMPLS can be used for dimension reduction even
when the mean function E(Y | X) is an unknown non-linear function of X.
Thus, as indicated previously in Section 9.5.2, we need a method or model
for predicting Y from w1T X, . . . , wdT X, d = 1, 2, . . . , min{p, n − 1}, along with
the associated estimated predictive mean squared error for determining the
number of components. Dimension reduction is no longer an issue per se.
Since dimension reduction is no longer an issue, the regression of Y on
w1 X, . . . , wdT X might be developed by using classical methods. In some set-
T

tings a simple transformation of the response may be enough to induce lin-


earity in the mean function sufficient to give good predictions. The Box-Cox
(Box and Cox, 1964) family of power transformations (see also Cook and
Weisberg, 1999) should be sufficient for positive responses, while the rela-
tively recent method by Hawkins and Weisberg (2017) can be used for non-
positive responses. In other cases adding all quadratic (wkTX)2 and cross prod-
uct wkTXwjTX terms may be sufficient, provided d is not too large since there
will be a total of d(d + 1)/2 terms of these forms. Another possibility is to
use classical non-parametric regression (e.g. Green and Silverman, 1994), but
again tuning is required and problems may arise if d is too large. Methods
based on decision trees may be useful in some settings.
We present in the next section a novel method of prediction adapted from
the general proposal of Adragni and Cook (2009). Our method gives good pre-
dictions for continuous predictors, does not require an entirely new analysis
for each value of d, is not hindered by large values of d and, based on our case
studies, performs as well as or better than existing methods.

9.6.2 Prediction via inverse regression


For notational convenience, let Zd = (w1T X, . . . , wdT X)T . The premise under-
lying the methodology of this section is that it is generally easier to adequately
Prediction 265

model the inverse regression of Zd on Y than it is to model the forward regres-


sion of Y on Zd since the regression of Zd on Y consists of d simple regressions.
The inverse regression of Zd on Y can itself be inverted to provide a method
for estimating the forward mean function E(Y | X) without specifying explic-
itly a model for the forward regression. We denote the density of Zd | Y by
g(zd | Y ). Then
E{Y g(zd | Y )}
E(Y | Zd = zd ) = , (9.8)
E{g(zd | Y )}
where all right-hand side expectations are with respect to the marginal dis-
tribution of Y . A sample version of (9.8) provides an estimator of the forward
mean function:
n
X
E(Y
b | Zd = zd ) = ωi (zd )Yi (9.9)
i=1
gb(z | Y )
ωi (zd ) = Pn d i ,
i=1 gb(zd | Yi )
where gb denotes an estimated density. This estimator may be reminiscent of
a non-parametric kernel estimator (e.g. Simonoff, 1996, Ch. 4), but there are
important differences. The weights in a kernel estimator do not depend on the
response, while the weights ωi here do. Multivariate kernels are usually taken
to be the product of univariate kernels, corresponding here to constraining the
components of Zd to be independent, but (9.9) requires no such constraint.
Finally, there is no explicit bandwidth in our weights as they are determined
entirely from gb, which eliminates the need for bandwidth estimation by, for
example, cross validation .
To get an operational version of (9.9), we need to decide how to get the
estimated densities. One straightforward method is based on the multivariate
normal model
Zd = µd + Θd f (Y ) + εd , (9.10)
where f ∈ Rs is a known user-specified vector-valued function of the response,
Θd is a d × s matrix of regression coefficients and εd ∼ Nd (0, ΣZd |Y ) is inde-
pendent of Y .
For instance, suppose that by inspecting plots of the d = 10 components
of Zd versus Y we conclude correctly that each element of E(Zd | Y ) is at
most a quadratic function of Y , implying that f (Y ) = (Y, Y 2 )T and that
!
Y
Z10 = µ + Θ10×2 + ε.
Y2
266 Non-linear PLS

Let SZd |Y denote the covariance matrix of the residuals and let

Zbd,i = µ
bd + Θ
b d f (Yi ), i = 1, . . . , n

denote the fitted values from a fit of (9.10). Then based on model (9.10), we
have  
1 T −1
ωi (zd ) ∝ exp − (zd − Zd,i ) SZd |Y (zd − Zd,i ) .
b b
2
These weights are then substituted into (9.9) to obtain the predictions
Yb (zd ) = E(Y
b | Zd = zd ), which can be used with a holdout sample or cross
validation to estimate the predictive mean squared error.
The performance of this predictive methodology depends on the approxi-
mate normality of Zd and the choice of f . Since low-dimensional projections
of high-dimensional data are approximately normal (Diaconis and Freedman,
1984; Hall and Li, 1993), it is reasonable to rely on approximate normality
of Zd . Consistency and the asymptotic distribution for prediction were es-
tablished by Forzani et al. (2019) for the case of p fixed and n → ∞. There
are several generic possibilities for choice of f , perhaps guided by graphics
as implied earlier. Fractional polynomials (Royston and Sauerbrei, 2008) or
polynomials deriving from a Taylor approximation may be useful for single-
response regressions,
f (Y ) = {Y, Y 2 , Y 3 , . . . , Y s }T ,

are one possibility. Periodic behavior could be modeled using a Fourier series
form
f (Y ) = {cos(2πY ), sin(2πY ), . . . , cos(2πkY ), sin(2πkY )}T
as perhaps in signal processing applications. Here, k is a user-selected integer
and s = 2k. Splines and other types of non-parametric constructions could
also be used to form a suitable f . A variety of basis functions are available
from Adragni (2009).
Another option with a single continuous response consists of “slicing”
the observed values of Y into H bins Ch , h = 1, . . . , H. We can then set
s = H − 1 and specify the h-th element of f to be Jh (Y ), where J is the
indicator function, Jh (Y ) = 1 if Y is in bin h and 0 otherwise. This has the
effect of approximating each component E(wjT X | Y ) of E(Zd | Y ) as a step
function of Y with H steps,
E(Zd | Y ) ≈ µ + ξJ,

where J = (J1 (Y ), J2 (Y ), . . . , JH−1 (Y )T ).


Tecator data 267

Cook and Forzani (2008b) studied a generalized version of model (9.10)


that leads to principal fitted components. Their large sample results apply
to our present case and help us understand the importance of normality
and the choice of f . Consider model (9.10) but with possibly non-normal
errors that are independent of Y and have finite fourth moments. The fit-
ted model has mean function µd + Θd f (Y ) but we no longer assume that
E(Zd | Y ) = µd + Θd f (Y ) and thus allowing for the possibility that f is
incorrect. It follows from (Cook and Forzani, 2008b, Section 3.2) that our es-

timators are n consistent with p fixed and n → ∞ if and only if the matrix
of correlations between E(Zd | Y ) and f (Y ) has rank q. This indicates that we
can still expect useful results when f is misspecified provided there is sufficient
correlation with the true mean function.

9.7 Tecator data


The Tecator data is a well-known chemometrics dataset that consists of 215
NIR absorbance spectra of meat samples, recorded on a Tecator Infratec Food
and Feed Analyzer (Boggaard and Thodberg, 1992). The three response vari-
ables are the percent moisture, fat, and protein content in each meat sample
as determined by analytic chemistry. The spectra ranged from 850 to 1050 nm,
discretized into p = 100 wavelength values which are the predictors. Training
and testing datasets contain 172 and 43 samples. A plot of the absorbance
versus wavelength is available from Shan et al. (2014).
Figure 9.1 shows four diagnostic plots for the response fat. Plot (a) corre-
sponds to the PLS1 graphical procedure described in Section 9.3. Under the
linearity condition for the CMS, we have from (9.4) that β ⊆ SE(Y |X) . Accord-
T
ingly, the trend in a plot of fat vs. βbnpls X should appear linear if model (2.5)
holds and non-linear under model (9.1). Because the plot is clearly curved
as seen by the fitted quadratic, we conclude that there is evidence for model
(9.1). This is supported by plot (b) which is a scatterplot of the residuals
T T
ri = Yi − Ȳ − βbnpls Xi vs. βbnpls Xi . In plots (a) and (b), the horizontal axis can
T
also be interpreted as R1 Xi , as discussed under step 4 of the graphical pro-
cedure described in Section 9.5.3. Plots (c) and (d) correspond to the second
and third plots of step 4. These plots can show if there is non-linearity present
268 Non-linear PLS

T T
(a) Fat vs. βbnpls X (b) Residuals, ri vs. βbnpls X

(c) Residuals, ri vs. R2T X (d) Residuals, ri vs. R3T X


FIGURE 9.1
Diagnostic plots for fat in the Tecator data. Plot (a) was smoothed with a
quadratic polynomial. The other three plots were smoothed with splines of
order 5. (From Fig. 1 of Cook and Forzani (2021) with permission.)

beyond that displayed in plots (a) and (b). In this case there does not seem
to be any notable non-linearity remaining and consequently we could proceed
with modeling based on plot (a). In particular, a reasonable model would have
E(fat | X) = f (β T X) for some scalar-valued function f that is likely close to
a quadratic.
Figures 9.2 and 9.3 show for protein and water content the same construc-
tions as Figure 9.1 does for fat content. Our interpretation of these plots is
qualitatively similar to that for Figure 9.1: In this case there does not seem
Tecator data 269

T T
(a) Protein vs. βbnpls X (b) Residuals, ri vs. βbnpls X

(c) Residuals, ri vs. R2T X (d) Residuals, ri vs. R3T X


FIGURE 9.2
Diagnostic plots for protein in the Tecator data. Plot (a) was smoothed with
a quadratic polynomial. The other three plots were smoothed with splines of
order 5.

to be any notable non-linearity remaining and consequently we could pro-


ceed in each case with modeling based on plot (a). Since each plot (a) seems
well-described by a quadratic, the overall three-component model, with one
component for each response, suggested by these plots is
   
Fat β10 + β11 (α1T X) + β12 (α1T X)2
Y =  Protein  =  β20 + β21 (α2T X) + β22 (α2T X)2  + ε, (9.11)
   

Water β30 + β31 (α3T X) + β32 (α3T X)2

where kαj k = 1, j = 1, 2, 3. Sample size permitting, this model can now


270 Non-linear PLS

T T
(a) Water vs. βbnpls X (b) Residuals, ri vs. βbnpls X

(c) Residuals, ri vs. R2T X (d) Residuals, ri vs. R3T X


FIGURE 9.3
Diagnostic plots for water in the Tecator data. Plot (a) was smoothed with
a quadratic polynomial. The other three plots were smoothed with splines of
order 5.

be fitted and studied using traditional methods for non-linear regression.


When the sample size is deficient, the αj ’s can be replaced by their esti-
mates αbj = βbnpls /kβbnpls k coming from a PLS algorithm, yielding a model in
one compressed predictor, α bjT X, j = 1, 2, 3, for each response.
   
Fat β10 + β11 (b α1T X)2
α1T X) + β12 (b
Y =  Protein  =  β20 + β21 (b α2T X)2  + ε.
α2T X) + β22 (b (9.12)
   

Water β30 + β31 (b α3T X)2


α3T X) + β32 (b

Standard inferences based on this model will be optimistic because they do


Tecator data 271

not take the variation in the α


bj ’s into account, although assessing predictive
performance using cross validation or a holdout sample is still feasible.
In some regressions it may be worthwhile to entertain the notion that each
response is responding to the same linear combination of the predictors, so
α1 = α2 = α3 := α. This leads to the model
!
T (αT X)
Y = β0 + β ,
(αT X)2

where β0 is a 3 × 1 vector of intercepts and β is a 2 × 3 matrix regression


coefficients. For the Tecator data, the correlations among the fitted vectors
T
βbnpls X are
 
Fat 1.000 −0.989 −0.862
Water  −0.989 1.000 0.820  .
 

Protein −0.862 0.820 1.000


Because the absolute correlation between Fat and Water fits is high, it may be
reasonable to conjecture that they are controlled by the same linear combina-
tion of the predictors. This conjecture in terms of model (9.11) is that α1 = α2 .
Contrasting the fits of model (9.11) with the version in which α1 = α2 = 6 α3
would then be a natural next step.
We next describe five methods that we used to illustrate how prediction
might be carried out in a non-linear setting:

1. Linear forward PLS.

Here we used NIPALS as summarized in Table 3.1 to fit each response indi-
vidually. The number components was determined by using 5-fold predictive
cross validation on the training data. This gave q = 13, 13, and 17 for fat, pro-
tein and water. In view of the curvature present in panel (a) of Figures 9.1–9.3
we would expect non-linear methods to give smaller mean squared prediction
errors.

2. Quadratic forward PLS.

The fitted model for prediction is as shown in (9.12) with the coefficients
determined by using linear PLS. The number of components was again deter-
mined by using 5-fold predictive cross validation on the training data, giving
q = 12, 13, and 16 for fat, protein and water. This method will perform best
272 Non-linear PLS

when the CMS has dimension 1, as suggested by the plots in Figures 9.1–
9.3, and the linearity condition holds. If the dimension of the CMS is greater
than 1, this method could still give prediction mean squared errors that are
smaller than those from linear PLS, provided that the response surface is well
approximated by a quadratic.

3. Non-parametric forward PLS.


T
This method is similar to quadratic PLS, except the regression of Y on βbnpls X
was fitted using a non-parametric method instead of a quadratic. We used the
R package np that uses kernel regression estimates. Details of this method are
available in the literature that accompanies the package. Our choice here was
based mainly on convenience; many other non-parametric fitting methods are
available.
The number of components was again determined by using 5-fold predictive
cross validation on the training data, giving q = 13, 14, and 17 for fat, protein
and water. Like quadratic PLS, this method will perform best under the single
index model; that is, when the CMS has dimension 1 and the linearity condi-
tion holds, as described in Corollary 9.2 and Proposition 9.5. If the dimension
of the CMS is greater than 1, this method could still give prediction mean
squared errors that are smaller than those from linear PLS, without requiring
that the response surface be well approximated by a quadratic. Asymptotic
results, including consistency, are available from Forzani, Rodriguez, and Sued
(2023).

T
4. Non-parametric inverse PLS (NP-I-PLS) using βbnpls X.

This is a version of the method described in Section 9.6.2 where the weights
(9.9) were determined using Zd = βbnplsT
X and f (Y ) = (Y, Y 2 , Y 1/2 ) was se-
lected by graphical inspection. The number components for βbnpls was again
selected by using 5-fold predictive cross validation on the training data, giving
q = 13, 18, and 18 for fat, protein and water. Like quadratic PLS, and non-
parametric forward PLS, this method will perform best when the CMS has
dimension 1 and the linearity condition holds but may still be useful otherwise.

5. Non-parametric inverse PLS (NP-I-PLS) using W TX.

This is the method described in Section 9.6.2 using f (Y ) = (Y, Y 2 , Y 1/2 ). The
number components for W was again determined by using 5-fold predictive
Tecator data 273
TABLE 9.1
Tecator Data: Number of components based on 5-fold cross validation using
the training data for each response. NP-PLS denotes non-parametric PLS and
NP-I-PLS denotes non-parametric inverse PLS.

Method Fat Protein Water


1. Linear PLS 13 13 17
2. Quadratic PLS 12 13 16
3. NP-PLS 13 14 17
T
4. NP-I- PLS with βbnpls X 13 18 18
T
5. NP-I-PLS with W X 15 17 22

cross validation on the training data, giving q = 15, 17, and 22 for fat, pro-
tein and water. The previous four methods will all be at their best when the
dimension of the CMS is 1. This method does not have that restriction and
should work well even when the dimension of the CMS is greater than one.
The number of components for each method-response combination is sum-
marized in Table 9.1 and the root mean squared prediction errors

43 
!1/2
X 2
(1/43) Ybtest,i − Ytest,i
i=1

determined from the 43 observations in the testing dataset are given in Ta-
ble 9.2(a). The root mean squared prediction errors typically changed rela-
tively little when the number of components was varied by, say, ±2. This is
illustrated in Figure 9.4, which shows the cross validation root mean squared
prediction error for linear PLS, method 1, and non-parametric inverse PLS
T
with βbnpls X, method 4. The prediction errors are much larger for q outside
the range of the horizontal axis shown in Figure 9.4.
The number of training cases in this dataset n = 172 is greater than the
number of predictors, p = 100, but none of the prediction methods used here
requires that n > p and all are serviceable when n < p.
The first method, linear PLS, is the same as the standard NIPALS algo-
rithm outlined in Table 3.1. From Table 9.2(a) it seems clear that in this exam-
ple any of methods 2–4 is preferable to the NIPALS algorithm. This is perhaps
not surprising in view of the curvature shown in panel (a) of Figures 9.1–9.3.
T
Methods 2–4 are all based on the NIPALS fitted values βbnpls X. As mentioned
274 Non-linear PLS

FIGURE 9.4
Tecator data: root mean squared prediction error versus number of compo-
nents for fat. Upper curve is for linear PLS, method 1 in Table 9.1; lower curve
T
is for inverse PLS prediction with βbnpls X, method 4. (From Fig. 2 of Cook
and Forzani (2021) with permission.)

previously, these methods will do their best when the CMS has dimension 1.
We judge their relative behavior in this example to be similar, except it might
be argued that non-parametric PLS demonstrates a slight advantage. Method
5, non-parametric inverse PLS with W TX, has the smallest root mean squared
prediction errors in this example and is clearly the winner. This method does
not rely on the CMS having one dimension, and we expect that it will dom-
inate in any analysis where linear PLS regression is judged to be serviceable
apart from the presence of non-linearity in the conditional mean E(Y | X).
As discussed in Section 9.6.1, Shan et al. (2014) propose a new method,
called PLS-SLT, of non-linear PLS based on slicing the response into non-
overlapping bins within which the relationship between Y and X is assumed
to follow linear model (1.1). In their analysis of the Tecator data, PLT-SLT
showed superior predictive performance against linear PLS and five compet-
ing non-linear methods. Table 9.2(b) gives the root mean squared prediction
error reported by (Shan et al., 2014, Table 6) for linear PLS and their new
method PLS-SLT. They used 12 components for each of the six scenarios
shown in Table 9.2(b). Although we used 13 components for fat and protein,
the root mean squared prediction errors for linear PLS shown in Table 9.2
are nearly identical. We used 17 components for the linear PLS analysis of
water, while Shan et al. (2014) used 12. This may account for the discrepancy
Tecator data 275
TABLE 9.2
Tecator Data:(a) Root mean squared training and testing prediction errors
for five methods of prediction.

Fat Protein Water


(a) Method Train Test Train Test Train Test
1. Linear PLS 2.04 2.10 0.61 0.62 1.48 1.72
2. Quadratic PLS 1.69 1.85 0.56 0.60 1.30 1.51
3. NP-PLS 1.29 1.90 0.51 0.58 1.19 1.32
T
4. NP-I-PLS with βbnpls X 1.66 1.91 0.54 0.62 1.20 1.41
T
5. NP-I-PLS with W X 1.40 1.49 0.47 0.54 0.88 1.32
(b) Method
I. Linear PLS – 2.10 – 0.63 – 2.03
II. PLS-SLT – 1.66 – 0.60 – 1.36

The three responses were analyzed separately as in PLS1. NP-PLS denotes


non-parametric PLS and NP-I-PLS denotes non-parametric inverse PLS. (b)
Root mean squared testing prediction error for two methods of prediction from
Shan et al. (2014). PLS-SLT is the slice transform version of PLS proposed
by Shan et al. (2014).

in linear PLS prediction errors for water, 1.72 and 2.03 in parts (a) and (b)
of Table 9.2. Most importantly, the root mean squared prediction error for
PLS-SLT is comparable to or smaller than the prediction errors for methods
1–4 in Table 9.2 (a), but its predictor errors are larger than those for method
5, non-parametric inverse PLS with W TX.
The methods discussed here all require a number of user-selected tuning
parameters or specifications. Our method 5, non-parametric inverse regression
with Zd = W TX, requires f (y) and the number of components q represented in

W . Recall that we selected f (y) = (y, y 2 , y)T based on graphical inspection
of inverse regression functions, and we selected q by using 5-fold cross valida-
tion on the training data. PLS-SLT requires selecting the number of bins and
the number of components, which we still represented as q. Shan et al. (2014)
restricted q to be at most 15 and the number of bins to be at most 10. For
each number of bins, they determined the optimal value of q by using a test-
ing procedure proposed by Haaland and Thomas (1988). The number of bins
giving the best predictive performance was then selected. We conclude from
these descriptions that the procedures for selecting the tuning parameters are
276 Non-linear PLS

an integral part of the methods; changing those procedures in effect changes


the method. Thus, although based on different numbers of components, it is
fair to compare the root mean squared prediction errors in Table 9.2.

9.8 Etanercept data


Chiappini et al. (2020) used a multi-layer perceptron (MLP) artificial neural
network to develop a novel multivariate calibration strategy for the online
prediction of etanercept, a protein used in the treatment of autoimmune dis-
eases like rheumatoid arthritis. An artificial neural network was used instead
of a traditional PLS algorithm because of the expected non-linear relationship
between etanercept concentrations and the spectral predictors.
Etanercept samples were studied using fluorescence excitation-emission
data generated in the spectral ranges 225.0 and 495.0 nm and 250.0 and
599.5 nm for excitation and emission modes. Their experimental protocol re-
sulted in 38, 500 data points for the vectorized set of excitation and emis-
sion wavelengths. After deletion of one outlier, their full data set consisted of
n = 35 independent samples, each consisting of r = 1 etanercept concentra-
tion and p = 38, 500 spectral readings as predictors. However, it is difficult if
not impossible to adequately train an artificial neural network with so many
predictors (Despagne and Luc Massart, 1998) and, in consequence, Chiappini
et al. (2020) first reduced the number of predictors by systematically removing
wavelengths to yield a more manageable p = 9,623 predictors. This number
is still too large and so the authors further reduced these predictors to be-
tween 4 and 12 principal components, which then constituted the input layer
for their network. They trained their MLP network on a subset of n = 26
training samples, reserving the remaining 9 samples for testing and estimat-
ing the root mean squared prediction error. Their training protocol consisted
of running a Box-Benken design with 5 center points, 4, 8, and 12 input nodes
(the principal components computed from the p = 9, 623 spectral readings);
2, 6, and 10 nodes for layer 1; 0, 4, and 8 nodes for layer 2 and one output
node. They used a desirability function to select the optimum settings for their
MLP network. After considerable analysis, their protocol resulted in 12 input
variables, two nodes at layer 1 and 0 nodes at layer 2. For their final MLP
Etanercept data 277
TABLE 9.3
Etanercept Data: Numbering of the methods corresponds to that in Table 9.2.
Part (a): Number of components was based on leave-one-out cross validation
using the training data with p = 9,623 predictors. RMSEP is the root mean
squared error of prediction for the testing data. Part (b): The number of com-
ponents for MLP is the number of principal components selected for the input
layer of the network. Part (c): Number of components was based on leave-one-
out cross validation using all 35 data points. CVRPE is the cross validation
root mean square error of prediction based on the selected 6 components.

(a) Method, p = 9,623 Components RMSEP


1. Linear PLS 7 15.28
3. NP-PLS 4 15.10
T
4. NP-I-PLS with βbnpls X 5 12.43
T
5. NP-I-PLS with W X 6 9.20
(b) MLP Network, p = 12 12 8.60
(c) Method, p = 38,500 Components CVRPE
NP-I-PLS with W TX 6 8.46

network, the root mean squared prediction error for the testing data was 8.6,
which is considerably smaller than the corresponding error of 15.28 from their
implementation of PLS.
We applied four of the methods used for Table 9.2 to the same training
and testing data with p = 9, 623. Quadratic PLS (method 2 in Table 9.2) was
not considered because the response surface was noticeably non-quadratic.
The fitting function f (Y ) = (Y, Y 2 , Y 3 , Y 1/2 )T for inverse PLS with W TX
(see Section 9.6.2) was selected by smoothing a few plots of predictors versus
the response from the training data. Using the testing data, the root mean
squared prediction errors for these four methods are shown in Table 9.3(a),
along with the prediction error for the final MLP network in part (b). The
prediction error of 9.20, which is the best of those considered, is a bit larger
than the MLP error found by Chiappini et al. (2020) but still considerably
smaller than the PLS error.
Wold et al. (2001) recommended that “A good nonlinear regression tech-
nique has to be simple to use.” We agree subject to the further condition that
a simple technique must produce competitive or useful results. The general
278 Non-linear PLS

approach behind the MLP network requires considerable analysis and many
subjective decisions, including the initial reduction to 9,623 spectral measure-
ments followed by reduction to 12 principal components, outlier detection
methodology, the type of experimental design used to train the network, the
desirability function, specific characteristics of the network construction, divi-
sion of the data into testing and training sets, and so on. Because of all these
required data-analytic decisions, it is not clear to us that the relative perfor-
mance of the neural network approach displayed in the etanercept data would
hold in future analyses. In contrast, the inverse PLS approach with W TX
proved best in the Tecator data and had a strong showing in the analysis of
the etanercept data with many fewer data analytic decisions required.
To gain intuition into what might be achieved by minimizing the number
of data-analytic decisions required, we conducted another analysis using NP-
I-PLS with W TX based on the original 38,500 predictors and n = 35 data
points. For each fixed number of components, the root mean square error of
prediction was estimated by using leave-one-out cross validation with the fit-
ting function used previously. For each subset of 34 observations, the number
of components minimizing the prediction error was always 6. The cross val-
idation root mean square error of prediction was determined to be 8.46 as
shown in part (c) of Table 9.3. This relatively simple method requires only
two essential data-analytic choices: the number of components and the fitting
function f (Y ), and yet it produced a root mean square error of prediction
that is smaller than that found by the carefully tuned MLP network.

9.9 Solvent data


Lavoie, Muteki, and Gosselin (2019) reaffirmed the general inappropriate-
ness of linear PLS in the presence of non-linear relationships and proposed a
new non-linear PLS algorithm. They used constrained piecewise cubic splines
to model non-linear relationships iteratively, and calculated variable weights
using a methods originally proposed by Indahl (2005). While we find their
method to be somewhat elusive, it performed quite well against four other
proposed non-linear PLS methods in the three case studies that they reported.
One of their case studies used a dataset on properties of different solvents. We
Solvent data 279

FIGURE 9.5
Solvent data: Predictive root mean squared error PRMSE versus number of
components for three methods of fitting. A: linear PLS. B: non-parametric
inverse PLS with W TX, as discussed in Section 9.7. C: The non-linear PLS
method proposed by Lavoie et al. (2019).

used the same solvent data to compare the method proposed by Lavoie et al.
(2019) to linear PLS and to non-parametric inverse PLS with W TX.
We studied the solvent data following the steps described by Lavoie et al.
(2019). For instance, we removed the 5 observations that Lavoie et al. (2019)
flagged outliers, leaving n = 98 observations on p = 8 chemical properties of
the solvents as predictors and one response, the dielectric constant. Ten-fold
cross validation was used to measure the predictive root mean squared error,
PRMSE.
The results of our comparative study are shown in Figure 9.5. Curve A
gives the PRMSE for linear PLS and it closely matches the results of Lavoie
et al. (2019, Fig. 6(b), curve A). Curve C gives the PRMSE for the Lavoie et
al. method read directly from their Figure 6. Curve B gives the PRMSE of our
proposed NP-I-PLS with W TX using the fractional polynomial (e.g. Royston
and Sauerbrei, 2008) fitting function f (Y ) = (Y 1/3 , Y 1/2 , Y, Y 2 , Y 3 , log Y )T .
Figure 9.5 reinforces our previous conclusions that the dimension reduction
280 Non-linear PLS

step of linear PLS algorithms is serviceable in the presence of non-linearity


and that the non-parametric inverse PLS with W TX method competes well
against all other methods.
10

The Role of PLS in Social


Science Path Analyses

In this chapter, we describe the current and potential future roles for partial
least squares (PLS) algorithms in path analyses. After reviewing the present
debate on the value of PLS for studying path models in the social sciences
and establishing a context, we conclude that, depending on specific objectives,
PLS methods have considerable promise, but that the present social science
method identified as PLS is only weakly related to PLS and is perhaps more
akin to maximum likelihood estimation. Developments necessary for integrat-
ing proper PLS into the social sciences are described. A critique of covariance-
based structural equation modeling (cb|sem), as it relates to PLS, is given as
well. The discussion in this chapter follows Cook and Forzani (2023).

10.1 Introduction
Path modeling is a standard way of representing social science theories. It
often involves concepts like “customer satisfaction” or “competitiveness” for
which there are no objective measurement scales. Since such concepts can-
not be measured directly, multiple surrogates, which may be called indicators,
observed or manifest variables, are used to gain information about them indi-
rectly. One role of a path diagram is to provide a representation of the rela-
tionships between the concepts, which are represented by latent variables, and
the indicators. A fully executed path diagram is in effect a model that can be
used to guide subsequent analysis, rather like an algebraic model in statistics.

DOI: 10.1201/9781003482475-10 281


282 The Role of PLS in Social Science Path Analyses

A path model is commonly translated to a structural equation model


(SEM) for analysis. There are two common paradigms for studying a path di-
agram after translation to an SEM, called partial least squares (pls|sem) and
covariance-based analysis (cb|sem). The acronyms just given are used here to
distinguish between the PLS methods used to study SEMs and PLS methods
more generally. These paradigms have for decades been used for estimation
and inference, but their advocates now seem at loggerheads over relative merit,
due in part to publications that are highly critical of pls|sem methods.

10.1.1 Background on the pls|sem vs cb|sem debate


Much discussion has taken place in the social sciences literature about the role
of pls|sem in path modeling, including Akter, Wamba, and Dewan (2017), Ev-
ermann and Rönkkö (2021), Goodhue, Lewis, and Thompson (2023), Henseler
et al. (2014), Rönkkö and Evermann (2013), Russo and Stol (2023), Sarstedt
et al. (2016), Sharma et al. (2022), and Rönkkö et al. (2016b).
Rönkkö et al. (2016b) asserted that “. . . the majority of claims made in
PLS literature should be categorized as statistical and methodological myths
and urban legends,” and recommended that the use of “. . . PLS should be
discontinued until the methodological problems explained in [their] article
have been fully addressed.” In response, proponents of pls|sem appealed for
a more equitable assessment. Henseler and his nine coauthors (Henseler et al.,
2014) tried to reestablish a constructive discussion on the role pls|sem in path
analysis while arguing that, rather than moving the debate forward, Rönkkö
and Evermann (2013) created new myths. McIntosh, Edwards, and Anton-
akis (2014) attempted to resolve the issues separating Rönkkö and Evermann
(2013) and Henseler et al. (2014) in an “even-handed manner,” concluding by
lending their support to Rigdon’s recommendation that steps be taken to di-
vorce pls|sem path analysis from its competitors by developing methodology
from a purely PLS perspective (Rigdon, 2012). The appeal for balance was
evidently found wanting by the Editors of The Journal of Operations Man-
agement, who established a policy of desk-rejecting papers that use pls|sem
(Guide and Ketokivi, 2015), and by Rönkkö et al. (2016b), who restated and
reinforced the views of Rönkkö and Evermann (2013).
Later, Evermann and Rönkkö (2021) tempered their ardent criticism of
pls|sem when writing “. . . to ensure that [Information Systems (IS)] re-
searchers have up-to-date methodological knowledge of PLS if they decide to
Introduction 283

use it.” Although dressed up a bit to highlight new developments, their criti-
cism are essentially the same as those they leveled in previous writings. There
was much give-and-take in the subsequent discussion, both for and against
pls|sem. Goodhue et al. (2023) recommended against using PLS in path
analysis, arguing that it “. . . violates accepted norms for statistical inference.”
They further recommended that key journals convene a task force to assess the
advisability of accepting PLS-based work. Russo and Stol (2023) took excep-
tion to some of Evermann and Rönkkö (2021) conclusions, while commending
their efforts. Sharma et al. (2022) expressed matter-of-factly that the pls|sem
claims leveled by Evermann and Rönkkö (2021) are misleading, extraordinary
and questionable. They then set about “. . . to bring a positive perspective to
this debate and highlight the recent developments in PLS that make it an
increasingly valuable technique in IS and management research in general”.
There is a substantial literature that bears on this debate (Rönkkö et al.,
2016b, cites about 150 references). The preponderance of articles rely mostly
on intuition and simulations to support sweeping statements about intrinsi-
cally mathematical/statistical issues without sufficient supporting theory. But
adequately addressing the methodological issues in path analysis requires in
part avoiding ambiguity by employing a degree of context-specific theoretical
specificity.
In this chapter, we use the acronym PLS to designate the partial least
squares methods that stem from the theoretical foundations by Cook, Hel-
land, and Su (2013). These link with early work by Wold (Geladi, 1988) and
cover chemometrics applications (Cook and Forzani, 2020, 2021; Martens and
Næs, 1989), as well as a host of subsequent methods, particularly the PLS
methods for simultaneous reduction of predictors and responses discussed in
Chapter 5, which are relevant to the analysis of the path models. We rely on
these foundations in this chapter.

10.1.2 Chapter goals


We elucidate pls|sem by describing what it can and cannot achieve and by
comparing it to widely accepted methods like maximum likelihood. We also de-
scribe the potential of PLS methods in path analyses and discuss cb|sem as a
necessary backdrop for pls|sem. We restrict discussion to reflective path mod-
els so the context is clear, as it seems to us that many claims are made without
adequate foundations. By appealing to specific and relatively straightforward
284 The Role of PLS in Social Science Path Analyses

path models, we are able to state clearly results that carry over qualitatively
to more intricate settings. And we are able to avoid terms and phrases that
do not seem to be understood in the same way across the community of path
modelers (e.g. McIntosh et al., 2014; Sarstedt et al., 2016). We took the articles
by Wold (1982), Dijkstra (1983, 2010), and Dijkstra and Henseler (2015a,b)
as the gold standard for technical details on pls|sem. Output from our im-
plementation of their description of a pls|sem estimator agreed with output
from an algorithm by Rönkkö et al. (2016a), which supports our assessment of
the method. We confine our discussion largely to issues encountered in Wold’s
first-stage algorithm (Wold, 1982, Section 1.4.1).

10.1.3 Outline
To establish a degree of context-specific theoretical specificity, we cast our de-
velopment in the framework of common reflective path models that are stated
in Section 10.2 along with the context and goals. These models cover simula-
tion results by Rönkkö et al. (2016b, Figure 1), which is one reason for their
adoption. We show in Section 10.3 that the apparently novel observation that
this setting implies a reduced rank regression (RRR) model for the observed
variables (Anderson, 1951; Cook, Forzani, and Zhang, 2015; Izenman, 1975;
Reinsel and Velu, 1998). From here we address identifiability and estimation
using RRR. Estimators are discussed in Sections 10.4 and 10.5. The approach
we take and the broad conclusions that we reach should be applicable to other
perhaps more complicated contexts.
Some of the concerns expressed by Rönkkö et al. (2016b) involve contrasts
with cb|sem methodology. Building on RRR, we bring cb|sem methodology
into the discussion in Section 10.5 and show that under certain key assump-
tions additional parameters are identifiable under the SEM model. In Sec-
tion 10.6 we quantify the bias and revisit the notion of consistency at large.
The chapter concluded with simulation results and a general discussion.

10.2 A reflective path model


10.2.1 Path diagrams used in this chapter
The upper and lower path diagrams presented in Figure 10.1 are typical reflec-
tive path models. In each diagram elements of Y = (Yj ) ∈ Rr , X = (Xj ) ∈ Rp
A reflective path model 285

denote observable random variables, which are called indicators, that are as-
sumed to reflect information about underlying real latent constructs ξ, η ∈ R.
This is indicated by the arrows in Figure 10.1 leading from the constructs to
the indicators, which imply that the indicators reflect the construct. Boxed
variables are observable while circled variables are not. This restriction to
univariate constructs is common practice in the social sciences, but is not re-
quired. An important path modeling goal in this setting is to infer about the
association between the latents η and ξ, as indicated by the double-headed
curved paths between the constructs. We take this to be a general objective
of both pls|sem and cb|sem, and it is one focal point of this chapter. Each
indicator is affected by an unobservable error . The absence of paths between
these errors indicates that, in the upper path diagram, they are independent
conditional on the latent variables. They are allowed to be conditionally de-
pendent in the lower path diagram, as indicated by the double-arrowed paths
that join them. We refer to the lower model as the correlated error (core)
model and to the upper model as the uncorrelated error (uncore) model.
The models of Figure 10.1 occur frequently in the literature on path models
(e.g. Lohmöller, 1989). The uncore model is essentially the same as that de-
scribed by Wold (1982, Fig. 1) and it is the basis for Dijkstra and Henseler’s
studies (Dijkstra, 1983; Dijkstra and Henseler, 2015a,b) of the asymptotic
properties of pls|sem. We have not seen a theoretical analysis of pls|sem
under the core model.
For instance, Bollen (1989, pp. 228–235, Fig. 7.3) described a case study
on the evolution of political democracy in developing countries. The variables
X1 , . . . , Xp were indicators of specific aspects of a political democracy in de-
veloping countries in 1960, like measures of the freedom of the press and the
fairness of elections. The latent construct ξ was seen as a real latent construct
representing the level of democracy in 1960. The variables Y1 , . . . , Yr were the
same indicator from the same source in 1965, and correspondingly η was inter-
preted as a real latent construct that measures the level of democracy in 1965.
One goal was to estimate the correlation between the levels of democracy in
1960 and 1965.
The path diagrams in Figure 10.1 are uncomplicated relative to those that
may be used in sociological studies. For example, Vinzi, Trinchera, and Amato
(2010) described a study of fashion in which an uncore model was imbedded
in a larger path diagram with the latent constructs “Image” and “Character”
286 The Role of PLS in Social Science Path Analyses

FIGURE 10.1
Reflective path diagrams relating two latent constructs ξ and η with their
respective univariate indicators, X1 , . . . , Xp and Y1 , . . . , Yr . ϕ denotes either
cor(ξ, η) or Ψ = cor{E(ξ | X), E(η | Y )}. Upper and lower diagrams are the
uncore and core models. (From Fig. 2.1 of Cook and Forzani (2023) with
permission.)

serving as reflective indicators of a third latent construct “Brand Preference”


with its own indicators. Our discussion centers largely on the path diagrams
of Figure 10.1 since these are sufficient for the objectives of this chapter.
The models of Figure 10.1 are called reflective because the indicators reflect
properties of the constructs, as indicated by the arrows pointing to the con-
structs from the indicators. These can also be viewed as regressions where the
A reflective path model 287

constructs represent a single predictor and the indicators collectively repre-


sent a multivariate response. If the direction of the arrows is reversed, so they
point from the indicators to the constructs, then the path models are called
formative because the indicators form the constructs. From a regression view,
the indicators now become the predictors and the construct is the response.

10.2.2 Implied model


Assuming that all effects are additive, the upper and lower path diagrams
in Figure 10.1 can be represented as a common multivariate (multi-response)
regression model with the following stipulations:
! !
ξ µξ
= + ξ,η , where ξ,η ∼ N2 (0, Σ(ξ,η) ). (10.1)
η µη

and
)
Y = µY + βY |η (η − µη ) + Y |η , where Y |η ∼ Nr (0, ΣY |η )
(10.2)
X = µX + βX|ξ (ξ − µξ ) + X|ξ , where X|ξ ∼ Np (0, ΣX|ξ ).

We use σξ2 , σξ,η and ση2 to denote the elements of Σ(ξ,η) ∈ R2×2 , and we
further assume that ξ,η , X|ξ and Y |η are mutually independent so jointly
X, Y , ξ, and η follow a multivariate normal distribution. Normality per se
is not necessary. However, certain implications of normality like linear regres-
sions with constant variances are needed in this chapter. Assuming normality
overall avoids the need to provide a list of assumptions, which might obscure
the overarching points about the role of PLS. As defined in Chapter 1, we use
the notation ΣU,V to denote the matrix of covariances between the elements
of the random vectors U and V , and we use βU |V = ΣU,V Σ−1 V to indicate the
matrix of population coefficients from multi-response linear regression of U on
V , where ΣV = var(V ). Lemma A.5 in Appendix A.8.1 gives the mean and
variance of the joint multivariate normal distribution of X, Y , ξ, and η. Since
only X, Y are observable, all estimation and inference must be based on the
joint multivariate distribution of (X, Y ).

10.2.3 Measuring association between η and ξ


The association between η and ξ has been measured in the path modeling liter-
ature via the conditional correlation Ψ := cor{E(η | Y ), E(ξ | X)} and via the
288 The Role of PLS in Social Science Path Analyses

marginal correlation cor(η, ξ). These measures can reflect quite different views
of a reflective setting. The marginal correlation cor(η, ξ) implies that η and
ξ represent concepts that are uniquely identified regardless of any subjective
choices made regarding the variables Y and X that are reasoned to reflect their
properties up to linear transformations. Two investigators studying the same
constructs would naturally be estimating the same correlation even if they se-
lected a different Y or a different X. In contrast, Ψ = cor{E(η | Y ), E(ξ | X)},
the correlation between population regressions, suggests a conditional view:
η and ξ exist only by virtue of the variables that are selected to reflect their
properties, and so attributes of η and ξ cannot be claimed without citing the
corresponding (Y, X). Two investigators studying the same concepts could un-
derstandably be estimating different correlations if they selected a different Y
or a different X. For instance, the latent construct “happiness” might reflect
in part a combination of happiness at home and happiness at work. Two sets
of indicators X and X ∗ that reflect these happiness subtypes differently might
yield different correlations with a second latent construct, say “generosity.”
The use of Ψ as a measure of association can be motivated also by an ap-
peal to dimension reduction. Under the models (10.2), X ξ | E(ξ | X) and
Y η | E(η | Y ). In consequence, the constructs affect their respective indi-
cators only through the linear combinations given by the conditional means,
E(ξ | X) = ΣTX,ξ Σ−1 T −1
X (X − µX ) and E(η | Y ) = ΣY,η ΣY (Y − µY ).

10.2.4 Construct interpretation


Rigdon (2012) questioned whether constructs like ξ and η are necessarily the
same as the theoretical concepts they are intended to capture or whether they
are approximations of the theoretical concepts allowed by the selected indi-
cators. If the latter view is taken, then an understanding of the constructs
cannot be divorced from the specific indicators selected for the study.
Further, Rigdon, Becker, and Sarstedt (2019) argued that there is always
a gap between a targeted concept like consumer satisfaction or trustworthi-
ness and its representation as a construct via selected indicators. Attempts
to reify a concept through a specific set of indicators must naturally reflect
the researchers’ vision, insights and ideology, with the unavoidable implication
that cor(η, ξ) is a researcher-specific parameter. Two independent researcher
groups studying the same concepts, but using different indicators, will inher-
ently be dealing with distinct correlations cor(η, ξ). Whether they arrive at
A reflective path model 289

the same substantive conclusions is a related issue that can also be relevant,
but is beyond the scope of this book. The correlation Ψ brings these issues to
the fore; its explicit dependence on the indicators manifests important trans-
parency and it facilitates the reification of a concept, although it does not
directly address discrepancies between the target concept and construct.
The diagrams in Figure 10.1 can be well described as instances of oper-
ational measurement theory in which concepts are defined in terms of the
operations used to measure them (e.g. Bridgman, 1927; Hand, 2006). In op-
erationalism there is no assumption of an underlying objective reality; the
concept being pursued is precisely that which is being measured, so concept
and construct are the same. In the alternate measurement theory of represen-
tationalism, the measurements are distinct from the underlying reality. There
is a universal reality underlying the concept of length regardless of how it is
measured. However, a universal understanding of “happiness” must rely on
unanimity over the measurement process.

10.2.5 Constraints on ξ and η


The constructs η and ξ are not well-defined since any non-degenerate linear
transformations of them lead to an equivalent model and, in consequence, it
is useful to introduce harmless constraints to facilitate estimation, inference
and interpretation. To this end, we consider two sets of constraints:

Regression constraints : µξ = µη = 0 and var{E(ξ | X)} = var{E(η | Y )} = 1.


Marginal constraints : µξ = µη = 0 and var(ξ) = var(η) = 1.

The regression constraints fix the variances of the conditional means E(ξ | X)
and E(η | Y ) at 1, while the marginal constraints fix the marginal vari-
ances of ξ and η at 1. Under the regression constraints, Ψ = cor(E(η | Y ),
E(ξ | X)) = cov(E(η | Y ), E(ξ | X)). Under the marginal constraints,
cor(ξ, η) = cov(ξ, η). The regression and marginal constraints are related via
the variance decompositions

var(ξ) = var{E(ξ | X)} + E{var(ξ | X)}


var(η) = var{E(η | Y )} + E{var(η | Y )}.

It seems that the choice of constraints is mostly a matter of taste: We show


later in Lemma 10.2 that cov(ξ, η) is unaffected by the choice of constraint.
290 The Role of PLS in Social Science Path Analyses

10.2.6 Synopsis of estimation results


cb|sem gives the maximum likelihood estimator of cov(ξ, η), provided that
there are identified off-diagonal elements of the conditional covariance matri-
ces ΣX|ξ and ΣY |η that are known to be zero and set to zero in the model.
It seems common in application to assume that ΣX|ξ and ΣY |η are diagonal
matrices, in effect assuming that ξ and η account for all of the dependence
between the indicators. If there are no identified off-diagonal elements that are
known to be zero, then cov(ξ, η) is not identified. In the models of Figure 10.1,
cb|sem can be used to estimate cov(ξ, η) in the uncore model, but not in
the core model. Schönemann and Wang (1972), Henseler et al. (2014, Fig.
1) and others have argued that the uncore model rarely holds in practice.
Nevertheless, it still seems to be used frequently in the social sciences.
The maximum likelihood estimator |Ψ b mle | of |Ψ| can be derived from the
RRR of Y on X and X on Y , as described in Section 10.3. We will see in Sec-
tion 10.4.1 that |Ψ
b mle | is simply the first canonical correlation between X and
Y . A corresponding moment estimator Ψ b mt is introduced in Section 10.4.2.
PLS methods are moment-based envelope methods that give estimators
of Ψ. There are no restrictions on ΣX|ξ and ΣY |η , other than the general
restriction that they are positive definite.
cb|sem and the PLS methods have two distinct targets, |cov(ξ, η)| and |Ψ|.
PLS methods do not provide generally serviceable estimators of |cov(ξ, η)| and
cb|sem does not provide a serviceable estimator of |Ψ|.

10.3 Reduced rank regression


PLS path algorithms gain information on the latent constructs from the multi-
response linear regressions of Y on X and X on Y . Consequently, it is useful
to provide a characterization of these regressions.
In the following lemma we state that for the model presented in (10.1) and
(10.2) of Section 10.2.2, the regression of Y on X qualifies as a RRR. The es-
sential implication of the lemma is that, under the reflective model described
in Section 10.2, the coefficient matrix βY |X in the multi-response linear re-
gression of Y on X must have rank 1 because cov(Y, X) = ΣY,X has rank 1. A
similar conclusion applies to the regression of X on Y . Proofs for Lemma 10.1
and Proposition 10.1 are available in Appendix Sections A.8.3 and A.8.4.
Estimators of Ψ 291

Lemma 10.1. For the core model presented in presented in (10.1) and
(10.2) and without imposing either the regression or marginal constraints,
we have E(Y | X) = µY + βY |X (X − µX ), where βY |X = AB T Σ−1 X , with
A ∈ Rr×1 , B ∈ Rp×1 , and ΣY,X = AB T .

It is known from the literature on RRR that the vectors A and B are not
identifiable, while AB T is identifiable (see, for example, Cook et al., 2015). As
a consequence of being a RRR model, we are able to state in Proposition 10.1
which parameters are identifiable in the reflective model of (10.1) and (10.2).

Proposition 10.1. The parameters ΣX , ΣY , ΣY,X = AB T , βY |X , µY , and


µX are identifiable in the reduced rank model of Lemma 10.1, but A and B are
not. Additionally, under the regression constraints, the quantities σξ,η /(ση2 σξ2 ),
Ψ, E(η|Y ), ΣX,ξ , and ΣY,η are identifiable in the reflective model except for
sign, while cor(η, ξ), σξ2 , ση2 , σξ,η , βX|ξ , βY |η , ΣY |η , and ΣX|ξ are not identi-
fiable. Moreover,

|Ψ| = |σξ,η |/(ση2 σξ2 ) = tr1/2 (βX|Y βY |X ) (10.3)


= tr1/2 (ΣX,Y Σ−1 −1
Y ΣY,X ΣX ). (10.4)

10.4 Estimators of Ψ
10.4.1 Maximum likelihood estimator of Ψ
Although A and B are not identifiable, |Ψ| is identifiable since from (10.4) it
depends on the identifiable quantities ΣX , ΣY , and on the rank 1 covariance
matrix ΣY,X = AB T in the RRR of Y on X. Let Σ b Y,X denote the maximum
likelihood estimator of ΣY,X from fitting the RRR model of Lemma 10.1, and
recall that SX and SY denote the sample versions of ΣX and ΣY . Then the
maximum likelihood estimator of |Ψ| can be obtained by substituting these
estimators into (10.4):

b X,Y S −1 Σ
b mle | = tr1/2 (Σ
|Ψ b Y,X S −1 ).
Y X

−1/2 −1/2
Let ΣỸ ,X̃ = ΣY ΣY,X ΣX denote the standardized version of ΣY,X
−1/2 −1/2
that corresponds to the rank one regression of Ỹ = ΣY Y on X̃ = ΣX X.
292 The Role of PLS in Social Science Path Analyses

Then |Ψ| can be written also as

|Ψ| = tr1/2 (ΣX̃,Ỹ ΣỸ ,X̃ ) = kΣỸ ,X̃ kF , (10.5)

which can be seen as the Frobenius norm k · kF of the standardized covariance


matrix ΣỸ ,X̃ .
The maximum likelihood estimator |Ψ b mle | can be computed in the follow-
ing steps starting with the data (Yi , Xi ), i = 1, . . . , n (Cook et al., 2015):
−1/2 −1/2
1. Standardize Ỹi = SY (Yi − Ȳ ) and X̃i = SX (Xi − X̄), i = 1, . . . , n.

2. Construct Σb
Ỹ ,X̃ , the matrix of sample correlations between the elements
of the standardized vectors X̃ and Ỹ .
T
3. Form the singular value decomposition Σb
Ỹ ,X̃ = U DV and extract U1 and
V1 , the first columns of U and V , and D1 is the corresponding (largest)
singular value of Σb
Ỹ ,X̃ .

4. Then

b mle | = D1 .

In short, the maximum likelihood estimator |Ψ b mle | is the largest singu-


lar value of Σ
b
Ỹ ,X̃ . Equivalently it is the first sample canonical correlation
between X and Y .
Wold (1982, Section 5.2), pointed out that the first canonical correlation
arises under formative models, which established a connection between
canonical correlations and the analysis of Wold’s two-block, two-LV soft
models.

This maximum likelihood estimator makes use of the reduced dimension


that arises because ΣX,Y has rank 1. In a sense then, it is a dimension reduc-
tion estimator. However, it is not formally related to envelope or PLS estima-
tors, which are also based on reducing dimensions. Envelope and reduced rank
estimators are distinct methods that were combined by Cook et al. (2015) to
yield a reduced-rank envelope estimator.

10.4.2 Moment estimator


Define
b mt | = tr1/2 (SX,Y S −1 SY,X S −1 ),
|Ψ (10.6)
Y X
Estimators of Ψ 293

where tr denotes the trace operator. This moment estimator of |Ψ|, which is
constructed by simply substituting moment estimators for the quantities in
(10.4), may not be as efficient as the maximum likelihood estimator under
the model of 10.2.2, but it might possess certain robustness properties. This
estimator does not use reduced dimensions that arises from the rank of ΣX,Y
or from envelope constructions and, for this reason, it is likely to be inferior.

10.4.3 Envelope and PLS estimators of Ψ


The results of Chapter 5 together with our starting models provide a frame-
work for deriving the envelope and PLS estimators of Ψ. We begin by estab-
lishing notation for describing the response envelopes for model (10.1)–(10.2).
Let BY |η = span(βY |η ), let BX|ξ = span(βX|ξ ), let Γ denote a semi-
orthogonal basis matrix of the response envelope EΣY |η (BY |η ) for Y model
in (10.2) and let Φ denote a semi-orthogonal basis matrix of the response en-
velope EΣX|ξ (BX|ξ ) for X model in (10.2). Let u and q denote the dimensions
of EΣY |η (BY |η ) and EΣX|ξ (BX|ξ ) and let (Γ, Γ0 ) ∈ Rr×r and (Φ, Φ0 ) ∈ Rp×p be
orthogonal matrices.
The envelope versions of models (10.2) can now be written as

)
Y = µY + Γγ(η − µη ) + Y |η ; Y |η ∼ Nr (0, ΓΩΓT + Γ0 Ω0 ΓT0 )
(10.7)
X = µX + Φφ(ξ − µξ ) + X|ξ ; X|ξ ∼ Np (0, Φ∆ΦT + Φ0 ∆0 ΦT0 ).

Here γ ∈ Ru and φ ∈ Rq give the coordinates of βY |η and βX|ξ in terms of the


basis matrices Γ and Φ, and ∆, ∆0 , Ω, and Ω0 are positive definite matrices
with dimensions that conform to the indicated matrix multiplications.
It follows from the envelope models given by (10.7) that

ΣY = Γ(γγ T ση2 + Ω)ΓT + Γ0 Ω0 ΓT0


ΣY,X = ΓγφT ΦT σξη
ΣX = Φ(φφT σξ2 + ∆)ΦT + Φ0 ∆0 ΦT
|Ψ| = |cor{E(ξ|ΦT X), E(η|ΓT Y )}|.

The essential implication of this last result is as follows. Once estimators Γb


and Φ
b are known, we can use the estimated composite indicators Γ b T Y and
T
b X in place of Y and X to estimate |Ψ|. Using these in combination with
Φ
maximum likelihood estimation, we can again see that |Ψ| is estimated as the
294 The Role of PLS in Social Science Path Analyses

first canonical correlation between Γ b T Y and Φ


b T X. Moment estimation (10.6)
proceeds similarly by replacing Y and X with Γ b T Y and Φb T X.
b T Y and Φ
If Γ b T X are real, then the maximum likelihood and moment es-
timators both yield the absolute sample correlation between these composite
indicators. To see this result, consider the four computational steps in Sec-
−1/2 b T b T Ȳ ) and X̃i = S −1/2 (Φ
b T Xi − Φ
b T X̄) are both
tion 10.4: Ỹi = SΓbT Y (Γ Yi − Γ bT X
Φ
T bT
real and so Σb
Ỹ ,X̃ is just the sample covariance between Γ Y and Φ X, which
b
equals the usual sample correlation in view of the standardization. The mo-
ment estimator is evaluated in the same way.
Envelope and PLS methods are distinguished then by how they construct
the estimated composite indicators:

Envelope estimators.

The envelope estimator |Ψ b env | is obtained by using the algorithms of Sec-


tion 5.2.2 to construct Γ
b and Φ.b

PLS estimators.

The PLS estimator |Ψ b pls | is obtained by using the algorithms of Ta-


ble 5.1 to construct Γ b and Φ, b which are denoted as G and W in the
table.
These are envelope and PLS estimators that do not take into account that
ΣY,X has rank 1. Nevertheless, they are still serviceable dimension reduction
methods that can compete effectively with the maximum likelihood and mo-
ment estimators of Sections 10.4.1 and 10.4.2. This potential is demonstrated
by the simulations of Section 10.7.

10.4.4 The pls|sem estimator used in structural equation


modeling of path diagrams
Using Dijkstra (1983) and Dijkstra and Henseler (2015a) as guides, the method
designated as PLS in the social sciences differs from those in Section 10.4.3
in two crucial aspects. First, the data are scaled by normalizing each real in-
dicator with its sample standard deviation, so the resulting scaled indicator
has sample mean 0 and sample variance 1. Second, the method is based on
the condition that u = q = 1. Working in the scaled variables, the pls|sem
method then effectively assumes that all reflective information is captured in
Estimators of Ψ 295

one linear combination of the scaled indicators.


Specific steps to compute the pls|sem estimator are as follows:

1. Scale and center the indicators marginally, X (s) = diag−1/2 (SX )(X − X̄)
and Y (s) = diag−1/2 (SY )(Y − Ȳ ), where diag(·) denotes the diagonal ma-
trix with diagonal elements the same as those of the argument.

2. Construct the first eigenvector `1 of SY (s) X (s) SX (s) Y (s) and the first eigen-
vector r1 of SX (s) Y (s) SY (s) X (s) .

3. Construct the proxy latent variables ξ¯ = r1T X (s) and η̄ = `T1 Y (s) .

4. Then the estimated covariance and correlation between the proxies η̄ and
ξ¯ are

cd ¯
ov(η̄, ξ) = `T1 SY (s) X (s) r1
¯ = `T1 SY (s) X (s) r1
cor(η̄, ξ) . (10.8)
{`T1 SY (s) `1 r1T SX (s) r1 }1/2
c

Aside from the scaling, the reduction to the proxy variables ξ¯ and η̄ is the
same as that from the simultaneous PLS method developed by (Cook, Forzani,
and Liu, 2023b, Section 4.3) and discussed in Section 5.3 when u = q = 1.
However, if we standardize jointly, then we recover the MLE. Rewriting the
MLE algorithm given in Section 10.4.1 to reflect its connection with pls|sem,
we have
−1/2 −1/2
1. Standardize the indicators jointly X̃ = SX (X − X̄), and Ỹ = SY (Y −
Ȳ ).

2. Construct the first eigenvector U1 of SỸ ,X̃ SX̃,Ỹ and the first eigenvector
V1 of SX̃,Ỹ SỸ ,X̃ .

3. Construct the proxy latent variables η̄ = V1T X̃ and ξ¯ = U1T Ỹ .

4. Then the estimated correlation between the proxies η̄ and ξ¯ is

cor(η̄, ¯ = UT S
c ξ) 1 Ỹ ,X̃ V1 = D1 .

Consequently, we reach the following three essential conclusions:


296 The Role of PLS in Social Science Path Analyses

I. If joint standardization is used for the indicator vectors, then the pls|sem
estimator cor(η̄,
c ¯ is the same as the MLE of |Ψ| under the model (10.1)–
ξ)
(10.2). This involves no dimension reduction beyond that arising from the
reduced rank model of Section 10.3. It is unrelated to PLS.
We recommend that pls|sem be based on joint standardization and not
marginal scaling, when permitted by the sample size.

II. If no standardization is used for the indicator vectors, then the pls|sem
estimator cor(η̄,
c ¯ is the same as the simultaneous PLS estimator (see Sec-
ξ)
tion 5.3) when u = q = 1. But we may lose information if u 6= 1 or q 6= 1.

III. If marginal scaling is used, it’s not clear what cor(η̄,


c ¯ is estimating. It
ξ)
might be viewed as an approximation of the MLE and simultaneous PLS
estimators of Ψ, but this has not been studied to our knowledge.
We have not seen a compelling reason to use marginal scaling. Marginal
scaling might be useful when the sample size is relatively small, but we
are not aware of work on this possibility.

10.5 cb|sem
As hinted in Proposition 10.1, the parameter cor(η, ξ) is not identifiable with-
out further assumptions. We give now sufficient conditions to have identi-
fication of cor(ξ, η). The conditions needed are related to the identification
of ΣX|ξ and ΣY |η . As seen in Lemma A.6 of Appendix A.8.4, under model
(10.1)–(10.2),

ΣX = ΣX|ξ + σξ−2 (B T Σ−1


X B)
−1
BB T (10.9)
ΣY = ΣY |η + ση−2 (AT Σ−1
Y A)
−1
AAT . (10.10)

The terms (B T Σ−1 X B)


−1
BB T and (AT Σ−1Y A)
−1
AAT in (10.9) and (10.10)
are identifiable. Since the goal of cb|sem is to estimate cor(ξ, η) and since we
know from Proposition 10.1 that, under the regression constraints,

|Ψ| = |cor{E(ξ | X), E(η | Y )}| = |σξ,η |/(σξ2 ση2 ) (10.11)


cb|sem 297

is identifiable, we will be able to identify

|cor(ξ, η)| = |σξ,η |/σξ ση = σξ ση |Ψ|

if we can identify σξ2 and ση2 . From (10.9) and (10.10) that is equivalent to
identifying ΣY |η and ΣX|η . We show in Proposition A.4 of Appendix A.8.4
that, under the regression constraints, (a) if ΣY |η and ΣX|ξ are identifiable,
then σξ2 , ση2 , |σξ,η |, βX|ξ , βY |η are identifiable and that (b) ΣY |η and ΣX|ξ are
identifiable if and only if σξ2 , ση2 are so.
The next proposition gives conditions that are sufficient to ensure iden-
tifiability; its proof is given in Appendix A.8.5. Let (M )ij denote the ij-th
element of the matrix M and (V )i the i-th element of the vector V .

Proposition 10.2. Under the regression constraints, (I) if ΣX|ξ contains


an off-diagonal element that is known to be zero, say (ΣX|ξ )ij = 0, and
6 0, then ΣX|ξ and σξ2 are identifiable. (II) if ΣY |η contains
if (B)i (B)j =
an off-diagonal element that is known to be zero, say (ΣY |η )ij = 0, and if
6 0 then ΣY |η and ση2 are identifiable.
(A)i (A)j =

Corollary 10.1. Under the regression constraints, if ΣY |η and ΣX|ξ are di-
agonal matrices and if A and B each contain at least two non-zero elements
then ΣY |η , ΣX|ξ , σξ2 , and ση2 are identifiable.

The usual assumption in SEM is that ΣX|ξ and ΣY |η are diagonal matrices
(e.g. Henseler et al., 2014; Jöreskog, 1970). We see from (10.11) and Corol-
lary 10.1 that this assumption along with the regression constraints is sufficient
to guarantee that |cor(ξ, η)| is identifiable provided B and A contain at least
two non-zero elements. However, from Proposition 10.2, we also see that it is
not necessary for ΣY |η and ΣX|ξ to be diagonal. The assumption that ΣX|ξ and
ΣY |η are diagonal matrices means that, given ξ and η, the elements of Y and
X must be independent. In consequence, elements of X and Y are correlated
only by virtue of their association with η and ξ. The presence of any residual
correlations after accounting for ξ and η would negate the model and possibly
lead to spurious conclusions. See Henseler et al. (2014) for a related discussion.
In full, the usual SEM requires that ΣY |η and ΣX|ξ are diagonal matrices,
and it adopts the marginal constraints instead of the regression constraints.
By Proposition 10.2, our ability to identify parameters is unaffected by the
constraints adopted. However, we need also to be sure that the meaning of
298 The Role of PLS in Social Science Path Analyses

cor(ξ, η) is unaffected by the constraints adopted. The proof of the following


proposition is in Appendix A.8.6.

Lemma 10.2. Starting with (10.2), cor(ξ, η) is unaffected by choice of con-


straint, σξ2 = ση2 = 1 or var{E(ξ|X)} = var{E(η|Y )} = 1.

Now, to estimate the identifiable parameters with the regression or


marginal constraints, we take the joint distribution of (X, Y ) and insert the
conditions imposed. The maximum likelihood estimators can then be found
under normality, as shown in Proposition A.3 of Appendix A.8.2. From this
we conclude that the same type of maximization problem arises under either
sets of constraints.
A connection between envelopes and cor(η, ξ) similar to that between
envelopes and Ψ is problematic. Since only span(Γ) = EΣY |η (BY |η ) and
span(Φ) = EΣX|ξ (BX|ξ ) are identifiable, arbitrary choices of bases for these
subspaces will not likely result in ΣΓT Y |η = Ω and ΣΦT X|ξ = ∆ being diagonal
matrices, as required by cb|sem. Choosing arbitrary bases Γ and Φ and using
the marginal constraints, the envelope composites have the following structure,

ΣΓT Y = γγ T + Ω
ΣΓT Y,ΦT X = γφT cor(ξ, η)
ΣΦT X = φφT + ∆,

where the notation is as used for model (10.7). From this we see that the joint
distribution of the envelope composites (ΓT Y, ΦT X) has the same structure
as the SEM shown in equation (A.31) of Appendix A.8.6, except that assum-
ing Ω and ∆ to be diagonal matrices is untenable from this structure alone.
Additionally, Ω and ∆ are not identifiable because they are confounded with
γγ T and φφT .
In short, it does not appear that there is a single method that can provide
estimators of both of the parameters |Ψ| and cov(ξ, η). However, an estimator
of |Ψ| can provide an estimated lower bound on |cov(ξ, η)|, as discussed in
Section 10.6.
Bias 299

10.6 Bias
Bias is a malleable concept, depending on the context, the estimator and the
quantity being estimated.
If the goal is to estimate |cor(ξ, η)| via maximum likelihood while assum-
ing that ΣY |η and ΣX|ξ are diagonal matrices, bias might not be a worrisome
issue. Although maximum likelihood estimators are generally biased, the bias
typically vanishes at a fast rate as the sample size increases. Bias may be an
issue when the sample size is not large relative to the number of parameters
to be estimated while SX and SY are still nonsingular. This issue is outside
the scope of this chapter.
If the goal is to estimate |Ψ| without assuming diagonal covariance ma-
trices and n > min(p, r) + 1 then its maximum likelihood estimator, the first
canonical correlation between X and Y , can be used and again bias might not
be a worrisome issue.
In some settings we may wish to use an estimator of |Ψ| also as an esti-
mator of |cor(ξ, η)| without assuming diagonal covariance matrices. The im-
plications of doing so are a consequence of the next proposition. Its proof is
in Appendix A.8.7.

Proposition 10.3. Under the model that stems from (10.1) and (10.2),

cor(ξ, η)
Ψ = [var{E(ξ | X)}var{E(η | Y )}]1/2 .
σξ ση

Under either the regression constraints or the marginal constraints

|Ψ| ≤ |cor(η, ξ)|. (10.12)

Relationship (10.12) enables us to define a population bias in this case as


the difference
0 ≤ |cor(ξ, η)| − |Ψ| ≤ 1,

which agrees with Dijkstra’s (Dijkstra, 1983, Section 4.3) conclusion that PLS
will underestimate |cor(ξ, η)|. Under the marginal constraints,

σξ2 = 1 = E{var(ξ | X)} + var{E(ξ | X)}


ση2 = 1 = E{var(η | Y )} + var{E(η | Y )},
300 The Role of PLS in Social Science Path Analyses

this bias will be small when E{var(ξ | X)} and E{var(η | Y )} are small, so
ξ and η are well predicted by X and Y . This may happen with a few highly
informative indicators. It may also happen as the number of informative in-
dicators increases, a scenario that is referred to as an abundant regression in
statistics (Cook, Forzani, and Rothman, 2012, 2013). On the other extreme, if
ξ and η are not well predicted by X and Y , then it is possible to have |Ψ| close
to 0 while |cor(ξ, η)| is close to 1, in which case the bias is close to 1. Like the
assumption that ΣY |η and ΣX|ξ are diagonal matrices, it may be effectively
impossible to gain from the data support for claiming that E(var(ξ | X)) and
E(var(η | Y )) are small.
Under the regression constraints,

σξ2 = E{var(ξ | X)} + 1


ση2 = E{var(η | Y )} + 1,

this bias will again be small when E{var(ξ | X)} and E{var(η | Y )} are small.
In short, an estimator of |Ψ| is also an estimator of a lower bound on |cor(ξ, η)|.
Under either the marginal or regression constraints, the bias |cor(ξ, η)| − |Ψ|
will be small when the indicators are good predictors of the constructs.

10.7 Simulation results


In this section, we provide simulation results to support our general conclu-
sions and to demonstrate what can happen. In all simulations, the constructs
η and ξ are real random variables with 0 ≤ cor(η, ξ) ≤ 2/3. In Section 10.7.1
we use simulations in the manner of Rönkkö et al. (2016b) to illustrate that
the PLS methods in the social sciences may overestimate both cor(ξ, η) and
Ψ, while envelope methods, including PLS, perform notably better for esti-
mating Ψ. The simulation results in Section 10.7.2 are intended to underscore
the observation that cb|sem and pls|sem methods may materially underesti-
mate Ψ and cor(ξ, η) when ΣX|ξ and ΣY |η are not diagonal matrices, while the
envelope-based methods continue to do well. We illustrate in Section 10.7.3
that cb|sem and pls|sem methods may again exhibit substantial bias when
there are multiple reflective composites.
Simulation results 301

10.7.1 Simulation results with ΣX|ξ = ΣY |η = I3


Rönkkö et al. (2016b) illustrated some of their concerns via simulations with
p = r = 3 and sample size n = 100 under the marginal constraints with
0 ≤ cor(η, ξ) ≤ 2/3 and ΣX|ξ = ΣY |η = I3 .
We follow their general setup by simulating data with coefficient vec-
tors for the linear regressions of Y on η and X on ξ set equal to L =
(8/10, −7/10, 6/10)T . Their choice for the covariance matrices implies that
u = q = 1, so there is only one material composite of the indicators, which
agrees with the conditions for pls|sem. It follows from Proposition 10.3 that

Ψ = 0.5984 × cor(η, ξ), (10.13)

which agrees with our general conclusion (10.12). Following Rönkkö et al.
(2016b), we simulated data with sample sizes N = 100 and N = 1000 obser-
vations on (X, Y ) according to these settings with various values for cor(η, ξ).
The estimators that we used are as follows:

MLE: This is the estimator described in Section 10.4.1. Recall that it is the
same as the pls|sem estimator with joint standardization, as developed in
Section 10.4.4.

Matrixpls: This R program, which was used to compute a version of pls|sem,


was provided to us by Mikko Rönkkö (Rönkkö et al., 2016b, Footnote 1).

ENV: This envelope estimator was computed using the methods described
in Cook and Zhang (2015b). It is as discussed in Section 10.4.3.

cb|sem: This estimator was discussed in Section 10.5. It was computed using
the lavaan package based on Rosseel (2012).

pls|sem: This estimator is as discussed in Section 10.4.4. We computed it


using our own code, but found the estimates to be quite close to those
from Matrixpls.

PLS: This designates the PLS estimator discussed in Section 10.4.3. It was
computed using the methods described by Cook, Forzani, and Liu (2023b,
Table 2).

We can see from the results shown in Figure 10.2, which are the aver-
ages over 10 replications, that at N = 100, Matrixpls tends to underes-
timate Ψ, while the other five estimators of Ψ clearly overestimate for all
302 The Role of PLS in Social Science Path Analyses
N = 100

FIGURE 10.2
Simulations with N = 100, ΣX|ξ = ΣY |η = I3 . Horizontal axes give the
true correlations; vertical axes give the estimated correlations. Lines x = y
represented equality.

cor(η, ξ) ∈ (0, 2/3), with ENV doing a bit better than the others. Because the
indicators are independent with constant variances conditional on the con-
structs, we did not expect much difference between MLE and pls|sem. Other
than cb|sem, none of these estimators made use of the fact that ΣX|ξ and ΣY |η
are diagonal matrices. At N = 1000 shown in Figure 10.3 the performance of
ENV is essentially flawless, while the other PLS-type estimators show a slight
propensity to overestimate Ψ.
Simulation results 303
N = 1000

FIGURE 10.3
Simulations with N = 1000, ΣX|ξ = ΣY |η = I3 . Horizontal axes give the
true correlations; vertical axes give the estimated correlations. Lines x = y
represented equality.

10.7.2 Non-diagonal ΣX|ξ and ΣY |η


Simulating with ΣX|ξ = ΣY |η = I3 , as we did in the construction of Fig-
ures 10.2 and 10.3, conforms to the standard assumption that these matrices
be diagonal and partially accounts for the success of cb|sem in Figures 10.2
and 10.3. To illustrate the importance of these assumptions, we next took
ΣX|ξ = ΣY |η = L(LT L)−1 LT + 8L0 LT0 with LT0 L0 = I2 and LT L0 = 0.
304 The Role of PLS in Social Science Path Analyses
N = 100

FIGURE 10.4
Simulations with N = 100, ΣX|ξ = ΣY |η = L(LT L)−1 LT + 3L0 LT0 . Horizontal
axes give the true correlations; vertical axes give the estimated correlations.
The lines represent x = y.

Otherwise, the settings are as described in Section 10.7.1, including the


relationship between Ψ and cor(ξ, η) given at (10.14). With this structure, the
usual PLS assumption that q = u = 1 is still correct, the material information
being LT X and LT Y . At N = 100 shown in Figure 10.4, ENV is still the best
with the MLE coming in second. The performance of the other four estimators
is terrible, as there seems to be little relationship between the estimator and
the estimand. At N = 1000 shown in Figure 10.4, the performance of ENV
Simulation results 305
N = 1000

FIGURE 10.5
Results of simulations with N = 1000, ΣX|ξ = ΣY |η = L(LT L)−1 LT + 3L0 LT0 .
Horizontal axes give the true correlations; vertical axes give the estimated
correlations. The lines represent x = y.

is essentially flawless, while the MLE does well. The other four estimators all
have a marked tendency toward underestimation, particularly cb|sem.

10.7.3 Multiple reflective composites


Figure 10.6 gives the results of a simulation with q = 2 reflective composites
for X and u = 2 reflective composites for Y . The simulation was structured
306 The Role of PLS in Social Science Path Analyses

FIGURE 10.6
Results of simulations with two reflective composites for X and Y , q = u = 2.
Horizontal axes give the true correlations; vertical axes give the estimated
correlations. (Constructed following Fig. 6.3 of Cook and Forzani (2023) with
permission.)

like that described in Section 10.7.2, except now p = r = 21, L is a 21 × 2


matrix and

Ψ = 0.9997224 × cor(η, ξ). (10.14)

Specifically, let 1k be a k × 1 vector of ones, let L1 = (8, −0.7, 60, 1T9 , −1T9 )T ,
L2 = (1, 0.3, −0.59/60, −1T9 , 1T9 )T and L = (L1 , L2 ). The conditional means
and variances were generated as E(X | ξ) = 0.1(L1 + 0.9L2 )ξ, ΣX|ξ =
5L(LT L)−1 LT + 0.1L0 LT0 and E(Y | η) = 0.1(L1 + 0.9L2 )η, ΣY |η =
5L(LT L)−1 LT + 0.1L0 LT0 . From this structure, we have, ΣY,η = ΣX,ξ =
0.1(L1 + L2 ), and

ΣY,X = 0.01(L1 + L2 )(L1 + L2 )T = 0.01L12 1T2 LT .

It follows from this structure that Φ = Γ = L(LT L)−1/2 and so q = u = 2.


Discussion 307

We see from Figure 10.6 that the ENV estimator does the best at all
sample sizes, while PLS does well at the larger sample sizes. The Matrixpls
estimator from Rönkkö et al. (2016b) underestimates the true correlation Ψ at
all displayed sample sizes because it implicitly assumes that q = u = 1. The
cb|sem estimator also underestimates its target cor(ξ, η) because it cannot
deal with non-diagonal covariance matrices.

10.8 Discussion
This chapter is based on the relatively simple path diagram of Figure 10.1.
Phrased in terms of a construct ξ and a corresponding vector of indicators
X, the following overarching conclusions about the role of PLS in path anal-
yses apply regardless of the complexity of the path model. The relationship
between the indicators and the construct can be reflective or formative. The
same conclusions hold with ξ and X replaced with η and Y .

Path modeling.

At its core, path modeling hinges on the ability of the investigators to identify
sets of indicators that are related to the constructs in the manner specified.
Our view is that an understanding of a construct should not be divorced from
the specific indicators selected for its study. This view leads naturally to us-
ing E(ξ | X) as a means of construct reification. Here, PLS and ENV can
have a useful role in reducing the dimension of X without loss of informa-
tion on ξ, which allows X to be replaced by reduced predictors XR so that
E(ξ | X) = E(ξ | XR ).
On the other hand, if marginal characteristics of the constructs like
cov(ξ, η) are of sole interest, then we see PLS as having little or no relevance
to the analysis, unless the lower bound |Ψ| ≤ |cov(ξ, η)| is useful.

PLS|SEM

The success of this method for estimating Ψ depends critically on the stan-
dardization/scaling used.
No scaling or standardization works best when one real composite of the
indicators, say ΦT X, extracts all of the available information from X about ξ.
308 The Role of PLS in Social Science Path Analyses

That is, once the real composite ΦT X is known and fixed, ξ and X are inde-
pendent or at least uncorrelated. However, we found no rational for adopting
this one-composite framework, which we see as tying the hands of pls|sem.
Envelopes and their descendent PLS methods include methodology for esti-
mating the number of composites needed to extract all of the available infor-
mation. Expanding the one-composite framework now used by pls|sem has
the potential to increase its efficacy considerably.
Standardization using sample covariance matrices produces the maximum
likelihood estimator under the core model. We expect that it will also be
called for in more complicated model, but further investigation is needed to
affirm this. We see no compelling reason to use marginal scaling.

CB|SEM

This likelihood-based method for estimating cor(η, ξ) requires constraints on


covariance matrices for identifiability. We find the common assumption of di-
agonal covariance matrices to be quite restrictive.

PLS versus unit-weighted summed composites

The unit-weighted summed composite is the sum of the individual indicators,


Pp
Xsum = j=1 Xj . The effectiveness of Xsum depends on its relationship to the
material composites. In one-component path models, if the multiple correla-
tion between Xsum and ΦTXis high then Xsum may be a useful composite. On
the other extreme, if the multiple correlation between Xsum and the immate-
rial composites ΦT0 X is large then Xsum is effectively an immaterial composite
and no useful results based on Xsum should be expected. In short, the use-
fulness of summed composites depends on the anatomy of the paths. In some
path analyses they might be quite useful, while useless in others.

Bias

We do not see bias as playing a dominant role in the debate over methodology.
If the goal is to estimate cor(η, ξ) using cb|sem, then estimation bias may,
depending on the sample size, play a notable role in cb|sem, as maximum
likelihood estimators are biased but unbiased asymptotically. If the goal is to
estimate cor(η, ξ) using PLS, then structural bias is relevant. But if the goal
is to estimate Ψ, then structural bias has no special relevance if PLS is used.
11

Ancillary Topics

In this chapter we present various sidelights and some extensions to enrich our
discussions from previous chapters. In Section 11.1 we discuss the NIPALS and
SIMPLS algorithms as instances of general algorithms N and S introduced in
Section 1.5. In Section 11.2 we discuss bilinear models that have been used
to motivate PLS algorithms, particularly the simultaneous reduction of re-
sponses and predictors, and show that they rely on an underlying envelope
structure. This connects with the discussion in Chapter 5 on simultaneous
reduction. The relationship between NIPALS, SIMPLS, and conjugate gra-
dient algorithms is discussed in Section 11.3. Sparse PLS is discussed briefly
in Section 11.4, and Section 11.5 has an introductory discussion of PLS for
multi-way (tensor-valued) predictors. A PLS algorithm for generalized linear
regression is proposed in Section 11.6.

11.1 General application of the algorithms


In this section, we discuss issues related to using algorithms N and S for the
construction of an estimator of a general envelope EM (A). As described in
Sections 1.5.3 and 1.5.4, these algorithms require symmetric positive semi-
definite matrices A and M and their estimators. Let M c be a √n-consistent
estimator of M and let A b be a symmetric √n-consistent estimator of a basis
matrix for A. If we start with a non-symmetric basis estimator U b , then we
simply set A b= U bUb T . For a particular sample application, we denote these
algorithms as N(A, c) and S(A,
b M c), which provide for √n-consistent estima-
b M
tors of a projection onto EM (A) when M > 0. Application of these algorithms
in this general context depends on the problem and the goals, and may suggest
different versions of the algorithms even for predictor reduction.

DOI: 10.1201/9781003482475-11 309


310 Ancillary Topics

We know from Chapter 3 that the NIPALS and SIMPLS algorithms for
predictor reduction provide estimators of the envelope EΣX(B) and that they
depend on the data only via SX,Y and SX . To emphasize this aspect of the
algorithms, we denote them as NIPALS(SX,Y , SX ) and SIMPLS(SX,Y , SX ).
With U b = SX,Y and Mc = SX we have the following connection between these
algorithms for predictor reduction in linear models:

T
NIPALS(SX,Y , SX ) = N(SX,Y SX,Y , SX )
T
SIMPLS(SX,Y , SX ) = S(SX,Y SX,Y , SX ).

However, the underlying theory allows for many other options for us-
ing N and S to estimate EΣX(B). Recall from Proposition 1.6 that, for all
k, EM (M k A) = EM (A) and, for k 6= 0, EM k (A) = EM (A). In particular,
EΣX(B) = EΣX (CX,Y ). This suggests that when n  p, we could also use the
T T
algorithms N(βbols βbols , SX ) and S(βbols βbols , SX ) to estimate EΣX(B). As a second
instance, it follows also that

EΣX(B) = EΣ2X (CX,Y ) = EΣX (span(Σ2X ΣX,Y )),

T 2 2 T 2
which indicates that we could use also N(SX,Y SX,Y , SX ), N(SX SX,Y SX,Y SX ,
SX ), or the corresponding versions from algorithm S to estimate EΣX(B). The
essential point here is that there are many choices for Mc and A,b and conse-
quently many different versions of NIPALS and SIMPLS, that give the same
envelope in the population but can produce different estimates in applica-
tion. Likelihood-based approaches like that for predictor envelopes discussed
in Section 2.3 can help alleviate this ambiguity.
To illustrate the associated reasoning, recall that the likelihood-based
development in Section 2.3.1 was put forth as a basis for estimating
EΣX(B), but further reasoning is needed to see its implications for PLS al-
gorithms. We start by rewriting the partially maximized likelihood function
from (2.10),

−1
Lq (G) = log |GT SX|Y G| + log |GT SX G|
= log |GT SX|Y G| + log |GT {SX|Y + SX◦Y }−1 G|
= log |GT SX|Y G| + log |GT {SX|Y + SX,Y SY−1 SY,X }−1 G|.

Comparing this to the kernel of algorithm L in Proposition 1.14, we see that


in the population this function forms the basis for estimating EM (A), where
Bilinear models 311

M = ΣX|Y and A = ΣX,Y Σ−1 Y ΣY,X . From Proposition 1.8 and the discussion
of (1.26) we have the following equivalences
−1/2 −1/2 −1/2
EM (A) = EΣX|Y (ΣX,Y ΣY ) = EΣX (ΣX,Y ΣY ) = EΣX(B)ΣY = EΣX(B).

As pointed out in Section 3.9, we again see that the likelihood is based
−1/2
on the standardized response vector Z = SY Y , which suggests that NI-
PALS and SIMPLS algorithms also be based on the standardized responses:
−1/2 −1/2
NIPALS(SX,Y SY , SX ) and SIMPLS(SX,Y SY , SX ), with corresponding
adaptations to algorithms N and S. This idea was introduced in Section 3.9,
but here it is used as an illustration of the general recommendation that a
likelihood can be used to guide the implementation of a PLS algorithm. The
algorithms NIPALS(SX,Y , SX ) and SIMPLS(SX,Y , SX ) with the original un-
standardized responses might be considered when SY is singular.

11.2 Bilinear models


Bilinear models have long been used to motivate PLS algorithms, particu-
larly NIPALS. The idea of using bilinear models originated around 1975 in
the early work of H. Wold (Martens and Næs, 1989, p. 118). We introduced
bilinear models for simultaneous reduction of the predictors and the responses
in Section 5.4.1, but that is not the only version of a bilinear model that has
been proposed as a basis for PLS. In this section, we continue our discussion
of bilinear models, starting with that proposed by Martens and Næs (1989).

11.2.1 Bilinear model of Martin and Næs


The bilinear calibration model described by Martens and Næs (1989, p. 91),
which they used in part to motivate the NIPALS algorithms for dimension re-
duction in X, is formulated in terms of the centered data matrices X ∈ Rn×p
and Y ∈ Rn×r :

T
Xn×p = T Rp×u + En×p
T
Yn×r = T Ur×u + Fn×r
Tn×u = XWp×u
312 Ancillary Topics

where min(p, r) ≥ u and the matrix W of weights has full column rank. The
rows of R and U represent the loadings and the rows of T represent the scores.
Descriptions of this bilinear model in the PLS literature rarely mentioning any
stochastic properties of E and F , regarding them generally as unstructured
residual or error matrices. Martens and Næs (1989), as well as others, seem
to treat the bilinear model as a data description rather than as a statistical
model per se. To develop the connection with envelopes, it is helpful to re-
formulate the model in terms of uncentered random vectors XiT and YiT that
correspond to the rows of X and Y, and in terms of independent zero mean
error vectors eTi and fiT , representing the rows of E and F . Then written in
vector form the bilinear model becomes for i = 1, . . . , n

Xi = αX + Rti + ei 
Yi = αY + U ti + fi (11.1)

ti = W T Xi ,

where αX and αY are intercept vectors that are needed because the model
is in terms of uncentered data, and e f . In the bilinear model (5.13) for
simultaneous reduction of responses and predictors, the latent vectors vi and
ti are assumed to be independent, while in (11.1) the corresponding latent
vector are the same and this common latent vector is a linear function of X.
As shown in the following discussion, the condition ti = W T Xi has negligible
impact on the X model but it does lead to envelopes in terms of the Y model.
It follows from (11.1) that we can take W to be semi-orthogonal without
any loss of generality, so we assume that in the following. Substituting for t
in the Y equation,
Yi = αY + U W TXi + fi . (11.2)

Thinking of this Y -model in the form of the multivariate linear model (1.1),
we must have B ⊆ span(W ). Without structure beyond assuming that the
error vectors fi are independent copies of a random vector f with mean 0
and positive definite variance, this is a reduced rank multivariate regression
model (Cook, Forzani, and Zhang, 2015; Izenman, 1975). With W regarded
as known, the estimator for β is

b T = W (W T SX W )−1 W T SX,Y ,
WUW

which is the same form as given in Tables 3.1 and 3.4. The issue then is how
we estimate W , the basis for which must come from the X-model.
Bilinear models 313

Substituting ti into the X-model, we have

Xi = αX + R(W TXi ) + ei . (11.3)

This represents a model for the conditional distribution of X | W T X. Multi-


plying both sides of the X-model (11.3) by W T , we then have,

W TXi = W TαX + W TR(W TXi ) + W T ei .

Since this equality holds for all values of W T X and W has full column rank,
we must have W T αX = 0, W T R = Iu , and W T e ≡ 0. In consequence, we can
take R = W without loss of generality. Recalling that W is semi-orthogonal,
the X-model (11.3) then reduces to Xi = αX + PW Xi + QW ei . This implies
that QW X = QW αX + QW e and so the X-model reduces further to simply

X = αX + PW X + e
= PW αX + QW αX + PW X + QW e
= PW αX + PW X + QW X
= PW X + QW X, (11.4)

since W T αX = 0. This holds for any W , so the X-model doesn’t really add
any restrictions to the problem. The only restriction on W arises from the
Y -model, which implies that span(W ) must contain span(β). This line of rea-
soning does not give rise to envelopes directly because (11.4) holds for any
span(W ) that contains span(β). The final step to reach envelopes is to require
that cov(PW X, QW X) = cov(PW X, QW e) = 0. With this and the previous
conclusion that B ⊆ span(W ), we see that span(W ) is a reducing subspace of
ΣX that contains B, and then u = q becomes the number of components. As
we have seen previously in this chapter, PLS algorithms NIPALS and SIMPLS
require this condition in the population, although we have not seen it stated
in the literature as part of a bilinear model.

11.2.2 Bilinear probabilistic PLS model


Recall from the discussion following (2.5) that the constituent parameters of
envelope models or PLS formulations are not identifiable, while key param-
eters are identifiable. el Bouhaddani et al. (2018) proposed a bilinear PLS
model in which the parameters are identifiable up to a sign, and they showed
314 Ancillary Topics

how to use the EM algorithm to estimate those parameters. We describe their


bilinear PLS model in this section.
The el Bouhaddani et al. (2018) model, which was motivated by the work
of Tipping and Bishop (1999) on probabilistic principal component analysis,
begins with the requirement that the response and predictor vectors Y ∈ Rr
and X ∈ Rp are related via q × 1 latent vectors t ∼ N (0, diag(σt21 , . . . , σt2q ))
and u by the following bilinear probabilistic PLS model.

X = µX + W t + e, with e ∼ N (0, σe2 Ip ) 

Y = µY + V u + f, with f ∼ N (0, σf2 Ir ) (11.5)

u = Bt + h, with h ∼ N (0, σh2 Iq ),

where µ(·) ’s are unknown intercept vectors, W ∈ Rp×q , V ∈ Rr×q , B =


diag(b1 , . . . , bq ) and the error vectors e, f and h are mutually independent. Ad-
ditionally, W and V have full column ranks, t is independent of {e, f, h}, and
u is independent of {e, f }. As defined in Section 2.3.1, we let C = (X T , Y T )T
denote the random vector constructed by concatenating X and Y . It is also
required that q ≤ min(p, r); otherwise, simultaneous reduction of X and Y
dimension reduction may have no benefits in this context.
It then follows from the model that
! !
ΣX ΣX,Y W Σt W T + σe2 Ip W Σt B T V T
ΣC = = .
ΣY,X ΣY V BΣt W T V {BΣt B + σh2 Iq }V T + σf2 Ir

In consequence,

β = Σ−1 T 2
X ΣX,Y = (W Σt W + σe Ip )
−1
W Σt B T V T .

We next demonstrate that span(W ) is an envelope; that is, it is the smallest


reducing subspace of ΣX that contains B = span(β) = span(W ). First we
show that span(W ) reduces ΣX and contains B = span(W ).
To show that span(W ) reduces ΣX , define a q × q nonsingular matrix
A so that W̃ = W A is semi-orthogonal, and choose W̃0 so that (W̃ , W̃0 ) is
orthogonal. Then

ΣX = W̃ (A−1 Σt A−1 + σe2 Iq )W̃ T + σe2 W̃0 W̃0T . (11.6)

It follows from Proposition 1.24 that span(W ) reduces ΣX .


Bilinear models 315

Using (11.6), the following calculations show that span(β) ⊆ span(W ).

β = Σ−1
X ΣX,Y

= (W̃ (A−1 Σt A−1 + σe2 Iq )W̃ T + σe2 W̃0 W̃0T )−1 W Σt B T V T


n o
= W̃ (A−1 Σt A−1 + σe2 Iq )−1 W̃ T + σe−2 W̃0 W̃0T W Σt B T V T
n o
= W̃ (A−1 Σt A−1 + σe2 Iq )−1 W̃ T W̃ A−1 Σt B T V T
= W̃ (A−1 Σt A−1 + σe2 Iq )−1 A−1 Σt B T V T .

Since (A−1 Σt A−1 +σe2 Iq )−1 A−1 Σt B T has rank q, r > q and V has full column
rank, we have that

span((A−1 Σt A−1 + σe2 Iq )−1 AΣt B T V T ) = Rq

and, in consequence, B = span(W ) and EΣX(B) = span(β) = span(W ). It


follows also from these results that span(W ) is equal to the span of the first
q eigenvectors of ΣX , and thus that the material predictors W̃ T X correspond
to the first q principal components. Under this bilinear model then, principal
component regression coincides with partial least squares in the population.
The bilinear probabilistic PLS model may seem similar to the Martin-Næs
model described in Section 11.2.1, but there are important differences. Most
notably, the Martin-Næs model is not formulated in terms of latent variables,
but is instead described directly in terms of loadings and scores. As indicated
previously, the requirement that the errors e and f be isotropic in the prob-
abilistic PLS model links it to the probabilistic principal component model
PPCA proposed by Tipping and Bishop (1999). As a result, the latent vari-
ables t and u must account all variation in X and Y up to isotropic errors. One
consequence of this can be seen in the covariance matrices for the material and
immaterial predictors: From (11.6), the variation in the material predictors
var(W̃ T X) = A−1 Σt A−1 + σe2 Iq is unconstrained, while the variation in the
immaterial predictors is modeled as isotropic, var(W̃0T X) = σe2 Ip−q . This is
much more restrictive than the envelope model for predictor reduction (2.5).
Zhang and Chen (2020) discuss the limitations of having isotropic errors as
a key ingredient of probabilistic principal component analysis. In view of the
isotropic errors required by the probabilistic PLS model (11.5), their criticism
are applicable here as well. In sum, it seems that this probabilistic PLS model
solved a paltry issue with an overly restrictive model, resulting in methodology
that is essentially equivalent to reduction by principal components. Etiévant
316 Ancillary Topics

and Viallon (2022) raised additional issues and proposed a generalization that
addresses some of them.
Multiple versions of the bilinear model have been used to motivate PLS.
Our assessment is that they can be confusing and more of hindrance rather
than a help, particularly since PLS can be motivated fully using envelopes.

11.3 Conjugate gradient, NIPALS, and SIMPLS


11.3.1 Synopsis
There are many techniques available for solving the linear systems of equations
(e.g. Björck, 1966; Elman, 1994),

Aω = b, (11.7)

where b ∈ Rr is known and A ∈ Sr×r is positive definite and known. The


solution ω̃ can of course be represented as ω̃ = A−1 b. However, there are sit-
uations in which computing this naive solution directly may not be the best
course, depending on the relationship between b and the reducing subspaces
of A. Let G ∈ Rr×s be a semi-orthogonal basis matrix for EA (span(b)) and
let (G, G0 ) ∈ Rr×r be an orthogonal matrix. Then, from Proposition 1.2 and
Corollary 1.1, A and A−1 can be represented as

A = GHGT + G0 H0−1 GT0


A−1 = GH −1 GT + G0 H0−1 GT0

where H = GT AG. It follows that we can compute ω̃ as

ω̃ = GH −1 GT b
= G(GT AG)−1 GT b,

which serves also to highlight the fact that only span(G) = EA (span(b)) mat-
ters. This form for ω̃ could be particularly advantageous if G were known and
its column dimension s = dim{EA (span(b))} were small relative to r, or if the
eigenvalues of H0 were small enough to cause numerical difficulties. Of course,
implementations of this idea require an accurate numerical approximation of
a suitable G. We might consider developing approximations of G by using the
Conjugate gradient, NIPALS, and SIMPLS 317

general PLS-type algorithms N(b, A) or S(b, A), but in this discussion there
is not necessarily a statistical context associated with (11.7), so it may be
unclear how to use cross validation or a holdout sample to aid in selecting a
suitable dimension s. Nevertheless, we demonstrate in this section that the
highly regarded conjugate gradient method for solving (11.7) is in fact an en-
velope method that relies on NIPALS and SIMPLS for an approximation of
G (e.g. Phatak and de Hoog, 2002; Stocchero, de Nardi, and Scarpa, 2020).
In keeping with the theme of this book, we now consider (11.7) in the
context of model (1.1) with a univariate response and our standard notation
A = var(X) = ΣX , ω = β and b = cov(X, Y ) = σX,Y . In this context, (11.7)
becomes the normal equations for the population, ΣX β = σX,Y . Sample ver-
sions are discussed later in this section.
Table 11.1 gives the conjugate gradient algorithm in the context of solving
the normal equations ΣX β = σX,Y for a regression with a real response. It
was adapted from Elman (1994) and it applies to solving any linear system
Aω = b for A symmetric and positive definite.

11.3.2 Details for the conjugate gradient algorithm of


Table 11.1
To gain an understanding of this algorithm, we consider the first two itera-
tions, as we have done for other algorithms. This then will lead via induction
to a general conclusion about the relationship between the conjugate gradient
algorithm (CGA), PLS, and envelopes. Recall from Tables 3.1 and 3.4 that
Wq = (w1 , . . . , wq ) and Vq = (v1 , . . . , vq ) refer to the NIPALS and SIMPLS
weight matrices with q components.
At k = 0, p0 = r0 = σX,Y , α0 = σX,Y T T
σX,Y (σX,Y ΣX σX,Y )−1 and

β1 = α0 p0 = p0 (pT0 ΣX p0 )−1 pT0 σX,Y


T
= {σX,Y T
σX,Y (σX,Y ΣX σX,Y )−1 }σX,Y
T
= {σX,Y (σX,Y ΣX σX,Y )−1 σX,Y
T
}σX,Y
r1 = r0 − α0 ΣX p0
T
= σX,Y − {σX,Y T
σX,Y (σX,Y ΣX σX,Y )−1 }ΣX σX,Y
T
= {I − ΣX σX,Y (σX,Y ΣX σX,Y )−1 σX,Y
T
}σX,Y
= QTσX,Y (ΣX ) σX,Y .

From this we see that span(p0 ) = span(r0 ) = span(w1 ) = span(v1 ) and


318 Ancillary Topics
TABLE 11.1
Conjugate gradient algorithm: Population version of the conjugate gradient
method for solving the linear system of equations ΣX β = ΣX,Y for β.

(a) Population version


Initialize β0 = 0, r0 = σX,Y , p0 = r0 and tolerance .
R0 = (r0 ), P̄0 = (p0 )
For k = 0, 1, 2 . . .
αk = rkT rk /pTk ΣX pk
βk+1 = βk + αk pk
rk+1 = rk − αk ΣX pk
Append Rk+1 = (Rk , rk+1 )
End when first krk+1 k < kσX,Y k and then set βcg = βk+1 and
q =k+1
T
Bk = rk+1 rk+1 /rkT rk
pk+1 = rk+1 + Bk pk
Append P̄k+1 = (P̄k , pk+1 )
(b) Notes: wk+1 and vk+1 denote NIPALS and SIMPLS
weights.
Weights span(rk ) = span(wk+1 ), span(pk ) = span(vk+1 )
for all j < k, rjT rk = 0, pTj ΣX pk = 0, pTj rk = 0.
Envelope connection span(Rq ) = EΣX (B), the ΣX -envelope of
B = span(β) when rq = 0.

β1 = βnpls = βspls when the number of components is q = 1 (cf. Tables 3.1 and
3.4). The estimators are thus identical when the respective stopping criteria
are met. For CGA to stop at β1 we need kQTσX,Y (ΣX ) σX,Y k < . For NIPALS
to stop with q = 1 we need, from Table 3.1, QTσX,Y (ΣX ) σX,Y = 0. Thus aside
from a relatively minor difference in the population stopping criterion, the
CGA and NIPALS are so far identical.
Assuming that the stopping criterion is not met, the next part of CGA is
to compute

B0 = r1T r1 /r0T r0 = σX,Y


T
QσX,Y (ΣX ) QTσX,Y (ΣX ) σX,Y /σX,Y
T
σX,Y
p1 = r1 + B0 p0
= QTσX,Y (ΣX ) σX,Y + PσX,Y QσX,Y (ΣX ) QTσX,Y (ΣX ) σX,Y .

Key characteristics of this second step are, as shown in Appendix A.9.1,


Conjugate gradient, NIPALS, and SIMPLS 319

span(p1 ) = span(v2 ). Since span(p0 ) = span(v1 ), pT0 ΣX p1 = 0. Additionally,


since r1T σX,Y = 0,

r1T σX,Y T
= σX,Y QσX,Y (ΣX ) σX,Y = 0
pT1 σX,Y = r1T σX,Y + σX,Y
T
QσX,Y (ΣX ) QTσX,Y (ΣX ) σX,Y
= r1T r1 .

Using these results, we get

r1T r1
β2 = β1 + T
p1
p1 ΣX p1
= p0 (pT0 ΣX p0 )−1 pT0 σX,Y + p1 (pT1 ΣX p1 )−1 pT1 σX,Y
= V2 (V2T ΣX V2 )−1 V2T σX,Y
= βspls when q = 2,

where V2 is the second SIMPLS weight matrix.


The following proposition and its proof covers these initial observations
and casts them into a general result connecting NIPALS, SIMPLS and CGA.
The notation is that used in Tables 3.1, 3.4, and 11.1. The index i in the
proposition is offset by 1 for NIPALS and SIMPLS. This arises because the
PLS algorithms are indexed from i = 1, while CGA is indexed from i = 0.

Proposition 11.1. For a single-response regression with ΣX > 0 and dimen-


sion q envelope EΣX(B), we have

1. span(ri ) = span(wi+1 ), i = 0, 1, . . . , q − 1

2. span(pi ) = span(vi+1 ), i = 0, 1, . . . , q − 1

3. span(r0 , . . . , rq−1 ) = span(Wq ) = span(Vq ) = EΣX(B)

4. βcg = βnpls = βspls .

Proof. The proof in Appendix A.9.1 is by induction on q.

This proposition tells us that in effect CGA uses NIPALS and SIMPLS to
pursue an envelope solution to the linear system of equations ΣX β = σX,Y . All
three algorithms – CGA, NIPALS, and SIMPLS – can be regarded as meth-
ods for estimating β based on the predictor envelope EΣX(B). The conjugate
gradient method (Hentenes and Stiefel, 1952) preceded NIPALS and SIMPLS
320 Ancillary Topics

and, being based on a positive definite sample covariance matrix SX , it pro-


ceeds by updating the estimate of β. NIPALS and SIMPLS do not require SX
to be positive definite and they proceed by generating the weights wj and vj
reserving estimation of β until the end. There are relatively recent studies of
the conjugate gradient method when A is singular (see, for example, Hayami,
2020) and those might be closer to the PLS algorithms.
Proposition 11.1 also holds in the sample, provided SX > 0, and we as-
sume that all three algorithms stop at the same point. In that case, ri , pi ,
wi+1 and vi+1 are based on the sample, the last equality in item 3 becomes
span(Vq ) = EbΣX (B), the estimated envelope, and the fourth item becomes
βbcg = βbnpls = βbspls . In application, CGA may differ appreciably from the PLS
methods because of the nature of their stopping criteria: CGA uses a sample
version of the PLS population stopping criteria described in Tables 3.1 and
3.4, while the stopping criterion for the PLS methods is typically based on
cross validation or a holdout sample. Of course, cross validation could be used
in conjunction with CGA as well.
Recall from Tables 3.1 and 3.4 that at termination NIPALS and SIMPLS
employ matrices in the computation of the their estimates of β. In contrast,
the CGA estimator βbcg is a simple additive update of the previous iterate; no
matrix valuation or storage is required. In consequence, CGA may be compu-
tationally more stable than NIPALS or SIMPLS.
A version of Proposition 11.1 holds also when SX is singular. It is well-
known that the linear system SX βb = σ bX,Y always has at least one solution; it
has infinitely many solutions when SX is singular. To find a solution that fits
with Proposition 11.1 when SX is singular, we can use the NIPALS method dis-
cussed in Section 3.1.3. Let H be a semi-orthogonal basis matrix for span(SX ).
Then SH TX > 0 and the linear system SH TX α b=σ bH TX,Y is covered by Propo-
sition 11.1. The solution to the original equations is then βbcg = H α b. Since
the PLS methods use the same technique for dealing with rank deficiency, the
connections between CGA and the PLS methods described in Proposition 11.1
still hold.

11.3.3 Origins of CGA


In the previous section we described the connection between CGA, NIPALS,
SIMPLS, and envelopes. However, the origins of CGA are quite different, trac-
ing back to the basic notion of steepest decent (See Björck, 1966, Ch. 7, for
Conjugate gradient, NIPALS, and SIMPLS 321

an historical overview.) The numerical goal is to find an approximate solution


to ΣX β = σX,Y ∈ Rp by minimizing φ(β) = ||σX,Y − ΣX β||22 . The general
idea is to start with a possible solution β0 and then iterate to a final solu-
tion, moving from step k to step k + 1 in the direction of maximum descend
dk := −∇φ(βk ) = σX,Y − ΣX βk . The (k + 1)-st iterate is then defined as
βk+1 = βk + αdk , with α chosen to minimize φ(βk + αdk ). This gives αk =
dTk dk /dTk ΣX dk . The algorithm stops when the update dk is sufficiently small.
An alternative version of steepest decent optimizes overall α coefficients
simultaneously at each step:

1. Choose β1 to minimize φ over {β0 + αd0 , α ∈ R}. We get β1 = β0 + δ0 d0 .


Define d1 = σX,Y − ΣX β1 .

2. Choose β2 to minimize φ over {β0 + α0 d0 + α1 d1 , α0 , α1 ∈ R}, we find


β2 = β0 + δ0 d0 + δ1 d1 and define d2 = σX,Y − ΣX β2 .

3. Choose βk+1 to minimize φ over {β0 + α0 d0 + · · · + αk dk , αi ∈ R}, and


define dk+1 accordingly.

4. Stop when kdk+1 k < kσX,Y k.

This alternative version of steepest decent is in principle better than the basic
version because at each step it optimizes simultaneously over the coefficients
αj if all directions dj , but it is relatively complicated and rather unwieldily
in application. However, there is an equivalent algorithm that gives the same
solution at the end (but not in the intermediate steps) and updates only from
the last step. The key in this variation is to find a single direction that gives the
same iterate at each step. This leads then to CGA (Björck, 1966; Elman, 1994):

1. Initialize β0 = 0, d0 = σX,Y − ΣX β0 , and p0 = d0 . At initialization the


new direction p0 is the same as the original direction.

2. Set β1 = β0 + δ0 p0 , where
dT0 d0
δ0 = arg min φ(β0 + δp0 ) = T
,
δ p0 ΣX p0

d1 = d0 − δ0 ΣX p0 and p1 = d1 + γ0 p0 with γ0 = dT1 d1 /dT0 d0 .

3. At the (k + 1)-st step, update βk in the direction of pk : βk+1 = βk + δk pk ,


where
dT dk
δk = arg min φ(βk + δpk ) = T k
δ pk ΣX pk
322 Ancillary Topics

dk+1 = dk − δk ΣX pk and pk+1 = dk+1 + γk pk with


dTk+1 dk+1
γk = .
dTk dk

4. Terminate when ||dk+1 || < ||σX,Y ||.

We find it rather remarkable that the CGA turns out to be an envelope


method that relies on NIPALS and SIMPLS for iteration improvement since
the rationale for the algorithm makes no direct appeal to reducing subspaces.

11.4 Sparse PLS


As mentioned in Chapter 4, sparse versions of PLS regression have been
proposed by Chun and Keleş (2010), Liland et al. (2013), and Zhu and Su
(2020). From our experience, we conclude that most chemometric regressions
are closer to abundant than sparse. Nevertheless, assuming sparsity can be a
sound step if there is prior supporting information, but it should not be taken
for granted just because one may be dealing with high dimensions. Chun and
Keleş (2010) used penalization to induce a sparse version of the PLS algorithm.
The method is relatively popular in statistics but its theoretical properties are
largely unknown. Zhu and Su (2020) developed a sparse version of envelope
methodology that performs much better than the proposal by Chun and Keleş
(2010). They referred to their sparse methods as envelope-based sparse PLS.
To introduce sparsity into envelope model (2.5), Zhu and Su (2020)
first partitioned X into active predictors XA ∈ RpA and inactive predictor
XI ∈ RpI , with X = (XA T
, XIT )T , p = pA + pI . Under sparsity, the rows of φ
corresponding to the inactive predictors are all zero,
!
φA
φ= ,
0

which, from model (2.5), leads to


! !
φA η βA
β = φη = := .
0 0

Recall from (2.11) that the predictor envelope estimator of φ in model (2.5)
is based on minimizing over semi-orthogonal matrices G ∈ Rp×q the objective
PLS for multi-way predictors 323

function
−1
Lq (G) = log |GT SX|Y G| + log |GT SX G|.

Zhu and Su (2020) based their development of envelope-based sparse PLS


on a functionally equivalent objective function, say L∗ , developed by Cook,
Forzani, and Su (2016) that can be solved using unconstrained minimiza-
tion over R(p−q)×q rather than constrained minimization in Rp×q . Then the
Zhu-Su envelope-based sparse PLS estimator of φ is found by optimizing a
constrained version of L∗ with the sparse parameterization. Their method
is more akin to maximum likelihood and is not based on a PLS-type algo-
rithm. In consequence, we see their label “envelope-based sparse PLS” as a
misnomer. They proved that their method has desirable theoretical properties
and demonstrated via simulation that it can perform much better than the
method by Chun and Keleş (2010).

11.5 PLS for multi-way predictors


In this book we have dealt with regression analysis in which the predictors
take the form of vectors. In chemometrics, these are typically vectorial spec-
tral data, like those captured through near-infrared (NIR) sample absorption.
With increasingly sophisticated instrumentation, predictor may take on higher
dimensional forms, like matrices or three-dimensional arrays.
Imagine that we aim to determine the concentration of a substance in water
using single excitation and emission wavelength fluorescence measurements,
where the sole emitter is the substance in question. Classical calibration in-
volves using a range of substance standards of different known concentrations
and constructing a regression for substance detection. The same methodol-
ogy is not feasible when gauging the concentration of a substance in samples
containing other fluorescent components. However, we could measure fluores-
cence spectra for a set of calibration samples containing both our substance
and possible interferences that might occur in new samples. This is a first-order
multivariate calibration method that allows quantification of the substance in
mixtures. But the original question remains: Can we do this using standards
of the pure substance? Generally, the answer is expected to be “no.”
324 Ancillary Topics

However, if we take measurements of matrix excitation-emission fluores-


cence data, we can indeed predict the concentrations of substances in new
samples, having calibrated with pure substance standards. This highlights the
primary advantage of multiway calibration. The goal of the earliest paper on
this topic, published in 1978, was to determine the polycyclic aromatic hy-
drocarbon perylene in mixtures with anthracene, calibrated solely with pery-
lene solutions, through appropriate processing of multiway fluorescence data.
Multiway calibration can therefore be seen as relatively recent in the field
of analytical chemistry. It results in models where Y is a real number but
the predictors X are matrices. See Olivieri and Escandar (2014) for further
explanation of these kind of data.
There are more intricate experiments where the predictors may be of higher
dimensions, but for simplicity we will now focus on matrix-valued predictors
X ∈ Rp×s and real responses Y . The model will follow this structure.
Vectorizing X and adopting a linear model in vec(X),

Y = β0 + β T vec(X) + , with  ∼ N (0, σY2 |X ), (11.8)

results in a setting that is covered by the dimension reduction methods in


this book. This includes PLS regression methods of Tables 3.1 and 3.4 with
X replaced by vec(X). This type of analysis is commonly called U-PLS in
analytic chemistry. Recall that the NIPALS and SIMPLS algorithms in the
population give consistent estimators for the envelope model:

β = Γη
Σvec(X) = ΓΩΓT + Γ0 Ω0 ΓT0

for some Γ ∈ Rps×q , semi-orthogonal, η ∈ Rq and positive definite matrices Ω


and Ω0 . PLS analyses based on this vectorization are quite flexible, but un-
fortunately they muddle the two-dimensional structure of the predictor and
may lead to a loss in chemical interpretability.
In Bro (1998) a multiway regression method called N-way partial least
squares (N-PLS) was presented. According to the authors, the developed al-
gorithm is superior to the vectorization methods, primarily owing to a sta-
bilization of the decomposition. Nevertheless, this method does not work as
PLS for multi-way predictors 325

expected. As explained in Olivieri and Escandar (2014), this is due to the fact
that the bilinear structure is only partially used in the estimation process.
Following the ideas for response envelopes given by Ding and Cook (2018),
we propose here a new algorithm that uses the structure of bilinearity from
the beginning. Specifically, consider the bilinear model

Y = β0 + β1T Xβ2 +  with  ∼ N (0, σY2 |X ), (11.9)

with X ∈ Rp×s , β1 ∈ Rp , β2 ∈ Rs , Σvec(X) = Σ2 ⊗ Σ1 , Σ1 ∈ Rp×p and Σ2 ∈


Rs×s . Also, assuming that E(X) = 0 without loss of generality, it follows that

cov(vec(X)Y ) = E(vec(X)Y ) = E(vec(X)vecT (X)(β2 ⊗ β1 ))


= Σ2 β2 ⊗ Σ1 β1
unvec(cov(vec(X)Y )) = Σ1 β1 β2T Σ2 .

As a stepping stone to PLS methods, we define a Kronecker envelope as


span(Wq1 ⊗Vq2 ), where span(Wq1 ) and span(Wq2 ) are the smallest subspace of
Rp and Rs , respectively, whose semi-orthogonal basis matrices Wq1 and Wq2
satisfy

β1 = Wq1 A1
β2 = Vq2 A2
Σ1 = Wq1 Ω1 WqT1 + Wq1 ,0 Ω1,0 WqT1 ,0
Σ2 = Vq1 Ω2 VqT1 + Vq1 ,0 Ω2,0 VqT1 ,0 .

As a consequence, a PLS estimator of the envelope for model (11.9) is given


in Table 11.2 and will be denoted as K-PLS (Kronecker-PLS). Further work
is needed to compare the estimators and study its asymptotic behavior.
In this section we have restricted attention to matrix-values predictors. The
ideas extend straightforwardly to general tensor-valued predictors, although
the associated algebra is notably more changeling. In the statistics literature
Zhang and Li (2017) developed a PLS estimator for tensor predictors that
requires the inverse of the covariance matrix. As we know, PLS methods are
generally applicable because they do not require inverting large matrices. More
work is needed to determine if these algorithms can be implemented without
inverting such matrices.
326 Ancillary Topics
TABLE 11.2
NIPALS K-PLS algorithm:

Population model (11.9) Σvec(X) = Σ2 ⊗ Σ1 , E(vec(X)Y ) = Σ2Y ⊗ Σ1Y ,


unvec(E(vec(XY ))) = Σ1Y ΣT2Y

(a) Population Version


Initialize w1 ∈ Rp ∝ Σ1Y , v1 ∈ Rs ∝ Σ2Y W1 = (w1 ),
V1 = (v1 )
For d = 1, 2 . . . ,
and l = 1, 2, . . . ,
End for W if QTWd (Σ1 ) Σ1Y = 0 and set q1 = d
End for V if QTVd (Σ2 ) Σ2Y = 0 and set q2 = d
compute weights wd+1 ∝ QTWd (Σ1 ) Σ1Y
compute weights vd+1 ∝ QTVd (Σ2 ) Σ2Y
Append Wd+1 = (Wd , wd+1 ) and Vd+1 = (Vd , vd+1 )
End when first QTWd+1 (ΣX ) ΣX,Y = 0 and then set q = d + 1
Regression coefficients βnpls = Wq1 (WqT1 Σ1 Wq1 )−1 WqT1 Σ1,Y ⊗
Vq2 (WqT2 Σ2 Wq2 )−1 WqT2 Σ2,Y
(b) Sample Version.
b 1Y = 1 Ps Pn Xijk (Yi − Ȳ )

Substitute for the Σ
sn Pk=1 Pi=1 j=1,p
b 2Y = 1 p n
population versions Σ pn j=1 i=1 Xijk (Yi − Ȳ )
j=1,p
b 1 = 1 2 Ps Ps Pn (Xijk − X̄jk )
Σ ns k=1  t=1 i=1
(Xiht − X̄ht ) j,h=1,p

b 2 = 1 2 Ps Ps Pn (Xijk − X̄jk )
Σ j=1
np
 h=1 i=1
(Xiht − X̄ht ) t,k=1,s
(c) Notes
Orthogonal weights WqT1 Wq1 = Iq1 and VqT2 Vq2 = Iq2
Envelope connection span(Wq1 ) = EΣ1 (B1 ), the Σ1 -envelope of B1 :=
span(β1 )
span(Wq2 ) = EΣ2 (B2 ), the Σ2 -envelope of B2 :=
span(β2 )
PLS for generalized linear models 327

11.6 PLS for generalized linear models


As described at the outsets of Chapters 1 and 2, the goal of envelope method-
ology is to separate with clarity information in the data that is material to
the goals of the analysis from that which is immaterial. Pursuing this notion
in the case of the multivariate linear regression model (1.1) led to serviceable
dimension reduction paradigms and close connections with PLS. Adaptations
for linear and quadratic discriminant analysis and for non-linear regression
were discussed in Chapters 7–9.
Cook and Zhang (2015a) developed a foundation for extending envelopes
to relatively general settings, including generalized linear models (GLMs). In
this section, we briefly review their foundation and its implications for GLMs.
This will then lead to PLS for predictor reduction in GLMs. To facilitate
making connections with literature, we use the notation of Cook and Zhang
(2015a) provided that it does not conflict with notation that we used previ-
ously. We assume familiarity with GLMs, providing background only to set
the stage for envelopes and PLS.

11.6.1 Foundations
Instead of proposing a specific modeling environment for envelopes, Cook
and Zhang (2015a) started with an asymptotically normal estimator. Let
θ ∈ Θ ⊆ Rm denote a parameter vector, which we decompose into a vec-
tor φ ∈ Rp , p ≤ m, of targeted parameters and a vector ψ ∈ Rm−p nuisance

parameters. We require that n(φb − φ) converge in distribution to a normal
random vector with mean 0 and covariance matrix Vφφ (θ) > 0 as n → ∞.
Allowing Vφφ (θ) to depend on the full parameter vector θ means that the
variation in φb is can depend on the parameters of interest φ in addition to the
nuisance parameters ψ. In many problems we may construct φ and ψ to be
orthogonal parameters in the sense of Cox and Reid (1987). In the remainder
of this section, we suppress notation indicating that Vφφ (θ) may depend on θ
and write instead Vφφ in place of Vφφ (θ).
328 Ancillary Topics

In the context of linear model (1.1),


 
α  
  α
 vec(β) 
θ= ; ψ = 
 vech(ΣY |X )  ; φ = vec(β).

 vech(Σ
Y |X ) 

vech(ΣX )

vech(ΣX )

When targeting the OLS estimator φb = vec(βbols ) for improvement, we have


from (1.18),

Vφφ = avar( nvec(βbols ) | X0 ) = ΣY |X ⊗ Σ−1
X ,

which depends on the nuisance parameters but not on β.


Let F = span(φ). Then we construct an envelope for improving φb as follows
(Cook and Zhang, 2015a) .

Definition 11.1. The envelope for the parameter φ ∈ Rp is defined as the


smallest reducing subspace of Vφφ that contains F, which is represented as
EVφφ (F) ⊆ Rp .

This definition expands our previous approaches in three fundamental


ways. First, it links the envelope to a particular pre-specified method of esti-
mation through the covariance matrix Vφφ ; a model is not required. Second,
the matrix to be reduced – here Vφφ – is dictated by the method of estimation.
Third, the matrix to be reduced can now depend on the parameter being es-
timated, in addition to perhaps other parameters. Definition 11.1 reproduces
all of the envelope methods discussed in this book (Cook and Zhang, 2015a).
The potential improvement from using envelopes in this general context
arises in much the same way as we discussed in Section 2.2 for predictor re-
duction. For this illustration, we assume that a basis matrix Γ for the envelope
EVφφ (F) is known and we construct the envelope estimator φbenv of φ by pro-
jecting φb onto span(Γ), φbenv = PΓ φ.
b Then, since span(Γ) reduces Vφφ , we have
that

Vφφ = PΓ Vφφ PΓ + QΓ Vφφ QΓ


√ √
b = PΓ Vφφ PΓ ≤ Vφφ .
avar( nφenv ) = avar( nPΓ φ)
b

See Cook and Zhang (2015a) for methods of estimation in the general setting.
PLS for generalized linear models 329

11.6.2 Generalized linear models


We restrict discussion to a GLM in which Y | X follows a one-parameter
exponential family with probability mass or density function f (y | ϑ) =
exp{yϑ − b(ϑ) + c(y)}, where ϑ is the canonical parameter and y ∈ R1 .
We consider only regressions based on a one-parameter family with the
canonical link function, ϑ(α, β) = α + β TX, which gives θ = (α, β T )T ,
µ(ϑ) = E(Y | ϑ) = b0 (ϑ) and var(Y | ϑ) = b00 (ϑ), where 0 and 00 denote first
and second derivatives with respect to the argument evaluated at its popula-
tion value. The overarching objective is to reduce the dimension of X ∈ Rp
with the goal of improving the estimation of β, which is the parameter vector
of interest; α is the nuisance parameter.
For a sample (yi , Xi ), i = 1, . . . , n, the log likelihood for the i-th observa-
tion, which varies for different exponential family distributions of Y |X, is

C(ϑi | yi ) := log f (yi |ϑi ) = yi ϑi − b(ϑi ) + c(yi ) = C(ϑi ) + c(yi ),

where ϑi = α+β TXi and C(ϑi ) = yi ϑi −b(ϑi ) is the kernel of the log likelihood.
The full log-likelihood can be written as
n
X n
X n
X
Cn (α, β) = C(ϑi | yi ) = C(ϑi ) + c(yi ).
i=1 i=1 i=1

Different log likelihood functions are summarized in Table 11.3 via the kernel.
We next briefly review Fisher scoring, which is the standard iterative
method for maximizing Cn (α, β). At each iteration of the Fisher scor-
ing method, the update step for βb can be summarized in the form of a

TABLE 11.3
A summary of one-parameter exponential families. For the normal, σ = 1.
A(ϑ) = 1 + exp(ϑ). C 0 (ϑ) and C 00 (ϑ) are the first and second derivatives of
C(ϑ) evaluated at the true value.

E(Y |X) C(ϑ) C 0 (ϑ) −C 00 (ϑ) (∝ ω)


Normal ϑ Y ϑ − ϑ2 /2 Y −ϑ 1
Poisson exp(ϑ) Y ϑ − exp(ϑ) Y − exp(ϑ) exp(ϑ)
Logistic exp(ϑ)/A(ϑ) Y ϑ − log A(ϑ) Y − exp(ϑ)/A(ϑ) exp(ϑ)/A2 (ϑ)
Exponential −ϑ−1 > 0 Y ϑ − log(−ϑ) (Y − 1/ϑ) ϑ−2
330 Ancillary Topics

weighted least squares (WLS) estimator where the weights are defined as
ω(ϑ) = −C 00 (ϑ). With the canonical link, as we are assuming, −C 00 (ϑ) =
b00 (ϑ) = var(Y | ϑ). For a sample of size n, we define the population weights as
var(Y | ϑi )
ωi = ω(ϑi ) = Pn , i = 1, . . . , n,
j=1 var(Y | ϑj )
Pn
which are normalized so that i=1 ωi = 1. Estimated weights are obtained by
simply substituting estimates for the ϑi ’s. In keeping with our convention, we
use the same notation for population and estimated weights, which should be
clear from context. Let Ω = diag(ω1 , . . . , ωn ) and define the weighted sample
estimators, which use sample weights,
n
X
X̄(Ω) = ωi Xi
i=1
Xn
ωi [Xi − X̄(Ω) ][Xi − X̄(Ω) ]T

SX(Ω) =
i=1
n n
X o
SX,Z(Ω)
b = ωi [Xi − X̄(Ω) ][Zbi − Z̄(Ω) ]T ,
i=1

where Zbi = ϑbi + {Yi − µ(ϑbi )}/ωi is a pseudo-response variable at the cur-
rent iteration. The weighted covariance SX(Ω) is the sample version of the
population-weighted covariance matrix

ΣX(ω) = E ω[X − E(ωX)][X − E(ωX)]T ,




and the weighted cross-covariance SX Z(ω)


b is the sample estimator of the
weighted cross-covariance matrix

ΣX,Z(ω) = E ω[X − E(ωX)][Z − E(ωZ)]T ,




where Z is the population version of Z.


b
Then the updated estimator of β at the k-th iteration can now be repre-
sented as a WLS estimator
−1
βb ←− SX(Ω) SX,Z(Ω)
b . (11.10)

The corresponding updated estimator of α can be determined as α b =


arg maxα Cn (α, β). Upon convergence of the Fisher scoring process, the final
b
WLS estimator, denoted βbwls . is a function of α
b, β,
b and ω. This iterative form
allows β to be represented in the population using a weighted construction,

β = Σ−1
X(ω) ΣX,Z(ω) .
PLS for generalized linear models 331

The asymptotic covariance matrix of βbwls is (e.g. Cook and Zhang, 2015a)
√ −1
b = Vββ (θ) = E(−C 00 ) · ΣX(ω)

avar( nβ) ,

while EVββ (B) is the corresponding envelope for improving estimation of β,


where still B = span(β). Writing EVββ (B) in a bit more detail and using
Proposition 1.6 and the discussion that immediately follows, we have

EVββ (B) = E{E(−C 00 )·Σ } (B) = EΣX(ω) (span(ΣX,Z(ω) )).


−1
X(ω)

From this we see that the envelope for improving β in a GLM has the same
form as the envelope EΣX(B) = EΣX (CX,Y ) for linear predictor reduction. This
implies that we can construct PLS-type estimators of β in GLM’s by first per-
forming dimension reduction using a NIPALS or SIMPLS algorithm. Specifi-
cally, implement a sample version of the NIPALS algorithm in Table 3.1b or
the SIMPLS algorithm in Table 3.4b, substituting SX(Ω) for ΣX and SX,Z(Ω) b
for ΣX,Y . Following reduction, prediction can be based on the GLM regres-
sion of Y on the reduced predictors W TX. Let νb denote the estimator of the
coefficient vector from this regression. The corresponding PLS estimator of β
is W νb.
Following these ideas, Table 11.4 gives an algorithm for using a PLS to
fit a one-parameter family GLM. The algorithm has two levels of iteration.
The outer level, which is shown in the table, are the score-based iterations
for fitting a GLM. The inner iterations, which are not shown explicitly, occur
when calling a PLS algorithm during each outer GLM iteration. The “For
k = 1, 2, . . . ,” instruction indexes the GLM iterations. For each value of k
there is an instruction to “call PLS algorithm” with the current parameter
values. The calls to a PLS algorithm all have the same number of components
q and so these PLS iterations terminate after q stages. The overall algorithm
stops when the coefficient estimates no longer change materially. The algo-
rithm does not require n > p.

11.6.3 Illustration
In this section we use the relatively straightforward simulation scenario
of Cook and Zhang (2015a, Sec. 5.1) to support the algorithm of Ta-
ble 11.4. We generated n = 150 observations according to a logistic regression
Y | X ∼ Bernoulli(logit(β TX)), where β = (0.25, 0.25)T and X follows a
332 Ancillary Topics
TABLE 11.4
A PLS algorithm for predictor reduction in GLMs.

(a) Sample Version


Select Number of components, q
(1) (1) (1)
Initialize ωi = n−1 , Ω(1) = diag(ω1 , . . . , ωn )
(1)
β = estimator of β from a PLS fit of Y
b
T
on X, α(1) = ȳ + βb(1) X̄.
For k = 1, 2 . . . , compute
(k)
X̄ (k) = n
P
weighted means i=1 ωi Xi ,
weight-centered data X(k) is n × p with rows (Xi − X̄ (k) )T
(k) T
weighted variance SX(Ω) = X(k) Ω(k) X(k)
(k) T (k)
pseudo responses Zi = α b(k) + βb(k)
n Xi + [yi − µ(ϑi )]/ωi , i = o
b 1, . . . , n
(k) (k) (k)
SX,Z(Ω) = n (k) (k)
P
sample covariance i=1 ωi [Xi − X̄ ][Zi − Z̄ ]
(k) (k)
call PLS algorithm Send SX(Ω) and SX,Z(Ω) to a PLS algorithm
and return the p × q matrix W (k+1) of PLS weights.
T
GLM regression Fit the GLM regression of Y on W (k+1) X, giving
coefficient vector νb(k+1) and βb(k+1) = W (k+1) νb(k+1) .
End if convergence kβb(k) − βb(k+1) k ≤ 
then set βbpls = βb(k+1) and α
bpls = arg maxα Cn (α, βbpls )
Otherwise  
estimate α
b b(k+1) = arg maxα Cn α, βb(k+1)
α
(k+1) b(k+1) ), i = 1, . . . , n
re-compute weights ω i = ωi (ϑ
(b) Notes
Weights All ω weights are sample versions, and are distinct from
those that occur in the PLS algorithm called
during iteration.
(k)
n<p The algorithm does not require that SX(Ω) be non-
singular.
Components The call to a PLS algorithm requires that the number of
components q be specified. This should be the same for
all calls. q can be estimated by using predictive cross
validation appropriate for the distribution of Y | X.

bivariate normal distribution with mean 0 and variance

ΣX = (10/kβk2 )ββ T + 0.1β0 β0T ,

where β0 is a 2×1 vector of length 1 that is orthogonal to β. It follows that the


ΣX -envelope of span(β) is span(β) itself, since β/kβk is an eigenvector of ΣX .
The correlation between the two predictors is about 0.98, so estimation of β
PLS for generalized linear models 333

3.0 ENV 3.0 ENV


GLM GLM
PLS PLS
2.5 2.5

2.0 2.0
Density

Density
1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0
-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

-1.0

-0.5

0.0

0.5

1.0

1.5
β1 β1

a. Linear envelope start b. Linear PLS start


7
ENV
6 GLM
PLS
5
Density

0
-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

β1

c. True value start

FIGURE 11.1
PLS-GLM data: Estimates of the densities of the estimators of the first com-
ponent β1 in the simulation to illustrate the PLS algorithm for GLMs in
Table 11.4. Linear envelope and PLS starts refer to starting iteration at the
envelope and PLS estimators from a fit of the linear model, ignoring the GLM
structure.

may be a challange for standard likelihood-based methods. We repeated this


simulation 500 times and for each dataset we estimated β using the standard
GLM estimator, the envelope estimator of Cook and Zhang (2015a) and the
PLS estimator from Table 11.4, the latter estimators using q = 1. To get some
intuition on the importance of starting values, where necessary we ignored
the logistic structure and started iterations using the estimators from (a) the
334 Ancillary Topics

enveloped linear regressions of Y on X, (b) the NIPALS linear regression of Y


on X and (c) the true value of β. Starting values (a) and (b) were as discussed
in Chapters 2 and 3.
The densities, estimated from the 500 simulations, of the estimators of
the first component β1 of β are given in Figure 11.1. Several observations are
noteworthy. Regarding starting value, the results in Figures 11.1a,b are very
close, suggesting that there is little difference between the envelope and PLS
starting values based on the linear model. From these starting values, PLS
GLM from Table 11.4 does much better than the standard GLM estimator
and the envelope GLM estimator of Cook and Zhang (2015a). The means for
PLS GLM (0.26) and the standard GLM estimator (0.23) are close to the
true value (0.25), while the mean for the envelope GLM estimates (0.07) of
Cook and Zhang (2015a) is not. The apparent bias of the estimator may be
due to the inherent multimodal nature of envelope objective functions, which
tends to be less important as the sample size increases. From Figure 11.1c,
we see that, when starting at the true value, the performance of the envelope
estimator improves greatly.
In short, the envelope GLM estimator seems sensitive to starting values,
while the PLS GLM estimator is relatively stable. The envelope GLM esti-
mator has the potential to do much better than the PLS GLM estimator, but
achieving that potential seems problematic. The PLS GLM estimator, on the
other hand, is relatively stable and easily beats the standard GLM estimator.
A

Proofs of Selected Results

A.1 Proofs for Chapter 1


A.1.1 Justification for (1.7)
−1
βbols = (XT X)−1 XT Y = (XT X)−1 XT0 Y = SX SY,X .

Let Ybi = Ȳ + βbols Xi and ri = Yi − Ybi denote the i-th vectors of fitted values and
residuals, i = 1, . . . , n, and let D = β − βbols . Then after substituting Ȳ for α,
the remaining log likelihood L(β, ΣY |X ) to be maximized can be expressed as
n
X
(2/n)L(β, ΣY |X ) = c − log |ΣY |X | − n−1 (Yi − Ȳ − βXi )T Σ−1
Y |X
i=1

× (Yi − Ȳ − βXi )
n
X
= c − log |ΣY |X | − n−1 (ri − DXi )T Σ−1
Y |X (ri − DXi )
i=1
n
!
X
= c − log |ΣY |X | − n −1
tr ri riT Σ−1
Y |X
i=1
n
!
X
− n−1 tr D Xi XiT DT Σ−1
Y |X
i=1
n
!
X
= c − log |ΣY |X | − n −1
tr ri riT Σ−1
Y |X
i=1

− tr(DSX D T
Σ−1
Y |X ),
Pn
where c = −r log(2π) and the penultimate step follows because i=1 ri XiT =
0. Consequently, L(β, ΣY |X ) is maximized over β by setting β = βbols so D = 0,

DOI: 10.1201/9781003482475-A 335


336 Proofs of Selected Results

leaving the partially maximized log likelihood


n
!
X
(2/n)L(ΣY |X ) = −r log(2π) − log |ΣY |X | − n −1
tr ri riT Σ−1
Y |X .
i=1

It follows that the maximum likelihood estimator of ΣY |X is SY |X :=


Pn
n−1 i=1 ri riT and that the fully maximized log likelihood is

b = −(nr/2) log(2π) − nr/2 − (n/2) log |SY |X |.


L

A.1.2 Lemma 1.1


Restatement. Let R be a u dimensional subspace of Rr and let M ∈ Rr×r .
Then R is an invariant subspace of M if and only if, for any A ∈ Rr×s with
span(A) = R, there exists a B ∈ Rs×s such that M A = AB.

Proof. Suppose there is a B that satisfies M A = AB. For every v ∈ R there


is a t ∈ Rs so that v = At. Consequently, M v = M At = ABt ∈ R, which
implies that R is an invariant subspace of M .
Suppose that R is an invariant subspace of M , and let aj , j = 1, . . . , s
denote the columns of A. Then M aj ∈ R, j = 1, . . . , s. Consequently,
span(M A) ⊆ R, which implies there is a B ∈ Rs×s such that M A = AB. 2

A.1.3 Proposition 1.2


Restatement. R reduces M ∈ Rr×r if and only if M can be written as

M = PR M PR + QR M QR . (A.1)

Proof. Assume that M can be written as in (A.1). Then for any v ∈ R,


M v ∈ R, and for v ∈ R⊥ , M v ∈ R⊥ . Consequently, R reduces M .
Next, assume that R reduces M . We must show that M satisfies (A.1).
Let u = dim(R). It follows from Lemma 1.1 that there is a B ∈ Ru×u
that satisfies M A = AB, where A ∈ Rr×u and span(A) = R. This implies
QR M A = 0 which is equivalent to QR M PR = 0. By the same logic applied
to R⊥ , PR M QR = 0. Consequently,

M = (PR + QR )M (PR + QR ) = PR M PR + QR M QR .

2
Proofs for Chapter 1 337

With this we have the following alternate definition of a reducing subspace

Definition A.1. A subspace R ⊆ Rr is said to be a reducing subspace of


the real symmetric r × r matrix M if R decomposes M as M = PR M PR +
QR M QR . If R is a reducing subspace of M , we say that R reduces M .

A.1.4 Corollary 1.1


Restatement. Let R reduce M ∈ Rr×r , let A ∈ Rr×u be a semi-orthogonal
basis matrix for R, and let A0 be a semi-orthogonal basis matrix for R⊥ . Then

1. M and PR , and M and QR commute.

2. R ⊆ span(M ) if and only if AT M A is full rank.

3. |M | = |AT M A| × |AT0 M A0 |.

4. If M is full rank then

M −1 = A(AT M A)−1 AT + A0 (AT0 M A0 )−1 AT0 (A.2)


−1 −1
= PR M PR + QR M QR . (A.3)

5. If R ⊆ span(M ) then

M † = A(AT M A)−1 AT + A0 (AT0 M A0 )† AT0 .

Proof. The first conclusion follows immediately from Proposition 1.2.


To show the second conclusion, first assume that AT M A is full rank. Then,
from Lemma 1.1, B must be full rank in the representation M A = AB. Conse-
quently, any vector in R can be written as a linear combination of the columns
of M and thus R ⊆ span(M ). Next, assume that R ⊆ span(M ). Then there is
a full rank matrix V ∈ Rr×u such that M V = A and thus that AT M V = Iu .
Substituting M from Proposition 1.2, we have (AT M A)(AT V ) = Iu . It follows
that AT M A is of full rank.
To demonstrate the third conclusion, write PR = AAT , QR = A0 AT0 and

M = AAT M AAT + A0 AT0 M A0 AT0 (A.4)


!
T
A MA 0
= (A, A0 ) (A, A0 )T .
0 AT0 M A0

The conclusion follows since (A, A0 ) is an orthogonal matrix and thus has
determinant 1.
338 Proofs of Selected Results

For the fourth conclusion, since M is full rank R ⊆ span(M ) and



R ⊆ span(M ). Consequently, both AT M A and AT0 M A0 are full rank. Thus
both addends on the right hand side of (A.2) are defined. Multiplying (A.4)
and the right hand side of (A.2) completes the proof of (A.2).
We use the first conclusion to prove (A.3). Since M and PR commute,
−1
M and PR must also commute. Thus, PR M −1 PR = M −1 PR . Similarly,
QR M −1 QR = M −1 QR , which gives

PR M −1 PR + QR M −1 QR = M −1 PR + M −1 QR = M −1 .

The final conclusion follows similarly: Since R ⊆ span(M ), AT M A is full


rank. The conclusion follows by checking the conditions for the Moore-Penrose
inverse. 2

A.1.5 Lemma 1.3


Restatement. Suppose that M ∈ Sr×r is positive definite and that the
column-partitioned matrix (A, A0 ) ∈ Rr×r is orthogonal with A ∈ Rr×u . Then

I. |AT0 M A0 | = |M | × |AT M −1 A|

II. log |AT M A| + log |AT0 M A0 | ≥ log |M |

III. log |AT M A| + log |AT M −1 A| ≥ 0.

Proof. Part I. Define the r × r matrix


!
Iu , AT M A0
K= .
0, AT0 M A0

Since (A, A0 ) is an orthogonal matrix and |K| = |AT0 M A0 |,

|AT0 M A0 | = |(A, A0 )K(A, A0 )T | = |AAT + AAT M A0 AT0 + A0 AT0 M A0 AT0 |


= |AAT + (AAT + A0 AT0 )M A0 AT0 |
= |AAT + M A0 AT0 |
= |M − (M − Ip )AAT | = |M ||Iu − AT (Ir − M −1 )A|
= |M ||AT M −1 A|.
Proofs for Chapter 1 339

Part II. Let O = (A, A0 ).

AT M A A T M A0
|M | = |OT M O| =
AT0 M A AT0 M A0
= |AT M A| × |AT0 M A0 − AT0 M A(AT M A)−1 AT M A0 |
≤ |AT M A| × |AT0 M A0 |.

Part III: This conclusion follows straightforwardly by combining the results


of parts I and II.
2

A.1.6 Proposition 1.3


Restatement. If R reduces M ∈ Sr×r and R ⊆ span(M ) then PR(M ) = PR .

Proof. Since R reduces M we have by Proposition 1.2

M = PR M PR + QR M QR
= R(RT M R)RT + QR M QR
:= RΩRT + QR M QR .

Thus, RT M = ΩRT . From conclusion 2 of Corollary 1.1, (RT M R) is full rank,


and consequently we have

PR(M ) = R(RT M R)−1 RT M = RΩ−1 ΩRT = PR .

A.1.7 Proposition 1.4


Restatement. The intersection of any two reducing subspaces of M ∈ Rr×r
is also a reducing subspace of M .

Proof. Let R1 and R2 be reducing subspaces of M . Then by definition


M R1 ⊆ R1 and M R2 ⊆ R2 . Clearly, if v ∈ R1 ∩ R2 then M v ∈ R1 ∩ R2 ,
and it follows that the intersection is an invariant subspace of M . The same
argument shows that if v ∈ (R1 ∩ R2 )⊥ = R⊥ ⊥ ⊥
1 + R2 then M v ∈ R1 + R2 :

340 Proofs of Selected Results

If v ∈ R⊥ ⊥ ⊥
1 + R2 then it can be written as v = v1 + v2 , where v ∈ R1 and
v ∈ R⊥ ⊥
2 . Then M v = M v1 + M v2 ∈ R1 + R2 .

A.1.8 Lemma 1.4


Restatement. EM1 (S1 ) ⊕ EM2 (S2 ) = EM1 ⊕M2 (S1 ⊕ S2 ).
Pqi
Proof. From Proposition 1.9, we have EMi (Si ) = j=1 Pij Si , i = 1, 2, where
Pij is the projection onto the j-th eigenspace of Mi . The eigen-projections of
M1 ⊕M2 , are of the forms P1j ⊕0p2 , j = 1, . . . , q1 , and 0p1 ⊕P2k , k = 1, . . . , q2 .
Therefore, applying Proposition 1.9 again, we have
  (
q1 q2
)
X  X
EM (S) = [(P1j ⊕ 0)(S1 ⊕ S2 )] + [(0 ⊕ P2k )(S1 ⊕ S2 )]
 
j=1 k=1

= {EM1 (S1 ) ⊕ 0} + {0 ⊕ EM2 (S2 )}


= EM1 (S1 ) ⊕ EM2 (S2 ).

A.1.9 Proposition 1.8


Restatement. Let ∆ ∈ Sr×r be a positive definite matrix and let S be a
u-dimensional subspace of Rr . Let G ∈ Rr×u be a semi-orthogonal basis ma-
trix for S and let V ∈ Su×u be positive semi-definite. Define Ψ = ∆ + GV GT .
Then ∆−1 S = Ψ−1 S and

E∆ (S) = EΨ (S) = E∆ (∆−1 S) = EΨ (Ψ−1 S) = EΨ (∆−1 S) = E∆ (Ψ−1 S).

Proof. Using a variant of the Woodbury identity for matrix inverses we have

Ψ−1 = ∆−1 − ∆−1 G(V −1 + GT ∆−1 G)−1 GT ∆−1 ,


∆−1 = Ψ−1 − Ψ−1 G(−V −1 + GT Ψ−1 G)−1 GT Ψ−1 .

Multiplying both equations on the right by G, the first implies span(Ψ−1 G) ⊆


span(∆−1 G); the second implies span(∆−1 G) ⊆ span(Ψ−1 G). Hence
Ψ−1 S = ∆−1 S. From this we have also that EΨ (Ψ−1 S) = EΨ (∆−1 S) and
E∆ (Ψ−1 S) = E∆ (∆−1 S).
Proofs for Chapter 2 341

We next show that E∆ (S) = EΨ (S) by demonstrating that R ⊆ Rr is a


reducing subspace of ∆ that contains S if and only if it is a reducing subspace
of Ψ that contains S. Suppose R is a reducing subspace of ∆ that contains
S. Let α ∈ R. Then Ψα = ∆α + GV GT α. ∆α ∈ R because R reduces ∆;
the second term on the right is a vector in R because S ⊆ R. Thus, R is a
reducing subspace of Ψ and by construction it contains S. Next, suppose R
is a reducing subspace of Ψ that contains S. The reverse implication follows
similarly by reasoning in terms of ∆α = Ψα − GV GT α. We have Ψα ∈ R
because R reduces Ψ; the second term on the right is a vector in R because
S ⊆ R. The remaining equalities follow immediately from Proposition 1.6.
2

A.2 Proofs for Chapter 2


A.2.1 Lemma 2.1
Restatement. Let S ⊆ Rp . Then S reduces ΣX if and only if
cov(PS X, QS X) = 0.

Proof. Since PS + QS = Ip ,

ΣS = (PS + QS )ΣS (PS + QS )


= PS ΣS PS + QS ΣS QS + PS ΣS QS + (PS ΣS QS )T .

The conclusion now follows from Proposition 1.2 which implies that S reduces
ΣX if and only if
ΣX = PS ΣX PS + QS ΣX QS .
2

A.3 Proofs for Chapter 3


A.3.1 Lemma 3.1
Restatement. Following the notation from Table 3.1(a), for the sample
version of the NIPALS algorithm

WdT Wd = Id , d = 1, . . . , q.
342 Proofs of Selected Results

Proof. Since the columns of Wd are all eigenvectors with length one, the diag-
onal elements of WdT Wd must all be 1. We show orthogonality by induction.
For d = 2,

w2 = `1 (XT2 Y2 YT2 X2 )
= `1 (QTw1 (SX ) XT1 Y2 YT2 X2 ),
1

where the second step follows by substituting (3.1) for X2 . It follows that
w1T w2 = 0.
For d = 3, substituting (3.1) twice,

w3 = `1 (XT3 Y3 YT3 X3 )
= `1 (QTw2 (SX ) XT2 Y3 YT3 X3 )
2

= `1 (QTw2 (SX ) QTw1 (SX ) XT1 Y3 YT3 X3 ).


2 1

Clearly, w2T w3 = 0. For w1 ,

Qw2 (SX2 ) w1 = w1 − w2 (w2T SX2 w2 )−1 w2T SX2 w1

But it follows from (3.4) that SX2 w1 = 0 and consequently Qw2 (SX2 ) w1 = w1
and it follows that w1T w3 = 0. The rest of the justification follows straightfor-
wardly by induction and is omitted.

A.3.2 Lemma 3.2


Restatement.
XTd+1 sj = XTd+1 Xj wj = 0

for d = 1, . . . , q − 1, j = 1, . . . , d.

Proof. The proof is by induction. For d = 1, substituting from Table 3.1 or


using (3.2),

X2 = X1 − s1 l1T
= X1 − X1 w1 (w1T XT1 X1 w1 )−1 w1T XT1 X1
= QX1 w1 X1 .

Clearly,
XT2 s1 = XT2 X1 w1 = XT1 QX1 w1 X1 w1 = 0

so the conclusion holds for d = 1.


Proofs for Chapter 3 343

For d = 2 and j = 2, XT3 X2 w2 = XT2 QX2 w2 X2 w2 = 0, where the first


equality comes from substituting from(3.2). For d = 2 and j = 1,

XT3 X1 w1 = XT1 QX1 w1 QX2 w2 X1 w1


= XT1 QX1 w1 (I − PX2 w2 )X1 w1
= XT1 QX1 w1 X1 w1
= 0,

where the third equality holds because XT2 X1 w1 = 0.


Under the induction hypothesis, XTd Xj wj = 0 for j = 1, . . . , d−1. To prove
the lemma we must show that XTd+1 Xj wj = 0 for j = 1, . . . , d. Again using
(3.2)

XTd+1 Xj wj = XTd QXd wd Xj wj


= XTd (I − PXd wd )Xj wj .

By the induction hypothesis, XTd Xj wj = 0 for j = 1, . . . , d − 1 and so we have


the desired conclusion for j = 1, . . . , d − 1: XTd+1 Xj wj = 0 for j = 1, . . . , d − 1.
For j = d,
XTd+1 Xd wd = XTd (I − PXd wd )Xd wd = 0,
and the conclusion follows. 2

A.3.3 Lemma 3.3


Restatement. For d = 1, . . . , q,

Xd+1 = (I − Psd − Psd−1 − · · · − Ps2 − Ps1 )X1


= (I − PSd )X1 = QSd X1 (A.5)
Yd+1 = (I − Psd − Psd−1 − · · · − Ps2 − Ps1 )Y1
= (I − PSd )Y1 = QSd Y1 (A.6)
SXd ,Yd = SXd ,Y1
ld = XT1 sd /ksd k2
md = YT1 sd /ksd k2
wd+1 = `1 (XT1 QSd Y1 YT1 QSd X1 ) (A.7)
Sd = X1 W d
Y
b npls = X1 βbnpls = PSq Y1 .
344 Proofs of Selected Results

Proof. The representations for Xd+1 and Yd+1 follow straightforwardly from
Lemma 3.2, (3.6) and the definition of sd = Xd wd : Since, by Lemma 3.2,
Qsd+1 Qsj = I − Psd+1 − Psj , the conclusion follows by using (3.6).
For the fitted values we take βbnpls from Table 3.1 to get,

b npls = X1 βbnpls = X1 Wq (LT Wq )−1 M T .


Y q q

The covariance matrix identity is a consequence of (A.5) and (A.6):

SXd ,Yd = n−1 XTd Yd = n−1 XT1 QSd−1 QSd−1 Y1 = n−1 XT1 QSd−1 Y1 = SXd ,Y1 .

The representation of wd+1 follows similarly.


To show that Sd = X1 Wd , consider a typical diagonal element sk+1 of Sd ,

sk+1 = Xk+1 wk+1


= QSk X1 wk+1 .

It follows from (A.7) that X1 wk+1 must fall into the orthogonal com-
plement of the subspace spanned by the columns of Sk . Consequently,
sk+1 = QSk X1 wk+1 = X1 wk+1 . Thus, Sd = X1 Wd .
Let Dd = diag(ks1 k2 , . . . , ksd k2 ). As defined in Table 3.1, ld = XTd sd /ksd k2 .
Substituting the form of Xd from (A.5), we get

ld = XTd sd /ksd k2 = XT1 QSd−1 sd /ksd k2 = XT1 sd /ksd k2 ,

where the second equality follows from the first consequence of Lemma 3.2.
This implies that Ld = XT1 Sd Dd−1 . Similarly, Md = YT1 Sd Dd−1 . Substituting
these into Y
b npls and using the fact that Sq = X1 Wq we get

Y
b npls = X1 Wq (LTq Wq )−1 MqT
= X1 Wq (Dq−1 SqT X1 Wq )−1 Dq−1 SqT Y1
= Sq (SqT Sq )−1 SqT Y1 = PSq Y1 .

A.3.4 Lemma 3.4


Recall that V denotes the p × p1 , p1 ≤ p, of eigenvectors of SX with non-zero
eigenvalues, and that XT1 = V ZT1 , where ZT1 is an p1 × n matrix that contains
Proofs for Chapter 3 345

the coordinates of XT1 in terms of the eigenvectors of SX . Recall also that wd∗
and s∗d denote the weights and scores that result from applying NIPALS to
data (Z1 , Y1 ).
Lemma 3.4 states that, for d = 1, . . . , q, (a) sd = s∗d and (b) wd = V wd∗ .
We see from the form of wd that, for d = 1, . . . , q, wd ∈ span(SX ),

wd+1 = `1 (XT1 QSd Y1 YT1 QSd X1 )


= `1 (V ZT1 QSd Y1 YT1 QSd Z1 V T )
= V `1 (ZT1 QSd Y1 YT1 QSd Z1 ).

If conclusion (a) holds then Sd = Sd∗ and we have

wd+1 = V `1 (ZT1 QSd∗ Y1 YT1 QSd∗ Z1 )


= V wd∗ ,

and conclusion (b) follows.


We next demonstrate (a) by induction. For d = 1,

s1 = X1 `1 (XT1 Y1 YT1 X1 )
= Z1 V T `1 (V ZT1 Y1 YT1 Z1 V T )
= Z1 `1 (ZT1 Y1 YT1 Z1 )
= s∗1 .

For d = 2

s2 = X1 `1 (XT1 Qs1 Y1 YT1 Qs1 X1 )


= Z1 V T `1 (V ZT1 Qs∗1 Y1 YT1 Qs∗1 Z1 V T )
= Z1 `1 (ZT1 Qs∗1 Y1 YT1 Qs∗1 Z1 )
= s∗2 .

Under the induction hypothesis, assume that the conclusion holds for d < q.
Then we have for the next term in the sequence

sd+1 = X1 `1 (XT1 QSd Y1 YT1 QSd X1 )


= Z1 V T `1 (V ZT1 QSd∗ Y1 YT1 QSd∗ Z1 V T )
= Z1 `1 (ZT1 QSd∗ Y1 YT1 QSd∗ Z1 )
= s∗d+1 ,

where the second equality follows from the induction hypothesis.


346 Proofs of Selected Results

A.3.5 Lemma 3.5


Restatement. Let V denote a p × c matrix of rank c, let v denote a p × 1
vector that is not contained in span(V ), let Σ denote a p × p positive definite
matrix and let ∆ = QTV (Σ) ΣQV (Σ) . Then

(a) ∆ = QTV (Σ) Σ = ΣQV (Σ)

(b) P(V,v)(Σ) = PV (Σ) + PQV (Σ) v(Σ)

(c) QV (Σ) Qv(∆) = Q(V,v)(Σ)

(d) QTv(∆) ∆Qv(∆) = QT(V,v)(Σ) ΣQ(V,v)(Σ) .

Proof. Part (a) follows by straightforward algebra and its proof is omitted.
For part (b) we have

P(V,v)(Σ) = Σ−1/2 P(Σ1/2 V,Σ1/2 v) Σ1/2


= Σ−1/2 PΣ1/2 V Σ1/2 + Σ−1/2 PQ Σ1/2 v Σ
1/2
Σ1/2 V

= PV (Σ) + Σ−1/2 PQ Σ1/2 v Σ


1/2
.
Σ1/2 V

Next, consider the second addend on the right hand side:

QΣ1/2 V Σ1/2 = Σ1/2 QV (Σ)


PQ Σ1/2 v = Σ1/2 QV (Σ) v{v T QTV (Σ) ΣQV (Σ) v}−1 v T QTV (Σ) Σ1/2
Σ1/2 V

Σ−1/2 PQ Σ1/2 v Σ
1/2
= QV (Σ) v{v T QTV (Σ) ΣQV (Σ) v}−1 v T QTV (Σ) Σ
Σ1/2 V

= PQV (Σ) v(Σ) . (A.8)

This establishes part (b).


To show part (c), first write

Qv(∆) = I − v(v T ∆v)−1 v T ∆


= I − v(v T QTV (Σ) ΣQV (Σ) v)−1 v T QTV (Σ) ΣQV (Σ)
= I − v(v T QTV (Σ) ΣQV (Σ) v)−1 v T QTV (Σ) Σ,

where the second equality follows by substituting for ∆ and the third follows
from part (a). Next, multiplying on the left by QV (Σ) and using (A.8) we have

QV (Σ) Qv(∆) = QV (Σ) − QV (Σ) v(v T QTV (Σ) ΣQV (Σ) v)−1 v T QTV (Σ) Σ
= QV (Σ) − PQV (Σ) v(Σ)
= I − PV (Σ) − PQV (Σ) v(Σ) .
Proofs for Chapter 3 347

From part (b) we then have the desired result:

QV (Σ) Qv(∆) = I − P(V,v)(Σ) = Q(V,v)(Σ) .

To show part (d) we begin by substituting for the middle ∆ and then using
part (c):

QTv(∆) ∆Qv(∆) = QTv(∆) QTV (Σ) ΣQV (Σ) Qv(∆)


= QT(V,v)(Σ) ΣQ(V,v)(Σ) .

A.3.6 Lemma 3.6


Restatement. Let Σ ∈ Sp be positive definite and let Bj be a subspace of
Rp , j = 1, 2. Then

EΣ (B1 + B2 ) = EΣ (B1 ) + EΣ (B2 ).

Proof.
B1 + B2 ⊆ EΣ (B1 ) + EΣ (B2 ) ⊆ EΣ (B1 + B2 ). (A.9)
The first containment seems clear, as Bj ⊆ EΣ (Bj ), j = 1, 2. For the second
containment, EΣ (B1 ) ⊆ EΣ (B1 + B2 ) and EΣ (B2 ) ⊆ EΣ (B1 + B2 ). The second
containment follows since EΣ (B1 + B2 ) is a subspace.
We next wish to show that EΣ (B1 ) + EΣ (B2 ) reduces Σ. The envelopes
EΣ (B1 ) and EΣ (B2 ) both reduce Σ and so by definition,

ΣEΣ (Bj ) ⊆ EΣ (Bj ), j = 1, 2.

In consequence

ΣEΣ (B1 ) + ΣEΣ (B2 ) = Σ(EΣ (B1 ) + EΣ (B2 ))


⊆ EΣ (B1 ) + EΣ (B2 ).

It follows that EΣ (B1 ) + EΣ (B2 ) is an invariant subspace of Σ. That it is a


reducing subspace follows from the symmetry of Σ.
In short, EΣ (B1 ) + EΣ (B2 ) is a reducing subspace of Σ that, by (A.9)
contains B1 + B2 . But EΣ (B1 + B2 ) is by construction the smallest reducing
subspace of Σ that contains B1 + B2 . Thus,

EΣ (B1 + B2 ) ⊆ EΣ (B1 ) + EΣ (B2 ).

Together with (A.9) we have the desired conclusion. 2


348 Proofs of Selected Results

A.4 Proofs for Chapter 4


A.4.1 Proposition 4.1
The proof in this section was adapted from the unpublished notes of Cook,
Helland, and Su (2013).

Restatement. Assume the regression structure given in (4.1) and (4.7) with
p fixed. Then

(i) The PLS estimator βbpls of β has the expansion


n
√ 1 Xn o 1
n(βbpls −β) = √ (Xi − µX )i + QΦ (Xi − µX )(Xi − µX )T β +Op ( √ ),
δ n i=1 n

where  is the error for model (4.1).



(ii) n(βbpls − β) is asymptotically normal with mean 0 and variance
√ n o
avar( nβbpls ) = δ −2 ΣX σY2 |X + var(QΦ (X − µX )(X − µX )T β) .

(iii) If, in addition, PΦ X is independent of QΦ X then



avar( nβbpls ) = δ −1 σY2 |X PΦ + δ −2 σY2 Φ0 ∆0 ΦT0 .

Proof. For notational convenience in this proof, we let σ = σX,Y , Σ = ΣX ,


kσk2 = σ T σ and recall that Pσ is the projection onto span(σ). Without loss of
generality we prove the proposition using the centered variables x = X − µX
and y = Y − µY . Then

βbpls = σ σT σ
b(b b)(b b)−1 .
σ T SX σ

We first expand σ σT σ
b(b b) and σbT SX σ
b. For this, we need the following expan-
sions (see Cook and Setodji, 2003)
n
√ 1
X 1
σ − σ) =
n(b n− 2 (xi yi − σ) + Op (n− 2 ),
i=1
n
√ − 12
X 1
n(SX − Σ) = n (xi xTi − Σ) + Op (n− 2 ).
i=1
Proofs for Chapter 4 349

σ k2 .
bkb
Step I: Expand σ

σT σ
b(b
σ b) = (b σ − σ + σ)T (b
σ − σ + σ)(b σ − σ + σ)
= σ − σ)kσk2 + σ(b
(b σ − σ) + σkσk2 + Op (n−1 ),
σ − σ)T σ + σσ T (b

so
√ √
σ k2 − σkσk2 =
σ kb
n(b σ − σ)kσk2 + σσ T (b
n{(b σ − σ) + σσ T (b
σ − σ)}
1
+ Op (n− 2 ), (A.10)
√ √ 1
= n(b σ − σ)kσk2 + 2 nσσ T (b σ − σ) + Op (n− 2 ),
√ 1
= kσk2 n{(b σ − σ)} + Op (n− 2 ),
σ − σ) + 2Pσ (b
√ 1
= kσk{Ip + 2Pσ } n(b σ − σ) + Op (n− 2 ), (A.11)
n
1 1
X
= kσk{Ip + 2Pσ }n− 2 (xi yi − σ) + Op (n− 2 ).
i=1
T −1
Step II. Expand (b b) .
σ SX σ
√ √
σ T SX σ
n(b b − σ T Σσ) = n{(bσ − σ + σ)T (SX − Σ + Σ)(b
σ − σ + σ) − σ T Σσ}
√ √ √
= n(bσ − σ)T Σσ + nσ T (SX − Σ)σ + nσ T Σ(b σ − σ)
1
+ Op (n− 2 ),
√ √ 1
σ − σ) + Op (n− 2 ).
= nσ T (SX − Σ)σ + 2 nσ T Σ(b
b A ∈ Rq×q
Next, we derive a general result for inverse expansions. Let A,
√ b
with A nonsingular. Assume that n(A − A) converges in distribution at rate
√ b−1 = A−1 + Op (n− 12 ). Then
n and that A

n(A
bAb−1 − I) = 0 ⇒

b − A + A)(A
n{(A b−1 − A−1 + A−1 ) − I} = 0 ⇒

n{(Ab − A)(Ab−1 − A−1 ) + (A
b − A)A−1 + A(A b−1 − A−1 ) = 0.
√ b − A)(Ab−1 − A−1 ) = Op (n− 12 ), we have
Since n{(A

n{(Ab − A)A−1 + A(A b−1 − A−1 )} = Op (n− 12 ) ⇒
√ √
b−1 − A−1 ) = − nA−1 (A
n(A b − A)A−1 + Op (n− 12 ).

Selecting Ab=σ bT SX σb, we get


√ √
σ T SX σ
n{(b b − A) + Op (n− 12 )
b)−1 − (σ T Σσ)−1 } = − n(σ T Σσ)−2 (A
√ √ 1
= − n(σ T Σσ)−2 σ T (SX − Σ)σ − 2 n(σ T Σσ)−2 σ T Σ(bσ − σ) + Op (n− 2 ).
(A.12)
350 Proofs of Selected Results

Step III. Combining the previous results.


√ √
n(βbpls − β) = σ k2 (b
σ kb
n{b b)−1 − σkσk2 (σ T Σσ)−1 }
σ T SX σ

= n{(b σ k2 − σkσk2 )(σ T Σσ)−1 }
σ kb
√ 1
+ nσkσk2 {(b b)−1 − (σ T Σσ)−1 } + Op (n− 2 )
σ T SX σ

= (σ T Σσ)−1 kσk2 {Ip + 2Pσ } n(b σ − σ) (From (A.11))
2 T −2
√ T
− σkσk (σ Σσ) nσ (SX − Σ)σ (From (A.12))
√ 1
− 2σkσk2 (σ T Σσ)−2 nσ T Σ(b σ − σ) + Op (n− 2 )

= (σ T Σσ)−1 kσk2 (Ip + 2Pσ ) − 2kσk4 (σ T Σσ)−2 Pσ Σ

σ − σ)
n(b
2 T −2
√ T − 12
− σkσk (σ Σσ) nσ (SX − Σ)σ + Op (n ).

Letting Φ = σ/kσk, we have


√ √
n(βbpls − β) = {(ΦT ΣΦ)−1 (Ip + 2PΦ ) − 2(ΦT ΣΦ)−2 PΦ Σ} n(b
σ − σ)
T −2
√ − 12
−(Φ ΣΦ) PΦ n(SX − Σ)σ + Op (n ).

b − σ and SX − Σ.
Step IV. Substitute the expansions for σ
n
√ 1
X
n(βbpls − β) = n− 2 (ΦT ΣΦ)−1 [{Ip + 2PΦ − 2(ΦT ΣΦ)−1 PΦ Σ}(xi yi − σ)
i=1
1
−1
− (Φ ΣΦ) T
PΦ (xi xTi − Σ)σ] + Op (n− 2 ).

Since q = 1, PΦ Σ = Φ∆ΦT = δPΦ , and

(ΦT ΣΦ)−1 = δ −1 ⇒ (ΦT ΣΦ)−1 PΦ Σ = PΦ .

Then we have a representation as the sum of independent and identically


distributed terms,
n
√ 1
X 1
n(βbpls − β) = n− 2 δ −1 {(xi yi − σ) − δ −1 PΦ (xi xTi − Σ)σ} + Op (n− 2 ).
i=1

Substituting yi = xTi β + i , and using that

β = Σ−1 σ = Φδ −1 ΦT σ = Φδ −1 kσk = σδ −1
Σσ/δ = σ,

we see that this expression is the same as conclusion (i) in the proposition.
Proofs for Chapter 4 351

We next need to calculate the variance of

R = δ −1 {(xy − σ) − δ −1 PΦ (xxT − Σ)σ}.

Clearly, E(R) = 0. Substituting y = xT β +  = xT σ/δ + ,

R = δ −1 {(x + xxT σδ −1 − σ) − δ −1 PΦ (xxT − Σ)σ},

where  x.
Step V. Study R.

R = δ −1 (x + xxT σδ −1 − PΦ xxT σδ −1 − σ + δ −1 Σσ).

But δ −1 Σσ = PΦ σ = σ. So we get

R = δ −1 (x + xxT σδ −1 − PΦ xxT σδ −1 )


= δ −1 (x + QΦ xxT σδ −1 ) = δ −1 (x + QΦ xxT β).

Since  x, the two terms in R are uncorrelated,

var(R) = δ −2 (var(x) + var(QΦ xxT β)).

Next, var(x) = var(x)var() = ΣσY2 |X , since E(x) = 0, E() = 0. In conse-


quence,  
var(R) = δ −2 ΣσY2 |X + var(QΦ xxT β) ,
which proves conclusion (ii) of the proposition.
To prove conclusion (iii) of the proposition, we first have

var(QΦ xxT σ) = var{Φ0 (ΦT0 x)(xT σ)} = Φ0 var(ΦT0 x · xT σ)ΦT0 .

Assuming that ΦT0 x xT σ, we get

var(QΦ xxT σ) = Φ0 var(ΦT0 x)var(σ T x)ΦT0


= Φ0 (ΦT0 ΣΦ0 )ΦT0 σ T Σσ
= Φ0 ∆0 ΦT0 δkσk2 .

Thus,

var(R) = δ −2 (ΣσY2 |X + Φ0 ∆0 ΦT0 kσk2 δ −1 )


= δ −2 {ΦδΦT σY2 |X + Φ0 ∆0 ΦT0 (σY2 |X + kσk2 δ −1 )}
= δ −1 ΦΦT σY2 |X + Φ0 ∆0 ΦT0 δ −2 (σY2 |X + kσk2 δ −1 )
= δ −1 ΦΦT σY2 |X + Φ0 ∆0 ΦT0 δ −2 σY2 .
352 Proofs of Selected Results

Where the final step follows because

σY2 = β T Σβ + σY2 |X = δ −2 σ T Σσ + σY2 |X = δ −1 kσk2 + σY2 |X .

This then gives conclusion (iii) from Proposition 4.1:

var(R) = δ −1 PΦ σY2 |X + Φ0 ∆0 ΦT0 δ −2 σY2 .

A.4.2 Notes on Corollary 4.1


Restatement.
Assume that the regression of Y ∈ R1 on X ∈ Rp follows model (4.1) with
∆0 = δ0 Ip−1 and that C = (Y, X T )T correspondingly follows a multivariate
normal distribution. Then
√  −1
avar( nβ)b = σ 2 δ −1 ΦΦT + η 2 η 2 δ0 /σ 2 + (δ0 /δ)(1 − δ/δ0 )2 Φ0 ΦT0
Y |X Y |X

avar( nβbpls ) = σY2 |X δ −1 ΦΦT + (σY2 δ0 /δ 2 )Φ0 ΦT0 .
√ b
avar( nβ) follows as a special case of Proposition 2.1 which was proved
by Cook, Forzani, and Rothman (2013). Briefly, substituting the hypotheses
of Proposition 4.1 into the conclusion of Proposition 2.1 we have

b = ΣY |X ⊗ Φ∆−1 ΦT + (η T ⊗ Φ0 )M † (η ⊗ ΦT ),
avar{ nvec(β)} 0

= (σY2 |X /δ)ΦΦT + η 2 Φ0 M † ΦT0 ,

where

M = ηΣ−1 T −1
Y |X η ⊗ ∆0 + ∆ ⊗ ∆0 + ∆
−1
⊗ ∆0 − 2Iq ⊗ Ip−q
= (η 2 δ0 /σY2 |X )Ip−1 + (δ/δ0 )Ip−1 + (δ0 /δ)Ip−1 − 2Ip−1
n o
= (η 2 δ0 /σY2 |X ) + (δ0 /δ)(1 − δ/δ0 )2 Ip−1 .

Thus, we can use ordinary inverses and



b = (σ 2 /δ)ΦΦT + η 2 Φ0 M −1 ΦT = (σ 2 /δ)ΦΦT
avar{ nvec(β)} Y |X 0 Y |X

+ η 2 {(η 2 δ0 /σY2 |X ) + (δ0 /δ)(1 − δ/δ0 )2 }−1 Φ0 ΦT0 .

Dividing η 2 {(η 2 δ0 /σY2 |X ) + (δ0 /δ)(1 − δ/δ0 )2 }−1 , the cost for the envelope es-
b by the corresponding PLS cost (σ 2 δ0 /δ 2 ) from avar{√nvec(βbpls )},
timator β, Y
which comes directly from Cook et al. (2013), and using the relationship
σY2 = η 2 δ + σY2 |X leads to cost ratio given below Corollary 4.1.
Proofs for Chapter 4 353

A.4.3 Form of ΣX under compound symmetry


Under compound symmetry, var(Xj ) = π 2 and cov(Xj , Xk ) = ρ. Let 1p de-
note the p × 1 vector of ones. Then in matrix form

ΣX = π 2 {ρ1p 1Tp + (1 − ρ)Ip }


= π 2 {pρP1p + (1 − ρ)(P1p + Q1p )}
= π 2 {(1 − ρ + pρ)P1p + (1 − ρ)Q1p }.

A.4.4 Theorem 4.1


Restatement. If model (4.1) holds, β T ΣX β  1 and Kj (n, p), j = 1, 2,
converges to 0 as n, p → ∞ then
n o
1/2
DN = Op n−1/2 + K1 (n, p) + K2 (n, p) .

To ease notion, in this section we let σ = σX,Y , and we let

tr(∆0 )
K1 (n, p) =
nkσk2
tr(∆20 )
K2 (n, p) =
nkσk4
tr1/2 (∆30 )
K3 (n, p) = = tr1/2 (∆3σ )/n,
nkσk3

where ∆σ is the signal to noise ratio defined near (4.11).


The following proposition will be used to prove Theorem 4.1 by implica-
tion.

Proposition A.1. If model (4.1) holds, β T ΣX β  1 and Kj (n, p), j = 1, 2, 3,


converges to 0 as n, p → ∞ then

DN = (βbpls − βpls )T ωN
bT σ b)−1 σ
σ T SX σ bT − σ T σ(σ T ΣX σ)−1 σ T ωN

= σ b(b (A.13)
1/2
= Op {n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p)}, (A.14)

where ωN = XN − E(X) ∼ N (0, ΣX ) as defined near (4.8), (A.13) is a re-


statement of the definitions and (A.14) is what we need to demonstrate.

This proposition, which is a restatement of Theorem 1 in Cook and


Forzani (2018), contains an extra addend K3 (n, p) that does not appear in
354 Proofs of Selected Results

Theorem 4.1. It turns out that the addend K3 (n, p) is superfluous because
the hypothesis of Theorem 4.1, Kj (n, p) → 0 for j = 1, 2, implies that
K3 (n, p) → 0:
1
K3 (n, p) ≤ (K1 (n, p)K2 (n, p))1/2 ≤ √ (K1 (n, p) + K2 (n, p)) ,
2
which establishes that K3 is at most the order of K1 + K2 . Nevertheless, to
maintain a connection with the literature, we prove Proposition A.1, which
then implies Theorem 4.1.

Proof. The proof of Proposition A.1 is facilitated by establishing key inter-


mediate results, the first of which involves the asymptotic behavior of certain
ratios:

Proposition A.2. Assume the hypothesis of Theorem 4.1. Then

bT SX σ̂
σ
= 1 + Op {n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p)}. (A.15)
σ T ΣX σ
bT σ̂
σ
= 1 + Op {n−1/2 + K1 (n, p)}. (A.16)
σT σ
bT σ
σ σT σ
= Op (1). (A.17)
b
σ T
b SX σ σT Σ
b Xσ

Proof of Proposition A.2. Turning to the estimated terms in (A.13) and


b = n−1 XT Y and SX = n−1 XT X ≥ 0, we have
recalling that σ
1  T
bT σ
σ = Y XXT Y
n2
b
1 
(Xβpls + ε)T XXT (Xβpls + ε }

=
n2
1 T 2 T 1
= β W 2 βpls + 2 βpls W XT ε + 2 εT XXT ε (A.18)
n2 pls n n
1 T 2 T 1
bT SX σ
σ = β W 3 βpls + 3 βpls W 2 XT ε + 3 εT XW XT ε, (A.19)
n3 pls
b
n n
where W ∼ Wn−1 (ΣX ). The justification below is intended to give only an
indication of the steps necessary to obtain the result. Additional details are
available from the Supplement to Cook and Forzani (2018).
Since we are going to need the expectation of powers of Wishart marices
to determine the orders of (A.18) and (A.19), we begin with the following
results from Letac and Massam (2004). An online Wishart moment calculator
is available from Forzani, Percı́ncula, and Toledano (2022).
Proofs for Chapter 4 355

Lemma A.1. Let U ∼ Wn (Θ). Then

E(U ) = nΘ
E(U 2 ) = nΘtr(Θ) + n(n + 1)Θ2
E(U 3 ) = nΘtr2 (Θ) + n(n + 1)(Θtr(Θ2 ) + 2Θ2 tr(Θ)) + n(n2 + 3n + 4)Θ3
E(U 5 ) = n5 + 10 n4 + 65 n3 + 160 n2 + 148 n Θ5


+ 4 n4 + 6 n3 + 21 n2 + 20 n (tr Θ) Θ4
 

2
+ 6 n3 + 3 n2 + 4 n (tr Θ)


+3 n4 + 5 n3 + 14 n2 + 12 n (tr Θ2 ) Θ3
 
n
3
+ 4 n2 + n (tr Θ) + 4 2 n3 + 5 n2 + 5 n (tr Θ2 )(tr Θ)
 
o
+2 n4 + 5 n3 + 14 n2 + 12 n (tr Θ3 ) Θ2

n 2
4 2
+ n(tr Θ) + 6 n2 + n (tr Θ2 )(tr Θ) + 2 n3 + 5 n2 + 5 n (tr Θ2 )
 

+4 n3 + 3 n2 + 4 n (tr Θ3 )(tr Θ)

o
+ n4 + 6 n3 + 21 n2 + 20 n (tr Θ4 ) Θ.


The next three lemmas, each with its own Proof section, give ingredients
to establish (A.15)–(A.17) from (A.18) and (A.19).

Lemma A.2. As n, p → ∞ subject to the conditions of Theorem 4.1,

εT XXT ε
 
1 tr(∆0 )
= Op (1) + (A.20)
n2 σ T σ n nσ T σ
εT XW XT ε tr(∆20 ) tr2 (∆0 )
 
1
= Op (1) + + . (A.21)
n3 σ T Σσ n nσ T Σσ n2 σ T Σσ

Proof of Lemma A.2. Since both quantities are positive we need only to
compute their expectations using Lemma A.1 and then employ Markov’s in-
equality. Recall from the preamble to Section 4.4 that ε is the n × 1 vector
with model errors i as elements.

σY2 |X
E(n−2 εT XXT ε) = E(tr(W )) = O(n−1 )tr(ΣX )
n2
σY2 |X tr(Σ2X ) tr2 (ΣX )
 
E(n−3 εT XW XT ε) = E(tr(W 2
)) = O(1) + .
n3 n n2

Therefore, since σ T ΣX σ  (σ T σ)2 , δ  σ T σ and ΣX = δΦΦT + Φ0 ∆0 ΦT , we


356 Proofs of Selected Results

have
εT XXT ε
 
tr(ΣX )
E = O(1)
n2 σ T σ nσ T σ
 
1 tr(∆0 )
= O(1) +
n nσ T σ
 
1
= O(1) + K1 (n, p)
n

εT XW XT ε tr(Σ2X ) tr2 (ΣX )


   
1
E = O(1) T 2 +
n 3 σ T ΣX σ (σ σ) n n2
2 2
 
1 tr(∆0 ) tr (∆0 )
= O(1) + +
n n(σ T σ)2 n2 (σ T σ)2
 
1
= O(1) + K2 (n, p) + K12 (n, p) .
n
2

Lemma A.3. As n, p → ∞ subject to the conditions of Theorem 4.1,


!
T
βpls W XT ε
var = O(n−1 ) (A.22)
n2 σ T σ

!
T
βpls W 2 XT ε
 
1
var = O(1) + K22 (n, p) + K32 (n, p) . (A.23)
n3 σ T Σ X σ n

Proof of Lemma A.3. Each term has expectation zero, so we compute their
variances with the help of Lemma A.1.
!
T
βpls W XT ε τ2 τ2
var 2 T
T
= 4 T 2 E(βpls W XT XW βpls ) = 4 T 2 E(βpls
T
W 3 βpls )
n σ σ n (σ σ) n (σ σ)
1
= O(1) n3 βpls
T
Σ3X βpls + n2 βpls
T
Σ2X βpls tr(ΣX )
n4 (σ T σ)2
+n2 βpls
T
ΣX βpls tr(Σ2X ) + nβpls
T
ΣX βpls tr2 (ΣX ) .


T
Now, βpls Σ3X βpls = σ T ΣX σ  (σ T σ)2 and therefore we have
!
T
βpls W XT ε
 
1 −1 2
var = O(1) + n {K1 (n, p) + K2 (n, p) + K1 (n, p)} .
n2 σ T σ n
Proofs for Chapter 4 357

Conclusion (A.22) follows by hypothesis since the three terms involving Kj


are all o(1/n).
Conclusion (A.23) follows in a similar vein:

σY2 |X
!
T
βpls W 2 XT ε
var = T
E(βpls W 2 XT XW 2 βpls )
n 2 σ T ΣX σ n6 (σ T ΣX σ)2
σY2 |X
= T
E(βpls W 5 βpls ).
n6 (σ T ΣX σ)2

E(W 5 ) can now be evaluated using Lemma A.1 and the results simplified to
yield (A.23). 2

Lemma A.4. As n, p → ∞ subject to the conditions of Theorem 4.1,


T
βpls W 2 βpls
 
1
− 1 = O p √ + K 1 (n, p) (A.24)
n2 σ T σ n
T
βpls W 3 βpls
 
1
− 1 = Op √ + K1 (n, p) + K2 (n, p) . (A.25)
n3 σ T ΣX σ n
Proof. The proof follows the same logic as the proofs of the other lemmas.
See Cook and Forzani (2018) for details. 2

Lemmas A.2–A.4 give the orders of scaled versions of all six addends on
the right hand sides of (A.18) and (A.19). These are next used to determine
the orders (A.15)–(A.17). By combining (A.20), (A.22) and (A.24) we see that
bT σ
σ b/kσk is of the order of
   
1 1 1
+ K1 (n, p) + √ + √ + K1 (n, p) ,
n n n

which has the order given in (A.16)


 
1
√ + K1 (n, p) .
n

Combining (A.21), (A.23) and (A.25) and using that σ T ΣX σ  (σ T σ)2 ,


bT SX σ
we find the order of σ b/σ T ΣX σ is equal to the order of
   
1 2 1
+ K2 (n, p) + K1 (n, p) + √ + K2 (n, p) + K3 (n, p)
n n
 
1
+ √ + K2 (n, p) + K1 (n, p) .
n
358 Proofs of Selected Results

Following the previous logic, we arrive at the stated order (A.15). (A.17)
follows immediately from (A.16) and (A.15).

End Proof of Proposition A.2. 2

Continuing now with the proof of Proposition A.1, the next step is to
rewrite (A.13) in a form that makes use of Proposition A.2. Recall that
δ = σ T ΣX σ/kσk2 is the eigenvalue of ΣX that is associated with the ba-
sis vector Φ = σ/kσk of the envelope EΣX(B). Let δb = σbT SX σ σ Tσ
b/b b. From
(A.13) then, we need to find the order of

DN = (δb−1 σ
b − δ −1 σ)T ωN
= δb−1 (b
σ − σ)T ωN − δb−1 (b b − σ T ΣX σ)(σ T ΣX σ)−1 σ T ωN
σ T SX σ
b − σ T σ)(σ T ΣX σ)−1 σ T ωN .
σT σ
+(b

It follows from Equation (A.17) that δb−1 δ = Op (1). Consequently, multiplying


the first two addends of DN by δδ −1 we have

DN = (δb−1 δ)δ −1 (b
σ − σ)T ωN − (δb−1 δ)δ −1 (b b − σ T ΣX σ)(σ T ΣX σ)−1 σ T ωN
σ T SX σ
b − σ T σ)(σ T ΣX σ)−1 σ T ωN .
σT σ
+ (b

Therefore an order for DN can be found by adding the orders of the following
three terms.

I = δ −1 (b
σ − σ)T ωN .
II = δ −1 (b b − σ T ΣX σ)(σ T ΣX σ)−1 σ T ωN .
σ T SX σ
III = (b b − σ T σ)(σ T ΣX σ)−1 σ T ωN .
σT σ

The orders of these terms is as follows.

Term I.

σ )  n−1 (var(Y )ΣX + σσ T ) (Cook, Forzani, and Rothman, 2013)


Since var(b
we have

var(I) = δ −2 E((b
σ − σ)T ΣX (bσ − σ))  δ −2 tr{var(b σ )ΣX }
2 T
var(Y )tr(ΣX ) + σ ΣX σ
 δ −2
n
 n−1 δ −2 var(Y ) δ 2 + tr(∆20 ) + n−1 δ −2 σ T ΣX σ
 n−1 + K2 (n, p).
Proofs for Chapter 4 359

Consequently, since E(I) = 0,


 
1/2
I = Op n−1/2 + K2 (n, p) .

Term II.

From conclusion (A.15) of Proposition A.2,


 
b−σ T ΣX σ)(σ T ΣX σ)−1 = Op n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p)
σ T SX σ
(b

and
2
var(δ −1 σ T eN ) = δ −1 σ T ΣX σ = (σ T σ)2 (σ T ΣX σ)−1  1.

Therefore
 
II = Op n−1/2 + K1 (n, p) + K2 (n, p) + K3 (n, p) .

Term III.

It follows from conclusion (A.16) of Proposition A.2 that


 
b − σ T σ)(σ T σ)−1 = O n−1/2 + K1 (n, p)
σT σ
(b

and, from term II that var δ −1 σ T wN  1. Accordingly,




 
III = O n−1/2 + K1 (n, p) .

Thus,

DN = I + II + III
 
1/2
= Op n−1/2 + K2 (n, p) + K1 (n, p) + K2 (n, p) + K3 (n, p)
 
1/2
= Op n−1/2 + K2 (n, p) + K1 (n, p) + K3 (n, p) ,

which is the conclusion claimed.

A.4.5 Proposition 4.2


Restatement. Assume that the eigenvalues of ∆0 are bounded and
β T ΣX β  1. Then κ(p)  p and, if kΣX,Y k2 → ∞, then η(p) → ∞, where
k · k2 denotes the Euclidean norm.

Proof. First κ(p) = tr(∆0 )  p is immediate since the eigenvalues of ∆0 are


bounded.
360 Proofs of Selected Results

For the second conclusion, recall that in the multicomponent model


ΣX = Φ∆ΦT + Φ0 ∆0 Φ0 . We assume without loss of generality that
∆ = diag(δ1 , . . . , δq ). The initial part of the proof is driven by the bound-
edness of β T ΣX β:

β T ΣX β T
= σX,Y Σ−1
X σX,Y
T
Φ∆−1 ΦT + Φ0 ∆−1

= σX,Y 0 Φ0 σX,Y
T −1 T

= σX,Y Φ∆ Φ σX,Y ,

where the last step follows because σX,Y is contained in EΣX(B), which has
semi-orthogonal basis Φ. Let φi denote the i-th column of Φ and define
T
σX,Y Pφi σX,Y
wi = T
, i = 1, . . . , q,
σX,Y PΦ σX,Y

which are positive and sum to 1. Then

β T ΣX β T
Φ∆−1 ΦT σX,Y

= σX,Y
q
X
= wi (kσX,Y k2 /δi ).
i=1

If the regression is abundant so η(p) = tr(∆) → ∞ then kσX,Y k2 → ∞


since β T ΣX β  1. If kσX,Y k2 → ∞ then we must have δi → ∞ since again
β T ΣX β  1. 2

A.5 Proofs for Chapter 5


A.5.1 Proposition 5.1
Restatement. Define the subspaces S ⊆ Rp and R ⊆ Rr . Then the two
condition
(a) QS X (Y, PS X) and (b) QR Y (X, PR Y )

hold if and only if the following two condition hold:

(I) QR Y QS X and (II) (PR Y, PS X) (QR Y, QS X).


Proofs for Chapter 5 361

Proof. For notational convenience, let M = (PR Y, PS X). We first show that
conditions (a) and (b) are equivalent to the conditions

(I0 ) QR Y QS X | (PR Y, PS X) and (II) (PR Y, PS X) (QR Y, QS X).

Assume that conditions (a) and (b) hold. Let R ∈ Rr×u and R0 ∈ Rr×r−u
be semi-orthogonal basis matrices for R and its orthogonal complement R⊥ .
Then (R, R0 ) is an orthogonal matrix and condition (a) holds if and only
if QS X (RT Y, R0T Y, PS X). Consequently, condition (a) implies that (see
(Cook, 1998, Proposition 4.6) for background on conditional independence.)

QS X (PR Y, QR Y, PS X) ⇒ QS X (M, QR Y )
⇒ (a1) QS X QR Y | M and (a2) QS X M
⇒ (a3) QS X M | QR Y and (a4) QS X QR Y.

Similarly, condition (b) implies

QR Y (PS X, QS X, PR Y ) ⇒ QR Y (M, QS X)
⇒ (b1) QS X QR Y | M and (b2) QR Y M
⇒ (b3) QR Y M | QS X and (b4) QS X QR Y.

Condition (I0 ) follows immediately from either condition (a1) or condition


(b1). Condition (II) is implied by (a3) and (b2). Thus condition (a) and (b)
imply conditions (I0 ) and (II).
Assume that conditions (I0 ) and (II) hold. Then these imply that

(I0 ) QR Y QS X | M, (II1) M QR Y and (II2) M QS X.

(I0 ) and (II1) imply that QR Y (M, QS X), while conditions (I0 ) and (II2)
imply that QS X (M, QR Y ). Replacing M with its definition, these impli-
cations give

QR Y (PS X, QS X, PR Y ) and QS X (PR Y, QR Y, PS X).

The required condition (a) and (b) follow from here since if A is independent
of B then A is independent of any non-stochastic function of B. That is, these
statements imply that

QR Y (PS X + QS X, PR Y ) and QS X (PR Y + QR Y, PS X).


362 Proofs of Selected Results

This established that conditions (I0 ) and (II) are equivalent to conditions (a)
and (b).
To establish (a) and (b) with (I) and (II), first assume conditions (a) and
(b). Then (I0 ) and (II) hold. But (II) implies that (I) holds if and only if (I0 )
holds. Next, assume that (I0 ) and (II) hold. Then (a) and (b) hold and, again,
(II) implies that (I) holds if and only if (I0 ) holds.
2

A.5.2 Proof of Lemma 5.1


Restatement. Under the simultaneous envelope model (5.7), canonical
correlation analysis can find at most d directions in the population, where
d = rank(ΣX,Y ) as defined in (5.5). Moreover, the directions are contained in
the simultaneous envelope as

span(a1 , . . . , ad ) ⊆ EΣX (B), span(b1 , . . . , bd ) ⊆ EΣY |X (B 0 ). (A.26)

Proof. Recall that the canonical correlation directions are the pairs of vec-
−1/2 −1/2
tors {ai , bi } = {ΣX ei , ΣY fi }, where {ei , fi } is the i-th left-right eigen-
−1/2 −1/2
vector pair of the correlation matrix ρ = ΣX ΣX,Y ΣY , i = 1, . . . , d,
d = rank(ΣX,Y ). Now, the conclusion follows for the ai ’s because
−1/2
span(a1 , . . . , ad ) = span(Σ−1
X ΣX,Y ΣY ) ⊆ span(Φ) = EΣX(B).

The conclusion for the bi ’s follows similarly. 2

A.5.3 Proof of Lemma 5.3


Restatement. Using the sample estimators from Lemma 5.2, the sample
covariance matrix of the residuals from model (5.7),
n
X
Sres = n−1 η T W TXi }{(Yi − Ȳ ) − Gb
{(Yi − Ȳ ) − Gb η T W TXi }T ,
i=1

can be represented as

Sres = Σ
b Y |X + PG SY |W TX QG + QG SY |W TX PG ,

where Σ
b Y |X is as given in Lemma 5.2.
Proofs for Chapter 5 363
−1
Proof. Substituting ηb = SW TX SW TX,GT Y and expanding, we have

n
X
Sres = n−1 η T W TXi }{(Yi − Ȳ ) − Gb
{(Yi − Ȳ ) − Gb η T W TXi }T
i=1
−1
= SY − GSGT Y,W T X SW T Y SW T X,Y

−1 T −1 T
−SY,W T X SW T X SW T X,GT Y G + GSGT Y,W T X SW T X SW T X,GT Y G .

−1
Let M = SY,W T X SW T X SW T X,Y so that SY − M = SY |W T X . Then

Sres = SY − PG M − M PG + PG M PG
= SY − PG M QG − QG M PG − PG M PG .

We next expand

SY = PG SY PG + QG SY PG + PG SY QG + QG SY QG ,

substitute into Sres and rearrange terms to get

Sres = PG (SY − M )PG + QG (SY − M )PG + PG (SY − M )QG + QG SY QG


= GSGT Y |W T X GT + QG (SY − M )PG + PG (SY − M )QG + GT0 SGT0 Y G0

b Y |X + QG SY |W T X PG + PG SY |W T X QG .

A.5.4 Justification of the two-block algorithm, Section 5.4.3


Here we give a justification of the two-block algorithm as described in Ta-
ble 5.2. We start with the data-based algorithm from Weglin (2000, Section
4.2, PLS-W2A) from which the population version will follow. Reall that Xn×p
and Yn×r are the matrices of centered predictor and response data.

1. k ← 1.

2. X(1) ← X and Y(1) ← Y.

3. Compute the first singular vectors of (X(k) )T Y(k) , uk left and vk right.

4. ξk ← X(k) uk and ωk ← Y(k) vk .

5. Regress X(k) on ξk and Y(k) on ωk obtaining first order approximations


b (k) = ξk (ξ T ξk )−1 ξ T X(k) and Y
X b (k) = ωk (ω T ωk )−1 ω T Y(k) .
k k k k
364 Proofs of Selected Results

6. Let SX (k) = X(k)T X(k) /n and SY (k) = Y(k)T Y(k) /n. Construct the residu-
als:

X(k+1) ← X(k) − X
b (k) = X(k) − ξk (ξkT ξk )−1 ξkT X(k)
= X(k) − X(k) uk (uTk X(k)T X(k) uk )−1 uTk X(k)T X(k)
= X(k) Quk (SX (k) )
Y(k+1) ← Y(k) − Y
b (k) = Y(k) − ωk (ωkT ωk )−1 ωkT Y(k)
= Y(k) − Y(k) vk (vkT Y(k)T Y(k) vk )−1 vkT Y(k)T Y(k)
= Y(k) Qvk (SY (k) ) .

7. If (X(k+1) )T Y(k+1) = 0 stop and k is the rank of the PLS model. Other-
wise, k ← k + 1 and go to step 3.

Now, if k = 1, u1 and v1 are the first left and right singular values of SX,Y
from the initialization step in Table 5.2. Then ξ1 = X(1) u1 , ω1 = Y(1) v1 and

X(2) = X(1) Qu1 (SX (1) )


SX (2) = (X(2) )T X(2) /n = QTu1 (S ) SX (1) Qu1 (SX (1) )
X (1)
(2) (1)
Y = Y Qv1 (SY (1) )
SY (2) = (Y(2) )T Y(2) /n = QTv1 (S ) SY (1) Qv1 (SY (1) )
Y (1)

SX (2) ,Y (2) = (X(2) )T Y(2) /n = QTu1 (S )) SX (1) ,Y (1) Qv1 (SY (1) ) . (A.27)
X (1)

Now, consider u2 and v2 , the first left and right singular vectors of SX (2) ,Y (2) ,
and

X(3) = X(2) Qu2 (SX (2) )


= X(1) Qu1 (SX (1) ) Qu2 (SX (2) )
= X(1) Q(u1 ,u2 )(SX (1) ) ,

where the last step follows from the form of SX (2) and Lemma 3.5. Similarly,
Y(3) = Y(1) Q(v1 ,v2 )(SY (1) ) .
We next compute first left and right singular vectors of

SX (2) ,Y (2) = (X(3) )T Y(3) /n = n−1 QT(u1 ,u2 )(S ) (X


(1) T
) Y(1) Q(v1 ,v2 )(SY (1) )
X (1)

= QT(u1 ,u2 )(S ) SX,Y Q(v1 ,v2 )(SY (1) ) . (A.28)


X (1)

We seen then that when SX = SX (1) , SX,Y SX (1) ,Y (1) and SY (1) are re-
placed with their population versions, the initialization and steps k = 1, 2 of
Proofs for Chapter 6 365

this algorithm correspond to the population version in Table 5.2. Subsequent


steps can be justified in the same way by induction.

A.6 Proofs for Chapter 6


A.6.1 Justification for the equivalence of (6.5) and (6.6)
Restatement. (6.5),

(a) Y X1 | PS X1 , X2 and (b) PS var(X1 | X2 )QS = 0,

is equivalent to (6.6),

(a) RY |2 R1|2 | PS R1|2 , X2 and (b) PS var(R1|2 | X2 )QS = 0.

The equivalence of (6.5) and (6.6) relies on Proposition 4.4 from Cook
(1998), which states that for U , V and W random vectors, U V | W if and
only if U (V, W ) | W . From this we have that (6.5a) holds if and only if

(Y, X2 ) (X1 , X2 ) | PS X1 , X2 .

Since (Y, X2 ) is a one-to-one function of (RY |2 , X2 ) and (X1 , X2 ) is a one-to-


one function of (R1|2 , X2 ), this last statement holds if and only if

(RY |2 , X2 ) (R1|2 , X2 ) | PS X1 , X2 .

Applying Proposition 4.4 from Cook (1998) again, the last statement holds if
and only if
RY |2 R1|2 | PS X1 , X2 .
We obtain (6.6a) by applying the same logic to the conditioning argument,
(PS X1 , X2 ), and to (6.5b).

A.6.2 Derivation of the MLEs for the partial envelope,


Section 6.2.3
Recall from Section 6.2.3 that the partial envelope model is

X2 ∼ N (0, Σ2 )
T
X1 | X2 ∼ N (β1|2 X2 , ΦΩΦT + Φ0 Ω0 ΦT0 )
Y | (X1 , X2 ) ∼ N (η T ΦT X1 + β2T X2 , σY2 |X ),
366 Proofs of Selected Results

where Y , X1 , and X2 are taken to be centered without loss of generality.


Recall also that the corresponding log likelihood is
n
n 1 X T −1 n n
log L = − log |Σ2 | − tr X2 i Σ2 X2 i − log |Ω| − log |Ω0 |
2 2 i=1 2 2
n
1X
− T
(X1 i − β1|2 X2 i )T (ΦΩ−1 ΦT + Φ0 Ω−1 T T
0 Φ0 )(X1 i − β1|2 X2 i )
2 i=1
n
n 1 X
− log σY2 |X − 2 (yi − η T ΦT X1 i − β2T X2 i )2 .
2 2σY |X i=1

The MLE of β1|2 is βb1|2 = S2−1 S2,1 and then R1|2 i is the i-th residual from
the fit of X1 on X2 ,

T
R1|2 i = X1 i − βb1|2 X2 i
n
X
S1|2 = n−1 T
R1|2 i R1|2 i
i=1
= S1 − S1,2 S2−1 S2,1 .

The MLE of Σ2 is S2 . Substituting these we get the partially maximized log


likelihood
n np2 n n
log L1 = − log |S2 | − − log |Ω| − log |Ω0 |
2 2 2 2
n n
1 X 1 X
− RT ΦΩ−1 ΦT R1|2 i − RT Φ0 Ω−1 T
0 Φ0 R1|2 i
2 i=1 1|2 i 2 i=1 1|2 i
n
n 1 X
− log σY2 |X − 2 (yi − η T ΦT X1 i − β2T X2 i )2
2 2σY |X i=1
n np2 n n
= − log |S2 | − − log |Ω| − log |Ω0 |
2 2 2 2
n n
n X n X
tr S1|2 ΦΩ−1 ΦT − tr S1|2 Φ0 Ω−1 T
 
− 0 Φ0
2 i=1 2 i=1
n
n 1 X
− log σY2 |X − 2 (yi − η T ΦT X1 i − β2T X2 i )2 .
2 2σY |X i=1

b = ΦT S1|2 Φ, Ω
From this we have the MLEs Ω b 0 = ΦT S1|2 Φ0 and thus the
0
Proofs for Chapter 6 367

next partially maximized log likelihood is


n np2
log L2 = − log |S2 | −
2 2
n n np1
− log |Φ S1|2 Φ| − log |ΦT0 S1|2 Φ0 | −
T
2 2 2
n
n 1 X
− log σY2 |X − 2 (yi − η T ΦT X1 i − β2T X2 i )2 .
2 2σY |X i=1

The next step is to determine and substitute the values of η and β2 that
maximize the log likelihood with Φ and σ 2 held fixed. To facilitate this we
orthogonalize the terms in the sum. To do this we use the coefficients βb1|2 Φ
Pn
from the OLS fit of ΦT X1 i on X2 i . Let SS = i=1 (yi − η T ΦT X1 i − β T X2 i )2 .
Then write
n
X
SS = (yi − η T (ΦT X1i − ΦT βb1|2
T
X2 i + ΦT βb1|2
T
X2 i ) − β2T X2 i )2
i=1
n
X
= (yi − η T ΦT (X1i − βb1|2
T
X2 i ) − η T ΦT βb1|2
T
X2 i − β2T X2 i )2
i=1
n
X
= (yi − η T ΦT R1|2 i − β2∗T X2 i )2 ,
i=1

Pn
T
where R1|2 i = X1 i − βb1|2 X2 i and β2∗ = β2 + βb1|2 Φη. Since i=1 R1|2 i X2Ti = 0,
we can fit the two terms in the last sum separately, getting straightforwardly
n
X
n−1 ΦT R1|2 i Yi = ΦT {S1Y − S1,2 S2−1 S2,Y }
i=1
S1|2 = S1 − S1,2 S2−1 S2,1
η = (ΦT S1|2 Φ)−1 ΦT SR1|2 ,Y
SR1|2 ,Y = S1Y − S1,2 S2−1 S2,Y
β2∗ = S2−1 S2,Y

as the values of η and β2∗ that minimize the sum. Substituting these values
and simplifying we get the next partially maximized log likelihood
n np
log L3 = − log |S2 | −
2 2
n n
− log |Φ S1|2 Φ| − log |ΦT0 S1|2 Φ0 |
T
2 2
n 2 n
− log σY |X − 2 {SY |2 − SR T
1|2 ,Y
Φ[ΦT S1|2 Φ]−1 ΦT SR1|2 ,Y }.
2 2σY |X
368 Proofs of Selected Results

This is maximized at the value

σY2 |X = SY |2 − SR
T
1|2 ,Y
Φ[ΦT S1|2 Φ]−1 ΦT SR1|2 ,Y ,

giving the next partially maximized log likelihood


n n(p + 1)
log L4 = − log |S2 | −
2 2
n n
− log |Φ S1|2 Φ| − log |ΦT0 S1|2 Φ0 |
T
2 2
n n o
− log SY |2 − SR T
1|2 ,Y
Φ[ΦT S1|2 Φ]−1 ΦT SR1|2 ,Y .
2
Writing the third and fifth terms together
n n(p + 1) n
log L4 = − log |S2 | − − log |ΦT0 S1|2 Φ0 |
2 2 2
n n o
T T
− log |Φ S1|2 Φ| SY |2 − SR 1|2 ,Y
Φ[ΦT S1|2 Φ]−1 ΦT SR1|2 ,Y .
2
Let
n o
−1 T
T4 = log |ΦT S1|2 Φ| SY |2 − SR
T
1|2 ,Y
Φ[Φ T
S1|2 Φ] Φ SR 1|2 ,Y .

For clarity we use det rather than | · | to denote the determinant operator in
the remainder of this proof. Then
!
ΦT S1|2 Φ ΦT SR1|2 ,Y
T4 = det T
SR 1|2 ,Y
Φ SY |2
n o
= det ΦT S1|2 Φ − ΦT SR1|2 ,Y SY−1|2 SR T
1|2 ,Y
Φ
n o
= det ΦT [S1|2 − SR1|2 ,Y SY−1|2 SRT
1|2 ,Y ]Φ .

Ignoring constant terms, the overall objective function to minimize becomes


n o  
F (Φ) = det ΦT [S1|2 − SR1|2 ,Y SY−1|2 SR
T
1|2 ,Y
−1
]Φ + det ΦT S1|2 Φ .

Since R1|2 is uncorrelated in the sample with X2 , we have SR1|2 ,Y = SR1|2 ,RY |2
and so
n o  
F (Φ) = det ΦT [S1|2 − SR1|2 ,RY |2 SY−1|2 SR
T
R
1|2 Y |2
]Φ + det Φ T −1
S1|2 Φ .

Thus this corresponds to defining the response as RY |2 i and predictor as R1|2 i


and then run the usual envelope or PLS for the regression of RY |2 i on R1|2 i .
The asymptotic variances under predictor reduction should apply to this con-
text.
Proofs for Chapter 9 369

Finally, recognizing that

S1|Y,2 = S1|2 − SR1|2 ,RY |2 SY−1|2 SR


T
1|2 RY |2

gives the objective function at (6.9). The estimators are obtained by piecing
together the various parameter functions that maximize the log likelihood.

A.7 Proofs for Chapter 9


A.7.1 Proposition 9.1
Restatement. The following statements are equivalent.
(i) Y E(Y | X) | αT X,

(ii) cov{(Y, E(Y | X)) | αT X} = 0,

(iii) E(Y | X) is a function of αT X.


Proof. That (i) implies (ii) and (iii) implies (i) are immediate because if
E(Y | X) is a function of αT X then given αT X the mean function is a
constant. It remains to show that (ii) implies (iii). The idea is to show that
(ii) implies that var[E(Y | X) | αT X] = 0, from which it follows that E(Y |X)
is constant given αT X. From (ii),

E[Y E(Y |X) | αT X] = E(Y | αT X)E{E(Y | X) | αT X}.

The left hand side is

E[Y E(Y |X) | αT X] = E[E{Y E(Y | X) | X} | αT X]


= E[E(Y | X)E(Y | X) | αT X]
= E[{E(Y | X)}2 | αT X].

The right hand side is

E(Y | αT X)E{E(Y | X) | αT X} = {E[E(Y | X) | αT X]}2 .

Consequently,

E[{E(Y | X)}2 | αT X] = {E[E(Y | X) | αT X]}2 ,

erom which it follows that var[E(Y | X) | αT X] = 0. 2


370 Proofs of Selected Results

A.7.2 Proposition 9.2


Restatement. Assume that W satisfies the linearity condition relative to
S under Definition 9.3. Then

1. M = ΣW α(αT ΣW α)−1 .

2. M T is a generalized inverse of α.

3. αM T is the orthogonal projection operator for S relative to the ΣW inner


product.

4.
E(W | αT W ) − E(W ) = PS(Σ
T
W)
(W − E(W )),

where for completeness we have allowed E(W ) to be non-zero.

5. If S reduces ΣW then PS(ΣW ) = PS .

Proof. To see conclusion 1, we have E(W | αT W ) = M αT W . Multiplying


both sides on the right by W T α and taking expectations we get for the left
hand side

E{E(W | αT W )W T α} = E{E(W W T α | αT W )}
= E(W W T α)
= ΣW α,

and for the right hand side

E(M αT W W T α) = M αT ΣW α.

Consequently,
Σ W α = M α T ΣW α

and conclusion 1 follows. Conclusions 2–4 follow straightforwardly from con-


clusion 1 and the definitions of a generalized inverse and projection. Conclu-
sion 5 follows from Proposition 1.3. 2
Proofs for Chapter 9 371

A.7.3 Proposition 9.3


Restatement.
Assume that the distribution of X | Y satisfies the linearity condition rel-
ative to SY |X for each value of Y . Then the marginal distribution of X also
satisfies the linearity condition relative to SY |X .

Proof. Let α be a basis matrix for SY |X . The random vector X − E(X|Y = y)


has mean 0 for each value y of Y . Then by hypothesis and Definition 9.3 we
have for each value of Y

E(X | αT X, Y = y) − E(X | Y = y) = MY αT (X − E(X | Y = y)),

where MY is now permitted to depend on the value of Y . This implies alge-


braically that

E(X | αT X, Y = y) = MY αT X − MY αT E(X | Y = y) + E(X | Y = y).

Since α is a basis for SY |X , the left hand side does not depend on the value
of Y . This implies that E(X | αT X, Y = y) = E(X | αT X) and thus that

E(X | αT X) − E(X | Y = y) = MY αT (X − E(X | Y = y)).

Taking the expectation with respect to the marginal distribution of Y we have

E(X | αT X) − µX = E(MY )αT X − E(MY αT E(X | Y )). (A.29)

The left and right hand sides are a function of only αT X. Since the expecta-
tion of the left hand side in αT X is 0, the expectation of the right hand side
must also be 0,
E(MY αT E(X | Y )) = E(MY )αT µX .

Replacing the second term on the right hand side of (A.29) with E(MY )αT µX ,
it follows that

E(X | αT X) − µX = E(MY )αT (X − µX ).

2
372 Proofs of Selected Results

A.7.4 Proposition 9.5


Restatement. Assume the linearity condition for X relative to SE(Y |X) .
Then under the non-linear model (9.1) we have

EΣX(B) ⊆ EΣX (SE(Y |X) ).

Moreover, if the single index model holds and ΣX,Y 6= 0 then EΣX(B) =
EΣX (SE(Y |X) ).

Proof. We know from Proposition 9.4 and Corollary 9.2 that B ⊆ SE(Y |X)
with equality under the single index model (9.2) with ΣX,Y 6= 0. Let A1
be semi-orthogonal basis matrix for B and extent it so that (A1 , A2 ) is a
semi-orthogonal basis matrix for SE(Y |X) . Then SE(Y |X) = span((A1 , A2 )) =
span(A1 ) + span(A2 ). Applying Proposition 1.7 with M = ΣX , S1 =
span(A1 ) = B and S2 = span(A2 ), we have SE(Y |X) = EΣX(B)+EΣX (span(A2 ),
which implies the desired conclusion, EΣX(B) ⊆ EΣX (SE(Y |X) ).
When the single index model holds, B = SE(Y |X) , which implies equality
SE(Y |X) = EΣX(B).
2

A.8 Proofs for Chapter 10


A.8.1 Model from Section 10.2
Lemma A.5. From (10.1) and (10.2), we have
   
X µX
   
 Y   µY 
∼
 ξ   µ  + X,Y,ξ,η
 
   ξ 
η µη
where var(X,Y,ξ,η ) = Σ(X,Y,ξ,η) with
 
T
ΣX|ξ + βX|ξ βX|ξ σξ2 βX|ξ βYT |η σξ,η βX|ξ σξ2 βX|ξ σξ,η
ΣY |η + βY |η βYT |η ση2 βY |η ση2 
 
 ... βY |η σξ,η
Σ(X,Y,ξ,η) = .

 ... ... σξ2 σξ,η 

... ... ... ση2
(A.30)
Proofs for Chapter 10 373

Proof. The first two elements of the diagonal are direct consequence of the
fact that cov(Z) = E(cov(Z|H))+var(E(Z|H)). For ΣX,ξ and ΣY,η we use the
fact that βX|ξ = σξ−2 ΣX,ξ and βY |η = ση−2 ΣY,η . To compute ΣY,η we use that

ΣY,η = E((Y − µY )(η − µη ))


= Eη [E((Y − µY )(η − µη ))|η]
= βY |η Eη (η − µη )2
= βY |η ση2 .

Analogously for ΣX,ξ . Now for ΣX,Y we use that

ΣY,X = E((Y − µY )(X − µX )T )


= Eη,ξ [EY,X|(η,ξ) {(Y − µY )(X − µX )T }]
T
= βY |η Eη,ξ (η − µη )(ξ − µξ )βX|ξ
T
= βY |η σξ,η βX|ξ .

Now, for ΣXη

ΣXη = E((X − µX )(η − µη ))


= Eη,ξ E((X − µX )(η − µη )|(η, ξ))
= βX|ξ Eη,ξ E((ξ − µξ )(η − µη )|(η, ξ))
= βX|ξ σξ,η .

The proof of ΣY,ξ is analogous. 2

A.8.2 Constraints in cb|sem


Proposition A.3. The SEM models under the marginal and regression con-
straints are

• Marginal constraints,
! !
X µX
∼ + X,Y
Y µY
where
!
DX|ξ + c2 BB T BAT
var(X,Y ) := Σ(X,Y ) = . (A.31)
AB T DY |η + d2 AAT

and c2 d2 cor2 (η, ξ) = 1.


374 Proofs of Selected Results

• Regression constraints
! !
X µX
∼ + X,Y
Y µY

From (A.43)
 
−1
DX|ξ + (B T DX|ξ B)−1 BB T /(σξ2 − 1)
AB T
 
 
Σ(X,Y ) = 

 BAT 

DY |η + (AT DY−1|η A)−1 AAT /(ση2 − 1)

where cor2 (η, ξ)(ση2 − 1)−1 (AT DY−1|η A)−1 (σξ2 − 1)−1 (B T DX|ξ
−1
B)−1 = 1.

Proof. For (A.31) we only need to prove ΣX = DX|ξ + BB T c2 and ΣY =


DY |η + AAT d2 with c2 d2 cor(η, ξ) = 1.
Now, ΣX = DX|ξ + ΣX,ξ Σξ,X and ΣY = DY |η + ΣY,η Ση,Y . And since
ΣY,X = AB T = ΣY,η cor(η, ξ)Σξ,X we have that ΣY,η = Ad and ΣX,ξ =
Bc with dc × cor(η, ξ) = 1 from what follows (A.31). Then the conclu-
sion for the regression constraints follows from the proof of Lemma A.6.
2

A.8.3 Lemma 10.1


Restatement. For model presented in Section 10.2 and without impos-
ing either the regression or marginal constraints, we have E(Y | X) =
µY + βY |X (X − µX ), where βY |X = AB T Σ−1
X , and A ∈ R
r×1
and B ∈ Rp×1
are such that ΣY,X = AB T .
T
Proof. This is direct consequence of Lemma A.5 since ΣY,X = βY |η σξ,η βX|ξ =
−2 −2
ΣY,η ση σξ,η σξ Σξ,X and therefore rank of ΣY,X is one and the lemma follows.
2

A.8.4 Proposition 10.1


Restatement. The parameters ΣX , ΣY , ΣY,X = AB T , βY |X , µY and µX
are identifiable in the reduced rank model of Lemma 10.1, but A and B are
not. Additionally, under the regression constraints, the quantities σξ,η /(ση2 σξ2 ),
Proofs for Chapter 10 375

cor{E(ξ|X), E(η|Y )}, E(ξ|X), E(η|Y ), ΣX,ξ and ΣY,η are identifiable in the
reflexive model except for sign, while cor(η, ξ), σξ2 , ση2 , σξ,η , βX|ξ , βY |η , ΣY |η
and ΣX|ξ are not identifiable. Moreover,

|cor{E(ξ|X), E(η|Y )}| = |σξ,η |/(ση2 σξ2 ) = tr1/2 (βX|Y βY |X ) (A.32)


= tr 1/2
(ΣX,Y Σ−1 −1
Y ΣY,X ΣX ) (A.33)

where | · | denotes the absolute value of its argument.

Proof. The first part follows from reduced rank model literature (see for ex-
ample Cook, Forzani, and Zhang (2015)). Now, using (A.30) and the fact that
µη = 0,

E(η|Y ) = Ση,Y Σ−1


Y (Y − µY ),

and therefore

var(E(η|Y )) = Ση,Y Σ−1 T


Y Ση,Y = 1, (A.34)

where we use the hypothesis that var(E(η|Y )) = 1 in the last equal. In the
same way
var(E(ξ|X)) = Σξ,X Σ−1 T
X Σξ,X = 1. (A.35)

Now, since

ΣY,X = AB T = ΣY,η ση−2 σξ,η σξ−2 Σξ,X , (A.36)

we have for some mη and mξ that

ΣY,η = Amη , (A.37)


ΣX,ξ = Bmξ . (A.38)

(A.36), (A.37), and (A.38) together makes AB T = Amη ση−2 σξ,η σξ−2 mξ B T and
therefore

mη ση−2 σξ,η σξ−2 mξ = 1. (A.39)

Now, using (A.34) and (A.35) together with (A.37) and (A.38) we have that

m2η = (AT Σ−1


Y A)
−1
(A.40)
m2ξ = (B T Σ−1
X B)
−1
. (A.41)
376 Proofs of Selected Results

Plugging this into (A.39),

ση−2 σξ,η σξ−2 = (mξ mη )−1


= (AT Σ−1
Y A)
1/2
(B T Σ−1
X B)
1/2

= (AT Σ−1 T −1
Y AB ΣX B)
1/2

= tr1/2 ((AB T )T Σ−1 T −1


Y AB ΣX )

= tr1/2 (βY |X βX|Y ).

Let us note that from (A.40) and (A.41), m2ξ and m2η are not unique, never-
theless m2η m2ξ is unique since any change of A and B should be such AB T is
the same. As a consequence mη mξ is unique except for a sign. And therefore
ση−2 σξ,η σξ−2 is unique except for a sign.
Now, coming back to equations (A.37) and (A.38) we have that

ΣY,η = A(AT Σ−1


Y A)
−1/2

ΣX,ξ = B(B T Σ−1


X X)
−1/2

and again are unique except by a sign.


Now, using again (A.30), the fact that µη = µξ = 0 and var(E(ξ|X)) =
var(E(η|Y )) = 1 we have

E(η|Y ) = Ση,Y Σ−1


Y (Y − µY )

ση2 = E(var(η|Y )) + 1
E(ξ|X) = Σξ,X Σ−1
X (X − µX )

σξ2 = E(var(ξ|X)) + 1.

Replacing ΣY,X = ΣY,η ση−2 σξ,η σξ−2 Σξ,X we have

cov{E(η|Y ), E(ξ|X)} = Ση,Y Σ−1 −1


Y ΣY,X ΣX ΣX,ξ

= Ση,Y Σ−1 −2 −2 −1
Y ΣY,η ση σξ,η σξ Σξ,X ΣX ΣX,ξ .

cor{E(η|Y ), E(ξ|X)} = (Ση,Y Σ−1


Y ΣY,η ) ση σξ,η σξ−2 (Σξ,X Σ−1
1/2 −2
X ΣX,ξ )
1/2

= ση−2 σξ,η σξ−2


= (mη mξ )−1
= (AT Σ−1
Y A)
1/2
(B T Σ−1
X B)
1/2
.

Now, we will prove that σξ and ση are not identifiable. For that, since
ση and σξ have to be greater than 1 and ση σξ should be constant we can
Proofs for Chapter 10 377

change ξ by Cξ and η by C −1 η in such a way that Cσξ > 1 and C −1 ση > 1.


We take any C such that σξ−1 < C < ση and none of the other parame-
ters change. From this follows that σξ,η and βX|ξ , βY |η is not identifiable.
It is left to prove that ΣY |η and ΣX|ξ are not identifiable. For that we use
ΣX = ΣX|ξ + ΣX,ξ σξ−2 Σξ,X . If ΣX|ξ is identifiable, since ΣX and ΣX,ξ are
identifiable, σξ is identifiable, which is a contradiction since we have already
proven that it is not so. The same argument holds for ΣY |η .
2

Lemma A.6. Under the reflexive model of Section 10.2 and the regression
constraints,

ΣX = ΣX|ξ + σξ−2 (B T Σ−1


X B)
−1
BB T (A.42)
1
= ΣX|ξ + 2 (B T Σ−1
X|ξ B)
−1
BB T . (A.43)
σξ − 1
ΣY = ΣY |η + ση−2 (AT Σ−1
Y A)
−1
AAT (A.44)
1
= ΣY |η + 2 (AT Σ−1
Y |η A)
−1
AAT . (A.45)
ση − 1

Proof. Expressions (A.42) and (A.44) are equivalent to (10.9) and (10.10) in
Chapter 10.
By the covariance formula and the fact that by Proposition 10.1 we have
ΣX,ξ = B(B T Σ−1X B)
−1/2
and ΣY,η = A(AT Σ−1 Y A)
−1/2
from where we get
(A.42) and (A.44). Now, taking inverse and using the Woodbury inequality
we have
B T Σ−1 T −1
X|ξ BB ΣX B
B T Σ−1
X B = σξ2 .
σξ2 B T Σ−1 T −1
X B + B ΣX|ξ B

As a consequence

σξ2 − 1
B T Σ−1
X B = B T Σ−1
X|ξ B σξ2

and (A.43) follows replacing this into (A.42). The proof of (A.45) follows anal-
ogously. 2
378 Proofs of Selected Results

Proposition A.4. Assume the regression constraints. If ΣY |η and ΣX|ξ are


identifiable then σξ2 , ση2 , |σξ,η |, βX|ξ , βY |η , and cor(η, ξ) are identifiable, and

cor(η, ξ) = cor{E(η|Y ), E(ξ|X)}σξ ση



σξ2 = (A.46)
Hξ − 1

ση2 = , (A.47)
Hη − 1

where Hξ = Σξ,X Σ−1 −1


X|ξ ΣX,ξ and Hη = Ση,Y ΣY |η ΣY,η . Moreover, ΣY |η and
ΣX|ξ are identifiable if and only if σξ2 , ση2 are so.

Proof. By (A.42)–(A.45) and using Proposition 10.1 we have that σξ and ση


are identifiable and as a consequence the rest of the parameters are identifi-
able. Now, to prove (A.46) let us use the formula

ΣX = ΣX|ξ + ΣX,ξ σξ−2 Σξ,X .

Using Woodbury inequality

Σ−1
X = Σ−1 −1 2 −1
X|ξ − ΣX|ξ ΣX,ξ (σξ + Σξ,X ΣX,ξ ΣX,ξ )
−1
Σξ,X Σ−1
X|ξ . (A.48)

Using the fact that Σξ,X Σ−1X ΣX,ξ = 1 proven in Proposition 10.1 and multi-
plying to the left and to the right of (A.48) by Σξ,X and ΣX,ξ we have

1 = Σξ,X Σ−1 −1 2 −1
X|ξ ΣX,ξ − Σξ,X ΣX|ξ ΣX,ξ (σξ + Σξ,X ΣX|ξ ΣX,ξ )
−1
Σξ,X Σ−1
X|ξ ΣX,ξ

= Hξ − Hξ (σξ2 + Hξ )−1 Hξ
Hξ σξ2
=
σξ2 + Hξ

from where we get (A.46). Analogously we get (A.47). 2

A.8.5 Proposition 10.2


Restatement. Under the regression constraints, (I) if ΣX|ξ contains an
off-diagonal element that is known to be zero, say (ΣX|ξ )ij = 0, and if
(B)i (B)j 6= 0 then ΣX|ξ and σξ2 are identifiable. (II) if ΣY |η contains an
off-diagonal element that is known to be zero, say (ΣY |η )ij = 0, and if
(A)i (A)j 6= 0 then ΣY |η and ση2 are identifiable.
Proofs for Chapter 10 379

Proof. Identification of ΣX|ξ means that if we have two different matrices ΣX


and Σ̃X satisfying (10.9) then we should conclude that ΣX|ξ = Σ̃X|ξ and that
σξ2 = σ̃ξ2 . The same logic applies to ΣY |η and ση2 . More specifically, from (10.9)
the equality

ΣX|ξ + σξ−2 (B T Σ−1


X B)
−1
BB T = Σ̃X|ξ + σ̃ξ−2 (B T Σ−1
X B)
−1
BB T (A.49)

must imply that ΣX|ξ = Σ̃X|ξ and σξ2 = σ̃ξ2 . If σξ2 is identifiable, so
σξ2 = σ̃ξ2 , then (A.49) implies ΣX|ξ = Σ̃X|ξ since (B T Σ−1
X B)
−1
BB T is identi-
fiable. Similarly, if ΣX|ξ is identifiable, so ΣX|ξ = Σ̃X|ξ , then (A.49) implies
that σξ2 = σ̃ξ2 and thus that σξ2 is identifiable.
Now, assume that elements (i, j) and (j, i) of ΣX|ξ and Σ̃X|ξ are known to
be 0 and let ek denote the p×1 vector with a 1 in position k and 0’s elsewhere.
Then multiplying (A.49) on the left by ei and on the right by ej gives

σξ−2 (B T Σ−1
X B)
−1
(B)i (B)j = σ̃ξ−2 (B T Σ−1
X B)
−1
(B)i (B)j

Since (B T Σ−1
X B)
−1
BB T is identifiable and (B)i (B)j 6= 0, this implies that
σξ−2 = σ̃ξ−2 and thus σξ2 is identifiable, which implies that ΣX|ξ is identifiable.
2

A.8.6 Lemma 10.2


Restatement. In the SEM model, cor(ξ, η) is unaffected by choice of con-
straint, σξ2 = ση2 = 1 or var{E(ξ|X)} = var{E(η|Y )} = 1.

Proof. If we have σξ2 = ση2 = 1 and we define ξ˜ = cξ ξ, η̃ = cη ξ with


cξ = [var{E(ξ|X)}]−1/2 (we can defined because varE(ξ|X) is identifible)
we have that cor(η̃, ξ)˜ = cor(ξ, η). Now, if var{E(η|Y )} = var{E(ξ|X)} = 1
and ΣX|ξ and ΣY |η are identifiable we can identify σξ and ση and we could
define ξ˜ = cξ ξ, η̃ = cη ξ with cξ = {σξ }−1/2 and cη = {ση }−1/2 since cξ and
cη are identifiable.
2
380 Proofs of Selected Results

A.8.7 Proof of Proposition 10.3


Restatement. Starting from the model that stems from (10.1) and (10.2),
cor(ξ, η)
Ψ = [var{E(ξ | X)}var{E(η | Y )}]1/2 .
σξ ση
Under either the regression constraints or the marginal constraints,

|Ψ| ≤ |cor(η, ξ)|. (A.50)

Proof. We start with

cov{E(ξ | X), E(η | Y )} = cov(Σξ,X Σ−1 −1


X X, Ση,Y ΣY Y )

= Σξ,X Σ−1 −1
X ΣX,Y ΣY ΣY,η .

From (A.30),

ΣX,Y = βX|ξ βYT |η σξ,η


σξ,η
= ΣX,ξ Ση,Y 2 2
σξ ση
cor(ξ, η)
= ΣX,ξ Ση,Y .
σξ ση
Substituting we get
cor(ξ, η)
cov{E(ξ | X), E(η | Y )} = Σξ,X Σ−1 −1
X ΣX,ξ Ση,Y ΣY ΣY,η .
σξ ση
As in the proof of Proposition 10.1, we using (A.30) and the fact that µη = 0
to get, E(η|Y ) = Ση,Y Σ−1 −1 T
Y (Y − µY ). Therefore var(E(η|Y )) = Ση,Y ΣY Ση,Y .
−1 T
In the same way var(E(ξ|X)) = Σξ,X ΣX Σξ,X . This allows us to express

cor(ξ, η)
cov{E(ξ | X), E(η | Y )} = var{E(ξ | X)}var{E(η | Y )} .
σξ ση
In consequence, we get the first conclusion,

Ψ = cor{E(ξ | X), E(η | Y )}


cor(ξ, η)
= [var{E(ξ | X)}var{E(η | Y )}]1/2 .
σξ ση
To demonstrate the inequality, we use that

σξ2 = E{var(ξ | X)} + var{E(ξ | X)}


ση2 = E{var(η | Y )} + var{E(η | Y )}.
Proofs for Chapter 11 381

Under the marginal constraints, σξ2 = ση2 = 1 and so the variances of


the regression functions must be less than 1: var{E(ξ | X)} ≤ 1 and
var{E(η | Y )} ≤ 1, which implies that |Ψ| ≤ |cor(η, ξ)|. Under the regres-
sion constraints, the variances of the regression functions are equal to 1:
var{E(ξ | X)} = 1 and var{E(η | Y )} = 1. This implies that σξ2 ≥ 1 and
ση2 ≥ 1. It again follows that |Ψ| ≤ |cor(η, ξ)|.
2

To perhaps aid intuition, we can see the result in another way by using
the Woodbury identity to invert ΣX = ΣX,ξ Σξ,X + ΣX|ξ we have

var{E(ξ | X)} = Σξ,X Σ−1


X ΣX,ξ = Σξ,X (ΣX,ξ Σξ,X + ΣX|ξ )−1 ΣX,ξ
Σξ,X Σ−1
X|ξ ΣX,ξ
= .
1 + Σξ,X Σ−1
X|ξ ΣX,ξ

Similarly,

Ση,Y Σ−1
Y |η ΣY,η
var{E(η | Y )} = Ση,Y Σ−1
Y ΣY,η = ,
1 + Ση,Y Σ−1
Y |η ΣY,η

and thus
#1/2
Σξ,X Σ−1 Ση,Y Σ−1
"
X|ξ ΣX,ξ Y |η ΣY,η
Ψ= cor(ξ, η).
1 + Σξ,X Σ−1 −1
X|ξ ΣX,ξ 1 + Ση,Y ΣY |η ΣY,η

This form shows that |Ψ| ≤ |cor(ξ, η)| and provides a more detailed expression
of their ratio.

A.9 Proofs for Chapter 11


A.9.1 Proposition 11.1
Restatement. For a single-response regression with ΣX > 0 and dimension
q envelope EΣX(B), we have

1. span(ri ) = span(wi+1 ), i = 0, . . . , q − 1

2. span(pi ) = span(vi+1 ), i = 0, . . . , q − 1
382 Proofs of Selected Results

3. span(r0 , . . . , rq−1 ) = span(Wq ) = span(Vq ) = EΣX(B)

4. βcg = βnpls = βspls

The proof of this proposition, which is by induction, relies in part on the


following special case of a result by Elman (1994).

Lemma A.7. In the context of the single-response linear model, the vectors
generated by CGA satisfy

(i) rkT pi = rkT ri = 0 for i < k

(ii) pTk ΣX pi = 0 for i < k

(iii) span(r0 , . . . , rk−1 ) = span(p0 , . . . , pk−1 ) = Kk (ΣX , σX,Y )

Proof. See Elman (1994, Lemma 2.1). 2

Recall from our discussion in Section 11.3.2 that r0 = p0 = σX,Y and so


span(r0 ) = span(w1 ) and span(p0 ) = span(v1 ). Recall that q = dim(EΣX(B)).
For clarity of exposition, let βj denote the population CGA iterate of β when
q = j. Additionally, β1 = p0 (pT0 ΣX p0 )−1 pT0 σX,Y and β1 = βnpls = βspls when
q = 1. We will also use going forward that pT0 σX,Y = r0T r0 . The proposition
therefore holds for i = 0.
For i = 1 we have
r0T r0
r1 = r0 − ΣX r0
r0T ΣX r0
T
σX,Y σX,Y
= σX,Y − T
ΣX σX,Y
σX,Y ΣX σX,Y
T
= σX,Y − ΣX σX,Y (σX,Y ΣX σX,Y )−1 σX,Y
T
σX,Y
= QTσX,Y (ΣX ) σX,Y
span(r1 ) = span(w2 )
r1T r0 T
= σX,Y QσX,Y (ΣX ) σX,Y = 0
r1T p0 = 0.
Proofs for Chapter 11 383

Turning to p1 :
r1T r1
p1 = r1 + p0
r0T r0
T
σX,Y σX,Y
= σX,Y − T
ΣX σX,Y
σX,Y ΣX σX,Y
T
σXY QσX,Y (ΣX ) QTσX,Y (ΣX ) σX,Y
+ T
σX,Y
σX,Y σX,Y
T
σX,Y T
σX,Y σX,Y Σ2X σX,Y
= T Σ σ 2
(σXY X X,Y )
T
Σ2X σX,Y )−1 σX,Y
T

× σX,Y − ΣX σX,Y (σX,Y ΣX σX,Y
T
σX,Y T
σX,Y σX,Y Σ2X σX,Y
= T Σ σ 2
QΣX σX,Y σX,Y ∝ v2 = QΣX σX,Y σX,Y
(σXY X X,Y )
span(p1 ) = span(v2 ), (from Table 3.4).

Additionally,

pT0 ΣX p1 = 0 since span(p0 ) = span(v1 ), span(p1 ) = span(v2 ) and v1T ΣX v2 = 0


r1T r1 T
pT1 σX,Y = r1T σX,Y + p σX,Y
r0T r0 0
= r1T σX,Y + r1T r1
= r1T r1 since r1T σX,Y = 0. (A.51)

For the next regression coefficients,


r1T r1
β2 = β1 + T
p1
p1 ΣX p1
= p0 (pT0 ΣX p0 )−1 pT0 σX,Y + p1 (pT1 ΣX p1 )−1 pT1 σX,Y
= βnpls = βspls when q = 2,

where the second equality follows from (A.51) and we used in the last equality
the fact that pT0 ΣX p1 = 0.
Since the proposition holds for q = 1, 2, we now suppose that it holds for
q = 1, . . . , k − 1. We next prove that it holds for q = k; that is, we show that
span(pk ) = span(vk+1 ), span(rk ) = span(wk+1 ) and that βk+1 = βnpls = βspls
when the number of PLS components is k + 1.
For the induction hypotheses we have for q = 1, . . . , k

1. span{r0 , . . . , rq−1 } = span{p0 , . . . , pq−1 } = span{v1 , . . . , vq } =


span{w1 , . . . , wq }
384 Proofs of Selected Results

2. For i = 0, . . . , q − 1, span(ri ) = span(wi+1 ), span(pi ) = span(vi+1 ),


Pq−1
3. βq = i=0 pi (pTi ΣX pTi )−1 σX,Y = βnpls = βspls ,
T
4. rq−1 ri = 0, for i = 0, . . . , k − 2,

5. pTq−1 ΣX pi = 0 for i = 0, . . . , q − 2,
T
6. rq−1 pi = 0 for i = 0, . . . , q − 2,

7. pTi σX,Y = riT ri for i = 0, . . . , q − 1.

Items 4–6 are implied by Lemma A.7. To prove the proposition we first show
that items 1 and 2 hold for q = k+1. To do this, we first show that {r0 , . . . , rk }
is an orthogonal set. Although this is implied by Lemma A.7(i), we provide
an alternate demonstration for completeness.
Claim: {r0 , . . . , rk } is an orthogonal set.

Proof. We know from induction hypothesis 4 that {r0 , . . . , rk−1 } is an orthog-


onal set so we need to show that rkT ri = 0 for i = 0, . . . , k − 1. We know from
item 5 that for q = k, pTk−1 ΣX pi = 0 for i = 0, . . . , k − 2. From hypothesis 1,

span{r0 , . . . , rk−2 } = span{p0 , . . . , pk−2 }

and consequently we have also that pTk−1 ΣX ri = 0 for i = 0, . . . , k − 2. By


construction rk = rk−1 − αk−1 ΣX pk−1 . Multiplying both sides by ri we have

riT rk = riT rk−1 − αk−1 riT ΣX pk−1 = 0, i = 0, . . . , k − 2.

T
It remains to show that rk−1 rk = 0. By construction
T
T T rk−1 rk−1
rk−1 rk = rk−1 rk−1 − T
rk−1 ΣX pk−1
pk−1 ΣX pk−1
( )
T rk−1 ΣX pk−1
= rk−1 rk−1 1 − T
pk−1 ΣX pk−1
= 0,

T
where the last equality follows because rk−1 ΣX pk−1 = pTk−1 ΣX pk−1 . To see
this we have by construction

pk−1 = rk−1 + Bk−2 pk−2 .


Proofs for Chapter 11 385

Multiplying both sides by pTk−1 ΣX and recognizing that pTk−1 ΣX pk−2 = 0 by


induction hypothesis 5 gives the desired result. 2

Claim: span(rk ) = span(wk+1 ) and span(pk ) = span(vk+1 )

Proof. The proof follows from Proposition 3.7 and Lemma A.7(iii).
In a bit more detail, we know from induction hypothesis 2 that for
q = 1, . . . , k and i = 0, . . . , q −1, span(ri ) = span(wi+1 ). From Lemma A.7(iii)

EΣX(B) = span{w1 , . . . , wk+1 } = span{r0 , . . . , rk } = Kk+1 (ΣX , ΣX,Y ).

Since rk is orthogonal to (r0 , . . . , rk−1 ) and wk+1 is orthogonal to (w1 , . . . , wk ),


we must have span(rk ) = span(wk+1 ). The same rational can be used to
show that span(pk ) = span(vk+1 ) since both have pTi ΣX pk = viT ΣX vk = 0.
2

Pk
Claim: βk+1 = i=0 pi (pTi ΣX pTi )−1 σX,Y = βnpls = βspls

Proof.

βk+1 = βk + αk pk
k−1
X rkT rk
= pi (pTi ΣX pi )−1 pTi σX,Y + T
pk
i=0
pk ΣX pk
k−1
X
= pi (pTi ΣX pi )−1 pTi σX,Y + pk (pTk ΣX pk )−1 rkT rk
i=0
k−1
X
= pi (pTi ΣX pi )−1 pTi σX,Y + pk (pTk ΣX pk )−1 pTk σX,Y .
i=0

From Table 11.1,

pTk σX,Y = rkT σX,Y + Bk−1 pTk−1 σX,Y


pTk−1 σX,Y
= rkT rk T r
rk−1 k−1

= rkT rk

since pTk−1 σX,Y /rk−1


T
rk−1 = 1 by induction hypothesis 7. 2
Bibliography

Adragni, K. and R. D. Cook (2009). Sufficient dimension reduction and


prediction in regression. Philosophical Transactions of the Royal Society
A 367, 4385–4405.

Adragni, K. P. (2009). Some basis functions for principal fitted components.


http://userpages.umbc.edu/∼kofi/reprints/BasisFunctions.pdf.

Akter, S., S. F. Wamba, and S. Dewan (2017). Why pls-sem is suitable for
complex modelling? an empirical illustration in big data analytics quality.
Production Planning & Control 28 (11–12), 1011–1021.

Aldrich, J. (2005). Fisher and regression. Statistical Science 20 (4), 401–417.

Allegrini, F. and A. C. Olivieri (2013). An integrated approach to the simul-


taneous selection of variables, mathematical pre-processing and calibration
samples in partial least-squares multivariate calibration. Talanta 115,
755–760.

Allegrini, F. and A. C. Olivieri (2016). Sensitivity, prediction uncertainty,


and detection limit for artificial neural network calibrations. Analytical
Chemistry 88, 7807–7812.

AlShiab, M. S. I., H. A. N. Al-Malkawi, and A. Lahrech (2020). Revisit-


ing the relationship between governance quality and economic growth.
International Journal of Economics and Financial Issues 10(4), 54–63.

Anderson, T. W. (1951). Estimating linear restrictions on regression coef-


ficients for multivariate normal distributions. Ann. Math. Statist. 22 (3),
327–351.

Anderson, T. W. (1999). Asymptotic distribution of the reduced-rank


regression estimator under general conditions. Annals of Statistics 27 (4),
1141–1154.

387
388 Bibliography

Arvin, M., R. P. Pradhan, and M. S. Nair (2021). Are there links between
institutional quality, government expenditure, tax revenue and economic
growth? Evidence from low-income and lower middle-income countries.
Economic Analysis and Policy 70(C), 468–489.

Asongu, S. A. and J. Nnanna (2019). Foreign aid, instability, and governance


in Africa. Politics & Policy 47(4), 807—848.

Baffi, G., E. Martin, and J. Modrris, A (1999). Non-linear projection to


latent structures revisited: the quadratic PLS algorithm. Computers &
Chemical Engineering 23 (3), 395–411.

Baffi, G., E. B. Martin, and A. J. Morris (1999). Non-linear projection to


latent structures revisited (the neural network pls algorithm). Computers
& Chemical Engineering 23 (9), 1293–1307.

Barker, M. and W. Ravens (2003). Partial least squares for discrimination.


Journal of Chemometrics 17, 166–173.

Basa, J., R. D. Cook, L. Forzani, and M. Marcos (2022, December).


Asymptotic distribution of one-component PLS regression estimators.
Canadian Journal of Statistics Published online, 23 December 2022,
https://doi.org/10.1002/cjs.11755.

Berntsson, P. and S. Wold (1986). Comparison between x-ray crystal-


lographic data and physicochemical parameters with respect to their
information about the calcium channel antagonist activity of 4-phenyl-1,4-
dihydropyridines. Molecular Informatics 5, 45–50.

Björck, Å. (1966). Numerical Methods for Least Squares Problems. SIAM.

Boggaard, C. and H. H. Thodberg (1992). Optimal minimal neural


interpretation of spectra. Analytical Chemistry 64 (5), 545–551.

Bollen, K. A. (1989). Structural Equations with Latent Variables. New York:


Wiley.

Box, G. E. P. and D. R. Cox (1964). An analysis of transformations. Journal


of the Royal Statistical Society B 26, 211–246.

Breiman, L. (2001). Statistical modeling: The two cultures (with comments


and a rejoinder by the author). Statistical Science 16 (3), 199–231.
Bibliography 389

Brereton, R. G. and G. R. Lloyd (2014). Partial least squares discriminant


analysis: taking the magic away. Journal of Chemometrics 28 (4), 213–225.

Bridgman, P. W. (1927). The Logic of Modern Physics. Interned Archive,


https://archive.org/details/logicofmodernphy0000pwbr/page/8/mode/2up.

Bro, R. (1998). Multi-way Analysis in the Food Industry, Model, Algorithms


and Applications. PhD. thesis, Doctoral Thesis, University of Amsterdam.

Bura, E., S. Duarte, and L. Forzani (2016). Sufficient reductions in regres-


sions with exponential family inverse predictors. Journal of the American
Statistical Association 111 (515), 1313–1329.

Chakraborty, C. and Z. Su (2023). A comprehensive Bayesian framework for


envelope models. Journal of the American Statistical Association Published
online, 24 August 2023 https://doi.org/10.1080/01621459.2023.2250096.

Chiappini, F. A., H. Allegrini, H. C. Goicoechea, and A. C. Olivieri (2020).


Sensitivity for multivariate calibration based on multilayer perceptron
artificial neural networks. Analytical Chemistry 92, 12265–12272.

Chiappini, F. A., H. C. Goicoechea, and A. C. Olivieri (2020). Mvc1-gui: A


matlab graphical user interface for first-order multivariate calibration. An
upgrade including artificial neural networks modelling. Chemometrics and
Intelligent Laboratory Systems 206, 104162.

Chiappini, F. A., F. Gutierrez, H. C. Goicoechea, and A. C. Olivieri (2021).


Interference-free calibration with first-order instrumental data and multi-
variate curve resolution. When and why? Analytica Chimica Acta 1161,
338465.

Chiappini, F. A., C. M. Teglia, A. G. Forno, and H. C. Goicoechea (2020).


Modelling of bioprocess non-linear fluorescence data for at-line prediction
of etanercept based on artificial neural networks optimized by response
surface methodology. Talanta 210, 120664.

Chiaromonte, F., R. D. Cook, and B. Li (2002). Sufficient dimension reduction


in regressions with categorical predictors. The Annals of Statistics 30 (2),
475–497.
390 Bibliography

Chun, H. and S. Keleş (2010). Sparse partial least squares regression for
simultaneous dimension reduction and predictor selection. Journal of the
Royal Statistical Society B 72 (1), 3–25.

Conway, J. (1994). A Course in Functional Analysis. Graduate Texts in


Mathematics. Springer New York.

Cook, R. D. (1994). Using dimension-reduction subspaces to identify impor-


tant inputs in models of physical systems. In Proceedings of the Section on
Engineering and Physical Sciences, pp. 18–25. Alexandria, VA: American
Statistical Association.

Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions


Through Graphics. Wiley Series in Probability and Statistics. New York:
Wiley.

Cook, R. D. (2000). SAVE: A method for dimension reduction and graphics in


regression. Communications in Statistics – Theory and Methods 29 (9–10),
2109–2121.

Cook, R. D. (2007). Fisher lecture: Dimension reduction in regression.


Statistical Science 22 (1), 1–26.

Cook, R. D. (2018). An Introduction to Envelopes: Dimension Reduction for


Efficient Estimation in Multivariate Statistics. Wiley Series in Probability
and Statistics. New York: Wiley.

Cook, R. D. (2020). Envelope methods. Wiley Interdisciplinary Reviews:


Computational Statistics 12 (2), e1484.

Cook, R. D. and L. Forzani (2008a). Covariance reducing models: An


alternative to spectral modelling of covariance matrices. Biometrika 95 (4),
799–812.

Cook, R. D. and L. Forzani (2008b). Principal fitted components for


dimension reduction in regression. Statistical Science 23 (4), 485–501.

Cook, R. D. and L. Forzani (2009). Likelihood-based sufficient dimension re-


duction. Journal of the American Statistical Association 104 (485), 197–208.

Cook, R. D. and L. Forzani (2018). Big data and partial least squares
prediction. The Canadian Journal of Statistics/La Revue Canadienne de
Statistique 47 (1), 62–78.
Bibliography 391

Cook, R. D. and L. Forzani (2019). Partial least squares prediction in


high-dimensional regression. Annals of Statistics 47 (2), 884–908.

Cook, R. D. and L. Forzani (2020). Envelopes: A new chapter in partial least


squares regression. Journal of Chemometrics 34 (10), e3294.

Cook, R. D. and L. Forzani (2021). PLS regression algorithms in the presence


of nonlinearity. Chemometrics and Intelligent Laboratory Systems 213,
194307.

Cook, R. D. and L. Forzani (2023, November). On the role of partial least


squares in path analysis for the social sciences. Journal of Business
Research 167, 114132.

Cook, R. D., L. Forzani, and L. Liu (2023a, January). Envelopes for


multivariate linear regression with linearly constrained coefficients.
Scandinavian Journal of Statistics Publlished online, 12 October 2023
https://doi.org/10.1111/sjos.12690.

Cook, R. D., L. Forzani, and L. Liu (2023b). Partial least squares for simul-
taneous reduction of response and predictor vectors in regression. Journal
of Multivariate Analysis 196, https://doi.org/10.1016/j.jmva.2023.105163.

Cook, R. D., L. Forzani, and A. J. Rothman (2012). Estimating sufficient


reductions of the predictors in abundant high-dimensional regressions. The
Annals of Statistics 40 (1), 353–384.

Cook, R. D., L. Forzani, and A. J. Rothman (2013). Prediction in abundant


high-dimensional linear regression. Electronic Journal of Statistics 7,
3059–3088.

Cook, R. D., L. Forzani, and Z. Su (2016, September). A note on fast


envelope estimation. Journal of Multivariate Analysis 150, 42–54.

Cook, R. D., L. Forzani, and X. Zhang (2015). Envelopes and reduced-rank


regression. Biomerika 102 (2), 439–456.

Cook, R. D., I. S. Helland, and Z. Su (2013). Envelopes and partial least


squares regression. Journal of the Royal Statistical Society B 75 (5),
851–877.
392 Bibliography

Cook, R. D. and B. Li (2002). Dimension reduction for the conditional mean


in regression. Annals of Statistics 30 (2), 455–474.

Cook, R. D., B. Li, and F. Chiaromonte (2010). Envelope models for parsi-
monious and efficient multivariate linear regression. Statistica Sinica 20 (3),
927–960.

Cook, R. D. and L. Ni (2005). Sufficient dimension reduction via inverse


regression: a minimum discrepancy approach. Journal of the American
Statistical Association 100 (470), 410–428.

Cook, R. D. and C. M. Setodji (2003). A model-free test for reduced rank in


multivariate regression. Journal of the American Statistical Association 46,
421–429.

Cook, R. D. and Z. Su (2013). Scaled envelopes: Scale-invariant and efficient


estimation in multivariate linear regression. Biometrika 100 (4), 939–954.

Cook, R. D. and Z. Su (2016). Scaled predictor envelopes and partial


least-squares regression. Technometrics 58 (2), 155–165.

Cook, R. D. and S. Weisberg (1982). Residuals and Influence in Regression.


London: Chapman and Hall.

Cook, R. D. and S. Weisberg (1991). Sliced inverse regression for di-


mension reduction: Comment. Journal of the American Statistical
Association 86 (414), 328–332.

Cook, R. D. and S. Weisberg (1999). Applied Regression Including Computing


and Graphics (1st ed.). New York: Wiley.

Cook, R. D. and X. Yin (2001). Special invited paper: Dimension reduction


and visualization in discriminant analysis (with discussion). Australian
and New Zealand Journal of Statistics 43, 147–199.

Cook, R. D. and X. Zhang (2015a). Foundations for envelope models and


methods. Journal of the American Statistical Association 110 (510),
599–611.

Cook, R. D. and X. Zhang (2015b). Simultaneous envelopes and multivariate


linear regression. Technometrics 57 (1), 11–25.
Bibliography 393

Cook, R. D. and X. Zhang (2016). Algorithms for envelope estimation.


Journal of Computational and Graphical Statistics 25, 284–300.

Cook, R. D. and X. Zhang (2018). Fast envelope algorithms. Statistica


Sinica 28, 1179–1197.

Cox, D. R. and N. Reid (1987). Parameter orthogonality and approximate


conditional inference. Journal of the Royal Statistical Society B 49, 1–39.

de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares


regression. Chemometrics and Intelligent Laboratory Systems 18 (3),
251–263.

Despagne, F. and D. Luc Massart (1998). Neural networks in multivariate


calibration. The Analyst 123 (11), 157R–178R.

Diaconis, P. and D. Freedman (1984). Asymptotics of graphical projection


pursuit. Annals of Statistics 12 (3), 793–815.

Dijkstra, T. (1983). Some comments on maximum likelihood and partial


least squares methods. Journal of Econometrics 22, 67–90.

Dijkstra, T. (2010). Latent variables and indices: Herman Wold’s basic


design and partial least squares. In E. V. Vinzi, W. W. Chin, J. Henseler,
and H. Wang (Eds.), Handbook of Partial Least Squares, Chapter 1, pp.
23–46. New York: Springer.

Dijkstra, T. and J. Henseler (2015a). Consistent and asymptotically normal


pls estimators for linear structural equations. Computational Statistics and
Data Analysis 81, 10–23.

Dijkstra, T. and J. Henseler (2015b). Consistent partial least squares path


modeling. MIS Quarterly 39 (2), 297–316.

Ding, S. and R. D. Cook (2018). Matrix-variate regressions and envelope


models. Journal of the Royal Statistical Society B 80, 387–408.

Ding, S., Z. Su, G. Zhu, and L. Wang (2021). Envelope quantile regression.
Statistica Sinica 31 (1), 79–105.

Downey, G., R. Briandet, R. H. Wilson, and E. K. Kemsley (1997). Near- and


mid-infrared spectroscopies in food authentication: Coffee varietal identi-
fication. Journal of Agricultural and Food Chemistry 45 (11), 4357–4361.
394 Bibliography

Eaton, M. L. (1986). A characterization of spherical distributions. Journal


of Multivariate Analysis 20, 260–271.

el Bouhaddani, S., H.-W. Un, C. Hayward, G. Jongbloed, and Houwing-


Duistermaat (2018). Probabilistic partial least squares model: Identifia-
bility, estimation and application. Journal of Multivariate Analysis 167,
331–346.

Elman, H. C. (1994). Iterative methods for linear systems. In J. Gilbert


and D. Kershaw (Eds.), Large-Scale Matrix Problems and the Numerical
Solution of Partial Differential Equations, pp. 69–117. Oxford, England:
Clarendon Press.

Emara, N. and I. M. Chiu (2015). The impact of governance environment on


economic growth: The case of Middle Eastern and North African countries.
Journal of Economics Library, 3(1), 24–37.

Etiévant, L. and V. Viallon (2022). On some limitations of probabilistic


models for dimension-reduction: illustration in the case of probabilistic
formulations of partial least squares. Statistica Neerlandica 76 (3), 331–346.

Evermann, J. and M. Rönkkö (2021). Recent developmets in PLS. Commu-


nications of the Association for Information Systems 44, 123–132.

Fawaz, F., A. Mnif, and A. Popiashvili (2021, June). Impact of governance


on economic growth in developing countries: A case of HIDC vs. LIDC.
Journal of Social and Economic Development 23 (1), 44–58.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statis-


tics. Philosophical Transactions of the Royal Society A: Mathematical,
Physical and Engineering Sciences 222 (594–604), 309–368.

Forzani, L., C. A. Percı́ncula, and R. Toledano (2022). Wishart moments


calculator. Technical report, https://antunescarles.github.io/wishart-
moments-calculator/.

Forzani, L., D. Rodrigues, E. Smucler, and M. Sued (2019). Sufficient


dimension reduction and prediction in regression: Asymptotic results.
Journal of Multivariate Analysis 171, 339–349.
Bibliography 395

Forzani, L., D. Rodriguez, and M. Sued (2023). Asymptotic results for


nonparametric regression estimators after sufficient dimension reduction
estimation. Technical report, arxiv.org/abs/2306.10537.

Forzani, L. and Z. Su (2020). Envelopes for elliptical multivariate linear


regression. Statistica Sinica 31 (1), 301–332.

Frank, I. E. and J. H. Frideman (1993). A statistical view of some


chemometrics regression tools. Technometrics 35 (2), 102–246.

Frideman, J., T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu (2004). Discus-


sion of three papers on boosting. The Annals of Statistics 32 (1), 102–107.

Gani, A. (2011). Governance and Growth in Developing Countries. Journal


of Economic Issues 45 (1), 19–40.

Garthwaite, P. H. (1994). An interpretation of partial least squares. Journal


of the American Statistical Association 89, 122–127.

Geladi, P. (1988). Notes on the history and nature if partial least squares
(PLS) modeling. Journal of Chemometrics 2, 231–246.

Goh, G. and D. K. Dey (2019). Asymptotic properties of marginal least-


square estimator for ultrahigh-dimensional linear regression models with
correlated errors. The American Statistician 73 (1), 4–9.

Goicoechea, H. C. and A. C. Oliveri (1999). Enhanced synchronous spectroflu-


orometric determination of tetracycline in blood serum by chemometric
analysis. comparison of partial least-squares and hybrid linear analysis
calibrations. Analytical Chemistry 71, 4361–4368.

Goodhue, D., W. Lewis, and R. Thompson (2023). Comments on Evermann


and Rönkkö: Recent developments in PLS. Communications of the
Association for Information Systems 52, 751–755.

Green, P. and B. Silverman (1994). Nonparametric regression and generalized


linear models. Boca Raton, FL: CRC Press.

Guide, J. B. and M. Ketokivi (2015, July). Notes from the editors: Redefining
some methodological criteria for the journal. Journal of Operations
Management 37, v–viii.
396 Bibliography

Haaland, D. M. and E. V. Thomas (1988). Partial least-squares methods for


spectral analyses. 1. relation to other quantitative calibration methods and
the extraction of qualitative information. Analytical Chemistry 60 (11),
1193–1202.

Haenlein, M. and A. M. Kaplan (2004). A beginner’s guide to partial least


squares analysis. Understanding Statistics 3 (4), 283–297.

Hall, P. and K. C. Li (1993). On almost linearity of low dimensional


projections from high dimensional data. Annals of Statistics 21, 867–889.

Han, X., H. Khan, and J. Zhuang (2014). Do governance indicators explain


development performance? A cross-country analysis. Asian Development
Bank Economics Working Paper Series, No. 417.

Hand, D. J. (2006). Classifier technology and the illusion of progress.


Statistical Science 21 (1), 1–14.

Hand, D. J. and K. Yu (2001). Idiot’s Bbayes–Not so stupid after all.


International Statistical Review 69, 385–398.

Hawkins, D. M. and S. Weisberg (2017). Combining the box-cox power and


generalised log transformations to accommodate nonpositive responses in
linear and mixed-effects linear models. South African Statistics Journal 51,
317–328.

Hayami, K. (2020). Convergence of the conjugate gradient method on singular


systems. Technical report, ArXiv: https://arxiv.org/abs/1809.00793.

Helland, I. S. (1990). Partial least squares regression and statistical models.


Scandinavian Journal of Statistics 17 (2), 97–114.

Helland, I. S. (1992). Maximum likelihood regression on relevant components.


Journal of the Royal Statistical Society B 54 (2), 637–647.

Helland, I. S., S. Sæbø, T. Almøy, and R. Rimal (2018). Model and estimators
for partial least squares regression. Journal of Chemometrics 32 (9), e3044.

Henderson, H. and S. Searle (1979). Vec and vech operators for matrices,
with some uses in Jacobians and multivariate statistics. The Canadian
Journal of Statistics/La Revue Canadienne de Statistique 7 (1), 65–81.
Bibliography 397

Henseler, J., T. Dijkstra, M. Sarstedt, C. Ringle, A. Diamantopoulos,


D. Straub, D. J. Ketchen Jr, J. F. Hair, T. Hult, and R. Calantone
(2014). Common beliefs and reality about PLS: Comments on Rönkkö &
Evermann (2013). Organizational Research Methods 17 (2), 182–209.

Hentenes, M. R. and E. Stiefel (1952). Methods of conjugate gradients for


solving linear systems. Journal of Research of the National Bureau of
Standards 49 (6), 409–436.

Hotelling, H. (1933). Analysis of a complex statistical variable into principal


components. Journal of Educational Psychology 24 (6), 417–441.

Indahl, U. (2005). A twist to partial least squares regression. Journal of


Chemometrics 19 (1), 32–44.

Izenman, J. A. (1975). Reduced-rank regression for the multivariate linear


model. Journal of Multivariate Analysis 5 (2), 248–264.

Jöreskog, K. G. (1970). A general method for estimating a linear structural


equation model. ETS Research Bulletin Series 1970 (2), i–41.

Kalivas, J. H. (1997). Two data sets of near infrared spectra. Chemometrics


and Intelligent Laboratory Systems 37 (2), 255–259.

Kenward, M. G. (1987). A method for comparing profiles of repeated


measurements. Journal of the Royal Statistical Society C 36 (3), 296–308.

Kettaneh-Wold, N. (1992). Analysis of mixture data with partial least


squares. Chemometrics and Intelligent Laboratory Systems 14, 57–69.

Khare, K., S. Pal, and Z. Su (2016). A bayesian approach for envelope


models. Annals of Statistics 45 (1), 196–222.

Kruskal, J. B. (1983). Statistical Data Analysis, Volume 28 of Proceedings


of Symposia in Applied Mathematics, Chapter 4, pp. 75–104. Providence,
Rhode Island: American Mathematical Society.

Lavoie, F. B., K. Muteki, and R. Gosselin (2019). A novel robust nl-pls regres-
sion methodology. Chemometrics and Intelligent Laboratory Systems 184,
71–81.

Letac, G. and H. Massam (2004). All invariant moments of the wishart


distribution. Scandinavian Journal of Statistics 31 (2), 295–318.
398 Bibliography

Li, B. (2018). Sufficient Dimension Reduction: Methods and Applications


with R. Chapman & Hall/CRC Monographs on Statistics and Applied
Probability. Boca Raton, Florida: CRC Press.

Li, B., P. A. Hassel, A. J. Morris, and E. B. Martin (2005). A non-linear


nested partial least squares algorithm. Computational Statistics & Data
Analysis 48, 87–101.

Li, B. and S. Wang (2007). On directional regression for dimension reduction.


Journal of the American Statistical Association 102, 997–1008.

Li, K. C. (1991). Sliced inverse regression for dimension reduction (with dis-
cussion). Journal of the American Statistical Association 86 (414), 316–342.

Li, L., R. D. Cook, and C.-L. Tsai (2007). Partial inverse regression.
Biometrika 94 (3), 615–625.

Li, L. and X. Zhang (2017). Parsimonious tensor response regression. Journal


of the American Statistical Association 112 (519), 1131–1146.

Liland, K. H., M. Hø y, H. Martens, and S. Sæ bø (2013). Distribution


based truncation for variable selection in subspace methods for multivariate
regression. Chemometrics and Intelligent Laboratory Systems 122, 103–111.

Lindgren, F., P. Geladi, and S. Wold (1993). The kernel algorithm for pls.
Journal of Chemometrics 7 (1), 44–59.

Liu, Y. and W. Ravens (2007). PLS and dimension reduction for classification.
Computational Statistics 22, 189–208.

Lohmöller, J.-B. (1989). Latent Variable Path Modeling with Partial Least
Squares. New York: Springer.

Magnus, J. R. and H. Neudecker (1979). The commutation matrix: some


properties and applications. Annals of Statistics 7 (2), 381–394.

Manne, R. (1987). Analysis of two partial-least-squares algorithms for mul-


tivariate calibration. Chemometrics and Intelligent Laboratory Systems 1,
187–197.

Martens, H. and T. Næs (1989). Multivariate Calibration. New York: Wiley.


Bibliography 399

Martin, A. (2009). A comparison of nine PLS1 algorithms. Journal of


Chemometrics 23 (10), 518–529.

McIntosh, C. N., J. R. Edwards, and J. Antonakis (2014). Reflections on par-


tial least squares path modeling. Organizational Research Methods 17 (2),
210–251.

McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern


Recognition. New York, NY: Wiley.

Muirhead, R. J. (2005). Aspects of Multivariate Statistical Theory. New


York: Wiley.

Nadler, B. and R. R. Coifman (2005). Partial least squares, Beer’s law


and the net analyte signal: Statistical modeling and analysis. Journal of
Chemometrics 19, 435–54.

Næ s, T. and I. S. Helland (1993). Relevant components in regression.


Scandinavian Journal of Statistics 20 (3), 239–250.

Neudecker, H. and T. Wansbeek (1983). Some results on commutation ma-


trices, with statistical applications. Canadian Journal of Statistics 11 (3),
221–231.

Nguyen, D. V. and D. M. Rocke (2004). On partial least squares dimen-


sion reduction for microarray-based classification: A simulation study.
Computational Statistics and Data Analysis 46 (3), 407–425.

Olivieri, A. and G. M. Escandar (2014). Practical three-way calibration.


Elsevier.

Olivieri, A. C. (2018). Introduction to Multivariate Calibration: A Practical


Approach. New York, NY: Springer.

Osborne, B. G., T. Fearn, A. R. Miller, and S. Douglas (1984). Application of


near infrared reflectance spectroscopy to compositional analysis of biscuits
and biscuit doughs. Journal of the Science of Food and Agriculture 35 (1),
99–105.

Pardoe, I., X. Yin, and R. D. Cook (2007). Graphical tools for quadratic
discriminant analysis. Technometrics 49 (2), 172–183.
400 Bibliography

Park, Y., Z. Su, and D. Chung (2022). Envelope-based partial partial


least squares with application to cytokine-based biomarker analysis for
COVID-19. Statistics in Medicine 41 (23), 4578–4592.

Phatak, A. and F. de Hoog (2002). Exploring the connection between pls,


Lanczos methods and conjugate gradients: Alternative proofs of some
properties of pls. Journal of Chemometrics 16, 361–367.

R Core Team (2022). A Language and Environment for Statistical Computing.


R Foundation for Statistical Computing, Vienna, Austria.

Radulović, M. (2020, April – J). The impact of institutional quality on


economic growth: A comparative analysis of the Eu and non-Eu countries
of Southeast Europe. Economic Annals 65 (225), 163–182.

Rao, C. R. (1979). Separation theorems for singular values of matrices


and their applications in multivariate analysis. Journal of Multivariate
Analysis 9, 362–377.

Reinsel, G. C. and R. P. Velu (1998). Multivariate Reduced-rank Regression:


Theory and Applications. New York: Springer.

Rekabdarkolaee, H. M., Q. Wang, Z. Naji, and M. Fluentes (2020). New


parsimonious multivariate spatial model: Spatial envelope. Statistica
Sinica 30 (3), 1583–1604. https://arxiv.org/abs/1706.06703.

Rigdon (2012). Rethinking partial least squares path modeling: In praise of


simple methods. Long Range Planning 45, 341–354.

Rigdon, E. E., J.-M. Becker, and M. Sarstedt (2019). Factor indeterminacy


as metrological uncertainty: Implications for advancing psychological
measurement. Multivariate Behavioral Research 54 (3), 429–443.

Rimal, R., A. Trygve, and S. Sæbø (2019). Comparison of multi-response


prediction methods. Chemometrics and Intelligent Laboratory Systems 190,
10–21.

Rimal, R., A. Trygve, and S. Sæbø (2020). Comparison of multi-response es-


timation methods. Chemometrics and Intelligent Laboratory Systems 205,
104093.
Bibliography 401

Rönkkö, M. and J. Evermann (2013). A critical examination of common


beliefs about partial least squares path modeling. Organizational Research
Methods 16 (3), 425–448.

Rönkkö, M., C. N. McIntosh, J. Antonakis, and J. R. Edwards (2016a).


Appendix A – Analysis file for R (R Code for Rönkkö et al., 2016b).
https://www.researchgate.net/publication/304253732 Appendix A -
Analysis file for R.

Rönkkö, M., C. N. McIntosh, J. Antonakis, and J. R. Edwards (2016b).


Partial least squares path modeling: Time for some serious second thoughts.
Journal of Operations Management 47–48 (SC), 9–27.

Rosipal, R. (2011). Nonlinear partial least squares: An overview. In H. Lodhi


and Y. Yamanishi (Eds.), Chemoinformatics and Advanced Machine
Learning Perspectives: Complex Computational Methods and Collaborative
Techniques, Chapter 9, pp. 169–189. Hershey, Pennsylvania: IGI Global.

Rosipal, R. and N. Krämer (2006). Overview and recent advances in partial


least squares. In C. Saunders, M. Grobelnik, S. Gunn, and J. Shawe-Taylor
(Eds.), Subspace, Latent Structure and Feature Selection Techniques, pp.
34–51. New York: Springer.

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling.


Journal of Statistical Software 48 (2), 1–36.

Rothman, A. J., P. J. Bickel, E. Levina, and J. Zhu (2008). Sparse permutation


invariant covaraince estimation. Electronic Journal of Statistics 2, 495–515.

Royston, P. and W. Sauerbrei (2008). Multivariable Model-Building: A


pragmatic approach to regression analysis based on fractional polynomials
for modelling continuous variables. New York: Wiley.

Russo, D. and K.-J. Stol (2023). Don’t throw the baby out with the
bathwater: Comments on “recent developments in PLS”. Communications
of the Association for Information Systems 52, 700–704.

Samadi, Y. S. and H. M. Herath (2021). Reduced-rank envelope vector


autoregressive model. Technical report, School of Mathematical and
Statistical Science, Southern Illinois University, Carbondale, IL.
402 Bibliography

Sarstedt, M., J. F. Hair, C. M. Ringle, K. O. Thiele, and S. P. Gudergan


(2016). Estimation issues with pls and cbsem: Where the bias lies! Journal
of Business Research 69 (10), 3998 – 4010.

Schönemann, P. H. and M.-M. Wang (1972). Some new results on factor


indeterminacy. Psychometrika 37 (1), 61–91.

Shan, P., S. Peng, Y. Bi, L. Tang, C. Yang, Q. Xie, and C. Li (2014). Partial
least squares–slice transform hybrid model for nonlinear calibration.
Chemometrics and Intelligent Laboratory Systems 138, 72–83.

Shao, Y., R. D. Cook, and S. Weisberg (2007). Marginal tests with sliced
average variance estimation. Biometrika 94 (2), 285–296.

Shao, Y., R. D. Cook, and S. Weisberg (2009). Partial central subspace and
sliced average variance estimation. Journal of Statistical Planning and
Inference 139 (3), 952–961.

Sharma, P. N., B. D. Liengaard, M. Sarstedt, J. F. Hair, and C. Ringle


(2022). Extraordinary claims require extraordinary evidence: A comment
on “recent developments in PLS”. Communications of the Association for
Information Systems 52, 739–742.

Simonoff, J. S. (1996). Smoothing Methods in Statistics. New York, NY:


Springer.

Skagerberg, B., J. MacGregor, and C. Kiparissides (1992). Multivariate data


analysis applied to low-density polyethylene reactors. Chemometrics and
Intelligent Laboratory Systems 14, 341–356.

Small, C. G., J. Wang, and Z. Yang (2000). Eliminating multiple root prob-
lems in estimation (with discussion). Statistical Science 15 (4), 313–341.

Stocchero, M. (2019). Iterative deflation algorithm, eigenvalue equations, and


PLS2. Chemometrics 33 (10), e3144.

Stocchero, M., M. de Nardi, and B. Scarpa (2020). An alternative point of view


on PLS. Chemometrics and Intelligent Laboratory Systems 222, 104513.

Su, Z. and R. D. Cook (2011). Partial envelopes for efficient estimation in


multivariate linear regression. Biometrika 98 (1), 133–146.
Bibliography 403

Su, Z. and R. D. Cook (2012). Inner envelopes: Efficient estimation in


multivariate linear regression. Biometrika 99 (3), 687–702.

Su, Z. and R. D. Cook (2013). Estimation of multivariate means with


heteroscedastic errors using envelope models. Statistica Sinica 23, 213–230.

Tapp, H. S., M. Defernez, and E. K. Kemsley (2003). Ftir spectroscopy and


multivariate analysis can distinguish the geographic origin of extra virgin
olive oils. Journal of Agricultural and Food Chemistry 51 (21), 6110–6005.

Tenenhaus, M. and E. V. Vinzi (2005). Pls regression, pls path modeling,


and generalized procrustean analysis: A combined approach for multiblock
analysis. Journal of Chemometrics 19 (3), 145–153.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.


Journal of the Royal Statistical Society B 58, 267–288.

Tipping, M. E. and C. M. Bishop (1999). Probabilistic principal component


analysis. Journal of the Royal Statistical Society B 61 (3), 611–622.

Vinzi, E. V., L. Trinchera, and S. Amato (2010). Pls path modeling: From
foundations to recent developments and open issues for model assessment
and improvement. In E. V. Vinzi, W. W. Chin, J. Henseler, and H. Wang
(Eds.), Handbook of Partial Least Squares, Chapter 2, pp. 47–82. Berlin:
Springer-Verlag.

vonRosen, D. (2018). Bilinear Regression Analysis. Number 220 in Lecture


Notes in Statistics. Switzerland: Springer.

Wang, L. and S. Ding (2018). Vector autoregression and envelope model.


Stat 7 (1), e203.

Weglin, J. A. (2000). A survey of partial least squares (pls) methods, with


emphasis on the two-block case. Technical Report 371, Department of
Statistics, University of Washington.

Wen, X. and R. D. Cook (2007). Optimal sufficient dimension reduction in


regressions with categorical predictors. Journal of Statistical Planning and
Inference 137, 1961–1975.

Wold, H. (1966). Estimation of principal components and related models by


iterative least squares. In P. R. Krishnaia (Ed.), Multivariate Analysis, pp.
392–420. New York: Academic Press.
404 Bibliography

Wold, H. (1975a). Path models with latent variables: The NIPALS ap-
proach. In H. M. Blalock, A. Aganbegian, F. M. Borodkin, R. Boudon,
and V. Capecchi (Eds.), Quantative Scoiology, Chapter 11, pp. 307–357.
London: Academic Press.

Wold, H. (1975b). Soft modelling by latent variables: the non-linear it-


erative partial least squares (NIPALS) approach. Journal of Applied
Probability 12 (S1), 117–142.

Wold, H. (1982). Soft modeling: the basic design and some extensions. In K. G.
Jörgensen and H. Wold (Eds.), Systems under indirect observation: Causal-
ity, structure, prediction, Vol. 2, pp. 1–54. Amsterdam: NorthHolland.

Wold, S. (1992). Nonlinear partial least squares modeling ii. Spline inner
relation. Chemometrics and Intelligent Laboratory Systems 14 (1–3), 71–84.

Wold, S., N. Kettaneh, and K. Tjessem (1996). Hierarchial multiblock PLS


models for easier model interpretation and as an alternative to variable
selection. Journal of Chemometrics 10, 463–482.

Wold, S., H. Martens, and H. Wold (1983). The multivariate calibration prob-
lem in chemistry solved by the pls method. In A. Ruhe and B. Kågström
(Eds.), Proceedings of the Conference on Matrix Pencils, Lecture Notes in
Mathematics, Vol. 973, pp. 286–293. Heidelberg: Springer Verlag.

Wold, S., J. Trygg, A. Berglund, and H. Antti (2001). Some recent de-
velopments in PLS modeling. Chemometrics and Intelligent Laboratory
Systems 7, 131–150.

Yeh, I. C. (2007). Modeling slump flow of concrete using second-order


regressions and artiicial neural networks. Cement and Concrete Compos-
ites 29 (6), 474–480.

Zhang, J. and X. Chen (2020). Principal envelope model. Journal of


Statistical Planning and Inference 206, 249–262.

Zhang, X. and L. Li (2017). Tensor envelope partial least-squares regression.


Technometrics 59 (4), 426–436.

Zhang, X. and Q. Mai (2019). Efficient integration of sufficient dimension


reduction and prediction in discriminant analysis. Technometrics 61 (2),
259–272.
Bibliography 405

Zheng, W., X. Fu, and Y. Ying (2014). Spectroscopy-based food classification


with extreme learning machine. Chemometrics and Intelligent Laboratory
Systems 139, 42–47.

Zhu, G. and Z. Su (2020). Envelope-based sparse partial least squares.


Annals of Statistics 48 (1), 161–182.
Index

Abundant regressions, see Asymp- Notation, 112


totic traits of PLS, Abun- Sparsity, 123, 126–129, 152
dance; Regression, abundant Sparsity, 111
Asymptotic traits of PLS, 110–154 Synopsis, 111–112
Abundance, 111
Multi-component Bilinear model, 167, 311–316
regressions, 111, 142–149 Bilinear probabilistic
Abundance PLS model, 313–316
v. sparsity, 144, 155 Box-Cox transformations, 264
Bounded ΣX , 145, 148
Canonical correlation, 161
Consistency, 146–148
Cayley-Hamilton theorem, 17
Illustration, 154
Central mean subspace, see
Noise, 143
Subspaces, Central mean
Signal, 143, 149
Central subspace,
One-component
see Subspaces, Central
regressions, 112–141
Collinearity, 36, 38, 59
Abundance,
Conjugate
111, 122–124, 126–129
gradient algorithm, 316–322
Bias, 130–139
Cross validation, 3, 4, 6, 73, 78,
Confidence
86, 99, 108, 149, 180, 193,
intervals, 112, 132–133, 141
197, 216, 222, 223, 242–244,
Consistency, 120–129
246, 265, 266, 277, 320, 332
Distribution of PLS
estimators, p → ∞, 129–141 Dimension reduction
Distribution of PLS subspace, see Subspaces,
estimators, fixed p, 114–120 Dimension reduction
Illustration, 150–154 Discriminant analysis,
Model, 112 see Linear and Quadratic
n < p, 127–129 discriminant analysis

407
408 Index

Double index models, 252 Economic


growth prediction, 194–198
Elliptically contoured variate, 254 Etanercept data, 276–278
Envelopes, 2, 20–34 Fruit, 246
Algorithms, 22–34 Gasoline data, 59–61
NIPALS generalization, Gut worm in cattle, 54
25–31, 72, 108, 309–311 Ingredients in concrete, 180
SIMPLS generalization, Meat protein, 6
31–32, 87, 108, 309–311 Mussels muscles, 117–119
1D, 33 Olive oil, 225–228, 246–248
Eigenspace, 22 Polyethylene reactor, 180
Krylov, 23, 100 Reactor data, 205–206
Likelihood-based, 32, 45, 58 Serum tetracycline, 6
Component screening, 34 Solvent data, 278–280
Non-Grassman, 33 Tecator data, 267–276
Bayesian, 36
Definition, 20 Fisher scoring, 329
Discriminant
subspace, ENDS, 238, 244 Generalized linear models, 328–332
Generalized, 239 PLS algorithm for, 331
Goal, 3, 35 PLS GLM illustration, 331
Partial, definition of, 186 Gram-Schmidt orthogonalization, 98
Predictor, Grassmannian, 33–34
see Predictor envelopes
Helland’s algorithm, 100–104, 106
Properties of, 20–22
Response, see Response envelope LAD, 40
Simultaneous, 159 Likelihood acquired
Variable scaling, 64–68 directions, see LAD
Examples Likelihood
Australian v. PLS estimators, 104–106
Institute of Sport, 183–184 Linear discriminant analysis, 211–212
Birds, planes, cars, 244–245 Classification function, 211
Coffee data, 224–225, 246–248 Illustrations, 223–228
Composition Overview, 222
of biscuit dough, 180 Reduced-rank versions, 220–222
Corn moisture, 3 via Envelopes, 216–218
Index 409

via PFC, 215 Predictor


via PLS, 218 transformations and, 99
Linearity condition, see Nonlinear Principal components and, 77–79
PLS, Linearity condition for Rank deficiency, 77
Longitudinal data, 53–59, 203–205 Response
transformations and, 99
Measurement theories, 289 Sample structure, 75
Multi-way predictors Synopsis, 70
and PLS, 323–325 Weights, 85
Multivariate linear model, 7 Nonlinear PLS, 36, 249–280
Estimation, 12–14 Illustrations, 267–280
OLS estimator, 12 Linearity condition for, 253–255
Standardized OLS estimator, 13 NIPALS,
Z-score, 14 249, 250, 255–257, 260–263
Prediction, 263–267
n < p, 32, 36, 38, 41
Approaches, 263–264
Asymptotic results, 127–129
Five methods for, 271–276
Likelihood-based estimators, 105
Number of components, 263
NIPALS, 73, 77
via Inverse regression, 264–267
PLS, 1, 48
Removing linear trends, 261–263
Nadler-Coifman model, 49, 62
SIMPLS, 249
NIPALS, 1, 2,
Synopsis, 250
18, 23, 48, 62, 64, 106–108
Notation, xxi, 9–12
for nonlinear regression,
see Nonlinear PLS One-hot coding, 207
NIPALS for
predictor reduction, 70–86 Path modeling, 281, 284–290
Algorithm, 71–73, 326 cb|sem analysis, 282, 308
Algorithm N, 72, 108 pls|sem analysis, 282, 307
Bare bones algorithm, 77, 78 Bias, 299–300, 308
Conjugate gradient Constructs, 285, 288–289
interpretation, 316–320 CORE, 285
Cross validation, 78 Estimators of construct
Envelope structure, 83–86 association, 291–296
Fitted values, 76 Formative, 287
Population details, 78 Indicators, 285
410 Index

Lavaan package, 301 Projection


Matrixpls, 301 Column space of X, 13
PLS vs CB debate, 282–283 Definition, 10
Reflective, 284, 286 in Conditional expectations, 10
Scaling indicators, Relationships between, 79
295, 296, 307, 308 when PR(M ) = PR , 19
UNCORE, 285
Unit-weighted Quadratic discriminant
summed composites, 308 analysis, 229–248
PCA, 197–198 Classification function, 229–231
PFC, 212–214 Dimension
PLS regression reduction for, 231–234
Culture, 1, 2 Envelope discriminant
Designed experiments, 51 subspace, ENDS, 238, 244
NIPALS, see NIPALS Illustrations, 243–248
Origins, 1 Overview, 242–243
Partial, 185, 192, 197–198 via Algorithms N and S, 239–241
PLS1 v. PLS2, 12, 69, 106–108 via Central subspace, 234–236
SIMPLS, see SIMPLS via Envelopes, 236–237
Stochastic predictors in, 9
Reduced rank
Variable scaling, 68
regression, 284, 290–292
Predictor envelopes
Regression
Model-free, 38–41
Abundant, 1–2, 111, 122–129,
Definition, 39
137, 143–149, 152, 300
vs SIR, SAVE, LAD, 40
Added variable plot, 15
Multivariate linear model, 41–48
Centering, 8
Likelihood estimation, 43–47
Collinearity, 2
PLS link, 48
Lasso, 3, 7
Parameter count, 42
PLS, see PLS regression
Partial, 185–198
RRR,
Response
see Reduced rank regression
envelope connection, 50–51
Sparse, 2
preface, xvii
Stochastic predictors, 8
Principal component
Response envelope
analysis, see PCA
Ancillary predictors, 51–52
Principal fitted components, see PFC
Definition, 50
Index 411

Gasoline data, 59–61 Envelope, 159


Likelihood estimation, 55–59 NIPALS algorithm, 166
Longitudinal data, 53–55 two-block algorithm, 169
Model-free, 48–51 Single index models, 252
Multivariate linear model, 52–62 SIR, 40
Partial, 199–206 Partial, 183
PLS link, 62 Sliced average variance
Predictor estimation, see SAVE
envelope connection, 50–51 Sliced inverse regression, see SIR
Response reduction, 35, 48, 51, 55, 70 Soft modeling v. hard modeling, 103
Partial, 199–206 Sparse PLS, 322
via NIPALS Subspaces
and SIMPLS, 62, 108–109 Central, 37
Central mean, 250–253
SAVE, 40, 185 Cyclic invariant, 17
SIMPLS, 2, 48, 62, 64, 106–108 Dimension reduction, 37
SIMPLS for Discriminant, 208, 231–234
predictor reduction, 86–95 Envelope discriminant,
Algorithm S, 87 208–211
Conjugate gradient Definition, 210
interpretation, 316–320 Invariant and reducing, 15–19,
Cross validation, 86 41, 72, 74, 84, 89, 92, 93
Envelope structure, 92 Krylov, 17
Sample structure, 90 NIPALS, 83
Synopsis, 86 Partial central, 183, 185
Weights, 94
SIMPLS v. NIPALS, 96–99 Two-block algorithm, 169
Common features, 98
Simultaneous predictor-response Vandermonde matrix, 24
reduction, 156–181 Variance inflation factors, 143, 148

You might also like