100% found this document useful (1 vote)
290 views

Multivariate Data Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
290 views

Multivariate Data Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 618

Multivariate Data Analysis

– In Practice
5th Edition

An Introduction to
Multivariate Data Analysis
and Experimental Design

Kim H. Esbensen
Ålborg University, Esbjerg

with contributions from


Dominique Guyot
Frank Westad
Lars P. Houmøller

CAMO 6RIWZDUHASNedre Vollgate 8 , N-0158 Oslo, Norway


Tel: +47 2239 6300
Fax: +47 2239 6322
www.camo.com
This book was produced using Doc-to-Help together with Microsoft Word. Visio and
Excel were used to make some of the illustrations. The screen captures were taken with
Paint Shop Pro.

Trademark Acknowledgments
Doc-To-Help is a trademark of WexTech Systems, Inc.
Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word
are trademarks of the Microsoft Corporation.
PaintShop Pro is a trademark of JASC, Inc.
Visio is a trademark of the Shapeware Corporation.

Information in this book is subject to change without notice. No part of this document
may be reproduced or transmitted in any form or by any means, electronic or
mechanical, for any purpose, without the express written permission of CAMO Software
AS.

ISBN 82-993330-3-2

 1994 – 2002 CAMO Software AS

All rights reserved.

5th edition. Re-print August 2002


Preface iii

Preface
October 2001
Learning to do multivariate data analysis is in many ways like learning
to drive a car: You are not let loose on the road without mandatory
training, theoretical and practical, as required by current concern for
traffic safety. As a minimum you need to know how a car functions and
you need to know the traffic code. On the other hand, everybody would
agree that it is first after having obtained your drivers’ license that the
real practical learning begins. This is when your personal experience
really starts to accumulate. There is a strong interaction between the
theory absorbed and the practice gained in this secondary, personal
training period.

Please substitute ”multivariate data analysis” for ”driving a car” in all of


the above. Neither in this context are you let out on the data analytical
road – without mandatory training, theoretical and practical. The
analogy is actually very apt!

This book presents a basic theoretical foundation for bilinear


(projection-based) multivariate data modeling and gives a conceptual
framework for starting to do your own data modeling on the data sets
provided. There are some 25 data sets included in this training package.
By doing all exercises included you’re off to a flying start!

Driving your newly acquired multivariate data analysis car is very much
an evolutionary process: this introductory textbook is filled with
illustrative examples, many practical exercises and a full set of self-
examination real-world data analysis problems (with corresponding data
sets). If, after all of this, you are able to work confidently on your own
applications, you’ll have reached the goal set for this book.

Multivariate Data Analysis in Practice


iv Preface

This is the 5th revised edition of this book. The three first editions were
mainly reprints, the only major change being the inclusion of a
completely revised chapter on ”Introduction to experimental design”,
which first appeared in the 3rd edition (CAMO). The 4th revised
edition however (published March 2000) saw very many major
extensions and improvements:

• Text completely rewritten by the senior author, based on five years of


extensive use in teaching at both university and dedicated course
levels. More than 5.500 copies in use.

• 30% new theory & text material added, reflecting extensive student
response, full integration of PCA, PLS1 & PLS2 NIPALS algorithms
and explanations.

• Text revised with an augmented self-learning objective throughout.

• Four new master data sets added (with extended self-exercise


potential):

1. Master violin data (PCA/PLS)


2. Norwegian car dealerships (PCA/PLS)
3. Vintages (PCA/PLS)
4. Acoustic chemometric calibration (PCR/PLS)

• Additional chapter on experimental design: new features include


mixture designs and D-optimal designs.

• New chapter on the powerful, novel: ”Martens’ Uncertainty Test”.

• Comprehensive glossary of terms.

This 5th edition also includes essential additional revisions and


improvements:

• Lars P. Houmøller, Ålborg University Esbjerg, has carried out a


complete work-through of all demonstrations and exercises. Many of
these had not been updated with respect to several of the intervening
UNSCRAMBLER software versions. We are happy to have finally
eliminated this most frustrating nuisance.

Multivariate Data Analysis in Practice


Preface v

About the authors


Kim H. Esbensen, Ph.D., has more than 20 years of experience in
multivariate data analysis and applied chemometrics. He was professor
in chemometrics at the Norwegian Telemark Institute of Technology
(HIT/TF), Institute of Process Technology (PT) 1995-2001, where he
was also head of the Chemometrics Department Tel-Tek, Telemark
Industrial R&D Center, Porsgrunn. Between these institutions he
founded ACRG: the Applied Chemometrics Research Group, HIT/TF-
Tel-Tek, which a.o. hosted SSC6, the 6th Scandinavian Symposium on
Chemometrics, August 1999 as well as numerous other international
courses, workshops and meetings.

July 1st, 2001 he moved to a position as research professor in Applied


Chemometrics at Ålborg University, Esbjerg, Denmark (AUE), where he
is currently leading ACACSRG: the Applied Chemometrics, Analytical
Chemistry and Sampling Research Group. As the name implies, applied
chemometrics activities continue in Esbjerg while new activities are
added – most notably through close collaboration with assoc. prof. Lars
P. Houmøller, who independently built up the area of analytical
chemistry/chemometrics at AUE before Prof. Esbensen’s arrival. Most
recently the discipline of sampling (proper sampling) has been added, in
recognition of the immense importance of sampling in any data
analytical discipline, including chemometrics.

Kim H. Esbensen has published more than 60 papers and technical


reports on a wide range of chemical, geochemical, industrial,
technological, remote sensing, image analytic and acoustic chemometric
applications. Together with Paul Geladi he has been instrumental in co-
developing the concept of Multivariate Image Analysis (MIA); with
ACRG he pioneered the development of the novel area of acoustic
chemometrics.

His M. Sc. is from the University of Aarhus, Denmark in 1978 (geology,


geochemistry), while a Ph.D. was conferred him by the Technical
University of Denmark (DTH) in 1981 within the areas of metallurgy,
meteoritics and multivariate data analysis. He then did post-doctoral
work for two years with the Research Group for Chemometrics at the
University of Umeå 1980-1981, after which he worked in a Swedish
geochemical exploration company, Terra Swede, for two more years.
Moving to Norway, this was followed by eight years as data analytical
research scientist at the Norwegian Computing Center (NCC), Oslo,
Multivariate Data Analysis in Practice
vi Preface

after which he became a senior research scientist at SINTEF, the


Norwegian Foundation for Industrial and Technological Research for
four additional years. In between these two assignments he was a
visiting guest professor at Norsk Hydro’s Research Center in Bergen,
Norway. He also holds a position as Chercheur associé (now Chercheur
affilié) du Centre de Recherche en Géomatique, Université Laval,
Quebec. He is a member of the editorial board of Journal of
Chemometrics, Wiley Publishers, and is a member of ICS, AGU and
several other geological, data analytical and statistical associations.

Dominique Guyot, educated in Statistics, Economics and


Biomathematics (ENSAE and Université de Paris 7, France), has 15
years of experience in the field of chemometrics. She gained industrial
experience from her work in the pharmaceutical and cosmetic industries,
before joining CAMO from 1995 until 2000. With CAMO, Dominique
worked as a Senior Consultant, and was particularly involved in food
applications. She put together a practical strategy for efficient product
development, based on experimental design and multivariate data
analysis. This strategy was implemented in the Guideline®+ software
package, complemented by an integrated training course focusing on
multivariate methods for food product developers. Dominique is now
studying music and singing at the Conservatoire of Trondheim, Norway.

Frank Westad has a M. Sc. in physical chemistry from the University of


Trondheim, Norway. He has 13 years experience in applied multivariate
data analysis, and he completed a Ph.D. in multivariate regression in
2000. Frank has given numerous courses in experimental design and
multivariate analysis for companies in Europe and in the U.S.A. His
main research fields include variable selection, shift modelling and
image analysis.

Lars P. Houmøller has a M.Sc. in chemistry and physics from the


University of Aarhus, Denmark. He has 12 years of experience in
analytical chemistry and has worked 5-7 years with chemometrics. His
teaching experiences include chemometrics, analytical chemistry,
spectroscopy, physical chemistry, general and technical chemistry,
organic and inorganic chemistry, unit operations and fluid dynamics. His
research field covers NIR spectroscopic applications over a very broad
industrial spectrum. He also has experience from working in the Danish
food production industry.

Multivariate Data Analysis in Practice


Preface vii

E-mail interaction with the authors:


Kim Esbensen [email protected]

Dominique Guyot [email protected]


Frank Westad [email protected]
Lars P. Houmøller [email protected]

About this book


Since 1986, when CAMO ASA first commercialized and started
marketing THE UNSCRAMBLER, many customers have asked for
basic, easy-to-understand literature on chemometrics. In 1993 a group of
data analysts at different competence levels was invited to a one-day
seminar at CAMO, Trondheim, for discussing their experience from
both learning and teaching chemometrics. The result was a blue-print
outline for what came to be this introductory book: the specifications
called for a comprehensive training-package, involving basic, practical,
easy-to-read, largely non-mathematical theory, with plenty of hands-on
examples and exercises on real-world data sets. CAMO contracted
SINTEF to write this book (first three editions), and the parties agreed to
cooperate on the completion of the complete training package.

In the intervening years, this book was published in some 4.500 copies
and was used for the introductory basic training in some 15 universities
and in several hundred industrial companies; reactions were many and
largely constructive. We learned a lot from these criticisms; we thank all
who contributed!

Came 1999, the time was ripe for a complete revision of the entire
package. This was undertaken by the senior author in the summer 1999
with significant assistance from his then Ph.D. student Jun Huang (now
with CAMO, Norway); Frank Westad (Matforsk) who wrote chapter 14
(Martens’ Uncertainty Test), Dominique Guyot (CAMO) who wrote the
original new entire chapter 17 (Complex Experimental Design
Problems), and with further invaluable editorial and managerical
contributions from Michael Byström (CAMO) and Valérie Lengard
(CAMO). A most sincere thank you goes to Peter Hindmarch (CAMO,
UK) for very effective linguistic streamlining of the 4th edition! The
authors and CAMO also take this opportunity to acknowledge Suzanne
Schönkopf’s (CAMO) contribution to editions previous to the 4th one.

Multivariate Data Analysis in Practice


viii Preface

The present edition of this book still bears the fruit of her very important
past efforts.

The publication of the 4th edition, in March 2000, was unfortunately


somewhat marred by a less than complete revision of the exercises and
illustrative UNSCRAMBLER runs in the book, which was not
considered fatal at the time – This soon proved to be a serious mistake;
disapointment and frustration from several generations of students, who
wanted to follow all the exercises closely, followed rapidly. A Danish
university teacher, who had himself experienced this frustration close up
when using the book for his own teachings, assoc. prof. Lars P.
Houmøller at the University of Ålborg, Esbjerg voluntarily took it upon
himself to carry out a complete work-through of this essential didactic
aspect of the book. His very valuable demo and exercise revisions, as
well as a very thorough text consistency check, have now been included
in toto in the 5th edition.

Today, this book is a collaborative effort between the senior author and
CAMO Process AS; the tie with SINTEF is now defunct.

There is little academic glamour in writing an introductory level


textbook, as the senior author has well experienced - which was never
the goal anyway. But on the other hand, the introductory level is
definitely where the largest audience and potential market exist, as
CAMO has well experienced. The senior author has used the book for
six consecutive years teaching introductory chemometrics largely to
engineering (M.Sc.) students, as well as for extensive course work in
industrial and foreign university environments. The response from some
accumulated 500 students has made this author happy, while some 5500
sales have made CAMO equally satisfied.

Thus all is well with the training package! We hope that this revised 5th
edition will continue to meet the challenging demands of the market,
hopefully now in an improved form. Writing for precisely this
introductory audience/market constitutes the highest scientific and
didactic challenge, and is thus (still) irresistible!

Multivariate Data Analysis in Practice


Preface ix

Acknowledgements
The authors wish to thank the following persons, institutions and
companies for their very valuable help in the preparation of this training
package:

Hans Blom, Østlandskonsult AS, Fredrikstad, Norway


Frode Brakstad, Norsk Hydro F-Center, Porsgrunn, Norway
Rolf Carlson, Department of Chemistry, University of Tromsø, Norway
Chevron Research & Technology Co, Richmond, CA, USA
Lennart Eriksson, Dept. of Organic Chemistry, University of Umeå,
Sweden (now with Umetrics, Inc.)
Professor Magni Martens, The Royal Vetarinary & Agricultural
University, Denmark
Geological Survey of Greenland, Denmark
IKU, Institute for Petroleum Research, Trondhein, Norway
Norwegian Food Research Institute (MATFORSK), Ås, Norway
Norwegian Society of Process Control
Norwegian Chemometrics Society
International Chemometrics Society
UOP Guided Wave, CA, USA
Pierre Gy, Cannes, France (for a gentleman’s introduction to the finest
French wines)
Zander & Ingerstrõm, Oslo, Norway
Tomas Õberg Konsult AB, Karlskoga, Sweden
KAPITAL (weekly Norwegian economic magazine), no 14/1994, p50-55
Hlif Sigurjonsdottir, Reykjavik, Iceland (owner of G. Sgarabotto “violin
no 9”)
Birgitta Spur, LSO, Reykjavik, Iceland (permission to use the Sgarabotto
oeuvre data)
Sensorteknikk A/S, Bærum, Oslo (Bjørn Hope: sensor technology
entrepreneur extraordinaire; Evy: for innumerable occasions: warm
company, coffee and waffles, waffles, waffles)
Thorbjørn T. Lied, Maths Halstensen, Tore Gravermoen, Rune Mathisen
a.o. (for enormous help in developing acoustic chemometrics)
“Anonymous wine importer”, Odense, Denmark.
Helpful wine assessors (partly anonymous), Manson, Wa, USA.

Finally the author(s) and CAMO wish to thank all THE


UNSCRAMBLER users during the last seven years for their close
relationships with us, which have given us so much added experience in

Multivariate Data Analysis in Practice


x Preface

teaching multivariate data analysis. And thanks for all the constructive
criticism to the earlier editions of this book. Last, but certainly not least,
a warm thank you to all the students at HIT/TF, at Ålborg University,
Esbjerg and many, many others, who have been associated with the
teachings of the authors, nearly all of whom have been very constructive
in their ongoing criticism of the entire teaching system embedded in this
training package. We even learned from the occasional not-so-friendly
criticisms…

Communication
The period of seven years that has been the formative period for the
training package has come of age. By now we are actually beginning to
be rather satisfied with it!

And yet: The author(s) and CAMO always welcome all critical
responses to the present text. They are seriously needed in order for this
work to be continually improving.

Multivariate Data Analysis in Practice


Contents xi

Contents

1. Introduction to Multivariate Data Analysis -


Overview 1
1.1 Indirect Observations and Correlation 1
1.2 Hidden Data Structures 7
1.3 Multivariate Data Analysis vs. Multivariate Statistics 9
1.4 Main Objectives of Multivariate Data Analytical Techniques 9
1.5 Multivariate Techniques as Projections 11

2. Getting Started - with Descriptive Statistics 13


2.1 Purpose 13
2.2 Data Set 1: Quality of Green Peas 13
2.3 Data set 2: Economic Characteristics of Car Dealerships in
Norway 17

3. Principal Component Analysis (PCA) –


Introduction 19
3.1 Representing the Data as a Matrix 19
3.2 The Variable Space - Plotting Objects in p Dimensions 20
3.3 Plotting Objects in Variable Space 21
3.3.1 Exercise - Plotting Raw Data (People) 22
3.4 The First Principal Component 27
3.5 Extension to Higher-Order Principal Components 30
3.6 Principal Component Models - Scores and Loadings 31
3.6.1 Model Center 32
3.6.2 Loadings - Relations Between X and PCs 33
3.6.3 Scores - Coordinates in PC Space 34
3.6.4 Object Residuals 35
3.7 Objectives of PCA 35
3.8 Score Plot - “Map of Samples” 36
3.9 Loading Plot - “Map of Variables” 40

Multivariate Data Analysis in Practice


xii Contents

3.10 Exercise: Plotting and Interpreting a PCA-Model (People) 47


3.11 PC-Models 54
3.11.1 The PC Model: X = TP T + E = Structure + Noise 54
3.11.2 Residuals - The E-Matrix 58
3.11.3 How Many PCs to Use? 61
3.11.4 Variable Residuals 64
3.11.5 More about Variances - Modeling Error Variance 65
3.12 Exercise - Interpreting a PCA Model (Peas) 66
3.13 Exercise - PCA Modeling (Car Dealerships) 68
3.14 PCA Modeling – The NIPALS Algorithm 72

4. Principal Component Analysis (PCA) - In Practice 75


4.1 Scaling or Weighting 75
4.2 Outliers 78
4.2.1 Scaling, Transformation and Normalization are Highly
Problem Dependent Issues 80
4.3 PCA Step by Step 81
4.3.1 The Unscrambler and PCA 84
4.4 Summary of PCA 85
4.4.1 Interpretation of PCA-Models 88
4.4.2 Interpretation of Score Plots – Look for Patterns 89
4.4.3 Summary - Interpretation of Score Plots 93
4.4.4 Summary - Interpretation of Loading Plots 94
4.5 PCA - What Can Go Wrong? 95
4.6 Exercise - Detecting Outliers (Troodos) 97

5. PCA Exercises – Real-World Application Examples 105


5.1 Exercise - Find Clusters (Iris Species Discrimination) 105
5.2 Exercise - PCA for Experimental Design (Lewis Acids) 107
5.3 Exercise - Mud Samples 109
5.4 Exercise - Scaling (Troodos) 112

6. Multivariate Calibration (PCR/PLS) 115


6.1 Multivariate Modeling (X,Y): The Calibration Stage 115
6.2 Multivariate Modeling (X, Y): The Prediction Stage 116
6.3 Calibration Set Requirements (Training Data Set) 118
6.4 Introduction to Validation 120
6.5 Number of Components (Model Dimensionality) 122
6.6 Univariate Regression (y|x) and MLR 124

Multivariate Data Analysis in Practice


Contents xiii

6.6.1 Univariate Regression (y|x) 124


6.6.2 Multiple Linear Regression, MLR 125
6.7 Collinearity 127
6.8 PCR - Principal Component Regression 128
6.8.1 Exercise - Interpretation of Jam (PCR) 130
6.8.2 Weaknesses of PCR 136
6.9 PLS- Regression (PLS-R) 137
6.9.1 PLS - A Powerful Alternative to PCR 137
6.9.2 PLS (X,Y): Initial Comparison with PCA(X), PCA(Y) 137
6.9.3 PLS2 – NIPALS Algorithm 139
6.9.4 Interpretation of PLS Models 143
6.9.5 The PLS1 NIPALS Algorithm 144
6.9.6 Exercise - Interpretation of PLS1 (Jam) 145
6.9.7 Exercise - Interpretation PLS2 (Jam) 147
6.10 When to Use which Method? 149
6.10.1 Exercise - Compare PCR and PLS1 (Jam) 150
6.11 Summary 153

7. Validation: Mandatory Performance Testing 155


7.1 The Concept of Test Set Validation 155
7.1.1 Calculating the Calibration Variance (Modeling Error) 157
7.1.2 Calculating the Validation Variance (Prediction Error) 158
7.1.3 Studying the Calibration and Validation Variances 159
7.2 Requirements for the Test Set 161
7.3 Cross Validation 163
7.4 Leverage Corrected Validation 168

8. How to Perform PCR and PLS-R 171


8.1 PLS and PCR - Step by Step 171
8.2 Optimal Number of Components in Modeling 172
8.3 Information in Later PCs 173
8.4 Exercises on PLS and PCR: the Heart-of-the-Matter! 173
8.4.1 Exercise - PLS2 (Peas) 174
8.4.2 Exercise - PLS1 or PLS2? (Peas) 177
8.4.3 Exercise - Is PCR better than PLS? (Peas) 179

9. Multivariate Data Analysis – in Practice:


Miscellaneous Issues 181
9.1 Data Constraints 181

Multivariate Data Analysis in Practice


xiv Contents

9.1.1 Data Matrix Dimensions 183


9.1.2 Missing Data 183
9.2 Data Collection 184
9.2.1 Use Historical Data 184
9.2.2 Monitoring Data from an On-Going Process 185
9.2.3 Data Generated by Planned Experiments 185
9.2.4 Perform Experiments or Collect Data - Always by
Careful Reflection 186
9.2.5 The Random Design – A Powerful Alternative 187
9.3 Selecting from Abundant Data 188
9.3.1 Selecting a Calibration Data Set from Abundant
Training Data 188
9.3.2 Selecting a Validation Data Set 189
9.4 Error Sources 190
9.5 Replicates - A Means to Quantify Errors 190
9.6 Estimates of Experimental - and Measurement Errors 191
9.6.1 Error in Y (Reference Method): Reproducibility 192
9.6.2 Stability over Consecutive Measurements: Repeatability 193
9.7 Handling Replicates in Multivariate Modeling 195
9.8 Validation in Practice 198
9.8.1 Test Set 198
9.8.2 Cross Validation 198
9.8.3 Leverage Correction 199
9.8.4 The Multivariate Model – Validation Alternatives 199
9.9 How Good is the Model: RMSEP and Other Measures 200
9.9.1 Residuals 200
9.9.2 Residual Variances (Calibration, Prediction) 201
9.9.3 Correction for Degrees of Freedom 203
9.9.4 RMSEP and RMSEC - Average, Representative Errors
in Original Units 203
9.9.5 RMSEP, SEP and Bias 205
9.9.6 Comparison Between Prediction Error and Measurement
Error 206
9.9.7 Compare RMSEP for Different Models 207
9.9.8 Compare Results with Other Methods 207
9.9.9 Other Measures of Errors 208
9.10 Prediction of New Data 209
9.10.1 Getting Reliable Prediction Results 209
9.10.2 How Does Prediction Work? 209
9.10.3 Prediction Used as Validation 210

Multivariate Data Analysis in Practice


Contents xv

9.10.4 Uncertainty at Prediction 210


9.10.5 Study Prediction Objects and Training Objects in the
Same Plot 211
9.11 Coding Category Variables: PLS-DISCRIM 211
9.12 Scaling or Weighting Variables 213
9.13 Using the B- and the Bw-Coefficients 214
9.14 Calibration of Spectroscopic Data 215
9.14.1 Spectroscopic Data: Calibration Options 216
9.14.2 Interpretation of Spectroscopic Calibration Models 217
9.14.3 Choosing Wavelengths 219

10. PLS (PCR) Exercises: Real-World Application


Examples - I 221
10.1 Exercise - Prediction of Gasoline Octane Number 221
10.2 Exercise - Water Quality 230
10.3 Exercise - Freezing Point of Jet Fuel 233
10.4 Exercise - Paper 236

11. PLS (PCR) Multivariate Calibration – In Practice 241


11.1 Outliers and Subgroups 242
11.1.1 Scores 242
11.1.2 X-Y Relation Outlier Plots (T vs. U Scores) 244
11.1.3 Residuals 245
11.1.4 Dangerous Outliers or Interesting Extremes? 246
11.2 Systematic Errors 248
11.2.1 Y-Residuals Plotted Against Objects 249
11.2.2 Residuals Plotted Against Predicted Values 249
11.2.3 Normal Probability Plot of Residuals 251
11.3 Transformations 252
11.3.1 Logarithmic Transformations 253
11.3.2 Spectroscopic Transformations 254
11.3.3 Multiplicative Scatter Correction 256
11.3.4 Differentiation 259
11.3.5 Averaging 259
11.3.6 Normalization 259
11.4 Non-Linearities 260
11.4.1 How to Handle Non-Linearities? 262
11.4.2 Deleting Variables 263
11.5 Procedure for Refining Models 264

Multivariate Data Analysis in Practice


xvi Contents

11.6 Precise Measurements vs. Noisy Measurements 265


11.7 How to Interpret the Residual Variance Plot 267
11.8 Summary: The Unscrambler Plots Revealing Problems 270

12. PLS (PCR) Exercises: Real-World Applications - II 273


12.1 Exercise ~ Log-Transformation (Dioxin) 273
12.2 Exercise - Multiplicative Scatter Correction (Alcohol) 276
12.3 Exercise – “Dirty Data” (Geologic Data with Severe
Uncertainties) 284
12.4 Exercise - Spectroscopy Calibration (Wheat) 291
12.5 Exercise QSAR (Cytotoxicity) 293

13. Master Data Sets: Interim Examination 303


13.1 Sgarabotto Master Violin Data Set 305
13.2 Norwegian Car Dealerships - Revisited 313
13.3 Vintages 317
13.4 Acoustic Chemometrics (a. c.) 321

14. Uncertainty Estimates, Significance and Stability


(Martens’ Uncertainty Test) 327
14.1 Uncertainty Estimates in Regression Coefficients, b 327
14.2 Rotation of Perturbed Models 328
14.3 Variable Selection 329
14.4 Model Stability 330
14.4.1 Introduction 330
14.4.2 An Example Using the Paper Data 330
14.5 Exercise - Paper - Uncertainty Test and Model Stability 332

15. SIMCA: An Introduction to Classification 335


15.1 SIMCA - Fields of Use 339
15.2 How to Make SIMCA Class-Models? 340
15.2.1 Basic SIMCA Steps: A Standard Flow-Sheet 340
15.3 How Do we Classify new Samples? 341
15.4 Classification Results 341
15.4.1 Statistical Significance Level and its Use: An
Introduction 342
15.5 Graphical Interpretation of Classification Results 344
15.5.1 The Coomans Plot 344
15.5.2 The Si vs. Hi Plot (Distance vs. Leverage) 345

Multivariate Data Analysis in Practice


Contents xvii

15.5.3 Si/S0 vs. Hi 347


15.5.4 Model Distance 348
15.5.5 Variable Discrimination Power 349
15.5.6 Modeling Power 350
15.6 SIMCA-Exercise – IRIS Classification 351

16. Introduction to Experimental Design 361


16.1 Experimental Design 361
16.2 Screening Designs 375
16.2.1 Full Factorial Designs 376
16.2.2 Fractional Factorial Designs 378
16.2.3 Plackett-Burman Designs 382
16.3 Analyzing a Screening Design 383
16.3.1 Significant effects 386
16.3.2 Using F-Test and P-Values to Determine Significant
Effects 387
16.3.3 Exercise - Willgerodt-Kindler Reaction 391
16.4 Optimization Designs 395
16.4.1 Central Composite Designs 396
16.4.2 Box-Behnken Designs 400
16.5 Analyzing an Optimization Design 402
16.5.1 Exercise - Optimization of Enamine Synthesis 403
16.6 Practical Aspects of Making an Experimental Design 414
16.7 Extending a Design 428
16.8 Validation of Designed Data Sets 430
16.9 Problems in Designed Data Sets 431
16.9.1 Detect and Interpret Effects 433
16.9.2 How to Separate Confounded Effects? 436
16.9.3 Blocking and Repeated Response Measurements 436
16.9.4 Fold-Over Designs 438
16.9.5 What Do We Do if We Cannot Keep to the Planned
Variable Settings? 439
16.9.6 A “Random Design” 440
16.9.7 Modeling Uncoded Data 440
16.10 Exercise - Designed Data with Non-Stipulated Values
(Lacotid) 441
16.11 Experimental Design Procedure in The Unscrambler 444

17. Complex Experimental Design Problems 447

Multivariate Data Analysis in Practice


xviii Contents

17.1 Introduction to Complex Experimental Design Problems 447


17.1.1 Constraints Between the Levels of Several Design
Variables 447
17.1.2 A Special Case: Mixture Situations 450
17.1.3 Alternative Solutions 451
17.2 The Mixture Situation 455
17.2.1 An Example of Mixture Design 455
17.2.2 Screening Designs for Mixtures 457
17.2.3 Optimization Designs for Mixtures 460
17.2.4 Designs that Cover a Mixture Region Evenly 461
17.3 How To Deal With Constraints 463
17.3.1 Introduction to the D-Optimal Principle 463
17.3.2 Non-Mixture D-Optimal Designs 466
17.3.3 Mixture D-Optimal Designs 467
17.3.4 Advanced Topics 469
17.4 How To Analyze Results From Constrained Experiments 474
17.4.1 Use of PLS Regression For Constrained Designs 474
17.4.2 Relevant Regression Models 476
17.4.3 The Mixture Response Surface Plot 478
17.5 Exercise ~ Build a Mixture Design - Wines 479

18. Comparison of Methods for Multivariate Data


Analysis - And their Validation 489
18.1 Comparison of Selected Multivariate Methods 489
18.1.1 Principal Component Analysis (PCA) 490
18.1.2 Factor Analysis (FA) 492
18.1.3 Cluster Analysis (CA) 494
18.1.4 Linear Discriminant Analysis (LDA) 496
18.1.5 Comparison: Projection Dimensionality in Multivariate
Data Analysis 498
18.1.6 Multiple Linear Regression, (MLR) 498
18.1.7 Principal Component Regression (PCR) 499
18.1.8 Partial Least Squares Regression (PLS-R) 500
18.1.9 Increasing Projection Dimensionality in Regression
Modeling 501
18.2 Choosing Multivariate Methods Is Not Optional! 501
18.2.1 Problem Formulation 501
18.3 Unsupervised Methods 502
18.4 Supervised Methods 503

Multivariate Data Analysis in Practice


Contents xix

18.5 A Final Discussion about Validation 505


18.5.1 Test Set Validation 505
18.5.2 Cross Validation 506
18.5.3 Leverage Corrected Validation 508
18.5.4 Selecting a Validation Approach in Practice 509
18.6 Summary of Basic Rules for Success 510
18.7 From Here – You Are on Your Own. Good Luck! 511

19. Literature 513

20. Appendix: Algorithms 519


20.1 PCA 519
20.2 PCR 520
20.3 PLS1 521
20.4 PLS2 524

21. Appendix: Software Installation and User


Interface 527
21.1 Welcome to The Unscrambler 527
21.2 How to Install and Configure The Unscrambler 527
21.3 Problems You Can Solve with The Unscrambler 529
21.4 The Unscrambler Workplace 530
21.4.2 The Editor 532
21.4.3 The Viewer 534
21.4.4 Dockable Views 537
21.4.5 Dialogs 537
21.4.6 The Help System 539
21.4.7 Tooltips 540
21.5 Using The Unscrambler Efficiently 540
21.5.1 Analyses 540
21.5.2 Some Tips to Make Your Work Easier 545

Glossary of Terms 549

Index 587

Multivariate Data Analysis in Practice


xx

Multivariate Data Analysis in Practice


1. Introduction to Multivariate Data Analysis - Overview 1

1. Introduction to Multivariate Data


Analysis - Overview
This chapter introduces the most important basic concepts and definitions for
acquiring a first overview of the elements of multivariate data analysis.

1.1 Indirect Observations and Correlation


The World is Multivariate
Nature is multivariate, as are most (if not all) technological and other data-
generating systems from the sciences, in the sense that any one particular
phenomenon we would like to study in detail usually/most often depends on several
factors. For instance, the weather depends on influencing variables such as wind,
air pressure, temperature, dew point – besides its obvious seasonal variations (also
giving a time series aspect). The health status of a human individual certainly also
depends on several factors, including genes, social position, eating habits, stress,
environment and so on. The tone quality of a violin? The environmental impact of
car exhaust gasses? Certainly on a score of factors. Even the simplest technological
systems as well. Think of the absorption spectrum of just one analyte. No matter to
which science one directs data analytical attention one property only very, very
rarely depends on one - and only one - variable.

This is of course also the case within many other scientific disciplines in which
underlying causal relationships give rise to manifest observable data, for instance
economics and sociology. In this book we shall primary pay attention to a wide-
ranging series of scientific and technological examples of multivariate problems
and associated multivariate data.

Accordingly data analytical methods dealing with only one variable at a time, so-
called univariate methods, will very often turn out to be of limited use in modern,
more complex data analysis. It is still necessary to master these univariate methods
however, as they often carry important marginal information – and they are the
only natural stepping stone into the multivariate realm – always realizing that they
are insufficient for a complete data analysis!

Multivariate Data Analysis in Practice


2 1. Introduction to Multivariate Data Analysis - Overview

Direct and Indirect Observations


It is often necessary to sample, observe, study or measure more than one variable
simultaneously. When the measuring or recording instrument(s) correspond directly
to the phenomenon being investigated everything is fine. For instance, if you wish
to determine the temperature, you could perhaps use a thermometer - this would be
a direct univariate observation. Unfortunately, this is very seldom the situation
with anything but the simplest of systems we would like to analyze. When you
cannot measure or observe a desired parameter or variable directly, you are forced
to turn to indirect observations, which is the situation in which multivariate data is
most often generated. So, for example, if the temperature is high enough to melt the
thermometer or it is simply not practical to use a thermometer, you would have to
determine the temperature in another way - indirectly. The inside of a blast furnace
is a good example of something that is both impractical and too hot to measure with
a thermometer.

To determine the temperature of the furnace load, you could perhaps instead use
IR-emission spectroscopy and then estimate the temperature of the furnace from the
recorded IR-spectrum. This would be an indirect observation - measuring
something else to determine what you really want to know. Observe how here we
would be using many spectral wavelengths to do the job, in other words: an indirect
multivariate characterization. This is a typical feature of nearly all types of such
indirect observation and therefore shows an obvious need for a multivariate
approach in the ensuing data analysis.

Data Must Carry Useful Information


The basic assumption underlying the use of multivariate analysis is that your
measured data carries information about what you want to know. If you want to
find, say, the concentration of a particular chemical substance in a liquid mixture,
the measurements you perform on the mixture must of course in some way reflect
the concentration of that substance. Quite clearly the mere act of measuring many
variables is in itself not a sufficient condition for multivariate data analysis to bring
forth information. This may not always seem to be an obvious requirement in many
complex real-world situations, but throughout this book you will see how powerful
multivariate methods are; indeed it will often be tempting to use them on any
measuring problem. But not even multivariate methods can help you if your data
does not contain relevant information about the property that you are seeking.

The amount of information in your data will depend on how well you have defined
your problem, and whether you have performed the observations, the measurements

Multivariate Data Analysis in Practice


1. Introduction to Multivariate Data Analysis - Overview 3

or the experiments accordingly. The data analyst has a very clear responsibility to
provide, or request, meaningful data. Actually it is much more important which
variables have been measured than simply how many observations have been made.
It is equally important that you have chosen the appropriate ranges of these
measurements. This is in contrast to “standard” statistical methods where the
minimum number of observations depends on the number of parameters to be
determined: here the number of observations must exceed the number of
parameters to be determined. This is not necessarily the case with the multivariate
projection methods treated in this book. In one particular sense, variables and
objects may – at least to some degree – stabilize one another, (much) more of
which will be revealed.

There must be a quantitative relationship between the set of measured variables


and the property of interest. If the measurement variables change, the value of the
indirect property must change as a consequence. Mathematically, this is formulated
in such a way that the desired property, called Y, is a function of the measured
variables, termed X. Thus Y is typically dependent on several variables and X is
therefore usually represented as a vector, called the measurement vector or the
object vector. The property of interest (Y) may for example often represent an
expensive type of measurement (i.e. that has to be carried out by a more
cumbersome, laborious, or expensive analytical method). The X-variables on the
other hand may often be “cheaper variables” (which typically could be easier, less
expensive, more efficient to obtain), for instance spectroscopic measurements
and/or as can be carried out instrumentally, automatically or otherwise. This X/Y
relationship is central to multivariate data analysis and we shall use it extensively
in what is to follow.

Now is time to introduce the first definitions before continuing (Frame 1.1).

Frame 1.1 - Basic definitions


Object Observations on one sample; usually a vector.
X-variables Easily available/”inexpensive” observations on the same object.
Y-variables “Expensive” observations on the same object.
n Number of objects.
p Number of X-variables.
q Number of Y-variables.

Multivariate Data Analysis in Practice


4 1. Introduction to Multivariate Data Analysis - Overview

These two types of measurements are usually organized in two matrices, as shown
in Figure 1.1:

Figure 1.1 Matrices for the two types of measurements


X-variables p Y-variables q
Objects X11 X12 ... ... X1p Y11 ... Y1q
X21 X22 ... ... Y21 ...
. . . . . .
. . . . . .
. . . . . .

X Y
. . . . . .
. . . . . .
. . . . . .

n Xn1 ... ... Xnp n Yn1 Ynq

Variance, Covariance and Correlation


In Frame 1.2 a number of definitions are listed, pertaining to the basic univariate,
so-called summary statistics. We shall briefly discuss the meaning of some of them.
The points brought up are highly relevant to all the chapters below.

The variance of a variable is a measure of the spread of the variable values, which
corresponds to how large a range covered by the measured n values is. This entity
is of critical importance for things to come. One should learn always to have this
univariate concept of “variance” in the back of one’s mind also when dealing with
more complex multivariate data structures. It is often the tradition to express the
measure of this spread in the same units as the raw measurements themselves;
hence the commonly used expression variance = the standard deviation (std).

The covariance between two variables, x1 and x2, is a measure of their linear
association. If large values of variable x1 occur together with large values of
variable x2, the covariance will be positive. Conversely, if large values of variable
x1 occur together with small values of variable x2, and vice versa (small values of
variable x1 together with large values of variable x2), the covariance will be

Multivariate Data Analysis in Practice


1. Introduction to Multivariate Data Analysis - Overview 5

negative. A large covariance (in absolute values) means that there is a “strong”
linear dependence between the two variables. If the covariance is “small”, the two
variables are not very dependent on one another: if variable x1 changes, this does
not affect the corresponding values of variable x2 very much.
Notice the similarity between the equations for variance, which concerns one
variable, and covariance, which concerns two.
But, as everybody knows, talk of “large” and “small” is very imprecise; we need to
define what is “much”, what is “small” and what is just “a little”. For example if
the covariance between pressure and temperature in a system is 512 oC*atm. is that
a large or a small covariance? Does it mean that temperature and pressure follow
each other closely or that they are nearly independent? And what about the
covariance between temperature and the concentration of a substance, say, if the
covariance is 12 oC*(mg.dm-3). How does that compare with the covariance
between temperature and pressure? As you see, the covariance measure depends
exclusively on the units of the variables, which is why this measure is not so very
useful for mixed variables.
To put everything on an “equal footing”, in order to compare linear dependencies,
the correlation is a much more practical measure. The correlation between two
variables is calculated by dividing the covariance with the product of their
respective standard deviations. Correlation is thus a unit-less, scaled covariance
measure. In general it is the most useful measure of interdependence between
variables, as two, or more coefficients of correlation are directly comparable
whatever units the variables are measured in. Pearson’s correlation coefficient r is
defined below; r2 is often used as a measure of the fraction of the total variance
that can be modeled by this linear association measure.
Frame 1.2 Basic univariate statistical measures
Mean: n
∑ xi
i =1
x=
n
Variance: n
∑ ( xi − x ) 2
i =1
Var ( x ) =
n −1

Multivariate Data Analysis in Practice


6 1. Introduction to Multivariate Data Analysis - Overview

Standard deviation (std): n


∑ ( xi − x ) 2
i =1
Sx =
n −1
Covariance (between x and y): n
∑ ( xi − x )( yi − y)
i =1
cov( x, y) =
n −1
Correlation (between x and y): cov( x, y)
r=
Sx Sy

Correlations always lie between -1.0 and +1.0 (Figure 1.2).


• A correlation of 0.0 means there is no correlation; in other words: no relationship
at all between the variables.
• A correlation of +1 means there is an exactly linear positive relationship between
the variables.
• A correlation of -1 means there is an exactly linear negative relationship between
the variables.
r is the most common form of expressing correlation, but note that the sign of the r
2

measure is then lost.


Figure 1.2 Correlations

r≈ 1 r ≈ -1 r≈ 0
Representative x/y relationships showing different r and r2.
8 70 * * **
8 * * * *
* *
** * ** *
* * 65
** * 7 * * * ****
*
*
7 * * 60 ** * * * * * *
* * * **
* * * ** * ** 55 * ***
*
6 * 6 ** * * * ** *
* * * * * *
* 50 * * *
* * ** *
* ** * 5
*
* **
**
* 45
5 *** ** *
* * ** *
* * * 40 *
**** * * * * ** *** ****
4 * 4 *** * *
* * 35 *
* ** * *
* *** * * *** *** * **
* * ** ** * 30
3 * * * 3 * * **
* ** * * *
25
*
2 2 20
1 2 3 4 5 6 7 2 3 4 5 6 7 8 2 3 4 5 6 7 8

r2 = 0.72 r2 = 0.85 r2 = 0.90


r = 0.85 r = -0.92 r = 0.95

Multivariate Data Analysis in Practice


1. Introduction to Multivariate Data Analysis - Overview 7

Causality versus Correlation


While correlation is a statistical concept for linear relationships it is basically a
neutral, phenomenological measure. Cause and effect deal with the interpretation
of deterministic relationships. Many people think of cause and effect when they use
the term correlation. This is wrong. One should most specifically use all application
and domain-specific knowledge when interpreting correlated variables with the aim
to determine cause and effect. For example, a statistical survey amongst a number
of small Danish rural towns showed that the number of babies born in the towns
could be correlated fairly well to the number of storks found in the area (squared
correlation of about 0.75 – no less!). However few people believe that storks bring
babies nowadays! Examples of this confusion between descriptive statistics and
causal interpretation are common in science and technology – be aware.

1.2 Hidden Data Structures


We have so far introduced the concepts of indirect observations and correlations.
Obviously there will always be some correlation between the set of variables we
measure and the property we wish to estimate in the observations, if nothing else
because we choose to use many characterizing variables. In general, the higher the
correlation, the more accurate the estimate may be. In many cases it is the
simultaneous contribution from several different variables that enables a
multivariate modeling of the property. Only in very special cases will the property
depend on only one variable, in which case the correlation between property and
variable necessarily must be close to 1.0. This is called a “selective variable”.
Multivariate data analysis typically deals with “non-selective” variables – and
invites to use many (very many if need be) of these en lieu of selectivity.

The measurements or observations we make will always contain elements


(quantitative components) that are irrelevant to the property we seek. These may be
effects that have absolutely nothing to do with what we seek, things that are
uncorrelated with that property. Instrumental noise and other random measurement
errors will always be present, but there may also be other phenomena that often just
“happen” to be measured at the same time, always closely problem-specific of
course.

An example of this would be the case where we wish to find the concentration of
substance A in a mixture which also contain substances B and C. We may for
example use spectroscopy to determine the A-concentration. But the measured
spectra will not only contain spectral bands from A, which is what we seek, but in

Multivariate Data Analysis in Practice


8 1. Introduction to Multivariate Data Analysis - Overview

general necessarily also bands from the other, irrelevant, compounds which we
cannot avoid measuring at the same time (Figure 1.3).
Figure 1.3 Overlapping spectra
Absorbance

B A C

The problem will therefore be to find which contributions come from A, and which
come from B and C. Since it is substance A that we want to determine, B and C can
here be considered as “noise”. Whether we consider the B and C signals as noise is
of course strongly dependent on the problem definition; if B was the substance of
interest, A and C would now be considered as noise. In still another problem
context we might be interested in measuring the contributions from both A and B
simultaneously (this is one of the particular strong issues in the so-called
multivariate calibration realm, see chapters below). In the latter case only the
contributions from C would now be considered noise. The issue here is that it is the
context of your problem alone that determines what to consider as “signal” and
what as “noise”.
Multivariate observations can therefore be thought of as a sum of two parts:

Data
Observations = + Noise
structure

The data structure is the signal part that is correlated to the property we are
interested in. The noise part is “everything else”; that is to say contributions from
other components, instrumental noise etc.- this is always a strongly problem-
specific issue. One often wishes to keep the structured part and throw away the
noise part. The problem is that our observations always are a sum of both of these
parts, and the structure part will at first be “hidden” in the raw data. We cannot
immediately see what should be kept and what should be discarded (note however,
that we do in fact also make good use of the noise part, as an important measure of
model fit).

Multivariate Data Analysis in Practice


1. Introduction to Multivariate Data Analysis - Overview 9

This is where multivariate data analysis enters the scene. One of its most important
objectives is to make use of the intrinsic variable correlations in a given data set to
separate these two parts. There are quite a number of such multivariate methods.
We will exclusively be working with methods that use covariance/correlation
directly in this signal/noise separation, or decomposition, of the data. In fact one
may say that the inter-variable correlations act as a “driving force” for the
multivariate analysis.

1.3 Multivariate Data Analysis vs.


Multivariate Statistics
Multivariate data analysis and multivariate statistics are related fields with no clear-
cut division, but each has a slightly different focus. In multivariate statistics one is
in addition to the signal part also very interested in the stochastic (random) error
part of the data, the noise part. This part is essential for statistical inference,
hypothesis testing and other statistical considerations. In multivariate data analysis
on the other hand, the overall focus is mainly related to the practical, problem-
specific use of the structure part. But the noise part is also important even here,
amongst other things helping to get the most out of the structure part. And one
always needs to know quantitatively how much of the empirical, observable, data
variance is information (structure) and how much is noise in the data.
Thus a high wall does not separate these two fields, and there are many situations
where they tend to overlap. For typical multivariate data analysis purposes the
emphasis in this book will almost exclusively be on the structure part of the data.
Readers who are more interested in multivariate statistics are referred to the
dedicated literature on this subject.

1.4 Main Objectives of Multivariate Data


Analytical Techniques
There are very many multivariate data analysis techniques. Which method to
choose depends on the type of answer you want to get out of the data analysis. It is
very important that you have formulated your data analytical problem in such a way
that the goal for the analysis is crystal clear and that the data are in a form suited
for reaching this goal. This is not always as simple as it may seem. But if the
problem and goal are both well specified, the choice of technique is generally not

Multivariate Data Analysis in Practice


10 1. Introduction to Multivariate Data Analysis - Overview

difficult, and mostly it will be obvious which technique to use. But first one must
acquire an experience-based overview.

Multivariate data analysis is used for a number of distinct, different purposes. We


will here divide the objectives into three main groups:
• Data description (explorative data structure modeling),
• Discrimination and classification,
• Regression and prediction.

Data Description (Explorative Data Structure Modeling)


A large part of multivariate analysis is concerned with simply “looking” at data,
characterizing it by useful summaries and very often displaying the intrinsic data
structures visually by suitable graphic plots. As a case in point, the data in question
can be state parameter values monitored in an industrial process at several
locations, or measured variables (temperature, refractive indices, reflux times, etc.)
from a series of organic syntheses – in general any p-dimensional characterization
of n samples.

The objective of univariate and multivariate data description can be manifold:


determination of simple means and standard deviations, etc., as well as correlations
and functional regression models. For example, in the case of organic synthesis,
you may naturally be interested in seeing which variables affect the product yield
the most, or the selectivity of the yield. The variables from the synthesis could also
be used to answer questions like: how correlated is temperature with yield? Is
distillation time of importance for the refraction index? In the next chapter we will
introduce Principal Component Analysis, a method frequently used for data
description and explorative data structure modeling of any generic (n, p)-
dimensional data matrix.

Discrimination and Classification


Discrimination deals with the separation of groups of data. Suppose that you have
a large number of measurements of apples and, after the data analysis, it turns out
that the measurements are clustered in two groups - perhaps corresponding to sweet
and sour apples. You now have the possibility to derive a quantitative data model in
order to discriminate between these two groups. Classification has a somewhat
similar purpose, but here you typically know before the analysis a set of relevant
groupings in the data set, that is to say which groups are relevant to model.

Multivariate Data Analysis in Practice


1. Introduction to Multivariate Data Analysis - Overview 11

In the apple example, this would mean that you already at the outset know that
there are differences between sweet and sour (“supervised pattern recognition” to
introduce a term which will be more elaborated below). The aim of the data
analysis would then be to assign, to classify, new apples (based on new
measurements) to the classes of sweet or sour apples. Classification thus requires
an a priori class description. Interestingly, here also Principal Component Analysis
can be used to great advantage (see the SIMCA approach below), but there are for
certain many other competing multivariate classification methods. Note that
discrimination/classification deals with dividing a data matrix into two, or more
groups of objects (measurements).

Regression and Prediction


Regression is an approach for relating two sets of variables to each other. It
corresponds to determining one (or several) Y-variables on the basis of a well-
chosen set of relevant X-variables, where X in general must consist of more than,
say, three variables. Note that this is often related to indirect observations as
discussed earlier. The indirect observation would be X and the property we are
really interested in would be Y. Regression is widely used in science and
technological fields.

We will mainly work with the regression methods Principal Component Regression
(PCR) and Partial Least Squares Regression (PLS-R) in this book, while also
making some reference to the statistical approach Multiple Linear Regression
(MLR).

Prediction means determining Y-values for new X-objects, based on a previously


estimated (calibrated) X-Y model, thus only relying on the new X-data.

1.5 Multivariate Techniques as Projections


There are several different ways to learn about the multivariate techniques treated
in this book. The methods PCA, PCR, and PLS may conveniently be introduced to
the uninitiated reader as projection methods; they are also known under the name
“bilinear methods”. These two aspects denote a more geometrical and a more
mathematical approach respectively. One may opt, for instance, to start with the
fundamental mathematics and statistics, which lie behind these methods; this is
often the preferred approach in statistical textbooks.

Multivariate Data Analysis in Practice


12 1. Introduction to Multivariate Data Analysis - Overview

We will not do that. Instead we will use a geometric representation of the data
structures to explain how the different methods work especially by using a central
concept of projections. This approach is strongly visual and thereby utilizes the fact
that there is no better pattern recognizer than the human brain. You will soon come
to view data sets primarily as points and swarms of points in a data space. This will
let you grasp the essence of most multivariate concepts and methods in an efficient
manner, so that you should be able to start to perform multivariate data analysis
without having to master too much of the underlying mathematics and statistics - a
very ambitious objective! This book represents some 20 years of accumulated
teaching experience; we hope that you will feel suitably empowered when you have
reached the end.

It is also befitting, however, to point out that it is not our belief that the present
geometrical approach is all you ever need in the multivariate realm. On the
contrary, this is an introductory textbook only. - After completion you should
certainly want to turn to higher level textbooks for a more solid additional
theoretical background in the mathematics and statistics of these methods, to
deepen your understanding of why they work so well. Indeed this is exactly our
objective with this introductory book.

Multivariate Data Analysis in Practice


2. Getting Started - with Descriptive Statistics 13

2. Getting Started - with Descriptive


Statistics
2.1 Purpose
The purpose of the following two exercises is to allow you to familiarize yourself
with the elementary statistical concepts presented in chapter 1, and to introduce you
to The Unscrambler’s user interface – in fact, to let you get started doing data
analyses right away.

We shall work on getting a first overview of two data sets using descriptive
univariate statistics. We shall be interested in the following two data sets as but two
examples of data matrices, and accordingly we do not present the full details to
these data sets now. It is sufficient to make a brief introduction only as both data
sets will appear again at several places later in this book.

2.2 Data Set 1: Quality of Green Peas


The first data set represents sensory assessments of pea quality. The data were
collected in order to make a survey of which sensory attributes characterize the
quality of green peas. The samples consist of five different pea cultivars (A-E),
harvested at five different times (1-5). There are 12 sensory variables and 60
different samples. Ten different judges in a taste panel evaluated each sample
twice. The sensory properties are all given a rating on a scale from 1 to 9 by each
judge. This gives a total of 1200 samples (60 samples x 2 replicates x 10 judges). It
is common practice to average data of this type to compensate for differences in
interpretation of the scale between individual judges.

Tasks
1. Reduce data by averaging over judges.
2. Plot raw data.
3. Calculate statistics.

Multivariate Data Analysis in Practice


14 2. Getting Started - with Descriptive Statistics

How to Do it

1. Open the Data File and Average over the Replicates


First you must read the raw data into an Editor. Select File - Open and find the
Examples directory where the data files are stored. Look for a file of type
“Data” called Peasraw, mark it and select Open. An Editor is created with the
data table inside.

At the bottom of your screen you should now see that your data table has the
size 1200 samples times 15 variables.

Study the data table. The first three variables identify the samples. You can see
that each of the ten judges (3-12) has tasted each sample twice for 12 variables,
and that there are 20 recordings (samples) for each sample number.

Average the data over all the judges: the objects. Two replicates and ten judges
give us 20 samples to average. Select Modify - Transform - Reduce (Average).
Use the following parameters:

Scope: All samples, All variables


Reduce Along: Samples Reduction Factor: 20

On the status bar you can see that the number of samples has now been reduced
to 60.

Delete the first three variables, which are only used for object identification:
Mark the variables by clicking on the column numbers while pressing the
CTRL-key. Select Edit - Delete.

Again the size of the data table is reduced, so the status bar should now read 60
samples times 12 variables.

Save the data table in the Editor on a new file with the name PEAS1 using File-
Save As.

2. Plot Raw Data


A picture says more than a thousand words. To get an impression of your data, it
is always a good idea to plot your raw data before doing any analysis. One way
of looking at the data is to plot all samples and all variables in a so-called matrix

Multivariate Data Analysis in Practice


2. Getting Started - with Descriptive Statistics 15

plot. Mark the whole data table (either manually or by choosing Edit - Select
All) and select Plot - Matrix.

Now you can see the value of all variables (columns) for all objects (rows). The
plot is displayed in a window, which we call a Viewer. A Viewer is a graphical
representation of data in an Editor window or of a matrix on file.

Once you have studied the matrix plot, close the viewer: Window - Close
Current, or with the _button. Then unmark the table selection with Edit -
Unselect All or with Esc.

Study the Correlation


Another common way to get an impression of the data is to perform a simple
linear regression for any pair of variables. You simply plot any two variables in
a 2D Scatter plot. In the data table Editor, mark the first two variables, “Pea
Flavour” and “Sweet”. Then select: Plot - 2D Scatter.

Now you have plotted two variables versus each other, for example X1 vs. X2,
it is possible to fit a regression line and to study the corresponding correlation.
Select View - Trend Lines - Regression line to see the regression line. Then
choose: View - Plot Statistics to display the regression statistics (i.e. regression
slope, offset and correlation coefficient r). Notice that these two variables are
highly correlated (all points lie near a straight line and the correlation r ≈ 0.95).

Try plotting other combinations of variables of your choice by selecting other


combinations of columns. Try the combination of column 1 vs. column 12 for a
particular low correlation (use the Ctrl key for easier selection of the variables).
Close the Viewers after you have studied the plots in order not to get too many
open windows at the same time.

To turn off the regression line and statistics, simply toggle the respective
commands once more.

Histogram
You can also study how the observations in each variable are distributed by
looking at the frequency histogram. Mark variable 1 and select Plot- Histogram.

Now you have a histogram on the screen. To show the statistics choose: View-
Plot Statistics. Now compare this with the histogram for variable 12. It would

Multivariate Data Analysis in Practice


16 2. Getting Started - with Descriptive Statistics

be useful to have both histograms on the screen. The Unscrambler lets you do
this by simply going back to the Editor, marking a new variable and plotting the
histogram for this variable. You can have several Viewer and Editor windows
open at the same time. Use Window - Tile to see all windows at the same time.

Which variable best describes the differences between peas? Why?


Close all open Viewers before you continue.

3. Calculate Statistics
You can calculate sample statistics with the View - Sample Statistics or
variable statistics with the View - Variable Statistics command. A new Editor
with the most common statistical measures is created. More statistical
information can be found making a statistics model. Select Task - Statistics and
use the sets All Samples and All Variables. Click OK to make the model and
View to look at the results. Let the cursor rest over a variable for a while, click
with the left mouse button.

Study the results using other plots by Plot - Statistics.


Which variable has the largest variance?

Which variable do you expect to describe the pea quality best? Is your answer
the same as when you compared the histograms? By which properties can the
assessors best distinguish between peas? Do the assessors use the scale in the
same way for all variables?

Now save the results in a file for later use by File - Save. Use the file name
PEASTATS, for example.

Summary
This simple exercise has demonstrated how to read a data file and how to study the
raw data both as numbers and as graphical displays. You should now be able to
start using The Unscrambler in general.

The first six variables have the largest variances, which by using classical
descriptive statistics suggests that these variables will be the best to describe pea
quality. However, in later exercises we will see that another approach may perhaps
be more useful.

Multivariate Data Analysis in Practice


2. Getting Started - with Descriptive Statistics 17

The judges do not seem to use the scale in the same way for the different variables.
We can see this because some variables have a high mean value and others have a
low mean value. However, we do not know for certain if this is because most
samples really have, for example, few skin wrinkles, or if the judges just use low
values on the scale for the variable “Skin”.

2.3 Data set 2: Economic Characteristics of


Car Dealerships in Norway
The data set CARDEALS is a compilation of 10 key economic performance
indicators for a set of car dealerships in Norway (1997). The complete data matrix
has been lifted from an economic magazine (“Kapital”), which in 1994 carried out
an economic analysis of the way these 172 car dealerships went about their trade.
In fact the basis for an economic analysis of the “health status” of the Norwegian
car dealership sector was a series of similar univariate descriptive statistics and
histograms, as was carried out above for the peas example.

The “problem” (as judged by the senior author of the present book) was just that:
univariate descriptive data analyses were the only data analyses carried out, that is
to say that each variable in the table was only analyzed individually, separately. As
will become very clear from the reader’s progression through this book, this
approach is not the best approach. “Clearly” a full-fledged multivariate analysis
will be able to tell more.

We shall investigate this data set closely in other exercises below; but as a start let
us simply do some entry-level univariate characterizations of this data, exactly as
for the peas example (means, variances -standard deviations-, histogram, etc.) - in
order to “get a feel” for this new data set. Quickly run through this exercise for all
ten variables. Do you see any pattern emerging?

You will probably quickly have appreciated that there is something rather special to
this data set - many (all?) of the variables are very symmetrically distributed
indeed. What could this mean?

Multivariate Data Analysis in Practice


18

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 19

3. Principal Component Analysis


(PCA) – Introduction
In this chapter we shall introduce Principal Component Analysis (PCA). PCA
constitutes the most basic “work horse” of all of multivariate data analysis. PCA
involves decomposing one data matrix, X, into a “structure” part and a “noise” part.
There is no Y-matrix, no properties, at this stage. (The Y-matrix will be very
prominent in later chapters). Chapters 3-5 should be worked through very closely,
and reflected upon carefully, as it will show you how to “think” multivariate data
analysis, and describes a basic philosophy that is going to be fundamental also for
many of the methods described later.

3.1 Representing the Data as a Matrix


The starting point is an X-matrix with n objects and p variables, namely an n by p
matrix (as in Frame 1.1 in chapter 1). This matrix is often called the “data matrix”,
the “data set” or simply “the data”. The objects can be observations, samples,
experiments etc., while the variables typically are “measurements” for each object.
The important issue is that the p variables collectively characterize each, and all, of
the n objects. The exact configuration of the X-matrix, such as which variables to
use – for which set of objects, is of course a strongly problem-dependent issue. The
main up front advantage of PCA - for any X-matrix - is that you are free to use
practically any number of variables for your multivariable characterization.

The purpose of all multivariate data analysis is to decompose the data in order to
detect, and model, the “hidden phenomena”. The concept of variance is very
important. It is a fundamental assumption in multivariate data analysis that the
underlying “directions with maximum variance” are more or less directly related to
these “hidden phenomena”. All this may perhaps seem a bit unclear now, but what
PCA does will become very clear through this chapter and the accompanying
exercises in chapters 4-5.

Multivariate Data Analysis in Practice


20 3. Principal Component Analysis (PCA) – Introduction

3.2 The Variable Space - Plotting Objects in


p Dimensions
Plotting the Data in 1-D and 2-D Space
The data matrix X, with its p variable columns and n object rows, can be
represented in a Cartesian (orthogonal) co-ordinate system of dimension p.

Consider for the moment the first variable, i.e. column, X1. The individual entries
can be plotted along a 1-dimensional axis (see Figure 3.1). The axis must have an
origin, usually a zero point, as well as a direction and a measurement unit of length.
If X1 is a series of measured weights, for example, the unit would be mg, Kg or
some other unit of weight. We can extend this approach to take in also the next
variable, X2 (see Figure 3.2). This would result in a 2-dimensional plot, often
called a “bivariate” scatter plot.

The axes for the variables are orthogonal and have a common origin, but may have
different measurement units. This is of course nothing other than what you did with
the variables in the exercises in chapter 2. You can continue the extension until all
p variables are covered, in all the pertinent variable pairs. Exercise - Plotting Raw
Data (People) on page 22 is a good workout for this!

Figure 3.1 1D system Figure 3.2 2D system


x2

0
x1

The Variable Space and Dimensions


The p-dimensional co-ordinate system described above is called the variable space,
meaning the space spanned by the p variables. We say that the dimension of this
space is p. The dimension related to the rank of the matrix representation
Multivariate Data Analysis in Practice
3. Principal Component Analysis (PCA) – Introduction 21

(mathematically: the number of independent basis vectors, statistically: the number


of independent sources of variation within the data matrix) may often be less than
p.

Multivariate data analysis aims at getting at this “operative” or “effective”


dimensionality. However, as a starting point we will assume p dimensions before
we perform the analysis. Most, if not all, of multivariate data analysis as presented
in this book, will benefit immensely from the concept of plotting X-data in the
p-dimensional variable space.

Visualization in 3-D (or More)


In this book we want you to think of multivariate data as a swarm of data points in
this variable space. Of course, as p increases above 3, we cannot any more visualize
this on paper. However this is of no serious consequence. It is not necessary to be
able to picture in your mind anything more complex than 3-dimensional systems to
learn to understand multivariate data analysis. We will be using 1-, 2- and
3-dimensional plots as exemplars, and the insights and data space “feeling” is
actually directly applicable to higher dimensions.

3.3 Plotting Objects in Variable Space


Assume that we have an X matrix with n objects and only 3 variables, i.e. p=3. The
variable space will have 3 axes: X1, X2 and X3. We plot the x-values for each
object in the variable space. Object number 1 has the set of variable measurements;
x11, x12 and x13; object number 2 is characterized by the set x21, x22 and x23 and so
on. In the Cartesian co-ordinate system each object can be characterized by its
coordinates (x1, x2, x3).

Each object can therefore be represented - plotted - as a point in this variable space.
When all the X-values for all objects are plotted in the variable space, the result is a
swarm of points for example as shown in Figure 3.3. There are now only n points
described in this p-dimensional space. Observe, for example, how this rendition of
the (n,p) dimensional two-way data matrix, allows you direct geometrical insight
into the hidden data structure.

In this particular illustration, we suddenly get an appreciation of the fact that there
is a prominent trend among the objects, a trend that is so prominent that in some
sense we might in fact call it a “hidden linear association” among all three variables

Multivariate Data Analysis in Practice


22 3. Principal Component Analysis (PCA) – Introduction

plotted. This revealing of the underlying covariance structure is really the backbone
of Principal Component Analysis.

Figure 3.3 Data plotted as a swarm of points in the variable space,


revealing a hidden trend
x3

x2

x1

3.3.1 Exercise - Plotting Raw Data (People)


Purpose
This exercise illustrates how to study single variable relationships and object
similarities using the traditional standard descriptive statistical tools, so-called
univariate data analysis. This data set will be used again immediately both for
PCA modeling and later as well as for PLS-analysis. We now start using the skills
we learned in chapter 2.

Data Set
We have selected an excerpt from a pan-European demographic survey. For
reasons of didactic introduction we have selected only a small, manageable set of
32 persons, i.e. 32 objects, of which 16 represent northern Europe (Scandinavia: A)
with a corresponding number of representatives from the Mediterranean regions
(B). An equal number of 16 males (M) and 16 females (F) were chosen for balance.

The data table, stored in the file “PEOPLE”, consists of 12 different X-variables:

Height (height in centimeters)


Weight (weight in kilograms)
Hair length (short: -1; long: +1)
Shoe size (European standard)

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 23

Age (years)
Income (Euro)
Beer consumption (liters per year)
Wine consumption (liters per year)
Gender (Sex) (male: -1; female: +1)
Swimming ability (index based on 500 m timed swimming)
Regional belonging (A/B) (A:-1(Scandinavia); B: +1(Mediterranean))
Intelligence Quotient(IQ) (European standardized IQ-test)

Among these variables we observe that Sex, Hair length and Regions (A/B) are
discrete variables with only two possible values (dichotomous or binary variables)
realizations (–1 or 1). The remaining 9 variables are all quantitative with
representative values. The description for this data set will be further discussed
later, in the PCA- and PLS-modeling exercises.

Tasks
1. Load and examine the data. Make univariate data descriptions of all variables.
2. Select any two variables and plot them against each other using a suitable
2-vector function. Study their interrelationship and data set similarities.
3. Select any three variables and plot them using a suitable 3-vector function.
Study the variable interrelationships and data set similarities.
4. Determine the most important (strongest) two-variable, as well as three-variable
interrelationships, expressed as the strongest correlations. Evaluate all possible
combinations (if possible). What do we know about this data set now?

How to do it
Open the data from the file PEOPLE. Take a look at the data in the Editor.
Observe the numerical tabulation of the 12 variables. This is not the optimal
approach for human pattern cognition!

Plot for instance variable 1 vs. variable 2 by marking the variables and select
Plot-2D Scatter.

You should now observe a plot with points marked by the sample names: FA,
MA, FB or MB indicating the persons' gender and region. If they appear only as
numbers or dots, as in Figure 3.4, Edit-Options can be selected in the menu or
context menu (right mouse-button). Is it easier to interpret the result from
names or from numbers in this plot? This depends on the use of the plot, more
of which later. It is important that you develop an appreciation that it is possible

Multivariate Data Analysis in Practice


24 3. Principal Component Analysis (PCA) – Introduction

to include coded information via the “names” of the objects. In this exercise we
have included the dichotomous sex and region information in the “PEOPLE”-
names.

Figure 3.4 Vector plot (Height vs. Weight)


Elements: 32
100 Slope: 1.449569
Offset: -186.488
Correlation: 0.959663
RMSED: 108.8286 1
90 SED: 6.219685
Bias: -108.656

15 3 2 17 18
80 16
8 7 11

31 24
23
70
2827
6 22

60 5

29 109
50 26
142513 20 19
1221 3230 4

40

150 160 170 180 190 200


(Height,Weight)

The primary relationship between the Height and the Weight is shown in Figure
3.4. It is quite obvious that the Height is proportional to the Weight in this selection
of people. Observe that The Unscrambler automatically calculates a set of useful
standard statistics in this plot; one observes, for example, that the desired
Height/Weight correlation coefficient is 0.96. The general trend accordingly shows
that the taller persons are heavier. Invoking the “name” plotting-symbol option, you
will be able also to appreciate that the men (M) are normally heavier and taller than
the women (F). As can be clearly seen, object 1 is the heaviest and highest person,
while 12 and 21 are situated at the extreme other end of a fitted regression line,
representing the lightest/shortest people. Use View-Plot Statistics to obtain other
statistical measures besides the correlation, namely the fitted regression slope,
offset, bias and so on.

Try to study a few other, randomly selected pairs of variables by marking and
plotting them directly from the data table. It is recommended here that at least some
“sensible” combinations, like Age vs. Income and Wine vs. Beer consumption
(Figure 3.5) be tried out.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 25

Figure 3.5 Vector plot of Wine - vs. Beer consumption


Elements: 32
250 Slope: -0.357569
20
Offset: 220.8071
Correlation: -0.654480
RMSED: 172.9533
SED: 128.5591 31
200 32
30 -117.906
Bias:
28
27 17
18
23
19 24 22
150 26
25 13
29 14
21
1
100 53 2
12 9 6 16
10 15
4
8 7
50
11

100 200 300 400 500


(Beer,Wine)

One will readily observe that even this comparatively small data set contains both
strongly positively as well as some intermediate negatively correlated relations.
There are also “random shots” interrelationships to be found. Have you tried out all
possible two-variable pairs yet? Want to know why? As a rule, with p variables
there are a total of p x (p-1)/2 such combinations. That is why! Surely there must
be an easier way?

You could also study three variables at the same time using the appropriate 3-D
plot. Simply mark any three desired variables in the Editor and then select Plot-3D
Scatter. Here we first again choose the Height and the Weight variables (since we
know these well already) plus the Swim (Swimming ability); see Figure 3.6. This
time you are looking at a 3-D plot. You may use Window-Identification to identify
the variables along the axes.

To view the plot from a different angle, you can choose View-Rotate or View-
Viewpoint- Change. This is a very powerful correlation inspection tool!

Multivariate Data Analysis in Practice


26 3. Principal Component Analysis (PCA) – Introduction

Figure 3.6 Vector plot of variables (Height, Weight and Swim)

100 -Y X

95
1
90 18
32
17
85 11
15
22 16
80 23 87
5 6 27 24
28 31
75
910
30 1329
70 4 14
32 25
20
2619

150 12
21
160
170
180
190 90 100
70 80
200 50 60
40

(Height,Weight,Swim)

Is the variable Swim correlated to Height and Weight? Do the women differ from
the men with respect to their swimming ability? Are there any groups in the plot?
Which persons are the most distinguishing ones? Which persons are similar and
which ones are not? What about some other combinations of similar three-variable
interrelationships, are there similar correlations among them?

Summary
In this exercise you have tried to study a total data set by looking at 2- and 3-D
plots of selected variables. The plots shown here indicate that the taller the people
are, the heavier they are among this group of people - and so on for the other pairs
you selected yourself. The swimming ability was found to be correlated to the
height and weight, but what about other three-variable intercorrelations? Discrete
variables are difficult to visualize and very difficult to interpret!

You will most probably have appreciated that even while these simple two-variable
and three-variable plotting routines are immensely powerful in their own right, the
number of variables (in this particular case only 12, or even only 9, if you discard
the binary ones) very quickly makes it impractical to investigate all the pertinent
combinations. One of the reasons for PCA, to be introduced shortly, is to make it
possible to survey all pertinent inter-variable relationships simultaneously. Thus we

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 27

will postpone other data analysis objectives to such time that we are in a position to
investigate these with PCA.

3.4 The First Principal Component


Maximum Variance Directions
The data swarm in Figure 3.7 should appear familiar. One intuitively “feels” that a
central axis could be drawn through the swarm and that this line would describe the
data swarm almost as efficiently as all the original p (p=3) variables (Figure 3.8).
This is so because this particular set of variables co-vary to a great extent - they are
strongly correlated. Therefore the effective dimension is not 3, but rather closer to
1. This central axis - which we can put here because the swarm "looks" linear - is
actually positioned along the direction of maximum variance. The variance, the
spread of the objects, will be the largest along this axis. In the general case this axis
needs of course not be parallel to any of the original X-axes; in fact it very seldom
is. It is usually so that two, three, or more... variables collectively support this type
of linear association. You will soon become very familiar with this “trend” aspect
of PCA.

We speak of “the variance” - but the variance of what? The feature in question is
the variance of the direction described/represented by the central axis - whatever
this unknown “new variable” may represent. This is really what is meant by the
term: “modeling a hidden phenomenon”. There is a co-varying, linear behavior
along this central axis due to “something” unknown (at least at the onset of the data
analysis). If we look only at the original X1-, X2- and X3-variables, there is no
such apparent connection, except that their pair-wise covariances are large. But this
simple geometrical plotting reveals this hidden data structure very effectively. All
Principal Component Analysis does is allow for this geometrical understanding to
be generalized into any arbitrary higher p-dimensionality.

Multivariate Data Analysis in Practice


28 3. Principal Component Analysis (PCA) – Introduction

Figure 3.7 Data (point) Figure 3.8 Data swarm with


swarm in 3-D PC1
x3 x3

PC1

x2 x2
x1 x1

This central axis is called the first Principal Component, in short PC1. PC1 thus
lies along the direction of maximum variance in the data set. We may say that there
is a hidden, compound variable associated with this new axis (the Principal
Component Axis). At this stage we do not usually know what this new variable
“means”. PCA will give us this first - and similar other - Principal Components, but
it is up to us to interpret what they mean or which phenomena they describe. We
will return to the issue of interpretation soon enough.

In this first example we only had 3 variables, so we recognized the linear behavior
immediately by plotting the objects in the 3-D variable space. When there are more
– many more – variables (like in spectroscopy where each row of the matrix is a
spectrum of perhaps several hundred, or thousands of wavelengths), this procedure
is of course not feasible any longer. Identification of this type of linear behavior in
a space with several thousand dimensions of course cannot any longer be done by
visual inspection. Here PCA can help us to discover the hidden structures however,
with its powerful projection characteristics.

The First Principal Component as a Least Squares Fit


We have looked at the central new variable as being defined as the axis along
which the Principal Component variance is maximized. There is also a
complementary way to view this axis. Assume that you have the same swarm of
points and now an arbitrary axis through the swarm (Figure 3.9). This is just a
temporary proxy-PC-direction to illustrate the concept - not the actual final axis we
want to find. We now project each point perpendicularly down onto this line. The
(perpendicular) distance from point (object) i is denoted ei (called the object
residual). As you see from Figure 3.9, each point is situated at a certain
“transverse” distance from the line. We can think of the first PC as finding the line

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 29

that is a best simultaneous fit to all the points through the use of the least-squares
optimization principle. We want to find the line that minimizes the sum of all the
squared transverse distances – in other words the line that minimizes Σ(ei)2.

This line is the exact same PC-axis that we found more “intuitively” earlier! When
using the Least Squares approach we now possess a completely objective
algorithmic approach to calculate the first PC, through a simple sum-of-squares
optimization.

One must appreciate that the n objects contribute differently to the determination of
the axis through their individual orthogonal projection distances. Objects lying far
away from the PC-axis in this transverse sense will “pull heavily” on the axis’
direction because the residual distances count by their squared contributions.
Conversely, objects situated in the immediate vicinity of the overall “center of the
swarm of points” will contribute very little. Objects lying far out “along the
PC-axis” may, or may not, display similarly large (or small) residual distances.
However, only the transverse component is reflected in the least square
minimization criterion.

Figure 3.9 Projections onto a PC


x3

ei
object j
object i
ej

x2

x1

We now have two approaches, or criteria, for finding the (first) principal
component: the principal component is the direction (axis) that maximizes the
longitudinal (“along axis”) variance or the axis that minimizes the squared
projection (transverse) distances. Some thought will show that these two criteria
really are two sides of the same coin. Any deviation from the maximum variance
direction in any elongated swarm of points must necessarily also result in an

Multivariate Data Analysis in Practice


30 3. Principal Component Analysis (PCA) – Introduction

increase of Σ(ei)2 - and vice versa. It will prove advantageous to have reflected
upon these two simple geometrical models of a principal component.

3.5 Extension to Higher-Order Principal


Components
Variance
The example illustrated above can now - easily - be generalized to higher-order
components. The swarm of points could be approximated quite well with just one
Principal Component (from now on denoted PC). Suppose now that the X data
swarm of points in fact is not so simple as to be modeled by only one PC (Figure
3.10). There is only one thing to do after having found PC1, and that is to find one
more, called PC2, as the data swarm of points is in fact quite planar in appearance.
The second principal component – per definition - will lie along a direction
orthogonal to the first PC and in the direction of the second largest variance
(Figure 3.11).

Again we are speaking about the variance of some “unknown” phenomenon or new
hidden compound variable which is represented by the second principal
component.

One may - perhaps – at this stage be wondering how actually to find, to calculate
the principal components. We will return to this later. At this stage it is only
important that one grasps the geometric concepts of the mutually orthogonal
principal components. The mathematics behind and the algorithmic procedure to
find them are very simple and will be described in due order.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 31

Figure 3.10 Data swarm Figure 3.11 Data swarm with


(quasi-planar) 2 PCs
x3 x3
PC 2

PC 1

x2 x2
x1 x1

We can continue finding still higher-order components. Thus PC3 will be


orthogonal to both PC1 and PC2 while simultaneously lying along the direction of
the third largest variance and so on for PC4, PC5 etc. The final PC-system will
consist of a number of orthogonal PCs, each lying along a maximum variance
direction in decreasing order. This system of PCs actually constitutes a new
coordinate system relative to the “old” one with the p variables. In fact we now
have a new set of “variables”, one for each PC, which are uncorrelated with each
other (since they are manifestly mutually orthogonal).

These new variables - let us call them PC-variables for the moment - do not co-
vary. By introducing the PCs we have made good use of the correlations between
the original variables and thereby constructed a new independent, orthogonal
coordinate system. Going from the original Cartesian co-ordinate system, one is
effectively substituting the inter-variable correlations with a new set of orthogonal
co-ordinate axes, the PC model. We shall find almost unbelievable data analytical
power in PCA, the Principal Component Analysis concept.

3.6 Principal Component Models - Scores


and Loadings
We shall now develop the systematics of Principal Component modeling.

Definition: A Principal Component model is an approximation to a given data


matrix X, i.e. a model of X, we use instead of the total original X. It is of course
assumed that this substitution has some advantages for our data
analysis/interpretation purpose(s).
Multivariate Data Analysis in Practice
32 3. Principal Component Analysis (PCA) – Introduction

Maximum number of Principal Components


There is an upper limit to the number of Principal Components that can be derived
from an X-matrix. The largest number of components is either n-1 (number of
objects -1) or p (number of variables), depending on which is the smaller. For
example, if X is a 40*2000 - dimensional matrix (e.g. 40 spectra, each with 2000
variables), the maximum number of PCs is 39. In this case, the largest number of
potential PCs is limited by the number of objects. Notice that we are discussing the
maximum number of components. In general we will be using many fewer.

From a data analytical point of view, using the maximum number of PCs
corresponds to a simple change of co-ordinate system from the p-variable space to
the new PC-space, which is also orthogonal (and with uncorrelated PC-axes).
Mathematically, the effective full dimension of the PC-space, the space spanned by
the PCs, is given by the rank of X.

PCA: X = Structure + Noise


All the PCs we can care to calculate are always orthogonal to each other and they
represent successively smaller and smaller variances, that is to say smaller and
smaller “spreads” of the object data swarm along the higher-order component
directions. Thus the last PCs will lie along directions where there is very little
spread in the objects, where, in effect, there are no longer any underlying
phenomena to “stretch” out the spread of the objects, indeed perhaps only
stochastic noise. These higher-order directions can progressively be thought of as
“noise” directions. This gives an indication of how PCA can decompose the
original data matrix X into a structured part (the first PCs that span the largest
variance directions), and the noise part (directions in the data swarm where the
variance/elongation is small enough to be neglected).

3.6.1 Model Center


A Principal Component model consists of a set of orthogonal axes determined as
maximum variance directions. They have a common origin, as can be seen from
Figure 3.12. There are a number of ways to choose this origin. Sometimes it can be
the same origin as the origin for the original p variables, but this is usually not
optimal. The most frequent choice of origin for the PCs is as the average object,
the average point in the data swarm:

( x1 , x2 ,  , x p )

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 33

∑x ik
where xk = i =1 , is the mean of variable index k, taken over all objects.
n

Figure 3.12 Centering the data at the average object


(corresponding to a translation)

x3’
x3

x2 ’
x1’ x2
x1

This PC-origin can also be viewed as a translation of the origin in variable space to
the “center-of-gravity” of the swarm of points. This procedure is called centering
and the common origin of the principal components is called the mean center.
Observe that the “average point” may well be an abstraction: it does not have to
correspond to any physical object present among the available samples. It is a very
useful abstraction however.

3.6.2 Loadings - Relations Between X and


PCs
The PCs really are variance-scaled vectors in variable space, whose directions are
determined as detailed above. Any one principal component can be represented as a
linear combination of the p unit vectors of the variable space (i.e. unit vectors
along each original axis in the variable space). The linear combination for each PC
will contain p coefficients, one for each of the p unit vectors. We will call the
coefficients pka, where k is the index for the p variables and a is the index for the
principal component direction coefficients. An example is p23, which would be the
coefficient for the second p-basis vector in the linear combination that makes up
PC3 in variable space.

Multivariate Data Analysis in Practice


34 3. Principal Component Analysis (PCA) – Introduction

These coefficients are called loadings and there are thus p loadings for each PC.
The loadings for all the PCs constitute a matrix P. This matrix can be thought of as
the transformation matrix between the original variable space and the new space
spanned by the Principal Components. For PCA, the loading vectors - the columns
in P - are orthogonal.

Loadings give us information about the relationship between the original p


variables and the Principal Components. In a way they constitute a bridge between
variable space and PC space. We may also say that the loadings construct the
directions of each PC relative to the original co-ordinate system, since the PCs can
be viewed as a linear combination of the original unit vectors.

Normally the loadings refer to variable space where the origin is centered, i.e. the
origin of the variable co-ordinate space is moved to the average object. This
corresponds to a simple translation as was shown in Figure 3.12.

We will discuss loadings in great detail (especially how to interpret them), and also
work much more with them, in several subsequent exercises.

3.6.3 Scores - Coordinates in PC Space


We earlier briefly introduced PCs as resulting from a projection of objects onto
axes with particular features. For this reason one will often see PCA and similar
methods designated as projection methods. Consider for instance PC1. Consider
object i and project it down onto PC1 (Figure 3.13). It will have a projection “foot
point” on this PC-axis, a distance (co-ordinate) relative to the PC-origin (either
negative or positive). This co-ordinate is called the score for object i. This is
designated as ti1. The projection of object i onto PC2 will give the score ti2, the next
ti3 and so forth. Just like in variable space, the (projected) object i corresponds to a
point in the new co-ordinate system, only now with co-ordinates, scores (ti1, ti2, ... ,
tiA). Observe how the object has been projected down onto an A-dimensional
surface (an “A-flat”); usually A < p.

Each object will thus have its own set of scores in this dimensionality-reduced
subspace. The number of scores, i.e. number of subspace co-ordinates for each
object, will be the same as the number of PCs. If we collect all the scores for all the
objects, we get a score matrix T. Notice that the scores for an object make up a row
in T. The columns in the score matrix T are orthogonal, a very important property
that will be of great use.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 35

Figure 3.13 Scores as PC-coordinates


x3
PC2
PC2

PC1 ti1
PC1
ti2
Object i

x2
x1

We will often have reason to refer to score vectors. A score vector is a column of
T. It is thus not the scores for a single object, but the scores for one entire Principal
Component; it is the vector of “foot-points” from all the objects projected down
onto one particular principal component. Therefore there will be a score vector for
each Principal Component. It will have the same number of elements as there are
objects, n. The general term “score” can be ambiguous. Usually, "scores" means
“elements in the T-matrix” without any further specification.

3.6.4 Object Residuals


There are many advantages in using a PC-model of the X-data, i.e. substituting the
scores for the p variables, but there is also a price to be paid. This price is
expressed by the size of the projection distances ei. They represent “information
lost” by changing the representation, namely by the projection. The projection
rendition is but an approximation to the original data set. The distances ei are called
object residuals. Here it pays to remember that the PCs can be thought of as the
result of minimizing the sum of squared object distances (object residuals). If these
distances, the object residuals, are large this implies that the fit is not good; the
model does not represent the original data very well. Many insights can be gained
from inspection of various statistics based on the object residuals (“misfit
statistics”).

3.7 Objectives of PCA


The objective of Principal Component Analysis, PCA, is to substitute the
representation of the objects, from the initial representation in the form of the p
original variables into the new Principal Component coordinate space. In fact we
do not simply wish to change the co-ordinate system; we also want the advantage
Multivariate Data Analysis in Practice
36 3. Principal Component Analysis (PCA) – Introduction

of dropping the noisy, higher-order PC-directions (Figure 3.14 The PC- coordinate
system). Thus PCA performs a dual objective: a transformation into a more
relevant co-ordinate system (which lies directly in the center of the data swarm of
points), and a dimensionality reduction (using only the first principal components
which reflect the structure in the data). The only “problem” is: how many PCs do
we wish to use?

Figure 3.14 The PC- coordinate system


x3
PC 3
PC 2

PC 1

x2
x1

In Figure 3.13 one may appreciate how the new PC co-ordinate system is also
reducing the dimensionality from 3 to 2. This is of course not an especially
overwhelming reduction in itself, but it should be kept in mind that PCA handles
the case of, say, 300 → 2, or even 3000 → 2 equally easily. The 3-D → 2-D (or
1-D) reduction is only a particularly useful conceptual image of the dimensionality
reduction potential of PCA, since it can render on a piece of paper or on a computer
screen. In fact, a large number of variables can often be compressed into a
relatively small number, e.g. 2,3,4,5 PCs or so, which allows us to actually see the
data structures regardless of the original dimensionality of the data matrix with but
a few plots (but then only as projections!).

3.8 Score Plot - “Map of Samples”


We are now ready to present one of the most powerful tools that “Principal
Component”-based methods can offer us - the score plot. A score plot is simply any
two pair of score vectors plotted against each other (Figure 3.15). It is important to
remember that the score vectors are only the “footprints” of the objects projected
down onto the Principal Components. Plotting the score vectors corresponds to
plotting the objects in a pertinent PC sub-space. Score plots are typically referred to

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 37

by their score designations, for example t1t2 for the PC1-PC2 score sub-space. Score
plots can be viewed as particularly useful 2-D “windows” into PC-space, where one
observes how the objects are related to one another. The PC-space may certainly
not always be fully visualized in just one 2-D plot, in which case two, or more
score plots is all you need. You are of course necessarily restricted to 2- or 3-
dimensional representations when plotting on paper or working on VDU-screens.

Figure 3.15 Example of a score plot of PC1 and PC2 (t1t2).


Objects are designated by plotting symbols, which allows for interpretations
(see further in text)
PC2 Scores
2

A1
1 A1
A5
A5 A5A5
A1 A5 A5
A5
A5B5
B1 A2 A4 B5 B5 B5
0
B1
B1 B2B2 B2 B3 A4 B5 B5B5
C2
C2 C3 C4C4 C4 C5
C2 C2 D4
D4D4 D5
C2 C3 D3D3 D4D4
C3 D4
D3D3
D3 E4 D4
D4
E2 E3 E4
-1
PC1

-6 -3 0 3
Pea Senso6, X-expl: 94%,3%

The most commonly used plot in multivariate data analysis is the score vector for
PC1 versus the score vector for PC2. This is easy to understand, since these are the
two directions along which the data swarm exhibits the largest and the second
largest variances. In Chapter 2, exercise "Quality of Green Peas" (descriptive
statistics) the problem concerned sensory data for peas. A plot of PC1 scores versus
PC2 scores, the t1t2 plot, is shown for this data set in Figure 3.15. Scores for PC1
are along the “x-axis” (abscissa) and the scores for PC2 are along the “y-axis”
(ordinate). Notice that objects are plotted in the score plot in their relative
dispositions with respect to this (t1t2)-plane, and that we have here used the very
powerful option of having one, two (or more) of the object name characters serving
as plotting symbol. This option will greatly facilitate interpretation of the meaning
of the inter-object dispositions.

Multivariate Data Analysis in Practice


38 3. Principal Component Analysis (PCA) – Introduction

Interpreting Score Plots


Looking at Figure 3.15 you will see that there is a singular object in the upper left
corner, A1 (in fact there are three replicate measurements of A1). It has the most
negative score for PC1, approx. -6.0, and a very large positive score for PC2,
approximately 1.3. Another observation would be that there seems to be a time
trend along PC1. The objects are denoted with letters A, B, C, etc. to denote the
type of peas (“cultivars”) while the numbers 1 - 5 indicates harvest time. If you
follow the objects from left to right, you see that this symbol for harvest time
increases systematically along the first PC. From this you can conclude that PC1
has something to do with harvest time. The “PC1”-variable would appear to be
strongly associated with the progress of harvest time. In other words PC1 lies along
the maximum variance of this “hidden” variable; the “hidden phenomenon” will
here be “progressively later harvest time”.

But also notice that “harvest time” is not a variable in the original X-matrix. Still
we can see it clearly in the score-plot because of this option of also being able to
use information recorded in the names of the objects. For this particular example,
one is led to the conclusion that time of harvest is important for the taste of the
peas. Since PC1 is the most dominant Principal Component (in fact, it carries no
less than 93% of the total X-variance), harvest time seems to be a very important
factor for the sensory results.

The same reasoning can be followed if we are interested in PC2. One can here
easily see another pattern in the plotting symbols as we move down the plot.
Objects with letter A are placed highest and objects with the letter E lowest. Thus
PC2 clearly has something to do with discriminating between pea types (A, B…E).

Note how PCA decomposes the original X-matrix data into a set of orthogonal
components, which may be interpreted individually (the PC1 phenomenon may be
viewed as taking place irrespective of the phenomenon along PC2 etc.). In reality -
of course - both these phenomena must act out their role simultaneously, as the raw
data without doubt come from the X-matrix altogether.

This possibility for orthogonal, decomposed interpretation is one of the most


powerful aspects of PCA. But these are only simple examples of the type of
information you can get out of score plots; there are many more uses. Score plots
are used for outlier identification, identification of trends, identification of groups,
exploration of replicate similarities and more. When interpreted together with

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 39

loadings, the influence of the original variables can also be deduced. This will be
discussed later in section 3.9 on page 40 and in chapter 4.

Choice of Score Plots


The (t1t2) score plot is the “work horse” of PCA, which is almost always viewed
first. However, any pair of principal components may be plotted against each other
in the same way as t1t2 of course. Which components actually need to be plotted is
another matter. Suppose, for the sake of argument, that the number of variables (p)
is less than number of objects. There will be p*(p-1)/2 possible 2-D scatter plots.
The number of possible plots quickly becomes practically unmanageable. For p=10
there are 45 plots, for p=15 there are 105 and p=25 gives 300 alternative plots.
Clearly it becomes impossible to study all plots, even for powerful computers – and
fortunately this is not called for either. There is in general little new information in
higher order PC scatter plots as compared to the lower order plots.

There are two rules of thumb concerning score plots:

1. Always use the same principal component as abscissa (x-axis) in all the score
plots: look at t1t2, t1t3, t1t4,… In this way you will be “measuring” all the other PC-
phenomena against the same yardstick, t1. This will greatly help getting the desired
overview of the compound data structure.

2. Use the principal component that has the largest “problem relevant” variance as
this basis (x-axis) plotting component. For many applications this will turn out to
be PC1, but it is entirely possible in other cases that PC1 lies along a direction that
for some problem-specific reason is not interesting. - If the time of harvesting in the
pea example above was, say, described in PC3 and place of harvesting in PC4, it
would not make much sense to plot PC1 vs. PC2 for studying these aspects. PC1
and PC2 would certainly describe “something” (other), but not what we were
looking for. In general PC1 describes the largest structural variation in any data set,
and in many situations this - per se - is often an “interesting” feature, but this does
not necessarily mean that this variation always is the most important for our
particular interpretation purpose. Correlation is neither per se equivalent to
causality.

These rules of thumb are very general and there are many exceptions. Our advice is
to start all data analysis following these simple rules, but always look out for
possible deviations. After an initial analysis, you may for example find that higher-
order score plots are necessary for interpretation after all. There are also many

Multivariate Data Analysis in Practice


40 3. Principal Component Analysis (PCA) – Introduction

interesting cases in which the problem-specific information is to be found


“swamped” in many “other” components. Sometimes, both the first, largest
components, and the truly insignificant higher-order ones have to be discarded: the
particular subject matter can be revealed in one, or more, of the intermediate
components.

As a case in point: The senior author was earlier involved in geochemical


prospecting in Sweden; the following features are courtesy of Terra Swede AB
(now defunct, but transformed into Terra Mining AB).

Working with 1 kg moraine overburden samples, for which over 30 geochemical


main and trace elements were recorded, an appropriate multivariate approach was
clearly necessary and both PCA and PLS-R (see later) were employed. The
geochemical prospecting campaign was directed towards finding new, hidden
mineralisations of gold (only one of more than 30 X-variables).

The specific gold-correlated variables (proprietary information) could all be


isolated in PC-3, which only accounted for 8% of the total variance (PC1: 57%;
PC2: 17%). The first two components were related to major geological processes
responsible for the overall moraine deposits and their chemistry, i.e. the chemistry
not related to rock fragments originating from buried mineralizations. Still, these
glacial processes and their impact on the overall chemical makeup of the primary
moraine samples were of a sufficient magnitude to “control” the first two principal
components (accounting for 74% of the total variance). It was only after successful
isolation of this particular PC3 that the geochemical mapping gained momentum –
all based on the geographical disposition of the scores of component 3. The raw
geochemical data (over 2000 samples) were much too overwhelming: consider, for
example, interpreting more than 30 individual geochemical maps simultaneously -
an impossible challenge even to trained geologists.

3.9 Loading Plot - “Map of Variables”


Loading vectors, or simply loadings, can be viewed as the bridge between the
variable space and the Principal Component space. As with score vectors, loading
vectors can be plotted against each other. These plots are equally valuable for
interpretation in their own right, but especially when compared with their
corresponding score plots.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 41

Interpreting the Loading Plot


In Figure 3.16 we have again turned to the pea data set as an example. This is the
(p1,p2)-plot, which corresponds to plotting loading vector PC1 versus loading
vector PC2. Note that the points plotted here are variables instead of objects. The
loading plot provides a projection view of the inter-variable relationships (variable
similarities).

The loading plot shows how much each variable contributes to each PC. Recall that
the PCs can be represented as linear combinations of the original unit vectors
& &
( pa = ∑k
p ka e k ). The loadings themselves are the coefficients in these linear

combinations. Each variable can contribute to more than one PC. In Figure 3.16 the
x-axis denotes the coefficients for all the variables making up PC1. The y-axis,
correspondingly, denotes the coefficients defining PC2 in the variable space.

Figure 3.16 Loading plot of PC1 and PC2 (p1,p2)


PC2 X-loadings
1.0
Off_flav

0.5

Fruity
0 Sweet
Pea_Flav
Hardness
-0.5 Mealines
PC1

-0.6 -0.3 0 0.3 0.6


Pea Senso6, X-expl: 94%,3%

The variables “Sweet”, “Fruity” and “Pea Flav” are located to the right of the plot.
“Sweet” contributes strongly to PC1 but not at all to PC2, since the value on the
PC2-axis is 0. Our earlier look at the (t1,t2)-score plot for the peas found that PC1
could be related to harvest time, and the inferred relation to pea flavor can, strictly
speaking, first be appreciated with this loading plot available; “Pea Flav” loads
very high on the positive PC1 direction indeed. From this we can also deduce that
measurements of “Sweet” can be used, together with the other similar variables
“Fruity” and “Pea Flav”, to evaluate harvest time. We can also say that the later the
peas are harvested, the sweeter they are. From the loading plot we see that other

Multivariate Data Analysis in Practice


42 3. Principal Component Analysis (PCA) – Introduction

variables also contribute to PC1, but at the opposite end (displaying negative
scores).

Above we deduced that Sweetness has nothing to do with the property described by
PC2. Identifying this property is a task for exercise 3.12. For now we can say that if
we wish to determine this property, we certainly should not measure “Sweet”.

Some variables display positive loadings, that is to say positive coefficients in the
linear combinations, while others have negative loadings. For instance, Sweetness
contributes positively to PC1, while Off Flav has a negative PC1-loading (as well
as a high positive PC2 loading). PCA-loadings are usually normalized to the
interval (-1,1), more of which later.

Co-varying Variables
The 3-dimensional plotting facility, as for scores, can also be used to study how the
original variables co-vary - in a 3-D loading plot. If the variables are situated close
together geometrically (i.e. display similar loadings), they co-vary positively.

In this example (Figure 3.17) Fruity, Sweet and Pea Flav are positively correlated
through PC1 because they make more or less the same contribution to PC1, but we
now also see that Sweetness, in addition, displays a high loading on PC3.

Figure 3.17 Three-vector loading plot (3-D plot of inter-variable


relationships)
X-loadings
-X -Y
1.0

0.5
Sweet
0 Off_flav
-0.5 Mealines Fruity
Pea_Flav
-1.0
1.0
0.5 Hardness
0.6
0 0.3
0
-0.3
-0.5 -0.6
Pea Senso6, X-expl: 94%,3%,2%

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 43

The variables Off_Flav, Hardness and Mealiness are also positively correlated with
each other at the negative PC1 end (these three variables have more or less equal
loadings for this component), but definitely only through PC1. This is clear from
their disposition in both the PC2 and PC3 directions in which these variables
occupy rather disparate positions in the 3-D plot. Simultaneously, these two sets of
three variables each (those with positive PC1-loadings and those with negative
PC1-loadings) are negatively correlated to each other since their PC1-loadings have
opposite signs. A strong correlation can either be positive or negative, and the
loading plot shows both these relationships unambiguously.

Some of these findings were already quite clear using the 2-dimensional loading
plot. The augmented 3-dimensional loading plots are usually more useful when the
number of X-variables is much higher than that of this simple illustration.

Comparison of Score Plot and Loading Plot


The corresponding score and loading plots are complementary and give the most
valuable information about both the objects and the variables when studied together
(Figure 3.18). Here the objects in the score plot are numbered instead of, as
previously, shown by their name code. The use of optional name-character(s) as
plotting symbols has many advantages that will gradually become clear after the
examples and exercises below.

Multivariate Data Analysis in Practice


44 3. Principal Component Analysis (PCA) – Introduction

Figure 3.18 Score and loading plots together


PC2 Scores
2

51
1 27
34
52 47 43
58 2519
42 5553
15 45 10 17
37
0
12
24 2148 3 28 38 49 311622
60
2918 20 46 2 40 8
9 57 33 1359
535
26 4
39 4441 23 17
5450
30 14 56
11
6 32 36
-1
PC1

-6 -3 0 3
Pea Senso6, X-expl: 94%,3%
PC2 X-loadings
1.0
Off_flav

0.5

Fruity
0 Sweet
Pea_Flav
Hardness
-0.5 Mealines
PC1

-0.6 -0.3 0 0.3 0.6


Pea Senso6, X-expl: 94%,3%

Sample (= object) 51 has a position in the score plot that corresponds to the
direction of variable Off-flavor in the loading plot. This means that sample 51 has a
high value for the variable Off-flavor. Sample 22 is very sweet, “9” is mealy and
hard and so on. If we take into account what we know about PC1, then early
harvesting time seems to give off-flavored peas that are hard and mealy, while late
harvesting (positive scores) results in sweet peas with strong pea flavor. Now it will
also be clear that PC2 can be interpreted as an “unwanted pea taste”-axis (going
from hard and mealy peas to distinctly off-flavored peas).

There is of course probably nothing particularly surprising to food science in the


above interpretations of this simple data analysis. The complementary

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 45

interpretation strategy - using the score plot to come up with the how and the
loading plot to understand why - should be well illustrated however. To be specific:

How? (Are the objects distributed with respect to each other, as shown by the
decomposed PC score plots?). The Score plot shows the object
interrelationships.
Why? (Which variables go together, as a manifestation of the correlations, defining
the PCs?). The Loading plot shows the variable interrelationships.
The loading plot is used for interpreting the reasons behind the object distribution.

In many cases one uses the 2-D score/loading plots illustrated above, but not
always.

The 1-Dimensional Loading Plot


In some cases the use of many variables may restrict the usefulness of the
2-dimensional loading plots. In spectroscopy, for example, where there can be up to
several thousand variables, the 2-dimensional loading plots are generally too
complex for simple interpretation, but can be useful for the detection of selective
variables for example. The type of detailed interpretation that we did with the peas
is virtually impossible. It is then necessary to use the 1-vector loading plots for one
PC at a time. These are often referred to as "loading spectra", simply because they
often look like “typical spectra”, but only in an open generic sense of course. This
1-D option is certainly not restricted to spectroscopy data.

Figure 3.19 is a 1-dimensional PC4 loading plot from a PCA of a set of IR spectra
of complex gas mixtures. In spectroscopy, the 1-D loading plots are often a great
advantage, as they are very useful for the assignment of diagnostic spectral bands
for example. As can be seen from Figure 3.19, such a loading plot indeed shows
great similarity to a spectrum. This loading plot is from a data set consisting of IR-
spectra of a system with 23 mixed gases. The loadings in Figure 3.19 belong to a
PC that can partially be related to the presence of dichloromethane. For
comparison, the original spectrum of pure dichloromethane is shown in Figure
3.20.

Multivariate Data Analysis in Practice


46 3. Principal Component Analysis (PCA) – Introduction

Figure 3.19 One-vector Figure 3.20 Spectrum of


loading plot (1-D loading plot) – dichloromethane (one of 23
PC4 mixture components)

Spectral features in the lower wavenumber region (i.e. the leftmost variables) can
be recognized in the loading plot. The loading plot shows the largest loadings,
which correspond to the most important diagnostic variables in this range (and in a
few other bands). Since dichloromethane absorbs in this region we can conclude
that PC4 - among other things - models the presence of dichloromethane. Note that
this system of 23 gases is actually rather complex so the PC-loading spectrum also
contains numerous smaller contributions and interferences from the other
compounds. Still, this is a very realistic example of the typical way one goes about
interpreting the “meaning” of a particular principal component in PCA of
spectroscopic data.

A final note on the graphical display of loadings and loading plots. 2-D score-plots
can be understood as 2-D variance-maximised maps showing the projected inter-
object relationships in variable space, directed along the directions of the particular
two PC-components chosen. By a symmetric analogy: 2-D loading plots may be
viewed as 2-D maps, showing projected inter-variables relationships, directed along
the pertinent PC-component directions in the complementary object space.

The object space can be thought of as the complementary co-ordinate system made
up of axes corresponding to the n objects in matrix X. In this space, one can
conveniently plot all variables and get a similar graphical rendition of their
similarities, which will show up in their relative geometrical dispositions,
especially reflecting their correlations. The particular graphical analogy between
these two complementary graphical display spaces is matched by a direct symmetry
in the method used to calculate the scores and loadings by the NIPALS algorithm
(see section 3.14 on page 72).

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 47

3.10 Exercise: Plotting and Interpreting a


PCA-Model (People)
Purpose
The objective of this exercise is to demonstrate how to study all variable- and
object-interrelationships in the data set simultaneously through PCA modeling.
Here you will learn how to build a PCA model and get started to use it to interpret
the results.

Data Set
The background information about the “PEOPLE” data set was given in exercise
3.3.1 on page 22. There are 32 persons from two regions in Europe, A and B. For
reasons of anonymity the persons’ real names are not included; instead we have
added codes in each object-name. The first position represents M/F (male/female),
whereas the second position in the name represents regions A/B respectively. Just
as in exercise 3.3.1 , the option of using both, or only one, of these codes as an
information-carrying plotting symbol will be made use of also in the context of
PCA-plots. The standard option is to use the running-number identifications for
objects 1 through N=32.

Tasks
1. Study the PCA-results. Focus is on the score plot and the loading plot.
2. Interpret the variable relationships and the object groupings.

How to Do it
The PCA model for the “people data” set has already been made. The modeling
results are stored on a particular file, which can be accessed by selecting the
model from Result-PCA. Select the file “people.11D” and press View. Here you
will observe the very first graphical overview of a PCA-model results, which
consists of a score plot, a loading plot, an influence plot, and a residual variance
plot. We will first focus on an interactive study of the score plot and the loading
plots.

Score Plot
The score plot is on the upper left corner of the PCA overview. A plot of PC1
vs. PC2 can be seen. If the numbers of the persons do not show, display the
numbers instead of the object-names using Edit-Options. You may also want to

Multivariate Data Analysis in Practice


48 3. Principal Component Analysis (PCA) – Introduction

enlarge this plot with the menu Window-Full Screen or with the corresponding
icone.

You will first of all observe a very clear grouping in the PC1-PC2 plot. In fact
you observe four distinct clusters of objects. This “grouping”, “clustering” or, as
we sometimes would want to express it (see later), this “data class delineation”
is optimally seen when using identical plotting symbols. In Figure 3.21 we have
used the sample numbers as object identification. If so desired, we could
actually have used the same symbol (x, +, o…) for all objects.

Please try and find out how to do this by using Edit-Options. You will find that
this is extremely useful for what might be termed the “initial overall pattern
recognition phase” of an exploratory PCA. However, as soon as we would like
to go any deeper into the data structure revealed, for example to find out about
the specific characteristics for each of these four groups, we need to use more
“plotting symbol information”, i.e. to use object-name information instead.

Figure 3.21 Score plot (PC1vs.PC2)


PC2 Scores
3

18
31 17
23
2 20 22

24
27
1 26 28
32
25 30
29 19
21
0
32
1
-1 7
8 15
16
5 11
6
13
9 10
14 4
-2 12
PC1
-4 -2 0 2 4 6
RESULT3, X-expl: 54%,19%

Who are the extreme persons? Which persons are similar to each other? Are
these the same conclusions as from plotting the raw data? Try to interpret the
meaning of the axes. For example, what do the persons to the left-hand side of
the plot have in common? And what about those to the right, upper part and
lower part?

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 49

Sample 21 is the leftmost person and Sample 1 the rightmost. Why? Is that
consistent with what you saw when plotting the raw data?

Now you may invoke the “object-name” information plotting option, this time
choosing to use both characters as the plotting symbol, as in Figure 3.22.
Observe the dramatic difference this makes with respect to immediate
interpretation of the four groups revealed!

What can you now say about these four groups? Do people from region A have
something in common? Do the people in region A differ from those in region B?

Figure 3.22 Score plot (PC1vs.PC2) with names marked


PC2 Scores
3

MB
MB MB
MB
2 FB MB

MB
MB
1 FB MB
FB
FBFB
FB FB
FB
0
MA
MA
MA
-1 MA
MA MA
MA
FA MA
FA
FA
FAFA
FA FA
-2 FA
PC1
-4 -2 0 2 4 6
RESULT3, X-expl: 54%,19%

As can be seen from this score plot (PC1 vs. PC2), the four groups of people
represent a double two-fold grouping, regional belonging (A/B) as well as
gender differences (M/F); each possible quartering in this classification sense is
located separately. The males are on the right-hand side and the females on the
left-hand side, while along PC2 the people from region A and B seem to lie at
the lower and upper parts of the plot respectively. Trying to interpret the
meaning of PC1 and PC2 is thus easy: PC1 is a “sex discriminating” component,
and PC2 is a "region discriminating" component.

Now we proceed to study the PC1 vs. PC3 score plot using the Plot- Scores
option and see what can be observed from this plot. No similar clear groups can

Multivariate Data Analysis in Practice


50 3. Principal Component Analysis (PCA) – Introduction

be seen in the plot, even if you try the “plotting symbol” trick from above again.
However, if you use Edit - Options - Sample Grouping - Enable Sample
Grouping - Separate with Colors - Group by - Value of Variable and mark
variable Age, you are able to infer that PC3 turns out to span/separate the older
from the younger people.

The age would appear to increase from the upper part to lower part, i.e. along
PC3, but this is not an easy thing to appreciate, since we did not have any age-
related information present in the object-names.

Typically binary (-1/1) information, such as gender and region in the above
example, lends itself easily to this type of plotting-symbol coding. Sometimes a
discrete (or categorical) information type might be used in a similar fashion, for
example A/B/C/D.

Figure 3.23 Score plot (PC1vs.PC3)


PC3 Scores
2

25 5
26 6 17 18
9 22
1 10
20 7
8 2
4 23
21 31 311
0
12
15
29

-1 14 24 16 1
13

32
30
-2 19
27

-3 28
PC1
-4 -2 0 2 4 6
RESULT3, X-expl: 54%,13%

We will now turn to the loading plot.

Loading Plot
You should always study the corresponding loading plot to see why the data
structure is grouped, or why a particular sample is located in a specific location
in the score plot. For instance, objects on the left-hand side of the score plot will
have relatively large values for the variables on the left-hand side of the loading
plot and vice versa.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 51

Figure 3.24 Loading plot (PC1vs.PC2)


PC2 X-loadings
0.6 A/B
Wine

0.4

0.2
Shoesize
Height
IQ Weight Swim

0
Age

Hairleng
-0.2
Sex
Income
Beer

-0.4
PC1
-0.4 -0.2 0 0.2 0.4
RESULT3, X-expl: 54%,19%

What is the characteristic of the persons on the left-hand side? And in the lower
part of the score plot? (Hint: Study both the corresponding score - and loading
plot interactively). Which variables are located along PC1 and which ones
along PC2? Do the people with larger height and weight have better swimming
ability? Is this dependent on Sex? Do the older people have higher salary than
the younger ones? Are there any variables that can differentiate people in the
region A from those in the region B? Do the people in the region A and B have
the similar wine and beer consumption? Do they have the same IQ? Is IQ
correlated to the physical variables? If so, which?

Also study loading plot (PC1vs.PC3) using Plot-Loadings. Can you see any
variables distributed along PC3? You will probably know more about the
meaning of PC3 so far. Compare the conclusions with those you get from the
score plots. Are they consistent with each other?

Multivariate Data Analysis in Practice


52 3. Principal Component Analysis (PCA) – Introduction

Figure 3.25 Loading plot (PC1vs.PC3)


PC3 X-loadings
0.2 Beer
Swim
IQ Height
Weight
Sex
0 Shoesize

Hairleng
A/B
-0.2 Wine

-0.4

-0.6 Income

Age

-0.8
PC1
-0.4 -0.2 0 0.2 0.4
RESULT3, X-expl: 54%,13%

Another important function of the loading plot is to study the variable


relationships. Variables plotted in the same direction from the center are
positively correlated, while those on opposite sides of the origin are negatively
correlated. Display the loading plot on the whole screen to see the variable
names more easily. Select Window-Full Screen from the menu or the tool bar.

Which variables are highly correlated in the PEOPLE data set? If they are
correlated, is there any physical relationship? Which of them are negatively
correlated? What about the IQ (plot PC1 against PC4)? Are there ANY
surprises at all compared to what one would generally expect in a sample of
pan-European people?

Summary
This exercise showed how to display these two most important model plots: the
score plot and the loading plot. It illustrated how the patterns of the samples in the
score plot (inter-object groupings, trends) can be “explained” from the variable
loadings. An object or a group of similar objects on the left-hand side in the score
plot has a high value of the variable(s) on the left-hand side in the loading plot.

As an example we saw how the first PC could be interpreted as describing the


gender of the persons; all females were on the left-hand side and all males on the
right-hand side. The second PC could denote that all people from region B lie on

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 53

the upper part of the plot and all those from region A lie on the lower part. The
third PC could be used for separating relatively older people (with correlated higher
salary) from younger ones with relatively lower salary. The fourth PC describes IQ
only, not correlated to any other variable.

There is also a tendency that people in the upper-left part of the PC1-PC2 plot drink
more wine and less beer than those in the lower-right part. This trend can be
observed “diagonally” in the score plot with enough scrutiny, but it is also revealed
in the appropriate loading plot, PC1-PC2. Since variable “Wine” has a distinctly
high PC2 loading, while “Beer” loads high both on PC1 as well as on the negative
PC2; in fact “creating” the diagonal object relationship observed in the score plot.

Furthermore all males display relatively large values of the variables height,
weight, shoe-size, swimming ability and beer consumption. In addition there is a
tendency for the males to have shorter hair than the females in this data set. The
interesting thing is that IQ does not seem to be related to any other variables,
because it displays effective “zero” loadings on all first 3 components.

The score plot also showed similarities and differences; people located near each
other in the plot are similar, such as samples 7 and 8. It is easy to make whatever
pertinent interpretations characterizing the data structure revealed. For example,
coupled with loading plot, the score plot shows a simultaneous tendency that
people in region A drink more beer and less wine and have relatively higher salary
than those in region B.

Data analytical co-variation does not necessarily mean physical correlation in a


causal sense. One must of course always use common sense and specific
application knowledge to separate “numerical co-variation” from “causal
relationship”. Weight, shoe size and height in general must correlate for
adolescents, but age and height probably do not correlate as strongly for mature
adults. IQ does not depend on the regions and physical characteristics in this
sample. Incidentally, this is why PCA historically was developed in the psychology
sciences in the early 1900’s, as it was actually believed that for example IQ and
criminal behavior could be explained by physical characteristics and social
background (sic). It is therefore important to note that PCA gives us a complete
unbiased view of the data – we are not imposing on it any preconceived ideas of
how the data should be modeled.

Comparing the PCA-results above with the earlier plots of paired variables, we see
that a PCA model gives a much more comprehensive overview of the whole data

Multivariate Data Analysis in Practice


54 3. Principal Component Analysis (PCA) – Introduction

set, “all in one glance” (- well four plots rather). In fact, going from a large set of
isolated, bivariate scatter plots (p * (p-1)/2) to a very few score/loading plots,
demonstrates one of the strongest features of PCA as used for exploratory data
analysis. This exercise gives you a first view of the power of PCA modeling. More
will be shown later in the coming chapters.

3.11 PC-Models
We shall now present a more formal description of PC-modeling. We will also look
at the more practical aspects of constructing adequate PC-models.

3.11.1 The PC Model:


X = TP T + E = Structure + Noise
As was briefly already mentioned above, PCA is concerned with decomposition of
the raw data matrix X into a structure part and a noise part. The equation in the title
of this section hints at how (by describing the separation with respect to E: the
error part).

Centering
The X-matrix used in the equations above is not cast precisely as our raw data set.
The original variables have first been centered:

Equation 3.1 xik = xik − xk

We have mentioned centering previously, both in connection with the translation of


the origin and in with a first description of the loadings. It is sometimes claimed to
be desirable to analyze without centering (but in fact this only applies to very
special situations). We shall here follow an established procedure within
chemometrics, and use a notation, which does not differentiate between centered
and uncentered data. When we discuss residuals below we will touch upon this
theme again, but generally: X is used as the designation for the centered data
matrix.

Equation 3.2 The PC-model (centered) X = TPT + E

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 55

In PCA we start with the assumption that X can be, indeed should be, split into a
sum of a matrix product, TPT, and a residual matrix E. As you have probably
gathered, T is simply the score matrix described above, and PT is but the
accompanying loading matrix (transposed). We wish to determine T and P and use
their outer product instead of X, which will then, most conveniently, have been
stripped of its error component, E:

The PC-model is the matrix product TPT.

E is not a part of the model per se. E is building up the so-called residual matrix. It
is simply that part of X which cannot be accounted for using the available PC
components; in other words, E is not “explained” by the model. Thus E becomes
the part of X that is not to be modeled by (included in) the product TPT. E is
therefore also a good measure of “lack-of-fit”, which tells us how close our model
is to the original data. While the data analytical use of PCA-models is mainly
concerned with the first data structure part, TPT, we could not do without the
complementary “goodness-of-fit” measure residing in E (a large E corresponds to a
small model fit and vice-versa).

Step by Step Calculation of PCs


Equation 3.2 is compact. It is useful to write the equation as individual PC-
contributions, as individual outer vector products as in Equation 3.3:

Equation 3.3 X = t1p1T + t 2pT2 +  + t Ap TA + E

An outer product results in a matrix of rank 1. Here, ta is the score vector for PCa
and is n x 1-dimensional. pa is the corresponding loading vector; since it is p x 1-
dimensional, paT is thus 1 x p-dimensional. Each outer product tapaT is therefore n x
p-dimensional, i.e. the same dimension as X and E, but all these tapaT matrices have
the exact mathematical rank 1. This is illustrated graphically in Figure 3.26.

Multivariate Data Analysis in Practice


56 3. Principal Component Analysis (PCA) – Introduction

Figure 3.26 The PC model as a sum of outer vector products


1- - - 1- - -
p1 p2

t1 t2
X = + + . . .+ E

Equation 3.3 is closer to the actual PC-calculation than the compact matrix
equation. PCs are in fact often calculated one at a time. Let us outline how in an
introductory overview first:

5. Calculate t1 and p1 from X.


6. Subtract the contribution of PC1 from X: E1 = X - t1p1T
7. (Note: at this stage X = t1p1T + E1)
8. Calculate t2 and p2 from E1.
9. Subtract the contribution of PC2 from E1: E2 = E1 - t2p2T
10. (Note: At this stage X = t1p1T + t2p2T + E2)
11. Calculate t3 and p3 from E2.

and so forth until you have calculated “A” components. The subtractions involved
are often also referred to as updating the current X-matrix. It is one of the most
general features of the usefulness of PCA that usually A << p. A is the dimension
of the structure sub-space used.

Notice that we use the letter E from the moment we start subtracting PC-
contributions. When we talk about the residual matrix, you will see why we use
also E in the step by step model calculation.

A Preliminary Comment about the Calculation Algorithm:


NIPALS
The basis for the step by step calculation of PCs is the NIPALS algorithm, see
section 3.14 on page 19. The fact that T and P are orthogonal results in a very
efficient algorithm for calculation of the PCs. Suffice here to highlight that it is the
specific delimitation between the data structure part, TPT, and the noise part, E,
which is carried out by establishing the numerical value of A. When A has been

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 57

determined (correctly), the PCA-model can be considered complete. Of course the


BIG question remains: what does “correctly” mean?

The Number of Principal Components - A


In the above sections, A signifies the number of PCs to be calculated; A << p.
Recall that we earlier discussed that the maximum number of possible PCs is
min(n-1,p). If the number of PCs calculated is deliberately chosen to be A ≡
min(n-1,p), we have calculated a so-called “full” model. This means that
decomposition of X simply amounts to replacing the original co-ordinate system
(the variable space) with a new co-ordinate system (the PC-space) with the exact
same (full) dimensionality, p. In this case the new origin corresponds to the mean
object, due to centering, but the number of PCs (and therefore the dimensionality)
has not been reduced at all. From a multivariate data analysis point of view, the full
set of PCs is never optimal. One of the main advantages with these methods is,
after all, a reduction in dimensions.

Only by determining A << p, are we certain to achieve the extremely important


advantage which is the splitting of X into a structure part and a noise part. When
we use the full set of PCs, we have not “truncated” the data, because we are still
dragging along all the noise contributions. If we use the full set of PCs, E will be 0.
We may have a more convenient representation of our data but we are not
exploiting the practical usefulness of the PCA method to the maximum.

The structured part of X is made up by the TPT product. The noise, (the residuals),
resides in the E-matrix. In this context the choice of A, how many PCs to include,
corresponds to determining the split between structure and noise.

From this we can conclude that a choice of “A” for an optimum fit must be made.

We, or the software PCA-program, must choose A so that the model, TPT, contains
the “relevant structure“ and so that the “noise”, as far as possible, is collected in E.
This is in fact a central theme in most of multivariate data analysis. This objective
is not trivial and there are many pitfalls. It is always up to you to decide how many
PCs to use in the model. There are many situations where the human eye and brain
excel, which simply cannot be programmed regardless of how far the field of
Artificial Intelligence has been developed. Most PCA software, of course, will try
to give you information on which to base your decisions, but this info is only
algorithmically derived, and it all hinges on which optimality criteria is used.
Rather than to skip this important point (by relying on an algorithmic approach

Multivariate Data Analysis in Practice


58 3. Principal Component Analysis (PCA) – Introduction

only), this book most emphatically demands that the reader takes it upon
him/herself to learn the underlying principles behind PCA. There can thus be no
two ways about it. The question (always) is: “how do we find the optimum number
of PC-model components, A?”

The answer is, fortunately, simple: keep track of the E-matrix.

3.11.2 Residuals - The E-Matrix


The residual term E can be used to monitor an accumulating PC-model fit; large E
= insufficient model, small E = good model. E is of course directly connected to A,
the number of PCs, as well as to the data structure present in X, but generally: a
small A ⇒ more noise in E; a large A ⇒ less noise in E.

However, terms like “small”, “large” and “good” are imprecise. What is a large E?
After all E is a matrix consisting of n x p matrix elements - what if some are large
and some are small? What are we comparing with when we say that something is
small or large? It is obvious that we must define and quantify these terms precisely.

Evaluation of E is always relative to the total variance as calculated with respect to


the centered origin. This origin is a new, derived point in variable space, given by
the average of each variable, namely the average of each column in the uncentered
X-matrix. This point is the new, common point (0,0, ... 0) in PC-space, the new
origin for all the PCs to be calculated. For systematic reasons it is useful to think of
this point as a “zeroth Principal Component”. If we think of developing a model via
step-by-step approximations to the data, each step being more refined than the
previous, the very first approximation is to represent X through the average object,
the object with average variable values. Thus the first approximation to X would be
the zeroth Principal Component: the average object.

We would then subtract this contribution from X and continue approximating the
residual. Subtraction of the zeroth Principal Component is identical to mean
centering of the raw data matrix X. Thus for A=0, the residual matrix termed E0, is
the same as the centered X. E0 plays a fundamental role as the reference when
quantifying the (relative) size of E.

Residual Variance
The residuals will change as we calculate more PCs and subtract them. This is
reflected in the notation for E. E will in general have an index indicating how many

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 59

PCs have been calculated in the current model. These residuals are compared with
E0, our starting point. E0 is the X in Equation 3.4:

Equation 3.4 The PC-model (centered) X = TPT + E

where there is no TPT term (yet).

It is often advantageous to compare the residuals in terms of fractions of E0. So for


A=0, E = 100% of E0. The residual variance is 100% and the modeled or explained
variance is 0%. Note that we are back to additive variances. The size of E is
expressed in terms of squared deviations, variance. E is after all an error term, so it
is natural to evaluate it in this proper statistical, squared, fashion.

We will be dealing with summations of the squares of the individual E-matrix


elements, the individual error terms. There are two ways that we can do these
summations: either along the rows, which gives us object residuals, or down the
columns, which results in variable residuals.

Object Residuals
The squared residual of an object i, ei2, is given by Equation 3.5:

p
Equation 3.5 ei2 = ∑ eik2
k =1

and the residual variance is ResVari = ei2/p. This sum is simply a number. If we
take the square root of this sum, the result corresponds to the geometric distance
between object i and the model "hyper plane", i.e. the “flat” or space spanned by
the current A PCs as expressed in the original variable space. Thus the object
residual is a measure of the distance between the object in variable space and the
model representation (the projection) of the object in PC-space. The smaller this
distance is, the closer the PC-representation (the model) of the object is to the
original object. In other words, the rows in E are directly related to how well the
model fits the original data – by using the current number of A components.

The Total Squared Object Residual


Above we defined the squared residual of one object (Equation 3.5). When
developing a PCA model, we want it to fit all the objects as well as possible
simultaneously. Therefore we need to define a total squared residual that accounts

Multivariate Data Analysis in Practice


60 3. Principal Component Analysis (PCA) – Introduction

for all the objects. For this purpose, we define the total squared object residual to
be the sum of all the individual squared object residuals (Equation 3.6).

n
Equation 3.6 2
etot = ∑ ei2
i =1

and the total residual variance is ResVarTot = e2tot /(p.n). In general we will refer
to the “total residual variance” without specifying objects.

Plotting the Individual Object Residuals


The residual variance can be plotted per object (Figure 3.27).

Figure 3.27 - Relative residual variances of twelve objects


X-variance Residual Sample Calibration Variance
0.06

0.04

0.02

0
Samples

3 6 9 12
Jam senso, PC:3

This plot is used mainly to assess the relative size of the object residuals. For
example, in Figure 3.27 we can see that sample number 10 has a larger residual
variance than the other objects. The model does not fit or “explain” this object as
well as the others. The plot thus may indicate that sample number 10 perhaps is an
outlier, it is not like the rest. We will return to the concept of outliers several times
in this book, as they are very important. An object like number 10 may be the result
of erroneous measurements or data transfer errors, in which case it should perhaps
be removed from the data set. Or it may be a legitimate and significant datum,
containing some very important phenomena that the other objects do not include to
the same extent. In the multivariate data analysis realm everything is always
“problem dependent”.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 61

The Total Residual Variance Plot


A PCA model is thus calculated in this step-by-step manner, including PCs one by
one. We start with a total residual variance given by the centered X-matrix, E0.
Then we calculate t1 and p1 and subtract the outer product from E0, which gives us
the matrix E1. E1 will give a new total residual variance, e2tot,1, which is compared
with the previous total residual variance, e2tot,0. This new e2tot,1 must by definition be
less than e2tot,0 - we are after all approximating the X-data using a criterion that
minimizes the distances between model and objects, and the residual variances are
measures for these distances. For each new PC we get a new total residual variance
which is smaller than the previous one. This can be plotted as the total residual
variance as a function of the number of PCs (Figure 3.28). This is the third type of
key PCA-plot (scores, loadings, residuals); these three standard plots must always
be evaluated together.

For the total residual variance plot, the graph function must be a decreasing
function of the number of current components, A. It must decrease towards exactly
0 when A reaches its maximum, i.e. is equal to min(n,p).

Figure 3.28 - Total residual variance


X-variance Residual Calibration Variance
1.0

0.5

0
PCs

PC_00 PC_03 PC_06


Pea Senso6-6, Variable: c.Total

3.11.3 How Many PCs to Use?


The details in the residual variance plot, Figure 3.28, help us find the optimum A.
There is an empirical rule which says that a large break in this function - going
from a steep slope to a much slower decrease - often represents the “effective”

Multivariate Data Analysis in Practice


62 3. Principal Component Analysis (PCA) – Introduction

dimensionality; the optimum A. In Figure 3.28 this is at PC number 1 (or possibly


3).

There is a logical argument behind this rule of thumb. Recall that the PCs are
placed along directions of maximum variances, that is to say along the elongations
of the data swarm, in decreasing order. When placed along these directions, the
total distance measure from the objects to the PC-model in general will decrease.
Remember the duality of maximum elongation variances and minimization of the
residual (transverse) distances for each component. As long as there are (still) new
directions in the data swarm with relatively “large” variances, the total residual
distance measure will decrease significantly. This again leads to a relatively large
decrease in the total residual variance from PC to PC, corresponding to a set of
relatively steep slopes in the plot, from one component to the next.

You may use the mental picture, that the next PC still would have something “to
bite into”, or that there still is some definite direction in the remaining projections
of the data swarm for the next component to model. This goes on until the
remaining data swarm does not show preferred orientations (elongations) any
longer. At this point there will no longer be any marked gain with respect to adding
any further PCs (“gain” is here to be understood as modeling gain, thus adding a
significant total variance reduction). Consequently the total residual variance plot
will flatten out and the gain per additional PC will be significantly less than before.
Thus a break here. Once the noise region has been reached, all of the PCs will be of
similar size (as they are modeling random variation and the residual data will form
a hyper-sphere), so the residuals will flatten out.

To conclude: the optimal number of PCs to include often is the number of PCs that
gives the clearest break point in the total residual variance plot. But this is only a
first rule-of-thumb; there are, alas, also plenty of exceptions from this rule.

A Note on the Number of PCs


Interpretation of the PCs is generally necessary in PCA. Often this may be nothing
more than carefully comparing the patterns in the score plots with knowledge about
the specific problem. The data analytical main issue often is arriving at the
“correct” number of components first, before one can perform any meaningful
interpretations of the meaning of the A components. If too few components are
chosen (resulting in what is termed an underfitted model), the interpretation runs
the risk of relating only to the most dominating parts of the data structure, with an
absolute certainty of leaving something significant out. Think of the gold

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 63

mineralisation PCA, if only the irrelevant first two PCs were extracted - and not the
gold-related PC3! Using too many components on the other hand (clearly leading to
an overfitted model) is equally bad, because you then risk interpreting parts of the
noise structure.

Why not Include more PCs?


It is a basic PCA-assumption that directions of “small” variances do not correspond
to significant data structure(s), but rather to noise. In PCA we tacitly assume that
large variances correspond to systematic phenomena (for example size, order,
concentration variations...) and that these dominate over lesser variance
contributions like random noise (e.g. measurement errors, sampling errors...). When
we analyze a specific data set we are in general looking for quite definite
phenomena, which we assume are represented in the data set. The philosophical
stand behind PCA and related multivariate methods is that the “large” PCs may be
correlated with the information we seek, while the “smaller” ones usually are noise
and thus irrelevant to the data structure elucidation problem in itself. These lesser
PCs should therefore not be included in the modeling; they should remain in E.
Thereby we “filter” the noise out of our data and can concentrate on the structured,
in principle noiseless part. We generally use the total residual variance plot to
assess where the modeled structure stops and the noise starts.

A very important lesson here is that we always carry out these evaluations in the
problem-specific context of all our knowledge about the problem, or situation
from which the data matrix X originates. It is bad form indeed to analyze data
without this critical regard for the problem context – indeed no interpretation is
possible without!

A Doubtful Case – Using External Evidence


Unfortunately things are not always as simple as we would like. Consider, for the
sake of argument, the case in which we know, from irrefutable external evidence,
that there is some definite level of residual variance below which we cannot go. Let
us for example say we are analyzing a number of chemical variables, which happen
to be very comparable and all have the same definite measurement uncertainty, say
12%. Accept for this illustration then that there is no point in measuring with a
relative uncertainty better than 12%. When trying to model the data matrix X then,
it would not make sense to include more PCs than the number best suited to give a
residual variance above this level (but of course as close as possible). If the next
component for example takes the residual variance from 15% to 6%, we should not
use this last component.

Multivariate Data Analysis in Practice


64 3. Principal Component Analysis (PCA) – Introduction

The point here is that external, indisputable evidence must override all internal data
modeling results. However, the external evidence must of course be proved beyond
doubt. There are cases where the external factors have not held true after all; the
modeling results using more components were later found to be correct. If ever in a
situation like this, one should neither reject the modeling results immediately,
neither ignore the external evidence. It will be most prudent to reflect carefully
again on the results and the evidence before deciding. In fact, there is no substitute
for building up as large a personal data analytical experience as indeed possible.
For this reason (also) we have included many examples and exercises in this book.

3.11.4 Variable Residuals


So far we have studied only squared object residuals, which are calculated by
summing squared E-elements along the rows in E. If instead we perform the
summation down the columns, we get the corresponding variable residual
variances. Just as is the case for objects, we can define a squared residual per
variable:

n
Equation 3.7 ek2 = ∑ eik2
i =1

And we can also define a total squared variable residual:

p
Equation 3.8
,v = ∑ ek
2 2
etot
k =1

Here we will only discuss the former. The residual variance per variable can be
used to identify non-typical variables, “outlying” variables, in a somewhat similar
fashion as for the objects. We cannot, however, interpret them in an exact
analogous fashion in terms of distances without introducing a complementary
object space in which variables can be plotted. This concept lies outside the scope
of this book.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 65

3.11.5 More about Variances - Modeling Error


Variance
Above we discussed at length the total residual variance for the determination of
the optimum number of PCs. The total residual variance is determined from the
E-matrix, which is a measure of the model error. This is the error stemming from
the fact that we are modeling a complex data structure but with a limited number of
PC-components A << p; this could be termed the modeling error variance, or the
modeling error for short. In PCA this modeling error denotes the deviation between
model and real data (X). This is opposed to the modeling error associated with the
methods PCR and PLS, to be discussed later, where the most interesting measure of
variance is often the prediction error (related to Y).

We have also briefly used the term “explained variance” above. Remember that the
residual variance is compared to the total residual variance for A=0. At this point
the total residual variance is 100% and the explained variance is 0%. When
A=min(n,p) the residual variance is 0% because E is 0, and the explained variance
is 100%. The explained variance is the variance accounted for by the model, the
TPT-product (always relative to the starting level, the mean centered X and E0). An
easy relation to remember:

%Explained variance + %Residual variance = 100%

Multivariate Data Analysis in Practice


66 3. Principal Component Analysis (PCA) – Introduction

Figure 3.29 - Explained and unexplained residual variance,


always totaling 100%

100
90
80
70
Explained variance
60
50
40
30
20
Unexplained
variance
10
0 PCs
0 1 2 3 4 5

3.12 Exercise - Interpreting a PCA Model


(Peas)
Purpose
Find which of the sensory variables are best suited to describe the pea quality in the
pea data already analyzed before.

Data Set
The model “Peas0” is based on the same data set as has been used several times
before (see Chapter 2). In fact we have also been using parts of these results several
times above when introducing the various aspects of PCA. After reformatting there
are 60 pea samples (objects). The names of the samples again reflect harvest time
(1 .. 5) and pea type (A .. E). The variables were not presented properly earlier. The
X-variables in fact consist of sensory ratings of all the pea samples, on a scale
from 0-9, as carried out by a panel of trained sensory judges. Whereas we earlier
were nearly only interested in the geometrical plotting relationships, we here want
you to carry out the complete principal component analysis yourself in this context.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 67

Tasks
Study score-, loading- and variance-plots of the data and interpret the PCA-model.

How to Do it
The model is already prepared. Go to Results - PCA and specify model Peas0.
Start by looking at how much of the variance is described by the model. Make
the plot in the lower right quadrant active, select View - Source - Explained
Variance, and un-select View - Source - Validation Variance. Is there a clear
break in this plot? You can see that the first two PCs explain around 75% of the
variance in the data. This is regarded as good for sensory analysis, due to the
high noise level in this kind of data as measurements are based on human
judgement.

Now study the score plot. Interpret the meaning of the PCs. Use Edit - Options
to replace sample numbers with names; if you click twice on the second cell of
the Name field in the dialog box, only the fraction of the name coding for
Harvesting Time will appear.

Study the loading plots. Try to answer the following type of questions: What do
the loadings represent? How can we interpret the plot? Which sensory
characteristics are the most important? Which vary the most? Which seem to
co-vary? Which variables describe the main variations of peas?

Summary
There is no clear break in the variance plot, but two PCs describe 75% of the total
variance while the third explains some 10% more. Two PCs are simple to interpret
and are probably sufficient to determine the most important variables for
description of pea quality. The clue here is to note carefully the fractions of the
total variance associated with each PC.

The score plot shows that the PC1 direction describes the harvesting time. We can
see that the samples are distributed from left to right according to their Harvesting
Time numbering. There is no similar obvious clear pattern in PC2 at the first glance
of the score plot.

The loading plot shows that Pea-flavor, Fruitiness and Sweetness co-vary.
Hardness, Mealiness, and Off-flavor are also positively correlated to each other,
while they are negatively correlated to Pea-flavor, Sweetness, and Fruitiness, since
the two groups are on opposite sides of the origin. This means that PC1 mostly

Multivariate Data Analysis in Practice


68 3. Principal Component Analysis (PCA) – Introduction

describes how the peas taste and feel in the mouth, which is perhaps not a so
surprising first direction defining what trained sensory judges base their assessment
on. The corresponding score plot indicates that taste is related to harvest time - the
riper the peas, the sweeter they taste.

Along PC2, we can see Color 1 and Whiteness to the top, negatively correlated to
Color 2 and Color 3 projected near the bottom of the plot. This means that peas
samples projected to the top of the score plot are whiter, while those projected to
the bottom are more colorful.

3.13 Exercise - PCA Modeling (Car


Dealerships)
Purpose
You are now set free to perform your very first own PCA. The purpose of PCA of
the car dealership data is to carry out the complete works of a PCA, completely
from finding the appropriate data file (“cardeals”), to doing the entire analysis and
the necessary interpretations yourself. We want you to find out what you can about
the data structure in the CARDEALS data matrix. But we want first to give you a
full background briefing, of course.

Data Set
The objects represent 172 Norwegian car dealerships. For reasons of anonymity we
have identified these companies only by a running number 1-172. In point of fact,
there is also information available as to the particular brand of car each individual
car dealership offers to the market (e.g. Volvo, Mitshubishi, Toyota...), but this is
mainly of interest when related to the univariate (1-D) data analyses carried out in
the original magazine article. Here we are exclusively interested in what can be
gleaned from the multivariate PCA perspective in comparison.

The X-variables consist of 10 economic indicator variables, taken directly from the
magazine tabulations. It is evident that these variables represent a standard
framework within which to carry out an economic analysis of a whole branch of the
wholesale market - in this case all Norwegian car dealerships. The data represent
the years 1993 accountings. For the moment we shall only refer to these variables
as X1 – X10.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 69

You (the novice multivariate data analyst) are at the outset specifically not allowed
any other details of the meaning of the chosen set of key economic indicators.
Indeed this is the whole point of this exercise - what can be said about the
multivariate data structure of this particular data set without detailed economic
understanding? By treating these presumably fine-tuned economic variables just
like any other set of p multivariate attributes for the set of N (=172) objects (which
just happens to be car dealerships), what can be achieved by a proper principal
component analysis? - Surely, it would be nice, if such an analysis was to turn up
new insight, insight not revealed by a standard univariate economic analysis. We
would be in the enviable position to be able to teach a professional economic
magazine a thing or two about the interrelationships and correlations between the
economic indicators, a feature completely left out by the run-of-the-mill, one-
variable-at-the-time approach presented in the feature article from the magazine
“Kapital” (14/94 p. 50-54).

Tasks
Study score-, variance- and loading-plots of the data and interpret the PCA-model.

From a score point of view, we would like to assess how the individual car
dealerships relate to one another, and how they might be clustered, grouped or how
their interrelationships might show trending. From a loading point of view, we
would be extremely interested in which economic indicators correlate with which
(do you want to scale this data set, or not? - Why? Note: Scaling will be introduced
in chapter 4.1). From a combined scores/loading assessment, which indicators are
responsible for the data structure as revealed by the score plot disposition of all 172
dealerships?

Make up your own questions to the objectives of the PCA as you go along, based
on your interim results – and on your thinking on what might possibly be the
driving forces of the car dealership market in a small, but supposedly, rather
representative European country. On the other hand, also remember that currently
Norway is a rich country (the world’s second largest oil exporting country).In
general people in Norway buy new cars.

How to Do it
Entirely up to you. This is an interim summation of your newly developed
PCA-skills!

Multivariate Data Analysis in Practice


70 3. Principal Component Analysis (PCA) – Introduction

Summary
One must always be prepared for surprises in the multivariate data analysis realm.
Many of the examples and illustrations of standard textbooks employ well-
structured data sets, with a more-or-less clear story to tell –well conceived of
course, compare (hopefully) the PEOPLE and PEASRAW data sets above for
example.

How does this tally with the car dealership data? There are data sets and there are
data sets. At the outset this apparently VERY INTERESTING data set turned out
geometrically to be almost nothing more noteworthy but a p-dimensional
hyperspherical data swarm. What might this mean?

A tentative of interpretation: The car dealership market (demand vs. supply) forms
a particularly strong competitive sector. There is simply very little room for
anything but to operate your (own) business precisely as do all your competitors,
each striving valiantly for that little extra comparative advantage! There were only
two marginally noteworthy car dealerships, and actually the “only” question in this
data analysis would appear to be whether to regard these two as outliers, or not.
Also: Irrespective of this two-object deletion, or not, the very same general
correlation structure was to be observed by interpretation of the comparative
loading plots. The only conclusion possible is that the competition makes for a very
homogeneous market – as measured by the present set of standard economic
variables that is.

The above little discourse to show that even a seemingly “dull” data structure might
nevertheless very well carry its own significant information. The particular, almost
hyper-spherical, data disposition encountered here might for example be interpreted
to reflect that the competition results in a very tight clustering of objects, with only
the smallest signs of differentiation between the individual dealerships. What are
the relevant fractions of the total variance for the first 2-3 PCs? – Here’s a difficult
question: How would you have formulated an alternative market interpretation,
had this data set revealed itself as a marked, perhaps more familiarly, elongated
trend?

So the most prominent observation to be made, however, was the almost virtual
absence of any groupings or trends in the score plots. For sure there were these
two dealerships which set themselves apart, albeit marginally. What exactly
characterizes these two? How will this influence our ability to reveal the
intricacies of the economic interrelationships for the remaining 170 car
dealerships? For this part of the interpretations, there is now a need to know a little
Multivariate Data Analysis in Practice
3. Principal Component Analysis (PCA) – Introduction 71

more of the meaning behind the specific indicator variables, for which purpose we
now list the designations and explanations of all the ten variables in full:

Variable X1: Turnover 1993 (1000’s NOK)


Variable X2: Turnover increase: 1992-1993 (%)
Variable X3: Gross profit (%)
Variable X4: Earnings before tax (1000’s NOK)
Variable X5: Profit margin (%)
Variable X6: Turnover 1992 (1000’s NOK)
Variable X7: Stock value end 1992
Variable X8: Salary, Man. Dir.
Variable X9: No. employees
Variable X10: Turnover/No. employees (X1/X9)

Armed with this additional information it should be possible to begin an


interpretative analysis of the way these economic indicators interrelate and thus to
be able, hopefully, to take a closer look over the shoulder of the economic
journalists. – By the way, maybe we have judged economic journalism slightly too
hard. We note that the standard economic indicator (variable no. 10) actually does
represent a start into the multivariate realm, as the ratio X1/X9. Still, from one
ratio-variable to the complete layout of all the pertinent loadings, it is a long step!

In spite of the apparent enticing original data compilation, after having wrestled
with the data in every possible way, there is only one surprising conclusion: This
PCA did not reveal any particularly new secrets of the car dealership community
and the way it conducts its business. But then again, this exercise was deliberately
given in order to teach the lesson that even though the prospects for interesting
interpretations and conclusions would all appear to be in the cards – we do not
know the data structure until after our most careful data analysis. For this particular
data set, all our hopes for an illuminating new insight apparently just vanished.
But all is not completely lost. Though our triumphant visit to the editorial offices of
the magazine “Kapital” will have to wait, here is a sneak preview of coming
attractions. One of the X-variables will - upon some in-depth reflection - turn out to
be of a sufficiently special nature relative to all the remaining 9 other variables, that
a certain re-formulation of the whole objective of the data analysis may be deemed
worthwhile, more of which later (in relation to PCR/PLS-R). In fact we may have a
bona fide y-variable present in this otherwise manifest X-variable company. If you
cannot find out which now, simply wait until you have reached chapter 13.

Multivariate Data Analysis in Practice


72 3. Principal Component Analysis (PCA) – Introduction

In chapter 3 we have so far taught you quite a lot of the overall understanding of
PCA, described all the most important elements, and especially lead you through a
number of useful exercises – without going into any mathematical or algorithmic
details. We also strongly hope that by now you should be sufficiently primed to
appreciate the basic NIPALS algorithm, which lies behind all PCA-calculations in
The Unscrambler. The completion of chapter 3 will be just that: The NIPALS
algorithm.

Note!
If you are not interested in a mathematical approach, you may skip
section 3.14 for now and move on directly to chapter 4, where you will
learn more about the practical use of PCA.

3.14 PCA Modeling – The NIPALS


Algorithm
In 1966 Herman Wold invented the NIPALS algorithm. This is claimed to have
taken place at the back of an envelope - literally, which goes a long way explaining
why it does not require any advanced mathematical training in order to be able to
grasp the essentials. For the present exposition we may take this acronym to signify
Non-linear Iterative Projections by Alternating Least-Squares, although the
original meaning was slightly different (we shall present the original meaning in
chapter 6).

This algorithm has since the 1970’s been the standard workhorse for the range of
the computing behind bilinear modeling (first and foremost PCA and PLS-R),
primarily through the pioneering work by one of the co-founding fathers of
chemometrics, Svante Wold (Herman’s son). The history of the NIPALS algorithm
has been told by Geladi (1988), Geladi and Esbensen (1990), Esbensen & Geladi
(1990). The latter two references actually deal with “The history of chemometrics”,
a topic of interest to some of the readers, hopefully.

In this introductory course on multivariate data analysis, we shall present the main
features of this algorithmic approach for two reasons. 1) - for deeper understanding
of the bilinear PCA-method. 2) - for ease of understanding of the subsequent
PLS-R methods and algorithms.

Thus we shall not go into any particular depths regarding specific numerical issues;
suffice to appreciate the specific projection/regression characteristics of NIPALS.

Multivariate Data Analysis in Practice


3. Principal Component Analysis (PCA) – Introduction 73

0. Center and scale appropriately (if necessary) the X-matrix, X


Index initialization, f: f = 1; Xf = X
1. For tf choose any column in X (initial proxy-t vector)
2. pf = XTtf/ | XTtf |
3. tf = Xpf
4. Check convergence: if | tf.new – tf.old | < criterion, stop; else go to step 2.
5. Xf+1 = Xf - tfpfT
6. f=f+1

Repeat 1 through 6 until f = A (optimum no. of components), or min (p,n)

Explanation to the NIPALS algorithm:

0. The process starts with the centered (optionally scaled) X-matrix.

1. It is necessary to start the algorithm with a proxy t-vector. Any column vector
of X will do, but it is advantageous to chose the largest column, max |Xi|.

2. Calculation of loading vector, pf, for iteration no. f.


3. Calculation of score vector, tf, for iteration no. f.

Step 3 can be seen to represent the well-known projection of the object vectors
down onto the fth PC-component in the variable space. By analogy one may view
step 2 as the symmetric operation projecting the variable vectors onto the
corresponding fth component in the object space. Note how these projections
also correspond to the regression formalism for calculating regression
coefficients, for which reason steps 2 & 3 have been described as the “criss-
cross regressions” heart-of-the-matter of the NIPALS algorithm. “Criss-cross
projections” may be an equally good understanding.

4. Convergence? NIPALS usually converges to a stable t-vector solution in less


than, say, 20-40 iterations (empirical experience). The stopping criterion may
be 10-6, or less, as desired. A difference smaller than this for the consecutive
t-iterations signifies that the NIPALS algorithm has reached a stable solution,
that is to say that the proxy-PC in variable space has stabilized to the
maximum variance direction sought.

5. Updating: Xf+1 = Xf - tfpfT.

Multivariate Data Analysis in Practice


74 3. Principal Component Analysis (PCA) – Introduction

The updating step is often also called deflation: Subtraction of component no. f.

The principal component model: TPT is calculated for one component


dimension at the time. After convergence, this rank one model, tfpfT is
subtracted from X. A very important consequence of the way NIPALS goes
about its business is that both the set of score-vectors and loading-vectors are
mutually orthogonal for all f. This is directly responsible for the superior
interpretation features of PCA.

The primary characteristics for the NIPALS algorithm are that the principal
components are deliberately calculated one-component-at-a-time. NIPALS goes
about this iterative PC-calculation by working directly on the raw X-matrix alone
(appropriately centered and scaled). This is a numerical approach for bilinear
analysis which sets it apart from several other numerical calculation methods with
which to compare, such as the Singular Value Decomposition method (SVD) and
the so-called direct XTX diagonalisation methods, the description of which falls
outside the present scope however. Appropriate references for this endeavor can be
found in Martens & Næs (1987).

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 75

4. Principal Component Analysis


(PCA) - In Practice
You should now have sufficient background to start more of your own serious PCA
modeling, but first we should look at some very important additional practical
details.

4.1 Scaling or Weighting


Preprocessing the data prior to performing PCA is often very important. In fact
sometimes proper preprocessing can make the difference between a useful model
and no model at all! The purpose of preprocessing is to try to transform the data
into the most suitable form for the analysis. Sometimes there may be background
effects we wish to remove by some suitable transformation, or the X-variables may
be measured in sufficiently different units that standardization is deemed necessary
etc.

For a comprehensive introduction to the very broad theme of data analytical


preprocessing the reader is referred to chapter 3 in Beebe, Pell & Seasholtz. (1998).
We shall however give you a first introduction below. Note that in many software
packages (The Unscrambler is no exception), weighting means the same as
scaling.

Standardization
There are many ways to scale or weight the data. The most usual scaling factor is
the inverse of the standard deviation. Each element in the X-matrix is multiplied
with the term 1/SDev:

Equation 4.1 1
xikscaled = xik ∗
SDev

Recall that standard deviation was defined in frame 1.2 (Chapter 1). By scaling
each column in X with the inverse of the standard deviation of the corresponding
variable, we ensure that each scaled variable gets the same variance. Try to

Multivariate Data Analysis in Practice


76 4. Principal Component Analysis (PCA) - In Practice

calculate the variances of the scaled variables and verify that they all have the same
variance, all equal to 1.0.

This is a very common scaling method when you analyze variables, which are
measured in different units, so that some display large variances compared to
others, which are smaller etc. For example, the variance of one variable could be in
the order of several 1000’s, while the variances of others are perhaps in the order of
0.001 or similarly. This is surely a case demonstrating the need for the inverse
standard deviation weighting by standardization. By this means all variable
variances become more comparable. No one variable is allowed to dominate over
another solely because of its range, and thereby unduly influence the model
(because of it measuring unit). A simple example would be if one mass variable
was measured in Kg whilst another was measured in mg. Standardization would put
these on the same variance scale.

Since we are looking for the systematic variations in PCA, standardization allows
subtle variations to play the same role in the analysis as the larger variations; this is
a very powerful result of the act of using the very simple standardization option.

Autoscaling
The combination of mean centering and scaling by 1/SDev is often called
autoscaling. Figure 4.1 shows what happens to the data set during autoscaling.

Figure 4.1 - Centering and scaling with 1/SDev = Autoscaling


Original Centered Centered and scaled

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 77

A Note of Caution: Scaling of Measurements with


Quasi-Equal Variances
In spectroscopy for example, scaling with 1/SDev is not always considered
exclusively advantageous, because it is sometimes thought amongst experienced
spectroscopists that scaling and standardization may result in losing out somewhat
with respect to interpretability of the loading plots. The main danger is that noise
variables are over emphasized, but this can be overcome, e.g. by selective
standardization, or by using an offset in the standardization, f.ex. 1/(h+SDev).

However, this is certainly not specific to spectroscopy. The situation envisaged is


representative as but one form of the more general situation in which the empirical
variance is more-or-less comparable across the entire set of p variables. It is only
the set of p internally comparable, empirical variances which determines whether
the arguments for standardization (autoscaling) hold up, or not - and that this has
nothing to do with one, or other, particular type of X- data.

Thus autoscaling may not always be the only obvious form of scaling to use when
the variables are measured in the same unit. This is a fixed rule, however you still
have to investigate the empirical variances and their comparability. There are
indeed also many cases where 1/SDev scaling of spectroscopic data gives the best
results. The cost is this loss of direct spectroscopic interpretability of the loadings,
but the data analytical model may very well still serve its purpose better. This small
spectroscopic issue aside has caused a great deal of confusion – especially amongst
data analytical beginners, naturally enough.

Variables with Empirical Variances in the Interval (0,1)


If a variable has a sufficiently small variation and/or is measured in a unit, such that
the empirical variance results in being less than 1.0 (variances of course will
always be larger than zero), the weight 1/SDev will be larger than 1. Multiplying
such a variable (multiplying all elements of a particular matrix column) with a large
number means that the covariance impact in the multivariate analysis of this
variable increases.

Sometimes, in order to avoid this situation, it has been suggested that an arbitrary
pre-multiplication of an inflicted variable by, say, a factor of 10, will rectify this
undesired result. Clearly, however, this trick only applies to the situation in which
the subsequent data analysis is carried out without any further preprocessing. The
particular relative covariance structure of this “rectified” variable, in relation to all
others in the multivariate analysis, will be the same when using standardization or

Multivariate Data Analysis in Practice


78 4. Principal Component Analysis (PCA) - In Practice

autoscaling, which often was the reason to worry about the impact of the (0,1)-
sized variance in the first place.

If you have changed units in a particular data analysis, you must remember also to
present the data analytical results in their original units, in reports etc. A more
systematic overview of scaling, sufficient for most uses stemming from this
introductory course, is given in chapter 9.

4.2 Outliers
In the previous sections we have briefly mentioned outliers, atypical objects or
variables on a few occasions. If outliers are the result of erroneous measurements,
or if they represent truly aberrant data etc., they should of course be removed
otherwise our model will not be correct.

On the other hand “outliers” may in fact be very important, though somewhat
extreme, but still legitimate, representatives for your data, in which case it is
essential to keep them. Thus if they represent a significant or important
phenomenon and you remove them, the model will also be “incorrect”. It will lack,
and will consequently be unable to explain, this property which is in fact present in
the data. The model will be an equally poor approximation to the real world. This
may appear to be a major problem - that you will create a false model if you include
true outliers and also if you remove false outliers. Fortunately, outliers must either
be one or the other.

The “bad news” is that it is always up to you to decide whether an outlier should be
kept or discarded. In fact, the only problem about this is that it will take some
experience to get to know these things by their appearances in the appropriate plots,
but it is a feasible learning assignment – and it is one which is absolutely essential
to master. Textbook examples, exercises and personal experience will quickly get
you up to speed in this task.

It is perhaps most important to realize that there are essentially two major outlier
detection modes:
1. Data analytical: the relative (geometrical) distribution of objects in e.g. the
score plots is all you have to go by. Decision must be based wholly on
experience.

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 79

2. Domain, problem-specific knowledge may in certain/many other situations


constitute an absolute reference with which to make the difficult decision as to
whether a particular object, or group of objects, are outliers or not.

Score plots are particularly good for outlier detection. An outlying object will
appear in score plots as separate from the rest of the objects, to a larger or smaller
degree. This is the result of one, or more, excessively high or low scores as
compared to the other objects. Figure 4.2 shows two cases of outliers. In the left-
hand one the object is a potential outlier, but some observers may decide that it
nevertheless still “fits” in general, while the object in the right panel is
considerably more doubtful when assessed together with all the remaining objects,
and their trend.
You will see - and learn much more about - outliers later in this book.

Figure 4.2. - Mild and extreme outliers in score plots


t2 t2

t1 t1

The problem of how to quantitatively identify - and thus be able automatically to


delete - outliers has been the subject of many suggestions in the data analytical
literature. This is because whilst it may be acceptable to take out outliers
“manually” for data exploration purposes, for automatic data analysis, e.g. process
control, there will of course not be the time, nor resources, to perform a continuing
manual inspection of score plots.

Residual object/variable variances can be used for this purpose, as indeed can the
relevant plots as well. This latter manual option may sometimes involve a lot of
work though, especially when there are many variables and/or objects. We shall
also show you some examples of this outlier detection approach, but the general
issue of automatic outlier deletion mainly falls outside the scope of this
introduction to multivariate data analysis.

Multivariate Data Analysis in Practice


80 4. Principal Component Analysis (PCA) - In Practice

4.2.1 Scaling, Transformation and


Normalization are Highly Problem
Dependent Issues
As mentioned above, PCA in general depends on the type of pre-treatment used.
You will find that you will often need flexibility in the initial stages of data
analysis, a flexibility that can sometimes be provided by alternative pre-treatments
of the data set. On the other hand, different types of pre-treatment are almost
certain to result in different score plots, perhaps even leading to different
interpretations. Thus consider the following alternative t1t2 score plots in Figure 4.3
and Figure 4.4.

In Figure 4.3 the data set has been scaled with 1/SDev before PCA, whilst in Figure
4.4 no such scaling has been performed. The two distributions of objects are quite
different, even though we do find the same overall groupings present.

Figure 4.3 - Score plot from weighted data set


Scores PC2
2.0
10
1.5 9
3
1.0
11
0.5 6
1
0 PC1

-0.5 12 7
8
-1.0 2 5

-1.5

-2.0
4
-2.5
-4 -3 -2 -1 0 1 2 3
p-0, PC(expl): <1(54%),2(15%)>

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 81

Figure 4.4 - Score plot from non-weighted data set


Scores
12 PC2
10
10
8
6 9 11
12
4
3
2
0 1 PC1
8
-2 6
-4
2
-6 7
-8 4 5
-30 -20 -10 0 10 20 30 40
p-1, PC(expl): <1(87%),2(9%)>

To Scale, or not to Scale? – That is the Question!


Which is correct? Apart from the different distributions, the two plots are mirrored,
which is of no importance. This often happens and depends only on how the
particular algorithm starts. All PC-directions are uniquely defined by the NIPALS
algorithm, except for the polarity, + vs. - From a strictly data analytical point of
view, both alternatives are quite admissible. Which is the most useful depends on
the specific problem context, the term “problem-specific” will very often come to
the fore in the type of data analytical responsibility this book aims at generating.

This simply means that in a great many situations, there is much (often all) to be
gained by paying the closest attention to the context surrounding the generation of
the data. The data analyst simply cannot learn enough about the available data in all
specific data analytical situations – never mind the overall, general principles of
multivariate data analysis, which unfortunately makes up the range of what can be
learned from the mere reading of a textbook. Experience rules!

4.3 PCA Step by Step


The following is a first, broad description of how to perform Principal Component
Analysis. Enough should have been learned previously that one may well begin to
gather a more fundamental overview of all the most important issues involved.

Multivariate Data Analysis in Practice


82 4. Principal Component Analysis (PCA) - In Practice

Step 1: Problem Formulation


Define exactly why you want to do the analysis and what type of information you
are looking for. It is your responsibility to ensure that the data set contains enough
relevant information to solve the problem! This translates into a question not often
considered carefully enough: which variables, and which objects - naturally this is
the most problem-specific issue of all.

On the other hand, there is a major advantage: it does not matter if your data set
contains any amount of additional information. The multivariate analysis will find
this easily enough and you will have an unexpected bonus. Occasionally, it turns
out that important information was to be found in this additional realm, even
though in the outset this was not thought to be the case. Time spent in analyzing
particular data analytical objectives is generally very well spent. But, there is a
downside too: even the best data analytical method in the world cannot compensate
for a lack of information, i.e. a bad, or ill-informed choice of variables/objects.

Step 2: Plotting Raw Data: Getting a “Feel” for the


Particular Data Set
Use the matrix plot option to get an initial overview of the whole data set. It will
often also help you to see whether the data needs scaling, transformations and so
on.

Step 3: The First Runs


Always start with an initial, open-minded PCA of the total data set. These initial
runs help you to familiarize yourself with the overall data set structure and are
useful for screening purposes. Remember always to center the data. You may also
try out various pre-processing schemes, although in general you should already
have formed a qualified opinion about which scaling method to use. So far we have
only talked about 1/SDev scaling; there are several other preprocessing options.
More will be revealed throughout this book, but the range of all pre-processing
alternatives is a great challenge not to be mastered in many years.

In the first runs you should of course freely compute “too many” PCs, to be sure
that there are more than enough to cover the essential data structures. There is a
great risk in missing the slightly more subtle data structures, for want of a few more
components in the initial runs. Until the data set is internally consistent (free from
all significant outlying objects and/or variables etc.), there is no point in
determining the optimum number of PCs, as this number may change depending on
what you do with the data set next.
Multivariate Data Analysis in Practice
4. Principal Component Analysis (PCA) - In Practice 83

Step 4: Exploration
The first few score plots are investigated to determine the presence of major
outliers, groups, clusters and trends etc. If the objects are collected in two separate
clusters, as an example, you should naturally determine which phenomena separate
them and decide whether the clusters rather should be modeled separately, or
whether the original objective of analyzing the total data set still stands.

Be especially aware if the score plots show suspect outliers, as they will also affect
the loadings, usually in severe fashions. In this case do not use the loadings to
detect outlying variables at this stage of the data analysis. Although in general one
should have good reasons for removing anything from the data set, on the other
hand, too much caution can be equally dangerous. You may have to perform
several “initial runs” and successively exclude outliers before the data set can be
satisfactorily characterized. One excludes all outlying objects before one embarks
upon the more subtle pruning away of information-lacking variables; this order of
outlier exclusion (objects before variables) is extremely important. The most
common error that inexperienced data analysts often make is to leave “too much”
as it is; in other words one does not take sufficient personal responsibility with
respect to deleting outlying objects, variables, divide into sub-groups, etc.

Step 5: The Later Runs


At this stage the data set may now be slightly, to somewhat reduced, and perhaps
divided into subsets if there are clusters that have to be modeled separately etc.
Make sure that you know what you have done to arrive at the present data set:
which objects and variables were removed and why; the best type of pre-
processing; etc. The later runs also consist of PCA calculation but now on the final
data set.

Step 6: Inspection of Variance Plot to Determine the


Optimum Number of PCs
There is no point in evaluating the optimum number of PCs before the data set is in
this satisfactory final form. Now use the total residual variance plot, or
alternatively, the total explained variance plot in order to arrive at the optimum
number of principal components to use, A. Always also relate to external
knowledge pertaining to the specific data set. When A-optimum has been
determined:

Multivariate Data Analysis in Practice


84 4. Principal Component Analysis (PCA) - In Practice

Step 7: Inspection of Scores and Loadings - Interpretation


Now it is time to interpret the PCA model by looking at the complementary scores
and loading plots, for exactly this number of PC-component of course. This can be
considered to be the heart of the matter in PCA. Often a re-evaluation of the
original data analysis objective may have arisen along the way.

Step 8: Analysis of the Error Matrix E


It is important that you check the residuals after the “successful” A-component
model has finally been calculated, and evaluated. The object residual variances and
the variable residual variances should all now be at acceptable levels. If there are
problems here, e.g. if you find still one more outlier, you simply have to go back to
the modeling phase, repeating steps 2 - 8. If you have done the modeling correctly,
there should be no surprises for you in the E matrix but this is really a matter of
experience!

4.3.1 The Unscrambler and PCA


The main steps in practical use of The Unscrambler are as follows:

1. Collect a representative set of calibration objects, the training set.


2. Enter the input data from the keyboard (File-New) or import it from an
external file (File-Import). Choose File-Save to save the data in Unscrambler
format.

Note!
This option is not applicable with the training version, where only File-
Open can be used to access existing Unscrambler data files.

3. Plot the data to get a first impression. Mark the data and choose Plot - Matrix.
4. Perform pre-processing if necessary. (Modify - Transform).

Note!
Scaling and centering is done from the Task menu (see hereafter).

5. Open the Task - PCA menu. This starts the PCA dialog box.
• Select the samples to be analyzed, from the “Samples” tab. If necessary
click Define… to create a new sample set.
• Select the variables to be taken into account, from the “variables” tab.
Check the current weighting options, and if relevant change the weights

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 85

by clicking Weights… In the Set Weights dialog which then pops up,
select the variables to be weighted, then pick up the desired weights at the
bottom of the dialog box (A/Sdev + B, or 1.0, or constant), then click
Update to apply the weights. Finally click OK to close the Set Weights
dialog box.
• Back to the main PCA dialog: choose validation method = Leverage
correction (at this stage – before you have learned more about
validation).
• Choose an appropriate number of Principal Components.
• Make sure that option Center Data is active.
• You may now start the computations by clicking OK.

6. Evaluate the present model by plotting the results (View). Go back to step 5
and use the option Keep out of calculation for the detected outliers. It is
normal to repeat this several times during an analysis.

4.4 Summary of PCA


One particular useful way of looking at a principal component model is as a
transformation in which many original dimensions are transformed into another
coordinate system with (far) fewer dimensions.

The transformation is achieved through projection. For example, we may transform


a data structure from a 3-dimensional coordinate system into a 1-dimensional
system by projecting the data elements (objects) onto the particular linear feature
wanted, or calculated. The loading vector expresses this direction for the relevant
PC. If the variations along, say, the x3 axis are relatively small, and the x1-x2
relations show up as a strong correlation, we will not lose too much information by
using the appropriate correlation feature alone. The lost information is of course
equating the model error (Figure 4.5).

Multivariate Data Analysis in Practice


86 4. Principal Component Analysis (PCA) - In Practice

Figure 4.5 - Summary illustration of score and loading projections


x3
This distance is t for
sample (a)

(a)
Each point is a
sample
The direction of
the PC is
Mean described by p

x2
x1

Translated into the principal components model, the new coordinate system has
fewer dimensions than the original set of variables, and the directions of the new
coordinate axes, called principal components, factors, or t-variables, have been
chosen to describe the largest variations. This is called decomposition or data
structure modeling, because we try to describe important variations, and
covariances in the data, using fewer PCs, i.e. by decomposing into orthogonal
components, which are supposed to be easier for interpretation as well.

The coordinates of the samples in the new system, i.e. their coordinates related to
the principal components, are called scores. The corresponding relationships
between the original variables and the new principal components are called
loadings. The differences between the coordinates of the samples in the new and
the old system, lost information due to projection onto fewer dimensions, can be
regarded as the modeling error or their lack-of-fit with respect to the chosen model.

Figure 4.6 illustrates PCA decomposition on a very small data set. The data set is
made simply to make it easy to see the principal components in relation to the
original variables. The data set contains 8 samples and 2 variables.

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 87

Figure 4.6 - Relationship between original and PC-coordinate systems


0 4 8
6

-0.2
2 7

-0.4
3 5
1

0 2 4 6
(x,y)

PC2 Scores
0.4

4
0.2
2 6
8
0
1
3 7
-0.2
5

-0.4
PC1

-4 -2 0 2 4
Very small data…, X-expl: 100%,0%

Figure 4.6 (upper part) shows the samples displayed in the original X-Y variable
space. Note that the axes do not have the same scale. Approximate principal
components lines are drawn by hand in the upper panel. Figure 4.6 (lower part)
shows the exact scores for PC1 and PC2 after a PCA. PC1 actually explains close
to 100% of the variance, also made clear by fact that the principal components are
now rotated relatively to the original variable axes.

Multivariate Data Analysis in Practice


88 4. Principal Component Analysis (PCA) - In Practice

4.4.1 Interpretation of PCA-Models


From a data analytical point of view the only option when determining the number
of components appears to be the total residual variance. You look for the most
significant break in the residual variance plot - if such a feature exists.

One should of course always try to compare the data analytical result with the best
estimate one may come up with regarding the “expected” dimensionality. As an
example, consider the case of an NIR-spectroscopic investigation of mixtures of
alcohol in water. Here we might expect one PC to be appropriate for example,
reflecting a two-component end-member mixing system. However mixing alcohol
with water also gives rise to physical interference effects which require one, or
maybe even two additional PCs for a complete description. The number of PCs in
practice is therefore not 1-2 but rather 2-3. On the other hand, if your PCA on the
alcohol/water spectra came up with, say, 5-7 components, one would naturally be
very suspicious. Such a large number of components for this system clearly implies
a gross overfitting – unless, say, contaminants were at play.

In the pea example only the type of peas and the harvest time have been varied. We
should not therefore expect to see more than a couple of PCs. For the car dealership
data, we simply have no clue at the outset of the analysis. All we may surmise is
that there certainly cannot be more than 10 independent economic indicators
present in this data set, equal to p, but surely some correlation amongst these is to
be expected in such a sufficiently interacting system as the economic performance
of multi-million dollar dealerships. Still we in fact do not know whether to expect
A close to one, or closer to 10 in this case.

This is of course an example of the problems associated with complete explorative


data analysis systems. Nor is it always easy to assess expected model dimensions
even from a more well-known problem at hand; in spectroscopic calibration for
instance, instrumental faults/artifacts and/or background effects etc. often will give
rise to additional PCs which is then included in the model. One can only aim to
have sufficient knowledge of the particular data analysis problem to be able to sort
out such possible data analytical artifacts.

The main challenge in PCA is how many PCs to use - and how to interpret them.

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 89

4.4.2 Interpretation of Score Plots – Look for


Patterns
Figure 4.7 is a (t1,t2)-score plot for a system of spectra from a calibration series of
dye/milk/water-solutions, i.e. a three-component system as characterized by NIR-
spectroscopy. In this type of situation it is advisable to use coded object-names in a
way that reflects their compositions or other known relevant external properties. In
this case the object name 2-0.5 indicates that this sample contains 2 ml of dye and
0.5 ml milk, and so on for all other objects.

To look for patterns, it might for example be useful to draw lines between objects
with identical dye level and to encircle groups of objects with the same milk level,
as has been done in Figure 4.7, manually by all means, if need be, or aided by some
relevant computer software (it is perfectly admissible to using hand-drawn guide
lines on any type of plot – in fact this is recommended throughout the modeling
phase in which information is emerging). The issue here is not how to do this - the
issue is to use whatever appropriate annotation to the plot in question, which will
help your particular interpretation.

The annotated plot shows that dye concentration increases along a (virtual) axis
that would go from lower left to upper right; i.e. both PCs contribute to the
determination of dye concentration. Similarly, milk content increases along a
direction that, although not quite straight within each dye concentration, could be
summarized by an axis roughly going from upper left to lower right. So both PCs
contribute to the determination of both compounds, even in the decomposed score-
plot (which above has been claimed to result in orthogonal, individually
interpretable components). Well, both yes and no!

It is - perhaps regrettably - not always this simple. There is no guarantee that all
data systems will necessarily be structured in such a simple fashion so as to be
stringently decomposable only into one-to-one phenomena-PC-component
relationships. But the PCA-results are nevertheless decomposed into easily
interpretable axes, a milk-concentration axis and a dye-concentration axis. By
careful inspection it may in fact be appreciated that these two mixing-phenomena
axes are but nearly orthogonal in their respective relationships; it is just that the
simultaneous description of both these more complex phenomena requires that the
first and the second axes be both involved.

Multivariate Data Analysis in Practice


90 4. Principal Component Analysis (PCA) - In Practice

Figure 4.7 - The annotated score plot.


Use problem-specific, meaningful names to ease interpretation!

As you can see at the bottom of Figure 4.7, The Unscrambler always lists two
numbers, “X-expl: (92%),(7%)”. These correspond to the “explained variances of
X” along each component shown. PC1 explains 92% of the original variance in X
while PC2 explains 7%. This shows that PC1 and PC2 together describe 99% of the
total variation in the X-matrix. Higher order PCs therefore influence less than 1%
of the model, so interpretation or outliers in higher order PCs would be absolutely
irrelevant.

Score Plots - Outliers


Figure 4.8 is a score plot from a later exercise with two severe outliers, objects 25
and 26. In this case one can easily observe the enormous effect outliers can have on
a PC-model. PC1 is used almost exclusively to describe these two objects in
opposition to all others. The other objects lie nicely on a line that is mostly
described by PC2. If objects 25 and 26 are erroneous or non-representative (note:
they may not be - even if they at first look very different from the others), they
distort the whole model because of their large effect. If/when they are removed, the
other samples/objects become evenly distributed throughout the whole score plot
(Figure 4.9).

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 91

Figure 4.8 - Score plot with apparent gross outliers (objects 25 & 26)

Score Plots - Groups and Similarities


Figure 4.9 now also shows several groupings .

Figure 4.9 - Score plot after removal of outliers 25 & 26; cf. Figure 4.8

In fact the outlines of these groupings might now - in retrospect - also be


discovered in Figure 4.8, but much less clearly defined. This particular data
structure was “swamped” by inclusion of these outliers in the data analysis.
Clusters or groups of objects do not always imply problems, but if such groups are
clearly separated, it might be necessary to model each group separately.

Multivariate Data Analysis in Practice


92 4. Principal Component Analysis (PCA) - In Practice

In general objects close to each other are similar. Samples 16 and 19 are very
similar in Figure 4.9, whilst 18 and 7 are very dissimilar. Note also that these
particular two samples are actually the two most dominating samples defining PC2
(however, the latter would almost be identically defined were these two samples
removed).

The two score plots in Figure 4.10 and Figure 4.11 are the results of PCA on
another data set, with a different pretreatment in each case. The data is again a set
of NIR spectra. The score plot in Figure 4.10 was made from an analysis of the raw
spectra. The spectra were then pre-processed with a particular method called
Multiplicative Scatter Correction (MSC- to be explained later), resulting in the
alternative score plot in Figure 4.11.

Once again the score plots are relatively different, although the same overall
disposition of all objects is pretty much recognizable in both renditions - but which
is correct? In this case one would probably use the score plot in Figure 4.11,
because of its more homogeneous layout of objects. This plot is claimed to be
easier to interpret than Figure 4.10, by experienced data analysts. This data set will
appear again in some of the later exercises, in which some (more) argumentation
for this stand (Figure 4.11 over Figure 4.10) will also become apparent.

Figure 4.10 - Score plot (from Figure 4-8) for un-preprocessed data
PC2 Scores
1.0
20
0.5 119
1022 3
24
12
1827
6 726
0 4 821
525
14 23
911 17
-0.5 2
15
13

-1.0
PC1

-2 -1 0 1 2
Alcohol raw, X-expl: 79%,16%

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 93

Figure 4.11 - Score plot for pre- processed spectra (MSC)


PC2 Scores
0.4

19
1
0.2
17
2
12
24 21 23
11
8
10
0 27
6 25
5 9
22 415
26 20
7
14 13
-0.2
3
18
-0.4
PC1

-1.0 -0.5 0 0.5 1.0


Alcohol correct…, X-expl: 89%,7%

Pre-treatment comprises e.g. scaling, normalization or transformations. We have


already dealt with 1/SDev scaling performed column-wise, illustrated in Figure 4.1.
In addition there are a number of other transformations, e.g. logarithmic
transformations, performed over all variables or on individual variables.

Normalization, e.g. normalizing to 100% corresponds to summing over all row


elements to a fixed constant sum for all objects. The latter normalization is
frequently used on chromatographic data to compensate for individual differences
in the absolute amounts of solute injected into the chromatographic column.

There are plenty of other optional pre-treatments. In general it is bad form to try out
all alternative scalings, transformations or normalisations indiscriminately without
problem-specific justification. Some of the most important pre-treatments will be
described in detail in section 11.3. At this point all you need to know is that there
are many possibilities, that they are all problem dependent, and that a wrong choice
may unfortunately lead to interpretations that are not relevant to your specific
problem.

4.4.3 Summary - Interpretation of Score Plots


• Objects close to the PC-coordinate origin are the most typical, the most
“average”. The ones far away from the origin may be extremes, or even outliers,
but they may also be legitimate end-members. It is your problem-specific
responsibility to decide in these issues.

Multivariate Data Analysis in Practice


94 4. Principal Component Analysis (PCA) - In Practice

• Objects close to each other are similar, those far away from each other are
dissimilar.
• Objects in clear groups are similar to each other and dissimilar to other groups.
Well-separated groups may indicate that a model for each separate group will be
appropriate.
• “Isolated objects” may be outliers - objects that do not fit in with the rest.
• In the ideal case, objects typically should be “well spread” over the whole plot.
If they are not, your problem-specific, domain knowledge must be brought in.
• By using well-reflected object names that are related to the most important
external properties of the different objects, one may better understand the
meaning of the principal components as directly related to the problem context.
• The layout of the overall object structure in score plots must be interpreted by
studying the corresponding loading plots.

4.4.4 Summary - Interpretation of Loading


Plots
Important Variables
Variables with a high degree of systematic variation typically have large absolute
variances, and consequently large loadings. In a 2-vector loading plot they lie far
away from the origin. Variables of little importance lie near the origin. This is a
general statement, which is always scaling dependant however. When assessing
importance, it is mandatory also to consider the proportions of the total explained
variance along each component - If e.g. PC1 explains 75% and PC2 only 5%, then
variables with large loadings in PC1 are much more important than those with large
loadings in PC2 – in fact 15 times as important!

Correlation / Covariance
Variables close to each other, situated out towards the periphery of the loading
plots, covary strongly, proportionally to the degree distanced from the PC-origin
(relative to the overall total variance fractions explained by the pertinent
components). If the variables lie on the same side of the origin, they covary in a
positive sense, i.e. they have a positive correlation. If they lie on opposite sides of
the origin, more or less (some latitude here) along a straight line through the PC-
origin, they are negatively correlated. Correlation is not automatically reflecting a
causal relation; interpretation is always necessary. Also: loadings, which are at 90
degrees to each other through the origin, are independent. Loadings close to a PC

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 95

axis are significant only to that PC. Variables with a large loading on two PCs are
significant on both PCs.

Spectroscopic Data
In spectroscopic applications, and similar many X-variable data sets, the 1-vector
loading plot is often the more appropriate. Again, large loading values imply
important variables (e.g. wavelengths).

4.5 PCA - What Can Go Wrong?


There are quite a number of things that can go wrong in a PCA. Unfortunately it is
not always simple to detect errors in the PCA results themselves. Looking at
residuals is one way of checking, but the residuals need not always show that
something is wrong. The most important check you have is that the results should
comply with the specific problem understanding and that they are internally
consistent. Your best strategy is to avoid making mistakes due to the analysis!

Below we have listed some of the most common potential pitfalls. While it is not
completely comprehensive (even the senior author of this book has not finished
making illuminative errors), one may certainly use it as a useful checklist.

1. Your Data Set Does not Contain the Information


Expected/Needed
This can to some extent be observed, e.g. the interpretations do not make sense.
However, there are also some very subtle mistakes that can be made here. Take the
case where you carry out a PCA on a data set and come to the conclusion that PC1
corresponds to viscosity, say, while in reality it may just as well correspond to
temperature (but you did not take out enough time to reflect on the problem
formulation, and thus you “missed” specifying that temperature should also be
measured). You would not be able to discover this without having thought about
the temperature dependence of viscosity and making the experimental protocol
accordingly, for instance by keeping the temperature constant in some of the
experiments. In a direct PCA this kind of lack of reflection of the entire problem
domain can easily lead to misinterpretations, in multivariate regression it can be a
severe problem. It all boils down to knowing the maximum about your problem and
your data set!

Multivariate Data Analysis in Practice


96 4. Principal Component Analysis (PCA) - In Practice

2. You Use too Few PCs in the Model


This means that you are not fully exploiting the potential information in your data.
You will loose out on the total potential information. It is not the worst mistake you
can make, but should of course be avoided through careful analysis.

3. You Use too Many PCs in the Model


This can be a serious mistake, indeed. You are including noise in your model. The
noise contribution must lead to erroneous interpretations; the analysis will always
be wrong, at least partly.

4. You Did not Remove Outliers which are Truly Due to


Erroneous Data
Obviously this also gives an invalid model. You always model errors instead of the
interesting variations, to some significant extent.

5. You Removed Outliers that Contain Important


Information
To put it bluntly, you are bound to miss something important. Your model will be
inferior, as it will not describe all the phenomena hidden in the data set in this case.

6. You Did not Explore the Score Plots Sufficiently


If you do not study the score plots carefully, you may miss important clues. Errors
4 and 5 above are connected to this mistake.

7. You Interpret Loadings with the Wrong Number of PCs


This may give rise to serious misinterpretations. You may even remove variables
that are important because they seem to show up as outliers. Remember that the
loadings constitute the bridge between variable space and PC space. If you have
chosen the “wrong” PC space, the “bridge” will not take you to the right place. The
bridge will be the wrong one; then the loadings cannot be trusted in any way.

8. You Rely too Much on the Standard Diagnostics in the


Computer Program without Thinking for Yourself
This is a very common mistake - and the most serious of them all! The diagnostics
may be adequate and helpful most of the time, but one must always use one’s own
problem understanding and check the consequences. Remember that the computer
program has no knowledge of your specific problem - it runs along standard

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 97

procedures that may not apply to your specific data set, the present The
Unscrambler included.

9. You Use the “Wrong” Data Pretreatment


This is rather a tricky point. Pre-processing/pre-treatment of the data set is essential
for relevant and valid modeling results. The correct type of pre-treatment is
generally given by the type of problem, but this is certainly up to you to decide; the
software unfortunately cannot be made clairvoyant. The wrong pre-processing may
well nearly always give rise to misinterpretations! This introduction textbook will
deal with some of the most important types of pre-treatment in a later section.

Hopefully you have not been put off completely by this list of possible errors, some
of which cannot even be detected when they arise! Experience, experience – and
still more experience is the only thing that will help you through many of these
pitfalls. Below in chapter 5 you will find a selected series of representative real-
world data sets, all of which show one or more interesting particularities. Which is
just the stuff experience come from! But first we will lead you through a
particularly interesting case.

4.6 Exercise - Detecting Outliers (Troodos)


Finding outliers is always crucial. Here is an exercise using data from a real-world
case that shows some interesting issues.

Purpose
To learn about outliers and how to recognize them from the score plot and the
influence plot, which we have not introduced before.

To learn that in the end it is you, the data analyst, who must assume responsibility
and decide on the outlier designation issue(s). There is no other way.

Data Set
The Troodos area of Cyprus is a region of particular geological interest. There has
been quite some dispute over this part of Cyprus’ geological history, which
however need not be given in all details here, in order for the data to be used in the
present context. At our disposal we have 143 rock samples from different locations
in the areas underlying the pertinent section of the Troodos Mountains of Cyprus.

Multivariate Data Analysis in Practice


98 4. Principal Component Analysis (PCA) - In Practice

The rocks were painstakingly collected by the senior author’s geologist-friend and
colleague since college days, Peter Thy, in a series of strenuous one-man field
campaigns in Cyprus. The data analysis was carried out many years later.

The values in Table 4.1 are measurements of the concentrations of ten rock
geochemical compounds. Geologists often use such data in order to discriminate
between “families” of rocks, or rather, chemically related rock series, in order -
they hope - to be able to discriminate between genetically different, or similar rock
groups, clusters or rock series.

Table 4.1 - Description of Troodos variables


Var Name Description Var Name Description
X1 SiO2 Conc. of SiO2 X6 MgO Conc. of MgO
X2 TiO2 Conc. of TiO2 X7 CaO Conc. of CaO
X3 Al2O3 Conc. of Al2O3 X8 Na2O Conc. of Na2O
X4 FeO Conc. of FeO X9 K2O Conc. of K2O
X5 MnO Conc. of MnO X10 P2O5 Conc. of P2O5

Characterization by these so-called “major element oxides” accounts for the


dominant geochemical makeup of any rock, usually to the amount of some 95+%.
The rock samples are collected in the field as “not weathered”. They are supposed
to be pristine “representative samples” as determined by the responsible field
geologist. They are then analyzed in the geochemical laboratory and the relevant
compositions of the major element oxides are determined. Our task is now to model
Table 4.1 using a standard PCA and to look for patterns.

One cardinal question is: are there more than one overall group of samples?
If the rocks are all geochemically similar, one would expect the whole area to have
been formed geologically at the same time. If there are clear groups in the locations
of rocks, due to different geological backgrounds, we might draw other conclusions
about the formation of the area. This work was originally carried out to help settle a
major controversy regarding the entire geological history of Cyprus, see Thy and
Esbensen (1993) for details.

The data are stored on the file TROODOS.

Tasks
1. Make a PCA model; find all significant patterns that may impinge on the
objectives as laid out above.

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 99

2. Identify and remove outliers, or else...(!)

How to Do it
1. Read the data file and study the data table in a matrix plot. Mark the whole
data table and choose Plot - Matrix. Are all the variables in the same value
range? Do they vary to the same extent? Is scaling necessary?

Close the viewer, and use the View - Variable statistics menu to check the
mean values and standard deviation of the variables; then close the Variable
Statistics window.

2. Go to the Task - PCA menu and make a model with the following parameters.
Samples: All samples Variables: All variables
Weights: 1/SDev Validation: Leverage correction
Number of PCs: 10 Warning Limits Outlier limits (Field 2 - 7): 3.5

We use leverage correction here to make the modeling faster but, as you will
learn later, another validation method is more appropriate before we complete
the analysis.

Study the PCA Progress box. You can see how many outlier warnings were
given at the computation of each PC. Hit View to continue. The model
overview appears.

Study the warnings by selecting Windows - Warning List - Outliers and note
the outliers in the first PCs, especially 65, 66, 129 and 130. When you have
finished, close the Warning List.

3. Study the score plot for PC1 vs. PC2. Look for samples that are far away from
other samples.

Such lonely samples might indeed be outliers, but they can also be extremes.
You have to bear the original problem context and the raw data in mind while
working with the plots. It is also quite normal to look at the original data
matrix again when assessing potential outliers. In fact from Figure 4.12 it
would appear that objects 65-66 are indeed very atypical – especially with
respect to the dominating trend made up of all other samples. Whilst we would
not be justified to regard either object 129 or 130 as a similar vein, it is a fairly
safe initial bet that objects 65-66 are indeed gross outliers, while object 129 is

Multivariate Data Analysis in Practice


100 4. Principal Component Analysis (PCA) - In Practice

in all likelihood an extreme end-member only. The status for object 130
probably needs further investigation.

Figure 4.12 - Score plot of PC1 vs. PC2 (Troodos)

Figure 4.12 is thus a typical picture of a model where such problems may
occur. A “good model” spans the variations in the data such that the resulting
score plots show the samples “well spread over the whole PC space”. In the
present plot the samples are well spread in PC1, but only a very few samples
represent the major variations in PC2. All the other samples are situated at, or
very close to the origin (zero score values in PC2), i.e. they have very little
influence on the model in this direction. We also note the <54%,23%>
partitioning of the total variance captured along <PC1,PC2> respectively.

4. Select Plot - Residuals (you may double-click on the miniature screen in the
dialog box to make your plot fill the whole screen) and plot the Influence plot
for PC1 - PC4 (write the range “1-4” under Components). Observe how
samples 65 and 66 move. This is a typical behavior of outliers and is illustrated
in Figure 4.13.

The plot shows residual variance on the ordinate axis and leverage along the
abscissa. High residual variance means poor model fit. High leverage means
having a large effect on the model. Therefore samples in the upper right corner
(large contribution to the model and high residual variance) are potentially
dangerous outliers.

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 101

When you add more PCs the residual variance decreases and even outliers will
eventually be fitted better and better to the model. The model thus
concentrates on describing the variations due to these few different samples
instead of modeling the variations in the whole data set.

Figure 4.13 – Typical outlier behavior in the Influence Plot


129
Res. variance of objects

PC1

PC2

PC3

PC4
132 131 119
24 56
10100
138
137
99
41 134 13311818
31
33
32
123
23
26
54
116
25
2743
122
50
649
80
113
95
15
125
28
117
29 94
5
34
8 63
7
47
21
64
48
34
4612
37111
11
143
59
62
44
38
98
89
93
22
90
9761102
126
53
142
58
39
55
42
92
13940
136
57 9127
51
52
135 101
103 12816
17
2 108
1104
107
109
105106
35
115
114
30
11084
112 79
78
60
140
8187
72
77
75
45
86
88
73
74
70
71
120
83
69
82
121
124
85
67
6891
141
76
9636 1314 Leverage
0 0.01 0.02 0.03 0.04 0.05 0.06
troo-0, # of PCs: 1

5. Plot the 2D Scatter loadings: Plot – Loadings – General, Plot Type: 2D


Scatter, Vector 1: 1, Vector 2: 2. In which variables do the outliers have
extreme, maybe erroneous values? Check the raw data table to confirm this! If
we want to model the 139 other samples, perhaps we should remove the
extreme ones? (When you know the details of your own application, of course,
you do not remove any samples without having first checked the raw data and
understood why they are outlying.) Before you go on, take a look at the
Variance by Plot – Variances & RMSEP. Do not select any particular
variable, tick Total. How much explained variance do we get from 2 and 4
PCs?

Make a model without the outliers


An important issue in general is to CHECK the suspicious data. This is
especially true if the data has been typed in manually. Typing in any
reasonable number of results is almost guaranteed to produce some errors.

6. Go back to the data table (Window, Troodos) and select Task - PCA dialog.
Select Keep Out of Calculation in the sample tab. Type 65-66. (We always
remove only a few outliers at a time, starting with the most serious ones). Re-

Multivariate Data Analysis in Practice


102 4. Principal Component Analysis (PCA) - In Practice

model. Study the new warnings, the variance, the score plots, and the influence
plots. Are there any more outlier candidates? Has the amount of explained
variance increased?

7. Remove the next one, or two most obvious outlier candidate(s) (129-130). Re-
calibrate again and study the resulting new scores and variance. Does the
model look “better” now? Has the explained variance increased? Are there
more outliers? Do you see any signs of groups? Now take a look at the
loadings to see which variables influence the model.

Summary
In this exercise you worked with a data table, which after some initial standard
PCA apparently contained only four outliers. The difference between outliers and
extremes can indeed be small. Remove only one or two at a time, make a new
model, and study the new model to see how the changes imparted manifest
themselves. After removing the first two outliers, the explained variance was
slightly higher at 2 PCs. Removing the next two outliers did not really change the
explained variance any further.

The first 2 or 3 PCs describe 78-87% of the total variance. It is not necessarily an
objective in itself to achieve as high as possible a fraction of the total variance
explained in the first Principal Components at the exclusion of other data analytical
objectives; but it is of course often an important secondary goal even so. For the
present case: in the last score plot for PC1 vs. PC2 we now see clear signs of two
data groupings on either side of the ordinate. Finding out whether there was only
one, or several rock groups was the overall objective for this data analysis.

It is of course difficult for you to interpret the meaning of PC1 without more
detailed geological knowledge about the samples. What is objectively clear
however, is that the corresponding PC1vsPC2 loading plot indicates that variables
6 (MgO) and 7 (CaO) pull one group to the left, and the rest (except no. 3 Al2O3)
pull the other group to the right (see for yourself!). Thus there is a very clear two-
fold grouping of these 10 variables (one lonely variable would appear to make up a
third group all on its own along the PC2-direction, which we shall not be interested
in here). While there is a total of ten variables, there are in fact only two underlying
geochemical phenomena present, and the one portrayed by PC1 involves no less
than nine of these variables (and they are all pretty well correlated with one
another; two of them are negatively correlated to the seven others).

Multivariate Data Analysis in Practice


4. Principal Component Analysis (PCA) - In Practice 103

Note that these groupings of objects as well as variables were not at all obvious
until the four outliers were removed. Objects 19 and 20 are now also seen as
potential candidates for removal. We could continue to pick out more “mild
outliers” of course, but the main objective - to look for separate groups - has been
achieved after removing only four outlying, severely disturbing samples. This
revealed “hidden grouping” resulted in a new, interesting geological hypothesis,
Thy & Esbensen (1993).

From a geochemical point of view, one could also study the outliers in more detail
to understand how they are different, as well as going into the detailed
interpretations of each of the three PC components in the final model, including the
interesting meaning of the singular PC-2 variable, but this is of course a task
rightfully reserved for the geologists, and we here leave these results for them to
mull over further, ibid.

Multivariate Data Analysis in Practice


104

Multivariate Data Analysis in Practice


5. PCA Exercises – Real-World Application Examples 105

5. PCA Exercises – Real-World


Application Examples
5.1 Exercise - Find Clusters (Iris Species
Discrimination)
Purpose
In this exercise you will use PCA to look for significant data groups, data classes.

Problem
The data set used in this exercise is taken from Fisher (1936). This is a famous data
set in the world of statistics. The variables are measurements of sepal length/width,
and petal length/width of twenty-five plants from each of three subspecies of Iris:
Iris Setosa, Iris Versicolor and Iris Virginica. The data are eventually to be used to
test the botanical hypothesis that Iris Versicolor is a hybrid of the two other
species. This hypothesis is based on the fact that Iris Setosa is a diploid species
with 38 chromosomes, Iris Virginica is a tetraploid and Iris Versicolor is a
hexaploid having 108 chromosomes.

Data Set
The file IRIS contains several data sets. The sample set Training contains
measurements of four variables (see above) for 75 samples: 25 samples of each Iris
type.

Tasks
1. Make a PCA model and identify clusters.
2. Find the most important variables to discriminate between the clusters found.

How to Do it
1. Open the file called IRIS and investigate it using plots and statistics.

Multivariate Data Analysis in Practice


106

2. Make a PCA model with these parameters:


Samples: Training Number of PCs: 4
Variables: All variables Validation: Leverage correction
Weights: 1/Sdev

There are four outlier warnings. For the moment we will disregard them. We
will now look at the results directly.

3. Variance
If necessary, change the residual variance plot to explained variance. Select
View - Source- Explained Variance.

How many PCs must be applied to explain, say, 70% of the variance? 95%?

4. Scores
Interpret the score plot for PC1 and PC2. How many groups do you see? How
does that comply with our prior knowledge about the data? Is there a clear gap
between the versicolor and the virginica groupings? What does PC1-PC3 show?

5. Loadings
Study the 2D Scatter loading plot. Which variables are the most important?
Which variable discriminates between setosa and the other two? Try to plot also
the loadings as line plots. Edit - Options may be useful to plot the results as
bars instead of lines.

Summary
The first two PCs account for approximately 96% of the variance. There are three
classes: one very distinct (the setosa class) and two others which are not as well
separated from each other. This plot, of course, clearly indicates that the versicolor
and virginica species are most alike, while the setosa species is distinctly different
from these two. It may also suggest that it can be difficult to differentiate versicolor
from virginica, but we have only taken our very first shot at this yet. We can see
however, that Iris versicolor lies between setosa and virginica, perhaps supporting
the hypothesis that it is a hybrid - or at least not contradicting this. Perhaps more
information is needed to be in any position to address this important scientific
question?

All the variables are used to differentiate between the three different species. Of
course, in an up to date study we would have used many more morphological
variables. This particular 4-variable data set of Fischer (1936) has become a
Multivariate Data Analysis in Practice
5. PCA Exercises – Real-World Application Examples 107

statistical standard over the years, and been used to test a great many new
methodological approaches; see e.g. Wold (1976). We shall return to the Iris data
set later.

5.2 Exercise - PCA for Experimental Design


(Lewis Acids)
Purpose
This exercise demonstrates how PCA can be used to make an experimental design.
This problem is taken from Rolf Carlson’s book: “Design and optimization of
organic synthesis” (Elsevier 1992).

Problem
Assume that you are about to test a new reaction for which electrophilic catalysis is
strongly believed to be beneficial. For this purpose the addition of a Lewis acid
catalyst would be worth testing. As there are many potentially useful Lewis acids to
choose from, the problem is to find a good one, preferably the optimal one. In
totally new reactions it may be difficult to make a good guess, so we need to make
some experiments. But which ones should we test? How do we design such an
experiment?

A good idea would be to select a limited number of test catalysts to cover a range
of their molecular properties. Using PCA we may describe a range of different
catalysts in terms of principal properties - i.e. the principal components that
describe the main variations.

We may describe different catalysts by known chemical descriptors, such as


essential functional groups, dipolarity, intrinsic properties, etc. Such chemical data
are readily available in standard organic chemistry reference handbooks.

Data Set
The data table in the file LEWIS contains chemical descriptors for 28 Lewis acids.
The following descriptors are believed to contain relevant information for the
problem (Table 5.1):

Multivariate Data Analysis in Practice


108

Table 5.1 – Major chemical descriptors


No Description No Description
X1 Dipole moment (vapor phase) X6 mean bond length (Å)
(D)
X2 negative of standard enthalpy X7 mean bond energy
-1
of formation (kcal mol )
-1
(kJ mol )
X3 negative of standard Gibbs X8 dielectric constant
energy of formation
-1 -1
(J mol K )
X4 standard entropy of formation X9 ionization potential (eV)
-1 -1
(J mol K )
-1 -1
X5 heat capacity (J mol K ) X10 magnetic and diamagnetic
-6
susceptibility (10 cgs)

Tasks
Make a PCA model to select catalysts that are the most different, i.e. span the
variations.

How to Do it
1. Open the file LEWIS and make a PCA model. Should the data be
standardized?

Validate with leverage correction and calculate the maximum number of PCs.
Interpret the score plot using the two first PCs. How many PCs do you need to
explain more than 50% of the variance? Select 9 Lewis acids with “the most
different” properties from the plot!

2. Study the loading plot. Are any of the descriptors unnecessary? Which
variables show the highest covariation? Which variables show a negative
covariation with variable 9?

The organic chemists who conducted these experiments originally chose nine
different acids, but they also based their decisions on certain chemical factors,
which we are unaware of. They chose samples 1, 4, 11, 12, 13, 16, 19, 26 and
28, which cover the variations well and also include a few around the middle of
the score plot. Why also choose these latter?

The experimental results obtained in the reactions using these selected catalysts
fully confirmed independent conclusions in the literature on preferred catalysts.

Multivariate Data Analysis in Practice


5. PCA Exercises – Real-World Application Examples 109

Lewis acid number 1 (AlCl3) got the best results. Sample 18 has also been
reported to be a superior catalyst in Friedel-Crafts reactions, and you can see
that samples 1 and 18 do indeed lie close to each other in the score plot.

Summary
In this exercise you have used the score plot to find samples that differ greatly from
each other, i.e. representative samples that span the experimental domain as much
as possible. Samples lying close together in the plot of course have similar
properties. The extreme samples all lie far away from the origin. All variables have
large contributions. Variables 2 and 3 covary. Variable 9 has a negative covariation
with number 10, both in PC1 and PC2.

5.3 Exercise - Mud Samples


Purpose
In this exercise you will use PCA to look for patterns in a large data set, and
investigate the use of different scaling techniques.

Problem
The data in this exercise were kindly provided by IKU (Institute for Petroleum
Research), Trondheim, Norway. During oilrig drilling operations, mud (barium
sulfate with other chemical components) is sometimes released into the sea and
may thus cause pollution, primarily along the main current direction. The oil
authorities demand regular monitoring of the pollution level around the platforms;
if it is too high, the mud must be removed or the concentrations of harmful
substances must be reduced in another way.

Samples are collected regularly and analyzed by gas chromatography.


Chromatograms from non-polluted areas show only background noise. These have
total hydrocarbon contents (THC) of approx. 3, while chromatograms from polluted
areas show higher THC-concentrations.

77 mud samples were collected and their chromatograms recorded. About 3000
peaks were reduced to 1041 by maximum entropy pretreatment in a selected
chromatographic retention interval. Normally several of the peaks are integrated,
but this does not really catch all the important variations. In addition it may be
difficult to compare many chromatograms and interpret the variations.

Multivariate Data Analysis in Practice


110

It is normal to quantify the THC (total hydrocarbon contents), but using PCA we
can instead:
• get a qualitative measure
• get an overview of the variation in a compressed way
• interpret loadings to find the interesting peaks
• look for patterns in the score plot
• classify the samples with regard to level of pollution

Data Set
The data file MUD contains 77 chromatograms with 527 variables
(chromatographic peaks). Originally there were 3000, reduced as described above.
In addition we have deleted other variables with little information, so that you can
analyze the data even if your PC has little memory.

Tasks
Make a PCA of the raw data table. Investigate if there are significant patterns that
reflect polluted and non-polluted samples. Then make the model on standardized
data instead.

How to Do it
1. Open the data file MUD and plot the data as lines to get an overview.
Typical unpolluted samples are no. 1, 2 and 3; typical polluted samples are 77,
23 and 22 for example. Do as follows: Edit – Select Samples, Samples: 1-3,
22, 23, 77, OK. Plot – Line, All Variables, OK. Edit – Options, Curve ID,
Labels Layout: Position on File.

2. Make a PCA model without weighting. Validate with leverage correction and
set the warning limit for outlier detection (field 2 - 5) to 5.0. Calculate 4 PCs.

3. Study the residual variance. How many PCs do we need?

Study the score plot. How much of the total variation is described by the first
two PCs? How can you interpret PC1? Find the samples listed as polluted
above and compare to the unpolluted; draw your conclusions.

Based on the score plot, would you think sample number 52 is polluted or non-
polluted? What about sample 66? Is sample 72 more polluted than sample 22?

Multivariate Data Analysis in Practice


5. PCA Exercises – Real-World Application Examples 111

4. Study the line loading plot (activate the upper right plot, Plot – Loadings –
General, Line, Vector 1: 1-2, OK) to look for interesting peak areas. At which
retention times are the chromatograms most different?
Save the model: File – Save As, Mud1, Save.

We would normally consider standardizing the data, when there are such large
differences between the variables as you could see in the line plot. The aim is to
also allow the more subtle variations to play a role in the analysis. Run a new
PCA model with Weights = 1/Sdev: activate the data editor, use the Task menu,
etc. Give the model a new name.
Close the data editor and select Window – Tile Vertically so that you can
compare the two models. Study the explained variance and the scores. Is this
model different? Does it change your overall interpretations? Try to explain
why the models give the same results!

Summary
There is a break in the variance curve at 1 PC, but with so many variables and
samples we should also be able to use 2 PCs to get a good 2-vector score plot. 2
PCs explain 93% of the total variations in the 77 chromatograms. The explained
validated variance is 88% using 2 PCs. The unpolluted samples lie to the left in the
score plot, while the polluted ones lie to the right. The first PC thus seems to
describe the overall, general pollution level. Sample 52 is unpolluted, while no. 66
and 72 are polluted. The more to the right the samples appear, the more polluted
they are.

The loadings in PC1 are largest at retention times between 100 and 300, so this is
where the most interesting peak information lies for this data set. In PC2 variables
105-110 have the largest loadings.

The model based on standardized data shows the same general patterns. The score
plot is reversed along PC2, which doesn’t matter. - Only the relative patterns count
in PCA. In this case the systematic variations in the important variables are very
large, both in the standardized and the non-scaled data. Therefore they will
dominate both models. The loading plot naturally also shows a reversed PC2, and
has a somewhat different shape.

Normally you must be very cautious when dividing variables with values between 0
and 1 by their standard deviation. If the standard deviation is small you will divide
by a small number that is dangerously close to zero, which may cause an

Multivariate Data Analysis in Practice


112

unnaturally large amplification of the scaled variances and can sometimes result in
numerical instability in the calculations. In this case all the variables are between 0
and 1, so they were all affected in the same way.

It will now be possible to make a PCA model using only normal background
samples for example (i.e. all the samples to the left) and use classification to see if
new samples are polluted or not, see below SIMCA. In this example we have used
PCA as an initial “data-scope” on which to see the first exploratory, overall
patterns of the X-matrix we was given to start out with. We may opt for carrying on
in the manner indicated etc.

5.4 Exercise - Scaling (Troodos)


Above you analyzed geological rock samples (Troodos mountains, Cyprus) where
we were mainly interested in looking for significant groupings and outliers.
Because the geochemical variables were different with respect to their empirical
variance, their standard deviation and their mean values, you scaled the data.

Purpose
In this exercise run a PCA without scaling the same Troodos data. Compare this
model with the earlier one and investigate the effects of the scaling. If you did not
save the pertinent model results from earlier, observe how quickly you can
regenerate these now that you are already a somewhat accomplished data analyst.

Data Set
File: TROODOS (143 rock samples and 10 geochemical variables).

Tasks
Make a PCA model without scaling. Study the scores, loadings, and explained
variance plots, and explain why results are so different compared to the autoscaled
results from the earlier auto-scaled analysis.

How to Do it
1. Open the data and run PCA with weights = 1.0, leverage correction and
outlier limit = 3.0.

Study the list of outlier warnings.

Multivariate Data Analysis in Practice


5. PCA Exercises – Real-World Application Examples 113

2. Study the explained X-variance. How much variance is explained by 2 PCs?


Compare with the first model you made on the Troodos data.

3. Study the loadings for PC1 versus PC2. Are there any insignificant
variables? Which? Why do fewer variables explain more of the variation in this
case? If necessary compute the statistics for the variables again.
Do you find the same outliers and groups in this model?

Summary
Two PCs explain about 90% of the variance in this model, while the scaled model
needed four PCs to explain the same variance level. The loadings show that
variables no. 5, 9, and 10 now have no contribution to the model. Their variance in
absolute numbers is very, very small compared to the others. Variables 5, 9, and 10
are so unimportant that they are not even included in the total variation. The other
variables therefore dominate the model completely and the model explains 90% of
the variance in these variables.

One should not ponder overly over these findings. There is no objective basis on
which to compare different data sets (e.g. data sets with a different number of
variables). In effect, we have only a seven-variable data set when not auto-scaling.
Every data set analyzed with PCA has its own relative initial total variance, which
is set to 100%. The relative internal %-levels in any PCA-analysis cannot be
compared between two, externally different PCA-analyses.

We do find the same outliers and groups with both models though, but in a
different order, so everything was not totally incommensurable between these two
alternative scalings.

We have witnessed that when the raw variables are characterized by (very)
different variations for one specific data set, it matters very much that these
numerical variance differences are not allowed to dominate. We choose the
autoscale PCA in this, and any similar situations.

Multivariate Data Analysis in Practice


114

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 115

6. Multivariate Calibration
(PCR/PLS)
The central issue in this book (after the necessary introductions to projections and
PCA) is multivariate calibration. This involves relating two sets of data, X and Y,
by regression. In a systematic context multivariate calibration can be called
multivariate modeling (X, Y). We first address multivariate calibration in general
terms before we introduce the most important methods in more detail. So far we
have not worked with an Y-matrix at all, but now Y will become a very important
issue.

6.1 Multivariate Modeling (X,Y):


The Calibration Stage
There are many kinds of multivariate data modeling. We are already quite familiar
with PCA-modeling, which comprises modeling of one X-matrix alone; this could
be termed multivariate modeling (X). In PCA we make a principal component
model of the essential covariance/correlation structure of X. The PCA-model is
displayed as score plots, loading plots, variance plots, etc. Interpretation is an
important problem-specific issue.

In contrast, multivariate calibration concerns two matrices, X and Y. The Y matrix


consists of the dependent variable(s) whilst X contains the corresponding
independent variables (in traditional regression modeling terms). At this point it
does not matter whether Y consists of one variable or several. The focus will be on
general understanding of the concept of multivariate calibration – multivariate
regression.

Multivariate Data Analysis in Practice


116 6. Multivariate Calibration (PCR/PLS)

Figure 6.1 – Calibration


Establishing a regression model from known X and Y data

Model

X +
Y

The multivariate model for (X,Y) is simply a regression relationship between the
empirical (X,Y) relations. We establish the model through multivariate calibration.
Thus the first stage of multivariate modeling (X,Y) is the calibration stage.

But calibration is rarely just establishing/finding a model for description of the


connection between X and Y. Mostly we often want to use the model, for example
for future prediction. We wish to use the model to find new Y-values from new
measurements of X, i.e. to predict Y from X. In general prediction is therefore the
second stage of multivariate calibration.

It is mandatory to start with a known set of corresponding X and Y data. From


these we develop the relevant multivariate regression model. The model may then
subsequently be used on new X measurements to predict new Y-values.

6.2 Multivariate Modeling (X, Y):


The Prediction Stage
Let us repeat the latter issue, so the purpose of this entire chapter is quite clear:

First a multivariate regression model of the (X, Y) relationship must be established.


The statistically correct way to describe this is that we estimate the parameters of
the (X, Y) regression model.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 117

Figure 6.2 - Using the multivariate regression model to predict new Y-values


X + Model ⇒ Y

The regression model is secondly used on a new set of X-measurements for the
specific purpose of predicting new Y-values. The reason is that this makes it
possible to use only X-measurements for future Y-determinations, instead of
making more Y-measurements.

Why Use X Instead of Y?


There are many reasons why multivariate calibration is so extensively used. One of
the main objectives is that one often wishes to make as few Y-measurements as
possible. Typically the Y-measurements may be expensive, difficult, ethically
undesirable, labor intensive, time consuming, dangerous, etc. All these
characteristics of the Y-measurements have one thing in common: it would be
desirable to replace them with X-measurements if these are simpler, cheaper, faster,
etc. There are very many practical application examples of this type.

Spectroscopic methods, for example, can often be implemented as fast methods that
can measure many simultaneous chemical and physical parameters indirectly. In the
same vein, there are also cases where it would be advantageous to substitute
several, perhaps slow and cumbersome, physical or wet chemical measurement
methods with one spectroscopic method. These spectra would then constitute the X
matrix and the sought for parameters, for example chemical composition, flavor,
viscosity, quality, etc. would constitute the Y matrix.

Another example would be “biological activity”, e.g. toxicity. The biological


activity of a compound can be difficult both to identify and quantify directly,
because this may involve animal experiments or similar. By using a set of
compounds with known biological activities, we can connect physical parameters,
which are easily determined, e.g. molecular structure, molecular weight, etc., with
the resulting biological activity. This field is called QSAR, Quantitative Structure

Multivariate Data Analysis in Practice


118 6. Multivariate Calibration (PCR/PLS)

Activity Relationships, in which indirect multivariate calibration really comes to


the fore. A related field, which is equally dependent on multivariate calibration is
QSPR, Quantitative Structure Properties Relationships.

In the biological and environmental sciences and in technology - as well as a legion


of industrial applications, there is a tremendous need for multivariate calibration.
We will attempt to illustrate these applications as fully as possible, but a complete
coverage of the field would be impossible due to the virtually unlimited potential of
the methods.

6.3 Calibration Set Requirements (Training


Data Set)
The starting point is always a set of known measurements collected for the data
matrix X. These should be characterized with the method to be used in the future,
when one wishes to exploit the more desirable X-data. For each object (row) in the
X- matrix, it is of course mandatory also to measure the corresponding Y-value. Y
is measured with the method one would like to substitute for in future work, often
called “the reference method”, so one also always knows the pertinent Y-matrix
(Frame 1.1 chapter 1). This then is the completely general starting point for the
development of any multivariate calibration model. The matrices X and the
corresponding Y are collectively called the calibration set or the training set.

The calibration set is of critical importance in any multivariate calibration


procedure. It must meet a number of requirements. The most important is that the
calibration set is representative of the future population from which the new
X-measurements are to be sampled, and clearly the measuring conditions should be
as similar as possible.

For instance, consider the case where you wish to use spectroscopy to measure the
amount of fat in ground meat, instead of the more time consuming laboratory wet
chemical fat determination methods. If the future samples all will have a fat content
between 1 - 10% only, then obviously we cannot use the spectra of meat with a fat
content of 60 - 75% for the calibration set etc. This simplistic example may sound
trivial, but the issue is not.

The demand that the training set (and the test set, see further below) be
representative covers all aspects of all conceivable variations in the conditions

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 119

influencing a multivariate calibration. In the case of the ground meat if for example
one was to say: “We only want to determine fat in these meat samples. Therefore
let us make our own training set mixtures of fatty compounds in the laboratory to
keep things simple. We would know precisely how much fat we have from how we
made the training samples, concentrating on the fat component and adding the other
meat components in correct proportions etc. Then we would not have to collect
complex, real-word meat samples and do those tiresome fat measurements”.
Never entertain such thoughts! This idea is based upon univariate calibration
theory, which, in reality, would seriously limit your creativity.

In this case, your laboratory spectra would be so different from the spectra of the
“real” meat that your laboratory-quality “fat model” would not apply at all.
Naturally so, because the artificial laboratory training samples, no matter how
precisely they would appear to have been created, would not - at all - correspond to
the complexities of the real-world, processed meat samples. This may well be
mostly because of significant interferences between the various meat components,
as occurring in the natural state – in spite of their quantitative correct proportions.

A more stringent way of formulating the representativity requirement is that the


calibration set must span the X-space, as well as the Y-space, as widely and
representatively as possible, again in the specific sense of the future usage of the
final, correctly validated (see further below), prediction model. Sometimes
designed experiments play an important role here.

Experimental design ensures that the calibration set covers the necessary ranges of
the phenomena involved. However, there are always some related constraints being
put on the training set. The most often being that the number of available samples
is more-or-less severely restricted. At other times we simply have to accept the
training data set as presented in a specific situation.

Irrespective of one’s own situation, one must always be aware of the range, the
span, of the calibration set, since this defines the application region of the model in
future prediction situations. Only very rarely will one be so lucky that the
application range can be extensively extrapolated beyond the range of the
calibration set. Data constraints will also be further discussed in section 9.1.

Multivariate Data Analysis in Practice


120 6. Multivariate Calibration (PCR/PLS)

6.4 Introduction to Validation


Validation of a model means testing its performance according to an a priori given
set of test result specifications. For example, in prediction model validation,
validation testing is concerned with its prediction ability on a new data set, which
has not been used in the development of the model. This new data set is called the
test set. It is used to test the model under realistic, future conditions, specifically
because it has been sampled so as to represent these future conditions. Indeed if
possible one should even use several test sets! “Realistic” here means that the test
set should be chosen from the same target population as the calibration set, and that
the measuring conditions of both the training set and the test set are as
representative of the future use as indeed possible. However, this does not mean
that the test set should be too closely similar to the training set. For instance, it will
not do to simply divide the training set in two halves, provided the original set is
large enough, as has unfortunately sometimes been recommended in chemometrics.
This would decidedly be wrong!

At first sight, this issue may appear somewhat difficult, but it will be discussed in
more detail at the appropriate points. Validation will be discussed in full detail in
chapter 7, and also in sections 9.8 and 18.5 The brief overview below is intended
only to introduce those important issues of validation which must be borne in mind
when specifying a multivariate calibration. From a properly conducted validation
one gets some very important quantitative results, especially the “correct” number
of components to use in the calibration model, as well as proper, statistically
estimated, assessments of the future prediction error levels.

Test Set Validation


The procedure introduced above - using a completely new data set for validation -
is called test set validation. There is an important point here; we also have to know
the pertinent Y-values for the test set, just as we did for the calibration set. The
procedure involved in test set validation is to let the calibrated model predict the Y-
values and then to compare these independently predicted values of the test set
with the known, real Y-values, which have been kept out of the modeling as well as
the prediction so far. We shall call predicted Y-values Ypred and the known, real
Y-values Yref. (hence the term “reference” values).

An ideal test set situation is to have a sufficiently large number of training set
measurements for both X and Y, appropriately sampled from the target population.
This data set is then used for the calibration of the model. Now an independent,
Multivariate Data Analysis in Practice
6. Multivariate Calibration (PCR/PLS) 121

second sampling of the target population is carried out, in order to produce a test
set to be used exclusively for testing/validating of the model – i.e. by comparing
Ypred with Yref.

The comparison results can be expressed as prediction errors, or residual variances,


which now quantify both the accuracy and precision of the predicted Y-values, i.e.
the error levels which can be expected in future predictions.

Other Validation Methods


There is no better validation than test set validation: testing on an entirely “new”
data set. One should always strive to use validation by test set. TEST IS BEST!

There is, however, a price to pay. Test set validation entails taking twice as many
samples as would be necessary with the training set alone. However desirable, there
are admittedly situations in which this is manifestly not always possible. For
example when the measuring of the Y-values is (too) expensive, unacceptably
dangerous or the test set sampling is otherwise limited e.g. for ethical reasons or
when preparing samples is extremely difficult etc. For this situation, there is a
viable alternative approach, called cross validation, see chapter 7. Cross validation
can, in the most favorable of situations, be almost as good as test set validation, but
only almost - but it can never substitute for a proper test set validation! And the
most favorable of situations do not occur very often either....

Finally, there is a “quick and dirty” validation method called leverage-corrected


validation. This is actually the one used so far, because we had not yet introduced
the concept of validation. This method uses the same calibration set to also validate
the model, but now “leverage-corrected”. It is obvious that this may be a
questionable validation procedure, all depending on the quality of the corrections
employed Furthermore this often gives results, which are too optimistic. However,
during initial modeling, where the validation is not really on the agenda yet, this
method can be useful as it saves time.

In chapter 7, we shall later explain in detail how these other validation methods
work and how they are related to test set validation.

Modeling Error
How well does the model fit to the X-data and to the Y-data? How small are the
modeling residuals? One may perhaps feel that a good modeling fit implies a good

Multivariate Data Analysis in Practice


122 6. Multivariate Calibration (PCR/PLS)

prediction ability, but this is generally not so, in fact only very rarely, as we shall
discuss later in more detail.

Initial Modeling
Detection of outliers, groupings, clusters, trends etc. is just as important in
multivariate calibration as in PCA, and these tasks should in general always be first
on the agenda. In this context one may use any validation method in the initial
screening data analytical process, because the actual number of dimensions of a
multivariate regression model is of no real interest until the data set has passed this
stage, i.e. until it is cleaned up for outliers and is internally consistent etc. In
general, removal of outlying objects or variables often influences the model
complexity significantly, i.e. the number of components will often change as a
result anyway. However, the final model must be properly validated, preferably by
a test set (alternatively with cross validation), but never with just leverage
correction.

6.5 Number of Components (Model


Dimensionality)
As for PCA, the “correct” number of components is also essential for the
multivariate regression methods. We have not yet introduced an operative
procedure to assess this number. In fact, for multivariate calibration there is a very
close connection between validation and finding this “optimal” number of
components.

Test set validation, cross validation, and leverage correction are all designed to
assess the prediction ability, i.e. the accuracy and precision associated with Ypred.
To do this, Ypred must be compared to the reference values Yref. The smaller the
difference between predicted and real Y-values, the better. The more PCs we use,
the smaller this difference will be, but only up to a point, which is the optimal
number of components. Let us see how this is done.

Minimizing the Prediction Error


The prediction error is often expressed as the residual Y-variance, based on the
validation. It is expressed in several forms and is typically studied for a varying
number of components.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 123

In Figure 6.3 the x-axis shows the number of components included in the prediction
model. The y-axis denotes a measure for the prediction variance, which usually
comes in two forms: 1) the residual Y-variance (also called prediction variance,
Vy_Val) or 2) the RMSEP (Root Mean Square Error of Prediction); the latter is
simply the square root of the former.

n
∑ ( yi − yi,ref )2
Equation 6.1 RMSEP = i =1
= Vy_ Val
n

Figure 6.3 - RMSEP vs. number of PCs

30 RMSEP
27
24
21
18
15
12
9
6
3
0
PC_0 PC_1 PC_2 PC_3 PC_4
<RMSEP Methanol>

As is obvious from Equation 6.1, the overall prediction ability is best when the
prediction variance (prediction error) is at its lowest. This is where the prediction
error, the deviation between predicted values and real values, has been minimized
in the ordinary statistical sum-of-squared-deviations sense. The plot in Figure 6.3
shows a clear minimum at 3 PCs, which indicates that this number of components
is optimal, i.e. the number where the prediction variance (residual Y-variance) is
minimized. Inclusion of more components may improve the specific modeling fit,
but will clearly reduce the prediction ability, because the RMSEP goes up again
after this number. From the practical point of view of prediction optimization, this
minimum corresponds to the “optimal” complexity of the model, i.e. the “correct”
number of prediction model components. Note that the specific determination of
the optimum is intimately tied in with the validation. It is therefore very easy
indeed to obtain the correct dimensionality of any multivariate calibration model –
all one has to do is to carry out an appropriate validation. This is somewhat

Multivariate Data Analysis in Practice


124 6. Multivariate Calibration (PCR/PLS)

different in relation to the case for PCA, in which only the residual X-variance plot
was at hand.

These introductory remarks on multivariate calibration in sections 6.1 to 6.5 may


now serve as a sufficient background upon which to focus on the specific
regression methods selected.

6.6 Univariate Regression (y|x) and MLR


We shall now present the bilinear regression methods themselves: Principal
Component Regression (PCR) and the Partial Least Squares Regression (PLS-R)
methods PLS1 and PLS2. For comparison we will start by looking at traditional
univariate regression and MLR.

6.6.1 Univariate Regression (y|x)


The simplest form of regression is the so-called univariate regression. One variable
only is measured, x, and one response property is modeled, y. This is described by
the vector relationship y = a + bx.

Univariate regression is undoubtedly the most often used regression method. It has
been studied extensively in the statistical literature and it is part of any university
or academy curriculum in the sciences and within technology. We assume that the
reader is sufficiently familiar with this basic regression technique, but also refer to
the relevant statistical texts in the literature section if need be.

Figure 6.4 - Univariate regression:


overlapping spectra in the wavelength regimen (x-axis)
Absorbance Absorbance

“I” “I”
“II”

There is a serious problem with this approach however, as there are no modeling
and prediction diagnostics available. It is de facto impossible to detect situations

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 125

where univariate regression gives false estimates. An example of this is given in


Figure 6.4. In the left-hand panel one can predict the concentration (y) of the
compound I in a solution calculated from the spectroscopic absorbance at one
selected wavelength (x), usually chosen as corresponding to the peak height. If
there is one, or more, unknown, compound(s) II present in the solution however,
and if there is a significant overlap between these spectra in the wavelength region
in which we measure (x), the measured absorbance will be the sum of the
absorbances of compounds I and II at the wavelength chosen. This is illustrated in
the right hand panel. If the spectra are overlapping, we cannot use only one
wavelength (x) to determine the concentration of compound I. This is a direct
consequence of the fact that the contribution from the compound II spectra are
unknown in the calibration. The univariate calibration approach therefore fails,
completely without the data analyst knowing – not a happy thought!

6.6.2 Multiple Linear Regression, MLR


Multiple linear regression, MLR, is the classical method that combines a set of
several X-variables in linear combinations, which correlate as closely as possible
to the corresponding single y-vector; see Figure 6.5.

Figure 6.5 - MLR: regressing one Y-variable on a set of X-variables


X
Y

In MLR a direct regression is performed between Y and the X-matrix. Here we


will only look at one column vector y, i.e. we shall work with one y-variable for
simplicity, but the method can readily be extended to a whole Y-matrix. In the
latter case one can make independent MLR-models, one for each y-variable, based
on the same X-matrix.

We start with the following MLR model equation:

Multivariate Data Analysis in Practice


126 6. Multivariate Calibration (PCR/PLS)

Equation 6.2 y = b0 + b1 x1 + b2 x2 +  + b p x p + f

This can be compressed into the convenient matrix form:


Equation 6.3 y = Xb + f

We wish to find the vector of regression coefficients b so that f, the error term, is
the smallest possible. To do this one uses the least squares criterion on the squared
error terms: find b so that fTf is minimized. This leads to the following well-known
statistical estimate of b.

Equation 6.4 bˆ = ( XT X) −1 XT y

As is well known estimating b involves matrix inversion of (XTX) - and this may
cause severe problems with MLR. If there are collinearities in X, i.e. if the
X-variables correlate with each other, matrix inversion may become increasingly
difficult and in severe cases may not be possible at all. The (XTX)-1 division will
become increasingly unstable (it will in fact increasingly correspond to “dividing
by zero”). With intermediate to strong correlations in X, the probability of this ill-
behaving collinearity is overwhelming and MLR will in the end not work. To avoid
this numerical instability it is standard statistical practice to delete variables in X so
as to make X become of full rank. At best this may mean throwing away
information. To make things worse, it is definitively not easy to choose which
variables should go and which should stay. In the worst case we may be unable to
cope with the collinearities at all and have to give up.

MLR may fail when there is:


• collinearity in X
• noise, errors in X
• more variables than samples
• interference between variables in X

Furthermore MLR-solutions are not as easy to interpret as the bilinear projection


models. The differences between the methods will be further discussed in section
18.1.

Again, these “explanations” are only a first non-statistical introduction into matters
that of course also should be studied in their proper mathematical and statistical
context, chapter 18.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 127

6.7 Collinearity
Collinearity means that the X-variables are intercorrelated to a non-neglectable
degree, that the X-variables are linearly dependent to some degree; for example
X1 = f(X2, X3, .., Xp).

If there is a high collinearity between X1 and X2 (see Figure 6.6), the variation
along the solid line in the X1/X2 plane is very much larger than across this line. It
will then be difficult to estimate precisely how Y varies in this latter direction of
small variation in X. If this minor direction in X is important for the accurate
prediction of Y, then collinearity represents a serious problem for the regression
modeling. The MLR-solution is graphically represented by a plane through the data
points in the X1/X2/y-space. In fact the MLR-model can directly be depicted as a
plane optimally least square fitted to all data points. This plane will easily be
subjected to a tilt at even the smallest change in X, e.g. due to an error in the
X-measurements, and thus become unstable, and thereby more or less unsuited for
Y-prediction purposes.

In such a case one usually tries to pick out a few variables that do not covary (or
which correlate the least), and use the information in a combination of these. This
is the idea behind the so-called stepwise regression methods, and this may
sometimes work well in some applications but certainly not in all. Also note that
we have to follow the demands of a particular calculation method; this is surely
something all true data analysts dislike! There are in general many problems in
relation to step-wise methods, for which we may refer to Høskuldsson (1996).

However, if the minor, “transverse” directions in X are more or less irrelevant for
the prediction of Y (which may be the case also in spectroscopy), this collinearity
is not a problem anymore, provided that a method other than MLR is chosen.
Bilinear projection methods, the chemometric approaches chosen in this book,
actually utilize the collinearity feature constructively, and choose a solution
coinciding with the variation along the solid line. This type of solution is thus
stable with respect to collinearity.

Multivariate Data Analysis in Practice


128 6. Multivariate Calibration (PCR/PLS)

Figure 6.6 - Collinearity in X-space – leading to unstable y-prediction

6.8 PCR - Principal Component Regression


PCA Scores in MLR
The score vectors, t, in PCA are clearly orthogonal to each other, no matter how
many you care to calculate. Suppose you did a full PCA and used all these PC
components, in the form of score vectors directly in MLR? Since you use all the
PCs, there would be no loss of X-information. On the other hand we would have
created just the type of orthogonal (independent) variables which MLR has been
designed to handle in an optimal fashion.

PCR can therefore be thought of as a two-step procedure: first a PCA is used to


transform X. The resulting T-matrix is then plugged directly into the MLR model
(Equation 6.5), now giving.

Equation 6.5 y = Tb + f

instead of y = Xb + f. This “MLR”, now called PCR for obvious reasons, would
thus be stable. But not only that - by using the advantages of PCA, we also get
additional benefits, in the form of scores and loadings and variances etc. which can
be interpreted with ease.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 129

Do we Need all the Possible PCs?


No, you don’t need to use a full PC-decomposition in the regression step. In fact it
would be better if we used fewer components, since the later components generally
correspond to noise. There is no a priori reason that using all the available
variations in X will necessarily create an optimal model for prediction of y. In fact
there may easily be structured elements in X that have nothing to do with y at all,
i.e. which are uncorrelated to y. These variance elements should be omitted or else
they will disturb the optimal regression modeling of Y.

Figure 6.7 - PCR: Using PCA-score in MLR (T) instead of X


X

T Y

How do we solve this selection problem? Perhaps it comes as no surprise: we will


of course use prediction validation to determine the correct number of PCs to
include in the regression model. The procedure is to continuously increase the
number of PCs in the regression one by one (i.e. increasing the dimension of T in
Equation 6.5). Then we use the resulting model (the resulting vector b, for the
augmented number of PC-components) in the prediction, thereby arriving at a
proper validation variance plot, in principle identical to Figure 6.3 above. Straight
from the book: we let the validation results determine the optimal number of PCs to
use!

This validated number will in general differ from the optimal number of PCs found
from an isolated PCA of X without regard to y. This is because we now let the
prediction ability determine the number of components, not the PCA modeling fit
with respect to X alone. This is the first time we meet with this very important
distinction between the alternative statistical criteria, modeling fit optimization vs.
prediction error minimization, but it will certainly not be the last we see of this in
chemometrics; see Høskuldsson (1996) for a comprehensive overview.

Multivariate Data Analysis in Practice


130 6. Multivariate Calibration (PCR/PLS)

Because we have earlier built up the relevant competencies regarding PCA, MLR
and validation in a planned stepwise manner, it has now been possible to introduce
all the essentials of PCR in the three small sections 6.6 to 6.8 . And now it is time
for an exercise!

6.8.1 Exercise - Interpretation of Jam (PCR)


Purpose
In this exercise you will make a PCR model, i.e. use a combination of PCA and
MLR. In PCR you work with the PCA scores instead of the original data set in
making the MLR model. The goal in this exercise is to learn how to set up the
model, how to interpret it and how to evaluate the model fit and prediction ability,
i.e. how to carry out the validation.

Problem
The data set you will be using in this and the next two exercises is about jam
quality, or rather about how to assess jam quality. We want to quantify the (human)
sensory quality of raspberry jam, - especially to determine which parameters are
relevant to the perceived quality and to try to replace costly sensory or preference
measurements (laboratories full of trained, expensive taste assessors, etc.) with
(much) cheaper instrumental methods. This is a highly realistic multivariate
calibration context, in fact the data come directly from a real-world industry project
from the former Norwegian Food Science Research Institute (now known under the
acronym “MATFORSK”).

The data set consists of the instrumental variables (X), as well as two types of
subjective variables (Y), which need not however lead to unnecessary confusion if
reflected upon carefully. Basically one may carry out two alternative calibrations
for these two alternative Y-data sets, both performed on the basis of the one-and-
the-same X data set.

Data
The analysis will be based on 12 samples of jam, selected to span normal quality
variations. The way the problem specification was originally formulated is given
below (so that the data set is not entirely “served up” perfectly for your data
analyzing pleasure, but you have to “get under the skin” of this particular problem

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 131

personally). This in order that you fully understand the organization of this slightly
complex data analysis exercise.

Objects = Samples = annotated with agronomic production variables.


The jam samples are made from raspberries picked in four different places (C1, C2,
C3 and C4) and harvested at three different times (H1, H2 and H3). This is listed in
Table 6.1.

Sensory variables (Y1)


Trained sensory (taste) panelists have judged 12 different sensory variables, using a
1-9 point intensity scale. This Y-variable set is called Sensory. The sensory
variables are listed in Table 6.2.

Table 6.1 - Jam objects Table 6.2 - Sensory variables (Y1)


No Name Place Harvest No Name Type
time
1 C1_H1 1 1 1 Red Redness
2 C2_H1 2 1 2 Color Color intensity
3 C3_H1 3 1 3 Shiny Shininess
4 C4_H1 4 1 4 Smell Raspberry smell
5 C1_H2 1 2 5 Flavor Raspberry flavor
6 C2_H2 2 2 6 Sweet Sweetness
7 C3_H2 3 2 7 Sour Sourness
8 C4_H2 4 2 8 Bitter Bitterness
9 C1_H3 1 3 9 Off-flav Off-flavor
10 C2_H3 2 3 10 Juicy Juiciness
11 C3_H3 3 3 11 Viscos Viscosity/thickness
12 C4_H3 4 3 12 Chewing Chewing resistance

Chemical/instrumental variables (X)


Naturally, we have also measured 6 pertinent chemical and instrumental variables;
see Table 6.3. This variable set is called Instrumental.

Multivariate Data Analysis in Practice


132 6. Multivariate Calibration (PCR/PLS)

Table 6.3 - Instrumental & chemical variables (X)


No Name Method
1 L Spectrophotometric
color measurement
2 a Same as above
3 b Same as above
4 Absorb Absorbency
5 Soluble Soluble solids (%)
6 acidity Titrable acidity (%)

Were these the only data at our disposition, we could easily set up the appropriate
multivariate calibration formalism: X ➜ Y,
in this case: X(instrumental) ➜ Y(sensory)

Consumer preference variables (Y2)


In addition to this, 114 representative consumers have also tasted the same 12 jam
samples and given them their preference scores on a scale from 1-9. The data used
here are the mean values (over 114 preferences) for each sample/object. This
variable set is called Preference. There is here an important difference between
sensory red (how red it is) and redness preference (how redness is liked).

An alternative calibration scheme would of course now be:


X(inst.) ➜ Y(Preference)

However - just to make matters really interesting – the context of the problem
would also allow for an exploratory multivariate calibration between the two sets of
alternative Y- profiling data, i.e. Y1 versus Y2. Clearly the most expensive
profiling of jam quality is the one involving a large number of consumers (114 in
this case). Were these to be replaced by the taste panelist data (Sensory), this could
result in serious cost reductions. The appropriate calibration would correspond to a
very special X ➜ Y setup, namely one between the two Y-data sets:
Y1(Sensory) ➜ Y2(Preference)
We shall use this data set extensively also later on, so we don’t exhaust all the
above calibration combinations yet.

How to do it
1. Study the data
All three variable sets (X,Y1,Y2) are stored on The Unscrambler file JAM.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 133

Note that the agronomic production variables are not used as quantitative variables
in any of the matrices, but they are exclusively “known external information”, and
will thus be very valuable as object annotations when interpreting the results of the
data analysis. This information has been coded into the names of the samples –
comparable to Figure. 3.15 in chapter 3 for example.

2. Make the model


Use Task-Regression to make a PCR model with the calibration specifications
shown below. Autoscale the data. Sensory variables are in the same units, but in
regression modeling we have no prejudices about important variables.
Standardization makes all the X-variables equally important.

Regression method: PCR Validation: Leverage correction


Samples: Training Number of PCs: 6
X-variables: Sensory Weights: 1/SDev
Y-variables: Preference Weights: 1/SDev

3. Study the model overview


When the model has been calculated you get the model overview as usual. Select
Close and give the model a name (Jam1). Go to Results - Regression and mark
the model. In the file information field you see that the program suggests the
optimal model with 5 PCs, but we must choose the one with the first local
minimum variance to avoid overfitting.

Go to View-Toolbars-Source to get a toolbar of the variances, which you can


toggle between
• Calibration and/or Validation variance,
• X and/or Y variables,
• Explained and/or Residual variance.

The prediction error (Y-residual variance) has a local minimum after 3 PCs.
According to what we know about the particular problem and the data only two
factors varied (growth location and harvesting time), so 2-3 PCs would not at all be
unreasonable from a data analytical point of view.

The calibration variances in X and Y show how well the data have been modeled in
X and Y respectively. You see that one PC describes 43% of the X-data, but
completely fails to model Y - the explained variance is of 1%. With two PCs

Multivariate Data Analysis in Practice


134 6. Multivariate Calibration (PCR/PLS)

however, we explain 71.6% of X and 58% of Y. Apparently two PCR-components


manage to describe X better than Y.

The validation variances in X and Y are based on the testing (this time using
leverage correction). The residual validation variance of Y is an expression of the
prediction error - what we can expect when predicting new data. The validation
variance is usually higher than the corresponding calibration variance, more of
which later. The error increases in Y with only one PC, but then decreases again.

4. Variance plot
Look at the residual variance plot for variable Y (preference). Add the calibration
variance by using the toolbar buttons. Approximately how much of the variance of
Y is explained by 2 and 3 PCs? Why is the calibration variance lower? Do you
think it is wise to use 4 or 5 PCs instead of 3?

5. Score plot
Using the menu Plot-Scores, take a look at the score plot using the two first PCs.
Also plot PC1 vs. PC3. You may also try a 3D Scatter score plot of PC1 vs. PC2 vs.
PC3. Try Edit - Options – Vertical Line and – Name. Can you see specific
patterns? What does PC1 model? And PC2?

6. Loading plot
Look at the loading plot for PC1 vs. PC2 and PC1 vs. PC3. Notice that the
Y-loading PREFEREN is also plotted.

Which variables describe the jam quality best? Which sensory variables correlate
most with Preference? Why does Preference have a small loading value in PC1?
Which variable is negatively correlated with both Raspberry Smell and Raspberry
Flavor? Is Sweetness an important variable?

The scores and loading plot are the same as if you had used PCA. In fact that is
what you have done, since firstly the PCA was calculated and then the MLR-
regression was invoked.

7. Study scores and loadings together


Go to Plot - Scores and Loadings – Four Plots. Select PC1 and 2 for Plot 1,3 -
PC2 and 3 for Plot 2,4 and “X and Y Loadings”. Do samples harvested early or late
have a characteristic taste of raspberry? Which samples are most preferred by
consumers? Why?

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 135

8. Compare predicted and measured Y-values


Select Plot - Predicted vs. Measured – Samples: Calibration and plot using 1, 2,
and 3 PCs into three different windows. You should see the samples forming a line
diagonally across the plot. Turn on the regression line and the statistic from the
View - Trend Lines and View - Plot - Statistics menus. An ideal model has a
slope close to 1, a correlation close to 1 and an offset close to 0. Why is it optimal
that the points lie close to a line with slope 1 and offset 0?

Summary
You have learned to make a PCR-model, and to decide how many PCs to use by
one specific validation method. Later in this training package you will be directed
to re-do this exercise using more appropriate validation methods.

Your PCR model has its optimum solution at 3 PC, not the 5 PCs as was suggested
by the leverage corrected procedure; already we are becoming adept at taking the
controls based on our understanding of multivariate data analysis.

Here we also for the first time looked at one singularly important way to determine
how good the model is, by using the Predicted vs. Measured plot and its
associated prediction assessment statistics. Some specific jam-related
interpretations include the following.

PCR calculates scores and loadings just as PCA does. Note the structure related to
harvesting time. There is a group for harvesting time 1 (H1) in the third quadrant,
starting to spread for harvesting time 2, and becoming widely spread at harvesting
time 3, where it hardly forms a group at all. This indicates that the quality of jam
made from different berries picked early in the season varies less than later in the
season. There is also some structure related to the harvesting sites, denoted with C1
to C4, in PC1.

The taste variables describe the jam samples best (43% of the variations along
PC1), but color and consistency are not much worse (28% along PC2). The color
variables correlate most with Preference, both in the 2nd and 3rd PC. Preference
was not modeled at all in PC1 (the explained variance was 1%). Therefore
Preference only has a very small loading value in PC1. Off Flavor is negatively
correlated with the smell and flavor of raspberries, both in PC1 and PC2 (you can
draw a straight line between them through the origin). Sweetness does not look
important if you just look at (PC1, PC2): it is close to the origin. But if you study

Multivariate Data Analysis in Practice


136 6. Multivariate Calibration (PCR/PLS)

(PC1, PC3) you will notice that sweetness is the most important variable along
PC3.
The late harvested samples picked at site 3 and 4 have the most characteristic
raspberry taste but jams based on berries from site 1 and 2 are preferred by most
consumers, because of their intense color.

Note that all three PCs should be studied, since Preference needs 3 PCs to be
adequately modeled!

6.8.2 Weaknesses of PCR


PCR is a powerful weapon against collinear X-data, and it is composed of the two
most studied and used of the multivariate methods, MLR and PCA. Despite this it
has been widely claimed that PCR is not necessarily the final solution.

Observe how PCR is a distinct two-stage process: First a PCA is carried out on X,
then we use this derived T-matrix as input for the MLR stage, usually in a
truncated fashion; we only use the A “largest” components, as determined by an
appropriate validation. There are no objections to this if we use enough PCR-
components, but we do not want to use too “many” components. Then the whole
idea of projection compression is lost.

In PCR there is one cardinal aspect which is still not optimized, no matter what:
There is no guarantee that the separate PC-decomposition of the X-matrix
necessarily produces exactly what we want - only the structure which is correlated
to the y-variable. There is no built-in certainty that the A first (“large”) principal
components contain only that information which is correlated to the particular
Y-variable of interest. There may very well be other variance components
(variations) present in these A components. Worse still, there may also remain
y-correlated variance proportions in the higher order PCs that never get into the
PC-regression stage, simply because the magnitudes of other X-structure parts
(which are irrelevant in an optimal (X,Y)-regression sense) dominate. What to do?
– PLS-R is the answer!

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 137

6.9 PLS- Regression (PLS-R)


6.9.1 PLS - A Powerful Alternative to PCR
It is possible to obtain the same prediction results, but based on a smaller number
of components, by allowing the y-data structure to intervene directly in the
X-decomposition. This by condensing the two-stage PCR process into just one:
PLS-R (Partial Least Squares Regression). Usually the term used is just PLS,
which has also been interpreted to signify Projection to Latent Structures.

PLS has seen an unparalleled application success, both in chemometrics and other
fields. Amongst other features, the PLS approach gives superior interpretation
possibilities, which can best be explained and illustrated by examples. PLS claims
to do the same job as PCR, only with fewer bilinear components.

6.9.2 PLS (X,Y): Initial Comparison with


PCA(X), PCA(Y)
In order to facilitate a clear comparison between PCR and PLS, let us first focus on
the regression part. PLS uses the y-data structure, the y-variance, directly as a
guiding hand in decomposing the X-matrix, so that the outcome constitutes an
optimal regression, precisely in the strict prediction validation sense introduced
above. It is no easy matter to explain exactly how this is accomplished without
getting into a higher-level statistical exposition, or by covering the details of the
PLS-algorithm. This latter we shall do shortly, but first we give a simple
introduction to PLS by comparing it with PCA and PCR.

Let us follow the geometrical approach and picture PLS in the same way that we
introduced PCR. Frame 6.1 presents a simplified overview of PLS, or rather the
matrices and vectors involved. And already some help will be at hand from the
earlier PCA algorithm accomplishments, e.g. the specific meaning of the t- and
p-vectors depicted.

A very first approximation to an understanding of how the PLS-approach works


(though not entirely correct) is tentatively and simply to view it as two
simultaneous PCA-analyses, PCA of X and PCA of Y. The equivalent PCA
equations are presented at the bottom in Frame 6.1. Note how the score and loading
complements in X are called T and P respectively (X also has an alternative

Multivariate Data Analysis in Practice


138 6. Multivariate Calibration (PCR/PLS)

W-loading in addition to the familiar P-loading, see further below), while these are
called U and Q respectively for the Y-space.

Note that we are treating the general case of several Y-variables (q) here. This is
not a coincidence. For the uninitiated reader it is easier to be presented the general
PLS-regression concepts in this fully developed scenario (PLS2) than the opposite
(PLS1); this strategy is almost exclusive to this present book. Most other textbooks
on the subject have chosen to start out with PLS1 and later to generalize to PLS2.
We have found that the general PLS-concepts are far more easily related to both
PCA as well as PCR beginning with PLS2. The case of one y-variable (PLS1) will
later be considered as but a simple boundary case of this more general situation.

However PLS does not really perform two independent PCA-analyses on the two
spaces. On the contrary, PLS actively connects the X- and Y-spaces by specifying
the u-score vector(s) to act as the starting points for (actually instead of) the t-score
vectors in the X-space decomposition. Thus the starting proxy-t1 is actually u1 in
the PLS-R method, thereby letting the Y-data structure directly guide the otherwise
much more “PCA-like” decomposition of X. Subsequently u1 is later substituted by
t1 at the relevant stage in the PLS-algorithm in which the Y-space is decomposed.

The crucial point is that it is the u1 (reflecting the Y-space structure) that first
influences the X-decomposition leading to calculation of the X-loadings, but these
are now termed ”w” (for “loading-weights”). Then the X-space t-vectors are
calculated, formally in a “standard” PCA fashion, but necessarily based on this
newly calculated w-vector. This t-vector is now immediately used as the starting
proxy-u1-vector, i.e. instead of u1, as described above only symmetrically with the
X- and the Y-space interchanged. By this means, the X-data structure also
influences the ”PCA (Y)-like” decomposition. This is sufficient for a first overview
comparison of the PLS-approach.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 139

Frame 6.1 - Partial Least Squares Regression: schematic overview

X Y
T U
W Q
P Τ
X = ∑ T⋅ P + E
A
Τ
Y= ∑U ⋅Q + F
A

The PLS-algorithm, which was merely sketched above, is specifically designed


around these interdependent u1⇒t1 and t1⇒u1 substitutions in an iterative way
until convergence. At convergence, a final set of (t,w) and corresponding (u,q)
vectors have been calculated for the current PLS-component (f), for the X-space
and the Y-space respectively (there are really only a few minor matters remaining
in the full PLS-algorithm, soon to be detailed below).

Thus, what might at first sight appear as two sets of independent PCA
decompositions is in fact based on these interchanged score vectors. In this way we
have achieved the goal of modeling the X- and Y-space interdependently. By
balancing both the X- and Y-information, PLS actively reduces the influence of
large X-variations which do not correlate with Y, and so does the job of removing
the problem of the two-stage PCR weakness.

The PLS2 NIPALS- algorithm will now be outlined (with reference to section 3.14
above).

6.9.3 PLS2 – NIPALS Algorithm


We do not here go into any particular numerical issues; it is sufficient to appreciate
the specific projection/regression characteristics of the PLS2 NIPALS.

Multivariate Data Analysis in Practice


140 6. Multivariate Calibration (PCR/PLS)

0. Center and scale both the X and Y-matrices appropriately (if necessary)
Index initialization, f: f = 1; Xf = X; Yf = Y

1. For uf choose any column in Y (initial proxy-u vector)


2. wf = XfTuf/ | XfTuf | (w is normalized)
3. tf = Xfwf
4. qf = YfTtf/ | YfTtf | (q is normalized)
5. uf = Yfqf
6. Convergence: if | tf.new – tf.old | < convergence limit, stop; else go to step 2. (uf
may be used alternatively)
7. pf = XfTtf/tfTtf
8. bf = ufTtf/tfTtf (PLS inner relation)
T
9. Xf+1 = Xf - tfpf Yf+1 = Yf - bftfqfT
10. f = f + 1
Repeat 1 through 10 until f = A (optimum umber of PLS-components by
validation)

Explanation to the NIPALS algorithm for PLS2:


1. It is necessary to start the algorithm with a proxy u-vector. Any column vector
of Y will do, but it is advantageous to chose the largest column, max |Yi|.

2. Calculation of loading weight vector, wf, for iteration no. f.

3. Calculation of score vector, tf, for iteration no. f.

4. Calculation of loading vector, qf, for iteration no. f.

5. Calculation of score vector, uf, for iteration no. f.

Steps 3 and 5 represent projection of the object vectors down onto the fth PLS-
component in the X- and Y- variable spaces respectively. By analogy one may view
steps 2 and 4 as the symmetric operations projecting the variable vectors, w and q,
onto the corresponding fth PLS-component in the corresponding object spaces. We
also note that these projections all correspond to the regression formalism for
calculating regression coefficients. Thus the PLS-NIPALS algorithm has also been
described as a set of four interdependent, “criss-cross” X-/Y-space regressions.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 141

Note that in PLS the “loading weights vector”, w, is the appropriate representative
for the PLS-component directions in X-space. Without going into all pertinent
details, one particular central understanding is that of the w-vector as representing
the direction which simultaneously both maximizes the X-variance as well as the
Y-variances in the conventional least-squares sense. Another way of expressing this
is to note that - after convergence - w reflects the direction which maximizes the
(t,u)-covariance, or correlation (auto-scaled data) between the two spaces,
Høskuldsson (1996).

6. Convergence. The PLS-NIPALS algorithm actually very often converges to a


stable solution in less iterations than for the equivalent PCA situation, because
the correlated data structure in both spaces are interdependently supporting each
other. At convergence, for the stable solution, the PLS optimization criterion is
composed of the product of both a modeling optimization term and a prediction
error minimization term; the combined criterion known as the H-principle, ibid.
7. Calculation of p-loadings for the X-space. These are mainly needed for the
subsequent updating, although some statisticians prefer also to do the X-space
interpretations based on these p-loadings (as opposed to the w-loading weights).

8. Calculation of the regression coefficient for the inner X-Y space regression.
This so-called “inner relation” of the PLS-model, graphically depicted as the “T
vs. U plot” (TvsU), constitutes the central score plot of PLS, occupying a
similar interpretation role as does the equivalent (t,t)-plot for PCA. There is of
course also a double set of these (t,t)- and (u,u)- score-plots available. It bears
noting that the central inner PLS relation is made up of nothing but a standard
univariate regression of u upon t. This PLS inner relation is literally to be
understood as the operative X-Y link in the PLS-model. It is characteristic that
this link is estimated one dimension at the time (partial modeling), hence the
original PLS acronym: Partial Least Squares regression, whereas a more
modern re-interpretation is often quoted as: Projection to Latent Structures.

9. Updating: Xf+1 = Xf - tfpfT Yf+1 = Yf - ufqfT (Yf+1 =Yf - btfqfT)

This step is often also called deflation: Subtraction of component no. f for both
spaces. This is where the p-loadings come into play. By using the p-vectors
instead of the w-vectors for updating X, the desired orthogonality for the
t-vectors is secured.

10. The PLS model: TPT and UQT is also calculated – and deflated - for one
component dimension at the time. After convergence, the rank one models, tfpfT
Multivariate Data Analysis in Practice
142 6. Multivariate Calibration (PCR/PLS)

and ufqfT are substituted appropriately, the latter expressed as Yf+1 = Yf - btfqfT
by inserting the inner relation, so as to allow for appreciation of how Y is
related to the X-scores, t.

PLS1 and PLS2


What is the main difference between PLS and PCR? PLS uses the information in Y
actively to find the Y-relevant structure in X, with w representing the maximum
(t,u)-covariance/correlation. Thus PLS focuses as much on the Y-variance as well
as the X-variance, and we are really most interested in the co-varying relationship
between these two spaces, be this expressed as covariance or as correlation. In
general this results in simpler models (fewer components).

From a method point of view, there are two versions of PLS: PLS1, which models
only one Y-variable, while PLS2 models several Y-variables simultaneously.

PLS2 gives one set of X- and Y-scores and one set of X- and Y-loadings, which are
valid for all of the Y-variables simultaneously. If instead you make one PLS1
model for each Y-variable, you will get one set of X- and Y-scores and one set of
X- and Y-loadings for each Y-variable. PCR also produces only one set of scores
and loadings for each Y-variable, even if there are several Y-variables. PCR can
only model one Y-variable at a time. Thus PCR and PLS1 are a natural pair to
match and to compare, while PLS2 would appear to be in a class of its own.

From a data analysis point of view the use of PLS2 was for many years thought of
as the epitome of the power of PLS-regression: complete freedom – modeling any
arbitrary number of Y-response variables simultaneously. Gradually however, as
chemometric experiences accumulated, everything pointed to the somewhat
surprising fact that marginally better prediction models were always to be obtained
by using a series of PLS1 models on the pertinent set of Y-variables. The reason for
this is easily enough understood - especially with 20/20 hindsight. Here we will
mostly let the exercises teach you this lesson - better didactics!

PLS-Components and Principal Components


Note that the components in PLS are not principal components but PLS-
components, found in a different way. For simplicity we often still use PCs or
components to denote both principal components and PLS-components. Both types
of components represent the relevant latent dimension of the models. It is usually
obvious from the context which types of components are referred to.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 143

6.9.4 Interpretation of PLS Models


In principle PLS models are interpreted in much the same way as PCA and PCR
models. Plotting the X- and the Y-loadings in the same plot allows you to study the
inter-variable relationships, now also including the relationships between the X-
and Y-variables.

Since PLS focuses on Y, the Y-relevant information is usually expected already in


early components. There are however situations where the variation related to Y is
very subtle, so many components will be necessary to explain enough of Y.
Modeling protein in wheat from NIR-spectra, for example, may require an
exceptional 8-18 components, because the dominating variations in the data may be
related also to grain size, packing, chemistry, etc. By studying how the Y-variance
develops with increasing numbers of components, you may see which components
explain most of the Y-variance. Several examples below will illustrate these
features.

Loadings (p) and Loading weights (w)


All PLS-calibrations result in two sets of X-loadings for the same model. They are
called loadings, P, and loading weights (or just weights or PLS-weights), W.

The P-loadings are very much like the well-known PCA-loadings; they express the
relationships between the raw data matrix X and its scores, T. (in PLS these may be
called PLS scores.) You may use and interpret these loadings in the same way as in
PCA or PCR, so long as you remember that the scores have been calculated by
PLS. In many PLS applications P and W are quite similar. This means that the
dominant structures in X “happen” to be directed more or less along the same
directions as those with maximum correlation to Y. In all these cases the difference
is not very interesting - the p and w vectors are pretty much identical. The duality
between P and W will only be important in the situation where the P and the W
directions differ significantly.

The loading weights, W, however represent the effective loadings directly


connected to building the sought for regression relationship between X and Y.
Vector w1 characterizes the first PLS-component direction in X-space, which is the
direction onto which all the objects are projected. In general, this direction is not
identical to the p1 direction (which can be found by PCA on the X-data alone),
because of PLS’s simultaneous (t,u)-covariance maximization. Therefore the
differences between p1 and w1 may be smaller or larger. It is precisely this

Multivariate Data Analysis in Practice


144 6. Multivariate Calibration (PCR/PLS)

difference between these alternative component directions that tells us how much
the Y-guidance has influenced the decomposition of X; one may think of the PCA
t-score as being tilted because of the PLS-constraint. One illuminating way to
display this relation is to plot both these alternative loadings in the same 1-D
loading plot.

The sequential updating in the PLS-algorithm implies that a similar relationship to


that between p1 and w1 also holds for the higher-order PLS-components 2, 3,... The
w-vectors that make up the W matrix are all mutually orthogonal. Therefore we
may also inspect and interpret the loading weights in 2-D vector plots, e.g. w1
versus w2. Much as we are used to for PCA, always remembering that W relates
directly to the sought for regression between X and Y. W tells the PLS story with
respect to the inter-variable relationships.

In PLS there is also a set of Y-loadings, Q, which are the regression coefficients
from the Y-variables onto the scores, U. Q and W may be used to interpret
relationships between the X- and Y-variables, and to interpret the patterns in the
score plots related to these loadings. The specific use of these double sets of scores
(T,U) and loadings (W,Q) shall be amply illustrated by the many practical PLS-
analytical examples and exercises to be presented below.

The fact that both P and W are important however, is clear from construction of the
formal regression equation Y = XB from any specific PLS solution with A
components. This B matrix is calculated from:
T -1 T
B = W (P W) Q

This B-matrix is often used for practical (numerical) prediction purposes, see
section 9.13.

6.9.5 The PLS1 NIPALS Algorithm


Now for the final NIPALS algorithm to be presented, the PLS1 approach. Because
we have dug somewhat into the PLS2 algorithms above, it will be easy both to
understand PLS1’s own specific features, but also to appreciate its “marginal”
position in comparison to PLS2.

In fact, by noting the matrix – vector substitution, Y vs. y, it will be possible to


follow how the steps in the corresponding PLS2 algorithm that concerns the score

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 145

u, all simply collapse, and the convergence is also made redundant. The result is a
much simpler, non-iterative calculation procedure. As usual, centering and scaling
first:

Center and scale both the X and Y-matrices appropriately (if necessary)

Index initialization, f: f = 1; Xf = X; yf = y

The y-vector is its own proxy “u-vector” (there is only one Y-column)

1. wf = XfTyf/ | XfTyf | (w is normalized)


2. tf = Xfwf
3. qf = tfTyf/ tfTtf
4. pf = XfTtf/tfTtf
5. Xf+1 = Xf - tfpfT yf+1 = yf - qftf
6. f = f+1

Repeat 1 through 6 until f = A (optimum number of PLS-components by


validation).

The PLS1 algorithm – and procedure – is as simple as this. There are no other bells
or whistles. Because of its computational simplicity PLS1 is very easy to perform,
but there are other, more important - indeed salient - reasons why PLS1 has become
the singular most important multivariate regression method. We will demonstrate
these reasons by using examples.

6.9.6 Exercise - Interpretation of PLS1 (Jam)


Purpose
In this exercise you will do almost the same as in the previous PCR exercise, but by
using PLS1. For the moment we will however still allow the use of the inferior
leverage corrected validation approach – again for demonstration purposes: one
needs to know the problems which exist with the methods, before one is able to
take the appropriate countermeasures.

Multivariate Data Analysis in Practice


146 6. Multivariate Calibration (PCR/PLS)

Data set
The data set is the same as in the previous exercise. Again we start the analysis
with the variable sets (Y1) Sensory, used as X and (Y2) Preference, Y proper, in
the file JAM.

Tasks
Make an identical PLS model, find the optimal number of PCs, and investigate how
good the model is. Also interpret the relevant scores and loadings. One other reason
to use leverage corrected validation here is that this allows direct comparison with
the earlier PCR model. – As a later exercise, you will benefit greatly from
duplicating this very same PCR/PLS1 comparison, using a common cross-
validation.

How to do it
1. Go to the Task-Regression menu and change calibration method to PLS1. The
other parameters are unchanged. Give the model another name, for example
Jam2. Calibrate for 6 PCs.

2. Study the Variance plot and find the optimal number of PCs, validated
Y-variance. Why do we now only need 1 PC now to explain about 90% of the
Preference? What does the fact that 27% of the variation in X explain 91% of
the variations in Y with a 1 PLS-component model mean?

Plot Predicted vs. Measured using your choice of PCs. Also see how the results
change with 1, 2 and 3 PCs. Are the results significantly better with more PCs? Or
are they actually worse?

3. Study the 2D Scatter Loading plot. Which variables are positively or negatively
correlated with Preference? How much of the total explained X-variance do
they contribute to? Hint: Study the figures below the plot.

Summary
In this exercise you have made a PLS1-model, and used the Predicted vs. Measured
plot to get an idea of how good the model is. Two PCs are probably optimal.
Adding more PCs always implies a risk of overfitting, so we would like to play it
safe. Also as this was a leverage corrected validation, great caution against possible
overfitting is needed. Only 27% of the sensory variables are needed to predict 91%
Multivariate Data Analysis in Practice
6. Multivariate Calibration (PCR/PLS) 147

of the Preference. We thus have evidence of quite a lot of redundant/irrelevant


information in X, with respect to Y. This is a demonstration of the power of PLS1.

The loading plot shows that color, sweetness and thickness correlate most with
Preference. The more intense the color and the sweeter the jam, the more the jam is
liked, while the thicker jams are less liked. All variables with small loading values
in PC1 are unimportant for determining the preference.

Since PLS focuses on Y, this method will immediately look for the Y-relevant
structure in X. Therefore it will give a lower residual Y-variance with fewer PCs.

6.9.7 Exercise - Interpretation PLS2 (Jam)


Purpose
PLS2 differs from PLS1 by allowing several variables in the Y-matrix
simultaneously. In this exercise you will learn how to set up and interpret a PLS2
model.

Problem
Now we compare the instrumental (X) and sensory data (Y1) to find out if the
instrumental and chemical variables give a good enough description of the jam
quality. Would it be possible to predict variations in the quality by using only these
instrumental variables? In that case we might replace costly taste panels with
cheaper instrumental measurements.

Data set
The variable sets Instrumental as (X) and Sensory as (Y) resides in the file JAM.

Tasks
Make a PLS2 model for prediction of sensory data (Y) from instrumental data (X).
Carry out a complete interpretation.

Multivariate Data Analysis in Practice


148 6. Multivariate Calibration (PCR/PLS)

How to do it
1. Read the data table from the JAM file. Use Instrumental as X-variables and
Sensory as Y-variables. Make a PLS2-model by changing Calibration method to
PLS2. All other parameters should be the same as in the previous exercise.

With PLS2 you also need to think about scaling (weighting) Y. In this exercise it is
natural to standardize both X and Y, so 1/SDev is a suitable weights option.

Name the model Jam3. Calibrate with all variables and e.g. 5 PCs.

The calibration overview, which is displayed when the model is complete, does not
show very promising results. The Total Y-variance is not well explained. However,
this is a measure for all the Y-variables together, so hopefully some of them may be
better described individually. This is a typical PLS2 situation, there is no need to
worry at this stage.

2. Plot the results


Look at the residual validation Y-variance for each Y-variable, by plotting the
residual Y-variances for all Y-variables using Plot - Variances and RMSEP,
Variables: Y, All (and un-tick Total), Samples: Validation. Which variables are
well explained by a model with 2 PCs?

Study the loadings for these first two PCs. Which sensory variables seem to
correlate with which instrumental variables? Can raspberry flavor judgments be
replaced by instrumental measurements?

Study the scores for the first two PCs. Which property is modeled by PC1? How
much of the X-variance is explained by PC1? Hint: see below the score plot.

Study scores and loadings together by using two windows. Which variables are
related to harvesting time?

Look at and analyze the variance and loadings as in previous exercises.

Summary
It seems that the spectrometric color measurements (L, a and b) are strongly
negatively correlated with color intensity and redness. Sweetness is as, expected,
rather strongly negatively correlated with measured Acidity, but the flavor shows

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 149

weak correlation to all of the instrumental variables and is not at all well described
by 2 PCs (small loadings).

The variance plot shows that color, redness and thickness are best modeled with 1-
2 PCs. To model the others we need at least 5 PCs, which implies a large risk of
overfitting.

PLS component 1 models harvesting time, which is mainly related to color and
thickness, just as we found in the previous models.

By studying the loading plot in a previous exercise we learned that jam quality
varied with respect to color, flavor and sweetness. The chemical and instrumental
variables mainly predict variations in color and sweetness only (which is also
indicated by the low explained Y-variance). This means that we cannot replace the
Y-variable flavor with the present set of X-variables. Using other instrumental X-
variables, e.g. gas chromatographic data, could possibly have increased the flavor
prediction ability.

6.10 When to Use which Method?


The BIG question in multivariate calibration - to some - is whether to use PCR or
PLS. Many statisticians prefer PCR because it is statistically well studied and
defined, while many applied data analysts and scientists outside statistics find the
PLS-approach easy to understand conceptually and to be preferred because it is
direct, and effective. PLS is said to produce results, which are easier to interpret -
because they are less complex (using fewer components). Often PCR may give
prediction errors as low as those of PLS, but almost invariably by using more PCs
to do the job. So for many purposes this boils down to a legitimate choice – but
certainly PLS-R seems to be the choice of many chemometricians, precisely
because of the interpretation and the number of components issues.

PLS2 is a natural method to start with when there are many Y-variables. You
quickly get an overview of the basic patterns and see if there is significant
correlation between the Y-variables. PLS2 may actually in a few cases even give
better results if Y is collinear, because it utilises all the available information in Y.
This is a rare situation however. The drawback is that you may need different
numbers of PCs for the different Y-variables, which you must remember at
interpretation and prediction. This is however also the case with PCR, when there
are several Y-variables, since each Y-variable may need more or fewer PCs. There

Multivariate Data Analysis in Practice


150 6. Multivariate Calibration (PCR/PLS)

are also cases where PLS2 fails to model some of the Y-variables well. Then one
should try separate PLS1 models anyway. You will definitely need to interpret each
model separately.

To conclude: Comparing PCR and PLS is an interesting issue, since you will need
to reflect on why the results are different. - PLS will probably give results faster. As
for PLS2 vs. PLS1: PLS2 is always useful for screening with multiple Y-variables,
but one will very often need separate PLS1 models to get the most satisfactory
prediction models. Nearly all individual PLS1 models will be superior, since the X-
decomposition can be optimised with respect to just one Y-variable - as opposed to
just doing an average job for all Y-variables. These PLS1/PLS2 distinctions are
dominantly founded on a large base of chemometric experience.

6.10.1 Exercise - Compare PCR and PLS1


(Jam)
Purpose
To learn how to compare models using the Viewer, and to compare PCR and PLS.

How to do it
1. Use the models you made in exercise 6.8.1 (jam1) and 6.9.6 (jam2) to compare
PCR and PLS1.
Close all viewers and data editors. Use the Results menu to plot the model
overviews in two Viewers and compare results:
Results - Regression, mark Jam1, hold Ctrl down and mark Jam2 as well, View –
Window – Tile Horizontally.

2. Variance
Compare the residual Y-variance between the two models.
A PLS1 model with 2 PCs is better than a PCR model with 3 PCs, and only slightly
worse than the PCR model with 5 PCs.

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 151

Figure 6.8 Residual variance in the PCR (Jam1) and the PLS (Jam2) models

Note that the PCR model actually displays an increase in the prediction error in the
first PC. In general this is a bad sign (and it is never acceptable in PLS). However,
in PCR you may often well accept this, because the first PCR components may very
well only be modeling X-structures, which are irrelevant to Y.

Important note: You cannot normally compare different models by using the
prediction variance (validation Y-variance) alone. This can only be done when the
models are based on the same data set and you have used the same Model center
and weighting in each model (and used the same validation procedure). This is the
case here, we can use this prediction variance as a measure of how good the
alternative models are.

More generally we use the measure RMSEP (Root Mean Square Error of
Prediction) to compare different models. RMSEP gives the errors in the same unit
of measure as used in the variables in the Y-matrix, and is therefore suitable for
general comparison. This will be discussed in detail later.

3. You can also try to plot RMSEP for these two models. You find it under Plot -
Variances and RMSEP (double-click on the miniature screen in the dialog box
so that your plot fills up the whole viewer).

4. Loadings
Study the Loadings (PC1 vs. PC2, Variables: X and Y) in a 2D Scatter plot. The
PLS loading plot is turned clockwise through almost 90º. The most obvious change
is that the Preference variable (the Y-matrix) is explained far better by the first
component with PLS1. This shows that the two methods use the same X-data in a
very different way. PCR (PCA really) extracts the systematic variations in the
rather independent data set, X, while PLS performs the interdependent

Multivariate Data Analysis in Practice


152 6. Multivariate Calibration (PCR/PLS)

decomposition in both X- and Y- matrices. In the loading plot we clearly see how
the Y-data have influenced the decomposition of X.

How would you go about using the loading-weights for a similar comparison? We
do have the loading weights from the PLS1-solution all right – but what would be
the corresponding item from the PCR-solution?

Figure 6.9 - Loading plots

5. Scores
Study scores for the two models. When you compare the score plot for PCR and
PLS1 you see some of the same general structures in both plots, but they are
actually clearer for PCR. Structures for harvesting time and for harvesting place are
present in both plots.

Figure 6.10 - Score plots

Multivariate Data Analysis in Practice


6. Multivariate Calibration (PCR/PLS) 153

Summary
PLS1 needs fewer PCs to explain the data than PCR and the final model performs
better. PLS1 and PCR utilize the X-data in very different ways.

In general PLS uses fewer PCs to model Y than PCR, which gives a minimum
residual Y-variance earlier. PCR may end up with as low a prediction error as PLS,
but with more PCs.

6.11 Summary
MLR
Multiple linear regression is the most widely used multivariate regression method,
but it has profound weaknesses. There are no diagnostics to tell e.g. whether
interferents are present or not. MLR also has severe problems when the X-data are
collinear. The MLR prediction solution is inherently unstable due to numerical
properties in collinear data sets.

PCR/PLS
PCR and PLS are shown to be strong alternative multivariate techniques, both with
many advantages. Interferents and erroneous measurements (outliers) are easily
detected using the diagnostics inherent in these methods. The different plots
available make it possible to interpret many generic relationships in the data set
both between variables and objects. The approach is highly visual, making the
methods available to a wider range of users than just skilled statisticians. Many
chemometricians have been proponents of the PLS-method over many years, and
many statisticians prefer PCR. Luckily - by proper validation of the relative
prediction model performances, this will always be subject to an objective
assessment. There are however still many skirmishes when these two schools meet
and exchange pleasantries....

Because these methods use projection, the collinearity problem is turned around to
a powerful advantage. Collinear data sets may in fact be modeled completely
without difficulty. It is then possible to use e.g. full spectra instead of just a few
selected wavelengths.

The rest about PLS - at least at this introductory level - concerns the many practical
applications where PLS has been found useful. We shall present many examples

Multivariate Data Analysis in Practice


154 6. Multivariate Calibration (PCR/PLS)

and illustrations below, but first the critical, indeed essential issue validation must
be presented in its full context. A proper validation understanding is an absolute
must in order to be able to get the maximum out of these immensely powerful
methods, PCR, PLS1 and PLS2.

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 155

7. Validation: Mandatory Performance


Testing
In this chapter we shall elaborate on the important issue of validation. The most
general statement one can make about the purpose of validation, is that it is a
measure of the performance of a multivariate model, be this related to modeling,
discrimination, classification or prediction. In this book we shall almost exclusively
be concerned with prediction performance.

In this context the purpose of validation is two-fold.

First of all validation is absolutely essential in order to make sure that the model
will work in the future for new, similar data sets, and indeed do this in a
quantitative way. This can be viewed as prediction error estimation.

Secondly, validation is often also used in order to find the optimal dimensionality
of a multivariate model (X,Y), i.e. to avoid either overfitting or underfitting. One
should not get confused by the fact that usually the dimensionality validation has to
be carried out before any prediction validation is put on the agenda.

Both test set and cross validation can be applied to any regression model made by
either MLR, PCR, PLS (PCA models can also be validated but this regards the
modeling performance; see section 7.1.3 on page 159). Test set and cross validation
are equally applicable to augmented regression models like non-linear regression
and neural networks, for example, and are perhaps even more important for
methods which involve estimates of many parameters as these imply even greater
risks of overfitting.

7.1 The Concept of Test Set Validation


The test set validation approach requires access to two data sets, each with known
X- and Y-values. The two data sets should be “similar” with respect to the
sampling conditions, i.e. representative of each other, and generally have the same
quality. But, more importantly, both should be representative of the future situation
in which the calibrated prediction model is to be used. One of the data sets is used

Multivariate Data Analysis in Practice


156 7. Validation: Mandatory Performance Testing

exclusively to calibrate the model; this set is called the calibration set. The other,
the validation set, is expressly used only for the validation.

Figure 7.1 - Data sets present for modeling (cal) and validation (val)

Xcal Ycal

Xval Yval

Usually the economics of a multivariate calibration in practice allows only for a


restricted number of samples, so not infrequently the optimum dimensionality
validation is also carried out on the training data set, since, hopefully, the resulting
prediction validation is to be carried out using the test set only. But indeed, in the
ideal case, even the dimensionality validation should also be carried out on a test
set.

It is not fatal if this initial tentative dimensionality does not already give us the
final answer; the validation principles you are about to learn will always allow you
to pin down the correct dimensionality, which is the critical basis needed for the
final prediction validation testing. What matters is that you take responsibility for
getting your thinking about it right, not relying on validation as one ready-to-use-
for-all-situations standard procedure. Most unfortunately, the validation issue is a
somewhat confused issue both within chemometrics as well as outside.

Therefore, we will spend some time on this crucial issue so that you have a full
understanding of the purposes of validation. These important principles can then be
applied in all situations.

We would like you to think of the following chapters on validation as providing a


safety net of what has been termed: “Principles of Proper Validation”, Esbensen &
Huang (2000).

All validations produce a measure of the prediction error, i.e. the error we can
expect when using the model to predict new objects. There are also other pertinent
measures of the prediction performance of the model, which we shall all introduce
Multivariate Data Analysis in Practice
7. Validation: Mandatory Performance Testing 157

in due order. We often also calculate a complementary measure of the modeling


error, which we will deal with first.

7.1.1 Calculating the Calibration Variance


(Modeling Error)
First you make a model based on Xcal and Ycal, i.e. you calibrate the multivariate
model (X,Y). We here assume that we know tentatively the optimal number of
components, A, from earlier results or otherwise; see also section 7.1.3 )

Figure 7.2 - Calibrating a multivariate model (X,Y)

Model
Xcal Ycal

Then we feed the Xcal values right back into the model to “predict” ycal .
Equation 7.1 Xcal + Model Î ycal
Comparing the predicted and measured Ycal values gives us an expression of the
modeling error, due to the fact that we have only used A components in the model:

Equation 7.2 Modeling error = ycal − ycal

This is calculated for each object. Summing the squared differences and taking
their mean over all n objects gives the calibration residual Y-variance:

Equation 7.3 Residual variancecal = ∑ ( y cal − y cal ) 2


n

The square root of this (divided by the appropriate weights used for scaling at
calibration if necessary) gives us RMSEC, (Root Mean Square Error of
Calibration), the modeling error, expressed in original measuring units.

Multivariate Data Analysis in Practice


158 7. Validation: Mandatory Performance Testing

Equation 7.4
∑ ( yˆ i ,cal − yi ,cal ) 2
RMSEC = i =1
n

Clearly RMSEC = 0 only if all potential components are used. For A < min(n,p)
RMSEC is a good measure of the error when only A components are used in the
model. We would like RMSEC to be as small as possible but there is a competing
consideration to be fully exposed immediately.

7.1.2 Calculating the Validation Variance


(Prediction Error)
Now we apply the model to the Xval values (the test set) and this time we really
predict the proper yval . Note that neither Xval or Yval of the test set have been
involved in the calibration.

Equation 7.5 Xval + Model Î yval

Next we compare the predicted and measured Yval values to get an expression of
the prediction error:

Equation 7.6 Prediction error = yval − yval

This is calculated for each validation object. Again, by summing the squared
differences and taking their mean over all objects in the test set we get the
validation residual Y-variance:
Equation 7.7 Residual varianceval = ∑ ( y val − yval ) 2
n

The square root of this expression (divided by the weights used for scaling in the
calibration) gives us RMSEP, (Root Mean Square Error of Prediction), the
prediction error, again in original units.

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 159

Equation 7.8
∑ ( yˆ i , val − yi ,val ) 2
RMSEP = i =1
n

7.1.3 Studying the Calibration and Validation


Variances
In practice, when calibrating a multivariate model, the program usually takes care
of validation at the same time. Both the calibration and validation variance are in
fact calculated completely automatically at each step of the model, i.e. after 1 PC,
after 2 PCs, and so on. Therefore we can study the error variances for models of
different, increasing complexity numerically as well as graphically. Remembering
that a model is accurately defined by, and critically dependent on the number of
components involved, the error variance tells about the model fit and prediction
ability after adding one more component, then one more, etc. It is very convenient
to plot variances as a function of the number of components to include in the model
and use this plot to determine how many PCs to use; see The V-Shape of the
Prediction Error Curve below.

The V-Shape of the Prediction Error Curve


The prediction validation variance as a function of the number of components
(Figure 7.3) is composed of two parts:
1. The modeling error which always decreases as you add more components, and -
2. The error associated with estimating the regression parameters (the statistical
uncertainty error), which always increases as more components are added.
The sum curve of these two opposing trends will therefore generally display a
more or less well defined minimum which corresponds to the optimal number of
components, A.

Be aware, though, that there may be exceptions to this clear “V-rule” for data sets
with non-trivial internal data structures (also influenced by non-deleted outliers,
etc.)

Multivariate Data Analysis in Practice


160 7. Validation: Mandatory Performance Testing

Figure 7.3 - Empirical prediction error – the sum of two parts (modeling &
estimation errors). This is the powerful plot which will always allow you to
determine the optimal number of components in a multivariate model (X,Y)
Error of prediction

Underfitting Overfitting

Modeling Estimation

Complexity of cal. model

Why Is an Overfitted Model Bad?


The calibration Y-variance usually decreases all the time (at least in PLS), since the
PCs are found in such a way that the residuals are minimized in each step. However
a “good” model fit does not necessarily mean that the model will be optimal also in
the predictive sense, i.e. for a future new data set. A generally valid model
describes only the systematic variations, not the random variations (noise).
Therefore we have to validate - testing the model on new data - and choose the
model that gives the minimum prediction error. Usually we prefer to use as few
components as possible. If the model describes noise it will fail to predict new
objects with optimal accuracy. The model will be “too detailed”; it will have too
many PCs. This is of course an overfitted model.

The prediction validation Y-variance will usually decrease until a certain point,
after which it generally starts increasing again. In prediction testing this minimum
corresponds to the optimum number of components, and one should never go
beyond this point!

Using the Validation Variance to Detect Problems


The prediction validation variance also indicates potential problems. In PLS any
increase in the validation variance is a bad sign, usually because of the presence of
outliers, noise, non-linearities, etc. On the other hand, in PCR it is relatively
common to observe an increase in the early PCs – when this happens this is
because these then model that part of the X-structure which is not relevant for Y.
Interpreting the prediction variance curve will be discussed in section 11.7.

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 161

X-Variances, Calibrated and Validated


We routinely also calculate calibration and validation variances on the X-data
alone; you are already familiar with the calibration variance relationships in
connection with PCA. In calibration, the X-variances tell you about which
X-variables contribute to the model, and which components they influence. In
contrast, it is often necessary to inspect both the X- and Y-variances to characterize
a PCR/PLS-model.

The calibration X-variance is based on a comparison of the measured and projected


X-values of the calibration objects, as was detailed above. Thus one may for
example validate a PCA model by projecting the test objects onto the
A-dimensional principal components. The resulting validated X-variance is then
directly related to the difference between the projected and measured X-values of
the test objects. Cross validation and leverage correction can be used in the same
way here.

Therefore we may also study the validation X-variance in PCA, to check that the
model is not overfitted. The conventional calibration (modeling) X-variance is the
most often used of these two. Until now you have only used leverage correction in
the PCA exercises, but with more knowledge about validation you may also use a
test set or cross validation before you are fully satisfied with your PCA model.

7.2 Requirements for the Test Set


It is often claimed that the test set should be totally unknown to the analyst to really
make the test as significant and valid as possible. This is a serious mistake. If one
does not know anything about the test set, how can one know whether it is
representative of the future prediction situation? If the test set used exactly the
same objects as in the calibration (remember the modeling validation above?), then
the “prediction error” would indeed be small. Yet there has been no prediction
testing whatsoever. Such a “prediction test” is worthless.

There are several possible problems regarding proper test set validation and its
requirements. Perhaps the most important issue is the somewhat surprising notion
that the prediction testing itself is not a completely objective procedure - that it
actually matters, to some degree at least, which validation method one chooses.

Multivariate Data Analysis in Practice


162 7. Validation: Mandatory Performance Testing

Status of the Prediction Error


If the validation samples are representative of all future prediction samples, then
the validated prediction error can be considered a fair estimate of the error level to
be expected in the future. Remember: there is no such thing as a “true” error. The
error is an estimate. In projection methods we make no assumptions as to the
statistical distribution of samples or errors. Therefore we cannot make proper
statistical inference estimates. The only thing we can do is to draw the test set from
a population as similar as possible to the one from which we will carry out
prediction in the future, calculate an average prediction error from this, and expect
similar average results later. There is a heavy emphasis on practical empirical
prediction validation in this context; it is perhaps here that the distinction between
multivariate statistics proper and the present multivariate data analytical approach
is the clearest.

Test Set and Calibration Set – Twins, or Sisters?


The calibration set and the test set must be as similar as possible, both with regard
to population and sampling conditions. They should cover the same ranges and
display the same features. If you made one model on the calibration set and another
on the test set, ideally they should give fairly similar models; the same general
patterns in the score plots and loading plots, the same number of PCs and similar
residual variances, both in X and Y are to be expected, except for the small(er), or
larger, sampling differences.

However, the two sets must not be too similar. If the two data sets are identical,
then the only difference between them would be the sampling variance, i.e. the
variance due to two independent samplings from the target population. In real life it
may be more or less difficult to obtain two almost identical drawings, both
representative with respect to the all-important future drawings, which represent
the eventual use of the prediction model. However, since this is such an important
issue, one must consider all these aspects of validation even as early as when you
plan the initial data collection. A problem may be how to pick out a representative
test set from all the available samples in the target population, but this is at least a
practical problem which can be confronted directly.

The calibration set must always be large enough to calibrate a model satisfactorily.
The test set must also be large enough to provide a satisfactory basis for the test.
Both these requirements call for a balance between the size of the sampling, the
number of samples and the representatively of both. Test set validation may
therefore at times require relatively more objects than a straight forward calibration

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 163

using cross-validation (to be presented immediately below), which may be too


expensive or difficult to achieve in a particular situation. What do we do then?

It is of paramount importance to realize the distinction between the optimum test


set concept – and the need for also being able to handle situations in which test set
validation unfortunately simply is not possible. While one is required to master
both the test set validation and the cross-validation approaches, there are serious
hidden dangers in opting for the latter too easily

7.3 Cross Validation


Starting Out: the Ideal Test Set Switch
In cross validation we only use the available training set objects, making models on
parts of the data and testing on other parts. There is no independently drawn test
set. The ideal situation would be the alternative in which there are enough data
available to simply divide the initial data set in two parts: A and B. Then one
calibrates a model on A and tests this on B. After which one makes a switch; model
B and test on A. The prediction error is then calculated for both B and A, and we
take the mean to obtain the estimate of the total prediction error.

This situation is often called the test set switch. This may indeed seem to be a
promising method. The test set switch situation is equivalent to the ideal of having
“enough samples for a good calibration” x 2, which translates into having plenty
calibration samples. Test set switch is in fact almost identical to the proper test set
situation– with one crucial difference: there have not been two independent
drawings from the target population, one! This has some specific implications,
which shall be more fully explored below.

But what do we do when the available number of samples is clearly below this
level - i.e. when we simply do have “too few samples” at our disposal for a proper
test set validation? Cross-validation now must come into play.

Full Cross Validation


Switching test sets did not solve the problem of having too few samples. Full cross
validation, or Leave-One-Out validation (LOO) is the correct thing to do here. Full
cross validation means that you make as many sub-models as there are objects,
each time leaving out just one of the objects and only use this for the testing. If
there are 20 objects, each sub-model will thus be made on 19 samples. The squared
difference between the predicted and the Y-value for each omitted sample is

Multivariate Data Analysis in Practice


164 7. Validation: Mandatory Performance Testing

summed and averaged, giving the usual validation Y-variance apparently in the
exact same sense as for test set prediction testing.

The full training set model is based on all 20 samples however. This means that the
LOO cross-validation error estimate is not based exactly on the full model, but on
20 almost identical sub-models, each with only 19 samples. For the series of these
20 sub-models, each pair of sub-models will have 18 objects in common. Does
cross-validation, seem a bit like cheating? We have actually never really performed
the validation procedure on a truly independent test set. This does not make sense!

Unreflected use of this cross-validation scheme may well appear as if we have but
created a true test set out of nothing – or at least out of the very same training data
set with which we have completed the calibration also. What is wrong here?

Full cross-validation is used very extensively within chemometrics. It has been


claimed to be just as powerful as test set validation, but in reality this cannot be so.
Full cross-validation can never be a complete alternative to test set validation.
Why? The crucial difference is that all the samples at our disposal were in fact
sampled simultaneously from the parent population. No matter what we care to do
with this collection of objects, we are still using this one-and-the-same calibration
set. This is what is “wrong” with the cross-validation concept – but it is ONLY
wrong with respect. to the high ideal of the perfect test set. All training set objects
do impact on the underlying full modeling however, so the calculation and
interpretation of the calibration model parameters is quite unaffected.

Cross validation is the best, and indeed the only alternative we have when there are
not enough samples for a separate test set. There are two, apparently different,
types of cross-validation which we are about to present in more detail; full cross-
validation and segmented cross-validation. In reality these two types are very
closely related however, and once you’ve mastered the one, the other follows
directly.

Actually there are many myths about full cross validation. Many text books and
experts recommend full cross validation as a general approach to prediction testing,
claiming that this should give the most comprehensive testing of the model. Many
statistical and data analytical programs include full cross-validation as the default
procedure. Certainly not everyone agrees with this however, because of the fatal
weaknesses discussed above.

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 165

Segmented Cross Validation


There is also another powerful solution falling between full cross-validation and
test set validation, called segmented cross validation. This method is used when
there is a relative abundance of samples in the training set, but we do not know how
to pick out representative ones, or perhaps when full cross validation would be too
time consuming. (Full cross validation of e.g. 1000 objects would take 1000 times
longer than making one model.) In segmented cross validation you divide the
calibration set into segments with, for example, 10% of the samples in each. You
leave out all the objects in one segment and model the rest. Then you leave out the
next segment, but take in the first again, and so on. How large should the segments
be? There are many suggestions, e.g. 10%, 25%, 33%; in reality this is a strongly
problem-dependent problem.

Systematics of Cross-Validation: Introduction


Let us now tackle this issue head on. On the one hand we have too few samples for
a test set validation – on the other hand we have “somewhat” more samples than
the minimum, which would require us to choose LOO, the full cross-validation
option. We are facing the “intermediate” case of wanting to squeeze more than one
sample into each segment. Incidentally, full cross-validation is easily seen as but
the limiting case of exactly one object per segment. Putting two samples into each
segment will allow for a 50% reduction in the number of segments, i.e. the number
of sub-models to be calculated. With three samples in each segment you achieve a
67% reduction, and so on.

Now it is in fact easily seen that in every segmented cross-validation situation there
is a definite range for the number of segments that can be chosen, corresponding to
(2,3,4,…,n-2,n-1,n). With this realization, the systematics of cross-validation is in
fact easily mastered: Each cross-validation necessitates a choice of the number of
segments to be made by you, the data analyst, not by the software program, and
most emphatically not by any algorithmic approach. This choice is always in the
range (2,3,4,…,n-2,n-1,n).

At the outset of trying to master multivariate calibration, this will almost certainly
appear as a very unfortunate situation for the novice data analyst. It would be so
much better (read: easier) were this “difficult choice” to be made by the “method”
itself. But this is not the correct way to tackle this issue. On the contrary, it will be
necessary for you, as a responsible data analyst, to make an informed choice of the
number of segments to be used in all cross-validation. We would perhaps now be
expected to give you another list of rules-of-thumb for this endeavor.

Multivariate Data Analysis in Practice


166 7. Validation: Mandatory Performance Testing

Unfortunately, there is only one place where relevant help is to be found


concerning this choice – from the very data structure itself.

The issue of how to select the “correct” number of segments revolves around a
novel way to interpret the entire cross-validation situation. A full disclosure is
outside the scope of the present introduction, for which see Esbensen & Huang
(2000), but the critical essentials are easily enough presented:

1. There is no universal validation procedure, to be used on all data set (blindly).


2. Rather, each validation situation is related to the actual data structure present.
3. Choose test set validation if at all possible. Only if all possibilities are
exhausted:
4. Segmented cross-validation, including the LOO option, comes to the fore.
5. Systematically one has to choose between (2,3,4,...,n-2,n-1,n) segments.
6. The relevant A T vs.U score plots show the pertinent data structure(s).
7. All cross-validation is only a simulation of the ideal test set validation.

In summary, all cross-validation is ever doing, in fact all it can ever do, is trying to
simulate the ideal case of test set validation. With this in mind you should have a
much more relaxed attitude towards assuming the personal responsibility, which we
have already talked about at great lengths above. This may seem like a difficult
obligation for a trainee data analyst, but there is no other way than to start out on
your own forays. In reality all that is needed is a few reflections on the effect of
disturbing the data structure, as it is manifested in the T vs. U plots, by selective
removal of (1,2,3, … ,n-2,n-1,n) objects simultaneously ......

The kind of reflections involved revolve around picturing in one’s own mind the
effect(s) of removing e.g. one object from the data set: how would this affect the
pertinent y|x-regression line direction in the TvsU plot? How about e.g. removing
segments with 2,3,4..... objects? ----What about removing 50% of the training data
set? The issue here is to be able to imagine how the pertinent segmentation deletion
will affect the y|x-regression line direction, viewed as a response to a perturbation
to the data set structure, the specific perturbation being the removal of a certain
fraction of the overall number of objects. Some practice in this context is surely
needed, and for this reason we have arranged for plenty of multivariate calibration
exercises, complete with our offer of a “correct” number of segments, to be
included in this book.

But the central issue involved is, strictly speaking, not just the y|x-regression
direction perturbation responses, but rather this: how large a proportion of the

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 167

samples must be taken away in one segment – in order to simulate the effects of the
second target population sampling that was never carried out in this cross-
validation setting?

This is the core-issue of the inter-related test set validation/cross-validation


systematics, which we are trying to describe for your deeper understanding in this
first introductory chapter on validation. This test set simulation role of cross-
validation is completely novel with the present approach; it is explained in full
detail in Esbensen & Huang (2000).

Cross Validation in Practice


Going back to the earlier reported misconception that full cross-validation is an all-
powerful validation procedure. Is it - at all - realistic to believe that the removal of
only one object will affect the y|x-regression line direction so much, as to simulate
a complete new drawing from the target population? – We most emphatically deny
this notion!

It is a more “realistic simulation of a test set situation” to divide the calibration set
into a few segments, e.g. with at least 10% of the samples in each. It would appear
very unlikely, for example., that one singular left-out sample will result in a
significantly greater sampling variance than that resulting from the 10%
segmentation removal, unless the one left-out sample happens to be an outlier. We
must assume here that all such outliers have been screened away by you, the
experienced data analyst, of course! By this reasoning, full cross validation on any
reasonable balanced data set simply must lead to over-optimistic validation results.
This makes the minimum 10% segmentation approach seem more realistic.

This type of segmentation is often done randomly. However, there is no guarantee


that a random selection gives an even distribution (unless we have a truly large data
set). In addition you cannot repeat this procedure under the same conditions, as
required for formal quality assurance procedures. Apart from “random selection”,
there are also options for manual as well as systematic segmentation, which brings
us back the problem of how to pick out representative segments. Nobody promised
that validation would necessarily be easy. However with some experience it will
tend to become manageable, we promise!

So long you plan what to do, understand the consequences and analyze the results
accordingly, the chance of making big mistakes is minimized. Omitting validation
would be much, much worse!

Multivariate Data Analysis in Practice


168 7. Validation: Mandatory Performance Testing

Segmentation Strategy Depends on the Data Set


It should now be easy to appreciate that the size of the segments depends on both
the size and the structure of the data set and on its complexity (the number of
significant components in the model is often called the complexity of the model). In
a small data set with e.g. only 10 samples, a separate test set is of course
impossible. We need all 10 points to make a model. Full cross validation is the only
option here.

If all 10 samples are equally important to span the variations, even leaving out one
sample may cause serious problems. A good example is a data set constructed by a
fractional factorial design. Then, we may need to disregard validation, but simply
check that the model fit is adequate and not be too optimistic about using the model
for safe future predictions. If we make this model in order to understand the
relationships in our system, for screening, or to investigate the possibility of using
multivariate modeling for future indirect measurements, perhaps we can live with
insufficient validation of the screening model. There is however also a third option:
leverage correction.

7.4 Leverage Corrected Validation


Leverage corrected validation has been called a “quick and dirty” method, an apt
designation. “Quick” - because it only requires us to make one model, i.e. it is as
fast as test set validation. “Dirty” - because the prediction error always must an
over-optimistic estimate. You have already used it in many of the previous
exercises, merely because we had not yet introduced the other methods of
validation. Leverage correction can be used as an alternative in the initial modeling
stages, to save time, but should always be replaced by a more conservative and
reliable method for validation of the final model.

The leverage-corrected model is exactly the same as the alternative test set
validated, or the cross-validated ones. They give exactly the same scores and
loadings, since all the calibration samples are used to make the model, but in
general the prediction error will be estimated to be lower and it may sometimes
even indicate fewer PCs.

Leverage is a measure of the effect of an object on the model, which to a large


extent is related to its distance from the model center. The leverage is scaled so that
it always has a value between 0 and 1. An extreme object, far away from the model

Multivariate Data Analysis in Practice


7. Validation: Mandatory Performance Testing 169

center, will have a high effect (leverage close to 1). A typical object, close to the
model center will have a small effect (leverage close to 0).

Equation 7.9 1 A tia2


hi = +∑
n a =1 t Ta t a
where
A = number of PCs calculated
tia = the score value for object i in PC a
hi = leverage for object i

In leverage-corrected validation, one is correcting the individual object prediction


error estimates before these are squared and summed appropriately in the formal
expression for RMSEP etc. The corrected error estimate is related to the leverage
for each object. The effect of leverage-correction is to increase the weight of the
extreme (faraway) samples and to give a corresponding relative reduction in the
weight of the more “typical” ones, closer to the average object.

For each variable Yj, the raw residual for an object, fij is divided by the leverage
expression (1-hi ):

Equation 7.10 f ij
f ijcorrected =
1 − hi

The residual validation variance is calculated as usual, from the mean squared
error, but the individual error contributions are now leverage corrected according to
Equation 7.10. In this way, what would have been only small residuals from
influencing objects, are now contributing to increasing the corrected prediction
error estimate, which is only fair, since these objects – because of their relatively
far out positions– were very influential in describing the data structure model. This
increasing in weight of the error contributions from unduly influential data is what
constituted the original motivation for introducing the concept of leverage
correction. This is an extensively used feature in statistics and data analysis. The
data are “unduly influential” because in the least-square sense of bilinear modeling,
such data points will necessarily lie close to the model PC-directions with the
unwanted consequence, that they will always contribute unrealistically small
prediction errors – if not properly corrected!

Multivariate Data Analysis in Practice


170 7. Validation: Mandatory Performance Testing

In general this correction approach often works rather well, but there are also
several situations in which grave distortions may arise, especially when dealing
with special data structures in which there is a strong colinearity. Leverage
corrected validation is never to be used for the really important final validations.

There is an analogous leverage statistic for each variable as well, which may be
used for detailed interpretations of the relative importance between variables.

Multivariate Data Analysis in Practice


8. How to Perform PCR and PLS-R 171

8. How to Perform PCR and PLS-R


The main steps in the practical working with PLS-R and PCR will be introduced
below. This chapter will teach you only the basic necessary “mechanics” of
performing standard PCR- or PLS-R multivariate calibrations and their validation.
It is intended to give you a chance to digest the theoretical introductions in chapters
6 and 7 in a relatively simple practical context, i.e. to start to use these methods
immediately. Chapter 9 goes into much more detail of many additional issues, the
understanding of which is greatly helped by the following practice.

8.1 PLS and PCR - Step by Step


1. Collect a representative set of calibration objects, the training set, and
preferably also a set of validation objects, the test set. If a separate test set is not
possible, consider how the model is to be validated already when collecting the
training set.
2. Enter the input data from the keyboard or import it from an external file (File-
Import). Key in the data and choose File - Save to save the data on Unscrambler
format. (- not applicable with the training version, which only accepts the set of
accompanying data sets)
3. Plot the raw data to get a first impression. Mark the data and choose Plot-
Matrix or Plot - Line.
4. Perform the necessary – and appropriate - preprocessing, if any (Modify-
Transform). Note that autoscaling is done in the Task - Regression menu.
5. Open the Task - Regression menu. Select model parameters appropriate for
your data analysis: Center Data, Weights =1 or 1/SDev, Method = PLS1, PLS2 or
PCR and Validation method = Test set, Cross validation or Leverage correction.
Start the modeling (OK).
6. The calibration screen output and the model overview which pops up after
calibration indicates outlier warnings, etc.
7. Evaluate the calibrated model by plotting the results (View). Study Variance,
Scores, Loadings and Predicted/measured. With PLS is it important to see
that the residual variance does not increase in the first PCs. If the Y-variance
plot, scores or Tvs.U scores indicate outliers, check whether they are due to data
transcription errors. If not, go back to step 5 and use the Keep out of
calculation option for any detected outliers. It is normal to iterate this several
times during an analysis.

Multivariate Data Analysis in Practice


172 8. How to Perform PCR and PLS-R

8. The prediction performance is evaluated by looking e.g. at the RMSEP and the
other validation options introduced above.

8.2 Optimal Number of Components in


Modeling
Above it was outlined how to select the optimal number of PCs by studying the
validation residual variance plot. One should choose the number of PCs
corresponding to the first clear V-minimum, or a break from monotonically
decreasing residual variance, i.e. where the prediction error is minimized, all the
while bearing in mind the “expected” number of “phenomena” in the data. In the
model overviews, The Unscrambler program suggests an optimal number of PCs.
This is a help to inexperienced users only, to avoid serious overfitting. The optimal
number of PCs in The Unscrambler is calculated according to the formula:

in PLS or PCR: Min [Vytot_valPC 0 * 0.01 * a + Vytot_valPC a]


in PCA: Min [Vxtot_valPC 0 * 0.01 * a + Vxtot_valPC a]

where
a = current dimensionality (PC number)
Vytot = Total residual Y-variance at validation
Vxtot = Total residual X-variance at validation
Index PC0 = at PC number zero
Index PCa = at PC number a.

In other words, for each new PC the program adds 1% of the variance before the
calculations start, to the variance at PC number a. This correction factor ensures
that you do not select still another PC unless you gain at least this much by doing
so. In this way you are encouraged to use fewer PCs. If increasing the number of
PCs does not bring much improvement, one must avoid using more PCs, as more
degrees of freedom will have been lost completely without any further residual
variance reduction.

However, you must also check the present residual variance relations carefully. If
the “optimum” solution comes after a local increase, one must be very cautious.
Outliers and/or large(r) data set irregularities are on the prowl! Always study the
particular shape of the variance curve. This is further discussed in section 11.7. The
bottom line is that the standard The Unscrambler suggestion for the optimal

Multivariate Data Analysis in Practice


8. How to Perform PCR and PLS-R 173

number of modeling components is just that: a standard notion, only completely


valid for “well-behaved” data structures. It is always up to the informed data
analyst to choose this dimension in the final analysis!

8.3 Information in Later PCs


How does one interpret models where the significant information comes in later
PCs? Since PLS focuses on the (X,Y) correlation directly, the reference values (the
so-called “measured Y”) will always guide the decomposition of the X-matrix. In
most cases the first PCs will then describe most of the variations in Y, i.e. the
information will come in early PCs. However, there are situations where PLS needs
more PCs to find subtle information. By studying the variance plots you can see
which PCs describe most of the residual Y-variance, thus identifying the most
important PCs. Studying the scores and loadings for these PCs allows you to draw
fairly accurate conclusions about the samples and characteristics described by the
PCs. See also section 11.7.

8.4 Exercises on PLS and PCR: the Heart-


of-the-Matter!
The following exercises in chapter 8 have many aspects in common with the
exercises in later sections. Now you will learn more about the difference between
PLS1 and PLS2, and between PCR and PLS, but this time you will have to do more
of the work yourself. You will be asked several questions along the way to help you
focus on the various important new aspects.

In several later chapters you will use PLS and PCR on many different application
examples, involving typical real-world data analytical problems, in which the
informed choice as to the validation will be increasingly up to you, and your
accumulating multivariate modeling expertise. It cannot be stressed enough that it
is the personal experience which counts here. It is all fine to learn the theory, and
the methodological finesses, algorithms etc. pertaining to multivariate calibration,
but this does not make a good data analyst per se! The only way to really learn the
trade in this realm is personally to apply the theory to representative, real-world
problems. We have worked hard to select precisely this type of exercises in this
book. In this context, chapters 10-13 are perhaps the most important for your own
practical multivariate modeling learning curve, but first things first:

Multivariate Data Analysis in Practice


174 8. How to Perform PCR and PLS-R

You will study the familiar data set on green peas, but in a modified form from the
one encountered above earlier.

8.4.1 Exercise - PLS2 (Peas)


Purpose
To calibrate and interpret a full PLS2 model.

Problem
Pea quality is mainly described by sweetness and texture, as experienced whilst in
the mouth. Using trained judges, panelists, is often standard practice in these
matters. However, peas are harvested and valued by their texture, often measured
by tendrometer readings. Tendrometer measurements are relatively inaccurate
however. Sugar contents indicate sweetness. Again the main objective behind
establishing a multivariate calibration model: Is it possible to replace the costly
sensory panel evaluations by inexpensive instrumental measurements for routine
quality control?

Data Set
The data are stored in the file PEAS. The variable set Chemical contains 6
chemical measurement variables (X). The Y-variables are stored in the variable set
Important Sensory which contains 60 samples with the six most important
variables, in fact the ones you found in a previous exercise. The data are averaged
over 2 replicates and 10 judges.

Table 8.1 - Variables in the X-data set: Chemical


No Description No Description
X1 Tendrometer value X4 % sucrose
X2 % dry matter X5 % glucose 1
X3 % dry matter after freezing X6 % glucose 2

Tasks
1. Calibrate a PLS2 model.
2. Interpret the results.

Multivariate Data Analysis in Practice


8. How to Perform PCR and PLS-R 175

How to Do it
1. Take an overview look at the data. Does pre-processing or weighting seem
necessary?

2. Go to the task menu and choose a PLS2 regression model with default
weights set to 1/SDev. Why is weighting (autoscaling) necessary? Use the
mean as the model center and set the outlier limit to 3. Choose leverage
correction as the validation method for the first run. Why do we use PLS2?

3. Study the model overview


Study the validation residual variance after each principal component. How
many PCs seem enough or optimal? Why?

4. Plot the results


Go to the model overview and plot the explained validation variance, totally for
all the Y-variables. How many PCs to use? Why? Do we get much more
explained variance by using more than two PCs? How much is explained by the
first PC alone?

Now plot the variance for the individual Y-variables: use Plot – Variances and
RMSEP to plot the variance for Y-variable 1, 2, 3 and so on in the same plot
(Variables: Y, 1-6, un-tick Total, Samples: Validation. Double-click on the
miniature screen in the dialog box to make your plot fill up the whole viewer).
Which Y-variable has the highest prediction error? Are all the Y-variables well
modeled?

5. Scores and loading plots


Plot the scores for PC1 and PC2 in a 2-vector plot, Variables: X and Y. Is the
overall distribution satisfactory? Are there patterns or outliers? Which property
is modeled by PC1?

Plot the loadings in a 2 vector plot for PC1 and PC2. Which variables are the
most important? Which seem to correlate? Which chemical measurements
correlate most with sensory data? Should we pay more attention to PC2?

6. Predicted vs. Measured


Plot predicted vs. measured for each Y-variable. How many PLS components
should we use here? Use View - Trend lines to turn on the regression line and
Windows - Statistics to see the statistics. Which variables are predicted best?

Multivariate Data Analysis in Practice


176 8. How to Perform PCR and PLS-R

Which are worst? Is there any relationship between these results and the
variance results? Also check the results using only 1 PC.

7. Outliers
Relation outliers, i.e. outliers due to errors in the relationship between X and Y
may be difficult to find in the normal score plot (t1-t2), which of course is a
picture of the data structure in the X-space alone. Check for possible relation
outliers by plotting X-scores (T) versus Y-scores (U). The “X-Y Relation
outliers” plot does this (Components: Double, 1 and 2). This is an extremely
important plot!

8. RMSEP
Check the expected future prediction error in original units by displaying the
RMSEP (Y-variables: All, Samples: Validation, make the plot fill up the whole
viewer). Select Window- Identification to see what you plotted. Does the
RMSEP provide the same interpretation as the Predicted vs. Measured-plot?

Draw your conclusions regarding this PLS2/leverage-correction validated


model. Does the model appear OK? How many PCs do we need? How large is
the prediction error?

Save the model under the name PeasPLS2.

Summary
One should always standardize data when they are measured in different units, as is
the case with the chemical X-variables in this exercise. Sensory data are always
standardized in PLS and PCR because the scale may be used differently for the
different variables by the different judges.

PLS2 is suitable when there are several Y-variables, at least to get a good first
overview. If the prediction error is high, one may alternatively try several separate
PLS1-models.

The calibration variances express the model fit in the X-space and the Y-space
respectively. Validation variances are calculated using the chosen validation
method and say more about how well the model will work for predicting new data
(Y).

Multivariate Data Analysis in Practice


8. How to Perform PCR and PLS-R 177

One PC explains about 85% of the variance of each Y-variable, except for Off-
flavor. We can safely use two PCs since the explained variance increases to about
90% and we know that there are only two independent variation factors (time and
place).

The samples are well spread in the score plot and there are no obvious outliers or
clear groups. The first component describes the variation due to harvesting time.

All the variables vary a lot in PC1, but the flavor and sugar variables also have
some contribution in PC2. However, PC2 only accounts for 2% of the variation in
X and 2-6% in Y (over all the Y-variables). Sweetness correlates with %sucrose
and Off-flavor is negatively correlated with those. The texture variables correlate
with tendrometer and dry matter measurements. It seems that ripe peas are sweet
and fruity while early harvested peas are hard, mealy and have Off-flavor –no
surprises here.

We studied prediction results with two PCs. Using only one PC gives slightly
poorer predictions. Off-flavor has the worst correspondence between predicted and
measured values. This is natural, since the explained Y-variance was lowest for this
variable.

RMSEP is the estimate of the mean prediction error of all “test samples”. In this
exercise we used leverage corrected validation instead of an independent test set, so
the estimate of the prediction error is probably too optimistic. The curve of RMSEP
against number of components has the same shape as the validation residual Y-
variance and is scaled back to original units. We see that RMSEP is larger for those
variables that have a worse Predicted/Measured relationship. RMSEP is between
0.3 and 0.6 for all Y-variables. This is appropriately “good” since the panelist
judgement measurements are on a scale of 1-9. The relative error is thus max. 0.6/1
= 60% on the low value levels and max. 0.6/9 = 7% on the high levels. Considering
the inaccurate nature of sensory assessments this is probably acceptable.

8.4.2 Exercise - PLS1 or PLS2? (Peas)


Purpose
In this exercise you will make two PLS1 models of variable 2, sweetness, from the
Y-matrix, to find out whether PLS1 gives a better model than PLS2. Normally we
expect PLS1 to give a better model on one Y-variable than PLS2 on an enlarged
Y-variable set. There are, however, also cases where this is not true, due to the fact

Multivariate Data Analysis in Practice


178 8. How to Perform PCR and PLS-R

that PLS1 does not use all the information available in the Y-matrix. What you will
also learn in this exercise is how to compare models and decide which is the better
one.

Data Set and Problem


The same as in the previous exercises, PEAS.

Tasks
1. Calibrate two PLS1 models (same Y-variable, different validation procedures)
2. Compare them with the PLS2 model just completed above.

How to Do it
1. Go to Task - Regression. Use Chemical as X and define variable
Sweetness as a new set. Use the same model parameters as in the last model,
but change the calibration method to PLS1. Calibrate for e.g. 6 PCs.

2. Interpret the model. Remember to check the RMSEP by plotting it. How
many components to use now? Which model is the better, PLS1 or PLS2?
Which criteria did you use to establish this finding?

Save the model under the name PeasPLS1_LC.

Interim Summary
In this case the leverage corrected PLS2 model and the leverage corrected PLS1
model turned out be equal with respect to size of the estimated prediction error –
compared with the same number of model components of course. You never know
in advance which one will be best. You must always try both PLS2/PLS1, if you do
not go directly to PLS1 for “external reasons” that is.

Purpose
Introducing cross-validation, now to be applied on the same data set, PEAS.
Comparing three alternative PLS-models, based on two alternative validations.

Tasks
Calibrate another version of the PLS1 model for sweetness, but this time using full
cross-validation (LOO).

Multivariate Data Analysis in Practice


8. How to Perform PCR and PLS-R 179

How to Do it
1. Change the validation method in Task - Regression to Full Cross validation;

Calibrate again for e.g. 6 PCs. How does cross- validation work? While waiting
for the calibration to finish, reflect on the functioning of cross validation. Do
you expect the prediction error to increase or decrease? Why?

Save the model under the name PeasPLS1_CV.

2. Compare the two alternatively validated PLS1-models.


Do this by plotting the scores from the different models in separate Viewers.
Use Results - Regression to plot score for previous models. Are the models
equal?

3. Now do the same comparison both for the loadings as well as for loading-
weights

4. Compare the prediction errors (RMSEP). Are the prediction errors equal?
What is the difference? Which error is lowest? Why? Which is highest? Why?

Summary
The leverage corrected PLS1 model could perhaps have been interpreted as slightly
“better” than the cross validated one, because its RMSEP was in fact a bit lower.
Scores and loadings were exactly equal for the two PLS1 models, of course,
because it is only the prediction error estimate that is a function of the particular
validation method used. We got the same model in both cases. But cross validation
gave a more conservative - and probably more realistic - prediction error estimate.
Note that leverage correction in this, as in many other data sets, may be too
optimistic.

8.4.3 Exercise - Is PCR better than PLS?


(Peas)
Purpose
In this exercise you are to compare PLS and PCR in much the same way as you
have compared PLS1 and PLS2 earlier.

Multivariate Data Analysis in Practice


180 8. How to Perform PCR and PLS-R

Data Set
PEAS, the same as in the previous exercises.

Tasks
1. Now calibrate a PCR model
2. Compare the PCR and PLS models

How to Do it
1. Make the PCR model as you made the PLS models. Change the calibration
method to PCR. First use leverage correction as the validation method. Why?

2. Compare the PCR model against both the PLS2 model and the leverage
corrected PLS1 model from exercises above. For a final model, if
appropriately validated, RMSEP is the most important single criteria to check.
You should compare RMSEP for all the Y-variables pairwise. Remember that
the PCR model includes regression to each of the Y-variables.

When comparing PCR and PLS1, remember that sweetness is variable number
2. The Window - Identification command is handy to see what you plotted.

Are the models different? Which is best? Why did we use leverage correction in
the PCR model? Why did you not suggest the use of e.g. cross-validation in this
exercise?

Summary
You should now have gathered some initial experience in making and comparing
bilinear prediction models. In general PLS models result in a lower prediction error
than PCR models, using fewer PCs. However, PCR may be forced to just as low an
error, often by including more PCs. Remember to ask yourself when making
models and interpreting results – comparing with the standard theoretical
expectations: Why is this particular result like this? Are the results like expected?

We mostly used leverage correction above, simply because the other models we
compared with were also leverage corrected.

You are now to compare all the different models above using both test set
validation (if/where possible), and especially you are to compare the LOO cross-
validation, and appropriately set up segmented cross-validation alternatives. Good
luck!
Multivariate Data Analysis in Practice
9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 181

9. Multivariate Data Analysis – in


Practice: Miscellaneous Issues
This chapter summarizes a range of important major and minor, but all non-trivia,
issues related to multivariate data analysis – in practice, including both PCA and
PCR/PLS-regression. The issues treated here all relate to what must be mastered in
addition to the basic calibration experience. We have introduced and expanded on
some of these topics in earlier chapters, but the range of what follows is new. The
really important issues probably cannot be emphasized too often when starting out
in the multivariate modeling.

These issues are presented in a comprehensive fashion, which will allow you to
gain more experience and understanding before we introduce a series of varied and
realistic real-world exercises in the next few chapters. Many of the miscellaneous
issues in this chapter will not find use simultaneously in any one data analysis, but
we have found no other way than to present them here in a somewhat kaleidoscopic
fashion. This is a chapter to read now (i.e. between chapters 8 and 10), but you will
get much more out of it when re-reading it after you have completed the entire
book, and especially after having completed all the exercises involved in the rest of
the book. The practical experience thus gained will allow for an enhanced
awareness of the kind of very important issues addressed here. There will also be a
further deepening of the validation issue.

9.1 Data Constraints


A multivariate calibration model relating Y-variables to X-variables has to be
determined from the most representative set of calibration objects possible to
obtain. Without enough representative information in this data set, you cannot
expect reliable prediction results later on.

Some general requirements for the calibration data set are given below.

• The calibration objects must be representative of the population of objects for


which Y is to be predicted later, with regard to both average values and
variability.

Multivariate Data Analysis in Practice


182 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

• A sufficient number of the typical or the most important (X,Y)-levels must be


included in the calibration to ensure the best prediction ability for such objects in
the long run.

• The model will never be better than the accuracy of the reference method (Ycal).
Collect some selected replicates (at least) to ensure that the measurement
inaccuracy in the reference method is covered, and can be estimated (see also
further below).

• A sufficient variability must always be included in the calibration objects. The


levels of every major independent factor or phenomenon must vary to the fullest
possible extent in the training set. Note however that the actual quantitative levels
or values of these effects do not necessarily all have to be known in advance - for
empirical calibration that is. Experimental design is a quite different area, which
is discussed in chapter 16.

• All major interferents (physical, chemical and other) should vary as widely as
possible in the calibration set, otherwise the multivariate calibration model cannot
distinguish between them. Failure to achieve this may for instance cause
erroneous Y-predictions in the future with some objects being identified
(wrongly) as abnormal outliers by the too limited model coverage. All factors
involved should thus display representative covariation in the calibration set in
addition to the individual maximum variable variances. Otherwise the model will
surely fail at compensating for their interaction.

• There must be a reasonable proportion between the number of objects, the


number of variables, and the number of PCs used. The number of PCs will in
general be of the same order of magnitude as the phenomena in the data.
Projection methods can handle more variables than objects (which is not possible
with MLR).

• For the most efficient calibration design, one should select training objects that
are as typical as possible. They should also span each of the individual
phenomena, which can be “controlled” as well as possible, and include enough
randomly selected objects to ensure a chance of also spanning the non-
controllable phenomena and their interactions. All this certainly would appear a
very demanding task for the novice and the expert alike - but luckily these are but
ideal objectives to be aimed for. When facing the practical world other limitations

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 183

such as economic constraints, tradition, ethical considerations, available time and


many other real-world parameters intervene.

• One should always strive for perfection when designing the protocol for the
training data set (and test set) sampling. On the other hand, one should not be
overly discouraged when the real-world practicalities force one to make some
necessary compromises.

• Balance is what is called for – and experience rules the day.

9.1.1 Data Matrix Dimensions


Bilinear projection methods can handle both “wide” (many variables/few objects)
and “long” matrices (few variables/many objects) to a surprising degree, but there
are limitations.

If the X-variables are more or less independent from each other (i.e. non-
correlated), you may, in well-behaved data sets, easily handle up to, say 10-50
times as many X-variables as there are objects. An “ideal data set” means a strong
X-Y data correlation with a comparatively low model dimension, A. If the X-
variables are strongly correlated then the number of X-variables may be much
larger. For example, 20 - 50 samples and thousands of spectral wavelengths may
not be a problem at all. However, the ideal situation of course calls for a less
extreme, rectangular X-matrix.

If models are found to be malfunctioning, one reason may be that there are too few
objects in relation to the number of independent phenomena and X-variables. This
is of course also the case for Y; few objects and many Y-variables may be difficult
to handle with PLS2. Then it may help to make several PLS1 models.

In general, the need for more/many calibration objects increases with the levels of
the measurement noise in the X and/or Y. It also increases with the number of
interference types in the X-variables of course.

9.1.2 Missing Data


It may often be the case that data are missing from either of the matrices (X,Y).
What should one do? There are many different strategies to handle this problem.
The Unscrambler represents missing values by “m” or “-0.9973E+24”. Missing
data points are simply kept out of every computation when appropriate. This means
Multivariate Data Analysis in Practice
184 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

that the scores, loadings and residuals are computed as usual, but the missing
values do not influence the results. However, objects with missing X-values do not
get predicted Y-values either!

Other possibilities would be to replace missing values by the mean value of the
variable in which the missing value occurs. This is fine if the object is a typical one
with respect to the variable in question, but it will certainly be a big mistake if the
object otherwise is an extreme member. How can we know this in advance? This is
impossible as the object has a missing element! A much better strategy would be to
find the two most similar objects, or the two most correlated variables, in the full
multivariate sense, the one including the pertinent missing values, and to
interpolate the missing value(s) aided by the pair wise correlation etc. In this way
one gets a local average replacement, corresponding with the overall correlation
for the missing value, that better can be used in the computations.

A much augmented development involving this general principle, but with a solid
statistical underpinning, goes under the name of “Multiple Imputations” and is a
distinct sub-discipline by itself. A very useful first reference is Rubin (1987).

Missing values should never be replaced by 0. This will certainly cause false
results from the computations, and lead to very unreliable interpretations.

9.2 Data Collection


Data may be generated by planned experiments, collected from a process by well
thought out criteria, randomly collected from an ongoing process, or represent
historical data that “happen” to be available (the literature is a virtual bonanza for
readily available data) - as well as by several other problem-dependent means. The
more carefully one selects data, the greater the chances are that the data will
contain the type of information desired. Because the data should satisfy the
requirements of section 9.1 each data collection technique has its own advantages
and disadvantages.

9.2.1 Use Historical Data


The advantage of this approach is obvious: data are readily available. However,
historical data have very often been collected for reasons other than the objective(s)
of the present multivariate analysis, and may therefore not at all be relevant nor
representative for your purposes. They may have been collected under different

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 185

sampling conditions and contain many uncontrolled phenomena that may now be
difficult, if not impossible, to trace.

You will, however, have many good tools to detect these weaknesses using
projection methods: lack of variability, noise, outliers, trends, groupings, etc.
Studying available external data may also show you which types of informative
data (objects/variables) is not present and give you ideas about which additional
types you need to collect. In any case, one should always look at all the data one
has at one’s disposition, but in general one should perhaps not expect too much
from historical data.

9.2.2 Monitoring Data from an On-Going


Process
Clearly, this is an easy way of getting much more relevant data and will probably
generate data that are typical of the process, provided the monitoring instruments
have been carefully chosen. There is however a large risk of lacking variability,
since a main objective of real-world monitoring of course often is to keep the
process stable and within strict product set-point specifications. Thus many types of
the more varying samples may well be lacking. There would be no coverage of
extreme conditions, and probably many useful combinations of process settings
would be missed. This implies that you may need to collect data over a very long
time, to be able to get samples that vary satisfactorily by natural causes. One way to
handle this is to update the model, as “new samples” become available. Data that
appear to be close to limits of the set-point specifications are of special interest.
Such samples are really the most important for spanning the calibration data space,
especially in relation to the possibilities of detecting (predicting) off spec. samples
and process conditions in the future.

9.2.3 Data Generated by Planned Experiments


The aim of experimental design is to provide the maximum potential information
from a minimum number of experiments, meaning designing experimental data
plans that ensures the widest possible span of the important known variations. Such
a design may be more or less complete; a small fractional design may contain too
little information to determine interactions, while a screening design often lacks
information about possible curvature (non-linear effects), etc. In general, by the
very nature of this endeavor, there is never an abundance of data available when

Multivariate Data Analysis in Practice


186 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

experimental design is on the agenda, so special considerations are always called


for.

Screening designs are primarily used to assess the significance of effects from
potential factors of variation in the data, whilst optimization designs give a more
detailed description of the relationships between the response(s) and the design
variables. A data set resulting from an experimental design can usually be modeled
and analyzed by traditional statistical methods such as ANOVA or MLR, but may
of course also be modeled by projection methods. Experimental design is given a
full presentation in chapter 16.

There are, of course, many practical situations where you cannot disturb a process,
at least not as much as preferred for an ideal experimental design. In many plants,
process excitation is not allowed at all. Collecting data from the on-going process is
then the only alternative, unless you can make experiments in the lab or have pilot
plant facilities.

There are also situations where you cannot perform all the desired experiments as
planned; some variable settings may be impossible in real life for example. This is
often the case in process operations. If the experiment (measurement) actually
performed turns out to be far from the planned specifications, the basis for classical
analysis of variance is no longer valid. Thus the traditional methods for the
determination of significant effects would be void, since they often require
orthogonality in the design matrix. PCR or PLS may sometimes save this situation,
but there is in general practically never a remedy for a badly thought out
experimental design.

9.2.4 Perform Experiments or Collect Data -


Always by Careful Reflection
Using the philosophy of experimental design is always a very good way to collect a
data set that spans all important known, or controllable variations. It points out
which combinations of variable settings to use, without the need to make all
possible combinations. Set up an experimental plan and use this as a guide to
which experiments to perform, even if you cannot keep the perfect settings. Also
collect some “random” samples, if possible, to increase the chances of also
spanning non-controlled phenomena.

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 187

Data from designed experiments may be useful to analyze simple, or more complex
response relationships in data, but a designed experiment often does not generate
enough samples for a model intended for prediction, since the resulting data set
often only consists of measurements on two or three levels. You may also often
need to get several additional samples around and between the initial design
settings if this is the objective too. The interrelationships between the small sample
experimental design situation(s) and the general multivariate calibration training
situation(s) are not a simple issue. It is not possible to cover all aspects in this
introduction. After the next section and chapters 16, 17 and 18 (experimental
design) have been considered, some further reflections on this issue will be offered.

9.2.5 The Random Design – A Powerful


Alternative
Here we end this discussion by presenting a strong alternative to designed
experiments, when the situation does not allow for such an endeavor. This could be
caused for instance by severe limits on the possibility of taking the required number
of samples when there truly are many experimental factors (say, above 5-8). In this
situation, even the simplest orthogonal designs which include interactions, will
simply demand too many experiments to be run. Alternatively, the complexity of
the system investigated may simply be beyond rational control with the help of
designs which in practice only allow two (at the most three) levels.

The alternative is called the random design, Esbensen et al. (1998). The setup is
easy to comprehend. The number of available experiments is set by the boundary
conditions; in the situations sketched above the need for at least 30-50 objects
should be justified. Consider that the system to be characterized is “complex”
(factors, levels, interactions), but that a number of 42 experiments has been
accepted as the absolute maximum (economic constraints, practical and/or time
limits etc.). The same analysis of the problem which resulted in these constraints
also furthered specific minimum and maximum levels for all factors involved; this
is a prerequisite for any experimentation, designed or random, or otherwise!

The random design is now set up by a generator which works on each factor
individually: the interval (max – min) is divided by the number of experiments
allowed (42 in this case). A random number table is scaled to the interval [1,42],
and 42 selections (no replacements) from this table are taken. Each random number
will indicate a specific level between the pertinent minimum and corresponding
maximum (continuous variables are binned in some problem-specific fashion).

Multivariate Data Analysis in Practice


188 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

Thus we are given a set of 42 randomly chosen levels, which neatly spans the entire
experimental domain pertaining to experimental factor 1. This is repeated for the
remaining factors in turn. At the end of this iterative procedure we are left with the
required total of 42 compound level settings for all factors involved, with their
combinations entirely determined by a random changing of the individual
selections. We thus have arrived at 42 experimental settings completely spanning
all the pertinent factor intervals, at levels that are maximally faithful for capturing
both the individual factor ranges as well as their interactions.

If the number of experimental factors were (only) three, the geometrical model of
the random design is particularly easy to depict in one’s mind. It is a cube with 42
objects sprinkled randomly and homogeneously all over the interior of the cube
volume, including some objects very close to (or perhaps a few actually lying on)
the edges/corners. If the actual number of experimental factors is larger than three
as it would be in a complex system, this conceptual model still applies: simply
think of a generalized hypercube with the same “interior volume” characteristics.
There are some problems with this direct geometrical generalization, Rucker
(1984), but they only relate to the strict hyper-geometrical aspects, not to the
practical use of this 3-dimensional geometrical model of the random design for the
more complex situations.

Esbensen et al. (1998, 1999) give several examples of the use of the random design
for suitably complex technological systems, in which the practical set up and use of
the random design is laid out clearly. The resulting spanning of the pertinent
multivariate calibration training data sets are particularly evident.

9.3 Selecting from Abundant Data


The aim is to get homogeneous, representative and well-distributed data sets. If
there are truly many samples available, it may be a good idea to choose a well-
balanced calibration subset from them, but this is probably not a particularly
frequent situation, as has already been commented upon several times. Still, we
need to master this case too.

9.3.1 Selecting a Calibration Data Set from


Abundant Training Data
One method is to make a preliminary PCA-model of all your available data. The
score plot will then show you the extremes and the typical ones, groupings and

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 189

patterns. You will see whether some types of samples are over-represented or
whether there are “holes” in the overall population distribution. Study the score
plots and try to pick out a subset that spans the variations in each of the relevant
PCs, with but only the necessary minimum number of training samples. More
typical objects, but certainly not just the “average” samples, will give a robust
model for such samples. Many extremes will allow for a more global, but perhaps
less accurate model. Again the particular balance one should make is always
problem-dependent. This procedure may be a bit difficult if there are many relevant
PCs, but it works well in principle. Do not give up on complex problems too easily.
When there are many relevant PCs you may use a factorial design to pick
calibration and validation samples from them, see chapter 16.

Usually this type of training data set screening is carried out in the X-space. An
alternative is to systematically pick out samples based on the Y-value distributions,
for example from every third consecutive Y-level. This works very well because
samples that span large variations in Y necessarily also span the variations in X,
assuming of course that a relevant (X,Y) correlation does exist. However, if the
distribution of the Y-values is very uneven, for example one sample with Y-value 3
and fifteen samples between 35 and 40, the model will of course be very different
depending on whether it contains the first sample or not! Problem-dependent
common sense is very often the best remedy for such non-standard problems. There
is one exercise below (“Geo – dirty samples”) which is fraught with exactly this
type of practical problem. Experience with many types of data sets is mandatory.

9.3.2 Selecting a Validation Data Set


If one has a “many-object” data set at one’s disposal, one may, for example, sub-
sample in order to define the calibration data set, using one of the methods above,
and then use the rest for testing. Again, it is just as important to span the variations
in the test set as in the calibration set. The calibration data must be representative
of what you want to model and/or predict in the future, as must the validation data.
If the validation data are very different from the calibration data, you will get a
high prediction error and a bias.

And, of course, the above does not constitute a proper test set validation setting; we
are back in the vicious circle of the singular sampling test set switch.

Multivariate Data Analysis in Practice


190 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

9.4 Error Sources


It is vital to reflect on uncertainty in the measurements in all stages of multivariate
calibration, indeed from before planning experiments or data collection. During
multivariate calibration, you expressly wish to replace important Y-measurements
with predictions based on X-measurements. In this calibration context, the Y-values
are regarded as the “true” reference values. If these are noisy, erroneous or
inaccurate, you can never expect the model to predict with levels of accuracy and
precision better that this and, in point of fact, it must always be (at least slightly)
poorer than this.

There is an error component accumulating from each stage in the whole chain of
sampling, preparation and measurement through to data analysis. These errors all
contribute to the model error and to the prediction error. To be able to evaluate
how good a final prediction model is you need to be aware of these typical error
sources.

Here are just a few major, often dominant, error components:

• Inhomogenity, e.g. differently packed powders or grains, mixtures that are not
well blended, solid samples that are not homogeneous, e.g. meat, rocks, alloys.
• Sample preparation, e.g. different laboratory assistants may perform procedures
in slightly different ways; samples collected at different points in time which may
incur slight sampling differences, aging of chemicals over time etc.
• Instrument inaccuracy, drifts, faults, both in X and the reference method Y.
• Modeling errors

The exercise “Geo – dirty samples” is characterized by several of these error


components.

9.5 Replicates - A Means to Quantify Errors


A way to quantify inaccuracies in sample preparation and measurement is to
measure the same samples several times and to compare the results
(“repeatability”). Unfortunately this standard procedure is often neglected, in which
case you may be more or less totally unaware just how bad, or good, your
measurements are (which is equally bad). Repeating measurements on a small
subset of representative samples could easily give you quite a shock, especially if
you routinely regard your lab results as “the truth”.

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 191

Measurement replicates are defined as samples, which have been measured twice,
or more, in X and/or in Y. When measured three times, they are called triplicates,
etc. Several Y-measurements of the same sample are called repeated response
measurements. In some situations you may choose to repeat the whole experiment,
or you may choose to prepare the sample again. Which choice to make is of course
problem-dependent.

If one divides a sample in two and measure each part once, one may either consider
it as a repeated measurement or as a replicated sample, depending on how
homogeneous the original sample was. Giving the same type of pea twice to a taste
panel can probably be regarded as a repeated sensory measurement. Cutting a piece
of meat in two may be regarded as a replicated sample measurement, or as two
different samples, because one piece of the meat may still be rather different from
the other, e.g. it may contain more fat. Blending three alcohols twice in the same
proportions may be regarded as two samples, because the experimentation contains
variation in itself; these two samples would give you with a measure of the
laboratory preparation (mixing) error component etc.

In the following text we will use the term replicate both for repeated X- and
repeated Y-measurements.

What you decide to do depends very much on where you believe the largest
inaccuracy is to occur, and when you feel the need to determine its size. But always
be aware of potential error sources and keep this in mind when analyzing data (also
at a more general level), when selecting validation methods, and when evaluating
model performances etc.

The practical handling of replicates is discussed further in section 9.7 on page 195.

9.6 Estimates of Experimental - and


Measurement Errors
In the following section we consider the distinction between replicated sampling
and measurement replicates. This issue concerns e.g. the situation where you
measure two individually sampled objects ten times (measurement replication)
resulting in a total of 20 “measurements”. The same total is acquired by performing
two repeated measurements on each of 10 individually sampled objects. Clearly

Multivariate Data Analysis in Practice


192 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

these two sets, though both totaling 20 “objects” do not contain the same
information, and consequently proper data analysis cannot treat them identically
either.

9.6.1 Error in Y (Reference Method):


Reproducibility
SDD, the Standard Deviation of Difference, is often used to quantify the analytical
inaccuracy. It is based on the difference between several repeated Y-measurements
of the same sample. Depending on how costly or time consuming each
measurement is, you may choose to measure each of the total of n samples several
times (costly, but precise), or do it on only a few, or even on just one sample
(inexpensive, but progressively less precise). In the latter case you assume, or hope,
that the measurement inaccuracy is the same also at all other measurement levels
than the one(s) selected, for which reason one often chooses a mean or intermediate
level. This is of course taking a risk (potentially taking a huge risk in some
situations), but this type of risk-taking is often needed to some degree in practical
work.

In experimental design the so-called center point corresponds to the experiment in


which all experimental variables (factors) are precisely at their mean value. When
response measurements are costly, one often chooses to use only such center points
to determine the analytical inaccuracy.

How many measurements you make on each sample depends on cost, tradition,
regulations and also on the expected size of the measurement inaccuracy. If one
does not know how many replicates to use, a good suggestion always is taking a
few pilot measurements, say three to five and see how much they differ. If this
empirical measurement variation is large, take a few more until you have a feeling
of stability of this replicate distribution. Then decide how many to use in future
measurements.

The Standard Deviation of Difference compares differences in replicates, normally


paired replicates of each sample in the whole measurement series.

Equation 9.1 ∑ ( d i - d m )2
SDD =
(n -1)
where

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 193

di = difference in Y between two replicates of sample i, di = yi1 - yi2


dm = mean value of all replicate differences, dm = Σdi/n
n = number of samples

A way to improve the precision of the Y-measurements is to increase the number of


replicates. For more than two replicates, di are all paired differences and dm is the
mean of all paired differences. When using, say, triplicates (3 measurements of
each sample) replace n with 2n in the formula for SDD.

When calculating SDD one normally assumes that each sample is homogeneous, so
the analytical inaccuracy, which is estimated by SDD, primarily consists of the
variation in the reference measurement method, preparation procedure and
instrument uncertainty.

9.6.2 Stability over Consecutive


Measurements: Repeatability
The Standard Deviation, SDev (also called SD), is used to express the repeatability
of measurements, for example in the follow-up to routine measurements where a
calibration model is used to predict Y-values based on X-measurements. The
repeatability is checked to make sure that that the instrument gives the same answer
every time. SDev is assumed to include variations due both to inhomogenity in
samples and instrument errors.

SDev is simply the square root of the total variance of all the repeated
Y-measurements.

SDev is calculated either by measuring one sample several times, or by measuring


several samples a few times. The latter method is of course more time consuming
but gives a more reliable result.

Alternative 1: Measure one sample several times

Equation 9.2 ∑( yi - ym )2
SDev =
(n -1)
where
ym = average measurement of y for all the replicates
yi = y-measurement of replicate i

Multivariate Data Analysis in Practice


194 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

n = number of replicates

Alternative 2: Measure many samples a few times


n samples are measured J times each, e.g. three times. Calculate the pooled average
variance and then the standard deviation (SDev is not additive, but the variance is).

1) Calculate the replicate variance for each sample i:

J
∑ ( yij − yi )2
Equation 9.3 Vi = j =1
J −1

where
J = number of replicate measurements on one sample
i = sample number

2) Calculate the pooled average variance for all n samples:

Equation 9.4
n
Vi
V =∑
i =1 n

where
n = number of samples

3) Calculate the average standard deviation:

Equation 9.5 SDev = V

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 195

The full formula is:

Equation 9.6 1 n J
SDev = ⋅ ∑ ∑ ( yij − yi )2
n ⋅ ( J − 1) i =1 j =1

9.7 Handling Replicates in Multivariate


Modeling
The above definitions may be familiar from elementary statistics, but they relate to
only one variable at a time. So how does one handle replicates and repeated
measurements in multivariate modeling and prediction? A good idea is first to
make a PCA model based on all samples including their replicates. In the score plot
you will get a good overall impression of how different the replicates are, giving a
very important visual impression of the replicate variation in its relation to the
overall variance. It is a good rule to name the objects in such a way that you can
easily identify different samples and their replicates – with but the absolute
minimum of annotation though!

Figure 9.1 - Replicates in a score plot


(replicates have been given identical “names”)
PC2 Scores
0.02
0303 03
2806
47 0404 28
04 51 35
28 5134
34
35
3433
4447
05
47 51
24
35 24
44 05 0202 2727
07 10
2717
07
5033
17
09
07
08
52
08
52
09
0814
33
52
10
1548
55
1855 55
44 46
45 4646 53
36 10
17
50
50
53 1
15
09514
11
48
48
11
1118
1618
14
16
0 43
43 3131
31 5301
01
01 19
19
45
41
41
45
41 49494954 13 12
13 1912
42 30 5437 12
42 30
30 37
39 42
3920
38
38
22 292937
2039 38
22
20
2525 22
25 21
26 23
-0.02 21
26 26
21 23
23
32 40
32 32 4040
-0.04
PC1

-0.15 -0.10 -0.05 0 0.05 0.10


Wheat-0, X-expl: 93%,7%

Hopefully you will be able to see the small groups (triplets) consisting of replicates,
well separated from other object replicate groups. In Figure 9.1 you can see the
triplicates quite clearly.

Multivariate Data Analysis in Practice


196 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

If the variation between replicates is comparable to the variation between different


samples, or even larger, there is of course a major problem. The replicate
inaccuracy is at least as large as the inter-sample variation. (Look also for samples
with “outlying” replicates, which may show an erroneous replicate measurement).

In Figure 9.2, for example, one replicate of sample 40 is more similar to sample 23
than the other replicates of sample 40.

Figure 9.2 - Replicates in a data set with more noise


PC2 Scores
0.03 03
51
0628
47 04 06 06 34 09 14
03
0343
47 04 05
05 4628 2717
27
28
51
51 33 52
33
35 19 12
4444 02
47 040231 07
45
27 35
3409
10
17
50 53
3424
17
3609
14
48
0852
33
54
0155
313146 072410
52
15
07
53
36 11
15
15 18
11
1114
29 18
13 19
0 44
4343
05 4102 364950 50 16
16 55
18 55
53101348
01
01
24 48
13
42 45
41 45 21 49 19
37 122937
22
42 39 46
413030 38 23 4954542937 12
25 25 39 20 30 21
21 23
25 20
4238
38 39 26
22 26 16
20 32 26 4023
-0.03
32 40
40
32

-0.06
PC1

-0.15 -0.10 -0.05 0 0.05 0.10


Wheat-1 noisy, X-expl: 75%,7%

Averaging
After having checked that the inter-replicate variation is not in danger of destroying
the data analysis proper, it may often be an advantage to prepare two sets of
pertinent score plots, one including all replicates, but also one in which is depicted
only the “average samples” from the replicates. By using these average one has, to
a certain degree, damped the inaccuracies in the model, but by also having access to
the full score plot it will always be possible to factor in the replicate variation again
later.

The Average function in The Unscrambler (Modify - Transform menu)


calculates the average of n adjacent (neighboring) samples. You must therefore
make sure that you have the same number of replicates of each sample before you
average, when using this option.

What should you do if you, for example, have three replicates of most samples, but
only two of some? It is of course possible simply to add a third sample where one is

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 197

lacking by adding the average of the other two. If the score plot shows the same
replicate variation for this sample as for the other samples, this is quite safe.
Likewise, if there are a few extra replicates in a few samples, you may replace
these by an average. Again, check the score plot first in order to fully grasp which
of these averaging “tricks” is necessary.

Note!
The number of samples, n, will be inflated when you have replicates. If
your data set for example contains triplicate X-measurements, the data
analysis is not really carried out on 3n independent objects. Many
standard statistics are based on the number of effective samples
present, n or (n-1). It is entirely up to you to keep track of how and where
to avoid this pitfall.

Cross-Validation with Replicates


An alternative is to keep all replicates as they are, but you must then take the
appropriate precautions at validation. If, for example, you want to use full cross-
validation, this would mean that all replicated samples are always in the model.
This has the possible consequence that you will easily be in the position that you
estimate how well a sample predicts its own replicate. Of course this would lead to
overly optimistic interpretations, i.e. you could easily end up getting more an
assessment of the replicate error than of the prediction error. This depends both on
how different the samples are between themselves and on the replication variances.

To avoid this, you should keep all replicates of particular sample together in the
validation process. This is done by selecting cross-validation with exactly as many
segments as there are sets of replicates. The Unscrambler has an easy option for
this “systematic selection” of cross-validation segments to allow you to put all
replicates of a sample in the same segment. This requires the same number of
replicates for all samples, which can be achieved as described above.

Replicates Regarded as New Samples


If you can truly regard the replicates as new samples, e.g. they are made up again
individually (sampled/prepared/observed again), using the same sampling and
sample preparation procedure, you may well use standard “random selection”
cross-validation or even use some of the “replicates” as the test set. This usually
works fine because now you have also carefully built in the relevant sampling and
experimentation errors in your data set.

Multivariate Data Analysis in Practice


198 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

From these simple illustrations above it will again be apparent that validation is
certainly not just some standard procedure to be applied to whichever data set at
hand. Careful reflection on the entire data analytical process is needed.

9.8 Validation in Practice


When selecting training objects for calibration, the validation issues are thus
already also on the agenda. A prediction model that is not properly validated is in
principle quite useless. Of course, these methods way well be used, but one will
never know the prediction accuracy and precision and so be scientifically and
practically worthless!

Practical validation can be done in three ways. Sometimes starting with leverage
correction is relevant in a specific problem-context, always remembering of course
to revalidate properly for the final model. This may then either be test set validation
or cross-validation if there really are not enough objects for a test set.

Comments on these three methods from a practical point of view are given below in
section 9.8.1 to 9.8.4 . The basic approaches were described in detail in chapter 7
and a final overview is also given in chapter 18.

9.8.1 Test Set


This is the best validation if only there are enough objects. Select a set of test
objects to be used only for validation. The test set must be representative of future
predictions. Ideally it should be approximately as large as the training set, but at the
minimum at least 25% of the size of the training set if and when samples are in
short supply. The models you make will be tested against this set of test objects.
The ideal case is in fact to be able to have two test sets, the first one used for model
dimensionality optimization, the other for assessment of the particular future
prediction strength (accuracy and precision).

9.8.2 Cross Validation


Use cross-validation only if and when there are not enough objects for a separate
test set or when you cannot pick out a representative test set. Make the calibration
set as large and representative as possible. Now a series of sub-model calibrations
will automatically be run, based on the different segments chosen (by you), and the
appropriate cross-validation statistics will be summed up over the corresponding

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 199

segmentation series. You can choose alternative ways to select these validation
segments, as was described above.

If the data set is relatively small, one would use segments that are at most 10% of
the total number of objects. Full cross-validation will become more and more
relevant if/when the number of available objects decreases even further. If the data
set is large and you only choose two segments, you will of course not get the same
effect as running two proper test sets (on two calibration sets of n/2 samples), as is
well known by now...

On the other hand, if there are very many samples, one may consider using
segmented cross-validation to imitate a series of three, four.... test set validations.
In general, such a “many-sample” is no problem at all.

But always be careful about replicates. If running a full cross-validation on a data


set containing many measurement replicates, one will actually validate the model
on essentially the same samples as the model was built upon. This of course results
in an artificially low residual variance, so the model will look fine, but now you
should know better. Use systematic selection of cross-validation segments in such a
way that all measurement replicates go into the same segment.

9.8.3 Leverage Correction


This is only an approximate validation; a “quick and dirty” method. Use it only in
the first runs to get an impression of how good the model is.

The influence of each object on the model, the leverage, will be computed. The
residuals in the calibration objects will then be “corrected” according to the
reciprocals of their leverage influences, i.e. their weighing increased or decreased
according to their modeling influences. Since all objects are used for both modeling
and validation, this method may often give estimates of the prediction error, which
are too optimistic.

9.8.4 The Multivariate Model – Validation


Alternatives
The scores and loadings calculated with either test set validation, cross-validation
or leverage correction are always the same however. Validation can be pictured as
a “piggyback” procedure, evaluating the stability of the multivariate model(s).

Multivariate Data Analysis in Practice


200 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

They only differ in the manner validation objects are either brought in from outside
(test set validation), or by the particular way they are sub-sampled from the training
set (cross-validation). The leverage-corrected validation simply tries to counter-
weight the effect of wrongly using the objects twice according to their distance-to-
model-center leverages.

This means that the multivariate model will always be the same, irrespective of the
particular validation method chosen. The consequences of this choice are that, in
general, the estimates of the prediction error will differ, but hopefully not by much.
There is one important imperative associated with this method choice; it is not a
subjective choice to be made based on the data analyst’s preferences. On the
contrary, it is the specific data structure and this data structure alone, which
determines the specific choice of validation method. The only subjectivity in this
context would be that based on differences with respect to the relevant practical
experience(s) of the data analyst. It is certainly not the case to go searching for the
particular validation alternative, which gives the smallest prediction error alone.
This would be very bad form indeed, never to be undertaken!

9.9 How Good is the Model: RMSEP and


Other Measures
There are several measures of how good a model is. Model fit or the corresponding
lack of fit both say something about how well the A-dimensional model has been
fitted to the training data set. The prediction error is an expression of the error we
can expect when using a calibration model in future predictions. The correlation
between predicted and measured values (from the predicted vs. measured fit) is
another way to see how well the model performs. Residuals show how well each
individual object is modeled or predicted. There are alternative measures for many
of these concepts and various ways to plot them, which may not only tell us how
good or bad the model is, but also indicate problems.

9.9.1 Residuals
The residuals, E or F, are the deviations between the measured and modeled or
predicted values in each object. The residuals are what has not been modeled.

Equation 9.7 E = X − X estimated F = Y − Ypredicted

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 201

Here “Xestimated“ is the projected (modeled) X-values for each object and “Ypredicted“
is of course the predicted Y-variable values for each object.

The residuals may conveniently be plotted for all samples, for example in run order
(the order of the objects in data matrix X) or versus the size of the predicted
Y-values. A large residual for an object means that this object has not been well
fitted (it may even be an outlier). The residuals should be randomly distributed,
meaning that the remaining unexplained variations in the data should be similar to
white noise. A systematic pattern indicates that some systematic variation in the
data remains, which is not described.

9.9.2 Residual Variances (Calibration,


Prediction)
The calibration variance is a measure of the model fit - how well has the model
been fitted to the training data set. However, as should be obvious now, this does
not say much about how well the model will work for new data, be this modeling
new X-data or predicting from new X-data. If the purpose of the modeling is
prediction, we need to determine the error associated with predicting new samples
in the future: the prediction error. The validation residual variance is an expression
of the model’s ability to predict new data, in PCR and PLS, the prediction error.

Validation provides an estimate of the prediction error, provided that

• an appropriate validation method has been chosen


• the calibration samples are representative of future prediction samples
• the validation samples are representative of both the calibration samples and
future prediction samples.

The residual variance is the summarized error expression: the sum of the mean
squares of the residuals. The calibration variance for X or Y is a measure for how
well the X and Y data, respectively, have been modeled. The validation variance
for Y expresses how well the model will perform for similar, new data. The
validation variance for X indicates how well the validation data have been
projected onto the PCA - or the PLS components model.

Both the calibration and validation variance can be plotted either as residual
variance, (which is supposed to decrease as the number of PCs increases), or as
explained variance which shows which percentage of the total variance has been

Multivariate Data Analysis in Practice


202 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

explained by the increasing number of components. The residual and the explained
variance plots are but two alternative expressions based on exactly the same data;
they always sum to 100%, see Figure 9.3.

Figure 9.3 - Residual calibration and validation variances; both can


alternatively be expressed as explained variances as well
Residual Y-variance
5.0
4.5
4.0
3.5
3.0
2.5
v.Tot
2.0
1.5
1.0
0.5
c.Tot
0 PCs
0 3 6 9
t-0, variable: c.Tot v.Tot

% Y-variance expl.
100
90
80 v.Tot
70
c.Tot
60
50
40
30
20
10
0 PCs
0 3 6 9
t-0, variable: v.Tot c.Tot

Usually one studies the residual variance curves by focusing on their characteristic
shapes, searching for the first local or for the global minimum. This plot’s primary
use (dimensionality optimization) is to find how many PCs to use in the model.
Because the residual variance is dependent upon the original measurement units
and/or model scaling, it may be difficult to use it to compare different models
directly.

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 203

9.9.3 Correction for Degrees of Freedom


Note that the calibration variances in The Unscrambler are corrected for degrees
of freedom. This correction is not applied in all statistical packages, and there are,
furthermore, different ways to do this. This is done to correct for estimation errors;
if the modeling is based on many samples or variables, this error is reduced because
we assume that more information is available. At the same time the prediction error
estimate corresponding to a complex model is increased, because you have
extracted a lot of information for each new PC you calculate. The reason for this
correction is to avoid the danger of overfitting, and to give a realistic impression of
the errors.

9.9.4 RMSEP and RMSEC - Average,


Representative Errors in Original
Units
RMSEP (Root Mean Square Error of Prediction) and RMSEC (Root Mean Square
Error of Calibration) are direct estimates of the prediction error and the modeling
error in Y respectively, expressed in original measurement units. If one measures a
concentration in, say, the unit “ml”, the RMSEP estimates the prediction error in
“ml”; if Y is measured in %, say, RMSEP will be in % and so on.

RMSEP is defined as the square root of the average of the squared differences
between predicted and measured Y-values of the validation objects:

n
∑ ( yi − yi )2
Equation 9.8 RMSEP = i =1
n

RMSEP is also the square root of the validation variance for Y, divided by the
weights used at calibration (scaling).

RMSEC is the corresponding measure for the model fit, calculated from the
calibration objects only.

RMSEP expresses the average error to be expected associated with the future
predictions. Therefore one may conveniently give predicted Y-values using 2 x

Multivariate Data Analysis in Practice


204 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

RMSEP as the estimated precision; e.g. the concentration in a new sample is 59 ml


± 0.5 ml, where 0.5 ml would be 2 x RMSEP.

Because bilinear projection modeling makes no assumptions about the statistical


distribution of errors, we cannot give the prediction error as proper statistical
interval estimates, (such as twice the standard deviation etc.). In fact with this data
analytical approach we cannot make any statistical inferences at all. RMSEP is the
practical average prediction error as estimated by the validation set. So if you did
a proper job at model validation, and both the calibration set and the validation set
were representative of future prediction sets, then RMSEP should serve as a valid
empirical error estimate.

Some authors are concerned with the fact that RMSEP is estimated using reference
values which themselves are also error-prone (measurement errors in the reference
method), and have consequently introduced the term apparent RMSEP etc. It is
claimed that a correction for this is easily invoked. While in the laboratory
calibration context it may well be possible to estimate the true measurement error
and carry out the pertinent corrections. There are certainly also many other
practical situations, we might term them field situations, i.e. technical monitoring,
production plant, environment or biosystems in which it is precisely this
comparison with the error-containing reference samples that carries the practical
meaning of the validation.

Note that RMSEP is the average error, composed of large and small errors
altogether. This is often well illustrated in the important Predicted vs. measured
plot, where many samples in general will be predicted well and some badly. In
Figure 9.4, sample 15 has a much larger error than sample 45.

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 205

Figure 9.4 - Predicted versus measured plot


Predicted
70
80 9
65
15
45 11
60 84
63 90
68 13
55 28 38 31
10 89
79 41 19
2714 37 17
2921 33
50 42 59 20 35
64
24 91 98
4372 113 16
32110
34 48 49
6018 67 116
5473
45 103 46 57
36
30 88
40 69
52 62
55 58 101 39
93 12
22
35 25
92 51 23
26
30 Measured
30 35 40 45 50 55 60 65 70
pap-5, (Yvar,PC): (PRINTHRU,8)

9.9.5 RMSEP, SEP and Bias


SEP and Bias are two other statistical measures closely connected to RMSEP.

Bias represents the averaged difference between predicted and measured Y-values
for all samples in the validation set.

n
∑ ( yi − yi )
Equation 9.9 Bias = i =1
n

Bias is a commonly used measure of the accuracy of a prediction model. Bias is


also used to check if there is a systematic difference between the average values of
the training set and the validation set. If there is no such difference, the Bias will be
zero.

SEP, the Standard Error of Performance, on the other hand expresses the precision
of results, corrected for the Bias.
n
∑ ( yi − yi − Bias)2
Equation 9.10 SEP = i =1
n −1

Multivariate Data Analysis in Practice


206 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

If one can reasonably expect a normal distribution of the samples included in the
calculation of the Bias,

95% of them will have yi − yi ≤ 2 ⋅ SEP and


5% of them will have yi − yi > 2 ⋅ SEP

Therefore 2*SEP may at times be regarded comparable to a 95% confidence


interval.

This also means that the uncertainty of the Bias is directly dependent on SEP. SEP
increases when the Y-values (the reference values) are inaccurate.

The relationship between RMSEP, SEP and Bias is statistically well known:

Equation 9.11 RMSEP2 ≈ SEP2 + Bias2

Normally, indeed hopefully, there is no Bias, which then leads to RMSEP = SEP.

If RMSEP is unusually high, given an otherwise reasonable data structure, the


reason may either be a bad model specification or a non-representative validation
set. The latter is the case if the Bias is large.

9.9.6 Comparison Between Prediction Error


and Measurement Error
What does an absolute estimate of, say, RMSEP = 0.5 ml mean? Is this good or
bad? This of course depends on what you plan to use the model for. A prediction
model for indirect measurements of dioxin emission may require a high accuracy,
whereas a model used to suggest promising sites for gold exploration may be quite
rough, and still be used to find potential mineralisations. This is discussed further
in chapter 11. In any case, both prediction errors and modeling errors may be
compared to the measurement errors, e.g. the error in the reference method
(Y-values). We can never expect the prediction error to be better than this basic
analytical error.

It is therefore natural to compare RMSEP with SDD, or more accurately - to


compare SEP with SDD (see previous section). If there is no Bias, RMSEP equals
SEP, so we continue to focus on SEP. Ideally one would like SEP to be

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 207

approximately equal to SDD. However since SEP includes uncertainty in both X


and Y, we usually accept SEP < 2*SDD for many practical applications.

What should we do if SDD is just at the level of accuracy required (for instance by
regulations), and SEP=2*SDD? You can only solve this problem by making better,
more precise Y-measurements, i.e. by reducing SDD. SDD is a function of the
number of repeated measurements, so SDD based on four replicates is only half as
large as SDD based on two replicates.

It is also natural to compare SEP or RMSEP with the measurement level. If


concentrations are measured between 1 and 50 ml, then an RMSEP of 5 ml may be
satisfactory at high levels but most probably totally unacceptable at low levels.
Plotting Predicted versus measured Y-values illustrates this well. If low Y-values
seem to be well predicted and there are other samples with large deviations, then
we may conclude that the prediction error at low levels will be lower than RMSEP.

It is very often of considerable interest to express RMSEP in some (problem-


dependent) relative fashion. You may for instance calculate RMSEP divided by the
mean measurement level, by the half-range of the measured data, or the highest
level, to get a relative Figure, e.g. 10%. The analytical accuracy is often given as a
similar percentage, e.g. ± 6%, which then serves as a good comparison. This
particular relative Y-gauging of the validated RMSEP-results will be seen to be a
particularly problem-dependent issue.

9.9.7 Compare RMSEP for Different Models


A most useful feature of RMSEP is that you can use it to compare different models,
regardless of how the models were made with regard to weighting, preprocessing
of the X-variables, or the number of components used etc.

Note however that if you transform Y, e.g. to log Y, the RMSEP will be in log Y
units. But you cannot back-transform RMSEP directly; exp(RMSEP(log Y)) is not
equal to RMSEP (Y). In this situation a little spreadsheet back-calculation is
necessary.

9.9.8 Compare Results with Other Methods


Chemometricians and other users of projection methods have methods of prediction
validation, which are not always the same as in other disciplines. You will probably
sometimes discuss your results with people who use other methods and other error

Multivariate Data Analysis in Practice


208 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

estimates. Make sure that you understand how everybody calculates their error
estimates, and that you can explain how you got yours.

For example, some Neural Net and MLR implementations do not validate based on
similar external principles, so here one might see apparent prediction results that
are really only model error estimates proper. As is known by now, an overfitted
model may have a very low apparent error (if evaluated in this fashion), but in
reality will have an unreliable prediction ability.

A commonly used error measure in MLR applications is Residual Standard


Deviation, RSD.

n
∑ ( yi − yi )2
Equation 9.12 RSD = i =1
(n − k − 1)
where
n = number of samples
k = number of variables

RSD is , therefore, more of an expression of the modeling error. The value of RSD
becomes very small when there are few variables. For example, in spectroscopy
applications with filter instruments, using only few wavelengths, the RSD will
seem very low. RSD does not estimate prediction errors (estimation errors of the
regression coefficients), but only errors in the model. It cannot be used for
prediction or validation samples since it is related to the number of samples. RSD
is, therefore, not equal to the prediction error in the above chemometric definition.

The best way to compare an MLR model with a PCR or PLS model is to calculate
an appropriate RMSEP for the same validation or prediction set.

9.9.9 Other Measures of Errors


PRESS
PRESS is defined as the Predicted Residual Error Sum of Squares:

PRESS = ∑ ( y i − yi ) 2

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 209

PRESS is the Residual Y-variance over the number of validation objects. PRESS is
often used to assess whether an individual, new (“next”) component represents a
significant addition to a model, whilst the residual variance as defined in this book
comprises all the included components simultaneously. It is a viable alternative to
use a “PRESS vs. no. of components” plot instead of the conventional residual
validation variance plot, since the Y- axes in this plot are but proportional to each
other, with a multiplication factor equal to the fixed number of validation samples
involved.

9.10 Prediction of New Data

9.10.1 Getting Reliable Prediction Results


For reliable results, when predicting new samples using PCR or PLS, the model
should be extensively evaluated and have a good prediction ability. When using
The Unscrambler for prediction, automatic warnings will be issued if you try to
predict samples that do not fit in with the model prerequisites.

Once we have found a satisfactory prediction model, satisfactorily validated that is,
the quality of the predicted values is expected to be approximately as good as that
for the average calibration/validation object. Except for outlier warnings and
prediction uncertainty limits, there is no way to check whether the prediction
objects are good or bad, i.e. whether they correspond in general to the data
structure for the training data set.

9.10.2 How Does Prediction Work?


In PCR/PLS prediction, new X-data are projected onto the A model components. Y
is usually estimated using these projected scores and loading matrices, T and P.
The traditional regression equation

Equation 9.13 Y = b0 + b1 x1 + b2 x2 ++ bn xn

can also be calculated from T, W, Q and P; all PLS-models may be “compacted”


into this standard regression formalism. So, alternatively Y may also be predicted
using this traditional equation.

Multivariate Data Analysis in Practice


210 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

Thus, one set of B-coefficients is calculated for the model with 1 PC included only.
Another set is calculated for the model including 2 PCs, and so on. You should of
course use the set of B-coefficients corresponding to the appropriate optimum
number of components, A.

Note!
Prediction using these B-coefficients gives exactly the same predicted
numerical Y-values as the projection model equations using A PLS-
components! Sometimes, inherent rounding-off errors may produce
small discrepancies; however, they should never be of any quantitative
consequence.

The traditional regression equation is therefore often used for e.g. downloading
prediction models to spectroscopic instruments etc. and for automatic predictions.
The only significant drawback in prediction using the B-vector is that you lose the
outlier detection and interpretation capabilities available with the projection
models.

9.10.3 Prediction Used as Validation


With automatic prediction, the Y-values, the reference values, are usually not
present. But if such Y-values were to be present, perhaps not for all
X-measurements, prediction can now be used as an additional form of validation.
The X-data are used for the prediction and the Y-data are used only to compare the
predicted and the known Y-values. Plotting Predicted vs. Reference values is a
useful way to illustrate this, just as in the validation phases of multivariate
calibration.

This procedure is often used to follow up calibration models used in routine


measurements. At regular intervals you should also take reference measurements
and check the validity of the predictions.

9.10.4 Uncertainty at Prediction


We usually do not make any assumptions about statistical distribution of errors or
samples in bilinear projection methods. When predicting new samples we would
like to give an indication of how precise the predicted values are, using an absolute
error estimate or uncertainty measure. For the reasons stated above this is
impossible however. RMSEP is really often the best we can do. But if we did a
proper job of validation, and if both the validation data and the training data are
representative of the future prediction data, we are justified to assume that RMSEP
Multivariate Data Analysis in Practice
9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 211

represents a fair estimate of the average prediction error. This will, in many
practical situations, be absolutely acceptable.

There have, in addition, been attempts at giving some uncertainty indications valid
for each specific prediction sample. Using The Unscrambler during prediction,
predicted Y-values are given an uncertainty limit called deviations (Dev). These
limits are calculated from the validation variances, the residual variances, and the
leverage of the X-data in the prediction objects. This is based on an empirical
formula originally developed in the 1980’s, then further improved in 1998. If the
X-data for the prediction sample is very similar to the training X-data, then the Dev
interval will be smaller and the prediction more reliable. If the new sample is more
different from the training data, the Dev interval will be larger. This prediction
deviation interval is really most useful when comparing predictions. Note that these
uncertainty limits (Dev’s) mostly indicate to what extent you can trust a particular
predicted value, i.e. a form of outlier detection.

If a predicted value is “bad”, e.g. gives outlier warnings, has large uncertainty
limits or seems to fit badly with the model, the reason may either be that the
prediction object is dissimilar to the calibration samples, or that the validation set
was different from the calibration set.

9.10.5 Study Prediction Objects and Training


Objects in the Same Plot
Scores for the prediction set are also calculated during prediction. You may display
scores for the model and the prediction objects in the same plot, (using the Results
- General view function and Add plot). This will show how close or far apart the
two data sets are in the X-space, which is a very illustrative way to compare the
training and the test sets directly, and will usually also help explain the reason for
bad predictions etc.

9.11 Coding Category Variables:


PLS-DISCRIM
Category variables such as indicator variables, discrete variables or “dummy
variables”, can be used in addition to continuous variables to some degree in the
present kind of bilinear models. For dichotomous, or binary, data for example, you
may use a special variable which could be called “Yes/No”, “present/not present”

Multivariate Data Analysis in Practice


212 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

or similar. It is preferred to use the values -1, +1 for such variables than to use 0, 1,
because of better symmetry.

The above dichotomous dummy variable facility often works very well, provided
there are not too many of these alongside the dominating continuous variables.
However, the simple analogy between (-1,1) and any dichotomous category
classification cannot be carried over to the case where more than two categories are
involved. It is most emphatically wrong to try to code multiple categories, e.g. (A,
B, C, D) into a discrete variable realization space, e.g. (1,2,3,4). There is no
guarantee that the “metric” discriminating between categories (A,B,C,D) should
happen to correspond to the equidistant, rational metric set out by (1,2,3,4). This is
a very serious mistake to make.

If a category variable can take more than two values, then we must use one
category variable for each category type. Example:

Table 9.1 - So-called “re-coded variables”, for multiple-category assignment.


There must be one binary dummy variable for each category
Type X1 X2 X3
A +1 -1 -1
B -1 +1 -1
C -1 -1 +1
A +1 -1 -1

This is often referred to as “re-coding”. With this approach, we are in fact able to
make use of these, at best, semi-quantitative category variables. In fact quite an
important issue in PLS-modeling concerns what has been termed “PLS-
discrimination modeling”, or PLS-DISCRIM for short. In chemometrics there have
been some spectacular application showcases based on this simple, yet enormously
powerful concept. We have found this issue so important that we have included
some PLS-DISCRIM problems amongst the new master data sets in this revised 4th
edition.

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 213

9.12 Scaling or Weighting Variables


Scaling is most important when different types of variables are analyzed together.
The most frequent scaling is of course standardization, i.e. divide all variables by
their standard deviation empirical, nearly always in the guise of autoscaling.

How to Handle Different Kinds of Variables?


If the variables are of different types and the data consequently appear in
significantly different empirical ranges, the standard use is weight = 1/SDev. This
makes small absolute variations play a correspondingly larger role because all
variables now get the same variance. Here is an overview of some of the more
important weighting options (Table 9.2).

Table 9.2 - Weighting of variables


Type Example Values Weighting
Coded -1/+1 Low or high, A or B -1, +1 None or 1/SDev
1
Coded 0/1 Low or high, A or B 0/1 or replace by None or 1/SDev
-1/+1
0-100 % Vary between 0 and 0.5, 30, 99... 1/SDev
100
0–1 Vary between 0 and 1.0 0.01, 0.1, 0.9... 1/SDev
0-1% Vary between 0 and 1 0.1, 0.7, 0.9... 1/SDev
Interactions x1*x1, x1*x2, any 1/SDev
and squares x1*x2*x2...

1
0 and 1 can be replaced by -1 and 1. If you should ever happen to import a data set
in which such variables have been coded (0,1), to replace (0,1) in a variable (e.g.
V1) with -1 and +1, select Modify - Compute, and type V1=(V1-0.5)/0.5. For the
whole matrix, type X=(X-0.5)/0.5, and specify the range of variables when the
programs asks for it.

Another way to avoid amplifying the effect of noise is to scale by 1/(SDev + C),
where C is a small constant, related to the accuracy of the data. This prevents
variables with a very low standard deviation from producing a very high value of
1/Sdev.

Note!
You may use the Weights button to allocate individual weights to individual
variables or to sets of variables, which should be treated identically. Such variable
sets are often referred to as “blocks”. Block-scaling is sometimes useful, for
Multivariate Data Analysis in Practice
214 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

example when an X-matrix is composed of a combination of variables from several


instruments.

9.13 Using the B- and the Bw-Coefficients


The regression coefficients in the traditional regression equation (Equation 9.14)

Equation 9.14 Y = b0 + b1*X1 + b2*X2 + ... + bn*Xn,


i.e. Y=XB

can be calculated directly from the pertinent PLS-loadings, and they give exactly
the same predicted Y-values. The B-vector can thus be used to predict new
Y-values with a minimum of fuss, when we are truly only interested in the
prediction result, like e.g. in automation and process control etc.

As we have mentioned before, however, the B-vector may be difficult to interpret


with respect to both important and unimportant X-variables. A large b-coefficient
may indicate an important variable - or a variable with small absolute values but
“large” relative differences! Small b-values may also be caused by several variables
that counteract each other. If there is interaction between X-variables, even the sign
of the b-coefficients may be misleading. The B-vectors are not for interpretation
alone.

Always study the B-vector together with the appropriate loading-weights. Check
this by plotting the loadings for the same PCs.

The B-vector is cumulative (the B-vector is for one complete model with A
components), while there is one set of loadings for each PLS-component – much
more informative and much easier to interpret!

Bw-coefficients are always calculated in The Unscrambler if the data have been
weighted (scaled at the Task menu). They should be used to predict new Y-values
from new weighted X-values.

Equation 9.15 Y = Xweighted Bw

The Bw-coefficients take the weighting into account, with the aim of disregarding
small or large original variable values. Therefore a large Bw indicates an important
X-variable. The sign may, however, still be “wrong” if there is interaction. The Bw
Multivariate Data Analysis in Practice
9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 215

option is primarily used for example for export from The Unscrambler, since all
internal prediction automatically is carried out with the appropriate weights, etc.

9.14 Calibration of Spectroscopic Data


Spectroscopy data and other similar full-spectrum (multi-channel) measurements
are ideal for PLS applications, or rather, PLS-regression is ideal for multivariate
calibration of this kind of data. Spectroscopic data are usually highly collinear,
which methods like MLR generally handle unsatisfactorily but which projection
methods handle well.

Spectroscopic data usually comprise rather large data sets, often with hundreds – or
even thousands – of variables, but because they are collinear PLS can handle them
even with few objects. Chromatography data, acoustic spectra, vibration data and
NMR are examples of similar types of data with the same characteristics. In the
rest of this book the term “spectroscopic” may often be taken in a more general
generic sense, meaning also applying to data with similar characteristics to the
spectroscopic data.

The X-variables typically represent wavelengths while the X-data themselves are
often absorbance, reflectance or transmission readings etc. In chromatography the
X-variables are usually retention times and the X-values are peak heights (single
peaks, or integrated). The Y-variables (often called constituents or properties) may
be chemical concentrations, protein contents or physical parameters such as octane
number a.o.

The dominating PLS-application in spectroscopy is for indirect measurement and


calibration, where the aim often is to replace costly Y-reference measurements with
reliable predictions from fast and inexpensive spectroscopic measurements (X).

PLS is well established within NIR today, because NIR applications often require
methods based on many wavelengths due to non-selective, full-spectrum
wavelength responses. PLS-applications are however being continuously developed
also in other wavelength ranges, such as IR, UV or VIS, where potential
information remained partly hidden earlier when univariate methods or MLR were
the only calibration options. There are also many recent developments within
acoustic chemometrics, which relies heavily on PLS-calibration, see one of the
master data sets in chapter 13.

Multivariate Data Analysis in Practice


216 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

9.14.1 Spectroscopic Data: Calibration


Options
You make a calibration model for spectroscopic data in the same way as for other
data, with a few exceptions. PLS is usually preferred since it focuses directly on the
Y-values. Spectra usually contain so much information that the same X-spectra may
easily be used to calibrate for many types of constituents, so PCR will typically use
many PCs to relate to the Y-variable of interest.

Standardization is sometimes claimed to be unnecessary in spectroscopic


calibration, because the variables are of the same type, and information is usually
considered to be related to the broader peaks. Standardization is therefore claimed
to actually amplify noise. However, there is nothing special – at all – about
spectroscopic data. It is always - only - the relative differences between the
absolute variable ranges of the n X-vectors that determine whether or not
standardization is necessary, i.e. whether the p X-variables display significantly
different variances, or not.

The automatic outlier detection limits should typically be raised, because this kind
of spectra is usually very precise and a low limit will display too many objects as
outlying. A usual limit would be 4.0-6.0 in this case. As this modification is highly
application-dependent, you will to find suitable limits for your own data. At a low
limit many objects are indicated as outliers. Studying the list of outlier warnings
shows you which are just above the limit and which are far higher. Adjust the limit
accordingly - good domain-specific knowledge is of course necessary.

Preprocessing is often required, for example either logarithmic transformation of Y


or other more specific spectroscopic transformations of X. Diffuse reflectance data
should always be transformed into Kubelka-Munck units. Spectra may often also
display various sorts of scatter effects, and thus Multiplicative Scatter Correction,
or perhaps differentiation may be relevant. If the spectra are noisy, some sort of
smoothing may also be appropriate. However, always start by modeling the raw
data and only gradually try tentative transformations if the modeling needs this.
Again, your own personal experience with preprocessing is the thing to strive for,
indeed as quickly as possible. This cannot be learned from a book, however a good
source of information on preprocessing can be found in Beebe, Pell & Seascholtz
(1998).

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 217

9.14.2 Interpretation of Spectroscopic


Calibration Models
Interpret spectroscopic models as usual. However the 2-vector loading plot is
usually more or less totally unsuitable. Since the X-variables usually are highly
correlated they will be displayed as a non-resolvable tangle across the plot, and
these intercorrelations are usually not very interesting in this type of plot. If the
Y-loadings are displayed in the same plot, all the X-variables will typically cluster
around the origin, due to scaling.

In PLS2 the 2-vector Y-loading plot shows the relationships within the set of
Y-variables and this is often useful information.

With spectroscopic data the 1-vector loading-weight plot is often very useful in e.g.
understanding the chemistry of particular applications. Large loading-weights
imply wavelengths in which there is for example significant absorption related to
the constituent of interest. This type of interpretation is a vital source of
information to help you to understand the chemistry of the samples. And similarly,
for instance with respect to acoustic spectra, which show the relationships of
frequency variable responses.

By studying the specific patterns in the loading-weights you may also - with some
experience - begin to be able to interpret which “effect” is being modeled in
specific calibration situations: peaks, shifts, double-peaks, scatter, or combinations
of these effects. Professional interpretation of loadings and loading-weights (by
spectroscopists, analytical chemists, etc.) requires extensive practical experience,
but there is also a lot of literature on this particular subject.

Multivariate Data Analysis in Practice


218 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

Figure 9.5 - Spectra with distinct scatter effects

3.6

3.3

3.0

2.7

2.4

2.1

1.8

1.5

1.2
20 40 60 80 100
<5> <6> <7> <8ny> <9> <12>

As a typical example, strong scatter, as seen in Figure 9.5, often gives loadings in
the first PLS-component with the pattern shown in Figure 9.6.

Figure 9.6 - 1-D loading plot representing scatter effect in PC1


Loadings
0.14

0.13

0.12

0.11

0.10

0.09

0.08

0.07

0.06 X-variables
20 40 60 80 100
test-1, PC(expl): 1(99%)

Another example concerns shifts in the X-spectra. This gives the type of loadings
shown in Figure 9.7. The shift form in the loading curve is shown in all significant
PCs.

Multivariate Data Analysis in Practice


9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 219

Figure 9.7 - Loading plots showing typical shift effects in PC1, PC2...
Loadings Loadings
0.4 0.5

0.3 0.4

0.2 0.3

0.1 0.2

0 0.1

-0.1 0

-0.2 -0.1

-0.3 X-variables -0.2 X-variables


20 40 60 80 100 20 40 60 80 100
test-0, PC(expl): 1(51%) test-0, PC(expl): 2(27%)

Determination of the optimal number of PCs follows the usual rules. In addition it
is natural to study the B-vector and the 1-vector loadings for a varying number of
PCs. The 1-vector loading-weights become noisy (around an effective zero-level)
when you have passed the optimum, because you start to model noise and overfit.
This can be a powerful alternative way to determine the number of components to
use in more complex applications.

9.14.3 Choosing Wavelengths


In spectroscopic applications it is often appropriate to refine the model by deleting
wavelengths. How should one select what to delete?

Experience and experiments show clearly that PLS manages to model structures
both on full-spectrum data sets and data from filter instruments.

If you reduce the number of wavelengths by using only the ones that carry most
information (the ones with high PLS-loading values), the model may be safer and
easier to interpret (fewer factors) and the prediction error may be reduced. This is
the result of avoiding wavelengths with large noise fractions and irrelevant
information, or non-linearities.

If you reduce the number of wavelengths by removing the most important


wavelengths, PLS may often still give you a good model (!), but with more
components, because of the powerful full-spectrum calibration facilities of PLS. In
this case you need more components to find the information left in the undeleted
wavelengths, but the risk of overfitting increases.

Multivariate Data Analysis in Practice


220 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues

Choosing fewer wavelengths arbitrarily still may give rather good models, but an
intelligent, and problem-dependent, selection technique will always improve the
results considerably.

In practice you should study the B-coefficients that give the accumulated picture of
the most important wavelengths - for the final, completely validated model. For a
model with, say, four valid PCs (giving the minimum residual variance), you study
the B-coefficients for 4 PCs. Select the variables with the highest absolute
B-values. Then recalibrate with only these variables and evaluate the results anew.

Wavelengths chosen to give optimal MLR solutions will also be useful for PLS.
There is a lot of work to be done in this field, and much to be gained from
optimizing an automated procedure for selecting the best wavelengths. A great
many papers have been published in recent years on this issue, which however is
just outside the scope on this introduction to (generic spectroscopic) multivariate
calibration.

A long, but essential chapter on the range of additional practical aspects of the
crucial application experience with multivariate calibration has come full circle.
There is nothing more we can teach you.
We can now progress to three full chapters on realistic, real-world multivariate
calibration exercises, many of which display very interesting non-standard issues,
all ready for you (chapters 10, 12 and 13).

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 221

10. PLS (PCR) Exercises: Real-World


Application Examples - I
10.1 Exercise - Prediction of Gasoline
Octane Number
Purpose
This exercise illustrates outlier detection at calibration and prediction, based on
spectroscopy data.

Context and Problem


The data were originally provided by UOP Guided Wave Inc., USA for an
application note to illustrate the particular power of PLS combined with NIR-
technology. The data were used in a feasibility study for one of Guided Wave’s
petrochemical customers, where on-line monitoring of octane number in gasoline
production was required for efficient quality control.

The reference method for octane number measurements is (very) time consuming
and relatively expensive, involving comparative use of test engines (two: one for a
reference distillation product mixture, the other for the mixture to be graded),
which has to run for 24 hours. If it would be possible to replace such measurements
(Y) with fast, inexpensive NIR-spectroscopy measurements (X), routine quality
control could be effectively rationalized and indeed very much cheaper. In this
example we do not concern ourselves with a score of other potential problems
related to optimization of the practical implementation of NIR-technology in the oil
refinery setting (these have all been overcome, although at no small effort), but we
shall assume that the prediction of octane number is the sole objective at hand.

Data Set
The sample set Training consists of twenty-six production gasoline samples that
were collected over a sufficient period of time, considered to span all the most
important variations in the production. All the data are stored in the file Octane.
Two of the samples contain added alcohol, which increases the octane number. The

Multivariate Data Analysis in Practice


222 10. PLS (PCR) Exercises: Real-World Application Examples - I

variable set NIR spectra consists of NIR absorbance spectra over 226 wavelengths.
The variable set Octane consists of the corresponding reference measurements of
the octane number.

There is also a sample set Test of 13 new, “unknown” samples. (The corresponding
octane numbers are of course also available for control and validation purposes.)

Tasks
Make a PLS model on the sample set Training. Detect outliers using all available
means, interpret, delete the outliers and re-model. When you have found a good
model, use it to predict the octane number of the sample set Test.

How to Do it
1. Studying raw data
There are several ways to study raw data.

First, you can have a look at the numerical data:

Open the file Octane in the Editor. Study the raw data values as well as possible
A second possibility is to plot the raw data. You can use a matrix plot to display
the whole spectrum for a whole set of samples; this will show you the general
shape of the spectrum, and may enable you to spot special samples. To do this:
Use Edit – Select Variables and choose the set NIR spectra. Use Plot- Matrix
and take the defined sample set “Training” which corresponds to the 26 first
rows of the data table. Click OK. If necessary, go to Edit - Options to select a
plot of type Landscape.

To plot an individual variable, you can choose a line plot:


Go back to the original Octane Editor window. Mark the variable Octane, and
use Plot - Line and take the sample set “Training”. You may use Edit - Options
to select a plot of type Bars.

The last possibility is to plot a summary of the data. For one individual variable,
a histogram is a good summary of the distribution of the values:
Mark the variable Octane again, use Plot - Histogram and select sample set
Training.
What is the range of variation of the octane number? Are there any groups of
samples?

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 223

For a whole set of related variables, for instance the X-matrix, descriptive
statistics are also a powerful way to summarize data distributions. To use them:
Go back to the original Octane Editor window, and choose Task - Statistics.
Select sample set Training, variable set NIR spectra. View the results.
The upper plot displays the minimum and maximum, lower and upper quartiles,
and median of each wavelength as a boxplot.
The lower plot displays the average (as a bar), and standard deviation (as an
error bar).
Can you easily see the common shape of the spectra for all samples? What is
happening at the right end of the spectrum?

2. Making a first PLS model


Make a PLS1 model:
Go back to the original Octane Editor window. Choose Task - Regression. Use
sample set Training, X-variables NIR spectra, Y-variables Octane. Before
clicking OK, check the following elements:
Select appropriate weights for X- and Y-variables. Is there any need for
weighting? Hint: we are dealing with spectra.
Validation: since this is but our very first attempt at this data set, you may select
leverage correction to save time.
Number of components: try 7 PCs, since we do not know much about how many
physical phenomena varied in the chosen calibration data set. Let us hope 7 will
be enough in this first run.
Set the warning limits to 4.0 for sample and variable outlier warnings.

You can now start the calibration. Study the screen during PLS1 regression
progress.
Are there any warnings in the early PCs? Approximately how many PCs will
you need? Does the Y-variance decrease as it should?

3. Viewing model results


Click View. You will now get a regression overview. The first things to check
are whether the model looks OK: searching for irregularities in the data or
outliers.

Study the Residual Validation Y-variance.


What do you think causes the increase in the first PC?
Approximately how many PCs do we need to model the octane number
correctly?

Multivariate Data Analysis in Practice


224 10. PLS (PCR) Exercises: Real-World Application Examples - I

Study the Scores plot for PC1 versus PC2. You can use Window - Warning list or
View – Outlier List to get more information about the warnings you noticed
during regression progress.
Which samples contribute most to make up the model span?
How do you interpret the narrow horizontal group of samples around the origin
along the first PC?
Are there any outliers? Which?

4. Spotting outliers
The X-Y relation outlier plot, which displays U scores vs. T scores, is only
available in PLS. This plot shows you directly how the regression works and
gives a good overview of the relationship between X and Y for one particular
PC. If the regression works well, you will see that all the samples form a straight
regression line. Outliers “stick out” orthogonally from this line. Extreme values
lie at the ends. Noise is typically being modeled when the samples start to
spread, i.e. you have gone past the optimal number of PCs.

Plot X-Y Relation outliers, Quadruple plot for PC1 - PC4. You can use View-
Trend lines to add a regression line to each plot.

Can you spot any outliers? Mark them (using the toolbar or Edit - Mark), and
notice how they are now marked on all plots, a facility called brushing.

Which PC starts getting noisy?

5. Checking performance of model with outliers


To check how well the model performs for individual samples, you will use
Predicted vs. measured plots. To assess its global performance, you will plot
RMSEC and RMSEP.

Plot Predicted vs Measured for a varying number of PCs (e.g. from 1 to 4), one
in each quarter window. Toggle between Cal (calibration) and Val (Validation)
using the toolbar. If you do not have the Cal and Val buttons on your toolbar,
turn them on by choosing View – Toolbar – Source.
If you wish to take additional information into account, use Edit - Options-
Sample grouping. Try to Separate with Colors and Group by Value of Leveled
variable 2 (the category variable containing information about octane number
range). It will help you to spot misplaced samples and see groups.

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 225

Can you see the outliers? What happens with the predicted values for 1 PC? 2
PCs? How many PCs do you need to get regularly distributed predictions?

Use Plot – Variances and RMSEP - RMSE to plot the calibration error (RMSEC)
or prediction error (RMSEP) in original units. Before clicking OK in the dialog
box, double-click on the plot preview to get it as a full window. This way the
produced plot will use the single subview window. Point at the minimum and
click to see the value!

How large is the RMSEP in this model with outliers?


How many PCs do you need to get a minimum RMSEP?

6. Investigating and handling the outliers properly


We usually do not permanently remove outliers unless we know what they
represent. Samples 25 and 26 were in fact the two samples with added alcohol.

Plot Loadings - Line - X for Component 1 (which separates those two samples
from the others). Point and click to see the wavelengths.
Can you deduce which band is mostly absorbed by alcohol-related compounds?

Try the Sample outliers plots and study how they reveal the problems with
sample 25 and 26. You may use several options and select Validation only on
some plots to make things clearer.

Based on the damage the outliers are doing to the model, and on the additional
knowledge of why they are outlying, make a decision about whether to keep
them or exclude them.

7. Building and checking a new model without outliers


Check that the outliers are still marked (otherwise select Edit - Mark and mark
them), then use Task - Recalculate without Marked.
Are there any outlier warnings now?

Study the Variances and RMSEP and X-Y Relation outliers plots for the new
model.
Do you see new outliers?
Is there still an increase in the prediction error?
How many PCs should we use?
How large is RMSEP with that number of components?

Multivariate Data Analysis in Practice


226 10. PLS (PCR) Exercises: Real-World Application Examples - I

Plot the scores.


How are the samples located? Do you see groups? How can we analyze the
groups? Hint: Use Edit - Options - Sample grouping. Try to Separate with
Colors and Group by Value of Variable Y.
Are there any signs that the groups are harmful for the model and thus should
be modeled separately?

Plot Loading Weights – Line - X to see which wavelengths are absorbed by


octane number related compounds. Plot the first 2-3 PCs in the same plot. Large
loading weights indicate important bands (they show that a wavelength plays an
important role in the model).
Which important bands can you detect?

Save the model under the name Octane2.

8. Towards a final model


Select Task again (you can do this directly from the Viewer, by selecting
Recalculate without marked while no samples are marked) and make a new
model with a more conservative Validation method, such as full cross-validation
(use the Setup button, Method, Full cross-validation).
How large is the RMSEP now, for the appropriate number of components?
Has the model changed?

Also try with some “sensible” segmented cross-validation approach and


compare with the full cross-validation (for example 10% segments, or larger?).

Plot Predicted vs Measured for varying number of PCs. Use View - Trend
lines to put on a Regression line. View - Plot Statistics gives additional
information.
Do you recognize the groups? How many PCs should we use? Are you satisfied
with the distribution of the samples around the regression line?

Make a conclusion about the model’s prediction ability by comparing RMSEP


with the range of measured values for octane number.
Which measure should we compare RMSEP with if available?

Save this model.

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 227

9. Using the final model for prediction


Now that you have found a good model, and decided on the most appropriate
validation hereof, it is time to use it for prediction:
Select Task - Predict. Check the following elements before clicking OK:
Choose Sample set Test. Select the appropriate X-variables. Also tick the
Include Y-reference box from Y-reference, with Variable set Octane. The
Y-variable will not be used in the prediction, but the Y-values will be stored
together with the predicted results to make the comparisons with known
reference values possible.
Specify which Model to use and how many PCs to use. The Variance button
gives you information about the chosen model.
How many PCs should we use? Why?

If you are hesitant to choose between two numbers of PCs, the smallest number
will be called the “conservative” choice, whilst the larger one may signal
problems!

10. Viewing prediction results


Click View to look at the prediction results. The default prediction plot is
Predicted with deviation.
This plot shows the predicted values (white line) of each sample. The size of the
bars indicates the deviations, the prediction uncertainty measure described in
chapter 9.
Plot it for varying numbers of components, including the one you had previously
decided on.
Which predicted values cannot be trusted? What do you think is wrong with
these samples?
What is the difference between RMSEP and the uncertainty deviation?

Open the Warning list from the Window menu.


Are there any outliers?

11. Checking predictions against reference values.


Plot Predicted vs. reference, for varying numbers of components. Add a
regression line and a target line.
Why is the fit to the regression line so bad for some models?
Which samples have a bad correspondence between Ypred and Yref?
What is wrong with these samples?
What do you think of samples happening to have a good fit between predicted
and reference values, but large deviations?
Multivariate Data Analysis in Practice
228 10. PLS (PCR) Exercises: Real-World Application Examples - I

Summary
Some would say we do not need to weight these spectra, since they are so
extremely similar over the entire X-wavelength range. We will need about 3 to 5
PCs, but the Y-variance increases in the first PC, which is a sure sign of problems -
usually outliers. Samples 25 and 26 were indicated as extreme outliers. These
outliers were actually seen in all the sample-related plots. PC1 in the first model
described mainly the difference between samples 25-26 and the other samples. The
outliers also caused an increase in the prediction error in the first PC. The narrow
group of samples around the origin in the score plot are in fact all the remaining
samples, whose variations are only small compared to the difference between them
and sample 25-26. Obviously no. 25 and 26 make up most of the model variance
alone. The X-Y Relation outliers plot shows the same thing, and in PC4 the
samples spread out, indicating overfitting. The loadings for PC1 were largest in
band 1400-1420 nm. Samples 25-26 were the samples with added alcohol;
obviously they are so dissimilar to the others that we cannot make a model for both
types. (An alternative might be to try to add more samples with alcohol, but there is
a great risk of the two types of samples being so different that the result would
most probably be an inaccurate global model.) The outliers were also visible in
Predicted vs. Measured. In some PCs samples got misplaced. RMSEP was about
0.3 with 3 PCs.

The second model had no local increase in prediction error; 2 PCs already seem
OK, but the model is also further improved with 3 PCs, giving an RMSEP around
0.25 octane number. The score plot indicates groups, but since the explained
variance of Y is around 98%, these are probably not harmful. The groups are
actually composed of Low, Medium, and High octane. In the Predicted vs.
Measured plot, the correlation between predicted and measured Y is close to 0.99,
and the regression line for validation samples is very close to “y=x”. This is indeed
a very good model, and comparing a RMSEP of 0.25 with octane numbers between
87 and 93 gives an average relative error of less than 0.5%. However, since the
error in the reference method, SDD, is unknown to us, we cannot say anything
more specific about how well the model predicts compared to traditional octane
measurements.

Samples 10-13 get large outlier warnings at prediction and therefore cannot be
trusted at all. Their predicted values are too high with a model including 3 PCs.
According to their spectra, they probably also contain alcohol and, since those
samples were removed from the calibration set, we cannot expect the prediction

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 229

model to handle such samples either. It seems that the model can also be used to
detect new prediction samples that do not fit in, so now nobody can try to cheat by
adding alcohol to raise the octane number! This is a very powerful illustration of
the possibilities of detecting “non-similar” training set objects when using
multivariate calibration.

The RMSEP is the average prediction error, estimated in the validation stage. If
new prediction samples are of the same kind and in the same range as the training
samples, we should expect roughly the same average prediction error. The
uncertainty deviation at prediction tries to consider the particular new sample that
is to be predicted, indicating a larger uncertainty if the new sample is (very)
different from the calibration samples. We can see this as a convenient form of
outlier detection at the prediction stage.

When plotting Predicted vs. Reference and drawing a regression line to fit the
points, the result seems bad because of the second group of outlying samples 10-13.
If you disregard these samples, you will notice that all the “normal” samples are
close to the target line.

Samples 10 - 13 have nice predictions using 2 PCs, but the deviations are large.
This should make you worry, because the samples are obviously different from the
samples used to make the model. It may therefore be pure luck that the predictions
fall nicely into the range of the others.

We hope you have not used leverage-corrected validation in the final evaluations!
If you did not: congratulations! If you did, here is what you still have to do:

Take a close look at the initial Matrix-plot of the X-data again. Extreme
collinearity and redundancy. This is a typical situation in which there is a real
danger of the “famous” over-optimistic leverage corrected validation. It is in fact
necessary to perform the entire calibration again using a more appropriate
validation procedure. However, the repeat is readily done now that you know all
the outliers, including the potential last three candidates (samples 10-13). In fact
this makes the whole re-analysis boil down to simply choosing the more
appropriate validation method to be used directly on the outlier-screened data set
inherited from above.

Compare the leverage-corrected – and your more appropriate validation method.

Multivariate Data Analysis in Practice


230 10. PLS (PCR) Exercises: Real-World Application Examples - I

What about you not scaling, or weighting these data? Why did you - perhaps -
decide not to use this option on these particular data? There is nothing special
about spectral data. True, the X-spectra were all measured in the same units, and
across the entire X-interval these data would appear very much identical with
enormous redundancy (only for the outlier-screened data set to be sure). That
would actually weigh in favor of auto-scaling rather, so as to help bring forth the
miniscule differences between this set of very similar objects, very similar spectra
(in the X-space). This is a severe lesson of not following any myths about scaling
or not (perhaps); in fact, all this commenting on the scaling/no-scaling issue can be
put to a much easier test: just do it! You’ll have to carry out the entire PLS-analysis
again, only now using the alternative auto-scaled data.

Compare the scaled PLS-model results with the earlier un-scaled model version(s).
In this particular case, what can we conclude with respect to the issue of the merits
of scaling?

10.2 Exercise - Water Quality


Purpose
In this exercise you will analyze a data set from a sewage water outlet, and make a
mathematical model that describes the relationships between four important
variables. You will also find the traditional regression equation for the model.

Problem
This exercise is based on studies of municipal water quality at an outlet of a sewage
plant at the 1996 Olympic Winter Games Village Lillehammer, Norway.

Make a model of the target Y-variable BOF7 from the other available
measurements and find an equation: BOF7 = f (tot-P, Cl, Susp). BOF7
measurements cost about $140 each so taking one measurement per day may be
rather costly for a medium-sized municipality (of course viewed in the context of
all the other analytical requirements in the municipality’s environmental protection
division). A good PLS-model would be valuable to the sewage plant monitoring
economics, and will ease the load on the wet laboratory resources.

Data Set
The samples were measured twice a month over a five-month period. In the data
tables they are listed in time order. The data file name is Water:

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 231

Table 10.1 - Variable descriptions


Variable Description Name Unit
BOF7 Biological oxygen consumption BOF7 mg O/l
(7 days)
x1 Total phosphorus Tot-P mg P/l
x2 Chloride Cl mg Cl/l
x3 Suspended materials Susp mg/l

Table 10.2 - Variable set descriptions


Set Name Description
All All measurements from the water outlet
Important Measure Three variables from All: tot-P, Cl and Susp
BOF7 BOF7 from All

Tasks
Make a PLS1 model.
Plot or read the B-coefficients and estimate the traditional regression equation.

How to Do it
1. Make a PLS1 model using the variable set Important Measure as
X-variables (Weights: 1/Sdev) and BOF7 as Y-variables; keep out sample 9.
Interpret the model and look for outliers. How many components should we
use? Re-calibrate with an appropriate validation method when you are sure
of the model. Compare RMSEP with the typical y-value levels. Can this
model be used to replace BOF7 with the cheaper X-measurements?
Save the model.

The B-coefficients may be used with new unweighted data. To study the
B-coefficients, use File - Import - Unscrambler Results.
Note!
It is NOT possible to import files or results in the training version of The
Unscrambler. This feature is only available for The Unscrambler full
version.

Specify the model you just made and select the B-coefficients for the optimal
PC. Read also B0. Will predictions using the B-coefficients give different
y-values from those using scores and y-loadings?

Multivariate Data Analysis in Practice


232 10. PLS (PCR) Exercises: Real-World Application Examples - I

2. Read the data again and pretend that they are new samples. (This is of course
only for this exercise purpose!) Select Task - Predict, specify your samples,
variables and model. Predict with one PC. Note the predicted Y-value for the
first sample.

Now predict using the B-vector instead. Append a new variable into the Editor.
Mark columns 1-3 and 12. Select Modify - Compute and type for example:
V12 = -27.5 + 10.4*V1 + 1.59*V2 + 0.0868*V3
using of course your own B-values.

Alternative procedure:
First, replicate the row containing the B-coefficients in the imported editor (by
using Edit – Copy then Paste 8 times). You now have as many B-coefficient
rows in this window as samples in the Water data table. Select all B-coefficient
columns, and drag and drop them to the Water editor (choose “Insert as 4 new
columns”). Then insert a new variable, and use Modify – Compute to perform
the manual predictions by typing in for example:
V1 = V2*V6 + V3*V7 + V4*V8 + V5
where V1 is the newly inserted variable, V2–V5 are the Bs and V6-V7 are
Important Measure.

Is Y for the first sample equal to the predicted result from The Unscrambler?
How would you indicate the predicted result, given the validation results?

Summary
In this data set one PC is enough to explain 81% of the validated Y-variance, giving
an RMSEP of 8.7 (with full cross-validation) that we can compare to the values of
BOF7 (which range from 15 to 70). RMSEP is then equal to some 23% of the mean
value of BOF7. The regression equation obtained from the B-coefficients is:
y = -29 + 10.25*tot-P + 1.63*Cl + 0.087*Susp

The samples are fairly well distributed. There is no obvious time trend. All three
X-variables have positive loadings.

Prediction using the B-coefficients gives the same predicted BOF7 values as a
prediction using scores and y-loadings, of course. The drawback of using the B-
coefficients for prediction is that we do not get outlier statistics or uncertainty

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 233

measures. The predicted value of sample 1 is 26.9 ± 4.3 if we use the prediction
deviation to build an approximate confidence interval.

Replacing BOF7 with the cheaper measurements indeed has a clear cost saving
potential, but the model needs to be improved, for instance by adding more
calibration samples. It is not yet precise enough, but we might agree that it shows
potential?

10.3 Exercise - Freezing Point of Jet Fuel


Problem
The effective freezing point of jet fuel has to be as low as possible, both following
aviation agency specifications, as well as for your flying pleasure: at typical
commercial jet altitudes, the ambient temperature outside the fuselage is way below
the freezing point of water. The reason why freezing points need to be depressed
relative to that of water is that jet engines do not operate that well if the fuel lines
are clogged with ice!

Determination of the freezing points is done manually, by optical inspection of a


sample in a super-cooled test tube. This tube has been cooled to a temperature
below 0°C. The laboratory technician observes whether ice crystals begin to form
in the test tube. If so, the temperature is considered to be below the effective
freezing point of the fuel mixture under test. Through an iterative procedure, where
cooling is tried at slightly lower and higher temperatures, the “correct” freezing
point is determined. Even if the illumination and observation conditions are
standardized, the observation uncertainty is relatively large. Typical estimates of
this error are ± 5°C.

It would be of great interest to carry out these tests by some faster instrumental
procedure, preferably one with a smaller uncertainty.

Because of the satisfactory results obtained in predicting octane numbers directly


from spectroscopic NIR-measurements of gasolines, it was decided to try a similar
approach for these freezing point depressions. Note however that there is not
necessarily a simple and easily modeled relationship between the physical freezing
point and NIR-spectra. Octane number (a measure of combustion efficiency, a
chemical property) may very well be more related to the organic chemistry
(supposedly) embedded in the pertinent NIR-spectra.

Multivariate Data Analysis in Practice


234 10. PLS (PCR) Exercises: Real-World Application Examples - I

We also know that the present set of jet fuel samples come from four different
refineries. There is an assumption that the four refineries produce identical jet
fuels. If this is true, there should be no differences between these four sub-types of
jet fuel with respect to freezing point depression. And we should be able to make
one global model for all four refineries.

This data set has kindly been provided by Chevron Research & Technology Co.,
Richmond, CA, USA. The original problem context has been slightly modified
(conceptually as well as numerically) for the present educational purpose.

Data Set
The file Fuel contains two variable sets:
Spectr: 65 sample spectra with absorbance readings at 111 wavelengths in the
NIR range (1100 to 1320 nm).
Freeze: Freezing point determinations (°C) for the 65 samples.

Tasks
Try to make a model that predicts freezing point from spectra, with the aim of
replacing the cumbersome reference method with fast and cheap NIR-
measurements. Check if there are significant differences between samples from
each of the four refineries.

How to Do it
1. Always plot raw data to get acquainted with the data matrices. Before you do
that, you may change the data type for variable set Spectr to “Spectra”
(Modify – Edit Set etc.).
You may notice that variable No. 2 is constant. You needn’t delete it, it will
be enough to keep it out of the analysis.

2. Make an initial PCA to see if there is structure in the spectra. Should we


scale the data? How many PCs are needed to describe the spectral
variations? Do the refineries form groups? Hint: Study all combinations of
score plots for up to 4 PCs.

3. Make a PLS1 model. Start with leverage correction. Try to find outliers.
Look for groups. Study the X-Y Relation outliers plot to identify outliers.

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 235

Can we discriminate between samples from the four different refineries? Or


is our hypothesis tenable? Hint: Study all combinations of score plots for up
to 5 PCs.

4. Refine the model by removing a few outliers at a time. Hint: We found three
obvious outliers (using the X-Y Relation Outliers plot), and two that may be
from a specific refinery (using the Influence plot: Plot – Residuals –
Influence Plot, Components: 1-5, Variables: X). See how the resulting
validation Y-variance improves without local increases. How many PCs
should we use? Check the RMSEP. Is the prediction error significantly
lower when the outliers are removed?

5. Also study the 1-vector Loading Weights plots, to see at which wavelengths
there is freezing point related information.

6. Re-calibrate with full cross-validation. Does this change the RMSEP and the
number of PCs drastically?

7. Also recalibrate with segmented cross-validation, for example with 6


randomly selected segments. Compare RMSEP with the full cross validation
results.

8. Make conclusions: Can this model be used to replace the reference method?
Compare RMSEP with the uncertainty in the reference method. Can we
expect a smaller prediction error than the analytical error? Suggest an
approach to get a better model.

Summary
We do not scale the X-data because they are of the same type (absorbance values).
The PCA model shows that there is strong spectral structure. About 99.5% of the
X-variance is explained by 4 PCs. The score plot for PC3 and PC4 plot is the only
one that shows clear groupings, however, both in PCA and PLS. Observe that we
see this grouping only at higher order components. This grouping is significant - of
what?

There does not seem to be a strong enough relationship between the NIR spectra
and the freezing point to constitute even a first working prediction model, since
only about 55-65% of validated Y-variance is explained. No matter what we do to

Multivariate Data Analysis in Practice


236 10. PLS (PCR) Exercises: Real-World Application Examples - I

try to make a good model - and this is indeed the same message we get from
whichever validation approach we choose.
o
The RMSEP is about 14 C at its lowest, using leverage correction. From a data
analytical point of view this is about what can be expected, since RMSEP is about
twice the analytical error. RMSEP is composed of uncertainty both in the X- and
the Y-measurements. However the R&D lab needs a smaller prediction uncertainty
than this. The only chance to get a lower RMSEP would be with better reference
measurements, for example by finding a more accurate method or by using a higher
number of replicates.

Samples 2, 46 and 47 are obvious outliers, easily found in the X-Y Relation outliers
plot. This plot also indicates that the regression does not work very well. Samples
37 and 63 are also outlying. They do not have extreme Y-values and may originate
from a certain, undisclosed refinery. If they are also removed, the residual Y-
variance curve is now almost smooth; the RMSEP does not get much better
however. Using leverage correction, the RMSEP curve flattens out after 11 PCs,
and using full cross validation the optimal number is 12.

Without specific chemical and spectroscopic application knowledge, it is difficult


to interpret the loading-weight plot. There is a significant absorbance in many
bands in the different PCs. The loadings of PC3 and PC4 might help to understand
why the refineries produce fuels which are different, but we really need more
information than what has been given here, in order to get any further with this.

10.4 Exercise - Paper


Purpose
This exercise illustrates outlier detection and variable reduction.

Problem
A paper mill monitors the quality of its newsprint product by applying ink to one
side of the paper in a certain, heavily standardized layout pattern. By measuring the
reflectance of light on the reverse side of the paper, a reliable, practical measure of
how visible the ink is on the opposite side is obtained. This property, Print
through, is an important quality parameter for paper mills in general. The paper is
also analyzed with regard to several other production parameters as well as other
pertinent raw material characteristics.

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 237

The paper mill wants to make a model, which can be used for quality control and
for production management. For example, it may be possible to rationalize the
quality control process by reducing the number of significant parameters measured.
Preferably, it should also be possible to predict the Print through for new, not too
dissimilar, paper compositions.

Data Set
The data is stored in the file Paper. You are going to use the sample sets Raw data
(106 samples) and Prediction (12 samples), and the variable sets Process (15
variables) and Quality (1 variable). The variables in the Process (X) set are shown
in Table 10.3.

Table 10.3 - Variable description in the X-variable set Process


X-var Name Explanation
X1 Weight weight / sq. m
X2 Ink amount of ink
X3 Brigthness brightness of the paper
X4 Scatter light scattering coefficient
X5 Opacity Opacity
X6 Roughness surface roughness
X7 Permeability Permeability
X8 Density Density
X9 PPS Parker Print Surf number
X10 Oil absorb a measure of the papers ability to absorb oil
X11 Ground wood the % of ground wood pulp in the paper
X12 Thermo pulp the % of thermomechanical pulp
X13 Waste paper the % of recycled paper
X14 Magenf the amount of additive
X15 Filler the % of filler

The samples were collected from the production line over a considerable time
interval, in the hope that the measurements would span all the important variations
in the newsprint production. The sample names show in which sequence the
samples were collected. The samples are sorted by increasing levels of the variable
Print through. In order to check the model, twelve new samples are stored in the
test set Prediction. These are used to check how the model performs for prediction.

Tasks
1. Find outliers in the sample set Raw data and remove them.
2. Reduce the model by also finding the less important variables and make a
new model without them.

Multivariate Data Analysis in Practice


238 10. PLS (PCR) Exercises: Real-World Application Examples - I

3. Predict Print through for the new samples in the sample set Prediction.
4. Try to solve the same problem using PCR instead of PLS.

How to Do it
1. Make an initial PLS model
Read the data from the file Paper. View the statistics, plot the raw data, and
decide on an appropriate preprocessing and weighting. Make a quick initial PLS
model with leverage correction using the sample set Raw data, the variable set
Process as X-variables and Quality as Y-variable.

2. Detect outliers and keep them out of calculation


Make use of all possible means to detect outliers. Use Plot – X-Y Relation
Outliers, Double. Make sure that you have numbers as markers (Edit –
Options, Markers Layout: Number). You should be able to detect three
outliers.
Did all the outliers appear in the list of warnings from the calibration? What
does this mean?

Remove the outliers, always only one, or a few at a time, and make a new
model. When you have reached the final model, plot the variance and decide
how many PCs to use.
How much of the X-variance is explained? How much of the Y-variance does
the first PC explain? Check the RMSEP!

Now we have a model without outliers.


Are the samples well distributed in the score plot?

3. Reduce the number of variables


The task is now to see whether it is possible to make a simpler model by
reducing the number of variables in the model. We need to take a close look at
the loading-weights plot (Variables: X and Y).
Which variables co-vary most with Print through?
Important variables have large regression coefficients (which summarize the
relationship between X- and Y-variables over several PCs). Plot the regression
coefficients for a relevant number of PCs. If the plot shows up as a curve, use
Edit – Options: Bars. Mark the 7 variables with the largest coefficients (in
absolute value) by Edit – Mark – One by One.

Multivariate Data Analysis in Practice


10. PLS (PCR) Exercises: Real-World Application Examples - I 239

4. Make a new PLS model


Make a new model by Task - Recalculate with Marked.
How well do the predicted Print-through values correspond to the measured
ones? How large is RMSEP now for the number of PCs you chose?

Compare the new regression coefficients to those of the previous model, and
check whether they are similar.
Close the viewer with the latest model and answer No to Save.

5. Use a more conservative validation method


Re-calibrate with cross-validation, using your own choice of number of samples
per segment. Check the RMSEP again. Is it still OK?
Close the viewer and save the model as PaperPLS1final. Close the other viewers
and answer No to Save.

6. Predict Print through for new samples


New samples are stored in the sample set Prediction. Check if there are outlier
warnings or large deviations.
Which conclusions can you draw? Why does the prediction of some samples get
large uncertainties?

7. Make a PCR model and compare to the PLS model


Make a PCR model of the data. Try to find the outliers in this model also.
Are the results comparable to those with PLS? Explain the difference.

Summary
The X-variables should be standardized because they are all in very different units
and value ranges. Only some outliers are shown in the list of warnings, compared to
those you find in the X-Y Relation outliers plot (T vs. U scores). The so-called
“relation outliers” (discrepancy between X and Y) show up in this plot only.
Samples number 105 and 106 are clear outliers and are removed from the
calculations first. Sample 104 seems outlying too, and is kept out of calculation
from the final model. You may argue that sample 98 is an outlier also. There are no
definite rules, it is your experience that counts.

In the refined model (all outliers removed), only 18% of the variations in X are
modeled by the first PC. However, these 18% explain all of 84% of the variation in
Y. Obviously, there is a lot of irrelevant X-information in these data. RMSEP is
around 3.4 using 1 PC, 2.8 using 2 PCs, and 2.6 using 3 PCs. The samples are also

Multivariate Data Analysis in Practice


240 10. PLS (PCR) Exercises: Real-World Application Examples - I

well distributed over the score plot in PC1 vs. PC2, indicating that important
variations are well spanned.

Variables Weight, Opacity, Scatter and Filler (however they are measured) co-
vary most, in a negative way, with Print through. This makes sense: a high
weight/sq. m. makes the paper thicker and thus less transparent, light scatter is
reflected light, opacity is by definition the opposite of Print through, and filler is
added to counteract Print through. The seven largest regression coefficients belong
to the four above mentioned variables, with the addition of Brightness, Density and
Ink. The rest of the variables have very small coefficients, and we may thus feel
motivated to remove them in the variable reduction context.

In the resulting reduced model using 2 PCs, the predicted Y-values correspond well
with the measured ones, giving an RMSEP of 2.6. Cross validation gives an
RMSEP of approximately 2.7, (depending on the specific segment selection). This
is the more conservative error estimate and is still satisfactory compared to the
measurement range of Print through (30-69).

Some samples in the test set Prediction are not well predicted. These are outlying,
probably due to X-errors. Outliers in the prediction set can be detected by their
deviations, or because they turn up in the list of outlier warnings.

Using PCR, we need no less than 6 PCs to get an RMSEP of about 3.3 on the data
set with all variables in, but the outliers removed. The outliers were very difficult
to find without access to X-Y Relation outliers plot. (In PCR T scores = U scores.)
Only sample 98 showed up in the score plot for PC1 vs. PC2. The reduced model,
with fewer variables and no outliers, gave a RMSEP of 2.7 using 6 PCs. The PCR
model resulted in pretty much the same interpretation as the PLS model, but
because there were now more PCs, the patterns were more difficult to see.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 241

11. PLS (PCR) Multivariate


Calibration – In Practice
Data can seldom be used in its raw format, unless you have been overseeing the
entire data gathering process yourself. There may be outliers, or unsuitable
variables, or there is a need to transform some of the variables, for linearization
purposes for instance. In any event, it is quite normal to make several data
analytical runs, starting with the raw data, and simply “try out” a few standard
routines to see how things are.

Multivariate data analysis is very much an iterative undertaking.

What is a “Good” or “Bad” Model?


In order to determine whether a model is “good” or “bad”, it is of course necessary
to specify the exact purpose of the data modeling. For example, a model intended
for prediction of the protein content in wheat (Y), using fast and inexpensive
spectroscopic (NIR) measurements (X), instead of time consuming laboratory
methods, will in all likelihood require a high prediction accuracy. Authorities or
customers may demand that the prediction error (RMSEP) must be within strictly
predefined limits, usually in the same range as the reference method used for the Y-
data of the calibration set.

In contrast, a model made in order to understand which process variables influence


the quality of a highly variable product, for instance apples, can often be accepted
with much less accuracy. An explained Y-variance as low as 60-75% may be
enough to get on the track of bad quality. The purpose of this model is more to
interpret the patterns of scores and loadings, i.e. to find the “significant variables”.
This may well be achieved without aiming to find the absolutely largest possible
explained Y-variance.

Various expressions of model fit and prediction ability are used to assess how good
a model is. Every solution must still satisfy the basic minimum rules for sound data
modeling though: no outliers and no systematic errors. You should also -
preferably always - be able to interpret the model relationships, ensuring that they
comply with your application knowledge.

Multivariate Data Analysis in Practice


242 11. PLS (PCR) Multivariate Calibration – In Practice

Signs of Unsatisfactory Models – A Useful Checklist


• The prediction error is “too high” for the problem at hand (external knowledge)
• The residual variance displays an increase before the minimum, or does not
decrease at all (in PLS)
• There are “lonely objects” (possible outliers), in scores, residuals or influence
plots, or groups of similar outlying objects.
• Similarly for isolated, outlying variables, although the interpretation of
unnecessary, or bad, variables is not identical to that pertaining to objects.
• The distribution of objects in T vs. U score-plots (X-Y Relation Outlier plot) has a
non-linear shape or shows the presence of groups, etc.
• There are systematic patterns in the residuals
• New data are not predicted well

Possible Reasons for Bad Modeling, or Validation Results


• Outliers are (still) present in the data
• Data are not representative: for instance, the calibration data are not
representative of future prediction data, or the validation data are not
representative with respect to the calibration data
• Unsatisfactory validation (wrong method, wrong data)
• Lack of variability in calibration range(s)
• Need for pretreatment of raw data (e.g. linearization)
• Inhomogeneous data (subgroups)
• Errors in sample preparation
• Systematic errors in experiments
• Instrument errors, e.g. drift
• Lack of information - there is in fact no X-Y relationship (always a possibility)!
• Strongly non-linear X-Y relationships

11.1 Outliers and Subgroups

11.1.1 Scores
As a typical example, in Figure 11.1 samples 25 and 26 are situated dramatically
apart from all the other samples. They contribute unreasonably to the overall
model; PC1 is actually used almost exclusively to describe the difference between
these two objects and all others. They may of course still belong with the others,

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 243

but chances are generally (very) high that they are significant outliers. As always
the specific decision as to the status of such objects is problem-dependent.

Figure 11.1 - Score plot, showing two examples of significant outliers

Using name markers may show unexpected sample locations. For example, an “A”
sample in the B-group may be an outlier, or it may represent a simpler labeling
mistake, see Figure 11.2.

Figure 11.2 - Schematic illustration of a possible labeling error, or an outlier


PC2
A B
B A
A
A
A B B
A A B
A B B PC1
B

Multivariate Data Analysis in Practice


244 11. PLS (PCR) Multivariate Calibration – In Practice

11.1.2 X-Y Relation Outlier Plots (T vs. U


Scores)
In “well-behaved” regression models, all samples should lie more or less close to a
straight regression line in the relevant X-Y Relation Outliers plots (T vs. U scores),
otherwise there is no basis for a model in the first place. The two plots in Figure
11.3 show how some samples can be located off the line, revealing themselves as
outliers, such as objects 104, 106, 105 and 98 along PC1, and samples 95, 105 and
106 along PC2.

Figure 11.3 – X-Y Relation Outliers


PC1: 104, 106, 105, 98

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 245

PC2: 95, 105, 106

11.1.3 Residuals
Figure 11.4 shows Y-residuals versus a run order listing of the samples. Large
residuals in this plot indicate possible outliers. Three samples have been flagged.

Figure 11.4 - Individual Y-residuals for each object (in run order listing)

In Figure 11.5, which is a Normal Distribution probability plot of residuals, objects


off a fitted straight line through the point (0,50) are possible outliers. Figure 11.4
Multivariate Data Analysis in Practice
246 11. PLS (PCR) Multivariate Calibration – In Practice

and Figure 11.5 are from the same data set. The same objects turn up in both plots
as possible outliers. There are always many different reflections of an abnormal
multivariate behavior. The objective is to have been acquainted with all of them in
your training, so as to use them more and more expertly later on.

Figure 11.5 - Normal Distribution probability plot of object residuals


Y-residuals
99.09 27
26
5216
08
07
49
84.55 09
4030
41
39
17
70.00 31
46
28
42
53
06
34
15
47
51
18
11
24
50.00 19
21
05
50
33
10
30.00 403
48
54
45
25
14 4
32
22
29
13
55
20
15.45 37
04
36
23
43
0212
01
38
0.91 35
Y-residuals

-1.0 -0.5 0 0.5


Wheat-2, (Y-var, PC): (WATER,2)

11.1.4 Dangerous Outliers or Interesting


Extremes?
Objects that lie far away from the others, or which do not fit in well with the rest in
the plot, are possible outliers, because they are different from the others. However,
this does not necessarily mean that they should be removed from the data set. You
are responsible for analyzing why they are different, and making a choice of
whether to keep them or not. Such objects may represent extreme end-member
objects that actually help span the calibration/validation variations; removing them
would then result in lower variability for a model, which consequently can then
only handle the more typical, average samples.

Always Check the Raw Data!


One should always go back to the raw data and check what makes the outlying
sample special. You may even need to consult the people who collected the data.
Maybe it is an erroneous value caused by an instrument breakdown or a reading
mistake, or typing/data transcription mistake? Maybe this object has been collected
under very different conditions than the rest. Or maybe it is just an accidental

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 247

extreme. Checking the raw data gives valuable information to be used later in the
analysis.

Which Objects Should be Removed?


From a strictly data analytical point of view, objects that do not fit in with the rest
should usually be removed, otherwise they may harm the model. It is always wise
to remove only one or two outliers at a time, starting with the most extreme ones in
the first components. Very often removal of a serious outlier results in significant
changes in the remaining data structure, when modeled again. As one example of
more or less counterintuitive results, the two “less serious” outliers in one run may
behave more like the norm in the next run. You are strongly advised to do several
outlier-screening runs iteratively instead of trying to catch all in one go.

If you remove an “outlier” that is really only an extreme end-member object, the
model may not get better. There may be no change in the prediction error or it may
even increase. Extreme objects actively help to span the model. Extreme end-
member samples are easy to spot. They occupy the extreme ends of the Tvs.U
regression lines, while outliers lie off this line (perpendicular to the model). This
distinction is much more difficult to appreciate if you only use numerical outlier
warnings, leverage index, etc. Always study the score plots carefully (T-U). In
regression modeling, X-Y Relation Outliers plots are really all you need – for the
modeling purposes. There may of course also be problem-specific reasons to dig
into how the objects are distributed in the X-space in a specific PLS-solution, via
the T-T scores plots, i.e. for interpretation purposes.

Subgroupings in the PLS-Regimen


It may be difficult to make a global model from a data set that consists of several
separate groupings. In such cases the model may become too general - it may
perhaps indicate which group a new sample belongs to, but the accuracy within
subgroups is poor. It will probably be much more adequate to make one model for
each subgroup.

The octane application in exercise 12.1 is a good example. The outliers were
different from the other samples; they contained alcohol, resulting in clearly
different spectra. If we include more such samples, there are two possible
scenarios:
1. They would not seem so different in the score plot, but there is a great risk that
the model will be too inaccurate.
2. There will be two clearly different groups that will make an inaccurate model.

Multivariate Data Analysis in Practice


248 11. PLS (PCR) Multivariate Calibration – In Practice

After removing these two outliers, a distinct grouping appeared in the score plot,
which consists of samples at different general octane number levels (called Low,
Medium and High), see Figure 11.6. The L, M, and H octane samples described in
this plot are a good example of the interpretation use of a T-T score plot from a
PLS-solution.

The prediction error in this model was satisfactory; RMSEP was approx. 0.3 units
to be compared with the measurement range of 87 - 93 octane numbers, i.e. a
relative error of about 0.5%. Apparently these groups are “harmless” in the greater
picture. This T-T grouping is in fact duplicated in the Y-space, which is why in the
more appropriate T-U plot they are all nearly perfectly aligned along the relevant
regression lines. This is a very important distinction. If you wish, return to exercise
10.1 for a moment and look at the appropriate Tvs.U plots.

Figure 11.6 - PLS solution, T-T score plot, showing distinct X-groupings

11.2 Systematic Errors


The whole point of multivariate data modeling is to decompose the data into signal
(systematic variation) and noise. This means that the relevant first components of
the model describe structure. The rest, i.e. what is unexplained, should be random
noise. “The rest” - the residuals - express the lack-of-fit or modeling/prediction
errors. Residuals should ideally only be the accumulated random errors. They are
often - ideally - assumed to be normally distributed, or at least to be distributed
symmetrically. This is a characteristic of the residuals that may be checked easily.
Multivariate Data Analysis in Practice
11. PLS (PCR) Multivariate Calibration – In Practice 249

In a “good” model, the residuals are randomly and symmetrically distributed.


Consequently, if the residuals show a systematic pattern, there is something wrong.
Always remember that we are only right in expecting a final “good” model if our
initial regression hypothesis is a sound reflection of reality - and that we have made
corresponding, sensible choices for n, p, and the calibration data set.

11.2.1 Y-Residuals Plotted Against Objects


The residuals should be evenly distributed with respect to the Y-zero line. If the
objects are sorted in run order, a systematic pattern may reveal effects due to time
dependencies, for instance larger residuals for later experiments. This could be
caused, for example, by aging of chemicals or the accumulating contamination of
detectors.

Figure 11.7 - Residuals (X, or Y) for objects,


plotted in run order or in temporal order

Residuals

Objects

11.2.2 Residuals Plotted Against Predicted


Values
In a good model the residuals should show a random scatter, such that the pattern
of a marginal projection onto the Y-axis should conform to a Normal Distribution
(see Figure 11.8). Systematic deviations from this behavior may indicate non-
sufficient modeling, i.e. that there is still systematic information in the data at the
present number of components.

Multivariate Data Analysis in Practice


250 11. PLS (PCR) Multivariate Calibration – In Practice

Figure 11.8 - Residual plot with random scatter


Y-residual
0.6

0.4

0.2

-0.2

-0.4

-0.6 Predicted Y
0 5 10 15 20 25 30
oct-1, (Yvar,PC): (octane,2)

If for example, the upper and lower bounds diverge and the pattern is funnel-like,
as in Figure 11.9, the error is clearly different in different parts of the experimental
range. We may then try to transform the Y-variables in appropriate ways, for
example, to counteract the problem observed, and make a new model. Iterate as
need be.

Figure 11.9 - Residual plot with “funnel-effect” (highly schematic)


Y-Residual

Predicted Y

If for example, the upper and lower bounds are parallel but not horizontal, see
Figure 11.10, there is a systematic error, in this case a trend. This could be because
a linear term is missing from the model, or it could indicate a scatter-effect not
(yet) corrected for.

All these residual inspection plots are very useful in telling you that your current
model is still not complete. What should you do about it then? Since this is
problem-dependent, you must use your application knowledge and try to find the
reasons.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 251

Figure 11.10 - Residual plot with systematic error (highly schematic)


Y-Residual

Predicted Y

11.2.3 Normal Probability Plot of Residuals


This is an alternative, statistical way to study the distribution of residuals. If the
residuals can be expected to be normally distributed, i.e. follow a Gaussian
distribution, they should line up located along a straight line through (0,50) in the
Normal Distribution probability plot. If they do not comply with this, something is
wrong.

Figure 11.11 - Normal probability plot of residuals

97.92 23

8
89.58 12
9
81.25 5
3
72.92 19
18
64.58 14
56.25 222
50.00 17
11
43.75 15
24
35.42 21
16
27.08 4
20
18.75 13
7
10.42 6
1

2.08 10 Y-residual
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3
OCT-0, PC = 4, Yvar = octane

The objects in Figure 11.11 are indeed found distributed along a straight line
through the point (0,50). This is a good indication that the model has taken care of
all the structure in the data and there are no outliers left, in contrast to Figure 11.12.

Multivariate Data Analysis in Practice


252 11. PLS (PCR) Multivariate Calibration – In Practice

Figure 11.12 - Normal probability plot with outliers (objects 5,9,10,12)

96.88 11

90.63 2
84.38 14
78.13 6
71.88 15
65.63 4
59.38 8
53.13
50.00 13
46.88 16
40.63 7
34.38 1
28.13 3
21.88 5
15.63 12
9.38 9

3.13 10 Y-residual
-0.000006-0.000004-0.000002 0 0.0000020.0000040.000006 0.0000
will-0, PC = 1, Yvar = Yield

11.3 Transformations
The field of transformations, the major class of pre-processing, is very varied and
complex. It is an area with a lot of topical interest and activity. We can only touch
upon some of the basics here, logarithmic, spectroscopic, MSC (Multiplicative
Scatter Correction), differentiation (computing derivatives), averaging,
normalization, but this should serve as an introduction. Bilinear models basically
assume linear data relationships. It is, however, not necessary that Y displays a
linear relationship to each X-variable. Y may be a linear function of the
combination of several non-linear X-variables (linear combination of non-linear
X-variables, or a non-linear combination of suitably transformed X-variables), see
Figure 11.13.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 253

Figure 11.13 - Nonlinear X-X relation, linear X-Y relation (highly schematic)
Y

X2
X1

In this schematic illustration, we see that it is possible to fit a regression plane


through the (X,Y) data. This shows that there is a linear relationship between the
two X-variables and Y; generalizing to p X-variables, this still holds.

In many cases you may succeed in linearizing data by analogous problem-


dependent variable transformations, and thus allow the bilinear multivariate
methods to work well also on transformed data.

All the transformations in the following sections are available from the Modify
menu in The Unscrambler.

11.3.1 Logarithmic Transformations


Many phenomena (e.g. physical, biological, geological or medical) display skewed
distribution characteristics that require a logarithmic transformation:
X* = log(X)
This is easily done using Modify - Compute.

Multivariate Data Analysis in Practice


254 11. PLS (PCR) Multivariate Calibration – In Practice

Figure 11.14 - Logarithmic transformation - “before and after”

Skewed distribution
4 (raw data)

0
0 10 20 30 40 50 60 70 80
<dispers.UNS DispAbil>

After log transformation


3

0
0 0.5 1.0 1.5 2.0
<test.UNS DispAbil>

If you have no prior knowledge about the data, study histograms of the variables. If
the frequency distribution is very skewed, a variance stabilizing transformation
such as the logarithmic function may help (Figure 11.14). Here the skewness was
reduced from 1,02 to -0,14 after the logarithmic transformation.

You may use natural (Ln) or base 10 (Log) logarithms.

11.3.2 Spectroscopic Transformations


Most spectroscopists prefer to work with absorbance data, because they are more
familiar with this type of data, and feel more at home interpreting absorbance
spectra.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 255

Many modern multi-channel analytical instruments can provide spectra as


absorbance readings, using some form of correction and transformation. If you do
not know which formula is used in the instrument software, it may be wise to
import the raw spectra instead and make the appropriate transformations yourself.

In general it is recommended that you start by analyzing the absorbance spectra. If


this does not work, then we should try to transform our data.

Transmission data are often non-linear, so they are “always” transformed into, e.g.
absorbance data, using a modified logarithmic transformation (see below). Diffuse
reflectance data are “always” transformed into Kubelka-Munk units, but exceptions
may be around in more problem-specific cases.

The Multiplicative Scatter Correction is another very useful transformation for


spectroscopic data, see below. The Unscrambler has many ready-to-use functions
for spectroscopic data from the Modify menu.

Reflectance to Absorbance
We shall here assume, without loss of generality, that the instrument readings R
(Reflectance), or T (Transmittance), are expressed in fractions between 0 and 1.
The readings may then be transformed to apparent Absorbance (Optical Density)
according to Equation 11.1.

Equation 11.1 1
Mnew (i, k ) = log( )
M (i, k )

Absorbance to Reflectance
An absorbance spectrum may be transformed to Reflectance/ Transmittance
according to Equation 11.2.

Equation 11.2 Mnew (i, k ) = 10 − M ( i , k )

Reflectance to Kubelka-Munk
A reflectance spectrum may be transformed into Kubelka-Munk units according to
Equation 11.3.

Multivariate Data Analysis in Practice


256 11. PLS (PCR) Multivariate Calibration – In Practice

Equation 11.3 (1 − M (i, k )) 2


M new (i, k ) =
2 ⋅ M (i, k )

Absorbance to Kubelka-Munk
In addition the apparent absorbance units may also be transformed into the
pertinent Kubelka-Munk units by performing two steps: first transform absorbance
units to reflectance units, and then reflectance to Kubelka-Munk.

Transmission to Absorbance and Back


Use Modify - Compute function using the expression X = -log(X) to transform X
from transmission into absorbance data. X = 10^(-X) gives transmission data again.

There is no possible way to go into the area of spectroscopic transformations in any


great detail at the scope of this introductory book. For this - as well as for all the
other transformations treated here - we are happy to refer the interested reader to
the excellent chapter 3 in the textbook by Beebe, Pell & Seascholtz. (1998). The
book presents a wide coverage of pre-processing, scaling and transformation,
complete with very useful illustrations, case histories and examples.

11.3.3 Multiplicative Scatter Correction


Spectroscopic measurements of powders, aggregates of grains of different particle
sizes, slurries and other particulate-laden solutions often display light scattering
effects. This especially applies to NIR data, but is also relevant to other types of
spectra; scatter effects in IR spectra may be caused by background effects, varying
optical path lengths, temperature, and pressure variations. Raman spectra also often
suffer from background scattering. In UV-VIS varying path lengths and pressure
may cause scatter. These effects are in general composed both of a so-called
multiplicative effect as well as an additive effect.

Other types of measurements may also suffer from similar multiplicative and/or
additive effects, such as instrument baseline shift, drift, interference effects in
mixtures, etc.

Multiplicative Scatter Correction is a transformation method that can be used to


compensate for both multiplicative and additive effects. MSC was originally
designed to deal specifically with light scattering. However, a number of analogous

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 257

effects can also be successfully treated with MSC. In The Unscrambler MSC
transformation can be done from the Modify - Transform menu.

The idea behind MSC is that these two undesired general effects, amplification
(multiplicative) and offset (additive), should be removed from the raw spectral
signals to prevent them from dominating over the chemical signals or other similar
signals, which often are of lesser magnitude. Thus we may well save one or more
PLS-components in our modeling of the relevant Y-phenomena, were we to be able
to eliminate (most of) these effects before multivariate calibration. This, in general,
will enable us to proceed with more precise and accurate modeling, based on the
cleaned-up spectra. MSC can be a very powerful general pre-processing tool.

A short description of how MSC works on a given data set is given below.
• All object spectra are plotted against their average spectrum; see Figure 11.15.
Note that in The Unscrambler, this plot is automatically generated by doing
Task – Statistics, viewing the results, then choosing Plot – Statistics – Scatter.

Figure 11.15 - All individual object spectra plotted against their average
spectrum
Individual spectral
values

Average spectral
values

• A standard regression is fitted to these data, with offset A(i) and slope B(i), called
the common amplification and the common offset respectively. Index i for all
individual objects in the data set.

• The rationale behind MSC is to compensate for these so-called common effects,
i.e. to correct for both the amplification and/or offset (see the options below).

Multivariate Data Analysis in Practice


258 11. PLS (PCR) Multivariate Calibration – In Practice

The MSC function replaces every element in the original X-matrix according to one
of the equations below.

Equation 11.4 MSC corrections


Common offset Mnew (i, k ) = M (i, k ) − A(i )

Common amplification M (i, k )


Mnew (i, k ) =
B(i )

Full MSC M (i, k ) − A(i )


Mnew (i, k ) =
B(i )

Common offset only corrects additive effects, while common amplification only
corrects multiplicative effects.

In practice you select a range of X-variables (spectra) to base the correction on.
One should preferably pick out a part that contains no clear specific chemical
information. It is important that this “MSC-base” should only comprise background
wavelengths, in so far as this is possible. If you do not know where this is, you may
try to use the whole set of variables. The larger the range, the better, but this also
implies a risk of including noise in the correction. Or worse, we may thus
accidentally include (some of) the “chemical specific” wavelengths in this
correction. Omit test samples from the base calculation.

The MSC-base is calculated on this selected X-range and the MSC coefficients (A
and B) are calculated accordingly. The whole data set, including the test set, will be
corrected using this MSC base. Finally you make a calibration model on the
corrected spectra. Before any future prediction, the new samples must of course
also be corrected using the same MSC-base.

The A and B coefficients may in some cases contain a certain kind of signal
information in them. In certain advanced applications these vectors may in fact be
added as extra variables in the X-block. In exercise 12.2 you will try out MSC in
practice. Likewise, in addition to the original MSC concept, still other uses have
been found for MSC, consequently now known as Multiplicative Signal Correction.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 259

11.3.4 Differentiation
The first or second derivatives are common transformations on continuous function
data where noise is a problem, and are often applied in spectroscopy. Some local
information gets lost in the differentiation but the “peakedness” is supposed to be
amplified and this trade-off is often considered advantageous. It is always possible
to “try out” differentiated spectra, since it easy to see if the model gets any better or
not. As always however, you should preferentially have a specific reason to choose
a particular transformation. And again, this is really not to be understood as a trial-
and-error optional supermarket - experience, reflection, and more experience!
The first derivative is often used to correct for baseline shifts. The second
derivative is an often used alternative to handling scatter effects, the other being
MSC, which handles the same effects.

In The Unscrambler differentiation is available from the Modify – Transform-


Derivatives menu.

11.3.5 Averaging
Averaging is used when the goal is to reduce the number of variables or objects in
the data set, to reduce uncertainty in measurements, to reduce the effect of noise,
etc. Data sets with many replicates of each sample can often be averaged over all
sets of replicates to ease handling regarding validation and to facilitate
interpretation. The result of averaging is a smoother data set.

A typical situation in routine applications is fast instrumental measurements, for


instance spectroscopic X-measurements that replace time-consuming Y-reference
methods. It is not unusual for several scans to be done for each sample. Should we
average the scans and predict one Y-value for each sample, or should we make
several predictions and average these? Both of these give the same answer, which is
why averaging can also be done on the calibration data and its reference values.
Averaging is available from The Unscrambler’s Modify – Transform – Reduce
(Average) menu.

11.3.6 Normalization
Normalization is concerned with putting all objects on an even footing. Above we
have so far mostly treated the so-called column-transformations, i.e. making
specific pre-processings or transformations which act on one column-vector
individually (single-variable transformations).

Multivariate Data Analysis in Practice


260 11. PLS (PCR) Multivariate Calibration – In Practice

Normalization is performed individually on the objects (rows), not on the variables


(columns). Each object vector is re-scaled - normalized - into a common sum, for
example 1.00 or 100%.

The row sum of all variable elements is computed for each object. Each variable
element is then divided by this object sum. The result is that all objects now display
a common size - they have become “normalized” to the same sum area in this case.
Normalization is a row analogy to column scaling (1/SDev).

Normalization is a common object transformation. For instance, in chromatography


it is used to compensate for (smaller or larger) variations in the amount of analyte
injected into the chromatograph. Clearly it will be of considerable help in the
analytical process if this particular measurement variance can be controlled by a
simple data analytic pre-processing like normalization, otherwise a whole extra
PLS-component would have to be included in order to model these input variations.

There are several other data analysis problems where normalization can be used in
a similar fashion, even though the physical or chemical reasons for the phenomena
compensated for may be very different from the chromatographic regimen
mentioned above. This scenario is an analogue to the earlier mentioned augmented
MSC-usages for other analogous purposes.

Normalization is available from the Modify – Transform – Normalize… menu in


The Unscrambler, with several options.

11.4 Non-Linearities
As mentioned above, Y may be a linear function of the combination of several non-
linear X-variables and thus present no problem for bilinear regression. In many
cases PLS can handle non-linearities up to second-order T-U relationships by using
a few more PLS-components. How do we see if there are non-linearities in the
data? There are many plots that reveal this:

Signs of Non-Linearities
The objects show a curved pattern in the X-Y Relation Outlier plots. Residuals have
a curved pattern, U-form, S-form, or arch form. Y-residuals may be plotted against:
• Objects sorted after Y-value, or in run order, or some other suitable order that
may suit the residual display (problem-dependent, of course)

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 261

• normal probability of Y-residuals


• predicted Y-values
• X-variables (which may show where the non-linearity is dominating)
• PLS-components (e.g. Y-residuals vs. T or T vs. U)

Figure 11.16 - Weak non-linearity in the X-Y Relation Outlier plot;


a slight, but distinct curvature
U Scores
1
19 1
10
22
12 14
24
627 318
0 4 726
15
821
13
525
20
911
23

-1 217 T Scores

-1 0 1
Alcohol, PC(X-expl,Y-expl): 1(89%,49%)

Figure 11.17 - Non-linearities in normal probability residual plot


Y-residuals
98.08 1
19
2
86.54 17
20
75.00 9
10 318
63.46 13
23 1122
14
26
36.54 7
5 4 25
25.00 2715
6
13.46 21
8
12
1.92 24
Y-residuals

-10 -5 0 5 10
Alcohol, (Y-var, PC): (Ethanol,2)

Multivariate Data Analysis in Practice


262 11. PLS (PCR) Multivariate Calibration – In Practice

The “inverse” S-shape of the data is typical in data sets where non-linearities are
present.

11.4.1 How to Handle Non-Linearities?


• Try to use a few extra components, but be even more aware than usual of the
danger of overfitting!

• Try a problem-dependent transformation like a logarithmic transformation, square


root, exponential, etc. (Needless to say - experience is critical).

• Add for example a second-degree term in the model (i.e. square or cross terms of
the variables, e.g. x1*x1 and x1*x2). This gives models that are easier to interpret
and you avoid mixing up non-linearities with the general error. These additional
model terms are automatically generated in The Unscrambler by clicking
Interaction and squares in the Modify – Edit Set dialog.

Do not use this approach unless you know well what you are doing, and are willing
to take full responsibility for the data analysis results, as such applications are
likely to have second order characteristics, because it implies a large risk of
overfitting!

• An alternative is to first compute T (the scores) and then add square and cross
terms of these score vectors, for each PC. This expanded scores matrix is then
used alongside the “ordinary” X-variables in the regression, with added columns
such as t1*t1, t1*t2. The first PCs are the most stable, so it is safer to use the
scores from only a few of the first PCs. Using more terms again increases the risk
of overfitting. (To perform traditional MLR-regression with The Unscrambler,
run PLS or PCR with max. number of PCs.)

• Try to delete “bad” variables- see also further below.

• If preprocessing or rank reduction do not help, and the model is not good enough
for your needs, try a non-linear modeling method instead. Examples are non-
linear PLS, spline-PLS or a Neural Network. Since very few applications suffer
from such severe non-linearities that the above tricks are in vain, it is beyond the
scope of this book to go into these. However we will give a few hints on Neural
Networks below.

Keep things as simple as possible to avoid making mistakes!

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 263

Neural Networks - Hot but Risky


Artificial Neural Networks have become very popular for handling non-linear
problems. Many implementations somehow lack one or more of the critically
important facilities like interpretation, diagnostics, outlier detection, and validation.
Hence there are, unfortunately, lots of severely overfitted applications around.

An interesting approach to correct this was developed by the Danish Meat Research
Institute and is implemented in the Neural-UNSC package as an add-on module to
The Unscrambler. Data are preprocessed by ordinary PCR or PLS, giving you
access to interpretation of the main variations, outlier detection, diagnostics and
transformations. If there are clear signs of remaining non-linearities, the PLS-score
matrix is used as input to the Neural Network. This gives a small network that is
fast to train, and guarantees the minimum solution of PCR or PLS. Neural-UNSC
also offers test set validation, which makes modeling quite safe.

11.4.2 Deleting Variables


Lack of Variability
Variables with low or no variability after autoscaling should often be deleted
because they may disturb the modeling (at least if present in large quantities). Very
low residual (X-) variances of the scaled data may indicate a lack of variation, as
do very small loadings or loading weights. Descriptive statistics (small variance)
indicate the same thing.

Noisy Variables
A model may likewise be improved by deleting especially noisy variables. These
usually make little contribution to the model, with small loadings and loading
weights. Plot the loadings and/or the loading weights for the relevant components
in the same plot and find these variables. Then make a new model without these
variables, using Task – Recalculate without marked, or manually selecting them
in the Keep Out field of the regression dialog, and see if the model gets better.
Remove variables by weighting them by zero or deleting them completely from the
raw data matrix.

It may be difficult to study the results for many components until some experience
with large data sets has been acquired. An alternative is to also study the
B-coefficients (regression coefficients), since there is only one B-vector for a
model with A relevant components. A small B-value may be a sign of a noisy or
unimportant variable, which you may try to delete. Note the risks when interpreting

Multivariate Data Analysis in Practice


264 11. PLS (PCR) Multivariate Calibration – In Practice

the B-vector however! A small B-value for a variable may still be due to large
measurement values, or due to interaction between several variables. So as a
precaution, always re-model and check that the results really are improved.

If the data are scaled by 1/SDev, you should study the Bw-vector instead, which
takes into account both large and small variable values.

Problems with PLS2


There are situations where PLS2 gives models with low explained Y-variance. This
typically arises when we are trying to model several different (i.e. non-correlated)
Y-variables simultaneously. A more efficient approach is here to run PLS1 on one
Y-variable at a time and find those that are not well described (low explained
variance). Remove the badly modeled Y-variables and try PLS2 again – or it may
turn out to be best to stick to the sequential PLS-modeling.

11.5 Procedure for Refining Models


In which order should we try the different approaches to improve a model? There is
no strict canonical order, but the summary below starts with the most obvious steps.
If you still get unexpected results, you should always first reflect on whether the
data are truly representative and the chosen validation method appropriate.

1. Start with PLS2 if there are several Y-variables. It will give a good overview of
the data without too much modeling work. Use leverage correction for
validation in the initial rounds, unless you have data suitable for test set
validation. This is much faster than cross validation and gives exactly the same
model, except for the estimate of the prediction error, which you will not trust at
this stage anyhow.

2. Look for outliers. Correct errors. In the case of “true” outliers, remove them, a
few at a time, always starting with the ones that appear first in the earliest PCs.
Study the validation variances to check that the model is improved as you
remove objects, but do not use RMSEP as the only indicator. It is time to start
developing a more holistic feel for a multivariate calibration model.

3. Check the residual variances in X. Variables with small variability can usually
be omitted. There are also other ways to inspect the variable set for “outliers”.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 265

4. If the model still seems strange, for instance the scores pattern is very different
from what you expect, and there are no obvious outliers, consider data
transformations. Again, if the pertinent score plots look better after the
transformation, usually using the X-Y Relation outlier plots. Use RMSEP to
compare the quantitative prediction ability of the different models.

5. Try separate PLS1 models for each variable. In most cases they will have better
prediction ability than a global PLS2 or PCR model. If PLS2 does not give good
enough results, you may model each Y-variable separately by PLS1 to find
Y-variables that are badly modeled and remove these.

6. Try to add second order terms (interactions and squares) if the data appear
(very) non-linear and the application is likely to have second order
characteristics.

7. Delete noisy variables. Check the loading-weights and the B-coefficients to


find X-variables with low contributions, which is often due to noise. In many
cases this may result in a lower prediction error.

8. Always remember that the goal of all the “rules” in this book is just to provide
a safety net. The objective is to make you a self-confident, creative data analyst
as soon as possible.

9. The only way to become just that: a self-confident, creative data analyst is to
start performing as many PCA (PCR) and PLS-R multivariate calibrations as
indeed possible, so please continue with chapters 12-13!

11.6 Precise Measurements vs. Noisy


Measurements
We may often be in a situation in which our X-data (or Y-data) are very precise, for
example spectroscopic measurements or other chemical or physical measurements
(weights, digital readings, etc.). And at other times we may be faced with X or Y
data which are distinctly imprecise (noisy), in relative or absolute terms. An
example would be geochemical data from geological samples (e.g.
%-measurements of SiO2, Al2O3) with significant measurement errors, of the order
of several percents or more.

Multivariate Data Analysis in Practice


266 11. PLS (PCR) Multivariate Calibration – In Practice

It is of paramount importance to be able to handle these two drastically different


situations - as well as anything in between of course. Quite a lot of unnecessary
confusion and indeed debate within chemometric circles is nothing but a
manifestation of a researcher generalizing far more than what is strictly legitimate
based on experience limited to one field.

In many instrumental measurements we typically have relative measurement errors


significantly lower than 1%, down to the ppm-level. This is very precise data.
Spectroscopic peak intensities (nowadays in the form of direct, or electronically
integrated, digital output) certainly fall in this category. In a typical precise data
situation, for instance a full-spectrum spectroscopic calibration against a chemical
analyte, we would be interested in a model with the smallest possible prediction
error. It matters very much whether the relative RMSEP is 1% or 0.1% in this type
of situation.

When data are noisy, we must in general take a more relaxed attitude towards both
the fraction of Y-variance that can be modeled and the obtainable RMSEP.
Consider a set of geological samples (objects) characterized by a relevant set of
geochemical variables. These may include both the so-called major element oxides
(compositionally in the % range) and trace element concentrations (in the ppm
range). An average measurement error of some 5% or more is not unusual here;
typical ranges would cover 1-15%.

Add to this a significant sampling error (e.g. localization error of the individual
geological samples from an inhomogeneous rock type) which will also lead to a
significantly reduced signal to noise ratio. In such an imprecise data situation, there
are quite different quantitative goals for the fraction of explained X and Y variance,
as well as of the obtainable RMSEP. It may for example be quite satisfactory to
model X and predict Y with an explained variance of the order of, say, 50-60%.

Process chemometric applications is another area where inherently noisy data may
occur, and there are still many other analogous examples in practical multivariate
data applications, biological data, environmental data, survey data. Which tells us
to bear in mind the simple distinction between precise and noisy data.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 267

11.7 How to Interpret the Residual


Variance Plot
Increasing Residual Y-Variance
In PLS an increase in the residual variance is a sure-fire indication of problems!
Very often these problems are related to outliers, in which case there is no problem
any longer: if the increase is in the early components, look for outliers in the
appropriate X-Y Relation Outlier plot.

If the residual variance increases all the time (in all PCs), the model is certainly
wrong! Maybe a transformation is required, or maybe there are no systematic
relationships between X and Y. If PLS2, try to run PLS1 for each Y at a time, and
see if this gives any better results.

If the Y residual variance increases, it means that the model does not describe the
relationships between X and Y, quite the opposite. In PLS you should then of
course not even inspect the loading-weight plots or try to interpret your model.

However, in PCR you model the X-structure as it is, all the way. In the PCR
situation, one may in fact, therefore, see such an increase in the residual Y-variance
for the first components, which does not necessarily signify problems. The first
components model the dominant X-structure no matter what, so more PCs are
needed to also model the Y-variations in this case. Such a model may eventually
end up being acceptable, though necessarily with more components than its
alternative PLS-cousin.

How to Choose the Number of PCs?


In general the best model is the one with the most unambiguously determined
number of PCs at the minimum in the residual Y-variance plot, i.e. showing the
most clear V-shaped minimum in this empirical prediction validation. However,
there are plenty real-world data sets which do not exactly correspond with such an
ideal and distinct V-definition at the pertinent PRESS (or total prediction error)
minimum, or where in fact the minimum is manifested by a very flat disposition
over several components. Thus if some, several or many PCs have almost the same
low variance, we must study other results, for instance Predicted vs. Measured and
RMSEP to understand the meaning of this. This particular prediction error
manifestation may be the result of a few large and many small object residuals, but

Multivariate Data Analysis in Practice


268 11. PLS (PCR) Multivariate Calibration – In Practice

it may also be the result of all object residuals having a similar size. The rule here
is very clear: use the first minimum (or the left-most part of the V-shape).

Why should you ever choose fewer PCs if it gives a larger residual variance? It is
because your validation set may not necessarily be totally representative of the
future prediction samples. Choosing fewer PCs gives a more robust model, which
is less sensitive to noise and errors, especially the unavoidable sampling errors. It is
the representativity of the validation, i.e. the representativity of the prediction error
estimation, which is at stake here, not the minimum RMSEP as such.
It may be a great help to look at the pertinent 1-D loading-weights and
B-coefficients. When this curve is noisy in some particular components, this part of
the model is unstable. You should then select fewer components, where the loading
weights are not so noisy.

When Can you Accept a Small Increase in the Residual


Y-Variance?
In some situations you may encounter a small local increase before the obvious
overall minimum, both in PLS and PCR. This may often be attributed be a slight
outlier, but it may also be a sign of a validation set which is not quite representative
of the calibration set. You may accept such models, if you can argue why.

Interpretation of Scores and Loading-Weights from “Bad”


Variance Curves
How do you interpret early components, which display an increase in the variance?
In PCR the first PCs do indeed describe the real relationships within X (provided
that X-calibration variances decrease), but these relationships are not directly
connected to Y. Thus you cannot say which variables have a large impact on Y, but
you can still interpret the X-variable relationships and sample patterns.

However, if you get a decrease in the residual variance but the prediction error is
too high, then you may interpret the loading-weight plots as an indication of real
relationships. (The RMSEP for a variable is approx. the square root of the residual
variance divided by the weights used. RMSEP is the error in Y in its original units.)
A low explained variance of course indicates an unsatisfactory model and the
loading-weight plot corresponding to the first components only shows you the most
general relationships related to that degree of explanation. For instance an
explained variance of 10% given by the first two PCs means that only 10% of the
variations are explained and that these variations follow the pattern of the loading-
weight plots for PC1 & PC2.
Multivariate Data Analysis in Practice
11. PLS (PCR) Multivariate Calibration – In Practice 269

It is always a distinctly useful option to plot the calibration X-variance or the


accompanying validation X-variance for individual X-variables, to see how each X-
variable is being modeled. You will then see which ones are easily modeled, and
modeled early, and which ones are not well modeled. With many different
X-variables, it may be an idea to try to model only the most important ones. For all
multivariate calibration work it is naturally of importance to know how much of the
X-variance goes into doing the work of the Y-modeling, which has been quantified
by the Y-variance plot.

Cross Validation
One must always select the pertinent cross validation options very carefully. As a
pregnant example, if you have selected full cross-validation and one singular
sample fits very badly with the model, this may make the total prediction error look
much worse, than with this particular (yes, you guessed right) outlier policed away.

Running a tentative model with leverage correction will give you an indication of
the prediction error, without the risk of making this type of mistake in selecting
cross validation options. The model will be the same as with cross-validation, but
the prediction error may be too optimistic (i.e. too low).

Variance/PC
For PCA and PCR: PC1 is responsible for the largest fraction of the X-variance,
PC2 takes care of the second largest fraction, and so on.

For PLS: PLS-component 1 takes care of the largest fraction of the Y-variance
modeled, which normally decreases with each additional component, and so on, no
matter what fraction of the X-variance is modeled. There are several instructive
examples from recent chemometrics with only some 10% (or even less) of the
X-variance in use for the first component, in which PLS was indeed able to isolate
just the right miniscule fraction of the X-variance, for example accounting for 70-
80% of the Y-variance alone. Examples in which the second PLS-component
accounted for a much larger fraction of the X-variance for a much more modest
additional Y-fraction can also be found.

Also with small data sets you may sometimes observe a similar feature, i.e. that a
“later PC” is accounting for a larger fraction of the variance than an earlier one.
When working with small data set there are three basic causes for this:

Multivariate Data Analysis in Practice


270 11. PLS (PCR) Multivariate Calibration – In Practice

1. Small data sets where the raw data are distributed in a certain way: for instance,
cigar-shaped along the dominant direction(s), but with a circular cross-section.
2. One of the variables in your data set are somewhat special, for example nearly
constant.
3. The data set is so small that the internal program iterations are oscillating
between two fixed positions.

There are many other “strange” effects that may occur when analyzing small data
sets. Mostly this is due to the fact that a very small data set displays a rigid,
irregular data structure compared to larger data sets. For example a data set with
only eight objects can barely support two components. There is absolutely no point
in aiming for complete modeling, validation and prediction on such a small data
set. It will only work if all 8 points line up very regularly in a linear way.

Always inspect your raw data (for example by PCA) before you perform any
regression modeling. You can avoid many frustrations by deciding at the outset that
a particular data set cannot really support multivariate modeling at all.

11.8 Summary: The Unscrambler Plots


Revealing Problems
Residual variances, RMSEP: the shape and level may indicate problems such as
outliers, bad prediction ability and lack of fit.

Scores (X-Y Relation outliers): may show outliers, subgroups, and non-linearities.

Residuals: may show outliers, non-linearities, and systematic errors.

Loading-weights, B-coefficients: may show noisy variables or variables lacking


variability.

Histograms of variables: may show skewed initial distributions.

Descriptive statistics: may show lack of variability.

Predicted vs. measured Y: may show systematic errors, inaccuracy, and


deviations between predicted and measured inaccuracy.

Multivariate Data Analysis in Practice


11. PLS (PCR) Multivariate Calibration – In Practice 271

Regarding the Predicted vs. measured Y plot: It is bad form indeed to use the
Pred/Meas plot as a modeling tool, for instance using this plot to spot outliers. You
must aim at developing sufficient modeling experience to “catch all outliers”, and
indeed be able to do all your modeling before this last plot is ever invoked and
evaluation of the established models’ prediction strength!

Now you are ready for graduating onto the next two sets of PLS-exercises, chapters
12-13.

Multivariate Data Analysis in Practice


272

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 273

12. PLS (PCR) Exercises: Real-World


Applications - II
12.1 Exercise ~ Log-Transformation
(Dioxin)
Purpose
This exercise illustrates how log transformations can lead to improved results in
preprocessed models compared with using the original data only.

Problem
Combustion of waste and fuel residuals containing chlorine is a well-known source
of emission of organic micro-pollutants, especially of chlorinated organics.
Sampling and analysis of micro-pollutants in flue gases is both complicated and
expensive. Both complexity and costs increase when the detection limit is lowered,
e.g. as regulations become stricter. This applies in particular to ultratrace toxic
components like polychlorinated dioxins (PCDD) and dibenzofurans (PCDF).

With suitable indicator parameters it may however sometimes be feasible both to


enhance the precision in the measurements and to reduce costs by using
multivariate regression analysis. The correlation between chlorinated benzenes and
PCDD/PCDF is the basis for novel indirect measurements, i.e. to predict the
emission levels of specific isomers without actually measuring them. Clearly such
an undertaking is critically dependent upon a thorough knowledge of representative
correlations among key variables in emission gases, etc. For the present study such
guarantees are at hand.

This exercise is based on a study made by Swedish environmental consultants


Tomas Öberg Konsult AB. The input data are from different Swedish steel plants’
flue gases.

Multivariate Data Analysis in Practice


274 12. PLS (PCR) Exercises: Real-World Applications - II

Data Set
The data are stored in the file DIOXIN, with variable sets X and Y. The
X-variables contain the measured levels of ten different isomers of chlorinated
benzenes in seven samples. The Y-variables consists of the measured TCDD-
equivalent levels in nanograms in the same seven samples.

Task
Make a PLS model that predicts TCDD (dioxin) from the associated chlorinated
benzenes. There may be a need for preprocessing.

How to Do it
1. Read the file DIOXIN and plot the data using a matrix plot. Use View -
Variable Statistics to have a look at min, max etc… for all variables. Also
plot histograms of each variable; Use View-Plot Statistics to display the
skewness of the distribution. Are the distributions skewed? Mildly, or
severely?

Note!
A skewness around ± 1.0 indicates a rather severe asymmetry.

Make two PLS1 models with leverage correction, one with and one without
standardization, computing up to 4 PCs. Study the explained Y-variance curve
for each of the models. Are these models satisfactory?

When variable distributions are very skewed, a Log transformation may often
help. Such data actually often follows a log-normal distribution, so it is a good
first try to log transform the data. Similar effects will occur for analogous
transformations, e.g. roots.
Go to Modify - Compute to compute a function. Make sure all samples and
all variables are selected in the Scope field and type X=Log(X) in the
Expression field. Save the data table with a new name, e.g. Dioxin Log.

2. Make a new PLS1 model. Should we standardize or not? Look at the


Y-residual variances and decide how many PCs are optimal. How large is the
explained Y-variance? Is the distribution of samples satisfactory? (hint: look
at the X-Y relation outliers plot)

Interpret the scores, then the loadings. Is there something wrong when all the
loadings are positive along PC1? Which X-variables are highly correlated with

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 275

the level of TCDD? Plot the regression coefficients for the optimal number of
PCs. If reduction of the X-matrix is needed, can some X-variables be left out
without decreasing the quality of the model?

3. Try this by making a new model where you keep the variables with smaller
regression coefficients out of calculation. You do this by marking the
unimportant variables on the plot, then selecting Task-Recalculate Without
Marked. Does the explained variance change significantly? Plot Predicted
vs. Measured to study the prediction ability of the two models.

4. Make a new model version now using full cross validation. Is it possible to
use test set validation on this data set? Is cross validation a problem? Is the
model still ok? Can we use RMSEP as a representative measure of the future
prediction uncertainty for this model?

5. Look at the predictive ability of the model. Plot Predicted vs. Measured.
Is this satisfactory?

Summary
The histograms show that most of the variables have rather skewed distributions,
which certainly will cause problems in the analysis. The models based on non-
transformed data were not very good; they needed 4 PCs to explain a fair portion of
the Y-variance.

Since the log-transformed variables have varying standard deviations, we need to


standardize them. The model based on Log(X) and Log(Y) was much better in this
case; two components explained about 94 % of the Log(Y).

Nothing is wrong when all loadings are positive along PC1. It just reflects the
overall correlation among the X-variables. Since we have standardized the
X-variables, all regression coefficients share a common scale and can be compared
to each other. Variables X3, X4, X5 and X6 have smaller coefficients than the
others and can be removed. The reduced model needs only one PC and has a lower
RMSEP than the previous one. You may also have tried to remove X1, which
contributes to the model mostly through PC2: it does not harm the model.

You may argue that candidates for variable deletion are those which strongly
correlate, since this perhaps should imply that one of them would be enough. This

Multivariate Data Analysis in Practice


276 12. PLS (PCR) Exercises: Real-World Applications - II

is not however the case. The collinearity of several variables is often important to
stabilize a model. You can easily try this out of course, and see what happens.

The scores plot shows a satisfactory distribution of samples, with no clear groups,
but there are few intermediate samples – each sample has an influence on the
model. We can try full cross validation; this proves to work fine in this study. In
this case leverage correction also gave a good estimate for the prediction error. Test
set would probably not work here, since the data set is extremely small. We need
all the data present to make a model. Note however that it is actually possible to
carry out a meaningful PLS-regression if/when all samples participate in a more-or-
less homogeneous spread of the Y-space (and the corresponding X-data are well
correlated).

Since we have log-transformed the data, we have to compute Y and Y-predicted


again, and calculate RMSEP manually to find a RMSEP which can be compared
with the RMSEP on non-transformed data. RMSEP using predicted results for 2
PCs is about 1.4, which is quite good compared to the observed measurement levels
of TCDD. We do not know the SDD in this example however.

12.2 Exercise - Multiplicative Scatter


Correction (Alcohol)
Purpose
This exercise illustrates spectroscopic calibration, the practical construction of a set
of test samples, PLS2, MSC transformation, and of course, the perennial outlier
detection.

Problem
We want to generate a PLS-model to predict alcohol concentrations in mixtures of
methanol, ethanol, and propanol. This is to take place by using spectroscopic
transmission data (NIR) over 101 wavelengths from a specific instrument (Guided
Wave, Inc.). This particular application has also previously been used by Martens
& Næs in their textbook “Multivariate Calibration”. The calibration samples have
been constructed as a triangular mixture design.

All mixture samples have been carefully prepared in the laboratory. The mixing
proportions are used directly as reference Y-values. The inaccuracy in the reference
method thus only consists of the variations in sample preparation, volume

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 277

measurements, etc. Some 16 samples are used as calibration samples. In addition 11


new samples have also been prepared from scratch, to be used as a pristine test set.
These samples are therefore regarded as ideal for testing purposes; they cover the
same Y-span as the calibration samples and the only difference is in the “sampling
variation”, which in this particular situation mostly consists of the preparation
procedure replicate variation, i.e. the repeatability.

The spectra have been transformed to absorbance units. No single wavelength can
be used alone, because of strongly overlapping spectra. NIR spectra of mixtures
may often exhibit scatter effects due to interference. This causes shifts in the
spectra, which is often a very big obstacle in the multivariate calibration game. But
this can be corrected for by MSC.

Data Set
The samples are characterized by different mixing proportions of the three alcohols
methanol, ethanol and propanol, always adding up to 100%. The three pure
alcohols are also included.

The data are stored in the file ALCOHOL. The total sample set, called Training (27
samples), and variable sets, Spectra (101 wavelengths) and Concentrations (three
Y-variables) are used in the calibration. The first 16 samples (A1 - A16) should be
used as calibration samples. Samples 17 - 27 (B1 - B11) should be used as test
samples to validate the model.

The sample set New may be used for prediction. The sample set MSCorrected
contains MSC transformed spectra for comparison.

Tasks
1. Make a full PLS2 model.
2. Look for outliers and consider transformations, if necessary.

How to Do it
1. Study raw data
Read the data file ALCOHOL by File - Open. Study the data in the Editor and
by View - Variable Statistics. Close the Editor with the statistics after you have
looked at the results.

Multivariate Data Analysis in Practice


278 12. PLS (PCR) Exercises: Real-World Applications - II

Display the spectra in a matrix plot to have an initial look. Use Edit - Select
Variables and select the variable set Spectra. Choose Plot - Matrix, and select
as scope sample set “Training”.

Plot variables 1-3 as a general 3D Scatter plot. Use the total sample set Training
(27 samples) for the plot.
Do you recognize the design in the Y-data? If you have trouble with this, click
on a few points and study their coordinates (X = Methanol, Y = Ethanol,
Z = Propanol).

You may also try Edit - Options – Vertical Line or View – Viewpoint- Change,
or View – Rotate. Notice the contents of samples no. 1, 2, 3 and 17, 18, 19.
Close the Viewer when you are finished looking at the plot.

2. Make a PLS2 model


Make a PLS2 model by Task - Regression. Reflect on the following questions
before clicking OK:
Is weighting necessary for the X-variables? What about the Y-variables?

How many PCs do you expect to find in this mixture data set? Why?
Of course you should calibrate for a few more than that. Why?

Here are the suggested options:


Sample Set: Training X-variables: Spectra
Y-variables: Concentrations
Weights: 1.0 in X, 1/SDev in Y Validation: Test Set (Samples 17 - 27)
Number of components: 6

To configure test set validation, check the Test Set box, then click Setup…;
choose Manual Selection, and select samples 17-27.
Change the warning limits for outlier detection: press Warning limits and change
fields 2 – 7 to a value of 5.5.

Now you can click OK. Study the model overview in the progress dialog.
Are there any outliers?
How many PCs seem enough or optimal? Why?

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 279

3. Study model results


Look at the score plot for PC1 vs. PC2. Use Edit - Options and enable sample
grouping. Use color separation and group by the different Y-variables to check
whether the grouping reflects the alcohol composition of the samples.
Would you expect to see groups in this triangular design? Hopefully not! But
grouping there is!

The only factors varied were the proportions of the three alcohols, and the
design was overall symmetric. Something is wrong here!

Make the residual variance plot active. Check the residual Y-variance for all
individual Y-variables by Plot – Variances and RMSEP, clicking “All” and
removing “Total”.
What is wrong in this plot?

Go back to the Editor and plot a few individual spectra (e.g. 4 - 10) as lines. If
necessary, use Edit – Options and choose to display the plot as curves. Use
View - Scaling - Min / Max to enlarge the detail in the picture between variables
20 and 60. Scale the ordinate axis to the range 0 - 0.5.
Do all spectra have the same baseline?

Variables 34 to 48 (1250 - 1320 nm) may be assumed not to contain chemical


information, i.e. this is a pure background absorbance region. The absorbance in
the spectra should then ideally be equal in these variables, and should touch the
base line. This is utilized in the MSC correction to be invoked below.

4. MSC transform the data


Close the Viewers, unmark all selections in the Editor, and go to Modify-
Transform - MSC.

Select the sample set Training and variable set Spectra in the Scope field. The
correction method is Common Offset. We want to use variables 31 to 45 as the
basis for this correction; this is done by omitting all other important variables
(1-30,46-101) in the selection field at the bottom of the dialog box. Also exclude
samples 17 - 27 because the correction should only be based on the calibration
samples. The test samples will still be corrected, since they are included in the
scope you have chosen. Click OK.

A dialog box pops up, asking you whether you want to save the MSC model.
Reply Yes, and give the model a name. This will save the model coefficients for

Multivariate Data Analysis in Practice


280 12. PLS (PCR) Exercises: Real-World Applications - II

future use (on a prediction data set, for instance). The spectra are now MSC-
corrected. Save the corrected data in the Editor by File - Save As with a new
name, e.g. Alcohol corrected.

Launch a general Viewer by Results - General View and plot the first 10
samples from the original data file (Alcohol) using the variable Set Spectra:
Plot, Line, Browse, Select Alcohol.00d, Samples: 1-10, Var. Set: Spectra,
OK. Select Window - Copy To - 2. Activate the lower plot window and plot the
first 10 corrected samples (from file Alcohol corrected).

Re-scale the plots using either View – Scaling – Min/Max or View – Scaling-
Frame, and see how (well) the MSCorrection has transformed the spectra.
Close the Viewer.

5. Make a new PLS2 model and study the results


Make a new model from the MSC-corrected data using the same parameters as
for the previous model. Study the outlier warnings.

Study the score plot.


Do you now recognize the mixing design triangle?
Are there outliers?

Study the validation Y-variance for all three variables.


How many PCs seem enough or optimal? Why?

Look at the plot X-Y Relation Outliers for components 1 - 4 in a quadruple plot.
This plot visualizes the relationship between X and Y along each component.
The samples should lie close to the target line. Outliers stick out from this line
and the samples "collapse" when the optimal number of components is
exceeded. Note: make sure that both Cal and Val samples are plotted!
Which samples are outlying in the present model?

6. Remove outliers and recalibrate


Select Edit - Mark - One by One and mark the outlier(s). Check whether your
choice agrees with the system’s one: try Edit – Mark – Outliers Only. Make a
new model by Task - Recalculate Without Marked. Calculate one or two PCs
more than necessary.
Are there still outliers?

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 281

Plot the results and interpret.

Study the design as it is depicted in the score plot and interpret PC1 and PC2.
Use the sample grouping feature to see how the samples are distributed
according to the levels in the Y-variables.

Plot Y-loadings on the subview below the score plot, for PC1 and PC2. You will
see the Y-loadings forming a triangle.

Compare the loading plot and the score plot to interpret the meaning of PC1 and
PC2.
Can you see a slight curvature at the base of the triangle in the score plot?

Plot the residual variances for Y-variables 1, 2 and 3 in the same plot or plot the
RMSEP for all Y-variables.
How many PCs would you use?

Plot the loading-weights as a line plot for PC 1, 2 and 3 together (type “1-3” in
the Vector 1 field).
How can we interpret this plot? Here one must assume the role of a
spectroscopist!

Plot predicted versus measured for each variable, with varying numbers of PCs.
View plot statistics. Toggle between Calibration and Validation samples by
using View – Source or by switching the “Cal” and “Val” buttons alternatively
on and off.
How many components give the best results? Why? Are the test samples as well
predicted as the calibration samples?

Make conclusions: Is the model OK? What do the components “mean”? How
many components should be used? How large is the prediction error for each
alcohol? Is this satisfactory? What do you think about the model’s ability to
predict samples between the test points?

Save the model and close the Viewer before you continue.

7. MS-corrected prediction samples


Before prediction you need to transform the new data the same way as the
calibration samples. Select Modify - Transform - MSC. Select the sample set

Multivariate Data Analysis in Practice


282 12. PLS (PCR) Exercises: Real-World Applications - II

New and the variable set Spectra in the Scope field. Now select Use Existing
MSC Model and find the MSC model you made earlier in this exercise.

You do not need to save the MSC coefficients after the correction is done.

8. Predict new samples


Go to Task - Predict. Choose sample set “New”, X-variable set “spectra”. In
this case we also have access to known reference values, so you can tick the
Include Y-reference box and select Concentrations as Y-variable set.

Specify the appropriate model name. Use the optimal number of PLS-
components that you found in the final model evaluation above. Click OK.

Press View and study the prediction results for each response variable. What is
the meaning of the deviations around each predicted value? Do you notice
anything suspicious? Why does the difference between the predicted values and
real concentrations (e.g. at the 0% level) have a larger error than the RMSEP
from the calibration?

Plot Predicted vs. Reference for each response (without table) using 3 quadrants
of the Viewer. Mark the outlier so that you can spot it on all 3 plots.
Is it as easy to detect on the Methanol plot as on the other two plots? What if
you plot Predicted with Deviations?

Save the results and close the viewer. To see how the scores of the new samples
are placed compared to the calibration samples, select Results – General View
– Plot – 2D Scatter and click Browse next to the Abscissa box. Find the
prediction results file and select Tai in the Matrix box, PCs: 1. For the Ordinate
the same is specified except PCs: 2. Now select Edit – Add Plot and fill out the
same as above for the PLS2 model saved earlier.

Summary
In general X-spectra need rarely be weighted (but they may equally well be), while
in PLS2 it is often mandatory to standardize the Y-variables. In this rather special
case all three Y-variables are in the same units and measurement range, and they
have the same variance because of the triangular design, so weighting is not strictly
necessary (but it does not hurt either).

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 283

The first model is distinctly bad, with a low explained Y-variance for PC1 and
PC2. The samples are not well spread in the score plot; actually they form two
groups plus one isolated sample, no. 20. It is easy to think that no 20 is an extreme
outlier, but things would not get better if you remove it. Since we never throw away
outliers before checking the raw data, we plotted the spectra and found signs of a
significant base line shift. Such scatter effects are not unusual in spectroscopy,
especially in fluid mixtures, or spectra of powders and grains.

Multiplicative Scatter Correction can be used to correct for both additive (e.g. base
line shifts) and multiplicative effects. We calculated the common correction
coefficients based on the calibration samples and the test samples was
automatically corrected with this “base”.

The second model was found to be much better, with a good decrease in prediction
error. Sample number 16 is now an outlier, easily found from the X-Y Relation
Outliers plot.

The prediction error of the final model was minimized at 3 PCs. We would expect
to need only 2 PCs, since the three alcohols always add up to 100%. (If we vary the
contents of Propanol and Methanol, the Ethanol content is given.) You see this both
in the 2-vector score plot, loading plot, and the Y-variance plot for each of the
Y-variables. Propanol and Methanol are negatively correlated in PC1, and
Methanol is negatively correlated to the combination of the other two in PC2. This
means that PC1 describes the variation of the proportions of Propanol and
Methanol, while PC2 describes the variation of the three alcohols together.

The score plot clearly shows the original triangular design. The slight non-linearity
in the triangle base is due to physico-chemical interference effects often associated
with mixtures, but PC3 takes care of this easily. That is why we need 3 PCs in this
“2 phenomena” application.

The RMSEP was about 1.7% for Methanol, 1.9% for Ethanol, and 1.2% for
Propanol using 3 PCs based on the test set validation. This is very good at higher
concentration levels. However as we have only tested new samples at the same
levels as the calibration samples, we cannot be perfectly sure that the model will
work equally well for points in between (but we would certainly expect it to!).
Nonetheless, since there were no signs of severe non-linearities and the calibration
samples cover the whole design space quite well, the model can safely be used for
other mixing combinations.

Multivariate Data Analysis in Practice


284 12. PLS (PCR) Exercises: Real-World Applications - II

Normally we do not have reference values for the prediction samples. Note that the
reference value was not used in the prediction; it is only stored to be used in
plotting Predicted Y versus Yref.

Prediction gives a few negative concentration values. One (prediction sample 7)


was an outlier; the others are simply due to the overall uncertainty. We saw,
however, that some of them had a larger error than RMSEP. RMSEP is an average
value of the error, so some errors are smaller and some larger than this.

The bars in the Pred+/-Dev plot indicate the prediction uncertainty. The deviations
are based on the similarity of spectra in the prediction and calibration samples, and
on their leverage. We therefore can use these deviations as outlier detection and
thus we do not trust predictions with large deviations. A more correct way to give
predicted values is to include the RMSEP as an indication of the uncertainty, for
example 33 ± 1.3% (1 std).

Because there were outliers in the prediction set, the Pred/ref picture initially
seemed bad (the regression line is based on all the plotted elements), while the fit
of the non-outliers is really very good. When the outliers are removed from the data
set and we include only the correctly predicted ones, the statistics improve greatly,
of course.

12.3 Exercise – “Dirty Data” (Geologic


Data with Severe Uncertainties)
Purpose
The purpose of this exercise is to analyze really dirty data, identify ambiguous
outliers, and learn more about scaling. The difference between outliers and extreme
values will also be a central issue in this exercise. You will be required to take a
large proportion of personal responsibility for this exercise; we shall mostly only
point out the major direction of the broad avenues along which you’ll have to fend
for yourself in order to do the “proper” data analysis.

Problem
The chemical element Tungsten (W) is an important alloying ingredient in many
types of steel for example, and is of course used extensively in light bulb filaments.
Geologically, Tungsten occurs predominantly in only one particular mineral, called
Scheelite. Scheelite mineralisations are thus a natural target for geological

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 285

exploration campaigns. Scheelite is an example of what is known as a “heavy


mineral” and this fact is used for mineralogical exploration purposes, since it tends
to be concentrated in certain peripheral bottom areas of rivers and streams. A
standard geochemical exploration technique is to prospect for “heavy minerals”
where one grabs one liter of “FFSS” - Fine Fraction Stream Sediments.

This sample medium is easy to collect (using a standard sieved bucket), but there
are never many Scheelite mineral grains in a standard FFSS sample. Geological
background levels are low, often in the range 0-10 grains (out of several thousands,
or in the ten thousands). Nearby mineralisations, however, may easily raise this
number of grains by factors of 3-15 as heavy mineral grains are freed by erosion
over geological time. The number of Scheelite grains in a standard FFSS-sample is
thus a very important Y-variable for exploration purposes. It is also a labor
intensive Y-variable. A professional mineralogist has to shift through 1 liter of
FFSS-sediment under a microscope and specifically count the critical number of
Scheelite grains. This is very laborious and very expensive indeed, especially when
it is pointed out that stream sediment exploration campaigns easily comprise
hundred, even thousands of samples.

Since the FFSS-samples can be collected easily and in great abundance when an
exploration field campaign is first launched, there is a clear interest in trying to
calibrate these Y-measurements against less laborious and less expensive chemical
analytical techniques; preferably instrumental methods that can be applied directly
to the FFSS. If this is possible, we might do away with the mineralogist altogether!
Of course his expert knowledge is needed to make a good calibration first (a
universally applicable calibration model). Actually there are many other tasks that
can now be assigned to the mineralogist, tasks which are much more interesting for
him than the perpetual FFSS screenings.

In the present scenario we most certainly do not have an ideal calibration data set.
In fact, it is a very imprecise data set. We do have a standard chemical FFSS X-data
set (XRF-analyses of 17 variables), but unfortunately Tungsten itself cannot be
analyzed by this method. There is however good geological reason to hypothesize
that the remaining XRF-data set (X) should carry sound geochemical evidence of
possible Scheelite mineralisations close to the sampling locality. From general
geological knowledge there is firm evidence that the 17 X-variables available for
XRF-analysis usually may act as an indirect measure of Scheelite content.

There are even more difficulties with the present data set. The X-samples (FFSS)
and the Y-samples (grain counts) do not originate from the same field campaign. In

Multivariate Data Analysis in Practice


286 12. PLS (PCR) Exercises: Real-World Applications - II

fact, the mineralogical Scheelite counts come from an altogether different


geological campaign some three years earlier. This means that we can only be
certain that the geographical correspondence between the X-samples and the
Y-samples in the field is of the order of ± 20 meters or so. When the first campaign
was carried out, no one could foresee that the samples would later be used for this
quite different purpose. This introduces a significant new type of localization
variance in the X→Y modeling, in addition to all the usual instrumental analytical
error components associated with the present kind of routine analysis.

The calibration data set is thus extraordinarily “dirty”, due to extremely large
uncertainties and error sources:

• Inaccurate localizations of two different sets of samples (X and Y)


• Several X-variables have large analytical errors at low concentrations
(instrumental XRF)
• Regional geological variation may exceed local fluctuations, or vice versa.
• There is even some uncertainty whether all possible Y-grains were always
detected (though the professional mineralogist will disagree strongly of course!)

A conservative estimate of the average uncertainty level (X) in this problem may
well be way beyond 10-15% in the more extreme cases. We must therefore make
full use of the best possible data analysis method to handle that much error and
noise. We shall investigate whether it is possible to model the Scheelite grain
counts (Y) as a PLS1-model of the 17 other chemical FFSS variables (X). The
overall geological background is consistent with the possibility that such a complex
indirect model just might work, but only barely - it is a long shot indeed!

The data set was originally released by the former Geological Survey of Greenland
(GGU). We have slightly modified the problem description for use in this exercise
in order to stretch multivariate calibration scenario to the limit. The situation for
the exploration department at GGU in collaboration with the senior author was not
exactly this challenging, but close enough. This is really a most difficult calibration
problem, with several unusually severe extraordinary error components added for
good measure.

Data Set
Calibration data are stored in the file GEO. The sample set Training contains 23
FFSS samples, which are used to make the model. The sample set New contains 51

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 287

FFSS samples, from other areas, in which the number of Scheelite grains has not
been counted.

The variable set XRF-analyses contains 17 variables (concentrations by XRF in


ppm or % of major oxides). The variable set Tungsten contains the number of
Scheelite grains found in the samples. The variable set Location gives the
geographical location (latitude, longitude) of the 51 samples in XNew.

Tasks
Two alternative strategies are suggested:
1. Start building a PLS1 model right away, and make extensive use of the
diagnostic tools in order to detect irregularities and abnormal samples. Try to
improve the model.
2. First, have a closer look at the data and decide on possible ways to make it
better suited for analysis. Then build a PLS1 model, diagnose it and improve it.
Once you are satisfied with your model (properly validated), predict the new
samples.

How to Do it
1. Read the data from the file GEO. Study the data and determine whether or
not to use weighting. Make an appropriate PLS1 model with warnings
enabled.
Which validation method would you recommend for this data set? Why?

Find possible outliers using the regression progress window, score plots, the
X-Y Relation outliers plot and other available means.
Are the suspected samples really outliers or just extreme values?

Try to remove the potential outliers, but only one at a time. You should play
around quite a lot with this data set. Try removing some “apparent outliers”, and
observe what happens to Validation variance, RMSEP and Predicted vs.
Measured. Try to remove both extreme end-members and outliers. Do not
hesitate to build 3 or 4 different models at least, and check the impact of
removing a sample or including it again.

Compare the different models with respect to prediction error. You can plot
RMSEP using Results - General View and add plot for the other models.

Multivariate Data Analysis in Practice


288 12. PLS (PCR) Exercises: Real-World Applications - II

When you are close to your final model, make sure that you use a valid
validation method to determine the number of components and check the
performance of your final model.

2. Go back to the raw data and see whether you can improve the quality of the
data by transformations. Hint: have a look at the histograms of the various
variables.
How are most variables distributed? Can you distinguish between extreme
(abnormal) values and skewed distributions? Which transformation(s) might
make the distributions more symmetrical? Is it necessary to have normal
(gaussian) distributions?

Make a new model based on suitably transformed data. Check it for outliers, and
if necessary remove them or try replacing some individual matrix elements by
missing values.
Are there any outliers now? Are they the same as with raw data?
How does the X-Y relationship look now? Is it improved compared to the
previous models?

Use a proper validation method. Which type of cross-validation does this data
set invite?

Check whether all variables are important; you may try to improve the model by
variable reduction.

3. Use the best model you settle on to make a prediction from the data in the
sample set New. Check that you have transformed the new samples the same
way as you transformed the data used to make your preferred model. Look at
the predicted results and their uncertainty limits, and evaluate.

Make predictions based on the two alternative strategies.


Is it possible to do this ambitious PLS modeling? Argue carefully for your
answer.

Summary
This exercise made you work with a very noisy data set, for which it was probably
difficult to find a really good model. Furthermore you practiced the detection,
identification and removal of outliers in a severe context of uncertainty. You may

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 289

have noticed that it can be very difficult to determine whether a sample is an outlier
or not and you had to be very careful.

The first approach diagnoses samples 4, 6 and 18 as either extreme or abnormal. If


you took out those three samples, then you could also detect sample 22 as
influential. But removing samples 4 and 6, which have the two highest Y values, is
not a very good idea since in the overall problem context you are particularly
interested in samples with a high grain count! (Much information has been lost in
the process, as can be seen from validation variances.

You might have taken samples 4 and 6 back in, keeping 18 out since it is so
obviously influential, and then sample 22 was not such an obvious outlier after all.
You might even have spotted other candidates in this game of: “What if...?”

If you tried cross-validation after having started out with leverage correction, you
experienced a serious drop in explained validation variance. You may also have
noticed the strong curvature in the X-Y relationship along the first component, and
a non-uniform distribution of the prediction errors. So obviously even the best
model you can build with this approach is not quite satisfactory.

The second approach shows an obvious need for transformations to make variable
distributions more symmetrical. In all cases requiring a transformation (most of the
X-variables, and the Y-response variable as well), you may have tried both a
logarithm and a square root, and concluded that the logarithm performed better.
Thus all distributions were made roughly symmetrical, some of them looking
perfectly normal, others bimodal; the variables with a large number of zero values
could be improved by using a constant inside the logarithmic term perhaps,
although they couldn’t be made completely symmetrical. Only very few extreme
values remained after these transformations.

A first model on transformed data then showed that sample 18 was a very
influential sample, while samples 4 and 6, although still extreme, fitted much better
into the overall picture. But most importantly, the shape of the X-Y relationship
was now much closer to a straight line. Replacing a few individual values by
“missing” got rid of the remaining outlier warnings.

Full cross-validation leads to a choice of only 3 PCs with this approach; then
variable reduction based on the regression coefficients can be applied to get a
simpler model, which requires only 2 PCs and performs slightly better. The
residuals and predicted values were well distributed.

Multivariate Data Analysis in Practice


290 12. PLS (PCR) Exercises: Real-World Applications - II

To evaluate the quality of your model, you could apply a pragmatic criterion; the
RMSEP was of the same order of magnitude as the response value for the first non-
zero level, and looked small enough to ensure that at least samples with a higher
grain count than this could be detected, which is what matters most.

A Possible Conclusion
Remember that the goal here was not to make a statistically perfect model, nor
necessarily one with the lowest RMSEP, but simply one that works with the given
extreme uncertainty levels in the problem context. There was no prior knowledge
as to how a possible model might look, and what the most appropriate
transformation and validation method might be.

The fact that there are really only four “effective” Y-levels in these data (levels of
roughly 0, 15, 25, 40 scheelite grains) strongly guides all the data analysis efforts.
Noting this critical point is pivotal to the possibilities of doing anything reasonable
with this data set. One should be extremely hesitant to declare any sample from the
two highest Y-levels as outliers, if it is at all possible to include these objects in the
model. This would make us effectively lose the overwhelmingly most important
part of the spanning of the Y-space. On the other hand, there are really many
samples with an effective a zero count. Any of these might easily be discarded if
this would streamline our model in the X-Y Relation outliers-plot. There are still
enough other samples to anchor the model at the equally important “zero scheelite
grains” end.

It is vital to note that we cannot select outliers from a plot of the X-space alone, for
instance the t1-t2 score plot. You would get thoroughly deceived in this case! One
never, ever, performs outlier delineation in the X-space alone when doing
multivariate calibration – ONLY the appropriate T vs U plot(s) will do!

Also, there is no single correct solution to this modeling exercise. There may be
several equally valid models. It is only important that you are able to argue your
specific choices of the particular data analysis strategy you have chosen – always
(and only) with respect to the actual problem specifics present.

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 291

12.4 Exercise - Spectroscopy Calibration


(Wheat)
Purpose
In this exercise you should again try to solve the task without too many detailed
instructions. The exercise concerns wheat quality monitoring (prediction), based on
NIR spectra. You will need to handle replicates, scatter, PLS2, PLS1, evaluation of
model performance, and refining models by variable selection (in this case:
wavelength deletion).

Problem
The aim of a particular Norwegian grain mill is to keep a constant wheat flour
quality, to meet the bakers’ requirements (who naturally prefer constant baking
characteristics). The requirement for protein contents in this case is 13.4 ± 0.3%.

By NIR-analysis of wheat it is possible to determine not only the water and protein
contents, but also the ash content. Ash is the residue after complete combustion of
the wheat, indicating the extension rate after milling.

To rationalize quality control we would like to have a model that predicts


concentrations of water, protein and ash in wheat samples, directly from NIR
diffuse reflection data.

Data Set
A set of 55 wheat samples, considered to span the most important variations, was
collected at the mill. Each wheat sample was packed in the instrumental
measurement container three times (uniform packing is critical in powder NIR
diffuse spectroscopy). NIR spectra were recorded for each of these triplicates with
a Bran+Luebbe Infralyzer 450 (diffuse reflection) filter instrument with 19 standard
wavelengths. There is also one extra, special wavelength, believed to enhance
results for ashes. All 55 samples were also analyzed in the chemical lab, for
protein, water, ash, and ash (dry matter); these are the reference Y-data.

The log-values of the absorbance spectra at 20 wavelengths (X-variables) were


recorded for 165 (3 x 55) “representative” wheat samples (we shall have to take the
data suppliers at their words for this critical issue). The data are stored on the file
WHEAT. The spectra are stored in the variable set Spectra and the constituents in
the variable set Constituents.

Multivariate Data Analysis in Practice


292 12. PLS (PCR) Exercises: Real-World Applications - II

Task
Make a multivariate calibration model of the data. Concentrate on trying to model
Ash and Protein, as these are the more difficult Y-variables in this problem.

How to Do it
1. Start with a PLS2. Try an outlier limit of 4. Use Leverage Correction in
the screening process (to find outliers), then cross validation for the
following calibrations (for example systematic 111222… with 3 samples per
segment for raw data and full cross validation for averaged spectra).

You may also transform the spectra to reflectance or Kubelka-Munk units and
see how this affects the models.

Use all available means to look for outliers. Compare results with raw data to
find extremes. Use the X-Y Relation Outliers plot and the Sample Outliers plot
to spot outliers and study how the shape of the Y-variance curve changes when
remove them. If a local increase in the error disappears when you remove a
sample, this indicates that it was indeed an outlier. There is however not much
use in removing samples if this does not reduce the prediction error.

Study replicate variations and see what happens if you average the spectra over
each replicate set. Also try MSC instead of the raw data. The use of RMSEP to
compare between these two alternative models is rightly justified.

2. Try separate PLS1 models to see if the more “difficult” variables are any
easier to model. Remember to define new variable sets, as you need it.

Also check how the model performs if you keep some wavelengths out of
calculation. This initiates you to the important area of variable selection, which
you will learn more about in Chapter 14.3. Study the B-vector as well as the
loading-weights for this.

Summary
Plotting the raw data indicates some inter-replicate variations. Since the replicates
were in fact packed individually and since we are analyzing aggregate powder
samples, we may suspect that scatter effects are at work here. An initial PLS2 gives
an overview of the possibilities for modeling these data. Standardization of the
Y-values is absolutely necessary. PLS2 indicates that modeling the four

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 293

constituents requires different numbers of PCs for each. Water is easiest to model.
Protein needs the most PCs.

For raw data and MSC corrected spectra, samples 103-105 are outliers (large X-
variances) and should be removed. For averaged spectra, sample 35 is an outlier.
For PLS2 models on raw spectra, there are only small differences in RMSEP using
absorbance, transmittance, or Kubelka-Munk. The best overall model is obtained
using MSC on absorbance spectra.

By computing the average spectra we include the natural replicate variation and
imitate the real world situation, where scans are averaged before prediction or the
predictions from several scans are averaged.

We obtain the lowest number of PCs in the models for protein and ash when using
separate PLS1 models on MSC corrected absorbance spectra.

The RMSEP for Protein is about 0.09 for 6-8 PCs. This is about the same order as
the reference method. RMSEP of Ash is about 0.014-0.015 using 7-8 PCs. We do
not know the corresponding laboratory inaccuracy of the Ash measurements.

By studying the regression coefficients for the optimal number of PCs we can try to
make a model based on fewer wavelengths. Remove wavelengths with small
regression coefficients - if the loading-weight plot(s) agree with this. Remaining
variables number 2, 4, 12, 18, and 20 give a 4 PC model. It has an RMSEP of about
0.018 for ash, which is slightly worse than the 20-filter model. This is still quite
satisfactory if you prefer a simpler model. The best model for Ash was based on
MSC corrected spectra and 20 wavelengths though.

12.5 Exercise QSAR (Cytotoxicity)


Purpose
This exercise illustrates the concept of QSAR (Quantitative Structure Activity
Relationships), the selection of calibration data using experimental design on the
scores (principal properties), and strategies for the validation of small data sets.

Multivariate Data Analysis in Practice


294 12. PLS (PCR) Exercises: Real-World Applications - II

Problem
This exercise is based on work done by Lennart Eriksson et al., Dept of Organic
Chemistry, University of Umeå, regarding strategies for ranking chemicals
occurring in the environment, and it is used here with kind permission.

We would like to know both long term and acute biological effects of all the
chemical compounds continually being released into the environment. It has been
estimated that there are 20.000-70.000 different chemicals in frequent use in
industry. Testing all these compounds is impossible, due to reasons of cost, time
and ethical considerations, (animals are often used for testing).

QSAR is a technique that allows the prediction of biological effects of untested


compounds.

In this application the cytotoxicity towards human cervical cancer cell lines (HeLa
cells) was to be determined for a series of halogenated aliphatic hydrocarbons. The
cytotoxicity is expressed as the inhibitory concentration lowering cell viability by
50%, IC50.

The principal properties of 58 compounds were analyzed by PCA based on 8


chemical descriptors (X). From this a calibration set of 10 compounds and a test set
of 5 compounds were chosen. They were chosen by applying experimental design
on the PCA scores. The purpose of the PCA here really was to form a basis for a
statistically designed selection of test compounds.

Laboratory experiments were performed to measure the cytotoxicity of these 15


samples. A PLS model was made with both cross validation of the calibration set
and validation using the test set. The model showed that the cytotoxicity of the
compounds critically depends on their hydrophobic and steric properties. The PLS
model was then used to predict the cytotoxicity of 40 new, untested compounds
with similar chemical properties.

Data Set
The data are stored in QSAR. Sample set New contains 58 samples (16-73) with
non-missing data for 8 (1-8) descriptors of the compounds. The names of the
samples are given in Table 12.1. This data set will be used in the first part of the
exercise, PCA.

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 295

Sample set Training contains 15 samples; 10 descriptors of the compounds are


included in variable set X, 1 descriptor (IC50) in variable set Y.
The samples to be used as “Cal” range from 1 to 10, and the last 5 should be used
as test samples.
The calibration set and test set will be used to make the PLS model later in the
exercise.

Table 12.1 - Compounds in sample set New


No. Name No. Name
1 CH3Cl 30 CH3-CH2Br
2 CH2Cl2 31 CHBr3
3 CHCl3 32 CH3-CH2F
4 CHCl2F 33 CH3-CHBr2
5 CHClF2 34 CBr2ClF
6 CCl4 35 CH2Br2
7 CCl3F 36 CH3I
8 CCl2F2 37 CH2BrCl
9 CH3Br 38 CBrF3
10 CH3-CH2Cl 39 CBr3F
11 CH2Cl-CH2Cl 40 CH2Br-CH2F
12 CH2Br-CH2Cl 41 CH3-CHF2
13 CH2Cl-CHCl2 42 CH3-CH2I
14 CH3-CCl3 43 CH2Br-CH2-CH2Br
15 CHCl2-CHCl2 44 CH2Br-CH2-CH2F
16 CHCl2-CCl3 45 CH3-CH2-CH2Br
17 CCl3-CCl3 46 CH3-CHBr-CH3
18 CCl2F-CClF2 47 CH3-CH2-CH2Cl
19 CH2Br-CH2Br 48 CH3-CHCl-CH3
20 CH3-CHCl2 49 CH3-CH2-CH2I
21 CClF2-CClF2 50 CH3-CHI-CH3
22 CH3-CHCl-CH2Cl 51 CH2Br-CH2-CH2-CH2Br
23 CH2Cl-CHCl-CH2Cl 52 CH3-CH2-CH2-CH2Br
24 CH3-CH2-CH2F 53 (CH3)3-CBr
25 CH2F-CH2-CH2F 54 CH3-CH2-CH2-CH2Cl
26 CH3-CF2-CH3 55 (CH3)3-CCl
27 CH2Cl-CHCl-CHCl2 56 CH3-CH2-CH2-CH2I
28 CH2F-CF2-CH2Cl 57 (CH3)3-CI
29 CH3-CF2-CH2Cl 58 CH3-CH2-CHI-CH3

Multivariate Data Analysis in Practice


296 12. PLS (PCR) Exercises: Real-World Applications - II

Table 12.2 – Descriptors of the compounds (X)


No. Name Description
1 Mw Molecular weight
2 BP Boiling point
3 MP Melting point
4 D Density
5 nD Refractive index
6 VdW Van der Waals volume
7 LogP Log P
8 Ip Ionization potential
st
9 LC1 Log retention time, 1 HPLC system
nd
10 LC2 Log retention time, 2 HPLC
system

Tasks
In this exercise we will make the PCA model (the principal properties model), see
how experimental design can be used to select a representative subset of samples,
make a PLS model, and validate with both “internal” cross validation and
“external” test set.

How to Do it
1. PCA
Make a PCA on sample set New, variable set X. Variables 9 and 10 which only
have missing values will automatically be kept out. Is standardization
necessary? Save the model under a meaningful name (e.g. QSAR New
compounds).

Study scores and loadings. How many PCs should we use? Is the sample
distribution of the score plot satisfactory? Which variables are important in the
first PC? And the second? If you are a chemist, try to interpret the meaning of
the PCs! Which variables dominate PC3 and PC4?

Save the model as QSAR New Compounds.

2. Pick out a calibration set


To select test compounds, we can use a fractional factorial design on the scores.
Such a design for four variables corresponds to a representation of the
compounds in a space with four dimensions, where each dimension comprises a
so-called principal property.

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 297

Choose File – New Design – From Scratch – Fractional Factorial, and build a
fractional design for four design variables (corresponding to four PCs). Also add
two center points to get a few points in the “middle”. Do not put any effort into
choosing names and so on, and use only -1 and +1 for low and high value.
Choose as Design Type “Fractional Factorial Resolution IV” (8 experiments),
and include the default 2 center samples.

Click Next until you reach Finish. By clicking Finish you will display the
designed data table. It should look roughly like this:

A B C D
Cube 001a -1 -1 -1 -1
Cube 002a 1 -1 -1 1
… …
Cent-b 0 0 0

The idea is now to find compounds with values of their principal properties
(scores) corresponding as well as possible to these design points.
(Geometrically, the selected points form the corners of a hypercube in the space
defined by the design variables). This means finding compounds with score
values in PC1, PC2, PC3, and PC4 that match patterns of the design points. For
example the design point +1 -1 +1 -1 corresponds to a compound whose score
values are + - + - for the four first PCs.

However, there are always practical restrictions to consider. To avoid chemicals


with extreme properties, only compounds located within about 2/3 of the max
and min values of the scores axes were selected. Furthermore, since you can not
always find compounds that correspond exactly to the design points, try
primarily to find compounds that match the patterns in PC1 and PC2 (that
describe most of the variance). Of course you should also strive to match the
patterns in PC3 and PC4.

Practical constraints like boiling points which are too low and the availability of
the compounds also restrict the candidate list of possible calibration compounds.
The two center points (with score values close to zero) were used to get
information about curvature and variability.

Import the scores matrix by File - Import - Unscrambler Results. Select the
PCA model QSAR New Compounds. Select matrix Tai. Remove the last four
Multivariate Data Analysis in Practice
298 12. PLS (PCR) Exercises: Real-World Applications - II

rows, so as to keep only the scores along PC1-PC4. Then choose Modify -
Transform - Transpose. Try to pick out 10 compounds according to the
directions given above. Save it using File - Save.

Eriksson et al. made the following calibration set selections (Table 12.3):

Table 12.3
Design Score Score Score Score Sample Name
PC1 PC2 PC3 PC4 no.
---- -1.40 -0.71 -0.22 -0.17 30 CH3CH2Br
-+-+ -2.03 0.80 -0.45 0.16 48 CH3CHClCH3
+--+ 1.59 -0.99 0.23 -0.18 33 CH3CH2F
++-- 0.50 1.09 -1.16 -0.20 52 CH3CH2CH2CH2Br
--++ -1.87 -0.59 0.71 0.28 2 CH2Cl2
+-+- 3.2 -1.26 0.23 -1.97 39 CBr3F
-++- -0.94 0.48 0.03 -1.25 7 CCl3F
++++ 1.8 0.70 0.62 0.38 15 CHCl2CHCl2
0000 -0.4 -0.1 0.68 0.11 3 CHCl3
0 0 0 0 -0.55 0.03 0.87 1.2 11 CH2ClCH2Cl

3. Choose a test set


For testing we shall select an additional smaller set with only 5 compounds,
including two “center” compounds. We may use a 23-1 fractional factorial design
with one center point to pick these out, based only on the first 3 PCs. Eriksson et
al. chose compounds 23, 19, 37, 12, 6, trying to match the design pattern, but
also due to the restrictions described above.

Experiments were now performed on the selected compounds. For the studies of
IC50 Eriksson et al. included a few more variables in addition to the chemical
descriptor variables. These were log retention times for two HPLC systems
(LC1, LC2), and a few others. (Since those “others” proved to be insignificant,
we will skip them here.) This means that we now have 10.

4. PLS
Make a PLS model using the sample set Training, X-variable set X (all of
them!), Y-variable set Y. Choose Test Set Validation. Set up the test set using
samples 11-15 as test samples. Save this first model as e.g. “Cyto 1”. Also make

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 299

a PLS model based on samples 1-10 only, where you validate the calibration by
full cross validation (10 random segments) and save this model as e.g. “Cyto 2”.

Interpret the model by studying Y-variance, scores and loadings. How many
components should we use? How large is the explained Y-variance using test set
and internal cross validation, respectively? Which X-variables dominate the
model? How do you interpret PC1 and PC2? Explain the difference between the
prediction error using internal cross validation and external test set validation!
How large is RMSEP?

Make two new PLS models for IC50 based on the five dominant variables, using
both test set and internal cross validation, to check if your interpretation about
important variables holds. Does this model have a better or worse prediction
ability with respect to explained variance and RMSEP?
Save your favorite model as e.g. “Cyto Reduced 1”

Make a PLS model based on only the five important variables again, but now
include all 15 samples. Call this model e.g. “Cyto Reduced 2”. Compare the
prediction error of this model with Cyto Reduced 1.

Also study Predicted vs. Measured for varying number of PCs. How many
components should we use?

5. Expand the model and predict untested compounds


Since both test set validation and internal cross validation indicate a good
model, we can now use it for prediction of the 43 untested compounds. These
samples have not been in the model. You can quickly define a new sample set
containing those 43 samples; call it “Prediction”. We may consider expanding
the model to include all of the tested compounds.

Since the HPLC measurements are not available in the literature like the
chemical descriptors, make a new PLS model based on all the tested
compounds, but using only the three most important variables. (Exclude LC1
and LC2). Use full cross validation. Interpret and check the RMSEP and the
explained Y-variance. Is this cross validated model much worse?
Save this model as e.g. “Cyto for Prediction”.

Predict IC50 for the 43 compounds in sample set “Prediction”, using your last
model. Can you trust all the predicted values? Why not?

Multivariate Data Analysis in Practice


300 12. PLS (PCR) Exercises: Real-World Applications - II

Compare extremely low predicted cytotoxicity values with the measured


toxicity range of the calibration set.

Summary
PCA
The data should be standardized. There are no signs of direct problems, but sample
no. 4 seems potentially outlying in several PCs. Four PCs describe about 94% of
the variance. All variables except Ip have large loadings on PC1; thus PC1 can be
interpreted as related to the size/bulk of the compounds. PC2 is dominated by log P,
Van der Waals volume and density, which can be interpreted as reflecting a
combination of size and lipophilicity/hydrophilicity; Ip also has a large loading on
PC2. The interpretation of PC3 and PC4 is ambiguous. Clearly the ionization
potential (Ip) is important for PC3, and melting point (MP) dominates PC4.

PLS
The PLS model based on test set validation needs one PC to describe about 82% of
the Y-variance with an RMSEP of 1.0. Using internal cross validation, two PCs
describe 73% of the Y-variance with an RMSEP of 1.5. This means that the test set
is too small to be predicted with more than one PC. IC50 depends primarily on the
hydrophobic and steric properties of the compounds, Mw, VdW and log P, and LC1
and LC2. The larger and more hydrophobic the compound is, the more cytotoxic it
is.

The reduced model with test set validation is better, with an explained Y-variance
of 90% using only one PC. Using 2 PCs gives overfitting. The internally cross
validated model needs four PCs to explain 80% of IC50. Both data sets are too small
to give consistent estimates of the prediction error, but we will only use them to
indicate toxicity levels - not to predict accurate levels.

The cross-validated “Cyto for Prediction” explains 85% of IC50 with 3 PCs. There
is little risk of overfitting since we know that all the three variables are necessary.

Prediction
Sample 51 appears to be outlier, having very large uncertainty limits at prediction.
The compounds outside the validity of the model are of course also uncertain. In
this case “compounds outside the validity of the model” means samples that lie
outside the area from which you picked the calibration set in the PCA score plot.
The model has to extrapolate to predict these samples.

Multivariate Data Analysis in Practice


12. PLS (PCR) Exercises: Real-World Applications - II 301

Six compounds have predicted IC50 values lower than 0.5 mM. Since the lowest
cytotoxicity value measured for the calibration set is 0.9 mM, giving specified
predictions below 0.5 mM does not seem quite justifiable. Maybe we can give these
predictions as “< 0.5 mM”.

The predictions may now be used as a starting point for further risk assessment of
the non-tested compounds, preferably integrated with other data of environmental
relevance.

Multivariate Data Analysis in Practice


302

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 303

13. Master Data Sets: Interim


Examination
This chapter introduces four data sets never before published in this training
package or elsewhere, which will serve as a test of the reader’s acquired skills
in multivariate data analysis. This revolves around PCA and PLS in general,
but is specifically also rather more related to the need for more problem-
specific modification of standard procedures. Thus the data analytical
problems to be presented will in particular focus on the need to compose
one’s own data analytical strategy. In other words, these problem scenarios
necessitate that YOU think carefully about the “proper ways” to approach the
problem descriptions given in a fashion most suited for the kind of
multivariate data analysis learned. A certain measure of “problem re-
formulation” will always be necessary – but there is (much) more in store
here.

We shall give a thorough description of each major data set, its origin and
general background only. The data sets themselves will not be given in the
same ready-to-use, fully prepared form as has been necessary for all the
exercises presented hitherto. All four problems are genuine real-world data
sets however, presented directly as is, without any preparatory help. We are
confident that you understand why – and that you concur.

There will for sure be a need for many of the contingencies outlined above in
chapters 1 through 12, at some point(s) in the preparation of a particular data
analysis strategy or problem re-formulation below. Thus you should be well
prepared for the easy issues such as which pre-processing to choose (and
why?), outlier detection (objects and/or variables), groupings, etc. Also look
out specifically for the more subtle ones as well, such as “typical” problems,
which on second reflection may well have to be re-formulated into another
more useful form, or the definition of what actually constitute proper
problem-dependent variables and objects, say. In point of fact we have
prepared a rather interesting palette for your own painting here. All these data
analysis scenarios have been tested thoroughly during the senior author’s
teaching experiences in the last five years, so far with only positive student
responses. It is to be hoped that the same applies to YOU – good luck!

Multivariate Data Analysis in Practice


304 13. Master Data Sets: Interim Examination

These major data analytical assignments are your own exclusively. We will
not debase your learning efforts so far by outlining a "correct solution"
immediately following the individual descriptions - we have devised a route
for you to follow that is much more respectful of your examination efforts.
For three of these data problems there will be on record one possible full
solution, perhaps amongst many other alternatives, but it will only be
available to you after you have carried out and documented your own
solution. There will be three full data set reports available, and a meta-
principle solution suggestion for the fourth data problem. The latter is to be
published in full in an upcoming high-level textbook, Esbensen & Bro: “PLS -
theory & practice in science, technology and industry” to be published in
2001 (Wiley).

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 305

13.1 Sgarabotto Master Violin Data Set


Introducing the father-son pair of master violinmakers, Gaetano and Pietro
Sgarabotto.

Photograph 1 - Gaetano and Pietro Sgarabotto (Parma, 1926 & 1957)

Gaetano S. (born 20.9.1878 – died 15.12.1959) and his son Pietro S. (born
1903 – retired 1971, but still producing masterworks as late as 1977; died
04.05.1999) both worked for their whole life in Italy as master violinmakers.
Gaetano S. worked initially in the city of Vicenza for many years before
moving to Parma in 1926, where he stayed almost without interruptions until
his death in 1959. His son, Pietro, continued in his father’s profession and
also became a master violinmaker in his own right. Today one speaks within
initiated circles with the utmost admiration of the brief, but prestigious
dynasty of violinmakers Sgarabotti. This refers to the historical presence of
both these masters who, apart from creating their own master instruments,
spent much of their time passing on experience to young violin makers.
Indeed the activity of the Sgarabotto makers was very influential in the violin
making school of Parma, Cremona and elsewhere. Much could be said about
their combined influence on the cultural heritage within the musical world of

Multivariate Data Analysis in Practice


306 13. Master Data Sets: Interim Examination

string instruments, and of the enormous regard and esteem with which all
their students and musicians held the masters. The greatest interest by
posterity, of course, centers on their combined oeuvre, on the set of master
violins from their hand left for us to play and admire.
The works of both the Sgarabotto makers can be readily identified by their
meticulous choice of materials, the workmanship always being exquisitely
“manual” in every phase of the making of each instrument, and showing
extreme precision and loving care to detail. The violin making of the
Sgarabotto makers is always graceful, that of the father Gaetano said to be
presenting a lighter touch whilst the thicknessing used by Pietro is more
consistent. The sonority and general musical quality of these master violins of
course cannot be expressed in any way fair by numerical values, but shall
forever be residing in their handling by the musicians and experts who are
fortunate enough to play a Sgarabotto violin, viola, violoncello….

From the hands of these masters a distinguished set of 32 masterworks are on


record (there are numerous other instruments which also have been
definitively attributed to them, but which are perhaps not all in the same
master violin class). In the interest of documentation, as well as for the very
serious reason of guarding against the most basic crime of all in this context –
frauds, a catalogue of these 32 master violins has been published (Spotti,
1991). It is very complete, with individual descriptions and detailed
photographs of each violin as well as an overview table of the physical
dimensions of the instruments.

There are 24 major physical indicators making up a standard description of


the external physical characteristics of a violin. These are shown in Figure
13.1, in which the original letter designations have been used (instead of the
data analytically more conventional X1-X24 variable designations). All
elements in this data table represent the salient physical dimensions as
expressed in mm (millimeters).

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 307

Figure 13.1 - The 24 key physical violin dimensions A:Z

Multivariate Data Analysis in Practice


308 13. Master Data Sets: Interim Examination

Table 13.1 below lists 18 master violins from the hand of the father Gaetano
S. Accompanying these are a further 14 master violins originating with the
son, Pietro S. Each individual work is dated with the year in which it was
made as well as in which city. We may also observe that Pietro Sgarabotto
adopted the tradition of assigning an individual name to every instrument in
the master class, sometimes dedicating them to a historical or prestigious
personality. The running numbering in Table 13.1 reflects a chronological
listing of Gaetano Sgarabotto’s entire master violin oeuvre (numbers 1-18),
followed by that of his son (numbers 19-32). This numbering listing is to be
used for a shorthand identification of these 32 objects.

Table 13.1 - Oeuvre of Gaetano (1-18) and Pietro Sgarabotto (19-32).


The physical dimension data reside on the file VIOLINS.
Gaetano Sgarabotto
1 Viola – 1905 – Milano A:Z dimensions (mm)
2 Violino (undated) A:Z dimensions (mm)
3 Violoncello – Quartetto Piatti – 1926 – Vicenza A:Z dimensions (mm)
4 Violino – Messia – 1927 – Parma A:Z dimensions (mm)
5 Violino – 1927 – Parma A:Z dimensions (mm)
6 Violino – 1928 – Parma A:Z dimensions (mm)
7 Violino – 1928 – Parma A:Z dimensions (mm)
8 Violincello – Il biondo Nazareno – 1929 – Parma A:Z dimensions (mm)
9 Violino – 1930 – Parma A:Z dimensions (mm)
10 Violino – 1930 – Parma A:Z dimensions (mm)
11 Violino – 1931 – Parma A:Z dimensions (mm)
12 Violino – 1933 – Parma A:Z dimensions (mm)
13 Violino – 1937 – Parma A:Z dimensions (mm)
14 Violino – 1940 – Parma A:Z dimensions (mm)
15 Violino – 1942 – Parma A:Z dimensions (mm)
16 Violino – 1942 – Milano A:Z dimensions (mm)
17 Violino – 1944 – Milano A:Z dimensions (mm)
18 Violino – P.P. Rubens – 1954 – Brescia A:Z dimensions (mm)
Pietro Sgarabotto
19 Violino ¼ - Piccolo mondo antico – 1930 – Parma A:Z dimensions (mm)
20 Violino – Al Dalmata Padre Lino – 1950 – Parma A:Z dimensions (mm)
21 Violino – Dostoievski Fedoro – 1954 – Parma A:Z dimensions (mm)
22 Violino – Amore e psyche – 1958 – Parma A:Z dimensions (mm)
23 Viola – La Wally – 1958 – Parma A:Z dimensions (mm)
24 Viola – I. Pizetti: La figlia de Jorio – 1960 – Cremona A:Z dimensions (mm)
25 Violino – Dante Alighieri: Paradiso – 1962 – Cremona A:Z dimensions (mm)
26 Violino – Giulio Cesare – 1962 – Cremona A:Z dimensions (mm)

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 309

27 Violino – Dante: L’inferno – 1963 – Cremona A:Z dimensions (mm)


28 Violino – Giotti Raffaello – 1963 – Cremona A:Z dimensions (mm)
29 Violino – Galileo Galilei – 1974 – Parma A:Z dimensions (mm)
30 Violino – Verdi: Va pensiero – 1975 – Parma A:Z dimensions (mm)
31 Violino – Schubert Ave Maria Accardo – 1976 – A:Z dimensions (mm)
Parma
32 Violino – G.B. Tiepolo – 1977 – Parma A:Z dimensions (mm)

The electronic data table Violins conveys a comparative overview of the


physical aspects of the entire master violin output from these two masters.
There have been many scholarly debates on what exactly constitutes the
distinctive features of the works of father vs. son Sgarabotto. In this context
one must of course put emphasis on the choice of materials as well as the
crafting of the wood itself, on the equally important aspect of the varnishing
of the finished wood, and of course on the ultimate issue of the musical
qualities of each violin. Still there have also been arguments that certain
overall physical characteristics of the “manual style” of the entire set of
representative violins from each of the Sgarabotto masters also show up as a
distinctive hallmark. This should, allegedly, also pertain to the simple
physical dimensions described in file VIOLINS, but especially as concerns
their “overall physical harmony”.

However, this particular part of the evaluation of the works from these two
master violin makers has never been put on a quantitative footing, being
traditionally carried out in decidedly non-quantitative artistic, humanistic,
craftsmanship or musicology related terms. But precisely the apparently
somewhat elusive feature of the “overall physical harmony” actually lends
itself to a translation into the kind of issues that are well known in our current
data analysis traditions. For “overall physical harmony” read “interacting
variables” or better still “correlated variables”! By use of bilinear modeling it
will be possible to put this artistic expert impression also in a more objective
quantitative context. This should of course be of immediate interest to the
data analysis community, i.e. to be able to express such subtle artistic
impressions in an objective language and thus, perhaps, to be able to
contribute towards greater clarity and precision in this scholarly debate.
Whether the same interest and appreciation is reciprocated from our musician
friends is not known, but it would probably be looked upon with at least some
initial frowning! Nevertheless we haste to point out that the following
quantitative analysis of the combined oeuvres of the Sgarabotto violins in no
way should be taken to indicate anything else than but a modest contribution

Multivariate Data Analysis in Practice


310 13. Master Data Sets: Interim Examination

only for the totality, and the integrity, of the violin making art and
craftsmanship tradition.

That said and done: there would appear to be an extremely interesting


opportunity in making a careful data analysis of file VIOLINS. Here is an
irresistible opportunity literally to “look over the shoulder” of two master
violin makers and, perhaps, to find out some essentials of what exactly are the
characteristics of the “manual style” with which each composes (creates) his
master violins.

Of course this can only concern the relative dimensions of the framework and
the sculpturing of the violin. This will of course never be but a weak
reflection of the whole appreciation of violin making. Still – what a
tremendously interesting data analysis context!

Photograph 2 - The art of violin making: Gaetano S. in his studio


(Parma 1929)

A master violin maker, while necessarily focusing mostly on each of the


contemporary, interconnecting parts of the crafting process, presumably must

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 311

have his mind focused both on the totality as well as the individuality of all
the essential elements involved in making such a complex, harmonic artwork
as a violin. Such elements may be the choice of wood materials, cutting,
shaping, sculpting, and varnishing. It would appear likely that the final
outcome of this complex process will possess at least some inter-correlated
features, which have come into existence more or less by a non-conscious
emerging totality of the material object, the violin, as more and more of the
essential parts of the process are added to the product of labor.

What is conjectured here is that at least some of the interacting features in the
totality and identity of a violin come into existence as a non-conscious sum-
of-parts, rather than as a deliberate act of Gestalt creation. The interacting,
correlated relationships between the external physical dimensions of a
finished violin would be one prominent representative of such emergent
properties. While presumably to a large extent non-conscious, the totality of
the manual crafting involved in the individual violin making, will end up as a
final, overall, integrated hallmark of the style of violin making craftsmanship,
supposedly distinctive and characteristic of each individual violin maker. In
any event, this constitutes the working hypothesis for undertaking a data
analytical examination of the VIOLINS data!

Thus the objective for this chapter’s first data analytical assignment will be:

Make a complete bilinear analysis of the data stored in file VIOLINS, with the
ultimate aim of finding out if there are (any?) distinctive features which can
be used to discriminate between the representative works of father and son
Sgarabotto violin makers. You may keep in mind the previous assertion
“...that of the father Gaetano said to be presenting a lighter touch whilst the
thicknessing used by Pietro is more consistent.”. How would YOU shed new
(quantitative) light on this issue?

Hint: We are firstly looking for the violin-discriminating features here; what
could possibly constitute outliers in this context? Next, for the violins proper,
how can we look behind the massive similarities, which surely must be
present when comparing entities, which to all but the most erudite experts and
musicians are virtually look-alike? Indeed these similarities positively almost
overwhelm the innocent data analyst at first sight. A certain creative data
analysis insight must be found here, lest we get stuck in this similarity swamp.

Multivariate Data Analysis in Practice


312 13. Master Data Sets: Interim Examination

Once the initial major question has been correctly solved: whether to use
auto-scaled, only standardized, only mean-centered, or raw data, the
remaining key hint is that a certain liberating re-formulation (based on some
of the interim results, which can be obtained relatively easily) will be
necessary. It will absolutely not be possible to delve directly into these data.

Above all, this compilation of invaluable data must be approached with the
utmost reverence!

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 313

13.2 Norwegian Car Dealerships -


Revisited
We dealt with the car dealer issue at length in Exercise 3.13 above (Chapter
3). After a series of seven meager years, in 1994 it was again a profitable
business to sell new imported cars. One of the companies turned out to be
“No. 1” – hence a certain cocky attitude (Photograph 3).

Photograph 3 - “Now they are smiling again: after seven meager years, the
car dealership business reports results that are again really on the move”.

But how are they reporting their results? That is the issue at hand in this last
exercise on these data. And how (good or bad) is the managing director paid?

In the above data analysis of the master violin data, it was found necessary to
re-formulate the initial PCA into a pointed, problem-specific PLS-
reformulation. It was found that one particular X-variable played such a key
role, that a re-assignment of this variable into a specific PLS Y-variable
perhaps would be found profitable. The data analysis continued, guided by the
comparative results from the relevant PCA and PLS-analyses. This issue
could be termed “internal re-formulation”, signifying that the reasons for a re-

Multivariate Data Analysis in Practice


314 13. Master Data Sets: Interim Examination

formulation of the entire data analysis objective (PCA Æ PLS) came about by
an “internal evaluation” (interim results of the previous data analysis efforts).

Similarly there might well exist “external” re-formulation reasons, signifying


that reasons for a re-formulation of the data analysis objective (PCA Æ PLS,
or otherwise) can come about from an “external analysis” of the original
objectives of the data analysis, or of the problem specifications. If we stick to
the exact same re-formulation method as for the violin data above, please now
consider whether something similar might be relevant for the overall objective
of the “Norwegian car dealerships” data. We have admittedly gone over this
data set already quite extensively at the end of chapter 3, but with a meager
result. There was apparently very little distinctive overall data structure
present, only an almost hyper-spherical data swarm, but we did discuss, rather
at length, the possible, still very useful market-relevant interpretations that
could be formed. So what could possibly be the potential benefit of a further
analysis on these data?

Reference must here be made to the extraordinary powerful result of such a


simple X-variable re-assignment in the master violin data case. It was because
of this singling out of (only) one “instrumental variable” (no pun intended),
which very clearly served the data analysis objective(s) better by its
alternative role as a PLS y-variable, that the final insight and conclusive
results of the violin making data scenario could be established at all. Thus this
PCA Æ PLS re-formulation played the absolutely essential role.

A word from us to you, which is to be taken entirely on trust here: there do


exist “external reasons” for a similar re-assignment opportunity of one
(perhaps two) analogous X-variables also in the case of the Norwegian car
dealership data! And in view of the ensuing spectacular results in the violin
data case, there are no two ways about it but for a re-evaluation the car
dealership case.

The “external reason” must in this case be provided by a reassessment of the


overall data analysis objective for the set of ten economic X-variables. Please
first refer, and refresh your background understanding of the data set relevant,
to Exercise 3.13 (chapter 3). The data are still stored in the file Cardeals.

It will probably not be too difficult to single out one particular X-variable,
which after only a little reflection, is not strictly in the same field as all of the
others. One particular X-variable, which – in principle at least – need not

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 315

exclusively contain data only originating via strong passive correlations to all
the other economic fellow travelers. A variable whose data actually to some
extent might be externally controlled in the sense that the particular values for
this variable actually can be fixed somewhat irrespective of the remaining
sales- and value-related data?

We are first of all talking of X-variable No. 8: “Salary, managing director”.


Clearly, it may be possible to set the level of salary for the managing director
both as a result of results obtained (last year), or alternatively as an
encouragement for results to be obtained (this year, or next year) for example.
X8 is the only variable for which the data recorded in the data table need not
exclusively be objective summary data obtained by an economic survey of the
sales figures.

We may formulate the following working hypothesis for the car dealership
data case – in analogy with the data analysis experiences from the violin data
case – that variable X8: “Salary, managing director” may constitute an
analogous PCA Æ PLS re-assignment variable of similar potentially increased
data analysis insight value.

Pertaining to this, we thus want to analyze the data set explicitly from the
standpoint of seeing quantitatively how well the set of the nine other
economic indicator variables are able to model (or to predict, rather) the level
of salary assigned to the managing director. This would be a direct
investigation of to what extent the macroeconomics of the company is directly
related to the corresponding level of salary to the managing director – or not.
Any significant positive deviation (i.e. a positive residual with respect to the
predicted salary following from the PLS- model) would constitute evidence of
violations of a strict market and sales-related salary policy. Opposite
deviations (i.e. negative residuals) would signify managing directors
accepting to be underpaid, certainly a more interesting situation. Thus for
economic analysis, the simple PCA vs. PLS switch option, could, perhaps,
offer drastically different points-of-departure. But, really, in doing this kind of
Y-guided data modeling, we are actually more interested in the embedded X-
decomposition than in this salary modeling per se.

You should thus be able to re-formulate the objective of the car dealership
data analysis accordingly, and to repeat the analysis of the interrelationships
of Norwegian car dealerships, but now as expressed by an appropriate PLS1
analysis. It will still be the pertinent t1-t2 score- and accompanying loading-

Multivariate Data Analysis in Practice


316 13. Master Data Sets: Interim Examination

weight plots, which will be of highest interest. This approach in fact


constitutes a nice example of a situation in which the PLS-data analysis only
forms the framework in which the internal X-relationships actually are the
prime interest. For a successful model to be possible there are – as always –
the usual calibration issues to address first (pre-processing, outliers,
validation).

After having negotiated these issues, please try to formulate the relevant new
interpretations of the PLS data analysis results, and compare them with their
counterparts from Exercise 3.13 above. Are we now finally in a position to
approach the editorial offices of the magazine “CAPITAL”? – And, if yes,
what can we teach the economic journalists about the significant issues of
interacting, correlated variables? This would be the crux of the matter!

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 317

13.3 Vintages
Here is another completely new data set, which you have never seen before.
This particular data set originates as a small guide for beginner wine-
aficionados, as a first-hand overview of the most important background
knowledge, which is essential for any wine expert, beginner or experienced.

Everybody loves wine. Everybody would like to be able to appreciate wine


quality as well as the professional wine assessors. It is a task of patience and
experience, and a cultivated and pleasant one at that. You simply just have to
start to accumulate your own impressions, your own experiences and then
compare with the opinions of the professional wine tasters.

The compilation of vintage assessments, which forms the background for the
present examination data set, stems from an extended survey carried out by a
major Scandinavian wine importer, who would prefer to remain anonymous.
But we are very grateful for permission to use the data compilation below.

Photograph 4

Multivariate Data Analysis in Practice


318 13. Master Data Sets: Interim Examination

In the data file Wines a selection of major wine types and selected
representatives from some of the most important French wine-producing
regions and other countries have been assigned vintage assessments for a
series of important recent years (1975-1994). Please observe that each entry in
this data table represents a carefully pruned and trimmed average overall
quality assessment as carried out by the local wine controlling authorities, not
by the producers themselves. Clearly there is a complex averaging procedure
involved behind each entry, the particulars of which need not concern us here
however, except for the fact that the same averaging procedure is used for all
individual entries listed, making them eminently comparable. In this table,
each vintage is expressed by an average assessment-index on a scale from 0 to
20 - obviously with 0 representing an outlandish, completely failed and utterly
unacceptable quality, while 20 represents the sublime, the perfect, the
quintessential quality…

Everybody knows that the vocabulary of true wine connoisseurs is second to


none in breadth, depth and with extreme associative powers ---- oh la la! But
less will do also: everybody is equally able to conduct a systematic evaluation
of more and more wines, so let us dig into the land of wines and their
appreciation, amateurs and experts.

Photograph 5 - Wine assessors (not yet complete professionals) at work on


champagne

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 319

The compilation in expert vintage assessment in Table 13.2 below is to be


taken seriously – it contains a very broad experience compressed into a
quantitative handy format.

Photograph 6 - Wines need storage for clearing, maturing, and


improving…

This data set is of general interest, amongst other reasons, because by the very
nature of this subject-matter there will always be a major bias present towards
the higher end of this scale! In point of fact there are no values below 8 in this
20 x 27 table, while more than 45% of the entries lie in the interval 15–19,
although, naturally enough, on the other hand there are (very) few “20”-
values! The data are thus both heavily censored and heavily skewed and as
such of very high tutorial value. But there is more.
We list an excerpt from the entire data table below, Table 13.2 (it is to be
found in its entirety on the CD accompanying this training package, alongside
all other exercise data sets) in order to focus on another important issue for
this particular data set. As listed here by the original wine importer, the rows
of the data table are made up of either of the individual wines (e.g. Saint-
Emilion), or wine regions (e.g. Bourgogne, white), or even individual country
aggregates (as detailed for red/white wine respectively), totaling 27 rows. The
columns are representing the wine assessment years in question: 1975,
1976… 1994 – totaling 20 columns in all. There are a few, quite
understandable, missing values in this data table as well (e.g. the years when
grape harvests were destroyed). However these are so few and so irregularly
located that no serious hampering of the data analysis is likely to occur –
except from the finding that no information is available for the entire 1975-
1982 interval on Beaujolais wines at least in this compilation. Whether your
Multivariate Data Analysis in Practice
320 13. Master Data Sets: Interim Examination

œnologic (i.e. wine tasting) experiences are so that this is comes as no


surprise to you, or not, we are confident that you will be able to take the
appropriate data analysis action pertaining to this row.

Now: You are kindly asked to perform an appropriate data analysis on the
WINES data table.

There are only a total of (27 x 20): 540 potential elements in this small data
matrix, less some 19 missing values, totaling 521 actual values, so what could
possibly be the problems involved in what at first sight would appear as a
straightforward PCA?

You will – hopefully – be greatly surprised: good luck with your ongoing
advanced learning!

Hint: The foundations of PCA, with its accompanying basics of mean


centering and/or standardization may have to be called into critical
questioning here. Indeed the issue of re-formulation will appear in a new
context, the like of which you have probably never met before. The very
definition of variables and objects may need careful reflection.

Table 13.2 - Enologic vintage assessments for selected wines and regions
for the years 1975-1994, excerpt. Complete file: Wines
Vintage 75 76 77 78 79 80 81 82
Bordeaux, red, Médoc/Graves 17 14 10 17 17 14 16 18
Saint-Emilion/Pomerol 19 16 10 16 17 13 16 18
Bordeaux, white (dry) 19 16 13 16 17 14 18 19
Sauternes/Barsac 17 16 12 15 14 m 15 14
Bourgogne, red, Côtes de Nuits 10 16 11 17 15 13 14 12
Côtes de Beaune 7 14 9 19 15 13 12 14
Bourgogne, white 13 15 13 17 17 12 14 17
Beaujolais m m m m m m m m
Rhône – north 14 14 13 19 15 14 13 17
Rhône – south 13 15 12 18 16 14 17 15
Loire, Muscadet/Touraine/Anjou 15 15 12 14 16 12 12 14
Pouilly-Fumé/Sancerre 18 18 8 15 14 13 14 17
Alsace 14 19 11 15 14 8 14 14
Champagne, vintage 18 16 m 15 16 14 15 18
Germany, Rhine valley 12 19 12 13 15 14 13 12
Mosel 17 18 8 9 15 8 15 11

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 321

Italy, Toscana 16 8 14 18 16 16 14 18
Piemonte 10 10 10 20 16 14 12 20
Portugal – vintage port 13 m 19 13 m 16 m 13
Spain – Rioja 17 16 19 17 12 16 18 19
California, red 16 16 15 18 16 17 17 17
Washington State, red 18 18 16 17 18 18 17 15
Oregon, red 17 12 18 16 16 14 15 16
Chile, red 14 13 14 15 15 13 14 12
Australia, red 15 16 16 16 17 16 15 18
Argentina, red 14 16 19 14 13 15 16 18
South Africa, red 11 17 14 16 14 13 14 19

13.4 Acoustic Chemometrics (a. c.)


The Applied Chemometrics Research Group (ACRG), Telemark Institute of
Technology, has for some 5+ years been engaged in developing a novel
methodology for acoustic chemometrics (a.c.): applying sound and/or
vibration sensors to physical phenomena with the aim of new, enhanced
process and product characterization.

“Acoustic chemometrics – from noise to information” (Esbensen et al., 1998),


and “Acoustic chemometrics – II: a small constriction will go a long way!”,
(Esbensen et al., 1999) outline all pertinent details of this new technology.
This work has from the start been made in a continuing, very close
collaboration with the construction firm Sensorteknikk AS, Oslo, headed by
Mr. Bjørn Hope.

Acoustic chemometrics has, amongst many other application areas, been


applied to quantify:
1) Grain size distributions in aggregate materials, especially powders, sands,
granulates, etc.
2) Trace concentrations of water-pollutants (oil spillage waters/environmental
monitoring a.o.)
3) Extrusion plastic products
4) Paper pulp compositions (wet end paper machine process/product
monitoring)
5) Industrial granulation/agglomeration/coating process monitoring
6) Pneumatic powder transportation monitoring and control

Multivariate Data Analysis in Practice


322 13. Master Data Sets: Interim Examination

A representative acoustic chemometric undertaking is that of non-invasive on-


line characterization of a range of fluid mixtures by the deployment of
(“clamp-on”) accelerometer sensors on an appropriate constricting
construction detail in an otherwise ordinary pipeline (an orifice plate or a
suitable valve for example). Figure 13.2 sketches the pertinent details for such
a typical setup.

Figure 13.2 - Acoustic chemometrics, showing the principal features of


ACRG experimental rig for fluid mixture quantification. Insert: orifice
plate/accelerometer measurement device

/
SIGNAL ANALYSIS

Downstream the constriction a turbulent regime is set up, which turns out to
be both selective and diagnostic of many of the factors involved in generating
this modified flow regimen, for example of the concentrations of the
influencing mixture-components.

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 323

Figure 13.2 shows ACRG’s experimental laboratory rig on which many of the
fundamental studies on the oil-in-water micro-pollution system have been
carried out (in the 0-300 PPM range). In particular it has been found that real-
time FFT (Fast Fourier Transform) of the raw time-domain acoustic signals
will result in a suitable spectral format (so-called power spectra) for this kind
of data, which can be used directly as X-inputs to for example PLS-
calibration. The Y-variable in this context would of course be any relevant
intensive parameter for which prediction from such acoustic measurements is
desirable, e.g. the trace oil concentrations. Esbensen et al. (1999) details
many other systems, which have also been characterized by acoustic
chemometrics.

In order to highlight ACRG’s most distinguished acoustic chemometrics


collaborator, we have here chosen to present a contemporary data set relative
to the efforts of Sensorteknikk AS to implement a working system for
continued surveying of the de-icing agent glycol at Oslo’s new airport
Gardermoen.

Photograph 7 (below) shows two representative illustrations from this


prototype industrial implementation, in which one in particular may note the
vertical orientation of the constricted flow. This sensor can be seen
sandwiched between the two flanges in the blowup (right-hand side panel).
This setup constitutes a typical indirectly calibrated analytical instrument, in
this case for acoustic chemometrics glycol determination. In other words, a
completely standard multivariate calibration would appear what is called for,
with the acoustic FFT-spectra (X) and glycol-concentrations (Y) from a
suitable training data set, which is easily enough brought about.

Multivariate Data Analysis in Practice


324 13. Master Data Sets: Interim Examination

Photograph 7 (left/right) – Sensorteknikk AS’s prototype industrial acoustic


chemometrics implementation for glycol determination at Oslo’s new
Gardermoen airport (left: general overview; right panel: close-up of
acoustic chemometrics measuring area)

This setup includes such advanced features as programmable automatic re-


calibration facilities, on-site LabView control as well as telemetric remote
monitoring and control. The fourth master data set has been recorded on this
setup in April 1999.

Multivariate Data Analysis in Practice


13. Master Data Sets: Interim Examination 325

The file Glycol1 contains 256 training spectra; note 18 replicates for each
concentration in the general glycol interval between 0.0% (corresponding to
pure Gardermoen groundwater) and 3.0%, which is the range of interest for
the Norwegian pollution authorities. The acoustic frequencies range between
0 – 25 kHz, which is probably severe frequency overkill. From our combined
acoustic chemometrics experiences one is led to expect that one, or more,
coherent frequency-bands embedded in this overall broad band signal would
be optimal to do the job of quantifying the concentration of glycol. These are
high-precision data, and there are many options when contemplating to use a
problem-specific averaged version of the 18 replicates.

What would you do in this context? How to go about this calibration?

There is also a completely pristine test set to be found for the Glycol data set!
This is called Glycol2 (which in fact carries a different number of replicates).

This novel Sensorteknikk AS-ACRG acoustic chemometric technology is


partly still somewhat in its industrial implementation infancy. Accordingly
there may of course show up some of the by now very well-known practical
calibration problems, when carrying out what should at first sight appear as a
pretty standard multivariate calibration, such as outlier policing or pre-
processing. The range of the “usual” potential calibration issues will still have
to be checked carefully, as always. This is the very last calibration data set
presented to the reader to practice on in this book – so it is anyone’s guess
just how many subtleties you will find embedded in this last examination
project: be aware!

But when all that has been well taken care of, there is a big bonus waiting:
The validation issue is almost ideal in this particular case: we have at our
disposition a bona fide real-world test set, acquired according to all the most
stringent validation prerequisites, compare Chapter 7. Thus we may direct our
attention (after your own perfunctory calibration has been arrived at, of
course) at the interesting issues involved in using a full-fledged test data set.

Indeed the main issue of this application is assessing the specific prediction
robustness of the technology involved. For this purpose the test set was
acquired only after considerable time had elapsed since the training
calibration (several days), and for good measure the whole acoustic
chemometrics apparatus had been shut down for service in this period as well.
Therefore we are in a position to focus our calibration experience mainly on

Multivariate Data Analysis in Practice


326 13. Master Data Sets: Interim Examination

the prediction issues in general, the specific robustness issues in particular.


For example: which part(s) of the original spectrum would appear to be the
most robust in the above sense. It may very well be necessary to perform
some exploratory, radical surgery on the original spectral X-variable range in
order to find the desired bands, which will increase the robustness.

The objective for this application context is thus clear: You are to find the
most robust subset of the original spectral range employed. Use whatever
means you may command at this stage. The main issue here turns out to be
problem-specific (validation-specific) variable selection.

Hint: Without some form of standardized variable selection algorithmic


approach, there simply will be no single “correct solution”. There may in fact
very well be several “best” solutions found.

Multivariate Data Analysis in Practice


14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test) 327

14. Uncertainty Estimates,


Significance and Stability (Martens’
Uncertainty Test)
This chapter deals with uncertainty estimates of the model parameters in
multivariate methods such as PCA, PCR and PLSR. The exercise will deal with
PLSR, although the methodology also is implemented for PCR. It is recommended
that you read this chapter when you have fully understood the concept of cross-
validation, as described in Chapter 7.

Recall that the cross-validation gives a number of individual sub-models that are
used to predict the samples kept out in that particular segment. Therefore, we have
perturbed loadings, loading weights, regression coefficients and scores to be
compared to the full model. The differences, or rather variances, between the
individual models and the full model will reflect the stability towards removing one
or more of the samples. The sum of these variances will be utilized to estimate
uncertainties for the model parameters.

We will discuss the uncertainty estimates in terms of:


• Variable selection
• Prediction performance
• Stability

14.1 Uncertainty Estimates in Regression


Coefficients, b
The approximate uncertainty variance of the PCR and PLS regression coefficients
b can be estimated by jack-knifing (Equation 14.1).

Equation 14.1
M 
(
s b =  ∑ (b - b m ) 2  N − 1 N
2
 m=1 
)
where
N = number of samples

Multivariate Data Analysis in Practice


328 14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test)

s2b = estimated uncertainty variance of b


b = regression coefficient at the cross validated AOpt components using all the N
samples
bm = regression coefficient at the rank A using all objects except the object(s) left
out in cross validation segment m.

On the basis of such jack-knife estimates of the uncertainty of the model


parameters, useless or unreliable variables may be eliminated, in order to simplify
the final model and making it more reliable. This is done by significance tests,
where t-test is performed for each element in b relative to the square root of its
estimated uncertainty variance s2b, giving the significance level for each parameter.
The uncertainties for the regression coefficients are estimated for a specific number
of components, preferably the optimum number, AOpt. See chapter 7.1.3 for
discussion of how to find AOpt. The Unscrambler will always give a suggested
number of components, but note that in some situations this number might be too
conservative, for instance when you have spectroscopic data as X-variables.

An informative and visual approach is to show the b-coefficients ± 2 standard


deviations, as this corresponds to a confidence interval of approximately 95%. The
more formal statistical test is that the confidence interval is b ± t0.05/2, df.sb.

14.2 Rotation of Perturbed Models


The individual models have a rotational ambiguity, thus Tm, Pm, Qm and Wm from
cross-validated segments must be rotated before the uncertainties are estimated. A
rotation matrix Cm satisfies the relationship

Equation 14.2 Tm[ PmT,QmT] = TmCmCm-1[PmT,QmT]

When Cm is estimated, for instance with orthogonal Procrustes (Equation 14.2)


rotation of Tm vs. T, the individual models m=1,2,...,M may be rotated towards the
full model:

Equation 14.3 T(m) = TmCm


for scores, T, and

Equation 14.4 [ PT,Q](m) = Cm-1[PmT,QmT]


for x- and y-loadings, P, Q.

Multivariate Data Analysis in Practice


14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test) 329

After rotation, the rotated parameters T(m) and [PT, QT](m) may be compared to the
corresponding parameters from the common model T and [PT, QT]. The loading
weights, W, are rotated correspondingly to the scores, T. The uncertainties are then
estimated as for b, thus the significance for the loadings and loading weights are
also estimated. This can be used for component wise variable selection in
x-loadings and loadings weights, but it also gives an optional criterion for finding
AOpt from the significance of the y-loadings, Q.

14.3 Variable Selection


It is a general principle that a parsimonious model, i.e. a model with fewer
components and variables than the full PLS model, will give a lower estimation
error (Fig. 7.3), since the number of model parameters is sparse. This affects the
RMSEP values as well as the uncertainty (or deviation) in the predicted values
themselves, named “YDev” in our terminology.

The general rule is thus that if a model with fewer variables/components is as good
or better with respect to predictability as the full model, we would prefer the
simpler model. However, if the objective is also to interpret the PCs and the
underlying structure, it could be an advantage in some cases to keep some of the
non-significant variables to span the multidimensional space. An example of this is
the visualization aspect of plotting the new predicted samples’ scores in the score
plot from the calibration model.

NIR spectroscopy is a field where multivariate calibration has shown to be an


efficient tool with its ability to embed unknown phenomena (interfering
compounds, temperature variations etc.) in the calibration model. There is however,
still a large number of applications that utilize only two or three wavelengths in
routine prediction. These applications have shown that the full PLS model is
sometimes inferior to a model based on a relatively small number of variables
found in various methods for variable selection. This is partly due to the
redundancy and the large amount of noisy, not relevant variables in NIR spectra.
Recent results show that variable selection based on jack-knife estimates is a fast
and reliable method with low risk of overfitting.

Multivariate Data Analysis in Practice


330 14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test)

14.4 Model Stability


Model stability can be visualized in scores and loading plots by plotting all
perturbed and rotated model parameters together with the full model. Examples
will be given below in this section, but let us first advocate for the use of these
stability plots.

14.4.1 Introduction
We have previously covered the procedure of removing outliers in our multivariate
models, and the leverage measure, hi, is a tool to find influential samples; samples
which have a high impact of the direction of the PCs. The simultaneous
interpretation of scores and loadings plots gives us information about which
variables span a specific direction and which samples are extreme in this direction.
The loading plots can also give information about correlation between variables,
both x and y, but the explained variance is also an important part here: is it valid to
interpret this PC at all? Even if there seems to be a high correlation, it is essential
to find out how the two variables are correlated, i.e. a 2D scatter plot of the
variables will reveal the structure. These aspects are particularly relevant for
models with a low number of samples.

14.4.2 An Example Using the Paper Data


A new sample set has been generated from the Paper data set: All samples without
missing values in the Training set, 66 samples, were subject to a PLS regression.
The option Edit - Mark - Evenly Distributed Samples Only was used to select 18
samples that spanned the first 3 PCs. These samples were extracted by use of Task
- Extract Data From marked, and a PLS regression was made with LOO [leave
one out] cross-validation. Figure 14.1 shows the score plot with the individual
score values shown for all sub-models after rotation. The scores from the full
model are the small circles in the center of each “cluster”. The sample numbers are
omitted to increase the readability.

Multivariate Data Analysis in Practice


14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test) 331

Figure 14.1 - Stability plot. Note the position of sample 14 when it was not part of
the sub-model (the crossed circle marked with an arrow).
PC2 Scores
4

Sample 14
-2

-4
PC1
-3 -2 -1 0 1 2 3

Figure 14.2 - Stability plot of loading weights/y-loadings. Note the position of


x-variable Permeability for one of the segments - the segment when sample 14 is
left out
PC2 X-loading Weights and Y-loadings
0.5

Print-through

Sample 14 left out


0

Permeability

-0.5
PC1
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

A 2D scatter plot of Permeability vs. Print-through reveals that sample 14 has a


very high value for permeability compared to the rest of the samples. Thus, the
relation (or correlation) between these variables is changed when sample 14 is kept

Multivariate Data Analysis in Practice


332 14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test)

out. The stability plots based on LOO cross-validation enable us to find and
interpret subtle structures in the data very efficiently in the realm of our objective:
to establish a regression model between x and y.

14.5 Exercise - Paper - Uncertainty Test and


Model Stability
Purpose
In this exercise you will make PLS models with cross-validation and estimate the
uncertainties as a basis for significance tests. You will also visualize the model
stability by plotting scores and loadings from individual sub-models.

Data Set
PAPER, the same as in exercise 10.4. Part of the description is given below.

The data is stored in the file Paper. You are going to use the sample sets Training
(103 samples) and Prediction (12 samples), and the variable sets Process (15
variables) and Quality (1 variable).

Tasks
Make a PLS regression and estimate uncertainties. Mark the significant variables
and make a new model. Predict the 12 Prediction samples with the two models.
Visualize and interpret the stability plots in scores and loadings (loading weights).

How to do it
1. Make a PLS model.
Go to Task - Regression. We will use “Systematic 123123123” with 20
segments as cross-validation method to yield identical results in this exercise.
The data are sorted after increasing y-values, which means that the main
structure is retained in all segments by this segment selection. Check the box
named Uncertainty test in the regression dialogue. There is an option to how
to determine how many components to use: the Opt #PCs as default from The
Unscrambler, or to manually select the optimum number of components. These
uncertainty calculations are often performed after outliers have been removed
and the “correct” number of components has been decided upon. Use 3 in this
case.

Multivariate Data Analysis in Practice


14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test) 333

2. Make a new model with the significant x-variables


Plot the regression coefficients in one of the plots (Plot - Regression
coefficients, PCs: 3), and mark the significant x-variables (Edit - Mark -
Significant X-Variables only or click on the icon). Do the marked variables
match the ones you marked manually in exercise 10.4? Visualize the
uncertainties as ± 2 Std.Dev. by selecting View – Uncertainty Test -
Uncertainty limits. Recalculate with the significant variables and use two
components in the uncertainty test calculations. Compare RMSEP for the two
models and save them (File - Save As).

Try different cross-validation options, and see how the significance is affected
by the number of segments, and how they are selected. Experience from other
data show that the uncertainty estimates are quite stable regardless if you use
LOO, 20, 10 or 5 segments in the cross-validation, assuming that extreme
outliers have been removed.

3. Predict the samples in the sample set Prediction


Make predictions of the 12 samples for both models from Task - Predict, and
compare the predicted values and Deviations. The average deviation is 2.1 for
the model with seven variables, versus 2.5 for the full model.

4. Visualize stability
The rotated and perturbed scores and loadings can be activated from View-
Uncertainty Test - Stability Plot or the icon on the toolbar. Since this is a
20 segment cross-validation, each individual score is a result from keeping five
or six samples out. You might want to make a model with 15-20 samples with
LOO cross-validation to interpret these plots in more detail, and see how a
single sample will affect the model. There is information about which sample
that was left out; click on the points in the score plot and you’ll see the segment
number. Plot both x- and y-loadings and loading weights/y-loadings and explain
the differences.

Summary
The automatic marking of the significant x-variables gave the same important
variables as from the manual selection, and a model based on this variable selection
gave lower RMSEP after 2 PCs than the full model with 3 PCs. It also seemed that
the deviations in prediction of new samples were smaller on average. The stability
plots give information about the structures of the data, such as correlation patterns
and the impact of influential samples on the model.

Multivariate Data Analysis in Practice


334

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 335

15. SIMCA: An Introduction to


Classification
We present SIMCA, the major chemometric classification approach in this chapter
as an introduction to the more general field of pattern recognition, which will be
discussed further in chapter 18.

Classification Based on Similarities


Classification is an example of a supervised pattern recognition objective used, for
example, to see if one or more new samples belong to an already existing group of
similar samples. For this, and related, purposes the Soft Independent Modeling of
Class Analogy (SIMCA) approach is the unsurpassed methodology.

The philosophy behind this classical chemometric technique - it was in fact the
very first chemometric method to be formulated, Wold (1976) - is that objects in
one class, or group, show similar rather than identical behavior. This can at first
simply be described so as to mean that objects belonging to the same class show a
particular class pattern, which makes all these samples more similar than with
respect to any other group or class. The goal of classification is to assign new
objects to the class to which they show the largest similarity. With this approach
you specifically allow the objects to display intrinsic individualities as well as their
common patterns, but you model only the common properties of the classes.

The easy part of mastering SIMCA is that you have already learned 90% of this
approach, because you have mastered the application of PCA. SIMCA is nothing
other than a flexible, problem-dependent, multi-application use of PCA-modeling.
A practical introduction of how to use SIMCA is given below, without spelling out
all the technical background at first. The SIMCA approach has been retold
numerous times since its inception (see reference list), never surpassed though is
Wold’s classical paper (1976) for the full traditional introduction, complete with all
technical details.

Let us first of all observe how grouping or clustering appears to the bilinear eye
(Figure 15.1).

Multivariate Data Analysis in Practice


336 15. SIMCA: An Introduction to Classification

Figure 15.1 - Grouping (clustering) as revealed in the initial overview score plot
PC2

PC1

SIMCA classification is simply based on using separate bilinear modeling for each
bona fide data class, which concept was originally called disjoint modeling. The
individual data class models are most often PCA models (because in the simplest
SIMCA formulation there is no Y-information present).

Classification is only applicable if you have several objects in each class because
every class has to support an A-dimensional PCA model. A complete SIMCA
classification model usually consists of several PC-models, one for each class
recognized, but of course the marginal case of just one class is also an important
option.

SIMCA-Modeling - A Two Stage Process: First Overview


Disjoint class-modeling is always the first step in the classification procedure,
called the training stage. Here we make individual models of the data classes in
question, using our extensive experience with PCA.

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 337

Figure 15.2 - Each data class from Figure. 15-1 as modeled by a separate
PC-model (SIMCA)
x3

x2

x1

The subsequent classification stage then uses these established class models to
assess to which classes new objects belong. Results from the classification stage
allow us to study the modeling and discrimination power of the individual
variables. In addition a number of very useful graphic plots are available, which let
us study the classified objects’ membership of the different classes in more details
etc, as well as quantifying the differences between classes – and (much) more.

If the data classes are known in advance, i.e. if we know the specific class-
belonging of all the training set objects, it is very easy to make a SIMCA-model of
each class. This is called supervised classification. Otherwise, if we do not know in
advance any relevant class-belongings, we have to identify the data classes by
pattern recognition first, e.g. by using PCA on the entire training data sets – to look
for groupings, clusters etc.

Thus there are at the outset two primary ways into a SIMCA classification:
• Classes are known in advance (which objects belong to which classes)
• There is no a priori class membership knowledge available

For the second case, any problem-relevant data analytical method which lets us find
patterns, groupings, clusters etc. in our data may be relevant (assuming of course
that a pattern recognition problem indeed is at hand), e.g. cluster analysis For many
applications however there is already such a method readily available, namely
using PCA on the entire data matrix present. It may be as simple as that!

Multivariate Data Analysis in Practice


338 15. SIMCA: An Introduction to Classification

It is not a critical issue, however, by which technique a training pattern has been
delineated, what matters is that this pattern be representative of the classification
situation.

In any event, when/after the problem-specific data class setup is known, the
SIMCA-procedure(s) is simple, direct and incredibly effective.

Advantages of SIMCA to Traditional Methods


There are several powerful advantages of the SIMCA approach compared to
methods like e.g. Linear Discriminant Analysis (LDA), cluster analysis
Firstly SIMCA is not restricted to situations in which the number of objects is
significantly larger than the number of variables as is invariably the case with
classical statistical techniques (this has to be so in order for the pertinent model
parameters to be estimated with statistical certainty). Not so with the present
bilinear methods, which are stable with respect to any significant imbalance in the
ratio objects/ variables, be it either (very) many objects with respect to variables -
or vice versa. Because of the score-loading outer product nature of bilinear models,
the entire data structure in a particular data matrix will be modeled well even in the
case where the one dimension of the data matrix is (very) much smaller than the
other, within reasonable limits of course.

SIMCA is usually thought of as a “hard” classification approach, i.e. a one-class-


membership methodology, but it is in fact equally applicable to cases where an
object belongs to more than one class. For instance, a food-component compound
may be both salty and sweet by sensory evaluation, and should thus, by rights,
simultaneously fall into both these two classes. It is relatively easy to use SIMCA
also for this multi-class belonging purpose, either by itself or in connection with
using dummy classification Y-variables, as in the PLS-DISCRIM approach. This
last issue is treated in full detail by Vong et al. (1988).

In addition collinearities can be handled easily, as is well known.

Another advantage is that all the pertinent results can be displayed graphically with
an exceptional insight regarding the specific data structures behind the modeled
patterns.

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 339

Measures to Evaluate SIMCA Results


In the classification of a new set of data, SIMCA calculates extensive statistics.
With these statistics you can quantify “model envelopes”, or classification spaces,
surrounding the classes. By defining suitable criteria for class-belonging, e.g. like
the distance between the envelope and the model, you can easily assess
geometrically whether new samples lie inside or outside a given model. It is thus
possible to judge the model distances very flexibly almost entirely by graphical
means, although the full numerical result complement is of course also available in
the background.

Thus, different distance measures are used to evaluate the class membership of
new objects: the object distance to the model and the distance from the model
center. Many plots are available to help you to interpret the object/class-model
relationships. The primary tools are called the Coomans plot and the model-to-
model plot. The Coomans plot gives compressed information about the class
membership to any two models simultaneously. The model-to-model plot gives
information about the degree of similarity between models. If the distance between
models is small, there is little difference and the classification model is unable to
distinguish between these classes.

15.1 SIMCA - Fields of Use


For which purposes can we then use classification?
• To identify and quantitatively characterize subgroups within a given set of
samples. Each SIMCA class is composed of similar objects. Samples in
different groups are necessarily dissimilar. We are interested in identifying the
objective groups, and to find the characteristics of each such class. A typical
application could be to characterize fish meat in different stages of freshness (or
rather, in different stages of decomposition). Another example: to characterize
the dietary eating habits of populations from different parts of the world.
• To assess whether a new sample is similar to other samples, or to which group it
belongs, for example to identify the status of a new, unknown fish meat sample
(of the same type as those for which a SIMCA-model has been established).
• To identify samples that are dissimilar compared to some standard, for example
to detect raw material batches that are outside specifications.

Multivariate Data Analysis in Practice


340 15. SIMCA: An Introduction to Classification

15.2 How to Make SIMCA Class-Models?


The flow-sheet outline below specifies the most important steps in a classification
analysis. The first step is always a class modeling, i.e. group the different objects
into homogeneous classes or clusters. (If the groups are already known, one of
course omits this step.) Then center, scale and model each class separately. In other
situations one has already made the class library beforehand. Finally, new objects
can be allocated to the established classes – or they may fall outside all “known”
patterns. This is a very special situation, which is not so bad as it may appear at a
first glance – more of which later.

15.2.1 Basic SIMCA Steps: A Standard


Flow-Sheet
1. Preprocess the data appropriately, if necessary (You know all about this
already).

2. Make a projection model of all objects to start to identify the individual


classes. If appropriate, scale or weight the entire set of variables appropriately.

3. Delineate the pattern-specific classes, while simultaneously discriminating


between classes. Study all the relevant score plots and identify all the problem-
specific groups or clusters. Find out which objects belong to each subgroup in
this training stage. There is always a great deal of interaction between the
general problem context and the initial data analysis results in this stage. Ideally
you should know all there is to know about the training data set background
yourself, or you are not afraid to ask all the appropriate and relevant questions
needed for truly representative class description!

4. Make a separate model for each class. You can use different modeling contexts
for the different classes. Use individual data pre-treatment (e.g. standardization,
weighting, or more advanced preprocessings if necessary) for each class to
assure maximum disjoint class modeling. Validate each class properly. All
classes must be validated in the exact same way, or the membership limits will
not be comparable.

5. Remove outliers and remodel as deemed necessary. Also study the appropriate
score plots to see if there should be more classes present than what is “known”
in advance as often the conventional wisdom is wrong! If so, repeat from step 3.

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 341

Determine the optimal number of PCs for each class. We now have completed
the classification modeling stage. This may take place immediately before the
next step (see below), or this may represent a modeling task carried out earlier,
on the basis of which the present classification is to be carried out, all is
problem dependent.

6. Classify new objects. Read the new data into the program and enter the Task-
Classify menu. Select the models you want to test the objects against. Then
choose the appropriate number of PCs for each class model, the number you
determined in step 5. For more details, see section 15.3.

7. Evaluate the present classification by studying the results and using the Plot
menu. Which plots to use and how to interpret them are described in section
15.5.

15.3 How Do we Classify new Samples?


Enter the new data set to be classified - which of course must be described by the
same set of variables as for the training class models. If the calibration data (used
for making the class models) were in some way transformed, the new data must, of
course, also be transformed in the exact same way before the classification.

Select the class models to be used for the pattern recognition and specify how
many PCs to use in each model (this is strongly problem-dependent). The
appropriate number of PCs depends on the data set, the goal, and the application.
Then start the classification routine. Results can be studied both numerically and
graphically.

15.4 Classification Results


In The Unscrambler a results table shows all objects and their classified
memberships:

For each object, a star is shown in the column belonging to a model whenever the
object in question belongs to this highlighted model with the current significance,
that is to say when it is simultaneously satisfying both the Si and Hi limits set. Non-
marked objects do not belong to any of the tested classes.

Multivariate Data Analysis in Practice


342 15. SIMCA: An Introduction to Classification

Classifying a new object can result in several different results:


1. The object is uniquely allocated to one class, i.e., it fits one model only within
the given limits. Furthermore the distance to the next closest class is typically
then much larger than the accepted distance with respect to this class.

2. The object may fit several classes, i.e. it has a distance that is within the critical
limits of several classes simultaneously. This ambiguity can be due to two
reasons; either the given data are insufficient to distinguish between the
different classes or the object actually does belong to several classes. It may be
a borderline case or have properties of several classes (for example being both
sweet and salty at the same time). If such an object is classified to fit several
classes one may for example study both the object distance (Si) and the
Leverage (Hi) to determine the best fit; at comparable object distances, the
object is probably closest to the model with which is displayed the smallest
leverage.

3. The object fits none of the classes within the given limits. This is a very
important result in spite of its apparent negative character. This may mean that
the object is of a new type, i.e. it belongs to a class that was unknown until now
or - at least - to a class that has not been used in the classification. Alternatively
it may simply be an outlier.

One of the most important scientific potentials of the SIMCA approach is related to
this very powerful aspect of “failed” pattern recognition – one must always be
prepared to accept that one or more objects actually do not comply with the
assumed data structure pattern(s). Clearly it is important to be able to identify such
potentially important “new objects” with some measure of objectivity – At some
point in this pattern cognition process it will become important to be able to
specific the statistical significance level(s) associated with this “discovery”. Hence
a few remarks on the use of the statistical significance level.

15.4.1 Statistical Significance Level and its


Use: An Introduction
Statistical significance tests are based on the concern of making mistakes. The
initial statistical hypothesis, H0, always is that a new object belongs to the class in
question. The statistical classification test (an F-test) checks the risk of rejecting
this hypothesis by mistake (a so-called type I error). This means that - statistically -
one does not ever test the probability that an object actually belongs to a particular

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 343

class! In the setup used in SIMCA-classification, the test carried out quantifies the
risk of saying that a particular object lies outside a specific model envelope - even
if it truly belongs. If you have had no formal statistical training all this may -
perhaps - appear a little confusing, but let us see how this works in practice.

In The Unscrambler you may study classification results with varying


significance levels - usually between 0.1% and 25%. You study whether an object
falls within (belongs to) or outside the two closest models at this a priori
significance level, chosen by the data analyst before the classification is being
carried out. The chosen significance level is intimately related to the problem at
hand, or at least is should be. There are, however, also many application studies
which relies more-or-less completely on “standard” statistical rules-of-thumb
(which may be all there is in some of these situations).

The “normal” statistical significance level used is 5%. In very practical data
analytical classification terms this “means” that there is a 5% risk that a particular
object falls outside the class, even if it truly belongs to it; 95% of the objects which
truly belong will thus fall inside the class. At opposing ends of a spectrum of
significance levels typically used, we may illustrate these issues in the following
manner:

A high significance level (e.g. 25%) means that you are being stricter - only very
“certain” objects will belong to the class, and (many) more “doubtful” cases will lie
outside it. Fewer objects that truly belong will fall inside the class (in this case,
75%).

A low significance level (e.g. 1%) on the other hand means that you are being very
“sloppy” - cases, which are doubtful, will still be classified as belonging to the
class. You will get more objects classified as members, i.e. almost all of the true
member objects (i.e. 99%) will be classified as members of the class in this case.

It is important to understand that the significance test only checks the object with
respect to “transverse” object-to-model distance, Si, which is compared to a
representative measure of the total Si -variation exhibited by all the objects making
up the class, called S0. A standard F-test is used. A fixed limit (depending on the
class model) is used for the leverage.

Multivariate Data Analysis in Practice


344 15. SIMCA: An Introduction to Classification

15.5 Graphical Interpretation of


Classification Results

15.5.1 The Coomans Plot


Purpose
This plot shows the orthogonal (transverse) distances from all new objects to two
selected models (classes) at the same time. The critical, cut-off class membership
limits (S0) are also indicated. These limits may be changed by toggling the
significance level (but only for exploratory purposes – you have already set the
significance level before calling upon the classification routine!). The higher the
significance level, the more strictly the new objects will be judged with respect to
“true” membership. This means that only “certain” cases will be recognized as
belonging to the “nearest class”; the “doubtful” cases will fall outside.

The Coomans plot shows the object-to-model distances for both the new objects as
well as the calibration objects, which is very useful when evaluating classification
results.

Interpretation
If an object truly belongs to a model (class) it should fall within the membership
limit, that is to the left of the vertical line or below the horizontal line in this plot.
Objects that are within both lines, i.e. near the origin, must be classified as
belonging to both models. The Coomans plot looks only at the orthogonal distance
of an object with respect to the model. To achieve a correct classification the
leverages should also be studied, e.g. in the Si vs. Hi plot.

If an object falls outside the limits, i.e. in the upper right corner, it belongs to
neither of the models. It is very important that - after having decided on the
significance level before you carry out the classification - you respect the
classification results.

To make interpretation easier, the objects are color coded in The Unscrambler.
NB- What follows applies if you are using the default color scheme with black
background (or with white background in parentheses).

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 345

Yellow (green) objects are the new objects being classified. Cyan (blue) objects
represent the calibration objects in model one, while magenta (magenta) ones
represent the calibration objects from model number two.

Figure 15.3 - Coomans plot


Sample Distance to Model Iris Versicolor Sample Distances
6

2345
13
4
4
30
41 58
50
35
40
3
72946
16
20
3926
37
18
43
38
476 24
36
28
48 33
215
12
10 25
21232
11
14
1
27 2
34
44
17
31 429
49
2
56 65
71 51
19 668
705763 473
67
10 597466
54 60
55 52 72
7 118
35 14 12
9217
11
621975
5358 61
251686 924
21
11
4231715
86
24
13
19
23
22
5
13
2012
1422
20
1532 14 69
7
0 18
10
16
21
25
Sample Distance to Model Iris Setosa

0 5 10 15
SIMCA Iris, Significance = 5.0%, Model 1: Iris Setosa, Model 2: Iris Versicolor

15.5.2 The Si vs. Hi Plot (Distance vs.


Leverage)
Purpose
The Si vs. Hi plot could also be called a membership plot because it actually shows
the limits used in the classification, both for the distance to the model (the residual
standard deviation) and of the leverage (distance to model center) measures for
each object. Objects that fall inside these limits are very likely to belong to the
class (model) at the chosen significance level.

This plot is similar to the Influence plot, used to detect outliers in PCA-calibration

Interpretation
The Si vs. Hi plot shows the object-to-model distance and the leverage for each
new object. The leverage can be read from the abscissa and the distance from the
ordinate. The class limits are shown as gray lines; horizontal for the object-to-
model distance and vertical for the leverage limit.

Multivariate Data Analysis in Practice


346 15. SIMCA: An Introduction to Classification

The limit for the object-to-model distance depends on the significance level chosen.
The leverage limit depends on the number of PCs used and is fixed for a given
classification.

The leverage value shows the distance from each object to the model center. It
summarizes the information contained in the model, i.e. the variation described by
the PCs.

Objects near the origin, within both gray lines, are classified as bona fide members
of the model. Objects outside these lines are classified as not belonging to the
model. This either/or aspect of the classification results is what has given rise to the
terminology “hard classification”. The further away from the origin of the plot they
lie, the more different the objects are. Objects close to the abscissa have short
distances to the model, but may be extreme (they may well have large leverages).

Objects in the upper right quadrant, for example object 45 in Figure 15.4, do not
belong to the class in question. Objects in the lower left corner, for example object
1, are well within all limits of the test. The ones in the lower right quadrant, for
example object 15, have short distances to the PC-model but at a high(er) leverage,
so they are in the sense specified by the chosen significance level “extreme” and
may in fact not belong for that reason. You should check this based on your
knowledge of the problem.

Figure 15.4 - The Si vs. Hi plot (“membership” plot)


Sample to Model Distance
6

45
4 40
30 50
35 2946
41 37
3926
43
47
38
3336 3228
27 48
4449
42 34
31

2
118524 8 715
6613
14
251
68
73 6 11
23
59
4 2516
21 10
20
70
57
64
61
65
5374
67
12
71
56
17
9 22 19 3
60
72
63
69
58
54
7562
55
52
0
Leverage

0 0.3 0.6 0.9 1.2


SIMCA Iris, Significance = 5.0%, Model: Iris Virginica

This is actually all the help you can get from SIMCA classification. It is now up to
you to decide how to view samples, which e.g. lie just outside the appropriate limit.

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 347

Statistically, these samples lie outside because you decided on the significance
level in advance. It is questionable, indeed unscientific, to toggle the limits after a
classification has been carried through because of a specific result that “clearly can
be improved if only I lower the significance level marginally”.

15.5.3 Si/S0 vs. Hi


Purpose
The Si/S0 vs. Hi plot is similar to the Si vs. Hi plot, above, and is used the same
way. However, the values at the ordinate axis in the Si vs. Hi plot are in absolute
values. In the present Si/ S0 vs. H0 plot we rather make use of the object distance
relative to the representative average distance for the model (S0), thereby making it
easier to relate to the distance measure.

Interpretation
The plot is interpreted in exactly the same way as the Si vs. Hi plot, i.e. objects in
the lower left corner belong to the class in question within all pre-set limits.

Figure 15.5 - The relative distance-leverage plot, Si/S0 vs. Hi


(same data as in Figure 15.4)
Si/S0
6 45
40
30 50
35 2946
41
3926
37 43
47
38
3336 3228 48
4 42 27 34
4449
31

2 7
118524
13 11 168 10
251
6814
66 5923
2521
20 15
73 12
70
57
64 96174 22 19 3
61
65
53
6074
67
72 71
56
63
69
58
54
55
7562
52
0
Leverage

0 0.3 0.6 0.9 1.2


SIMCA Iris, Significance = 10.0%, Model: Iris Virginica

Figure 15.5 shows an example of this plot. It is from the same data set, which was
classified in Figure 15.4, but note that a different significance level has been used

Multivariate Data Analysis in Practice


348 15. SIMCA: An Introduction to Classification

for illustration purposes. Object no. 7, just outside the limits, is however close to
the class at approximately twice the average distance (Si/S0 = 2).

15.5.4 Model Distance


Purpose
You can use a model distance plot to visualize the distance between models, i.e. to
quantify if they are really different. A large inter-model distance indicates clearly
separated models. The model distance is found by first fitting objects from two
given classes to their own model as well as to the other model, in turn. The model
distance measures can then be calculated from pooled residual standard deviations.
The Unscrambler manual carries the appropriate formulae. The technical
intricacies we leave out for now; it is however very easy to learn how to use this
measure.

Interpretation
A useful rule-of-thumb is that a model distance greater than 3 indicates models
which are significantly different. A Model distance close to 1 suggests that the two
models are virtually identical (with respect to the given data). The distance from a
model to itself is of course, by definition, 1.0. In the distance range 1-3 models
overlap to some degree.

Figure 15.6 - A model distance plot for the three IRIS classes met with before
Model Distance
100

50

0
Models

Setosa Versicolor Virginica


SIMCA Iris, Model:Iris Versicolor

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 349

The example in Figure 15.6 is taken from exercise 15.6 (classification of Iris
species), where the three earlier met species of Iris are classified using the four
classical X-variables. Using these variables only, it is known that two of the species
are very similar. This is also reflected in the model distance plot where the distance
from model Versicolor is shown. The distance to the first model (Setosa) is very
large (around 100), but the distance to the last model (Virginica) is small, around 3-
4, i.e. they are to some degree similar. The second bar in Figure 15.6 is the distance
to the Versicolor model itself, i.e. 1.0.

15.5.5 Variable Discrimination Power


Purpose
A measure analogous to the above model distance but calculated from one model to
all other alternatives can be calculated for the individual variables. The
discrimination power of a variable thus gives information about its ability to
discriminate between any two models. The discrimination power is calculated from
the residuals by fitting the objects from one model to all the alternative models and
to their proper model. Again The Unscrambler manual carries technical details.

If you have a poor classification, deleting the variables with both a low
discrimination power and a low modeling power may sometimes help. The
rationale for this specific deletion is of course justified by the fact that variables
which do not partake in either the data structure modeling nor in the inter-class
discrimination are not interesting variables – at least not in a classification
perspective (they may be otherwise interesting of course).

Interpretation
The discrimination power plot shows the discrimination power of each variable in a
given two-model comparison. A value near 1 indicates no discrimination power at
all, while a high value, i.e. >3, indicates good discrimination for a particular
variable.

Multivariate Data Analysis in Practice


350 15. SIMCA: An Introduction to Classification

Figure 15.7 - Discrimination power plot for the four classical IRIS-variables
Discrimination Power
14

12

10

4
X-variables

Sepal L Sepal W Petal L Petal W


SIMCA Iris, Data onto model:Iris Setosa onto Iris Versicolor

The example in Figure 15.7 again shows a plot from the IRIS species classification.
Here the data from model Iris Setosa are projected onto model Iris Versicolor. The
diagram thus shows which variables are the most important in distinguishing
between these two species models. All the variables have a discrimination power
larger than 3 and all are therefore useful in the overall classification.

15.5.6 Modeling Power


Purpose
The modeling power is used to quantify the relevance of a particular variable in the
modeling of the individual models. The modeling power tells you how much of the
variable’s variance is used to describe any particular model.

The modeling power can thus be a useful tool for improving an individual class
model. Even with careful variable selection, some variables may still contain little
or no information about the specific class properties. Thus these variables may
have a different variation pattern from the others, and consequently, they may cause
the model to deteriorate. Different variables may show different modeling power in
different models however, so one must always keep a strict perspective with respect
to the overall classification objective(s) when dealing with a multi-class problem.

Variables with a large modeling power have a large influence on the model. If the
modeling power is low, i.e. below 0.3, the variable may make the model worse and

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 351

therefore be a candidate for deletion, particularly if its discrimination power is also


low.

Interpretation
The modeling power is always between 0 and 1. A rule-of--thumb is that variables
with a value equal to or lower than 0.3 are less important.

Figure 15.8 - A modeling power plot


Modeling Power
0.5

0.4

0.3

0.2

0.1
X-variables

Sepal L Sepal W Petal L Petal W


SIMCA Iris, Model:Iris Setosa

In Figure 15.8 the modeling power for the Iris Setosa model is shown. The last two
variables have a very low modeling power indeed and may therefore possibly be
deleted if our only interest lies in modeling Iris Setosa. However, in Figure 15.7,
the discrimination power for the same two variables is very high. These variables
cannot therefore be deleted if the goal is also to discriminate between these two
classes. Even with as small a multiple-class number as three, one must keep the
overall perspective crystal clear.

15.6 SIMCA-Exercise – IRIS Classification


Purpose
The data to be classified in this exercise were taken from the classical paper by Sir
Ronald Fisher. The task is to see whether these three different species of the iris
flowers can be classified by the four taxonomic measurements made on them; the
length and width of the sepals and petals respectively. This is a wonderful classic
data analysis standard data set, which was also used in the original SIMCA paper

Multivariate Data Analysis in Practice


352 15. SIMCA: An Introduction to Classification

by Wold (1976), as well as in a host of similar classification method papers ANY


(new) classification method MUST compare its merits against the Fisher IRIS data.

Data Set
The data table is stored in the file IRIS and contains 75 training (calibration)
samples and 75 test samples – here we have a well-balanced and completely
satisfactory test data set.

The training samples are a priori divided into three training data sets, each
containing 25 samples. These three sets are Setosa, Versicolor, and Virginica. The
sample set Testing is used to test the efficiency of the established classification.

Four traditional taxonomic variables are measured: Sepal length, Sepal width, Petal
length, and Petal width. The measurements are given in centimeters.

1. Graphical Clustering Based on Score Plots


It is always a good idea to start a classification with an overview PCA of all
samples. If you do not know the classes in advance, this is one feasible way of
doing the clustering. In this case the calibration samples are already assigned to the
three different classes.

Task
Add a new variable to the data table so that the three classes can be identified on
PCA plots, then make a PCA model of all calibration samples.

How To Do It
Open the file IRIS. Mark the first column in the table, then choose Edit –
Insert – Category Variable. Enter a name for the category variable, e.g.
Class. In the Method frame, select “I want my levels to be based on a collection
of sample sets”, then click Next. Move the three sets Setosa, Versicolor, and
Virginica from the “Available Sets” to the “Selected Sets” by selecting them
and clicking Add. Click Finish. You are now back in the data table and you can
see the new “Class” variable in column 1. Its name is written in blue, to show
that this is a special type of variable, used for labeling purposes only.

Make a PCA model with the following parameters:

Samples: Training Variables: Measurements

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 353

Weighting: 1/SDev Validation: Leverage correction


Number of PCs: 4

We assume that you are thoroughly familiar with making PCA models by now.
Refer to one of the previous exercises if needed.

Why do we auto-scale these data? Could we make do without? – Let us assume


you could not answer this last question. One would naturally be inclined to
simply do both alternative classifications. Now, what if they showed (radically)
different results? Which would be termed “ better”? Why? One must always
decide on performance-criteria before carrying out comparative multivariate
data models (sic).

Note that there are few outlier warnings and most of the variance is explained
by three PCs. Click View to look at the modeling results.

Activate the residual variance plot and select Plot – Variances and RMSEP.
Remove the number in the variables field so that only the total variance is
displayed, select only the validated variance in the samples field and (if
necessary) change the plot from residual to explained variance (View –
Source - Explained Variance).

We see that two to three PCs are enough to describe most of the variation
present.

Activate the score plot and select Edit - Options. Select the Sample
Grouping tab; enable sample grouping, separate with Colors and select Value
of Variable in the Group By field. Make sure Levelled Variable 1 is selected.

Note that there are three classes in the data; one very distinct (Setosa) and two
that are not so well separated (Versicolor and Virginica). The score plot
indicates that it may be difficult to differentiate Versicolor from Virginica.

Close the Viewer before you continue (answer No to Save).

2. Make Individual Class Models: SIMCA


Before we can classify new samples, each individual class must be described by a
separate PCA model. These models should be made independently of each other.

Multivariate Data Analysis in Practice


354 15. SIMCA: An Introduction to Classification

This means that the number of components must be determined for each model
individually, outliers found and removed separately etc.

Task
Make individual PCA models for the three classes Setosa, Versicolor, and
Virginica.

How To Do It
Select Task - PCA and make a model with the following parameters:

Samples: Setosa Variables: Measurements


Weighting: 1/SDev Validation: Leverage correction
Number of PCs: 4

Repeat the modeling using the sample sets Versicolor and Virginica. Name each
model after the sample sets (close the viewer and save the model after each
calculation).

The program suggests three PCs as the optimal for all models. Here we overrule
this and suggest using one PC for all models. The data contains only four
variables and two of the residual variance plots have a “break” at PC1, which
indicates that one PC may be enough.

Close the open Viewers before you continue.

3. Classify Unknown Samples – Classify the Test Data Set


When the different class models are made and new samples are collected (the test
data set), it is time to assign them to the known classes.

Task
Assign the sample set Testing to the classes Setosa, Versicolor, and Virginica.

How To Do It
Select Task - Classify. Use the following parameters:
Samples: Testing Variables: Measurements

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 355

Make sure that Centered Models is checked. Add the three models Setosa,
Versicolor, and Virginica. Mark each model and change the number of PCs to
use from three to one.

Click OK to start the classification.

4. Interpretation of Classification Results


The classification results are displayed directly in a table, but you may also
investigate the classification model closer by some plots.

Tasks
Interpret the classification results in suitable plots.
Look at the Cooman’s and Si vs. Hi plots.

How To Do It
Click View when the classification is finished.

A table plot is displayed where the samples with a star in the column of a model
are classified to the corresponding model. These samples are within the limits as
defined by the significance level chosen and the leverage limit.

The significance level can be toggled using the Significance Level field in the
toolbar. We see that all samples are “recognized” by the correct class model.
However, some samples are indeed classified as belonging to two classes
simultaneously.

If a sample is doubly classified, you should study both Si (sample-to-model


distance) and Hi (leverage) to find the best fit (see the following page); at
similar Si levels, the sample is probably closest to the model to which it has the
smallest Hi.

The classification results are well displayed in the Cooman’s plot. Select Plot-
Classification and choose the Cooman’s plot for models Setosa and
Versicolor.

This plot displays the sample-to-model distance for each sample to two models.
The new test set samples are displayed in green color (if you have chosen a
white background for your plots), while the calibration samples for the two
models are displayed in magenta and blue. Yellow, magenta and cyan are used if

Multivariate Data Analysis in Practice


356 15. SIMCA: An Introduction to Classification

the background is black. It may in many instances be advantageous to remove


most, if not all, of the naming annotations for the individual objects; this often
leads to “information overkill”, as we have chosen to illustrate below. In very
many cases the complete object name “information” is highly redundant and
unwanted in these types of plots. It actually would have been enough to use the
simple running number object-IDs.

Figure 15.9

The Cooman’s plot for classes Setosa and Versicolor nevertheless shows that all
Setosa samples are classified uniquely as belonging to the Setosa model only.
All Setosa samples are located to the left of the vertical line indicating
membership. We also see that almost all the Versicolor samples are also
classified correctly. Nonetheless, it seems like some of the Virginica samples
are also classified as belonging to this model. We also have to look at the
distance from the model center to the projected location of the sample, i.e. the
leverage.

This is done in the Si vs. Hi plot. Select Plot - Classification and choose Si
vs. Hi for model Versicolor.

Some Virginica samples are indeed classified as belonging to the class


Versicolor, but most samples that are not Versicolor are found outside the lower
left quadrant. The reason for the apparently partly ambiguous classification
between Versicolor and Virginica is that these models are in fact partly
overlapping, as was already observed in the pertinent score plots. They are very
similar with respect to the width and length of the sepal and petal.

We can illustrate this by focusing on a specific sample: no. 12 (vers23). Locate


this sample on the Si vs. Hi plot for model Versicolor (you may use Edit –

Multivariate Data Analysis in Practice


15. SIMCA: An Introduction to Classification 357

Options – Markers Layout: Number to find it more easily). Mark sample


12, then use Window – Copy To - 2. Click on the lower window (the empty
one) then plot Si vs. Hi for model Virginica. Sample 12 is marked there as well.
Click on this sample so as to read its abscissa (leverage Hi) and ordinate
(distance Si). Compare the Si values for sample 12 between model Versicolor
and model Virginica: which model is closest?

Figure 15.10

5. Diagnosing the Classification Model


In addition to the Coomans and Si vs. Hi plots, there are three more plots that give
us information regarding the classification results.

Task
Look at the model-to-model distance, discrimination power, and modeling power
plots.

How To Do It
Select Plot - Classification and choose the Model Distance plot for the
Versicolor model (you may double-click on the miniature screen in the dialog
box so that your plot uses the full window). If necessary, use Edit – Options,
Plot Layout: Bars.

This plot compares different models. A distance larger than 3 indicates a good
class separation. The models are then sufficiently different for most practical
classification and discrimination purposes.

It is clear from this plot that the Setosa model is very different from the
Versicolor, while the distance to Virginica is small. It is barely over three.

Multivariate Data Analysis in Practice


358 15. SIMCA: An Introduction to Classification

Figure 15.11

Select Plot - Classification and choose the discrimination power for


Versicolor projected onto the Virginica class.

This plot tells which of the variables describe the difference between the two
models well. A rule-of-thumb says that a discrimination power larger than three
indicates a good discrimination. The overall discrimination power can be
increased by deleting variables with a particular low discrimination power, if
they also have low modeling power.

Figure 15.12

The plot above tells us that all variables have low discrimination power. This
tells us that none of the measured variables are very helpful in describing the
difference between these two types of IRIS, which is another indication of their
partly overlapping nature.
Multivariate Data Analysis in Practice
15. SIMCA: An Introduction to Classification 359

Select Plot - Classification and choose the modeling power for Versicolor.

Variables with a modeling power near a value of one are important for the
model. A rule-of-thumb says that variables with modeling power less than 0.3
are of little importance for the model.

Figure 15.13

The plot tells us that all variables have a modeling power larger than 0.3, which
means that all variables are important for describing the model. If not, we might
have wanted to remove the unimportant ones (on the basis of their modeling
power), but we must be very careful in such an isolated venture. It is much
better to use the aggregate information pertaining to both the modeling as well
as the discriminating power for all variables involved.

Summary of SIMCA-Modeling of the IRIS Data


Careful SIMCA-modeling of the entire IRIS data set will lead to important
experience regarding pattern recognition in general, SIMCA in particular.

Note that we have used the test data set strictly for classification in order to get a
feeling for the classification efficiency. One might instead perform a similar,
independent SIMCA analysis in its own right on the test data set, assuming the
internal three-fold data structure is the same, i.e. that the first 25 samples are
known to belong to the Sestosa species, etc. as for the first training data set, and
compare these two independent IRIS classifications.

Multivariate Data Analysis in Practice


360 15. SIMCA: An Introduction to Classification

How to compare two alternative classifications

Alternatively these two data sets may be pooled, assuming correct class-
membership for the second data set, and a new SIMCA-modeling may be
performed on the pooled data set.

In fact, it is entirely possible to use the test set in complete accordance with the
methodology of test set validation, which was described, for regression prediction
assessment above.

How would you set up, and perform, a classification test set validation?

We leave these interesting new tasks to the reader’s discretion to develop. Good
luck!

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 361

16. Introduction to Experimental


Design
Carefully selected samples increase the chances of extracting useful information
from your data. When you have possibilities to actively perturb your system
(experiment with the variables) these chances become even bigger. The critical part
is to decide which variables to change, the intervals for this variation, and the pattern
of the experimental points.

We want to investigate a phenomenon, create a new product, improve an existing


one or optimize a process. Whatever our specific objective, we know that it can only
be achieved by performing experiments. We need to run experiments in order to gain
knowledge about the way things work in our product, our recipe or our process,
because we cannot rely on any theoretical model to tells us what happens when our
twenty ingredients are mixed, stirred, heated up then cooled down.

Experiments are usually costly and time-consuming; therefore, we are interested in


minimizing the total number of experiments we perform, while ensuring that each
single experiment gives us as much value for our money as possible.

16.1 Experimental Design


Experimental design is a large area to cover. This section must be seen as a light
coverage of different aspects you should be aware of when using experimental
design and The Unscrambler.

Why is Experimental Design Useful?


Statistical experimental design is a powerful technique applied to making efficient
experiments. Instead of varying one variable at a time and keeping the rest constant
(which is the traditional way of making experiments), you vary many factors
simultaneously in a systematic and smart way using the concept of factorial designs.
The purposes of experimental designs are:

• Efficiency - get more information from fewer experiments


• Focusing - collect the information you really need

Multivariate Data Analysis in Practice


362 16. Introduction to Experimental Design

The Ad Hoc Approach


Many people make one experiment, and based on the outcome use their experience
and decide what to do next. Sometimes they are lucky and get the results they want
pretty soon, but usually they must continue with more experiments.

There are three major problems with this approach. First, people who apply it will
rarely understand how their system really works, so it may be difficult to transfer
the knowledge to a new application. Second, since there is usually some amount of
variability in the outcome of each experiment, interpreting the results of just two
successive experiments can be misleading because a difference due to chance only
can be mistaken for a true, informative difference. Lastly, there is also a large risk
that their solution is not optimal. In some situations this does not matter, but if they
want to find a solution close to optimum, or understand the application better, an
alternative strategy is recommended.

The Traditional Approach - Vary One Variable at A Time


Most people have learnt that varying only one variable at a time is the safest way to
control experimental conditions so as to best interpret the results from their
experiments. This is often a delusion.

Let us show this with a simple example from an investigation of the conditions of
bread baking:

We list the variables, such as ingredients and process parameters, that may have an
influence on the volume of the bread, then study each of them separately. Let us say
that the input parameters we wish to study are the following: type of yeast, amount
of yeast, resting time and resting temperature.
First, you set type of yeast, amount of yeast and resting temperature to arbitrary
values (for instance, those you are most used to: e.g. the traditional yeast, 15g per kg
of flour, with a resting temperature of 37 degrees); then you study how bread volume
varies with time, by changing resting time from 30 to 60 minutes. A rough graph
drawn through the points leads to the conclusion that under these fixed conditions,
the best resting time is 40 minutes, which gives a volume of 52 cl for 100g of dough
(Figure 16.1).

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 363

Figure 16.1 - Volume vs. Resting TimeFirst set of experiments,


with the traditional yeast, 15g per kg of flour, and a temperature of 37 degrees
Volume

40 minutes Resting time

Then you can start working on the amount of yeast, with resting time set at its “best”
value (40 minutes) and the other settings unchanged, while changing the amount of
yeast from 12 to 20g as in Figure 16.3. At this “best” value of resting time, the best
value of the amount of yeast is about not far from the 15g in the first series of runs,
giving a volume of about 52 cl. Now the conclusion might seem justified that an
overall maximum volume is achieved with the conditions “amount of yeast = 15g,
resting time = 40 minutes”.

Figure 16.3 - Volume vs. Yeast. Second set of experiments, with the traditional
yeast, a resting time of 40 minutes and a resting temperature of 37 degrees
Volume

15g/kg Yeast

The graphs show that, if either amount of yeast or resting time is individually
increased or decreased from these conditions, volume will be reduced. But they do

Multivariate Data Analysis in Practice


364 16. Introduction to Experimental Design

not reveal what would happen if these variables were changed together, instead of
individually!

To understand the possible nature of the synergy, or interaction, between the amount
of yeast and resting time, you may study the contour plot below, see Figure 16.5,
which shows how bread volume varies for any combination of amount of yeast and
resting time within the investigated ranges. It corresponds to the two individual
graphs above. However, if the contour plot represents the true relationship between
volume and yeast and time, the actual maximum volume will be about 61 cl, not 52
cl! A volume of 52 cl would also be achieved at for example 50 minutes and 18g/kg,
which is quite different from the conditions found by the One-Variable-at-a Time
method. The maximum volume of 61 cl is achieved at 45 minutes and 16.5g/kg.

Figure 16.5 - Optimum in Response SurfaceContour plot showing the values of


bread volume for all possible combinations of resting time and amount of yeast.
Yeast
Actual optimum

30
18

60
50
15

12 40

30 40 50 Resting time

Figure 16.5 illustrates that the One Variable at a Time strategy very often fails
because it assumes that finding the optimum value for one variable is independent
from the level of the other. Usually this is not true.

The trouble with that approach is that nothing guarantees that the optimal amount of
yeast is unchanged when you modify resting time. On the contrary, it is generally the
case that the influence of one input parameter may change when the others vary: this
phenomenon is called an interaction. For instance, if you make a sports drink that
contains both sugar and salt, the perceived sweetness does not only depend on how
much sugar it contains, but also on the amount of salt. This is because salt interacts
on the perception of sweetness as a function of sugar level.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 365

To summarize, with the classical approach we have the following problems:

y We are likely to miss interactions between two input variables.


y We cannot distinguish random variations from true effects.
y We cannot predict what would happen for an experiment we have not run.
y We do not know in advance how many experiments we will need to achieve our
goal.

The Alternative Approach


There is a flaw in the classical approach. It lies in the assumption that causal effects
can only be proven if the potential causes are investigated separately. Experimental
design is based upon a mathematical theory which makes it possible to investigate
all potential causes together and still draw safe conclusions about all individual
effects, independently from each other. To make things even better, this
mathematical foundation also ensures that the impact of the experimental error on
the final results is minimum, provided that all experimental results are interpreted
together (and not sequentially as in the classical approach).

In short, experimental design has the following advantages:

y We know exactly how many experiments we will need to get the information we
want
y The individual effects of each potential cause, and the way these causes interact,
can be studied independently from each other from a single set of designed
experiments
y We analyze the results with a model which enables us to predict what would
happen for any experiment within a given range
y We can conclude about the significance of the observed effects, that is to say,
distinguish true effects from random variations

The successive steps of building a new design and interpreting its results are listed
hereafter.

Multivariate Data Analysis in Practice


366 16. Introduction to Experimental Design

Experimental Design in Practice


Here is how you design your experiments:

1. Define which output variables you want to study (we call them responses). You
will measure their values for each experiment.
2. Define which input variables you want to investigate (we call them design
variables). You will choose and control their values.
3. For each design variable, define a range of variation or a list of the levels you
wish to investigate.
4. Define how much information you want to gain. The alternatives are:
a- find out which variables are the most important (out of many)
b- study the individual effects and interactions of a rather small number of
design variables
c- find the optimum values of a small number of design variables.
5. Choose the type of design which achieves your objective in the most economical
way.

The various types of designs to choose from are detailed in the next sections.

Here is how you analyze your experimental results:


1. Define which model is compatible with your objective. The alternatives are:
a- to find out which variables are the most important: a linear model (studies
main effects)
b- to study individual effects and interactions: a linear model with interaction
effects
c- to find an optimum: a quadratic model (includes main, interaction and
square effects).
2. Compute the observed effects based on the chosen model, and conclude on their
significance.
3. Interpret the significant effects and use this information to reach your goal.

The Concept of Factorial Designs


In factorial designs, you investigate a precisely defined experimental region to see
whether a change in an experimental parameter (a design variable) makes the
responses change - has an effect on the responses. For example, will the change of an
ingredient or process parameter make the outcome of the experiment; product
quality, change?

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 367

Each design variable is studied at only a few levels, usually two: it varies from a low
to a high level. You investigate different combinations, for example low temperature
(5 degrees) and low time (5 minutes), high temperature (75 degrees) and high time
(20 minutes), low temperature (5 degrees) and high time (20 minutes), and vice
versa.

The Concept of Full Factorial Designs


Let us start with Full Factorial Designs, which are easy to illustrate. Consider three
parameters: the amounts of salt, sugar, and lemon. We vary each of them from a low
(-) to a high (+) level and make all combinations:

Figure 16.7 - Full Factorial Design

(+ + +) run # Salt Sugar Lemon


1 - - -
2 - - +
3 - + -
X2 4 - + +
5 + - -
X3 6 + - +
(- - -) (+ - -) 7 + + -
X1 8 + + +

Each point in the cube in Figure 16.7 is an experiment. Three design variables varied
at 2 levels give 23 = 8 experiments using all combinations. The table shows the low
(-) and the high (+) settings in each experimental run in a systematic order - standard
order.

In the same way,


2 variables at 2 levels will give 22 = 4 experiments
3 variables at 2 levels will give 23 = 8 experiments
4 variables at 2 levels will give 24 = 16 experiments
5 variables at 2 levels will give 25 = 32 experiments
6 variables at 2 levels will give 26 = 64 experiments
7 variables at 2 levels will give 27 = 128 experiments
8 variables at 2 levels will give 28 = 256 experiments

Multivariate Data Analysis in Practice


368 16. Introduction to Experimental Design

As you can see, the number of experiments increases dramatically when there are
many design variables. The advantage of Full Factorial Designs is that you can
estimate the main effects of all design variables and all interaction effects. The
program generates the experimental design automatically. All you have to do is
define which design variables to use and the low and high levels.

Effects
The effects are also calculated by the program, and we will deal with that later in the
section about Analysis of Effects on page Error! Cannot open file.. However, to
understand what “effect” means and how to interpret an effect, we will go through
the definition here.

The variation in a response generated by varying a design variable from its low to its
high level is called main effect of that design variable on that response. It is
computed as the linear variation of the response over the whole range of variation of
the design variable. There are several ways to judge the importance of a main effect,
for instance significance testing or use of a normal probability plot of effects.

Some variables need not have an important impact on a response by themselves to be


called important. The reason is that they can also be involved in an interaction.
There is an interaction between two variables when changing the level of one of
those variables modifies the effect of the second variable on the response.

Interaction effects are computed using the products of several variables (cross-
terms). There can be various orders of interaction: two-factor interactions involve
two design variables, three-factor interactions involve three of them, and so on. The
importance of an interaction can be assessed with the same tools as for main effects.

Important Variables
Design variables that have an important main effect are important variables.
Variables that participate in an important interaction, even if their main effects are
negligible, are also important variables.

Effects - Definition and Calculation


Consider a simple bread baking experiment where we investigate the bread volume
by trying different combinations of low (35°) and high (37°) temperature and yeast
type C1 or C2: The measured value of Volume for the 4 different combinations of
the design variables (the 4 experiments) are written inside the cube.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 369

Figure 16.9 - Calculation of Effects

Volume

C2
40 60
Type of
yeast
30 20
C1
35°C 37°C
Temperature

A main effect reflects the effect on the response variable of a change in a given
design variable, while all other design variables are kept at their mean value:

Main effect of Temperature = Mean response at high - Mean response at low


Temperature Temperature

Effects are Changes

Figure 16.11 - Volume vs. Yeast and Temperature


Volume Volume
60 60
50 50
40 40
30 30
20 20
10 10

35°C 37°C C1 C2
Temperature Yeast

Multivariate Data Analysis in Practice


370 16. Introduction to Experimental Design

The Main Effect of Temperature


Volume increases from 35 (=(40+30)/2)) to 40 (=(60+20)/2), when we change the
temperature from 35 to 37 degrees. Thus the main effect of Temperature on Volume
is +5.

Figure 16.13 - Effect of Temperature


40 60
Effect: +5
30 20
36°C 37°C
Average Volume: 35 ⇒ 40

The Main Effect of Yeast


Volume increases from 25 (=(30+20)/2)) to 50 (=(40+60)/2), when we change the
yeast type from C1 to C2. Thus the main effect of Yeast on Volume is +25.

Figure 16.15 - Effect of Yeast

Average C2
Volume: 40 60
50 Effect: +25
⇑ 30 20
25 C1
An interaction reflects how much the effect of a first design variable changes when
you shift a second design variable from its average value to its high level, (which
amounts to the same as shifting it halfway between low and high):

 Effect of Design variable A at high level of B − 


Interaction = (1 / 2)  
 Effect of Design variable A at low level of B 

The interaction effect between yeast and temperature


The Volume increases with +20 (=60-40) when we change the temperature from 35
to 37 degrees if we use yeast C2.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 371

But Volume decreases with 10 (-10 = 20-30) when we change the temperature from
35 to 37 degrees, if we use yeast C1.

Figure 16.17 - Interaction Effect


Effect of temperature
for different yeasts
C2: +20
40 60

30 20
C1: -10

That is, we get an increase with one yeast, but a decrease with another; the effect of
temperature depends on which yeast we use. So the interaction effect is
(1/2)*(20 - (-10)) = 30/2 = +15.

Interaction - the effect of one variable depends on the level of another variable, as
illustrated in Figure 16.19.

Figure 16.19 - Interaction Effect


Volume
60 C2
50
40
30
20 C1
10

35°C 37°C
Temperature

How to Calculate Effects When You Have Many


Experiments
If we have three variables, varied using a Full Factorial Design like on page 367, you
get Table 16.1 with the settings of the design variables (e.g. amounts of Salt, Sugar,
Multivariate Data Analysis in Practice
372 16. Introduction to Experimental Design

and Lemon). You can also include the computed interactions (between Salt and
Sugar: Salt*Sugar, between Salt and Lemon: Salt*Lemon, and so on) and the
measured response variables (here only one, e.g. Sweetness).

Table 16.1
RUN Salt Sugar Lemon Salt*Sugar Salt*Lemon Sugar*Lemon Sugar*Salt*Lemon Sweet

1 - - - + + + - 1.7
5 - - + + - - + 4.5
3 - + - - + - + 5.2
7 - + + - - + - 7.2
2 + - - - - + + 3.5
6 + - + - + - - 2.1
4 + + - + - - - 2.8
8 + + + + + + + 4.8

The main effect of Salt on response Sweetness in the table above is -1.35.

Calculation:
The main effect of Salt on Sweetness =
(3.5 + 2.1 + 2.8 + 4.8)/4 - (1.7 + 4.5 + 5.2 + 7.2)/4 = 3.3 - 4.64 = - 1.35

Interpretation: This means that by increasing Salt from its low to its high level, the
response Sweetness will decrease by 1.35 units.

Calculation of Effects by Using Regression


The effects may also be found by fitting the experimental data to the common
regression equation

Equation 16.1 y = b0 + b1 x1 + b2 x2 ++ bn xn + b12 x1∗ x2 +

i.e. by finding the regression coefficients bi. This can be done using several methods.
MLR is the most usual, but PLS or PCR may also be used. If we have three design
variables and want to investigate on one response following expression will be used.

Response = b0 + b1 A + b2 B + b3C + b4 AB + b5 AC + b6 BC + b7 ABC

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 373

Mean Effect
The mean effect is the average response of all the experiments and equals b0 in the
regression equation. In ANOVA it is simply the average response value.

Main Effects
The main effect of variable A is an average of the observed difference of the
response when A is varied from the low to the high level. The estimated effect
equals twice the b-coefficient for variable A in the regression equation, and so on.

Interaction Effects
An interaction effect AB means that the influence of changing variable A will
depend on the setting of variable B. This is analyzed by comparing the effects of A
when B is at different levels. If these effects are equal, then there is no interaction
effect AB. If they are different, then there is an interaction effect. Estimated
interaction effects again equal twice the corresponding b-coefficients.

The Concept of Fractional Designs


Since Full Factorial Designs require so many experiments when you have many
design variables, we need a more economical alternative. It is not necessary to
perform all combinations to find out about the main effects. Often several of the
chosen design variables have no effect at all on the responses, so that we can use the
same number of experiments more efficiently. By choosing a smart subset (a
fraction), we can actually make much fewer experiments and still estimate both main
effects and interactions.

Figure 16.21 - Fractional Design

X1 X2 X3
Sugar Salt Sugar*Salt
- - +
X2
- + -
X3
+ - -
X1 + + +

Multivariate Data Analysis in Practice


374 16. Introduction to Experimental Design

The smart subset of combinations of three design variables gives us only 22-1= 4
experiments. This is therefore called the half-fraction of a Full Factorial Design, or a
Fractional Factorial Design with a degree of fractionality of one.

The Smart Subset is Found as Follows:


You set up all combinations of low and high levels for the first two design variables
(X1 and X2, or named Sugar and Salt), and code them with a plus (+) or minus (-)
sign. Then you find the sign coding for the third variable, X3 (Lemon), by
multiplying the signs for the first two. For instance, in the first experiment, you
multiply a negative sign for X1 with a negative sign for X2 to find the sign for X3: a
positive sign. Of course, you do not have to do this manually, the program will do it
for you.

In the table, you can see that the sign for variable X3 is the same as for the
interaction between X1 and X2 - which is X1 times X2 (X1* X2).

Confounding
The price to be paid for performing fewer experiments is called confounding, which
means that sometimes you cannot tell whether variation in a response is caused by,
for instance, Sugar, or the interaction between Salt, and Lemon. The reason is that
you may not be able to study all the main effects and all the interactions for all the
design variables, if you do not use the full factorial set of experiments. This happens
because of the way those fractions are built; using some of the resources that would
otherwise have been devoted to the study of interactions, are now merely used to
study main effects of more variables instead.

This side effect of some fractional designs is called confounding. Confounding


means that some effects cannot be studied independently of each other.

The type of design is described by its resolution, which is a formalized way of


defining how severe the confounding is. However, all you have to pay attention to is
the number of experiments to run and the confounding pattern.

The list of confounding patterns in the program shows which effects can be
estimated with the current number of experimental runs. For instance, A=BC means
that the main effect of A will be mixed up (confounded) with the interaction (BC)
between B and C.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 375

What is the Consequence of Confounding?


Effects may be mixed up with more than one interaction, and interactions may be
mixed up with each other. Two-variable interactions (e.g. BC) are often important -
at least in advanced screening designs, but three-variable interactions (e.g. ABC) are
usually negligible and can be disregarded. If a main effect is confounded with a
three-variable interaction or higher, you can usually just pretend that that
confounding does not exist!

If you are interested in the interactions themselves, using a design where two-
variable interactions are confounded with each other, you will only be able to detect
whether some of them are important, but not to tell for sure which are the important
ones. For instance, if AD (confounded with BC, “AD=BC”) turns out as significant,
you will not know whether AD or BC (or a combination of both) is responsible for
the observed change in the response.

However, if you use a well-planned sequential strategy, confounding is not an


insurmountable problem. It is quite OK to use a design with some amount of
confounding as the first step in your investigations. This way, you will at least be
able to prove that some of the variables have an effect! Afterwards, you will always
have the possibility to make a few extra experiments to find out which is which.

Other Fractional Designs


There are many alternative fractional designs, like:
3 variables at 2 levels may give 23-1 = 4 experiments
4 variables at 2 levels may give 24-1 = 8 experiments
7 variables at 2 levels may give 27-3 = 16 experiments
9 variables at 2 levels may give 29-4 = 32 experiments

You can see that despite the large number of variables we can still keep the number
of experiments on a manageable level by using Fractional Factorial Designs.

An alternative to Fractional Factorial Designs is Plackett-Burman designs. They can


only be used for the first screening of many variables when you are not interested in
interactions.

16.2 Screening Designs


When you start a new project, there is usually a large number of potentially
important variables. At that stage, the aim of the experimentation is to find out

Multivariate Data Analysis in Practice


376 16. Introduction to Experimental Design

which are the most important variables. This is achieved by including many
variables in the design, and roughly estimating the effect of each design variable on
the responses. The variables which have “large” effects can be considered as
important.

Design variables that have an important main effect are important variables.
Variables that participate in an important interaction, even if their main effects are
negligible, are also important variables.

16.2.1 Full Factorial Designs


Full Factorial Designs are used for screening purposes, in the following cases:
1. You want to study all main effects of the individual variables, and all
interactions between any combination of two design variables, independently
from each other;
2. Some of your design variables have more than two levels.

Note!
Conditions 1) and 2) can apply separately, or together. Note that in case
1), there may be other valid designs than a full factorial, whereas in case
2), no other type of design can be built.

Building the Design


A Full Factorial Design is defined as a set of experiments which combines all levels
of all design variables.

Example: A Full Factorial Design investigating the effects of variables


Temperature (28 to 33°C) and Mixer type (X or Y) consists of four experiments,
listed hereafter.
Exp. 1: Temperature=28 Mixer X
Exp. 2: Temperature=33 Mixer X
Exp. 3: Temperature=28 Mixer Y
Exp. 4: Temperature=33 Mixer Y

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 377

Figure 16.23 - Geometrical representation of a Full Factorial Design


with 3 levels*4 levels
Type of Flour
A

C
50 60 70 80
Mixing Speed

Figure 16.25 - Geometrical representation of a Full Factorial Design


with 3 variables studied at 2 levels each
(+ + +)

Time

% yeast
(- - -) (+ - -)

Temperature

How Many Experiments?


The design includes as many experiments as the product of the numbers of levels of
all design variables.

Example: The Full Factorial Design for 5 variables with 2 levels each includes
2x2x2x2x2 = 32 experiments.

Example: The Full Factorial Design studying the effects of variable A (3 levels),
variable B (2 levels) and variable C (5 levels) includes 3x2x5 = 30 experiments.

In addition, center samples can be included in the design whenever all design
variables have a continuous range of variation. The center samples are experiments
which combine the mid-levels of all design variables. They are useful for checking
what happens between Low and High (it may be non-linear). They are usually

Multivariate Data Analysis in Practice


378 16. Introduction to Experimental Design

replicated, i.e. the experiment is run several times, so as to check how large the
experimental error is.

Example: If you are studying the effects of variables Temperature (28 to 33°C)
and Amount of yeast (8 to 12 g), it is recommended to include a center sample,
replicated three times. These three experiments will all have
Temperature=30.5°C and Amount of yeast=10 g.

Analyzing the Results


The results from a Full Factorial Design are analyzed with a linear model with
interactions.

16.2.2 Fractional Factorial Designs


Fractional Factorial Designs are used for screening purposes, whenever:

1. You want to study the effects of a rather large number of variables (3 to 15),
with fewer experiments than a Full Factorial Design would require
2. And all your design variables have two levels (either Low and High limits of a
continuous range, or two categories, e.g. Starch A / Starch B).

Building the Design


A Fractional Factorial Design is built by pretending that one has fewer design
variables than the true number, for instance 4 instead of 5. A Full Factorial Design is
made from the reduced set of design variables. The table showing the composition of
the experiments is then modified by adding one new column for each of the design
variables which were omitted. These new columns are built in such a way that each
“new” design variable is combined with the others in a balanced way.

The important point is that the design now enables us to study 5 variables with no
more experiments than required for 4 variables, i.e. 2x2x2x2 = 16 experiments
instead of 32.

Table 16.2 and Figure 16.27 hereafter illustrate this principle in the simpler case of 3
variables studied with 4 experiments instead of 8. You can see how variable Time,
introduced in the design originally built for variables Temperature and %Yeast only,
is combined with the other two variables in a balanced way.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 379

Table 16.2 - Symbolic representation of a Fractional Factorial Design with 3


variables
Experiment Temperature % Yeast Time
1 Low (-) Low (-) High (+)
2 Low (-) High (+) Low (-)
3 High (+) Low (-) Low (-)
4 High (+) High (+) High (+)

Figure 16.27 - Geometrical representation of a Fractional


Factorial Design with 3 variables
(+ + +)

Time

% yeast
(- - -) (+ - -)

Temperature

How Many Experiments?


Fractional Factorial Designs require a number of experiments which is always a
power of two (since all variables have two levels). The number of experiments has to
be larger than the number of variables, and smaller than the number included in the
corresponding full factorial.

Example: If you want to investigate the effects of 6 design variables, there are
three possible Fractional Factorial Designs, including respectively 8, 16 and 32
experiments (the full factorial requires 64 experiments).

In addition, as with full factorials, center samples can be included in the design
whenever all design variables have a continuous range of variation. The center
samples are experiments which combine the mid-levels of all design variables. They
are useful for checking what happens between Low and High (it may be non-linear).
They are usually replicated, i.e. the experiment is run several times, so as to check
how large the experimental error is.

Multivariate Data Analysis in Practice


380 16. Introduction to Experimental Design

The Price to Be Paid


Unfortunately, there is often a price to be paid for studying the same number of
variables with fewer experiments: the results will be less precise. In practice, this
means one of the following:

y Main effects cannot be distinguished from 2-variable interactions: you have


chosen to run very few experiments, and there is room for main effects only.
Such designs are said to be “Resolution III designs”. They are useful when you
need to sort out among many variables
y Main effects can be interpreted, but 2-variable interactions cannot be
distinguished from each other. Such designs are said to be “Resolution IV
designs”. They are useful in intermediate situations where you have too many
variables to study all interactions, but want to detect whether some of the
possible interactions are important
y Main effects and 2-variable interactions can be interpreted independently from
each other: your fractional design has so many experiments that it is almost as
good as the full factorial. Such designs are said to be “Resolution V (or higher)
designs”. They are an economical and efficient alternative to Full Factorial
Designs

The fact that several effects cannot be separated at the interpretation stage is called
confounding. Whenever two effects are confounded with each other, they cannot
mathematically be computed separately. They appear as only one effect in the list of
significant effects. You can still see if the confounded effects are significant, but you
cannot know for sure which of the two is responsible for the observed changes in the
response values.

Which Design to Choose in Practice


As you can see, there has to be a tradeoff between the cost of the investigation and
the amount of information you will get from your experiments. Thus it is important
to choose the right type of design at the right time: it will ensure that you use your
resources optimally throughout your project.

1. First screening of many variables: you wish to find out which variables are the
most important. In practice, if you have more than 8 design variables, you need at
least 32 experiments to detect interactions (with a Resolution IV design). If you
cannot afford to run so many experiments, choose a Resolution III design. Then
only main effects will be detected. If you have more than 15 variables, no

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 381

Fractional Factorial Design is available; you will need a Plackett-Burman instead


(see hereafter).

2. Screening of a reduced number of variables: you wish to know about the main
effects and interactions of a reasonable number of variables (4 to 8). This may
either be your first stage, or follow a first screening where these variables have
been found the most important. You should select at least a Resolution IV design
(available for 4, 6, 7, 8 design variables), which does not require more than 16
experiments. If you have 5 design variables, 16 experiments will give you a
Resolution V design. If you have 6 design variables, 32 experiments will give you
a Resolution VI design (even better). If you have 4 design variables, the only way
to study all interactions is a Full Factorial (see above) with 16 experiments.

3. Last stage before you start an optimization: you have run at least one screening
before, and identified 3 to 6 most important variables. Before you build a more
complex optimization design (see hereafter), you need to identify all interactions
with certainty. With 3 or 4 design variables, the design you need is a Full
Factorial. For 5 design variables, choose the Resolution V design with 16
experiments. If you have 6 design variables, choose the Resolution VI design
with 32 experiments.

Analyzing the Results


The results from Fractional Factorial Designs are analyzed with different models
depending on the resolution.

1. For Resolution III designs, only a linear model is relevant.

2. Resolution VI designs can be analyzed with either a purely linear model, or a


linear model with interactions. In practice, it is recommended to start with the
most complete model (with interactions). If it turns out that few or none of the
interactions are significant, you will have the possibility to remove some effects
from the model and recalculate.

3. Resolution V (or higher) designs require a linear model with interactions.

Multivariate Data Analysis in Practice


382 16. Introduction to Experimental Design

16.2.3 Plackett-Burman Designs


Plackett-Burman designs are used for a first screening, whenever:

a) You want to study the effects of a very large number of variables (up to 32), with
as few experiments as possible
b) And all your design variables have two levels (either Low and High limits of a
continuous range, or two categories, e.g. Starch A / Starch B).

Principles and Properties


Plackett-Burman designs are based on a mathematical theory which makes it
possible to study the main effects of a large number (n) of design variables with no
more than n+4 experiments. The number of experiments is always a multiple of
four: 8, 12, 16, …

As with factorial designs, each design variable is combined with the others in a
balanced way. Unlike Fractional Factorial Designs, however, they have complex
confounding patterns where each main effect can be confounded with an irregular
combination of several interactions. Thus there may be some doubt about the
interpretation of significant effects, and it is therefore recommended not to base final
conclusions on a Plackett-Burman design only. In practice, an investigation
conducted by means of a Plackett-Burman design should always be followed by a
more precise investigation with a Fractional Factorial Design of Resolution IV or
more.

When to Use a Plackett-Burman Design


Use that kind of design in your first screening stage, whenever you have more that
15 variables to screen, or if it is essential to keep the number of experiments to an
absolute minimum.

Example: We want to investigate the effects of 11 process parameters by running


large-scale experiments on the production line. We cannot use our pilot plant
because we suspect that scale effects will make it impossible to extrapolate the
results. By using a Plackett-Burman design, we can identify the most important of
these 11 process parameters with only 12 experiments.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 383

The other area of use is for feasibility studies, when you do not want to invest too
much money into the first set of experiments which will tell you whether you can
obtain any valuable information at all.

Example: We have listed 9 variables as having a potential influence on the


viscosity of our product. We have run experiments in the past, and never been
able to interpret their results in a clear way. We suspect that there are some
uncontrolled variations we are unaware of. We choose to start our project with a
Plackett-Burman design which will tell us, with only 12 experiments, whether we
can detect any significant effects at all.

Analyzing the Results


The results from a Plackett-Burman design are analyzed with a linear model.

16.3 Analyzing a Screening Design


The aim of screening designs like Factorial, Fractional Factorial or Plackett-Burman
is to find the significant effects, i.e. determine which design variables (and
interactions) have influence on the response.

No matter how you generated your data, you have to analyze them if you expect to
obtain information from them. After briefly introducing the logical steps in a data
analysis, this chapter focuses on the many different ways to analyze your
experimental results.

Logical Steps in Data Analysis


Before presenting the details of the methods which will extract the information you
need from the data you have just collected, we will draw the outline of a sensible
strategy for data analysis. Let us first introduce the main purposes of these stages.

Data Checks
If you have ever worked with data - and you most probably have - you will recognize
this statement as true:
A data table usually contains at least one error.

This being a fact, there is only one way to ensure that you get valid results from the
analysis of experimental data: detect the error(s) and correct them! The sooner you
do this in your analysis, the better. Imagine going through your whole sequence of

Multivariate Data Analysis in Practice


384 16. Introduction to Experimental Design

successive analyses, producing your final results, and realizing that these results do
not make sense. Or, even worse: not realizing anything, and presenting your results -
your wrong results! To avoid the awkward and dangerous consequences of such a
situation, there is one recipe and only one: include error detection as the first step in
a data analysis.

The first step in a series of successive analyses consists in producing a simple


summary of the data, where errors of large magnitude will be visible. Furthermore,
the interpretation of the results of each new analysis will start by studying a few
diagnostic plots. These diagnostics tell you whether any samples or variables have
unusual or suspicious behaviors, so that errors of a lesser magnitude can also be
detected (by the disturbances they introduce in the consistency of the results) and
corrected before a more thorough result interpretation.

Descriptive Analysis
No matter what the ultimate objective of your data analysis is, for instance
predicting fat content or understanding process malfunction, you will increase your
chances of reaching that objective by starting with a descriptive phase.

Descriptive methods for data analysis are tools which give you a feeling for your
data. In short, they replace dry numbers (your raw data) with something which
appeals to your imagination and intuition: simple plots, striking features.

Once you have in a way “revealed” what was hidden in your data, you can digest it
and transform it into information. It is also your duty to have a critical view of this
newly extracted information: compare the structures or features you have just
discovered, with your a priori expectations. If they do not match, it means that either
your hypotheses were wrong, or there is an error in the data which generates
abnormal features in the results. Thus descriptive analysis is also a powerful tool for
data checking and error detection.

Inferential Analysis
Whenever you are drawing general conclusions from a selection of observations, you
are making inferences. For instance, using the results from designed experiments to
determine which input variables have a significant influence on your responses, is a
typical case of inferential analysis, called significance testing.

Once you have cleaned up your data and revealed their information content, it is
time to start making inferences. Remember that experimental design is the only way

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 385

to prove the existence of a causal relationship between two facts! Which means that,
in practice, we will use inferential analysis mostly as a stage in the analysis of
designed data. Non-designed data will be analyzed with descriptive methods, and
predictive techniques if our objective requires it. Read what follows to understand
the differences between the two approaches.

Predictive Analysis
While we use inference to increase our knowledge, i.e. build up new rules which we
will then apply to reach a goal, predictive methods can help us obtain immediate
practical benefits. Let us illustrate the difference through an example.

Analysis of Screening Experiments


The variables taken into account in the data checks and descriptive stages are
response variables only; the values of the design variables are already known, and
their variations completely controlled in the experiments, so that they do not require
any specific check or description.

The last stage of the analysis brings the two groups of variables together, and
enables you to draw final conclusions.

Table 16.3 - A schematic strategy for the analysis of screening results


Stage Methods Results
(1) * y Plotting Raw Data y Data transcription errors are
Data Checks y Statistics detected and corrected;
y Principal y More serious problems are
Component Analysis identified (if any).
(PCA) if several
responses
(2) * y Statistics y Range of variation of each
Descriptive Analysis y Principal response;
Component Analysis y Amount of variation which makes
(PCA) if several sense (structure);
responses y Correlation among various
responses.
(3) y Analysis of Effects y Effect of each design variable on
Inferential Analysis with ANOVA and each response;
significance testing y Significance of each effect;
y Important variables identified.
(*) Data checks and descriptive analysis are usually performed simultaneously, since descriptive
methods are an efficient tool for detecting errors in the data.

Multivariate Data Analysis in Practice


386 16. Introduction to Experimental Design

To Make the Model by PLS or PCR


If you use PLS1, you should study the results corresponding to one PLS-component.
If there are several responses, you may use PLS2 initially and study results for as
many components as there are responses (e.g. 3 PCs for three responses).

In PCR, study the results for as many PCs as there are variables (i.e. max. number of
PCs). This corresponds to the MLR solution.

Note! Artifacts
Factorial designs give only one PLS component per Y-variable because
you vary all variables equally! For the same reason the calibration
X-variance will be zero.

16.3.1 Significant effects


Size of effects
Plotting the B-coefficients as a standard 1-vector plot (e.g. with Bars) shows the
values of the B-coefficients which are half the size of the effects: Effect = 2 * B.

It may be difficult from this plot to say where the limit between significant and
insignificant effects lies.

There are several ways to find the significant effects, either using an F-test or a
p-value, or by studying the effects relative to each other in a Normal probability plot
of effects (the Normal B-plot).

Figure 16.29 - B-coefficients as 1-vector plot


B-coefficient
18

15

12

-3

-6

-9 X-variables
0 2 4 6 8 10 12 14 16
tutd-0, (Yvar,PC): (Yield,1)

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 387

16.3.2 Using F-Test and P-Values to


Determine Significant Effects
F-Test
effect 2
Equation 16.2 ( m − 1)
F= 2
RSD
(n − 1)
where
m-1 = number of degrees of freedom for the effect
n-1 = number of degrees of freedom of the error

n
∑ ( yi − yi ) 2
i =1
RSD =
( n − 1)

m = number of experiments (except center points)


n = number of replicates for the estimation of the error

F is compared with the F-distribution with three parameters; m-1, n-1, and
significance level (typically 95%). You find the F-value in a statistical table.
If F > F-value, then the effect is regarded to be significant.

P-Value
A complementary measure is the P-value. The P-value is the probability that effect =
0. For instance, PA = 0.01 means that the effect A is significant with a probability of
99%.

If the P-value you get from the statistical table is small, then the effect is significant.

Unfortunately this approach has a few pitfalls and, even worse, most people are
unaware of them!

The P-values are totally irrelevant if the number of degrees of freedom for the
estimate of the error is small, i.e. there were few replicates to estimate the error.
All P-values become low, that is all effects seem significant, if the error is very
small. This may occur if the reference method is very accurate.

Multivariate Data Analysis in Practice


388 16. Introduction to Experimental Design

If this is the situation, a better strategy may be to look at the effects relative to each
other, for instance in the Normal B-plot!

Figure 16.31 - P-value

60 * *
* *
**
* *
50
*
* * * * * ** * *

40 * *
* * * *
* *
* *
30 * *
* * * *
* *
* * *
20 * *
* * *
* *
* *
10 * *
* ** *
* *
**
0
0 10 20 30 40 50 60
R=-0.42 , P=0.0006

In Figure 16.31 the P-value is very small, implying a significant effect. However
most people have difficulties in seeing a “significant” trend and correlation in the
plotted data!

Since the F-test and the P-value are two sides of the same coin, this also applies for
the results of the F-test.

Find Significant Effects in the Normal B-Plot


The easiest way to examine which effects are significant is by a plot of the normal
probability of effects, or the Normal B-plot.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 389

Figure 16.33 - Normal probability plot of B-coefficients

96.67 C

90.00 A
83.33 B
76.67 CD
70.00 CE
63.33 AE
56.67 DE
50.00 BE
43.33 BD
36.67 AD
30.00 D
23.33 AB
16.67 E
10.00 AC

3.33 BC B-coefficient
-10 -5 0 5 10 15 20
tutd-0, PC = 1, Yvar = Yield

The abscissa axis shows the size of the b-coefficients. The ordinate axis shows the
probability. (An F-test is used to calculate this probability.)

Significant effects represent systematic variations of the response related to the


design variables and their interactions. Systematic variation is the opposite of
random noise. In the normal probability plot all random effects are located along a
more or less straight line through (0, 50), and all significant effects are located
outside this line. To be precise: to the upper right or the lower left of the line.

In some cases it may be difficult to determine whether an effect is on or off the


straight line. This can be further investigated by checking for example the residuals.

Discrete Variables in Designed Experiments


The Normal B-plot only tells you which of the design variables are significant. In the
case of a discrete variable, e.g. catalyst A or B, which one has the most effect? Study
the experimental plan and compare the settings for the variable in question with the
response. You find the most effective setting where the response is high.

Since the center points are not used to estimate the effects, it does not matter which
setting you use for discrete variables in these experiments.

Curvature Check
The b0 is an estimate of the average response value. If b0 is different from the
average response at the center point(s), then the response surface is probably curved

Multivariate Data Analysis in Practice


390 16. Introduction to Experimental Design

and you may need to continue with a response surface design to make a quadratic
model. If you have no center points, you may instead calculate the average response
of all the experiments.

Check that the Model is Adequate


When you interpret the Normal B-plot you determine which variables and
interactions have a significant influence on the response. To verify that this
assumption is correct, you should make a new model based only on the significant
effects.

An easy way to do this in The Unscrambler is to make a new model where you
weight all the insignificant effects to zero. Then plot the residuals of this reduced
model, for example as a Normal probability plot of residuals. They should now be
random if all systematic variation in the response is explained by the significant
design variables. This form a more or less straight line through (0, 50).

Sometimes it is difficult to see which of the effects are significant; are they on or off
the line in the Normal B-plot? Start by making reduced models where you first keep
only the obvious ones. If you have deleted a significant design variable, there will be
unexplained systematic variations in the reduced model. The residuals will not be
random and there will be no straight line in the Normal Residuals plot! You should
then put the doubtful variables back into the model, one by one, until you are
satisfied.

Studying the Response Surface


The response may be plotted as a function of two of the design variables, keeping the
rest of the design variables at constant levels. The response surface is then seen
either as a 3D plot, or seen from above as a contour plot. Now you can change the
constant levels of the other design variables and see how the response surface
changes. The changes of the design variables may for example be from low to high,
or mean, or a level that is relevant for the application. You may of course also
replace the two varying variables with others. If there are many design variables, you
may need to play around a bit to fully grasp how the response depends on the design
variables.

Normally the response surface plot is associated with optimization designs but the
plot illustrates well the interactions and how the response varies with the design
variable settings. Note that the response surface will be a bad description of the
response surface if there is curvature!

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 391

16.3.3 Exercise - Willgerodt-Kindler Reaction


Purpose
To illustrate layout of a design plan and analysis of significant effects.

Problem and Context


One very peculiar oxidation/rearrangement reaction is known as the Willgerodt-
Kindler reaction. In this reaction aryl alkyl ketones rearrange into
ω-arylthiocarboxamides when heated in the presence of elemental sulfur and amines.
This reaction has been described in a huge number of papers. Despite the large
efforts spent on studying this reaction, no mechanism had been found, and the
reaction was known as rather useless for preparative purposes due to its poor yield
(< 70%). If the yield were larger, the product might be of interest as synthons for
more complicated molecules.
One reaction was studied by R. Carlson and co-workers (1986) using experimental
design. The aim was to optimize the yield, as follows.

Task
Find which variables have significant influence on the yield. Use a (fractional)
factorial design. From the literature, there is an indication that the following
variables could influence the yield and that the ranges given are appropriate:

Table 16.4 - Description of the variables in the exercise


Variable Description Domain
Low High
-1 1
The amount of sulfur/ketone 5 11
The amount of morpholine/ketone 6 10
The reaction temperature ( C) 100 140
The particle size of sulfur (mesh) 120 240
The stirring rate (rpm) 300 700

How to Do it
1. Make the design
Select a suitable design to study the influence of the five design variables on the
yield. (The Yield is the response variable.)

Multivariate Data Analysis in Practice


392 16. Introduction to Experimental Design

Go to File – New Design, choose From Scratch and hit NEXT select Create
Fractional Factorial Design, and hit NEXT.
Select New to enter a new design variable. Fill in a name and levels of variable A.
Hit OK and do the same for the rest.
(You can edit your entries with Properties or by double-clicking on the list.)
Then enter a New Response variable; Yield. When satisfied - press Next.

Select Design type. Choose a design for max. 16 experiments (not counting center
points). The design should enable you to estimate all main effects and to see if
there are any interactions between any two of the variables.

Study the confounding pattern and which effects can be estimated.


How many experiments will you need to get all main effects unconfounded?
What are the confoundings in the chosen design? Is this acceptable?
Will it be difficult to study two-variable interactions with this design? Why/why
not?

Select Next when satisfied.

2. Fill in Design Details.


Select 2 center points but no extra replicates of the cube points.
Why do we use center points? Why do we replicate them?
Will you get information about significance if you have no center points?
How many experiments do you get in total?

When satisfied, select Finish.

3. Look at the design.


From the View menu, toggle between Standard Sample Sequence and Experiment
Sample Sequence to have a look at the design in standard order or in random
order. You can also use the option View - Point Names.
Why is the design randomized?
Which order shows you the structure of the design? Which is to be used when
performing the experiments?
Do you understand how the design was built?

4. Enter the lab results into the program


Since numerical editing is not allowed in the training package, you cannot enter
the response values from the experimentation. Therefore select File - Open and

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 393

look for Files of type Designed Data instead, and select WILLKIND which
already contains the response values.

5. Make a model to estimate the effects.


Select Task - Analysis of Effects. Select Samples (Cube & Center), X-variables
(Main and 2-variable Interaction Effects) and Y-variables (Response variables).

6. Find the significant effects


View the results. Use Plot - Effects with the default choices.
Which are the significant effects? What are their signs?

Then use Plot - Effects - Details and select Normal probability plot and Include
Table.
Which are the largest effects? Which ones are likely to be significant?
Do you get the same information as in the Effects Overview?

7. Check the model


Plot Residuals, for example Normal Probability of Residuals.
Are the residuals large? Would you expect that?

8. Check precision and curvature


Click the data editor and select Task - Statistics. Compute statistics for all the
samples and variables. View results. The upper plot shows the percentiles. Here
you can visualize the whole range of variation of the response over the design
samples.
Is the range of variation of yield large? Is it symmetrically distributed?

The lower plot shows Mean and Std. Deviation for all design samples.
Select Plot - Statistics and plot Mean and Std. Deviation for Group containing
both Design Samples and Center Samples.
Is the standard deviation of the cube samples much larger than the standard
deviation of the replicated center points? What can you conclude from this
regarding the experimental error and/or the precision of the response
measurement?

How large is the average response value (yield) in the cube samples? How large
is the average response in the center points?
Is the relationship between yield and the design variables linear? How can you
tell?

Multivariate Data Analysis in Practice


394 16. Introduction to Experimental Design

Summary
The Fractional Factorial Resolution V design requires 16 experiments (which we can
afford in this case). All main effects will be estimated without any confoundation
problems. The two-variable interaction effects will be confounded with three-
variable interactions, but those are in general negligible, so this is not a problem.

The Fractional Resolution III design requires only 8 experiments, but all main
effects will be confounded with one or two interaction effects. If there are
interactions it will therefore be difficult to interpret the results from this design.

We therefore choose the Resolution V design.

Center points are used to check curvature. If we replicate them we can also use them
to estimate the experimental error. This gives us 18 experiments in total. If there are
no center points it is difficult to get good estimates of significance. Then you will
need to use the Normal probability of effects plot to get an indication. We normally
randomize the experiments to avoid systematic effects, so the lab rapport we print
out should be randomized.

The list of experiments in standard order shows that the design was built as if there
were only 4 design variables (A, B, C, D) combined factorially, and that the fifth
variable (E) was generated from the interaction column ABCD.

The Effects Overview plot shows that Temperature, Sulfur and Morpholine have
positive significant effects. The AC and BC interactions are significant, negative
effects.
The other effects are not significant at the 5% level.
The Normal Probability plot indicates similar results, but you do not have
significance levels - you have to interpret which effects are likely to be significant:
Large effects that are far from the normal distribution line are likely to be
significant. The significant effects detected by ANOVA are the only ones that stand
out from the normal distribution line.

The detailed ANOVA table displayed together with the normal probability plot lists
the values of the effects, their significance level (p-value), and the confounding
pattern. It is a good summary of the overall results.

Residuals are very small. In fact, it is not meaningful to look at them because the
design has exactly as many cube samples as there are terms in the model (i.e. the

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 395

model is saturated), so the fit is perfect by construction. You might also have noticed
that the model had an R-square of 1.000, for the same reason.

Statistics
The percentile plot shows a very wide range of variation for yield: from 10 to 90
approximately. This indicates that the levels chosen for the design variables
generated enough variation in the response. The distribution of Yield values is
slightly asymmetrical: half the measured values are above 75 approximately. The
standard deviation of the cube samples (left bar) is much larger (about 50) than that
of the replicated center points (about 5). This means that the experimental error is so
small compared to the overall variation that the results can be trusted. (If the
variations due to experimental or instrumental variability were of the same order of
magnitude as the variation in the whole experiment series - caused by changing the
design variable settings - then we could not draw any conclusions about effects.)

The average response in the cube samples is close to the center samples’ average
value, so we can conclude that the relationship looks linear.

When there are no replicated center points (or reference points) and the design is
saturated, there are no residual degrees of freedom to estimate the significance. The
only significance testing method that applies to such a case is COSCIND. The
Effects overview table is then different from the ordinary one. The effects are
displayed by increasing order of absolute value. You should read the p-values until
you find the first significant effect; then all larger effects are assumed to be at least
as significant. The Normal probability plot of effects is a useful complement to the
COSCIND method, to make sure that all important effects are detected.

16.4 Optimization Designs


We have reached the stage in our project where we have identified the important
variables. We may also have found the best values for some of them, but there is a
small number of design variables, say 2 to 5, for which we need to collect more
information.

The purpose of an optimization design is to investigate the remaining variables at


more than two levels, so that we can analyze the results with a more complex model,
which will tell us which values of the design variables lead to the best response
values.

Multivariate Data Analysis in Practice


396 16. Introduction to Experimental Design

Since the point is to study what happens anywhere within a given range of variation,
optimization designs can only investigate design variables which vary over a
continuous range. As a consequence, if you have previously investigated any
category variables, you have to select their best level according to your screening
results, and fix them at the optimization stage.

Building the Design


The design consists of a set of experiments which combine at least 3 levels of the
design variables in a balanced way.

Two very different approaches are possible; each defines a particular type of
optimization design.

y Central Composite Designs use 5 levels of each design variable.


y Box-Behnken designs use 3 levels of each design variable.

You can read more about these types of design in the next two chapters.

Analyzing the Results


Optimization experiments are analyzed with a quadratic model. The model contains
the following elements:

y A linear part which consists of the main effects of the design variables.
y An interaction part which consists of the 2-variable interactions.
y A square part which consists of the square effects of the design variables,
necessary to study the curvature of the response surface.

The model results are visualized by means of one or several response surface plots,
where you can read the value of a response variable for any combination of values of
the design variables.

16.4.1 Central Composite Designs


Central Composite Designs are used for optimization purposes, whenever:

a) You want to optimize one or several responses with respect to 2 to 6 design


variables.
b) Optionally, if you want to re-use some experiments from an existing full factorial
or Fractional Factorial Design.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 397

Note!
Case b) only applies if the ranges of variation of your design variables are
the same at the screening and the optimization stage. In addition, if your
optimization is based on fewer design variables than the screening,
re-using the previous experiments is only possible if the variables you are
dropping from the investigation are fixed at their Low or High level.

Building the Design


The design consists of two sets of experiments:

y The Cube and Center samples from a Full Factorial Design.


y And Star samples which provide the additional levels necessary to compute a
quadratic model.

The star samples combine the center level of all variables but one, with an extreme
level of the last variable. The star levels (Low star and High star) are respectively
lower than Low cube and higher than High cube. Usually, these star levels are such
that the star samples have the same distance from the center as the cube samples. In
other words, all experiments in a Central Composite Design are located on a sphere
around the Center sample.

This property is called rotatability. It ensures that each experiment contributes


equally to the total information. As a consequence, the model will have the same
precision in all directions from the center.

Figure 16.35 - Geometrical representation of a Central Composite Design


with 3 variables

Multivariate Data Analysis in Practice


398 16. Introduction to Experimental Design

How Many Experiments?


The number of experiments in a Central Composite Design is fixed according to the
number of design variables. The number of center samples can be tuned between an
“economical” value and an “optimal” value.

Table 16.5 - Number of experiments in a Central Composite Design


Design variables Cube Star Center Total exp.
2 4 4 3 to 5 11 to 13
3 8 6 3 to 6 17 to 20
4 16 8 3 to 7 27 to 31
5 32 10 3 to 10 45 to 52
6 64 12 3 to 15 79 to 91

Note!
From these numbers, it is pretty obvious that it is not recommended to
build a Central Composite Design with 6 variables. The total number of
experiments is so large that something is bound to go wrong, which will
prevent you from interpreting the results clearly enough. In practice, if you
have 6 design variables to investigate, run one more screening before
starting an optimization

What if We Cannot Go Out of the Cube?


You may wish to extend an existing factorial design to Central Composite and have
a constraint regarding the possible values of your design variables. Sometimes this
makes the usual values for the Star levels impossible to reach.

Example: We have investigated three design variables with a Full Factorial


Design, in the following ranges:
Temperature: Low=35°C High=38°C
Amount of yeast: Low=8 g High=12 g
Amount of stabilizer: Low=0 g High=1 g
Now we want to re-use the results from our screening experiments in a Central
Composite Design. But we have a problem with the Low star levels: for Amount
of stabilizer, we cannot go below zero!

The solution consists in tuning down the distance between Star samples and Center
of the design, until we reach possible values for all Low star and High star levels. In
the most extreme case, the Star levels will be the same as the Cube levels. Then the
star samples are located on the center of the faces of the cube - see Figure 16.37 for
an illustration.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 399

There is a disadvantage, however, to changing the distance between Star samples


and Center. Since the Star samples are no longer on the sphere defined by the Cube
samples, the design is no longer rotatable. The star samples bring in slightly less
information than the cube samples.

Figure 16.37 - Central Composite Design where


Low star=Low cube, High star=High cube

What if We Want to Run the Experiments in Two Blocks?


You may wish to perform your experiments in two independent series, in such a way
that, even if there is a consistent difference between the first and the second series,
the results can still be used to estimate the main effects, interactions and square
effects reliably. This is called blocking.

Example: We have built a Central Composite Design studying 3 process


variables. We are going to run large-scale experiments. This means that we need
large batches of raw material. Actually, our main ingredient is only available in
batches large enough for 10 experiments, but the design includes 17 experiments.
How can we divide 2 batches among our 17 experiments so as to minimize the
resulting disturbances linked to the fact that our trials are not run under exactly
the same conditions?

Fortunately, the Central Composite Design consists of two main sets of experiments:
Cube and Star samples. These two groups have the mathematical property that they
contribute to the estimation of a quadratic model independently from each other. As
a consequence, if some of the experimental conditions vary slightly between the first
group and the second one, it will of course generate some “background noise”, but it
will not change the computed effects.

Multivariate Data Analysis in Practice


400 16. Introduction to Experimental Design

So the recipe for blocking is quite simple:

y The first block contains all Cube samples and half of the Center samples.
y The second block contains all Star samples and the other half of the Center
samples.

Figure 16.39 - Blocking with a Central Composite Design

Block 1
Block 2
Block 1/ Block 2

16.4.2 Box-Behnken Designs


Box-Behnken designs are used for optimization purposes, whenever:

a) You want to optimize one or several responses with respect to 3 to 6 design


variables.
b) Optionally, you want to avoid extreme situations (all variables at their Low or
High levels together).
c) Optionally, you want to stay inside the “cube” and still have a rotatable design.

Building the Design


The experiments will include three levels of each design variable: Low, Center,
High. The design consists of all combinations of the extreme levels of 2 or 3
variables with the center level of the remaining ones.

So the design does indeed avoid extreme situations: if you study Figure 16.41 you
will see that the corners of the cube are not included. All experiments actually lie on
the centers of the edges of the cube.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 401

As a consequence, it is also obvious that all experiments lie on a sphere around the
center: the design is rotatable.

And finally, since only 3 levels are used, there is no risk of including “impossible”
levels once you have defined a valid range of variation for each design variable.

Figure 16.41 - Geometrical representation of a Box-Behnken design


with 3 design variables

How Many Experiments?


The number of experiments in a Box-Behnken design is fixed according to the
number of design variables. The number of center samples can be tuned between an
“economical” value and an “optimal” value.

Table 16.6 - Number of experiments in a Box-Behnken design


Design variables “Cube” Center Total exp.
3 12 3 15
4 24 3 27
5 40 3 to 6 43 to 46
6 48 3 to 6 51 to 54

Note!
If you compare these numbers with those of the Central Composite
Design (see Table 16.5), you will notice that the Box-Behnken generally
requires fewer experiments. So if you do not have any particular reason
for using a Central Composite Design, Box-Behnken is an economical
alternative.

Which Optimization Design to Choose in Practice


Here are a few rules to help you.
y If you wish to re-use some of the results from your previous factorial design: the
Central Composite Design is the only choice.

Multivariate Data Analysis in Practice


402 16. Introduction to Experimental Design

y If you need blocking (read about that in the Central Composite chapter above):
the Central Composite is the only type of design with that possibility.
y If you are investigating 2 design variables only: a Central Composite Design is
the only choice.
y If you wish to avoid extreme situations (because they are likely to be difficult to
handle, or because you already know that the optimum is not in the corners): the
Box-Behnken design is preferable.
y If you cannot go out of the cube but want a rotatable design: the Box-Behnken
design is the only one with that combination of properties.
y If you do not have any special constraints, except budget: the Box-Behnken
design is more economical than the Central Composite.

16.5 Analyzing an Optimization Design


As is the case for screening, the variables taken into account in the data checks and
descriptive stages are response variables only.

The last stages of the analysis involve both inference and prediction: it is important
to know which effects included in the model are useful and which can be taken out,
so that the model finally used for prediction is simple, effective and robust.

Note that, once this analysis is completed, there will be a confirmation stage if
satisfactory conditions have been identified; the analysis of the confirmation
experiments will consist mostly of a descriptive stage where the results are checked
against the expectations.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 403

Table 16.7 - A schematic strategy for the analysis of optimization results


Stage Methods Results
(1) * y Plotting Raw Data y Data transcription errors are
Data Checks y Statistics detected and corrected;
y Principal Component y More serious problems are
Analysis (PCA) if several identified (if any).
responses
(2) * y Statistics y Range of variation of each
Descriptive y Principal Component response;
Analysis Analysis (PCA) if several y Amount of variation which
responses makes sense (structure);
y Correlation among various
responses.
(3) ** y Response Surface y Effect of each design variable
Inferential Analysis analysis, with ANOVA on each response;
and response surface y Significance of each effect;
plots y Useful effects retained, others
kept out of the model.
(4) ** y Response Surface y The shape of the response
Predictive Analysis analysis, with ANOVA surface is described, a
and response surface maximum/minimum is found;
plots y The values of the responses
y PLS2 regression if are predicted all over the
several responses, with experimental region;
response surface plots y The best compromise is
identified.
(*) Data checks and descriptive analysis are usually performed simultaneously, since descriptive
methods are an efficient tool for detecting errors in the data.
(**) Inferential and predictive analysis are often performed simultaneously, since Response Surface
analysis includes ANOVA (inference) and response surface plots (prediction).

16.5.1 Exercise - Optimization of Enamine


Synthesis
Purpose
This exercise illustrates the following terms, functions, and tasks:
• Building suitable designs for screening and optimization purposes.
• Analysis of Effects.
• Response Surface Modeling.

Multivariate Data Analysis in Practice


404 16. Introduction to Experimental Design

Problem Description
This exercise was built from the enamine synthesis example published by R.
Carlsson in his book “Design and Optimization in Organic Synthesis”, Elsevier,
1992.

A standard method for the synthesis of enamine from a ketone gave some problems
and a modified procedure was investigated. A first series of experiments gave two
important results:
• A new procedure was built up, which shortened reaction time considerably.
• It was shown that the optimal operational conditions were highly dependent on
the structure of the original ketone.

So a new investigation had to be conducted to study the specific case of the


formation of morpholine enamine from methyl isobutyl ketone.
It was decided to adopt a 2-step strategy:
• First, at a screening stage, study the main effects of 4 factors (relative amounts
of the reagents, stirring rate and reaction temperature) and their possible
interactions.
• Second, conduct an optimization investigation with a reduced number of
factors.

Data Table
From the previous experiments, reasonable ranges of variation were selected for the
following 4 design variables:

Table 16.8 – Design variables and their ranges of variation


Range
Variables Low Mid High
(-1) (0) (1)
A: The amount of TiCl4/ketone 0.57 0.75 0.93
(mol/mol)
B: The amount of morpholine/ketone 3.7 5.5 7.3
(mol/mol)
C: The reaction temperature (°C) 25 32.5 40
D: The stirring rate (rpm) no inter- high
stirring mediate stirring

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 405

Building a Screening Design


Screening designs are used to identify which design variables influence the
responses significantly.

Tasks
Select a screening design requiring a maximum of 11 experiments, that will make it
possible to estimate all main effects and detect the existence of 2-factor interactions.
Note: with 4 design variables, you need a Fractional Factorial Design to keep the
number of experiments lower than 16 (24).

How To Do It
Go to File – New Design,choose From Scratch and hit Next. Then select Create
Fractional Factorial Design, and hit Next.

Then you may define your variables. From the Define Variables window, use
New in the Design Variables box to add each new design variable. From the Add
Design Variable window, name each new design variable (e.g. TiCl4,
Morpholine, Temperature, Stirring), select Continuous, enter the low and high
levels (lookup the levels in the table previous page; use only the Low and High
levels), and validate with New.

Note: in order to be allowed to specify center samples, you will have to define
Stirring rate as a continuous variable; you can give it the arbitrary levels -1 and 1,
where -1 stands for “no stirring” and 1 stands for “high stirring”.

After all four design variables have been defined, the Design Variables box
should contain the following:

Table 16.9
ID Name Data Type Levels
A TiCl4 Continuous 2 (0.6;0.9)
B Morpholine Continuous 2 (3.7;7.3)
C Temperature Continuous 2 (25.0;40.0)
D Stirring Continuous 2 (-1.0;1.0)

From the Non-design Variables window, use New to define the response variable
(Yield).

Multivariate Data Analysis in Practice


406 16. Introduction to Experimental Design

Now you are ready to choose your design type more specifically.
Use Next to get into the Design Type window.

You will notice that the default choice is set to a Fractional Factorial Resolution
IV design, which consists of 8 experiments. Try other choices by toggling
Number of Experiments to Run up or down. (Actually, there is only one possible
Fractional Factorial Design with 4 variables; if you go up to 16 samples, then you
have a Full Factorial Design.)

Study the confounding pattern of the suggested design. You can see that all main
effects are confounded with 3-factor interactions, which is acceptable if we
assume that those interactions are unlikely to be significant. The 2-factor
interactions are confounded two by two.

The last step consists in setting the numbers of replicates and center samples.

Figure 16.43 – Design Details and Last Checks dialogs

Use Next to get into the Design Details window (Figure 16.43). Keep Number of
Replicates to 1, and add 3 Center Samples. Click Next twice until you reach the
Last Checks window (Figure 16.43). A summary is displayed to make sure that
all your design parameters have the correct values. If not, use Back and make
corrections.

Once you are satisfied with your design specifications, use Finish to exit. The
generated design is automatically displayed on screen. You can use the View
menu to toggle between display options. Try Sample Names and Point Names,
Standard Sample Sequence and Experiment Sample Sequence (randomized order).

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 407

It would now be safe to store your new data table into a file, using File - Save As;
give it a name, e.g. Enam FRD. Note that you should not overwrite the existing
file Enam_FRD. You need this file later in the exercise.

Estimation of the Effects


After the experiments have been performed and the responses have been measured,
you have to analyze the results using a suitable method.

Tasks
Study the main effects of the four design variables, and check whether there are any
significant interactions. The simplest way to do this is to run an Analysis of Effects.
Then, interpret the results.

How To Do It
First, you should enter the response values. Since this has already been done, you
just need to read the complete file. Use File - Open, and select among the
Designed Data list the file named Enam_FRD, which already contains the
response values.

Running an Analysis of Effects


To start the analysis, choose Task - Analysis of Effects....

From the Analysis of Effects window use the Samples, X-variables and
Y-variables boxes to select the appropriate samples and variables. Sample Set
should be Cube & Center Samples (11). X-variables should be Design Vars + Int
(4+3). Y-variable set should be Cont Non-Design Vars (1).

Multivariate Data Analysis in Practice


408 16. Introduction to Experimental Design

Figure 16.45 – Analysis of Effects dialog

Validate your final choices with OK.

After the calibration has completed successfully, click “View” to get an overview
of the model results. Before doing anything else, use File - Save to save the results
file with a name like “Enam FRD AoE-a”, for example.

Interpreting the Results from Analysis of Effects


The Effects Overview shows which effects are significant. By default, the
Significance Testing Method is “Center”. Go to Plot - Effects and select COSCIND
as Significance Testing Method.

You can see that three effects are considered to be significant: Main effect TiCl4
(A), Interaction AB or CD, and Main effect Morpholine (B).

Go to Window - Copy to - 2 and use the empty window to plot those effects. To do
that, go to Plot - Effects - Details and select Normal Probability only.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 409

Figure 16.47 – Normal probability Plot

The normal probability plot of the effects (Figure 16.47) confirms the results of
the Effects Overview: the effect of Morpholine (B) is clearly very significant, and
AB=CD and TiCl4 (A) are also likely to be significant.

Checking the Model


Since there are as many terms in the model as the number of cube samples in the
design, studying the residuals is not relevant.

So we should just check the model for non-linearities. To do that, go back to the
Editor window and select Task -Statistics. Choose Cube & Center Samples (11),
then OK. View the results.

The upper plot shows the range of variation of the response (Yield).
The lower plot shows mean and standard deviation over all samples. Click that
plot and use Plot - Statistics, selecting Mean and Std. Dev. for Sample Groups
Design Samples and Center Samples; validate your choices with OK.

The lower plot now displays the mean and standard deviation of all Design
samples compared to that of the Center samples only.

You can see that the standard deviation for the Center samples is about half the
overall standard deviation. This indicates some lack of reproducibility in the
Center samples; this is why most of the effects observed in the Analysis of
Effects were not found significant according to the Center significance testing
method. If you go back to the Editor and study the Yield values, you will notice
that Center sample Cent-c has a very different value from Cent-a and -b; maybe
that experiment was performed wrongly.

Multivariate Data Analysis in Practice


410 16. Introduction to Experimental Design

The other important information conveyed by that plot is that there is a strong
non-linearity in the actual relationship between Yield and the design variables:
the mean value for the Center samples is much higher than for the overall design.

Drawing a Conclusion from the Screening Design


The final conclusions of the screening experiments are the following.

Three effects were found likely to be significant. One of them is a confounded


interaction, but since the main effects of A and B are the only significant ones,
we can try an educated guess and assume that the significant interaction is AB.

There was some lack of reproducibility in the Center samples, although the
remaining part of the design showed a clear structure (according to the
COSCIND and Normal probability results). If new experiments are performed, it
will be useful to replicate the Center samples a few more times.

There seems to be a strong non-linearity in the relationship between Yield and


(TiCl4, Morpholine). Furthermore, since the Center samples have a higher yield
than the majority of the Design samples, the optimum is likely to be somewhere
inside the investigated region.

So the next sensible step would be to perform an optimization, using only


variables TiCl4 and Morpholine.

Building an Optimization Design


After finding the important variables from a screening design, it is natural to proceed
to the next step: finding the optimal levels of those variables. This is achieved by an
optimization design.

Task
Build a Central Composite Design to study the effects of the two important variables
(TiCl4 and Morpholine) in more details. NB- the other two variables investigated in
the screening design have been set to their most convenient values: No stirring, and
Temperature=40°C.

How To Do It
Choose File - New Design to start the dialog that will enable you to generate a
designed data table as in the previous exercise. Select Create Central Composite
Design. From the Define Variables window, define the two design variables

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 411

TiCl4 and Morpholine with the same ranges of variation as previously, and the
response variable (Yield).

Check that the Design Variables box indicates the correct Star Points Distance
from Center, namely 1.41.

Once you are satisfied with your variable definitions, use Next to get into the
Design Details window. Set Number of Replicates to 1, and Number of Center
Samples to 5.

Check the summary displayed to the right of the window to make sure that all
your design parameters have the correct values. The design should include a total
of 13 experiments. Otherwise, use Back.

Once you are satisfied with your design specifications, use Finish to exit. The
generated design is automatically displayed on screen.

You may view the list of experiments in standard order to better understand the
structure of the design.

Then you should save your design for further use.

Computation of the Response Surface


After the new experiments have been performed and their results collected, you are
ready to analyze them so as to find the optimum.

Task
Find the levels of TiCl4 and Morpholine that give the best possible yield. You will
need to use a Response Surface Analysis.

How To Do It
First, you should enter the response values, but this has already been done. Open
the Designed Data list the file named Enam_CCD, which already contains the
response values.

Running a Response Surface Analysis


To start the analysis, choose Task - Response Surface.

Multivariate Data Analysis in Practice


412 16. Introduction to Experimental Design

From the Response Surface window, check that the Samples, X-variables and
Y-variables boxes contain the appropriate selections. Select a Quadratic Model.
Click OK to start the analysis.

When the computations are finished, click View to study the results. But before
you start interpreting them, do not forget to save the result file!

The viewer displays a response surface overview, which consists of 4 plots:


Analysis of Variance, Residuals, Response Surface visualized as a contour plot,
and Response Surface visualized as a landscape plot.

Interpreting Analysis of Variance Results for Response Surface


First, study the ANOVA results. Use Shift-Click on the upper part of the
ANOVA window, to blow it up to full screen. You can adjust the width of the
various columns of the table if necessary. Study in turn: Summary, Model Check,
Variables, and Lack of Fit.

The Summary shows that the model is globally significant, so we can go on with
the interpretation.

The Model Check indicates that the quadratic part of the model is significant,
which shows that the interactions and square terms included in the model are
useful.

The Variables ANOVA displays the values of the b-coefficients, and their
significance. You see that the most significant coefficients are for the linear and
quadratic effects of Morpholine; the quadratic effect of TiCl4 is close to the 0.05
significance level. That section of the table also tells you that the maximum point
is reached for TiCl4=0.835 and Morpholine= 6.504; the information displayed on
top of the table shows a Predicted Max Point Value of 96.747.

The Lack of Fit section tells you that, with a p-value around 0.19, there is no
significant lack of fit in the model. Thus we can trust the model to describe the
response surface adequately.

Checking the Residuals for Response Surface


The upper right window shows a Normal Probability plot of the residuals. That
plot can be used to detect any outliers.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 413

Here, you see that the residuals form two groups (positive residuals and negative
ones). Apart from that, they lie roughly along a straight line, and no extreme
residual is to be found outside that line. This means that there is no apparent
outlier.

From that window, go to Plot - Residuals and select Y-residuals vs Predicted Y.


Try alternatively the two options Residuals (which shows the raw residuals) and
Studentized (which shows transformed residuals that can be compared to a
Student distribution).

On the Studentized residuals plot, all values are within the (-2;+2) range, which
confirms that there are no outliers. Furthermore, there is no clear pattern in the
residuals, so nothing seems to be wrong with the model.

Go to Plot - Predicted vs Measured and select Predicted vs Measured. If


necessary, use View - Trend Lines - Regression Line to visualize the “y=x” line.
You can see how the design samples are spread around that line; in particular, the
Center samples to the right of the plot show an important spread. This is why so
few effects in the model are very significant: there is quite a large amount of
experimental variability.

Interpreting Response Surface Plots


Now the model has been thoroughly checked, you can use it for final
interpretation. This is most easily done by studying the two plots which visualize
the response surface.

The landscape plot displayed in the lower right quadrant shows you the shape of
the response surface: a kind of round hill with a maximum somewhere between
the center and maximum values of the design variables.

That plot is not precise enough to spot the coordinates of the maximum; the
contour plot displayed left is more suited to that purpose. For instance, you can
change the scaling to zoom around the optimum, so as to locate its coordinates
more accurately. Check that they match what is displayed in the ANOVA table.

You can also click at various points in the neighborhood of the optimum, to see
how fast the predicted values decrease. You will notice that the top of the surface
is rather flat, but that the further away you go, the steeper the Yield decreases.

Multivariate Data Analysis in Practice


414 16. Introduction to Experimental Design

Finally, you may also have noticed that the Predicted Max Point Value is smaller
than several of the actually observed Yield values (sample Cube004a for instance
has a Yield of 98.7). This is not paradoxical, since the model smoothes the
observed values; those high observed values might not be reproduced if you
performed the same experiments again.

Drawing a Conclusion from the Optimization Design


The Response Surface Analysis gave a significant model, in which the quadratic
part in particular was significant, thus justifying the optimization experiments.

Since there was no apparent lack of fit, no outlier and the residuals showed no
clear pattern, the model could be considered valid and its results interpreted more
thoroughly.

The response surface showed an optimum predicted Yield of 96.747 for


TiCl4=0.835 and Morpholine= 6.504; the predicted Yield is larger than 95 in the
neighboring area, so that even small deviations from the optimal settings of the
two variables will give quite acceptable results.

16.6 Practical Aspects of Making an


Experimental Design
Before you set up an experimental design you have to think through your problem in
depth, formulate a clear objective, and consider how to avoid problems that might
occur.

Experimental design is a strategy to gather empirical knowledge, i.e. knowledge


based on the analysis of experimental data and not on theoretical models.

Building a design consists in carefully choosing a small number of experiments that


are to be performed under controlled conditions. There are four interrelated steps in
building a design:

• Define an objective to the experiment, for example “better understand” or “sort


out important variables” or “find optimum”.
• Define the variables that will be controlled during the experiment (design
variables), and their levels or ranges of variation.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 415

• Define the variables that will be measured to describe the outcome of the
experimental runs (response variables), and examine their precision.
• Choose among available standard designs the one that is compatible with your
objective, number of design variables and precision of measurements, and has a
reasonable cost.

In the following sections we will go through the practical issues to consider, in the
same order as you enter them into the program when creating a new design.

Brainstorm- Problem Definition


Building a research and development project should always start with a
brainstorming session. This is where you are going to define your objective and
outline a strategy to achieve it. The brainstorming session may be the most important
step in the project!

Brainstorm - Creative Stage


In the creative part of the brainstorming session, it is very useful to involve everyone
with application knowledge and discuss everything that can vary in your product or
your process. If you make mistakes here you may need to redo the whole work. Here
is a nice check list:

‰ Participants - Make sure that the following people take part in the brainstorming
session:
‰ Project leader.
‰ People who know the application, the product, the process.
‰ People knowledgeable in measurement methods.
‰ Somebody who has some experience in experimental design and data analysis.
‰ People who will actually perform the experiments.

Objective - Define:
‰ The application you are interested in. Which product? Which process?
‰ The output parameters (responses) you are going to study. Which properties? Are
they precisely defined? Make sure that all relevant responses are included!
‰ The measurement methods and protocols you will use. Is there a standard
measurement method for each of your responses? Have you got the necessary
instruments? Are your sensory descriptors adequate? Is the panel well trained?
‰ Your target values, for each response. Do you want the response value to be
maximum? Minimum? Within a certain range? As close as possible to a reference
value? No target value, just detect variations?

Multivariate Data Analysis in Practice


416 16. Introduction to Experimental Design

‰ Potential factors - List:


‰ The main components of your application. For example: product recipe,
production process, packaging, storage.
‰ Then, for each component, more detailed parts. For example, regarding product
recipe: list all the ingredients. Then, for each ingredient, list all possible
substitutes and note whether the concentration of the ingredient can vary.
‰ Do not forget to list uncontrolled factors, environmental conditions, etc.
‰ Do not omit a parameter just because “nobody has ever changed that before”!
‰ Be creative; do not eliminate any possibility at this stage.
‰ Never ignore the possibility of interactions.
‰ Also investigate conditions that give poor responses. Poor responses (but not too
poor) can give valuable information.
‰ Consider how to deal with variables that you cannot control which may affect
results.
‰ Do not assume that all chosen levels and combinations will work and give
reasonable responses.
‰ Do not continue with the experiment if it is clearly going wrong.
‰ Do not make optimization experiments until you know which variables really are
important.
‰ Be aware of possible outliers in the data.
‰ Do not focus too hard on non-linearities. The structure may be simpler than you
think!
‰ Do not overfit by including too many variables and interactions. You may end up
modeling errors instead of the underlying effects.

Reducing the Number of Potential Factors


Once you are through with this creative process, it is time to introduce some
structure in the list and start removing some of the possible factors of variation. Here
are some rules:

• Take practical constraints into account. If there is only one supplier for raw
material A, it is of no use to consider a possible change of supplier. If the
regulations fix the amount of ingredient B, this is not a potential variable any
more.
• Use your previous knowledge. If earlier studies have shown that, for a
representative set of samples, the preservative has no effect on the sensory
properties of your product, then you do not need to investigate the effect of the
preservative any more.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 417

• Do not forget your common sense. The taste of the product does not depend on
the color of the package! But it might be influenced by the packaging material.
• But do not leave any potential factor aside just because you assume that it has no
effect or because it has always remained fixed before. Use reasonable arguments
to agree on what can vary and what may have an influence. If a parameter can
vary and if its influence cannot be excluded, then it should be studied.

Outlining a Strategy
After you have reduced the list of potential factors to its minimum size, you have to
check whether the remaining number of factors is compatible with the precision of
your objective. Usually you wish to describe the variations of your responses
precisely depending on the values of your input parameters. You may understand
intuitively that this is easier to achieve if you have a small number of input
parameters to study!

As a rule of thumb, if you have more than 4 or 5 potential factors of variation, we


advise you not to start with an optimization right away. It is much safer to break
down your project into a series of smaller projects, each of which has its own,
intermediate objective. Start with a screening stage, to detect main effects and/or
interactions and reduce the number of potential factors. If you have more than 6 to 8
potential factors to start with, you will need a first screening to sort out which are
really important, followed by a more advanced screening investigating the
interactions between a reduced number of factors. Once you have gotten the number
of important variables down to no more than five, you can perform your
optimization!

Steps and associated design types in an efficient experimental strategy.

Step: First Advanced Optimization


Screening Î Screening Î

Number
of 6 or more Ö 4 to 8 Ö 2 to 5
factors:
Fractional Factorial Fractional Factorial Central Composite
Designs:
(high resolution)
or or or
Plackett-Burman Full Factorial Box-Behnken

Multivariate Data Analysis in Practice


418 16. Introduction to Experimental Design

Start Building Your Design


In The Unscrambler, in order to start the concrete definition of your experimental
plan, you will first have to choose what kind of design to build for the current step of
your investigation. This is easy if you have had a thorough brainstorm.

The designs available in The Unscrambler are the most common standard designs,
dealing with several continuous or category variables that can be varied
independently of each other.

They belong to one of the following families:

• Screening designs: Full or Fractional Factorial Designs, Plackett-Burman


Designs.
• Optimization designs: Central Composite Designs, Box-Behnken Designs.

Just select the type of design which matches the number of potential factors and
complexity level of your current step. For example, if you are ready to start an
advanced screening with 4 variables, choose a Full Factorial Design.

Since you already know how the main types of designs work, it is easy for you to
check that they will not lead to too many experiments. If you are unsure, try the most
economical type of design first. For instance, you wish to study the main effects and
interactions of five parameters (advanced screening with 5 variables): you could
build a Full Factorial Design, with 25=32 experiments. But there is a chance that a
Fractional Factorial Design 25-1 will give you as much information with just 16
experiments (not counting center samples).

Define the Design Variables


Performing designed experiments is based on controlling the variations of the
variables whose effects you want to study. These variables with controlled variations
are called design variables. They are sometimes also referred to as just “factors”.

How to Select Design Variables


During the brainstorm, you have listed the potential factors and broken down you
global project into smaller parts. Each of these investigations will require its own
experimental design, for which you have to select exactly which design variables to
investigate, and how.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 419

For a first screening, the most important rule is: do not leave out a variable that
might have an influence on the responses, unless you know that you cannot control it
in practice. It would be more costly to have to include one more variable at a later
stage, than to include one more into the first screening design.

For a more extensive screening, variables that are known not to interact with other
variables can be left out. If those variables have a negligible linear effect, you can
choose whatever constant value you wish for them (like for instance: the least
expensive). If those variables have a significant linear effect, then you should fix
them to the most suitable level to get the desired effect on the response.

The previous rule also applies to optimization designs, if you also know that the
variables in question have no quadratic effect. If you suspect that a variable can have
a non-linear effect, you should include it in the optimization stage.

In The Unscrambler, a design variable is completely defined by:

• Its name.
• Its type: continuous or category.
• Its levels.

Continuous Variables
All variables that have numerical values and that can be measured quantitatively are
called continuous variables. This may be somewhat abusive in the case of discrete
quantitative variables, such as counts for instance. It reflects the implicit use which
is made of these variables, namely model their variations using continuous functions.

Examples of continuous variables are: temperature, concentrations of ingredients (in


g/kg, or %…), pH, length (in mm), age (in years), number of failures in one year...

The variations of continuous design variables are usually set within a pre-defined
range, which goes from a lower level to an upper level. At least those two levels
have to be specified when defining a continuous design variable.

You can also choose to specify more levels if you wish to study some values
specifically.

If only two levels are specified, the other necessary levels will be computed
automatically. This applies to center samples (which use a mid-level, half-way

Multivariate Data Analysis in Practice


420 16. Introduction to Experimental Design

between lower and upper), and star samples in optimization designs (which use
extreme levels outside the predefined range).

Note!
If you have specified more than two levels, then center samples will be
disabled.

Category Variables
In The Unscrambler, all non-continuous variables are called category variables.
Their levels can be named, but not measured quantitatively. Examples of category
variables are: color (Blue, Red, Green), type of texture agent (starch, xanthane, corn
starch), supplier (Hoechst, Dow Chemicals, Unilever).

A special case of category variables is represented by binary variables, which have


only two levels. Binary variables symbolize an alternative.
Examples of binary variables are: use of a catalyst (Yes/No), recipe (New/Old), type
of sweetener (Artificial/ Natural)...

For each category design variable, you have to specify all levels. Since there is a
kind of quantum jump from one level to another (there is no intermediate level in-
between), you cannot directly define center samples when there are category
variables.

Ranges of Variation of the Design Variables: How to Select


Levels
Once you have decided which variables to investigate, appropriate ranges of
variation should be defined. Since we are investigating whether the response is
affected by a change of a design variable, we need to choose a large enough range.
Usually we think of the usual or natural value and choose a considerably lower and a
considerably higher value. They have to be different enough so that there is a good
chance that the responses will change when we go from the low to the high level, but
not so wide apart that the reaction “goes wild”. We choose a region that is rather
normal and meaningful to investigate.

Ranges of Variation for Screening Designs


For screening designs, you are generally interested in covering as large a region as
possible. On the other hand, no information is available in the regions between the
levels of the experimental factors (design variables) unless you assume that the
response behaves smoothly enough as a function of the design variables. Selecting

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 421

the adequate levels is a trade-off between these two aspects. However, do not choose
only a range that gives “good” quality! Also bad results are useful to understand how
the system works. We need to select a range for each design variable that spans all
important variations!

Thus, a rule of thumb can be applied: make the range large enough to give effect and
small enough to be realistic. If you suspect that two of the designed experiments will
give extreme, opposite results, perform those first. If the two results are indeed
different from each other, this means that you have generated enough variation. If
they are too far apart, you have generated too much variation, and you should shrink
the ranges a bit. If they are too close, try a center sample: you might just have a very
strong curvature!

Ranges of Variation for Optimization Designs


Since optimization designs are usually built after some kind of screening, you should
already know roughly in what area the optimum lies. Therefore, unless you are
building a Central Composite Design as an extension of a previous factorial design,
you should try to select a smaller range of variation, preferably in the most useful
direction. This is illustrated in the fish soup example on page 418. This way a
quadratic model will be more likely to approximate the true response surface
correctly. Often you may need to move to another region. It is also important to use
appropriate levels of the ingredients or process parameters that are to be kept
constant during the experiment.

If you are not sure that your ranges of variation are suitable, perform a few pre-trials.
Section “Do a Few Pre-Trials!” gives you practical rules for that.

Define the Non-Design Variables


When you perform the experiments you will measure the results; these
measurements are your responses. They may be sensory variables, chemical
parameters, instrumental measurements, quality characteristics, consumer
preferences, or numbers of microorganisms per cubic centimeter. You should later
enter these data into the program for statistical analysis.

In The Unscrambler, all variables appearing in the context of designed


experiments, which are not themselves design variables, are called non-design
variables. This includes the following three cases:
• Response variables, i.e. measured output variables that describe the outcome of
the experiments.

Multivariate Data Analysis in Practice


422 16. Introduction to Experimental Design

• Constant variables, i.e. variables that might have an influence on the outcome of
the experiments, and that are kept constant so as not to interfere with the design
variables.
• Non-controlled variables, i.e. variables that might have an influence on the
outcome of the experiments, and which you cannot control. In order to have a
possibility to take them into account in further analyses, you can record their
observed values during the experiments. In case they indeed vary, they may also
influence the results of the experiments and disturb the estimation of effects
using the classical statistics like ANOVA (see page Error! Cannot open file.).
However, since The Unscrambler includes multivariate analysis methods like
PLS, they can still be analyzed using these methods instead, taking the
uncontrolled variables into account too.

Select Design Type and the Number of Experiments to Run


Once you have completely defined your design variables, as well as the non-design
variables, you are ready to define more precisely which type of design you will use,
and how many experiments to run. This is especially true for Fractional Factorial
Designs, which are available with varying degrees of fractionality. You can choose
your design type depending on the number of experiments you can afford and the
amount of information you need (the resolution of the design). The program shows
different options and the confounding patterns. By default, it suggests the most
economical alternative.

When you select among the available design types, you can interactively change
either the number of experiments to run or the resolution, which are linked. When
you make a change you can see how the confounding pattern changes, see page 374.
There is usually a trade-off between fewer experiments and less confounding.

Design Details
The next stage of your design specification concerns how to deal with possible
errors and uncertainty, by adding extra experimental points. These extra samples can
be of three types:

• Replicated design samples (replicate the whole design, in case of huge


variability).
• Center samples, i.e. experiments with intermediate levels for all design variables.
• Reference samples, for example today’s production recipe or similar, to use for
comparisons.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 423

As soon as you enter new samples, an overview tells you how many experiments you
will now get in total.

Replicates
Replicates are experiments performed several times. They should not be confused
with repeated measurements, where the samples are only prepared once but the
measurements are performed several times on each.

When you try to find the effect on the response of changing from a low level to a
high level, you are really fitting a line between two experimental points. However,
do not forget that these points have observed values that are a combination of actual
(theoretical, unobserved) values and some error (experimental or measurement error)
that interferes with the observations. Therefore, each observed point has a certain
amount of imprecision that will reflect on the slope of the fitted line and make it
imprecise too. This is why you may include replicates into your design: by making
two or three experiments with the same settings, you will have the opportunity to
collect observed values that are slightly different from one another and thus estimate
the variability of your response.

Figure 16.48 - Replicates

Assumed slope

Range of possible
slopes

By making all experiments twice you get a better precision of the results. “One
replicate” means that you make each experiment only once, while “two replicates”
means that you make them twice. Whether you decide to replicate the whole design
or not depends on cost and reproducibility. If you know that there is a lot of
uncontrolled or unexplained variability in your experiments, it may be wise to
replicate the whole design (make it twice).

Multivariate Data Analysis in Practice


424 16. Introduction to Experimental Design

Why Should You Include Replicates?


Replicates are included in a design in order to make estimation of the experimental
error possible. This is doubly useful:
• It gives information about the average experimental error in itself.
• It enables you to compare response variations due to controlled causes (i.e. due to
variation in the design variables) with uncontrolled response variations. If the
“explainable” variation in a response is no larger than its random variation, then
it means that the changes in this response cannot be related to the levels of the
design variables.

Center Samples
Now fitting a straight line assumes that the underlying phenomenon is linear; but
what if it is not? You should have the means to detect such a situation. Center
samples are used to diagnose non-linearities. By making also an experiment where
all design variables take their mid-levels (in the middle between low and high) you
will have a chance to compare the response values in this point with the calculated
average response. If they are not equal, it means that the relationship is non-linear. In
the case of high curvature, you will have to build a new design to describe a
quadratic relationship, for example a Central Composite or Box-Behnken design.

Figure 16.50 - Center Samples


Possible other
shape

Assumed shape

Since replicating the whole experimental series usually is rather expensive, you may
instead include a replicated center point, i.e. you perform the “average” experiment
twice or three times. It can thus be used to check both the reproducibility of the
experiments (at least in the middle) and possible non-linearities. Of course, you can
never be sure that the level of imprecision is exactly the same in the center samples

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 425

as for the extreme levels of the design variables, but it is likely to be close to the
average variability.

We therefore recommend you to always make at least two center samples if possible!

Why Should You Include Center Samples?


• To check if the system is linear or non-linear.
• To check the experimental error, the reproducibility, and the precision of the
response measurements.

Center Samples when there are Category Variables


Center samples cannot be used when there are category design variables. The reason
is simple: how do you define the intermediate level between texture agents A and B?

A practical way to overcome this problem is to make a “center point” for each level
of the category variable, for example one using agent A and all other design
variables at their mid-level, and one using agent B and all other design variables at
their mid-level. Such pseudo-center samples can be included as reference samples,
see page 425. If you cannot make true center samples we recommend that you
replicate these reference samples (or one of them) and use them to check the
reproducibility of the experiments instead.

Center Samples in Optimization Designs


Optimization designs automatically include at least one center sample, which is
necessary as a kind of anchor point to the quadratic model. Furthermore, it is
strongly recommended to have more than one. The default number of center samples
for central composite and Box-Behnken designs is automatically computed so as to
achieve uniform precision of the model over the whole experimental region.

Reference Samples
Reference samples are experiments which do not belong to a standard design, but
which you choose to include for various purposes. If you want to compare the
designed samples with today’s production or competitors’ samples, these should be
included too. You will not enter the values of the design variables for these samples
(you seldom know the competitor’s recipe!) but their response values can usually be
measured and included in the analysis.

Multivariate Data Analysis in Practice


426 16. Introduction to Experimental Design

Another use of reference samples is to compensate for the fact that center samples
cannot be used when there are category design variables, as described on page 425.
Make pseudo-center samples as reference samples instead.

Why Should You Include Reference Samples?


• If you are trying to improve an existing product or process, you might use the
current recipe or process settings as reference. It may also be wise to include a
sample produced in the plant.
• If you are trying to copy an existing product, of which you do not know the
recipe, you might still include it as reference and measure your responses on that
sample as well as on the others, in order to know how close you have come to
that product.
• To check curvature in the case where some of the design variables are category
variables, you can include one reference sample with center levels of all
continuous variables for each level (or combination of levels) of the category
variable(s).

How to Include Replicates


The usual strategy consists in specifying several replicates of the center sample. This
has the advantage of being rather economical, and providing an estimation of the
experimental error in “average” conditions.

When no center sample can be defined (because of category variables, or of


variables with more than two levels), you may specify replicates for one or several
reference samples instead.

Randomization
Randomization consists in performing the experiments in random order, as opposed
to the standard order which is sorted according to the levels of the design variables.

Why is Randomization Useful?


Very often, the experimental conditions are likely to vary somewhat in time along
the course of the investigation. This is the case, for instance, when temperature and
humidity vary according to external meteorological conditions, or when the
experiments are carried out by a new employee who is better trained at the end of the
investigation than at the beginning. It is crucial not to risk confusing the effect of a
change over time with the effect of one of the investigated variables. To avoid such
misinterpretation, the order in which the experimental runs must be performed is
usually randomized.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 427

Therefore the program sorts the experiments in random order when printing out the
lab report.

Incomplete Randomization
Sometimes, however, it is very impractical to perform all experiments in random
order. For example, the temperature may be very difficult or time-consuming to tune,
so the experiments will be performed much more efficiently if you tune that
parameter only a few times. It would be much easier to first run all experiments with
a low temperature and then all with a high temperature.

In The Unscrambler you can tick the box “Sorting Required During
Randomization” at the bottom of the Design details dialog before you select Finish.
Then you can select which variables you do not want to randomize. As a result, the
experimental runs will be sorted according to the non-randomized variable(s). This
will generate groups of samples with a constant value for those variables. Inside each
such group, the samples will be randomized according to the remaining variables.

But remember that you have done this and be aware of possible systematic effects.
This may be detected by studying the so-called residuals.

Another case for incomplete randomization is blocking.

Do a Few Pre-Trials!
Sometimes there are combinations of low and high values for some variables that
cannot be accomplished. We recommend that you always do a few initial
experiments, for example the experiment where all design variables have their low
level values, the one with only high level values, and perhaps the center point where
all design variables have their average value. You should perform those two or three
experiments first, regardless of the randomization.

In this way you can easily check that the chosen range is wide enough. If the
responses are about the same for the both extreme experiments, your selected
variables have no effect on the responses, or the range is too narrow. You can also
check that the experiments can be conducted and responses measured as planned,
and have a chance to alter the procedure or the test plan before you have wasted
more efforts. These initial experiments should thus also be used to check the
reproducibility and the measurement errors.

Multivariate Data Analysis in Practice


428 16. Introduction to Experimental Design

These experiments should normally give the most different samples, so if they are
too similar - rethink! You want to generate different samples to investigate how
responses vary and to compare.

16.7 Extending a Design


Once you have performed a series of designed experiments, analyzed their results,
and drawn a conclusion from them, two situations can occur:

• Either the experiments have provided you with all the information you needed:
then your project is completed.
• Or the experiments have given you valuable information, which you can use to
build a new series of experiments that will lead you closer to your objective.

In the latter case, sometimes the new series of experiments can be designed as a
complement to the previous design, in such a way that you minimize the number of
new experimental runs, and that the whole set of results from the two series of runs
can be analyzed together. This is called extending a design.

Which Designs Can Be Extended?


In The Unscrambler full and Fractional Factorial Designs, and in a limited way
Central Composite Designs, can be extended in various manners. Use the on-line
help for details.

Why Extend a Design?


In principle, you should make use of the extension feature whenever possible,
because it enables you to go one step further in your investigations with a minimum
of additional experimental runs, since it takes into account the already performed
experiments.

Extending an existing design is also a nice way to build a new, similar design that
can be analyzed together with the original one. For instance, if you have investigated
a baking process and recipe using a specific type of yeast, you might then want to
investigate another type of yeast in the same conditions as the first one, in order to
compare their performances. This can be achieved by adding a new design variable,
namely type of yeast, to the existing design.

Last but not least, you can use extensions as a basis for an efficient sequential
experimental strategy. That strategy consists in breaking your initial problem into a
Multivariate Data Analysis in Practice
16. Introduction to Experimental Design 429

series of smaller, intermediate problems, and invest into a small number of


experiments to achieve each of the intermediate objectives. Thus, if something goes
wrong at one stage, the losses are cut. And if all goes well, you will anyway end up
solving the initial problem at low cost, compared to a huge design from the
beginning.

When and How to Extend a Design


Let us now go briefly through the most common extension cases.

• Add levels: Whenever you are interested in investigating more levels of already
included design variables, especially for category variables.

• Add a design variable: Whenever a parameter that has been kept constant
previously is suspected of having a potential influence on the responses. Also,
whenever you wish to duplicate an existing design so as to apply it to new
conditions that differ by the values of one specific variable (continuous or
category), and analyze the results together. For instance, you have just
investigated a baking process using a specific yeast, and now wish to study
another similar yeast for the same process, and compare its performances to the
other one’s. The simplest way to do this is to extend the first design by adding a
new variable: type of yeast.

• Delete a design variable: If one or a few of the variables in the original session
have been determined as clearly non-significant by the analysis of effects, you
can increase power of your conclusions by deleting this variable and reanalyzing
the design. Deleting a design variable can also be a first step before extending a
screening design into an optimization design. You should use this option with
caution if the effect of the removed variable is close to significance. Also make
sure that the variable you intend to remove does not participate in any significant
interaction!

• Add more replicates: If the first series of experiments shows that the
experimental error is unexpectedly high, replicating all experiments once more
might make your results clearer.

• Add more center samples: If you wish to get a better estimation of the
experimental error, adding a few center samples is a good and inexpensive
solution.

Multivariate Data Analysis in Practice


430 16. Introduction to Experimental Design

• Add more reference samples: Whenever new references are of interest, or if


you wish to include more replicates of the existing reference samples in order to
get a better estimation of the experimental error.

• Extend to higher resolution: Use this option for Fractional Factorial Designs
where some of the effects you are interested in are confounded with each other.
You can use that option whenever some of the confounded interactions are
significant, and you wish to find out which ones exactly. This is only possible if
there is a higher resolution Fractional Factorial Design. Otherwise, you can
extend to full factorial instead.

• Extend to full factorial: This applies to Fractional Factorial Designs where


some of the effects you are interested in are confounded with each other, when
there is no higher resolution fractional factorial.

• Extend to central composite: This option completes a Full Factorial Design by


adding star samples and (optionally) a few more center samples. Fractional
Factorial Designs can also be completed that way, by adding the necessary cube
samples as well. This should be used only when the number of design variables is
small; an intermediate step may be to delete a few variables first.

Caution!
Whichever kind of extension you use, remember that all the experimental
conditions not represented in the design variables must be the same for
the new experimental runs as for the previous runs.

16.8 Validation of Designed Data Sets


Validating designed data sets demands a little more consideration than other data
sets. Because all variables in X are varied equally and at random, the validation
methods may behave strangely.

Cross validation is impossible for Full Factorial Designs unless there are systematic
replicates, because each sample is equally important to the model. The validated
Y-variance using leverage correction may be useless too, because this method
simulates full cross validation.

If the main effects dominate, validation causes no problems. Validation may thus
primarily be a problem if interaction effects dominate. Use leverage correction

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 431

during calibration, but disregard the validation Y-variance. Study the calibration
Y-variance, which is a measure of the model fit, and study Y residuals.

With methods like PLS, it is possible to use RMSEC, Root Mean Square Error of
Calibration, or the calibration Y-variance. These measures express how well the
model has been fitted to the data.

16.9 Problems in Designed Data Sets


In the first descriptive stages of a data analysis, we want to get acquainted with each
of the variables we have measured, check their variations and detect possible data
transcription errors.

Before using more sophisticated classical or multivariate methods, we need a


summary of the raw data as they are: no data compression, no inference, just plain
observed values.

Descriptive statistics consist of a few measures extracted from the raw data, either
by picking out some key values, or by very simple calculations.

Percentiles are values extracted from the raw data. The Unscrambler gives you the
following percentiles:

y Minimum and Maximum: the extreme values encountered in the current group of
samples.
y Quartiles: the values inside which the middle half of the observed values are to be
found - or outside which the 25% largest and the 25% smallest values are
encountered.
y Median: the value which cuts the observed values into two equal halves; in other
words, 50% of the samples have a larger value than the median, the remaining
50% have a smaller value.

Percentiles are sometimes referred to as non-parametric summaries of a variable


distribution. No matter how the individual values are distributed, the percentiles
always have the same interpretation.

Multivariate Data Analysis in Practice


432 16. Introduction to Experimental Design

How to Detect Out-of-Range Values


To detect values outside the expected range of a given variable, study the
percentiles. You will immediately notice if the minimum or maximum are outside
the limits you expected.

Mean and Standard Deviation


The Mean and Standard deviation are computed from the observed values of a
variable, in the following way:

y The Mean is the average of the observed values, i.e. the sum of the values,
divided by the number of samples in the group.
y The Standard deviation (abbreviated Sdev) is the square root of the variance; the
variance is itself computed as the sum of squares of the deviations from the mean,
divided by the number of samples minus one.

The mean is supposed to give an indication of the central location of the samples, i.e.
a value around which the most typical samples are located. The standard deviation
provides a measure of the spread of the observed values around the average, i.e. how
much any sample taken from the same population is likely to vary around the
average.

These measures are often referred to as parametric summaries of a variable


distribution, because they can be used as estimators for the parameters of a normal
distribution. Thus, they should be interpreted with caution if the distribution is far
from normal. For instance, the mean value of an asymmetrically distributed variable
is influenced by the extreme values, and does not reflect the central location of most
samples.

How to Detect Problems in the Replicated Center Samples


When analyzing the results from designed experiments, study the mean and standard
deviation for two groups of samples: the design samples, and the replicated center
(or reference) samples. Since the design samples are all different from each other,
you expect their response values to vary more than for one single replicated
experiment, So if it turns out that the standard deviation for the replicated center
samples is as large - or of the same magnitude - as for the whole design, you have a
problem. Check your raw data! The explanation may be one of the following:

y One of the replicated samples has an erroneous value, or the corresponding


experiment was performed under wrong conditions.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 433

y The response variable varies very little over the design samples, or its variations
are due mostly to uncontrolled conditions.

How to Detect Curvature in a Screening Design


In a screening design, the effects are estimated under the assumption that the values
of the response variables increase linearly when the level of a design variable
changes. This may not be true! In practice, it is a simplification of a more complex
reality, so as to be able to detect effects and test their significance. However, a more
complex description may be required, and this will be the purpose of a later
optimization design.

To know whether you need an optimization, have a look at the mean for the Center
samples, and compare it to the mean for the design samples (for each response
variable). If they differ noticeably, it means that at least one of the design variables
has a non-linear effect on the response. You will not be able to conclude with
certainty about which design variable has a non-linear effect, but at least you will
know with certainty that you need an optimization stage to describe the variations of
your response adequately.

16.9.1 Detect and Interpret Effects


Analysis of Effects in Short
Analysis of Effects is a collection of methods which enable you to summarize the
results from screening experiments. It gives the following information:

y Estimated value of the main effect of each design variable on each response.
y Estimated value of the interaction effect between two design variables for each
response (if the resolution of the design allows for these interactions to be
studied).
y Significance of these effects.
y In case of design variables with more than two levels, which levels generate
significantly different response values.

Once you have checked your raw data and corrected any data transcription errors,
and possibly completed the descriptive stage by performing a multivariate data
analysis, you are ready to start the inferential stage, i.e. draw conclusions from your
design.

Multivariate Data Analysis in Practice


434 16. Introduction to Experimental Design

The purpose of Analysis of Effects is to find out which design variables have the
largest influence on the response variables you have selected, and how significant
this influence is. It especially applies to screening designs.

Analysis of Effects includes ANOVA, Multiple comparisons and Significance


testing of the effects.

Analysis of Variance (ANOVA)


ANalysis Of VAriance (abbreviated as ANOVA) is based on breaking down a
response’s variations into several parts that can be compared to each other for
significance testing.

To test the significance of a particular effect, you have to compare the response’s
variance accounted for by that effect to the residual variance which summarizes
experimental error. If the “structured” variance (due to the effect) is no larger than
the “random” variance (error), then the effect can be considered negligible. Else it is
regarded as significant.

In practice, this is achieved through a series of successive computations.

y First, several sources of variation are defined. For instance, if the purpose of the
ANOVA model is to study the main effects of all design variables, each design
variable is a source of variation. Experimental error is also a source of variation.
y Each source of variation has a limited number of independent ways to cause
variation in the data. This number is called number of degrees of freedom (DF).
y Response variation associated to a specific source is measured by a sum of
squares (SS).
y Response variance associated to the same source is then computed by dividing
the sum of squares by the number of degrees of freedom. This ratio is called
mean square (MS).
y Once mean squares have been determined for all sources of variation, F-ratios
associated to every tested effect are computed as the ratio of MS(effect) to
MS(error). These ratios, which compare structured variance to residual variance,
have a statistical distribution which is used for significance testing. The higher
the ratio, the more important the effect.
y Under the null hypothesis that an effect’s true value is zero, the F-ratio has a
Fisher distribution. This makes it possible to estimate the probability of getting
such a high F-ratio under the null hypothesis. This probability is called p-value;
the smaller the p-value, the more likely it is that the observed effect is not due to

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 435

chance. Usually, an effect is declared significant if p-value<0.05 (significance at


the 5% level). Other classical thresholds are 0.01 and 0.005.

The ANOVA results are traditionally presented as a table, in the format illustrated in
Table 16.10.

Table 16.10 - Schematic representation of the ANOVA table


S o u rc e df SS MS F -ra tio p -va lu e
of
va ria tio n
A dfA = SS A M SA = FA = 0.001
#levels-1 SS A /dfA M S A /M S err

B dfB = SS B M SB = FB = 0.29
#levels-1 SS B /dfB M S B /M S err

C dfC = SS C M SC = FC = 0.02
#levels-1 SS C /dfC M S C /M S err

E rro r dferr = n-1 SS err = SS Tot M S err=


-dfA -dfB -dfC -SS A -SS B -SS C SS err/dferr

Note!
The underlying computations of ANOVA are based on the MLR algorithm.
The effects are computed from the regression coefficients, according to
the following formula:

Main effect of a variable = 2•(b-coefficient of that variable).

Multiple Comparisons
Multiple comparisons apply whenever a design variable with more than two levels
has a significant effect. Their purpose is to determine which levels of the design
variable have significantly different response mean values.

The Unscrambler uses one of the most well-known procedures for multiple
comparisons: Tukey’s test. The levels of the design variable are sorted according to
their average response value, and non-significantly different levels are displayed
together.

Methods for Significance Testing


Apart from ANOVA, which tests the significance of the various effects included in
the model, using only the cube samples, Analysis of Effects also provides several

Multivariate Data Analysis in Practice


436 16. Introduction to Experimental Design

other methods for significance testing. They differ from each other by the way the
experimental error is estimated. In The Unscrambler, five different sources of
experimental error determine different methods. Read more about those methods in
the reference manual.

16.9.2 How to Separate Confounded Effects?


Specific confounded effects may be separated by carrying out a few extra
experiments. You do this only if the confounded effect has been determined to be
significant, of course.

Let us start with a simple example; a Fractional Factorial Design of four design
variables and reduction 1; 4-1/IV resolution. If the confounded effect of AC and BD
is significant (AC+BD), how do we separate them? Everywhere in the design the
variation of AC is always the same as BD. To be able to separate them we must
therefore run additional experiments, in which the variation of AC is different from
the variation of BD. This can actually be achieved by only one extra experiment,
where the setting of AC and BD is different.

We pick one of the experiments, e.g. number 8, and then make a new one, number 9,
thus:

Table 16.11
No A B C D ... AC BD
8 +1 +1 +1 +1 ... +1 +1 (existing experiment)
9 +1 +1 -1 +1 ... -1 +1 (new experiment)

This extra run allows us to estimate AC and BD. This illustrates the great advantage
of Fractional Factorial Designs. Confounded effects can be separated by doing
complementary runs. Clearly such a sequential experimentation strategy is very
efficient and economical.

16.9.3 Blocking and Repeated Response


Measurements
In cases where you suspect experimental conditions to vary from time to time or
from place to place, and when it is possible to perform only some of the experiments
under constant conditions, you may consider to use blocking of your set of
experiments instead of free randomization. This means that you incorporate an extra
Multivariate Data Analysis in Practice
16. Introduction to Experimental Design 437

design variable for the blocks. Experimental runs must then be randomized within
each block.

Typical examples of blocking factors are:

• day (if several experimental runs can be performed the same day).
• operator or machine or instrument (when several of them must be used in parallel
to save time).
• batches (or shipments) of raw material (in case one batch is insufficient for all
runs).

Blocking is not handled automatically in The Unscrambler, but it can be done


manually using one or several additional design variables. Those variables should be
left out of the randomization.

Repeated Response Measurements


Repeating response measurements (for instance letting the judges try each sample
several times) is a way to decrease the uncertainty in response measurements. The
number of repeated measurements you make depends on the precision of the
measurement method. If it is very imprecise you need to make more repetitions.

You can also measure the standard deviation of the measurements on each sample,
as an expression for the precision of the measurement method, see Equation 16.3 (or
rather, the measurement error).

1 I
Equation 16.3 SDev = ∑
I − 1 i =1
( yi − y )
2

where
I = the number of reference measurements

Note!
It is important not to mix up repeated response measurements with
replicated experiments. With repeated measurements, each experiment is
carried out once; then the measurements only are repeated several times.
Therefore, if you have repeated measurements but no true replicates, do
not specify your design as replicated! There are other ways to handle the
repeated response values in practice.

Multivariate Data Analysis in Practice


438 16. Introduction to Experimental Design

Before Analysis of Effects, PCA, etc., you will calculate the average of all repeated
measurements for each sample. There are two alternatives:

Average Data Manually Before Entering the Data


Average the measurements manually before entering the final results straight into the
design file.

Enter all Data and Check the Judges if Applicable


Enter the data for each of the judges and each replicate in a new data file. Analyze
the judges to make sure that they are all reliable. Then use the average functions in
The Unscrambler to average over judges and replicates. Finally import the
resulting data into the design file.

It is important that you enter the data in a sequence that make it easy to analyze and
average. Use the following scheme:

Sample Glossiness Redness…


Judge 1 Judge 2… …
1 Repl A Repl B Repl C Repl A Repl B Repl C …
2
3

Then you can easily


• Define a Variable Set for each attribute and analyze the judges by Concordance
analysis.
• Average over the replicates and judges in one go.

16.9.4 Fold-Over Designs


Fold-over designs offer a means of finding a complementary fraction to separate all
main effects from confounded two-variable interactions. This may be useful if for
example you have made a Plackett-Burman design and discover that there probably
are interactions. You find a good example of the idea behind fold-over design in the
Bible, Daniels’ book, chapter 1!

In a fold-over design you simply switch the signs of all the experimental settings of
the variables in the first design. Here is an example:

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 439

Table 16.12
First design Fold-over design
A B C AB AC BC ABC A B C AB AC BC ABC
-1 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 +1 +1 +1
-1 +1 -1 +1 -1 +1 +1 +1 -1 +1 -1 +1 -1 -1
+1 +1 -1 -1 +1 +1 -1 -1 -1 +1 +1 -1 -1 +1
-1 -1 +1 -1 +1 +1 +1 +1 +1 -1 +1 -1 -1 -1
+1 -1 +1 +1 -1 +1 -1 -1 +1 -1 -1 +1 -1 +1
-1 +1 +1 +1 +1 -1 -1 +1 -1 -1 -1 -1 +1 +1
-1 +1 +1 +1 +1 -1 -1 +1 -1 -1 -1 -1 +1 +1
+1 +1 +1 -1 -1 -1 +1 -1 -1 -1 +1 +1 +1 -1

When analyzing the two parts together, the main effects will be free from
confounding with two-variable interactions. Two-variable interactions may (still) be
confounded with each other. But you may often make an “educated” guess of which
term dominates in such confoundings; significant interaction effects are generally
found for variables which also have significant main effects.

Read more about the powerful fold-over designs in the design literature.

16.9.5 What Do We Do if We Cannot Keep to


the Planned Variable Settings?
There are situations where you fail to keep to the planned levels of the variable
settings; either they cannot be kept constant or you cannot tune them accurately. If
the settings deviate too much from the plan, the whole basis for the standard analysis
of significant effects is violated and the estimated results may be wrong. What do we
do then?

We enter the real variable values and use PCA on the responses, PCR or PLS in
order to make an ordinary multivariate projection model, to relate response variation
to the design variable. We disregard the Normal B-plot and instead use the loadings
plot to study the most important variables “as usual”. Even if the design was not as
planned, you have probably generated a data set which spans the most important
variations as well as the interacting covariations.

Multivariate Data Analysis in Practice


440 16. Introduction to Experimental Design

The Importance of Having Measurements for All Design


Samples
Analysis of Effects and Response Surface modeling, analysis methods which are
specially tailored for designed data sets, can only be run if response values are
available for all the designed samples. The reason is that those methods need
balanced data to be applicable.

As a consequence, you should be especially careful to collect response values for all
experiments. If you do not, for instance due to some instrument failure, it might be
advisable to re-do the experiment later so as to collect the missing values.

If, for some reason, some response values simply cannot be measured, you will still
be able to use the standard multivariate methods described in this book: PCA on the
responses, and PCR or PLS to relate response variation to the design variable.

16.9.6 A “Random Design”


Finally consider the situation where we know that there are many design variables
and many interactions (the higher order interactions cannot be ignored). In short,
every conceivable design plan will be far too complex. What do we do? Here is an
immediate idea: Use uncoded design variables. Make sure that they span all the
appropriate ranges. Even more important - make sure that there are many samples
from the intermediate ranges. In this fashion you have maximized the possibility for
all factors to interact. Now you can analyze this random design with the standard
PLS approach, using scores and loadings to interpret the relationships. For such a
big experimental design task as this one, we only have to pay the price of using a
relatively large number of experimental data. There is a price to pay for everything.

16.9.7 Modeling Uncoded Data


Uncoded data consist of the actual X-variable values and are available in the matrix
UnCode in The Unscrambler design files. A model based on these is practical if the
purpose is future prediction, since the prediction X data are not probably coded.

Remember to autoscale the variables in this situation. If some of the variables are
varying between 0 and 1, scale them in an appropriate fashion to get values larger
than 1 before you divide by their standard deviation. For instance if a variable varies
between 0 and 1 gram give its levels in milligrams instead.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 441

Calculate enough components and study the residual Y-variance to find the
appropriate number of components to use for interpretation.

If the model is bad, for instance if the prediction error is high, one reason may be
non-linearities. Try to expand the X-matrix with cross- and square terms. Remember
to subtract the mean value of each variable before expansion. Then make a new
model and see if the added cross- and square terms do the trick.

This procedure may also be an alternative if the basis for a chosen design has failed.
Suppose you make a Plackett-Burman design but there are clear interactions. The
data set then contains too little information for the ambition of the stated modeling.
Make a fold-over design, add the new experiments to the data set, and make a PLS-
model as usual based on the uncoded (real) values expanded with cross terms. Note
that the results from a Plackett-Burman design cannot be used to estimate curvature
since it does not contain any axial points.

16.10 Exercise - Designed Data with Non-


Stipulated Values (Lacotid)
Purpose
This exercise is based on data from a factorial design where the planned settings
could not be maintained. They must therefore be analyzed by PLS instead of
classical analysis of significant effects. Such data give “unexpected” phenomena
during bilinear modeling that it may be useful to be aware of. This exercise also
illustrates the interpretation of effects.

Problem
Lacotid is a white crystalline powder used in medicine. The synthesis of Lacotid is a
two stage process: 1) synthesis 2) crystallization. The synthesis of the raw products
is performed in a methanol solution (MeOH). A slurry of the raw product is then
pumped into a new container where crystallization takes place. Crystallization is
performed by gradually adding isopropanol (C3H7OH) to the slurry. The producers
of Lacotid wanted to increase the yield and make the production more optimal and
stable, as there was a yield of only 50% and large variations in quality. After some
initial experiments the factory concluded that the main variations occurred in the
crystallization stage. They therefore planned to improve the monitoring of the
crystallization process to ensure stable and optimal production of Lacotid with

Multivariate Data Analysis in Practice


442 16. Introduction to Experimental Design

respect to yield and quality. To achieve this they first needed to find which process
parameters have significant effects on the yield.

It was decided to study the crystallization process using a factorial experimental


design to determine the main effects. Some parameters were assumed to have little
or no effect on yield and quality, and were thus not investigated.

In factorial designs all X-variables take only two values (high or low), because the
goal is to investigate if Y is affected by a change in each X-variable. Because of
problems in keeping some of the variables at their planned levels, the data could not
be analyzed by the traditional methods used to analyze experimental designs. They
therefore had to work with the real x-values and use a PLS model instead to interpret
the variable relationships.

The data for this exercise were provided by the Norwegian Society for Process
control at a workshop in 1993. They are based on a real application.

Data Set
LACOTID with the variable Sets X-Data and Y-Data.
The X variables are:

Table 16.13 - Variable description


Name Description Units
MeOHrest Rest of methanol MeOH in Lacotid at start. %
MeOHprop Proportion of methanol in methanol+propanol %
solution
Feed_vel Feed velocity of propanol %
Stirring Stirring speed at addition of propanol rpm
Temp Temperature of propanol °C
Crystemp Lowest crystallization-temperature °C
Duration Duration of crystallization hours

There are two responses in Y: Yield in % and purity (GC) of the yield (measured by
gas chromatography). Twenty experiments were made using a 27-3 factorial design
with 4 center points. The replicated center points are marked R. Center points have
mean values of all x-variables. They are replicated to check the error in the reference
method.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 443

Task
Make a PLS model to study which process variables influence the yield and quality.

How to Do it
1. Plot the raw data and study the data table to get acquainted with the data.
Calculate general statistics of the variables (both X and Y) in the Task-
Statistics menu. Note the SDev of Y1 and Y2.

2. Make a PLS2 model. Should we standardize the data? Use leverage correction
initially.

Study the variance from the Model overview. You should see that the explained
calibration X variance is very small. In factorial designs, by definition, all the
X-variables have been systematically varied in the same way. That is no one
variable varies more than another, so normally the explained calibration
X-variance will be zero. (In this case the planned settings could not be kept
completely constant, so there is a small variation, and even a small decrease in
PC2.) In factorial designs, do not pay attention to these X-variances as they have
no meaning here.

Study the residual Y- Variance. Observe the big difference in variance for Yield
and GC. What does this mean? Also look at RMSEP. How many components do
we need? How much of the variance of Y is explained?

Take a look at the score plot. Is it really meaningful to interpret PC2? The center
points represent the average samples. Study the loading plot. Which process
variables have highest effect on the Yield?

3. Calculate the SDev of the variable GC for the four samples marked R. Compare
this SDev to the standard deviation of all samples in Y2. Use this to explain why
the modeling of Y2 is so bad.

4. Run a new model on Yield only. Study the Y-variance, the score plot and the
loading plot. Check your conclusions from the first model. How many
components should we use? Can we say anything about possible interactions?

Summary
Data should be standardized if they are not coded. Yield is explained by one PLS-
component, but GC is not explained at all. Normally we get one PLS-component per

Multivariate Data Analysis in Practice


444 16. Introduction to Experimental Design

Y-variable with PLS2 of factorial designs, but since the X-data deviate a little from
the planned design we may accept one or two components more.

SDev of GC for the replicates is 0.33 and the standard deviation of GC for all
samples is 0.47, i.e. the error in the reference method is of the same order as the
spread of the response in all the samples. Obviously the chosen measurement method
cannot be used, therefore you cannot model Purity based on these data. We do not
know from this whether or not the chosen design variables have any effect on the
purity.

Either 1 or 2 PCs should be used. In the PLS2 model it is difficult to say how many
PCs to use, since the error has a local minimum in PC1. In that model we do not
know if the increase in PC2 is caused by real problems or overfitting. The PLS2
model suggests that 2 components can be used. Since these data are only generated
to find the most important variables, not to make the perfect prediction model, we
can therefore look at the 2 PC model for convenience (it is a bit easier to study the
2-vector plots than the 1-vector plots, but pay most attention to PC1).

PC1 suggests variable X1 and X2 as the most important for the Yield. X6 has some
contribution in PC2. If we continue with optimization experiments, we could
perhaps include X6.

From the loading plots it is clear that variable X1 and X2 covary. We cannot say if
they also interact, unless we add the variable X1*X2 in the X-matrix. The high
degree of explained Y-variance (80% at 2 PCs) without the interaction term suggests
that this is not a significant effect. (If you are interested in experimenting, you can
delete the non-significant effects from the X-matrix and make a new model including
only X1 and X2. Study the Normal probability plot of residuals.)

16.11 Experimental Design Procedure in The


Unscrambler
Here is a brief listing of the work procedure:

Part I Set Up the Design


1. Define your goal for the experiments.
2. Define which variables you consider important.
3. Open the File – New Design menu.

Multivariate Data Analysis in Practice


16. Introduction to Experimental Design 445

4. Choose From Scratch, and choose the Design type that you need.
5. Define the low and high levels of the design variables (the ones you plan to
vary, X-variables), enter variable names, units and blocks (e.g. experimenters).
Define the number of responses (Y-variables) you plan to measure.
6. Choose the suitable number of experiments and resolution for your purposes.
In factorial designs, check whether the effects you are most interested in will be
clear or confounded. If you are not satisfied, select a different design.
7. Define the number of repeated measurements, and the number of center
points per block, as required. Here we can add reference samples as well.
8. We can also choose to sort during randomization.
9. The program now calculates an experimental pattern that spans all variables
optimally and generates a randomized experimental plan for the lab. The
X-matrix is automatically expanded with all interaction effects as additional
X-variables, and stored as a Design file for later use.
10. Preview and Print out the randomized experimental plan for the lab.
11. Re randomize if you are not happy with the randomization.

Part II Analysis
12. When the experiments have been performed, enter the lab results (the
responses) as Y-variables. Note! Entering response values is disabled in the
training version of the program.
13. The next analyses are found in Plot and Task menu.
14. Data checks: Plot raw data use line, scatter or histogram plots, Statistics and
PCA if several responses. The purpose is to find transcription errors and more
serious problems are identified.
15. Descriptive analysis: Statistics or PCA. The purpose is to check ranges of
variation for each response and correlation among responses.
16. Inferential analysis: Analysis of effects with ANOVA and significance testing.
At this stage we want to find the significance of each effect and choose to leave
out non-significant effects.
17. Predictive analysis (optimization designs): Response Surface analysis with
ANOVA and PLS2 if several responses.

Make more experiments if necessary, for example using optimization designs, fold-
over designs or complementary runs to separate confoundings. With just a little work
you will quickly develop enough personal experience to build a sequential
experimental strategy.

Multivariate Data Analysis in Practice


446

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 447

17. Complex Experimental Design


Problems
This chapter introduces the “tricky” situations in which classical designs based
upon the factorial principle do not apply. Here, you will learn about two specific
cases:
1. Constraints between the levels of several design variables;
2. A special case: mixture situations.
Each of these situations will then be described extensively in the next sections.

17.1 Introduction to Complex Experimental


Design Problems

17.1.1 Constraints Between the Levels of


Several Design Variables
A manufacturer of prepared foods wants to investigate the impact of several
processing parameters on the sensory properties of cooked, marinated meat. The
meat is to be first immersed in a marinade, then steam-cooked, and finally deep-
fried. The steaming and frying temperatures are fixed; the marinating and cooking
times are the process parameters of interest.
The process engineer wants to investigate the effect of the three process variables
within the following ranges of variation:

Table 17.1: Ranges of the process variables for the cooked meat design
Process variable Low High
Marinating time 6 hours 18 hours
Steaming time 5 min 15 min
Frying time 5 min 15 min

A full factorial design would lead to the following “cube” experiments:

Multivariate Data Analysis in Practice


448 17. Complex Experimental Design Problems

Table 17.2: The cooked meat full factorial design


Sample Mar. Time Steam. Time Fry. Time
1 6 5 5
2 18 5 5
3 6 15 5
4 18 15 5
5 6 5 15
6 18 5 15
7 6 15 15
8 18 15 15

When seeing this table, the process engineer expresses strong doubts that
experimental design can be of any help to him. “Why?” asks the statistician in
charge. “Well,” replies the engineer, “if the meat is steamed then fried for 5
minutes each it will not be cooked, and at 15 minutes each it will be overcooked
and burned on the surface. In either case, we will not get any valid sensory ratings,
because the products will be far beyond the ranges of acceptability.”
After some discussion, the process engineer and the statistician agree that an
additional condition should be included:
“In order for the meat to be suitably cooked, the sum of the two cooking times
should remain between 16 and 24 minutes for all experiments”.
This type of restriction is called a multi-linear constraint. In the current case, it
can be written in a mathematical form requiring two equations, as follows:

Steam + Fry ≥ 16 and Steam + Fry ≤ 24

The impact of these constraints on the shape of the experimental region is shown in
Figure 17.1 and Figure 17.2:

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 449

Figure 17.1: The cooked meat experimental region - no constraint

15
Frying
18

Marinating
5
6
5 Steaming 15

Figure 17.2: The cooked meat experimental region - multi-linear constraints


15
Frying

18

Marinating
5

6
5 Steaming 15

The constrained experimental region is no longer a cube! As a consequence, it is


impossible to build a full factorial design in order to explore that region.
The design that best spans the new region is given in Table 17.3.

Multivariate Data Analysis in Practice


450 17. Complex Experimental Design Problems

Table 17.3: The cooked meat constrained design


Sample Mar. Time Steam. Time Fry. Time
1 6 5 11
2 6 5 15
3 6 9 15
4 6 11 5
5 6 15 5
6 6 15 9
7 18 5 11
8 18 5 15
9 18 9 15
10 18 11 5
11 18 15 5
12 18 15 9

As you can see, it contains all "corners" of the experimental region, in the same
way as the full factorial design does when the experimental region has the shape of
a cube.

Depending on the number and complexity of multi-linear constraints to be taken


into account, the shape of the experimental region can be more or less complex. In
the worst cases, it may be almost impossible to imagine! Therefore, building a
design to screen or optimize variables linked by multi-linear constraints requires
special methods. Chapter 17.1.3 "Alternative Solutions" will briefly introduce two
ways to build constrained designs.

17.1.2 A Special Case: Mixture Situations


A colleague of our process engineer, working in the Product Development
department, has a different problem to solve: optimize a pancake mix. The mix
consists of the following ingredients: wheat flour, sugar and egg powder. It will be
sold in retail units of 100 g, to be mixed with milk for reconstitution of a pancake
dough.

The product developer has learnt about experimental design, and tries to set up an
adequate design to study the properties of the pancake dough as a function of the
amounts of flour, sugar and egg in the mix. She starts by plotting the region that
encompasses all possible combinations of those three ingredients, and soon
discovers that it has quite a peculiar shape:

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 451

Figure 17.3: The pancake mix experimental region


100% Egg

Only Flour and Egg


Only Sugar and Egg

100
Mixtures of
3 ingredients

100% Flour

Egg
0

Sugar
Only Flour and Sugar
0

100
100% Sugar 0 Flour 100

The reason, as you will have guessed, is that the mixture always has to add up to a
total of 100 g. This is a special case of multi-linear constraint, which can be written
with a single equation:

Flour + Sugar + Egg = 100

This is called the mixture constraint: the sum of all mixture components is 100%
of the total amount of product.

The practical consequence, as you will also have noticed, is that the mixture region
defined by three ingredients is not a three-dimensional region! It is contained in a
two-dimensional surface called a simplex.
Therefore, mixture situations require specific designs. Their principles will be
introduced in the next chapter.

17.1.3 Alternative Solutions


There are several ways to deal with constrained experimental regions. We are going
to focus on two well-known, proven methods:
• Classical mixture designs take advantage of the regular simplex shape
that can be obtained under favorable conditions.

Multivariate Data Analysis in Practice


452 17. Complex Experimental Design Problems

• In all other cases, a design can be computed algorithmically by applying


the D-optimal principle.

Designs Based on a Simplex


Let us continue with the pancake mix example. We will have a look at the pancake
mix simplex from a very special point of view. Since the region defined by the
three mixture components is a two-dimensional surface, why not forget about the
original three dimensions and focus only on this triangular surface?

Figure 17.4: The pancake mix simplex


Egg
100%
Egg

0% 0%
Sugar Flour
33.3%
Sugar
33.3%
Flour
33.3%
Egg

100% 100%
Flour Sugar

Flour 0% Sugar
Egg

This simplex contains all possible combinations of the three ingredients flour,
sugar and egg. As you can see, it is completely symmetrical. You could substitute
egg for flour, sugar for egg and flour for sugar in the figure, and still get exactly the
same shape.

Classical mixture designs take advantage of this symmetry. They include a varying
number of experimental points, depending on the purposes of the investigation. But
whatever this purpose and whatever the total number of experiments, these points
are always symmetrically distributed, so that all mixture variables play equally
important roles. These designs thus ensure that the effects of all investigated
mixture variables will be studied with the same precision. This property is
equivalent to the properties of factorial, central composite or Box-Behnken designs
for non-constrained situations.

Figure 17.5 shows two examples of classical mixture designs.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 453

Figure 17.5: Two classical designs for 3 mixture components


Egg Egg

Flour Sugar Flour Sugar

The first design in Figure 17.5 is very simple. It contains three corner samples
(pure mixture components), three edge centers (binary mixtures) and only one
mixture of all three ingredients, the centroid.

The second one contains more points, spanning the mixture region regularly in a
triangular lattice pattern. It contains all possible combinations (within the mixture
constraint) of five levels of each ingredient. It is similar to a 5-level full factorial
design - except that many combinations, like "25%,25%,25%" or
"50%,75%,100%", are excluded because they are outside the simplex.

You can read more about classical mixture designs in Chapter 17.2 "The Mixture
Situation".

D-Optimal Designs
Let us now consider the meat example again (see Chapter 17.1.1 "Constraints
Between the Levels of Several Design Variables"), and simplify it by focusing on
Steaming time and Frying time, and taking into account only one constraint:
Steaming time + Frying time ≤ 24. Figure 17.6 shows the impact of the constraint
on the variations of the two design variables.

Multivariate Data Analysis in Practice


454 17. Complex Experimental Design Problems

Figure 17.6: The constraint cuts off one corner of the "cube"
9

15
S + F = 24

Frying
9

5
5 Steaming 15

If we try to build a design with only 4 experiments, like as in the full factorial
design, we will automatically end up with an imperfect solution that leaves a
portion of the experimental region unexplored. This is illustrated in Figure 17.7.

Figure 17.7: Designs with 4 points leave out a portion of the experimental region
Unexplored portion
5 4 5 4

I 3 II 3

1 2 1 2

From this figure it can be seen that, design II is better than design I, because the
left- out area is smaller. A design using points (1,3,4,5) would be equivalent to (I),
and a design using points (1,2,4,5) would be equivalent to (II). The worst solution
would be a design with points (2,3,4,5): it would leave out the whole corner
defined by points 1,2 and 5.

Thus it becomes obvious that, if we want to explore the whole experimental region,
we need more than 4 points. Actually, in the above example, the five points
(1,2,3,4,5) are necessary. These five crucial points are the extreme vertices of the
constrained experimental region. They have the following property: if you were to

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 455

wrap a sheet of paper around those points, the shape of the experimental region
would appear, materialized formed by your wrapping.

Every time you add a constraint, you INCREASE the number of vertices.

When the number of variables increases and more constraints are introduced, it is
not always possible to include all extreme vertices into the design. In these cases
you need a decision rule to select the best possible subset of points to include in
your design. There are many possible rules; one of them is based on the so-called
D-optimal principle, which consists in enclosing maximum volume into the
selected points. In other words, you know that a wrapping of the selected points
will no exactly re-constitute the experimental region you are interested in, but you
want to leave out the smallest possible portion.

Read more about D-optimal designs and their various applications in section 17.3 ,
"How To Deal With Constraints".

17.2 The Mixture Situation


This chapter addresses the classical mixture case, where at least three ingredients
are combined to form a blend, and three additional conditions are fulfilled:
1. The total amount of the blend is fixed (e.g. 100%);
2. There are no other constraints linking the proportions of two or more of
the ingredients;
3. The ranges of variation of the proportions of the mixture ingredients are
such that the experimental region has the regular shape of a simplex (see
Section 17.3.4 , "When is the Mixture Region a Simplex?").

These conditions will be clarified and illustrated by an example. Then three


possible applications will be considered, and the corresponding designs will be
presented.

17.2.1 An Example of Mixture Design


This example, taken from John A. Cornell’s reference book “Experiments With
Mixtures”, illustrates the basic principles and specific features of mixture designs.

A fruit punch is to be prepared by blending three types of fruit juice: watermelon,


pineapple and orange. The purpose of the manufacturer is to use their large supplies

Multivariate Data Analysis in Practice


456 17. Complex Experimental Design Problems

of watermelons by introducing watermelon juice, of little value by itself, into a


blend of fruit juices. Therefore, the fruit punch has to contain a substantial amount
of watermelon - at least 30% of the total. Pineapple and orange have been selected
as the other components of the mixture, since juices from these fruits are easy to
get and inexpensive.

The manufacturer decides to use experimental design to find out which


combination of those three ingredients maximizes consumer acceptance of the taste
of the punch. The ranges of variation selected for the experiment are as follows:

Table 17.4: Ranges of variation for the fruit punch design


Ingredient Low High Centroid
Watermelon 30% 100% 54%
Pineapple 0% 70% 23%
Orange 0% 70% 23%

You can see at once that the resulting experimental design will have a number of
features which make it very different from a factorial or central composite design.
Firstly, the ranges of variation of the three variables are not independent. Since
Watermelon has a low level of 30%, the high level of Pineapple cannot be higher
than 100 - 30 = 70%. The same holds for Orange.

The second striking feature concerns the levels of the three variables for the point
called “centroid”: these levels are not half-way between “low” and “high”, they are
closer to the low level. The reason is, once again, that the blend has to add up to a
total of 100%.

Since the levels of the various concentrations of ingredients to be investigated


cannot vary independently from each other, these variables cannot be handled in
the same way as the design variables encountered in a factorial or central composite
design. To mark this difference, we will refer to those variables as mixture
components (or mixture variables).

Whenever the low and high levels of the mixture components are such that the
mixture region is a simplex (as shown in Chapter 17.1.2 , "A Special Case: Mixture
Situations"), classical mixture designs can be built. Read more about the necessary
conditions in section 17.3.4 , "When is the Mixture Region a Simplex?".

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 457

These designs have a fixed shape, depending only on the number of mixture
components and on the objective of your investigation. For instance, we can build a
design for the optimization of the concentrations of Watermelon, Pineapple and
Orange juice in Cornell's fruit punch, as shown in Figure 17.8.

Figure 17.8: Design for the optimization of the fruit punch composition
Watermelon
100% W

The fruit punch


simplex

0% P 0% O

70% O 70% P
30% W

100% O 100% P

Orange 0% W Pineapple

The next chapters will introduce the three types of mixture designs which are the
most suitable for three different objectives:
• 1- Screening of the effects of several mixture components;
• 2- Optimization of the concentrations of several mixture components;
• 3- Even coverage of an experimental region.

17.2.2 Screening Designs for Mixtures


In a screening situation, you are mostly interested in studying the main effects of
each of your mixture components.

What is the best way to build a mixture design for screening purposes? To answer
this question, let us go back to the concept of main effect.

The main effect of an input variable on a response is the change occurring in the
response values when the input variable varies from Low to High, all experimental
conditions being otherwise comparable.

Multivariate Data Analysis in Practice


458 17. Complex Experimental Design Problems

In a factorial design, the levels of the design variables are combined in a balanced
way, so that you can follow what happens to the response value when a particular
design variable goes from Low to High. It is mathematically possible to compute
the main effect of that design variable, because its Low and High levels have been
combined with the same levels of all the other design variables.

In a mixture situation, this is no longer possible. Look at the previous figure: while
30% Watermelon can be combined with (70% P, 0% O) and (0% P, 70% O),
100% Watermelon can only be combined with (0% P, 0% O)!

To find a way out of this dead end, we have to transpose the concept of "otherwise
comparable conditions" to the constrained mixture situation. To follow what
happens when Watermelon varies from 30% to 100%, let us compensate for this
variation in such a way that the mixture still adds up to 100%, without disturbing
the balance of the other mixture components. This is achieved by moving along an
axis where the proportions of the other mixture components remain constant, as
shown in Figure 17.9.

Figure 17.9: Studying variations in the proportion of Watermelon


Watermelon

(100% W, 0%[1/2P+1/2 O])


W varies from 30 to 100%,
P and O compensate
in fixed proportions

(77% W, 23%[1/2P+1/2 O])

(53% W, 47%[1/2P+1/2 O])

(30% W, 70%[1/2P+1/2 O])

Orange Pineapple

The most "representative" axis to move along is the one where the other mixture
components have equal proportions. For instance, in the above figure, Pineapple
and Orange each use up one half of the remaining volume once Watermelon has
been determined.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 459

Mixture designs based upon the axes of the simplex are called axial designs. They
are the best suited for screening purposes because they manage to capture the main
effect of each mixture component in a simple and economical way.

A more general type of axial design is represented, for 4 variables, in the next
figure. As you can see, most of the points are located inside the simplex: they are
mixtures of all 4 components. Only the four corners, or vertices (containing the
maximum concentration of an individual component) are located on the surface of
the experimental region.

Figure 17.10: A 4-component axial design


Vertex

Axial point

Overall
centroid

Optional
end point

Each axial point is placed halfway between the overall centroid of the simplex
(25%,25%,25%,25%) and a specific vertex. Thus the path leading from the
centroid ("neutral" situation) to a vertex (extreme situation with respect to one
specific component) is well described with the help of the axial point.

In addition, end points can be included; they are located on the surface of the
simplex, opposite to a vertex (the are marked by crosses on the figure). They
contain the minimum concentration of a specific component. When end points are
included in an axial design, the whole path leading from minimum to maximum
concentration is studied.

Multivariate Data Analysis in Practice


460 17. Complex Experimental Design Problems

17.2.3 Optimization Designs for Mixtures


If you wish to optimize the concentrations of several mixture components, you need
a design that enables you to predict with a high accuracy what happens for any
mixture - whether it involves all components or only a subset.

It is a well-known fact that peculiar behaviors often happen when a concentration


drops down to zero. For instance, to prepare the base for a Dijon mayonnaise, you
need to blend Dijon mustard, egg and vegetable oil. Have you ever tried - or been
forced by circumstances - to remove the egg from the recipe? If you do, you will
get a dressing with a different appearance and texture. This illustrates the
importance of interactions (e.g. between egg and oil) in mixture applications.

Thus, an optimization design for mixtures will include a large number of blends of
only two, three, or more generally a subset of the components you want to study.
The most regular design including those sub-blends is called simplex-centroid
design. It is based on the centroids of the simplex: balanced blends of a subset of
the mixture components of interest. For instance, to optimize the concentrations of
three ingredients, each of them varying between 0 and 100%, the simplex-centroid
design will consist of:
• 1- The 3 vertices: (100,0,0), (0,100,0) and (0,0,100);
• 2- The 3 edge centers (or centroids of the 2-dimensional sub-simplexes defining
binary mixtures): (50,50,0), (50,0,50) and (0,50,50);
• 3- The overall centroid: (33,33,33).

A more general type of simplex-centroid design is represented, for 4 variables, in


Figure 17.11.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 461

Figure 17.11: A 4-component simplex-centroid design


Vertex

Optional
interior point
3rd order
centroid
(face center)
Overall
centroid

2nd order centroid


(edge center)

If all mixture components vary from 0 to 100%, the blends forming the simplex-
centroid design are as follows:
• 1- The vertices are pure components;
• 2- The second order centroids (edge centers) are binary mixtures with equal
proportions of the selected two components;
• 3- The third order centroids (face centers) are ternary mixtures with equal
proportions of the selected three components;
• …..
• N- The overall centroid is a mixture where all N components have equal
proportions.

In addition, interior points can be included in the design. They improve the
precision of the results by "anchoring" the design with additional complete
mixtures. The most regular design is obtained by adding interior points located
halfway between the overall centroid and each vertex. They have the same
composition as the axial points in an axial design.

17.2.4 Designs that Cover a Mixture Region


Evenly
Sometimes you may not be specifically interested in a screening or optimization
design. In fact, you may not even know whether you are ready for a screening! For

Multivariate Data Analysis in Practice


462 17. Complex Experimental Design Problems

example, you just want to investigate what would happen if you mixed three
ingredients which you have never tried to mix before.

This is one of the cases when your main purpose is to cover the mixture region as
evenly and regularly as possible. Designs which address that purpose are called
simplex-lattice designs. They consist of a network of points located at regular
intervals between the vertices of the simplex. Depending on how thoroughly you
want to investigate the mixture region, the network will be more or less dense,
including a varying number of intermediate levels of the mixture components. As
such, it is quite similar to an N-level full factorial design. Figure 17.12 illustrates
this similarity.

Figure 17.12: A 4th degree simplex-lattice design is similar to


a 5-level full factorial
Egg Baking temperature

Flour Sugar Time

In the same way as a full factorial design, depending on the number of levels, these
can be used for screening, optimization, or other purposes, simplex-lattice designs
have a wide variety of applications, depending on their degree (number of
intervals between points along the edge of the simplex). Here are a few:
- Feasibility study (degree 1 or 2): are the blends feasible at all?
- Optimization: with a lattice of degree 3 or more, there are enough points to fit a
precise response surface model.
- Search for a special behavior or property which only occurs in an unknown,
limited sub-region of the simplex.
- Calibration: prepare a set of blends on which several types of properties will be
measured, in order to fit a regression model to these properties. For instance, you
may wish to relate the texture of a product, as assessed by a sensory panel, to the
parameters measured by a texture analyzer. If you know that texture is likely to

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 463

vary as a function of the composition of the blend, a simplex-lattice design is


probably the best way to generate a representative, balanced calibration data set.

17.3 How To Deal With Constraints


In this chapter, you will learn about the more general cases of constrained designs,
which apply whenever there is no pre-defined solution like a classical mixture
design.

Since there is no "template" that can automatically be applied, the design will have
to be computed algorithmically to fit your own particular situation. The main
principle underlying these computations is the D-optimal principle. The chapters to
come explain this principle and its practical implications.

17.3.1 Introduction to the D-Optimal


Principle
If you are familiar with factorial designs, you probably know that their most
interesting feature is that they allow you to study all effects independently from
each other. This property, called orthogonality, is vital for relating variations of
the responses to variations in the design variables. It is what allows you to draw
conclusions about cause and effect relationships. It has another advantage, namely
minimizing the error in the estimation of the effects.

Constrained Designs are Not Orthogonal


As soon as multi-linear constraints are introduced among the design variables, it is
no longer possible to build an orthogonal design. This can be grasped intuitively if
you understand that orthogonality is equivalent to the fact that all design variables
are varied independently from each other. As soon as the variations in one of the
design variables are linked to those of another design variable, orthogonality cannot
be achieved.

In order to minimize the negative consequences of a deviation from the ideal


orthogonal case, you need a measure of the "lack of orthogonality" of a design.
This measure is provided by the condition number, defined as follows:
Cond# = square root (largest eigenvalue / smallest eigenvalue)

Multivariate Data Analysis in Practice


464 17. Complex Experimental Design Problems

which is linked to the elongation or degree of "non-sphericity" of the region


actually explored by the design. The smaller the condition number, the more
spherical the region, and the closer you are to an orthogonal design.

Note!
An eigenvalue gives a measure of the size or significance of a dimension
(or PC). The NIPALS algorithm extracts PCs in order of decreasing
Eigenvalues. That is to say that if the eigenvalues are equal, then each
dimension is equivalent to the others so the space is spherical. If the
largest dimension is much larger than the smallest dimension, then the
region is “flat”.

Small Condition Number Means Large Enclosed Volume


Another important property of an experimental design is its ability to explore the
whole region of possible combinations of the levels of the design variables. It can
be shown that, once the shape of the experimental region has been determined by
the constraints, the design with the smallest condition number is the one that
encloses maximal volume.

In the ideal case, if all extreme vertices are included into the design, it has the
smallest attainable condition number. If that solution is too expensive, however,
you will have to make a selection of a smaller number of points. The automatic
consequence is that the condition number will increase and the enclosed volume
will decrease. This is illustrated by Figure 17.13.

Figure 17.13: With only 8 points, the enclosed volume is not optimal
Region of interest Unexplored portion

How to Build a D-Optimal Design


First, the purpose of the design has to be expressed in the form of a mathematical
model. The model does not have the same shape for a screening design as for an
optimization design.
Multivariate Data Analysis in Practice
17. Complex Experimental Design Problems 465

Once the model has been fixed, the condition number of the "experimental
matrix", which contains one column per effect in the model, and one row per
experimental point, can be computed.

The D-optimal algorithm will then consist in:


1- Deciding how many points the design should include. Read more about that in
17. "How Many Experiments Are Necessary?".
2- Generating a set of candidate points, among which the points of the design will
be selected. The nature of the relevant candidate points depends on the shape of the
model. Read the next chapters for more details.
3- Selecting a subset with the desired number of points more or less randomly, and
computing the condition number of the resulting experimental matrix.
4- Exchanging one of the selected points with a left over point, and comparing the
new condition number to the previous one. If it is lower, the old point is replaced
by the new one; else another left over point is tried. This process can be re-iterated
a large number of times.

When the exchange of points does not give any further improvements, the
algorithm stops and the subset of candidate points giving the lowest condition
number is selected.

How Good is my Design?


The excellence of a D-optimal design is expressed by its condition number, which,
as we have seen previously, depends on the shape of the model as well as on the
selected points.

In the simplest case of a linear model, an orthogonal design like such as a full
factorial would have a condition number of 1. It follows that the condition number
of a D-optimal design will always be larger than 1. A D-optimal design with a
linear model is acceptable up to a cond# around 10.

If the model gets more complex, it becomes more and more difficult to control the
increase in the condition number. For practical purposes, one can say that a design
including interaction and/or square effects is usable up to a cond# around 50.

If you end up with a cond# much larger than 50 no matter how many points you
include in the design, it probably means that your experimental region is too
constrained. In such a case, it is recommended to re-examine all your design

Multivariate Data Analysis in Practice


466 17. Complex Experimental Design Problems

variables and constraints with a critical eye, and search for ways to simplify your
problem (see 17.3.4 , "Advanced Topics"). Else you run the risk of starting an
expensive series of experiments which will not give you any useful information at
all.

17.3.2 Non-Mixture D-Optimal Designs


D-optimal designs for situations which do not involve a blend of constituents with a
fixed total will be referred to as "non-mixture" D-optimal designs. To differentiate
them from mixture components, we will call the design variables involved in non-
mixture designs process variables.

A non-mixture D-optimal design is the solution to your experimental design


problem every time you want to investigate the effects of several process variables
linked by one or more multi-linear constraints. It is built according to the D-optimal
principle described in the previous chapter.

D-Optimal Designs for Screening Stages


If your purpose if is to focus on the main effects of your design variables, and
optionally to describe some or all of the interactions among them, you will need a
linear model, optionally with interaction effects.

The set of candidate points for the generation of the D-optimal design will then
consist mostly of the extreme vertices of the constrained experimental region. If the
number of variables is small enough, edge centers and higher order centroids can
also be included.

In addition, center samples are automatically included in the design (whenever they
apply); they are not submitted to the D-optimal selection procedure.

D-Optimal Designs for Optimization Purposes


When you want to investigate the effects of your design variables with enough
precision to describe a response surface accurately, you need a quadratic model.
This model requires intermediate points (situated somewhere between the extreme
vertices) so that the square effects can be computed.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 467

The set of candidate points for a D-optimal optimization design will thus include:
- all extreme vertices;
- all edge centers;
- all face centers and constraint plane centroids.

To imagine the result in three dimensions, you can picture yourself a combination
of a Box-Behnken design (which includes all edge centers) and a Cubic Centered
Faces design (with all corners and all face centers). The main difference is that the
constrained region is not a cube, but a more complex polyhedron.

The D-optimal procedure will then select a suitable subset from these candidate
points, and several replicates of the overall center will also be included.

17.3.3 Mixture D-Optimal Designs


The D-optimal principle can solve mixture problems in two situations:
1- The mixture region is not a simplex.
2- Mixture variables have to be combined with process variables.

Pure Mixture Experiments


When the mixture region is not a simplex (see 17.3.4 , "When is the Mixture
Region a Simplex?"), a D-optimal design can be generated in a way similar to the
process cases described in the previous chapter.

Here again, the set of candidate points depends on the shape of the model. You may
lookup section 17.4.2 “Relevant Regression Models" for more details on mixture
models.

The overall centroid is always included in the design, and is not subject to the
D-optimal selection procedure.

Note!
Classical mixture designs have much better properties than D-optimal
designs. Remember this before establishing additional constraints on
your mixture components!

Section 17.3.4 "How to Select Reasonable Constraints" tells you more about how
to avoid unnecessary constraints.

Multivariate Data Analysis in Practice


468 17. Complex Experimental Design Problems

How to Combine Mixture and Process Variables


Sometimes the product properties you are interested in depend on the combination
of a mixture recipe with specific process settings. In such cases, it is useful to
investigate mixture and process variables together.

The Unscrambler offers three different ways to build a design combining mixture
and process variables. They are described below.

The Mixture Region is a Simplex


When your mixture region is a simplex, you may combine a classical mixture
design, as described in Chapter 17.2 , "The Mixture Situation", with the levels of
your process variables, in two different ways.

The first solution is useful when several process variables are included in the
design. It applies the D-optimal algorithm to select a subset of the candidate points,
which are generated by combining the complete mixture design with a full factorial
in the process variables.

Note!
The D-optimal algorithm will usually select only the extreme vertices of
the mixture region. Be aware that the resulting design may not always be
relevant!

The D-optimal solution is acceptable if you are in a screening situation (with a


large number of variables to study) and the mixture components have a lower limit.
If the latter condition is not fulfilled, the design will include only pure components,
which is probably not what you had in mind!

The alternative is to use the whole set of candidate points. In such a design, each
mixture is combined with all levels of the process variables. The figure hereafter
below illustrates two such situations.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 469

Figure 17.14: Two full factorial combinations of process variables


with complete mixture designs

Screening: Optimization:
axial design combined with a simplex centroid design combined
2-level factorial with a 3-level factorial
Egg Egg

Flour Sugar Flour Sugar

This solution is recommended (if the number of factorial combinations is


reasonable) whenever it is important to explore the mixture region precisely.

The Mixture Region is Not a Simplex


If your mixture region is not a simplex, you have no choice: the design has to be
computed by a D-optimal algorithm. The candidate points consist of combinations
of the extreme vertices (and optionally lower-order centroids) with all levels of the
process variables. From these candidate points, the algorithm will select a subset of
the desired size.

Note!
When the mixture region is not a simplex, only continuous process
variables are allowed.

17.3.4 Advanced Topics


This last section focuses on more technical or "tricky" issues related to the
computation of constrained designs.

Multivariate Data Analysis in Practice


470 17. Complex Experimental Design Problems

When is the Mixture Region a Simplex?


In a mixture situation where all concentrations vary from 0 to 100%, we have seen
in previous chapters that the experimental region has the shape of a simplex. This
shape reflects the mixture constraint (sum of all concentrations = 100%).

Note that if some of the ingredients do not vary in concentration, the sum of the
mixture components of interest (called Mix Sum in the program) is smaller than
100%, to leave room for the fixed ingredients. For instance if you wish to prepare a
fruit punch by blending varying amounts of Watermelon, Pineapple and Orange,
with a fixed 10% of sugar, Mix Sum is then equal to 90% and the mixture
constraint becomes "sum of the concentrations of all varying components = 90%".
In such a case, unless you impose further restrictions on your variables, each
mixture component varies between 0 and 90% and the mixture region is also a
simplex.

Whenever the mixture components are further constrained, like in the example
shown in Figure 17.15, the mixture region is usually not a simplex.

Figure 17.15: With a multi-linear constraint, the mixture region is not a simplex
Watermelon
Experimental
region W ≥ 2*P

W = 2*P

Orange Pineapple

In the absence of multi-linear constraints, the shape of the mixture region depends
on the relationship between the lower and upper bounds of the mixture
components.
It is a simplex if:
The upper bound of each mixture component is larger than
Mix Sum - (sum of the lower bounds of the other components).

Figure 17.16 illustrates one case where the mixture region is a simplex, and one
case where it is not.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 471

Figure 17.16: Changing the upper bound of Watermelon affects the shape of the
mixture region
Watermelon W

17% 17% 17% 17%

66% The mixture region The mixture region


is a simplex is not a simplex
55%

66% 66% 66% 66%

17% 17%

O P
Orange Pineapple

In the leftmost case, the upper bound of Watermelon is 66% = 100 - (17 + 17): the
mixture region is a simplex. If the upper bound of Watermelon is shifted to 0.55, it
becomes smaller than 100% - (17 + 17) and the mixture region is no longer a
simplex.

Note!
When the mixture components only have Lower bounds, the mixture
region is always a simplex.

How to Deal with Small Proportions


In a mixture situation, it is important to notice that variations in the major
constituents are only marginally influenced by changes in the minor constituents.
For instance, an ingredient varying between 0.02 and 0.05% will not noticeably
disturb the mixture total; thus it can be considered to vary independently from the
other constituents of the blend.

This means that ingredients which are represented in the mixture with a very small
proportion, can in a way "escape" from the mixture constraint.

So whenever one of the minor constituents of your mixture plays an important role
in the product properties, you can investigate its effects by treating it as a process
variable. See 17.3.3 "How to Combine Mixture and Process Variables" for more
details.

Multivariate Data Analysis in Practice


472 17. Complex Experimental Design Problems

Do you really need a mixture design?


A special case occurs when all the ingredients of interest have small proportions.
Let us consider the following example: a water-based soft drink consists of about
98% of water, an artificial sweetener, coloring agent, and plant extracts. Even if the
sum of the "non-water" ingredients varies from 0 to 3%, the impact on the
proportion of water will be negligible.

It does not make any sense to treat such a situation as a true mixture; it will be
better addressed by building a classical orthogonal design (full or fractional
factorial, central composite, Box-Behnken, depending on your objectives).

How to Select Reasonable Constraints


There are various types of constraints on the levels of design variables. At least
three different situations can be considered.
1- Some of the levels or their combinations are physically impossible. For instance:
a mixture with a total of 110%, or a negative concentration.
2- Although the combinations are feasible, you know that they are not relevant, or
that they will result in difficult situations. Examples: some of the product properties
cannot be measured, or there may be discontinuities in the product properties.
3- Some of the combinations which are physically possible and would not lead to
any complications, are not desired, for instance because of the cost of the
ingredients.

When you start defining a new design, think twice about any constraint you intend
to introduce. An unnecessary constraint will not help you solve your problem
faster; on the contrary, it will make the design more complex, and may lead to more
experiments or poorer results.

Physical constraints
The first two cases mentioned above can be called "real constraints ". You cannot
disregard them; if you do, you will end up with missing values in some of your
experiments, or uninterpretable results.

Constraints of cost
The third case, however, can be referred to as "imaginary constraints". Whenever
you are tempted to introduce such a constraint, examine the impact it will have on
the shape of your design. If it turns a perfectly regular and symmetrical situation,
which can be solved with a classical design (factorial or classical mixture), into a

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 473

complex problem requiring a D-optimal algorithm, you will be better off just
dropping the constraint.

Build a standard design, and take the constraint into account afterwards, at the
result interpretation stage. For instance, you can add the constraint to your response
surface plot, and select the optimum solution within the constrained region.

This also applies to Upper bounds in mixture components. As mentioned in section


17.3.4 "When is the Mixture Region a Simplex?", if all mixture components have
only Lower bounds, the mixture region will automatically be a simplex. Remember
that, and avoid to impose an Upper bound on a constituent playing a similar role to
the others, just because it is more expensive and you would like to limit its usage to
a minimum. It will be soon enough to do this at the interpretation stage, and select
the mixture which gives you the desired properties with the smallest amount of that
constituent.

How Many Experiments Are Necessary?


In a D-optimal design, the minimum number of experiments can be derived from
the shape of the model, according to the basic rule that
In order to fit a model studying p effects, you need at least n=p+1
experiments.

Note that if you stick to that rule without allowing for any extra margin, you will
end up with a so-called saturated design, that is to say without any residual degrees
of freedom. This is not a desirable situation, especially in an optimization context.

Therefore, The Unscrambler uses the following default number of experiments


(n), where p is the number of effects included in the model:
- For screening designs: n = p + 4 + 3 center samples;
- For optimization designs: n = p + 6 + 3 center samples.

A D-optimal design computed with the default number of experiments will have, in
addition to the replicated center samples, enough additional degrees of freedom to
provide a reliable and stable estimation of the effects in the model.

However, depending on the geometry of the constrained experimental region, the


default number of experiments may not be the ideal one. Therefore, whenever you
choose a starting number of points, The Unscrambler automatically computes 4
designs, with n-1, n, n+1 and n+2 points. The best two are selected and their

Multivariate Data Analysis in Practice


474 17. Complex Experimental Design Problems

condition number is displayed, allowing you to choose one of them, or decide to


give it another try.

Read more about the choice of a model in Chapter 17.4.2 , "Relevant Regression
Models".

17.4 How To Analyze Results From


Constrained Experiments
In this section, you will learn how to analyze the results from constrained
experiments with methods that take into account the specific features of the design.

17.4.1 Use of PLS Regression For Constrained


Designs
PLS regression is a projection method that decomposes variations within the
X-space (predictors, e.g. design variables or mixture proportions) and the Y-space
(responses to be predicted) along separate sets of PLS components (referred to as
PCs). For each dimension of the model (i.e. PC1, PC2, etc.), the summary of X is
"biased" so that it is as correlated as possible to the summary of Y. This is how the
projection process manages to capture the variations in X that can "explain"
variations in Y.

A side effect of the projection principle is that PLS not only builds a model of
Y=f(X), it also studies the shape of the multidimensional swarm of points formed
by the experimental samples with respect to the X-variables. In other words, it
describes the distribution of your samples in the X-space.

Thus any constraints present when building a design, will automatically be detected
by PLS because of their impact on the sample distribution. A PLS model therefore
has the ability to implicitly take into account multi-linear constraints, mixture
constraints, or both. Furthermore, the correlation or even the linear relationships
introduced among the predictors by these constraints, will not have any negative
effects on the performance or interpretability of a PLS model, contrary to what
happens with MLR.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 475

Analyzing mixture designs with PLS


When you build a PLS model on the results of mixture experiments, here is what
happens:
1. The X-data are centered, i.e. further results will be interpreted as deviations from
an average situation, which is the overall centroid of the design;
2. The Y-data are also centered, i.e. further results will be interpreted as an increase
or decrease compared to the average response values;
3. The mixture constraint is implicitly taken into account in the model, i.e. the
regression coefficients can be interpreted as showing the impact of variations in
each mixture component when the other ingredients compensate with equal
proportions.

In other words: the regression coefficients from a PLS model tell you exactly what
happens when you move from the overall centroid towards each corner, along the
axes of the simplex.

This property is extremely useful for the analysis of screening mixture experiments:
it enables you to interpret the regression coefficients quite naturally as the main
effects of each mixture component.

The mixture constraint has even more complex consequences on a higher degree
model necessary for the analysis of optimization mixture experiments. Here again,
PLS performs very well, and the mixture response surface plot enables you to
interpret the results visually (see Chapter 17.4.3 , "The Mixture Response Surface
Plot" for more details).

Analyzing D-optimal designs with PLS


PLS regression deals with badly conditioned experimental matrices (i.e. non-
orthogonal X-variables) much better than MLR would do. Actually, the larger the
condition number, the more PLS outperforms MLR.

Thus PLS regression is the method of choice to analyze the results from D-optimal
designs, no matter whether they involve mixture variables or not.

How Significant Are the Results?


The significance of the effects can be assessed visually by looking at the size of the
regression coefficients. This is an approximate assessment using the following rule
of thumb:

Multivariate Data Analysis in Practice


476 17. Complex Experimental Design Problems

- If the regression coefficient for a variable is larger than 0.2 in absolute value, then
the effect of that variable is most probably significant.
- If the regression coefficient is smaller than 0.1 in absolute value, then the effect is
negligible.
- Between 0.1 and 0.2: "gray zone" where no certain conclusion can be drawn.

Note!
In order to be able to compare the relative sizes of your regression
coefficients, do not forget to standardize all variables (both X and Y)!

The best and easiest way to check the significance of the effects is to use Martens’
Uncertainty test, which allows The Unscrambler to detect and mark the
significant X-variables (see Chapter 14).

17.4.2 Relevant Regression Models


The shape of your regression model has to be chosen bearing in mind the objective
of the experiments and their analysis. Moreover, the choice of a model plays a
significant role in determining which points to include in a design; this applies to
classical mixture designs as well as D-optimal designs.

Therefore, The Unscrambler asks you to choose a model immediately after you
have defined your design variables, prior to determining the type of classical
mixture design or the selection of points building up the D-optimal design which
best fits your current purposes.

The minimum number of experiments also depends on the shape of your model;
read more about it in section 17.3.4 "How many Experiments Are Necessary?”.

Models for Non-mixture situations


For constrained designs which do not involve any mixture variables, the choice of a
model is straightforward.

Screening designs are based on a linear model, with or without interactions. The
interactions to be included can be selected freely among all possible products of
two design variables.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 477

Optimization designs require a quadratic model, which consists of linear terms


(main effects), interaction effects, and square terms making it possible to study the
curvature of the response surface.

Models for Mixture Variables


As soon of your design involves mixture variables, the mixture constraint has a
remarkable impact on the possible shapes of your model. Since the sum of the
mixture components is constant, each mixture component can be expressed as a
function of the others. As a consequence, the terms of the model are also linked and
you are not free to select any combination of linear, interaction or quadratic terms
you may fancy.

In a mixture design, the interaction and square effects are linked and
cannot be studied separately.

Example: A, B and C vary from 0 to 1. A+B+C = 1 for all mixtures. Therefore, C


can be re-written as 1 - (A+B).

As a consequence, the square effect C*C or C2 can also be re-written as (1-(A+B))2


= 1 + A2 + B2 -2A - 2B + 2A*B: it does not make any sense to try to interpret
square effects independently from main effects and interactions.
In the same way, A*C can be re-expressed as A*(1-A-B) = A - A*A - A*B, which
shows that interactions cannot be interpreted without also taking into account main
effects and square effects.

Here are therefore the basic principles for building relevant mixture models.

For screening purposes, use a purely linear model (without any interactions) with
respect to the mixture components. If your design includes process variables, their
interactions with the mixture components may be included, provided that each
process variable is combined with either all or none of the mixture variables. No
restriction is placed on the interactions among the process variables themselves.

For optimization purposes, you will choose a full quadratic model with respect to
the mixture components. If any process variables are included in the design, their
square effects may or may not be studied, independently of their interactions and of
the shape of the mixture part of the model. But as soon as you are interested in
process-mixture interactions, the same restriction as before applies.

Multivariate Data Analysis in Practice


478 17. Complex Experimental Design Problems

17.4.3 The Mixture Response Surface Plot


Since the mixture components are linked by the mixture constraint, and the
experimental region is a simplex, a mixture response surface plot has a special
shape and is computed according to special rules.

Instead of having two coordinates, the mixture response surface plot uses a special
system of 3 coordinates. Two of the coordinate variables are varied independently
from each other (within the allowed limits of course), and the third one is computed
as the difference between MixSum and the other two.

Examples of mixture response surface plots, with or without additional constraints,


are shown in Figure 17.17.

Figure 17.17: Unconstrained and constrained mixture response surface plots


Simplex D-optimal
1.471 3.614 5.756 7.899 10.041 12.183 1.437 3.804 6.171 8.538 10.905 13.272

Response Surface C=100.0000 Response Surface C=100.0000

C [0.000:100.0000] C [0.000:100.0000]
A [0.000:100.0000] A [0.000:100.0000]
B [0.000:100.0000] B [0.000:100.0000]
10.577

11.648
9.5
05

8.
43
4

7.3
63
12.680

6.29
2
11.497
5.221
10.313

4.149 9.130
7.946
3.078
6.763
2.007 2.
02 3.21 5.579
5
8 2 4.39
A=100.0000 B=100.0000 A=100.0000 B=100.0000

Centroid quad, PC: 3, Y-var: Y, (X-var = value): D-opt quad2, PC: 2, Y-var: Y, (X-var = value):

Similar response surface plots can also be built when the design includes one or
several process variables.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 479

17.5 Exercise ~ Build a Mixture Design -


Wines
Purpose
You will learn how to build and analyze a classical Mixture design.

Context
A wine producer wants to blend 3 different types of wines together: Carignan,
Grenache and Syrah. All three types can vary between 0 and 100% in proportion.
He is aiming at finding out what proportion of the three makes the most preferred
wine, but to simplify his production work he is mostly interested in blending only
two types of wine together. He is also concerned about the production cost.

Tasks
In this exercise, you will build a mixture design with 3 mixture variables (Carignan,
Grenache and Syrah).

This exercise will lead you through data input of the responses of interest:
Preference and Cost. We will analyze the data and take into account both
preference and cost in the search of a compromise.

1. Define the variables.


2. Build a mixture design and visualize the design.
3. Type in response data.
4. Check the data with Statistics.
5. Check the symmetry of the data.
6. Build a PLS2 regression model.
7. Check the quality of the model.
8. Find the most preferred wine combination.
9. Conclude on the wine producer’s options for a taste/cost compromise.

How to Do It
1. Define the variables
Go to File - New Design and choose to build your design From Scratch.
Click Next.
Select to build a Mixture design and click Next.

Multivariate Data Analysis in Practice


480 17. Complex Experimental Design Problems

Define three mixture variables: Carignan, Grenache and Syrah, varying from 0
to 100% each. As there are no additional constraints to the mixture constraint,
do not tick the Multi-linear constraints box. Click Next.
There are no process variables involved, so you do not need to define any. Click
Next.
Enter 2 responses: “Preference” and “Cost”, then click Next.

2. Build a mixture design and visualize the design.


In the Define Model dialog, select Mixture Interactions and Squares. We
want to optimize a wine blend, so we are going to build a quadratic model with
interactions in order to describe variations as precisely as possible. In the
Define Design Purpose dialog, choose “optimization”. Click Next.

In the Design Type dialog, the default choice recommended by The


Unscrambler is a Simplex-Centroid design with interior points. However, the
wine producer would like to study two-wine combinations in particular, in order
to make his final production work easier. So we will choose a design which does
not include interior points but more edge points instead: a Simplex-Lattice
design of degree 3.

Mark the experiments included in each of these two designs on the simplexes
below to see the difference.

Simplex-Centroid design with Simplex-Lattice design, degree 3


interior points (3 Mixture variables)
(3 Mixture variables)

How many experiments are included in a Simplex-Lattice design of degree 3 for


3 mixture variables?

Back to The Unscrambler…

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 481

Click Next to access the Design Details dialog. Take 1 replicate and 2 center
samples; click Next. In the randomization details dialog, click Next. In the Last
Checks dialog, hit Preview to take a first look at your design.
Check if the resulting design is what you expected. Is the number of experiments
as you expected? Do the experimental points meet with your expectations?
Click Finish in the Last Checks dialog.
Save your design under the name of your choice by going to the menu: File-
Save As…

Now you are going to make a geometrical representation of this 3-variable


mixture design in order to see its shape precisely. Select the design variables
columns with the mouse and go to the menu Plot - 3D Scatter. Draw the plot
on all samples. You can draw lines in order to visualize the borders of the
experimental region: right-click to access the context menu, go to Edit - Insert
Draw Item - Line.

Note!
If you need to delete a line, press “Alt Gr” and click on the line to select
it, then press “Delete”.

Do you recognize the simplex shape? Notice how the 3-dimensional space (a
cube) merged into a 2-dimensional space (a flat triangle).

Minimize the 3D scatter plot, and create a new identical one (Plot- 3D
Scatter). We are going to observe the surface by a different method.
Go to View - Rotate and rotate the points horizontally (use the keyboard
arrows or the mouse) until all the design points are lined up. For a rotation 1
degree by 1 degree, press “Ctrl” as you rotate the plot.

3. Type in response data


The preference data was averaged over 83 male and 30 female consumers aged
between 25 and 60. The scale used ranged from 1 to 3. A major group of 99
consumers with a similar preference was clearly identified in a PCA analysis.
Their evaluations were sorted and averaged.

The production cost was computed for each sample according to the amount of
each wine type with a linear equation.

Type in the calculated production cost and the averaged preference evaluations
into your The Unscrambler design table according to Table 17.5.
Multivariate Data Analysis in Practice
482 17. Complex Experimental Design Problems

Do not forget to save the new table with results! (File - Save)

Table 17.5 – Design variables and responses


Carignan Grenache Syrah Preference Cost
Cube001a 1 0 0 2.8 3.50
Cube002a 0.67 0.33 0 2.4 3.17
Cube003a 0.67 0 0.33 2.8 3.67
Cube004a 0.33 0.67 0 1.9 2.83
Cube005a 0.33 0 0.67 2.7 3.83
Cube006a 0 1 0 1.4 2.50
Cube007a 0 0.67 0.33 1.8 3.00
Cube008a 0 0.33 0.67 2 3.50
Cube009a 0 0 1 1.8 4.00
Cent-a 0.33 0.33 0.33 m* 3.33
Cent-b 0.33 0.33 0.33 2.5 3.33
* The box containing the samples for Center-a fell off the lorry and was never delivered
for evaluation.

4. Check the data with Statistics


We are going to check the raw data and their distribution with Percentiles, Mean
& Standard Deviation plots.

Go to Task - Statistics and choose “All samples” in the sample set selection,
“Response variables” in the variable set selection. Click OK. Hit the View
button to view the results, and go straight away to File - Save in order to save
this new Statistics result file under a meaningful name.

Now you can start interpreting the results.


Can you detect any out-of-range value? Is there a variation in the samples? Is
the distribution of the data symmetrical?

You may want to compare the center sample to the design samples. Go to Plot-
Statistics and in the Compressed tab tick “Center” as well as “Design” in the
sample groups field. Click OK.
Is the center sample (33% Carignan, 33% Grenache, 33% Syrah) well
appreciated by the consumers compared to the rest of the samples?

5. Check the symmetry of the data


In addition to the Percentiles plot, you can check the distribution of the data
with histogram plots.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 483

Close or minimize your statistics results and select the column Preference with
the mouse. Go to the menu Plot - Histogram and choose “All samples” in the
sample set selection. To access the skewness value, which corresponds to how
symmetrical the data is, go to View - Plot Statistics. The closer to zero, the
more symmetrical the distribution.

Does the skewness value confirm your opinion about the symmetry of the data?
Do you need to perform any pretreatment before starting a multivariate
analysis?

Check the symmetry for Cost with a similar procedure.

6. Build a PLS2 regression model


You are going to build a regression model relating the design variables (types of
wines) to Preference and Cost.

Go to the data table and choose the menu Task - Regression.

Samples and variables selection


In the Regression dialog, select a PLS2 model (we have more than 1 response).
In the Samples selection, take “All samples” and keep out sample 10 in the
“Keep out” field (we do not have any preference value for sample 10 “Cent-a”).
In the X-Variables selection take Design def. Model. This selection includes
the variables, interactions and square terms that we selected when defining the
model in exercise 16. Here, this variable set includes: 3 mixture variables, 3
double interaction effects and 3 squares of the mixture variables.
In the Y-variables selection, select “Response Variables”.

Weights Standardize all X and Y variables.

Number of Components Choose 6 components to be computed.

Validation Method
Choose Cross Validation as a validation method, and make sure that “Full
Cross Validation” is selected in the Setup.
Tick the Jack-Knife and check that the number of PCs used for Jack-knifing
will be the “Optimal number of PCs”. Click OK.

Multivariate Data Analysis in Practice


484 17. Complex Experimental Design Problems

According to the Regression Progress, how many PCs carry meaningful


information?

View the results and go straightaway to the menu File - Save to save this PLS2
model under a meaningful name.

7. Check the quality of the model


Check the number of PCs used for the model
Look at the Explained Validation Variance plot (in the bottom left corner). How
many PCs carry useful information?

Look at the plot in the bottom right corner (Predicted vs. Measured), and check
how many PCs The Unscrambler has used to compute the results (right under
the plot). What is the optimal number of PCs according to The Unscrambler?
Compare your number to The Unscrambler’s finding. Do you agree with The
Unscrambler’s choice?

Find out which wine types have a significant effect on Preference


Go to Plot - Regression Coefficients, select “Preference” as a Y-variable
and click OK. Tick on the Jack-knifing icon (a Swiss knife) to mark the
significantly important effects on Preference.

Which variables have a significant effect on consumer preference? Are these


effects positive or negative? Was it necessary to include interactions and
squares in the model?

Go to View - Jack-knife - Uncertainty limits. You can notice that for all
the non-significant variables, the uncertainty is such that we cannot even know
for sure whether the regression coefficient is positive or negative.

Important note!
Even though most interaction and square effects are not significant, we
cannot remove them from the model: all these terms are tightly related
because of the mixture constraint!

The only option would be to remove all interactions and squares from the
model, but we would then ignore the significant square effect of C (Syrah). So
we decide to keep our model as it is.

Multivariate Data Analysis in Practice


17. Complex Experimental Design Problems 485

8. Find the most preferred wine combination


In a first stage, we are going to focus on consumer preference.

Study the relationship between wine types and Preference


Click on the upper right corner to select this part of the screen as a location for
the plot we are going to produce. Go to Plot - Loadings and in the tab called
General, select “X and Y” as variables. The plot type should be a “2D
scatter”, and the Vectors used should be “1” and “2” (PC1 and PC2). Click OK.

The X- and Y-Loadings plot shows that on PC1 and PC2, 69% of the variation
in X (wine types) explain 80% of the variation in consumer preference and cost.
The model performs very well!

The plot reveals the significant effects detected before (marked variables). You
can notice that some of the variables are projected far from the center of the plot
on PC1 and PC2; however, they are not marked as significant. This is due to the
fact that we are looking at two components only, whereas the model is actually
based on 5 components.

To get a bigger view of the plot, go to Window - Copy to and select figure 1.
Now the X- and Y-Loadings plot takes the whole viewer screen.

In which direction is Preference plotted compared to the significant variables?


Which of the significant effects are correlated positively to Preference? Which
are correlated negatively to Preference? Which wine types would you combine
in order to reach a high level of consumer acceptance?

Find out an optimal combination of wines to reach highest Consumer


Preference
Go to Plot - Response Surface, check that the default selections are fine and
click OK.

What percentages of the three wines give the highest acceptance? Does this
combination match with your expectations? What is the predicted Preference
value for this optimal wine combination?

You can also display the response surface as a landscape (right-click to access
the context menu, Edit - Options). Go to View - Rotate or select the Rotate
icon to rotate the plot.

Multivariate Data Analysis in Practice


486 17. Complex Experimental Design Problems

9. Conclude on the wine producer’s options for a taste/cost compromise


Now we are going to introduce Cost in our interpretation.

Find out which wine types have a significant effect on Cost


Choose Window - Go to and pick figure 4 to go back to the normal viewer
stage. Go to Plot - Regression Coefficients and select response number 2:
“Cost” as the Y-variable. Click OK to display the plot.

How big are the regression coefficients for interactions and squares? Can you
explain that?

Hint: The cost was computed as a linear relationship of the amount of


Grenache, Syrah and Carignan. The regression coefficients equal to zero show
that PLS2 was able to detect the linearity!

Click on the Jack-knife icon to mark the variables that have a significant effect
on Cost. Note that the X- and Y-loadings plot is updated at the same time.
Which wine types have a significant effect on the production cost? Are these
effects positive or negative?

Relate your findings to what you can see on the X- and Y-loadings plot.
Which significant effects are negatively/positively correlated to Cost?

Find out an optimal combination of wines to reach different goals


Go to Window - Go To and choose Figure 2. The viewer stage is now divided
into an upper and a lower part. Go to Plot - Response Surface and select
response number 2, “Cost” as the Y-Variable.
Which wine is the cheapest to produce? The most expensive?

To be able to compare easily the preference and the cost for different wine
combinations, display the response surface plot for Preference in the lower part
of the screen.
How much is the production cost for the most preferred wine combination?

The wine producer knows from the consumer study that a number of consumers
would consider the price first, then the taste when choosing wine. However, a
grade under 2 in the consumer study clearly meant “rejected wine”.
Help the wine producer find a wine combination for a production cost lower
than 3 and a preference as close as possible to 2.
Is it possible to find such a combination involving only two wines?
Multivariate Data Analysis in Practice
17. Complex Experimental Design Problems 487

Summary
We built a Simplex-Lattice design in order to focus our study on two-wine
blendings. Because of the mixture constraint (C+G+S=100%), we worked in a two-
dimensional space, that is to say a surface, even though we are including three
variables in the design.
Before starting the analysis, we checked the raw data and found no suspicious out-
of-range value. The distribution of the data was quite symmetrical for both
response variables, Preference and Cost.
A PLS2 regression was performed with the design variables and their interactions
and squares as X, and the response variables as Y. Automatic marking by Jack-
knifing showed us that interactions and squares were needed for the model.
Regarding the consumers’ preference, a positive effect of Carignan, a negative
effect of Grenache and a negative square effect of Syrah were shown.
The cost was linearly influenced by the quantities of Carignan, Grenache and
Syrah: Syrah was the most expensive wine to produce and Syrah the cheapest.
The most preferred wine combination was 71% Carignan - 29% Syrah, for a
preference of 2.9 on the scale from 1 to 3 and a cost of 3.6.
To keep his production cost lower than 3 but ensure a preference above 2 by
mixing only two wines, the producer should mix 55% Grenache with 45% Syrah.

Multivariate Data Analysis in Practice


488

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 489

18. Comparison of Methods for


Multivariate Data Analysis - And
their Validation
In this final chapter we will repeat the essence of multivariate data analysis, but
with a perspective that will also give the reader a first overview of competing
methods and approaches. We will also try to argue for a holistic attitude towards
the family of bilinear methods. The choice between alternative methods is always
problem-dependent, and there are indeed many modeling methods to choose from.
In section 18.1 we will give a comparative overview of a selection of the most
important methods, in a direct comparison with the well-known projection
methods. In section 18.5 we will finish by discussing the validation issue once
again, this time focusing on the crucial relation to the problem definition.

18.1 Comparison of Selected Multivariate


Methods
It is not the goal of this overview to give a complete statistical and algorithmic
presentation of the various methods to be compared. The objective is to give a brief
practical presentation of the way each method is used. We shall do this by focusing
on their projection characteristics, and - often - this may be all that is needed in
order to be able to choose the method best suited to a specific problem. Clearly,
though, such a utilitarian overview is not sufficient for a complete mastering of
these methods. We have included ample literature references for in-depth follow-
ups if desired.

The underlying feature that will serve as a common backbone with which to
interrelate the various methods will be the effective projection dimension of each
of the methods. This is dimension has been called “A” in this book. Principal
Component Analysis and Factor Analysis, for instance, may be seen as methods
that project the original p-dimensional data (recall the variable space defined by the
original p variables) onto a low-dimensional subspace of dimension A. At times A
may be very low, for instance 1, 2 or 3, and in general A << p.

Multivariate Data Analysis in Practice


490 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

18.1.1 Principal Component Analysis (PCA)


PCA - as has been fully exposed above - is extremely useful for projecting a higher-
dimensional data matrix onto a low-component subspace, which is used as
“windows” into the p-dimensional data space.

The idea is that the signal part, which corresponds to the main multivariate
structure(s) projected onto the A-dimensional subspace, very often corresponds to
the most useful part of the data. As soon as the subspace dimension, A, has been
determined, we can usually do away with the complementary (p-A) dimensions.
According to the underlying assumptions of PCA-decomposition, these dimensions
represent the “noise”, or error part, which we are not interested in continuing to
confound the data.

In some ways the action of projection methods such as PCA is like using a
magnifying glass. The essential data structures are enhanced, while the irrelevant
noise is screened away. It is this “truncated” use of PCA that has been particularly
useful - and therefore popular - in science and technology in general, and in
chemometrics in particular.

In this type of application PCA performs as a powerful tool for Exploratory Data
Analysis (EDA). The graphic score and loadings plots are used to visualize the
underlying structure in the data after removal of the noise contributions, and for
problem-dependent interpretations. But PCA can also be used for quite another
purpose than EDA - it can be used for classification and modeling of individual
data classes. An example of this is given in Figure 18.1.

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 491

Figure 18.1 - Principal Component Analysis, here used for SIMCA-modeling


x3
A=0
A=1

A=2
x2

x1

This illustration is a case where the data swarm in variable space (here p=3) is
found to be grouped into three separate clusters – three data classes. An initial PCA
on this data set would reveal these groupings in the relevant score plots. Each group
can now be modeled and indeed interpreted separately. This will reveal the
effective dimension within each class; in Figure 18.1 the classes have dimensions A
= 0, 1 and 2 respectively. Notice the cluster with A=0. In this class the data points
are quasi-spherically distributed, i.e. there is no preferred direction of maximum
variance - all directions are equal. In this (admittedly very rare) case of isotropic
data variance the “best” model of the data swarm is simply the mean object!

When PCA is used for classification purposes like this, the overall data
decomposition is controlled by the structures inherent in the entire data set. It is a
“let us see what we get” approach, after which SIMCA may proceed if found
appropriate etc.

Another approach, which will be discussed later, is when the data analyst knows
beforehand exactly which classes to expect (and which not to expect), e.g. the data
is made up of red or green apples and nothing else, in which case one would expect
only a class for each color of apples. Then is it possible to “force” the
decomposition onto the classes in question through the use of the class of
“dummy”-regression methods, PLS-DISCRIM for example? This would be wrong!
What about the possible presence of rotten apples? We trust the reader can see
behind the extreme simplicity of this illustration, and be able to carry over the

Multivariate Data Analysis in Practice


492 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

mental notion of “rotten apples” to his/her own data analysis situation, i.e. always
be on the lookout for outliers - always!

It is important to understand the fundamental difference in these two approaches -


the unsupervised or “passive” PCA and the supervised or “forced” decomposition
(which will be described in more detail later).

18.1.2 Factor Analysis (FA)


Factor analysis may be considered as a statistical older brother of PCA. While PCA
separates noise from signal and models only the signal part of the data, FA is
concerned with modeling both these parts of the data.

FA is based on a number of statistical requirements that ideally must be fulfilled


(but which are often hard to substantiate in real-world data analysis). Be that as it
may, in FA it is assumed that the data matrix X can be decomposed according to
Frame 18.1.

First of all notice the clear similarity to PCA - FA also operates with scores and
loadings matrices. In PCA we attempt to take out the noise contributions and
collect them in the error matrix E, by determining the dividing effective dimension
A. The scores and loadings in PCA thus represent the signal part of the data and the
rest is in the E matrix. In FA the errors are also supposed to follow a quite specific
statistical distribution structure, detailed in Frame 18.1. Consequently, the
modeling of the error components is also intrinsic to FA. Thus there are basic
statistical assumptions both for the error matrix E as well as for the scores matrix
F, and these assumptions lead to the expressions given for the covariance of X and
the covariance between X and F. The Cov(X) = AA’ + diag and Cov(X,F) = A, of
course expresses the basic factor analysis model. FA is clearly a statistically much
more elaborate method than PCA.

A main objective of FA is still finding the “correct” dimension of the subspace that
represents the signal-structure part. There are a large number of statistically based
methods available for this purpose. We will not go into any detail on this point; the
interested reader is referred to the literature on FA; here we will only highlight
Joliffe (1986) and Jackson (1991) especially, but the literature on factor analysis is
especially comprehensive.

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 493

Frame 18.1 - Factor analytical model (orthogonal); schematic summary


p p

n X = mp + A••F´ + nE
A = loading matrix
F´= score matrix (factor matrix)
m = average vector
E = error matrix

Statistical model:
E(F) = 0; Cov(F) = I
E(E) = 0; Cov(E) = diag } Î Cov(X) = AA´+ diag
Cov(X,F) = A

Another important feature of FA concerns rotations of the “primary solutions”. The


primary solutions are the score vectors (and corresponding loadings) that represent
the structure part of the data. The goal of such rotations is allegedly to improve the
interpretability, vide the expression “rotate to simple structure”. Whether this is, or
can be, achieved has been the source of a very long and still unfinished debate in
FA circles, see e.g. Jackson (1991). It should be mentioned that it is also possible to
rotate PCA solutions; this can sometimes improve even their interpretability. The
interesting issue here is that any rotation actually leaves the explained (modeled)
variances unchanged, so in many cases the merits differences between FA or PCA
is very much only a function of the underlying model chosen. The choice is thus
not so much oriented towards the utility or the validity of the results of these
competing methods, but would often appear more to be but a traditional - perhaps -
partly futile academic matter of preference and discussion. Please note that there do
indeed exist more advanced methods in which target rotation for example forms an
absolute integral part a.o.

In general however, most data analysts are happy with the primary solutions - but
from very advanced and experienced FA-analysts, be prepared for a lecture on the
“primitive” PCA method! It may of course do the reader no harm to dig somewhat
deeper into this issue, but hopefully not before having gained some - a lot rather -
personal experience with PCA, especially concerning its practical use.

Besides, PCA and FA very often give quite similar numerical results, especially
when the errors are small. Thus the practical use of FA is very much the same as
for PCA. It has been postulated that in some instances FA is superior to PCA
because it attempts to determine the “true factor structure”, which may encompass
both the signal and noise parts. In other words, FA attempts to model the noise as

Multivariate Data Analysis in Practice


494 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

well - to get the noise “under control” - as opposed to PCA where the objective is
to discard the noise and pay no more attention to it. Whether FA is superior to
PCA, or not, as claimed in this general sense, FA has indeed seen some noteworthy
spectacular successes, but nearly always exclusively in the hands of (very)
experienced users with more than just a passing interest in the underlying statistics
(biometrics, psychometrics etc). Thus, FA is not a method recommended for the
novice - there are too many traps and pitfalls.

Still, an overview of the main distinctions between PCA and FA may be of help at
this stage:

Table 18.1 - Comparison of Factor Analysis and Principal Component Analysis


Modified from Jackson (1991)
Factor Analysis Principal Component Analysis
FA models correlations PCA models variances
FA-factors are uncorrelated PC-components are uncorrelated
Residuals are uncorrelated because of Residuals usually correlated (not a
independent communality estimation problem; residuals are discarded)
Several estimation procedures; One unique estimation procedure
estimates are not unique
Adding one more factor may change the Adding one more component will not
earlier ones change earlier ones
Some solutions are invariant with PCA is always scale-dependent
respect to scaling
Computational problems for fully PCA computations are always simple!
specified statistical models of complex
data
Many, competing rotations methods; Usually no rotational ambiguity; only
“Ideological schools” of usage abound rarely is secondary rotations used

18.1.3 Cluster Analysis (CA)


The purpose of CA is to model, in a broad sense, the objective grouping into
“clusters” of data, hence the very apt name. CA consists of a large number of
alternative, but closely related, numerical and algebraic techniques. An important
distinction between the various techniques is that for a very few of them no
assumptions have to be made about the number of data groups or the group
structure, while others require such knowledge as input. Inputs to CA are in general
“similarity measures” of some kind - criteria against which the grouping can be
performed. Clustering is then done on the basis of the dissimilarities between the

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 495

objects/ variables. There is a confusingly large number of such (dis)similarity


measures. In addition CA comes in quite a number of different algorithmic
disguises.

Equation 18.1 - Cluster analysis


Similarity measures (examples of often used measures):
[∑ X − Y ]
1
m m
d ( X, Y ) = , m = 1,2,3...
a)

b) d ( X, Y ) = ∑ X •Y
X2 • Y 2

Figure 18.2 - Dendrogram


similarity
measure:
d(X,Y)

samples/
groupings

The common characteristic of very many of the related CA-approaches is the well-
known type of visual display, the dendogram, which has proven extremely popular;
see Figure 18.2. This type of display manages to compress the data structure, as far
as the data grouping is concerned, onto an apparently 2-dimensional chart. The
calculated grouping is displayed along one dimension, with the degree of
relatedness between these groups along the other. These features are in fact
intimately related, and the listing of the ordered set(s) of samples does not
constitute a proper dimensionality by itself. CA must therefore be viewed as a
projection onto a “1.5”-dimensional subspace, as it were.

There is a snag with the many alternative CA possibilities, however, especially for
someone who is inexperienced with CA. You are usually advised to try out several
similarity measures and cluster methods “to see which corresponds best with the

Multivariate Data Analysis in Practice


496 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

data structures”. This many method attitude was probably devised in the hope that
if the results from several methods are more or less the same, these data structures
reflect reality. But in reality this points to a very serious weakness with CA. The
different, competing similarity criteria are not unique!

Also, a little thought brings us to the following problem: if several measures, and/or
cluster algorithms give different solutions, which do we choose? This may very
well (and frequently does) happen. The problem with CA is that nowhere is there to
be found any general optimization criterion for the many different clusterings
possible on the very same data set! Thus CA runs into the same sort of problem as
did FA (albeit in a distinctly different setting: CA is a classification method) - non-
uniqueness of the primary solutions. There is only one absolute certainty here, and
it in fact applies to all multivariate analysis:

There is no substitute for a proper understanding of how a particular method


works. But perhaps even more important, why it works or why it fails Nowhere is
this more important than concerning the plethora of methods of CA.

The above extremely short introduction to CA really does not pay the necessary
respect to this venerable approach, but in the interest of the present introductory
book (on bilinear methods), we simply have to accept some boundaries. The reader
is however referred with great enthusiasm to the excellent CA-textbook by
Rosenberg (1987) which is a superb introduction to CA.

18.1.4 Linear Discriminant Analysis (LDA)


Discrimination Analysis (DA) seeks to describe mathematically, but often also
graphically, the discriminating features that separate objects into different data
classes. Thus DA finds “discriminant axes”, linear combinations of the initial p
variables, that optimally separate two or more data classes. Figure 18.3 shows the
classical linear DA (LDA) method, in this case the determination of the line that
best separates the two classes: “dots” vs. “crosses”. As is clearly seen from this
schematic illustration, the border between the 2 classes is “fuzzy” as reflected by
the two individual class distribution curves - they partially overlap.

DA is partly a supervised EDA technique, but is also often used for supervised
pattern recognition in a subsequent step. With DA you need to know some initial
means or characteristics in order to start dividing the objects into two or more data
classes. The overwhelming abundance of LDA-applications is concerned only with
two groupings.

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 497

It may also be seen from Figure 18.3 that LDA in fact can be viewed as a projection
onto A=1 dimensions. If the objects are projected onto the “LDF-axis” in Figure
18.3, this axis could be viewed as a “component vector” separating the two classes,
i.e. a 1-dimensional representation. This singular discriminant axis may be
extended to higher dimensions in more advanced versions of DA (e.g. quadratic
DA), but a great many of the classic methods stay with this very low 1-D dimension
of the subspace employed. For situations with two classes this makes good sense.
There are, however, many real-world data sets where this simple, low dimensional
picture is not enough. These systems are simply of a sufficiently higher complexity
in which a 1-dimensional approximation is a gross misrepresentation.

An important assumption in LDA is that there is a common covariance structure


for both classes. This is an assumption, which is patently not easily upheld, in very
many real-world data sets!

One last point on LDA which is also important, concerning collinearity. LDA
suffers from the same collinearity problems as does MLR, and there are no
remedies of the PCR kind in this case.

Figure 18.3 - Linear Discriminant Analysis (LDA)


x2

LDF
(Fisher’s linear discriminant
x x function (A=1)
xx x
x
x x x
x1
x x x x
x x x
x
x x

Multivariate Data Analysis in Practice


498 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

18.1.5 Comparison: Projection


Dimensionality in Multivariate Data
Analysis
Principal Component Analysis (PCA), Factor Analysis (FA), Cluster Analysis (CA)
and Discriminant Analysis (DA) have been described above in an order where the
effective projection dimensionality decreases from A (PCA) down to 1 (LDA). The
comparison, based on this effective projection dimensionality, is designed to
illustrate the relative usefulness and weakness associated with these methods for
data description and classification. It is especially emphasized that some of the
classical statistical and data analytical methods have rather severely restricting
premises or are of limited use, due to A being too low for the actual data structures
present. Full subspace projection methods are the most flexible, because they allow
the data themselves to help determine the optimal dimensionality A. Thus the user
is never left with the possibility of using too narrow, or indeed to large, a model
support. Also the A-dimensional bilinear methods, PCA and FA, allow the user to
make up his/her own data analysis strategy dependent upon the interim results of
iterative runs through the data analysis, utilizing the full versatility of these
methods, especially in the “truncated” PCA tradition. The main problem with one-
shot methods is just that: they do not allow for any iterative data analysis – which
renders them inferior for many purposes.

It is possible to acquire an important insight into the interrelationships between


these methods by focusing on the decreasing effective projection dimensionality of
this series: PCA/FA – CA - LDA.

18.1.6 Multiple Linear Regression, (MLR)


Linear regression, LR, is probably the most often used method of the multivariate
family, in science, technology as well as in nearly any other scientific discipline in
which quantitative characterization plays a role. Unhappily it is also the most
frequently misused method; perhaps not deliberately, but because of an unfortunate
lack of information about its proper practical use.

Multiple Linear Regression, MLR, is designed for the regression of one Y-variable
on a set of p so-called “independent” X-variables. It is implicit in the classical LR
formulation that X has full mathematical rank. The concept of “full rank” means
that the columns in X are linearly independent, i.e. they are not collinear. “Full
rank” and “not collinear” in practice means that the variables are uncorrelated.

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 499

This is a very important point. In science and technology collinearity is very often
the case for the sets of variables employed. Use of MLR on collinear data can lead
to serious misinterpretations and, worst of all, this may remain undiscovered, if
sufficient warnings against this have gone unnoticed.

There is still far too little emphasis in the applied sciences on critical inspection of
data structures (e.g. outliers, groupings, etc.) before running the data through one
of the popular and plentiful packaged MLR routines available. For the professional
user there is a very large apparatus of “regression diagnostics” with which to assess
whether this, and other critical assumptions are upheld, but very little help is
offered in the case they are not. Regression analysis is a huge area for professional
statisticians, and it is with the greatest respect that we here nevertheless largely
dismiss this class of regression. Reasons have been given however in chapter 6 and
elsewhere.

Another MLR-premise that is often violated concerns the fact that the X-variables
are often associated with errors, be it measurement errors, sampling errors, or
otherwise. In the classical case, only the Y-variable is assumed to be affected by
errors. Take for example the LR least-squares fitting criteria that is related to
Y-variable variance only - it is implicitly assumed that the X-values are noise free.

MLR is, in fact, ideally concerned specifically with an orthogonal X-matrix, (as is
reflected by its name in statistical terminology: the design matrix). For truly
orthogonal, i.e. uncorrelated X-variables, LR works just fine.

However, this is certainly no justification to continue to misuse MLR on collinear


and/or data with significant X-errors. After all, there are many other alternative
approaches, e.g. PCR and PLS-R, which are readily available.

18.1.7 Principal Component Regression (PCR)


If you want a method which both avoid the collinearity problem and cope better
with significant X-errors, PCR is a strong candidate. PCR performs a two-stage
operation in which the X-variables are first subjected to a standard principal
components decomposition exactly as described in chapter 3. Then the
Y-variable(s) is/are regressed onto this decomposed X-matrix. PCA(X) assures
both orthogonal scores vectors, ideal for MLR for example, as well as utilizing the
PCA projection onto the signal part, the A-dimensional subspace, as a screening
device for the X-errors.

Multivariate Data Analysis in Practice


500 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

PCA decomposition is however still carried out completely without regard to the Y-
data structure. This means that you may very well decompose the X-matrix in a
way that is not optimal for Y-variable prediction, especially if you are modeling
more than one Y-variable at the same time. Here the PLS method is a far better
choice. On the other hand, in the case of only one Y-variable, PCR is statistically
the best studied and most well-known method. Many of the references in this book
give in-depth information on MLR and PCR from the statistical point of view. In
The Unscrambler package PCR and PLS are the two obvious choices.

18.1.8 Partial Least Squares Regression


(PLS-R)
The PLS approach has been designed to cope with the fully multivariate regression
case (both X- and Y-spaces are multivariate). It is the prime achievement of PLS
that it both handles the weaknesses of MLR and is an improvement over PCR in
terms of prediction ability with respect to fewer, and more interpretable
components. In addition PLS has optimal prediction ability in the strict statistical
sense. PLS can handle one or several co-varying Y-variables equally well. In this
latter context PLS may be viewed as one of the most generalized multivariate
regression techniques, with complete control over both collinearity and X-errors. In
the case of only one Y-variable, PCR and PLS compare closely though PLS often
achieves its goal with fewer components than PCR. There is also a “Discriminant-
PLS” (PLS-DISCRIM) version, which avoids most of the LDA problems
mentioned previously, although it is not at a completely equal statistical footing -
yet. PLS-DISCRIM is very useful in practical data analysis however.

As was the case with PCA, PLS can also be used as a supervised calibration tool
that simultaneously classifies the new X-vectors submitted for prediction. This
latter feature is unique to multivariate calibration and is used extensively, for
example for automatic outlier-warning. Within chemometrics there is a complete
strategy for multivariate calibration in science and technology in general. The book
by Martens & Næs (1989): “Multivariate Calibration” is still a leading authority on
the subject even with about ten years on its back!

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 501

18.1.9 Increasing Projection Dimensionality


in Regression Modeling
The succession of MLR ⇒ PCR ⇒ PLS can be seen as representing an increase in
the effective projection dimensionality, A, of the subspace employed. MLR uses a
one-dimensional linear combination of the p original variables only; PCR uses an
A-dimensional decomposition of X, but is unrelated to Y before the regression
stage. While PLS makes use of a similar A-dimensional subspace as PCR and it lets
the Y-variables “control” the decomposition of X, leading to a model that is
optimal for prediction of the Y-variables as well as for interpretation of the X-space
(you have both the p- and the w-loadings).

Understanding these two sequences, PCA ⇒ FA ⇒ CA ⇒ LDA and MLR ⇒


PCR ⇒ PLS respectively will allow the user to appreciate the general similarities
as well as the even more important distinctions between these seven most used
families of methods. There is a very significant difference between regarding the
methods described here as merely a multivariate supermarket (driven by more or
less subjective choices) and the compulsory obligation to master the relationships
between problem formulation and the appropriate method to use. Indeed, getting to
know how to use one particular method from the present effective projection
dimension viewpoint is considerably more scientific than using a mere statistical
“cookbook recipe” method, which is bound to fail in many cases anyway. The
relationship between a proper understanding of the data analytical problem and the
eventual choice of the (most) appropriate method may at first seem formidable for
the new data analyst, but there are some fundamental guiding principles that are by
now well within our grasp, because we have worked so hard at learning - and
mastering – the bilinear methods.

18.2 Choosing Multivariate Methods Is Not


Optional!

18.2.1 Problem Formulation


Unsupervised methods are used in much the same way as fishing tackle; if you are
an adept angler, your angling skills contributes more to a successful result than the
quality (and perhaps the price) of the fishing equipment itself. Likewise for the
methods used in EDA, exploratory data analysis. Anybody can feed a given data set

Multivariate Data Analysis in Practice


502 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

(a data matrix, X) into one of the all too easily available software packages
boasting this and that “well known” statistical or analytical routine. It is very
doubtful however that this act alone guarantees you a relevant and intelligible
answer to the specific data analytical problem at hand, despite the fact that there is
always some output. Even though multivariate methods may appear difficult to
grasp at first sight, these methods are in fact very easy to use. We have in any event
worked hard to press home exactly this point in this book - and if you have
carefully carried out all (well most of) the exercises herein, you will have gotten
this message by now.

An appropriate analogy could be that of a cookie cutter. Cookie cutters stamp out a
predefined form from whatever is placed beneath them. But not only is the form of
the cookie cutter important - what is placed under it is also of equal importance of
course. Trying to stamp out cookie forms from, for instance, spaghetti would be a
senseless thing to do.

Translating this analogy to multivariate analysis: It is of paramount importance


which data set is subjected to which method and – especially – why?

• Which data set: which data are measured/observed in your problem context.
• Which method: the (multivariate) method must comply with your problem
formulation.
• Why: why do I measure/observe these data? The problem definition determines
which (multivariate) method is to be used!

One of the most fundamental distinctions in multivariate data analysis concerns the
alternative data analysis modes: unsupervised methods vs. supervised methods.

18.3 Unsupervised Methods


Which n samples should go into the (n,p) data matrix (which n samples, described
by which p attributes)? It matters very much how you sample your observations,
and this is of course intimately related to the objective of your investigation. In fact,
the choice of an appropriate method is in many instances already determined from
the moment you have set down a specific objective for your sampling. For some
problems substitute “observe/observation” for “sampling/sample” in the above –
the central issue is the same however.

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 503

In EDA you are searching for patterns, groupings, outliers, etc. - in short,
performing an act of Pattern Cognition (PAC). You may use whatever appropriate
unsupervised method your feel comfortable with (e.g. PCA), but you should most
definitely not use e.g. MLR if you have not formulated some form of functional
XÎY regression concept that has been derived from “external” knowledge
pertaining to your problem formulation, your sampling scheme and your general
knowledge of the data context. Otherwise MLR is almost prohibited in this context,
because it is specifically a regression method and therefore assumes a regression
objective for the analysis, i.e. that there is a Y that can be explained by an X.
Alternatively you may use PCA or another method which does not partition the
variables into an X-block and a Y-block.

On a slightly more general level: unsupervised methods are used for unsupervised
purposes (and of course: supervised pattern recognition methods are used for
supervised pattern recognition purposes). What does this apparently circular dictum
mean then? When you do not know any specific data analysis purpose (e.g.
regression, classification) from the original problem specification, all you in fact
can do is to perform an unsupervised data analysis, since their simply is no
supervising guidance to be had. On the contrary:

18.4 Supervised Methods


If you find data groupings (data classes) in an initial EDA run, these may
subsequently then become the basis for subsequent Pattern Recognition (PARC) by
a suitable PARC method, e.g. some classification method (DA, class modeling
PCA, SIMCA, etc.). If you formulate a classification/discrimination objective, this
may for instance be dependent upon your findings in the earlier EDA run, -or other
external knowledge in analogous situations. But in other circumstances the
classification objective may be a given outset of the analysis (remember the red and
green apples), and of course then only supervised, classification/ discrimination
methods are suitable for this task.

There is no harm done however, if you prefer to perform some EDA first. In fact it
is useful to do this every time. You will get to know the data structure and you may
perhaps even find that your initial assumptions do not hold up - and surely nobody
will object to that type of cautious data analysis in the initial stages.

Supervised methods always comprise a two-stage process:

Multivariate Data Analysis in Practice


504 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

1. Establishing a model for the XÎX or XÎY relationship (e.g. DA, MLR, PCR,
PLS). This is called the training, the modeling - or calibration stage. DA can
also be seen as a marginal subset of the regression case. This stage can in some
sense be considered as a “passive” modeling stage, because the data itself pretty
much determines the (soft) data model.

2. Using this model for whatever purpose your original objective dictates e.g.
PARC for DA, for prediction for MLR, PCR, PLS, or for classification. This
may be called the “active” stage, the classification or the prediction stage etc.

The validity and efficiency of any supervised data analysis method is totally
dependent on the representativity of the initial data relationship used as a training
basis. The principle of GIGO (Garbage In - Garbage Out) applies to all supervised
methods! It is the responsibility of the data analyst - YOU - to specify the training
data set in an as relevant and representative manner as possible (which n samples?
Why? Characterized by which p variables?). Data classes and the samples therein
must be representative with respect to future sampling of the populations modeled
by the particular supervised method employed for the specific problem.

The systematic relationships between data analytical problem formulations and the
appropriate multivariate methodological choices are presented in full detail by
Esbensen et al. (1988).

In the sections above the twofold unsupervised/supervised division, as well as the


threefold division, into EDA, classification and regression methodological
approaches were laid out in order to suggest some mainline approaches to practical
multivariate data analysis. Unfortunately data analysis is often first called upon to
perform a rescue operation when you are all but drowning in (unstructured) data.
Such rescue operations often fail because the problem formulation has not been
designed from the outset to give structured, meaningful information after the
analysis. Obviously there is often not much to be gained if there is no structuring
objective behind the data analysis.

The review in this chapter has been successful if you have formed an opinion that
this structuring is intimately connected to the original problem objective, and that
an appropriate data analytical method will almost suggest itself if only seen in this
proper context.

The “how to” of multivariate methods is relatively easily mastered (especially in


relation to the very many comprehensive software packages available for the PC

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 505

and other computer systems). The “why” is a matter that rests solely with the user -
YOU. Even if this appears burdensome at first sight to the novice user, it is in fact
a blessing because it relieves the pressure of technical expertise and emphasizes the
data context imperative, and this is where all your domain specific expert
knowledge comes in. The interplay between the expert knowledge of the problem
context and the multivariate data analysis knowledge (and experience) is really
where most of the fun in multivariate data analysis is to be found.

18.5 A Final Discussion about Validation


Since validation is the most important issue in multivariate analysis, we shall finish
this book by reviewing the alternative approaches.

The purpose of model validation is two-fold: First to avoid overfitting or


underfitting a model by finding the optimal number of components to use.
Validation is used here in the modeling stage. Secondly: to be the instrument for
assessing the empirical prediction error associated with the future use of the
prediction model. Thus validation may be used both for optimal modeling and for
prediction validation.

18.5.1 Test Set Validation


The test set validation concept is designed around the availability of - at least - a
second data sampling, the test set. It should be drawn from the parent population as
closely comparable to the calibration data set as possible, with respect to the
number of objects, the sampling conditions and the sampling time. And most of all
with respect to the representativity of the target population. When it is at all
possible to plan for the subsequent data analysis ahead of the actual sampling
procedure, you should always try to secure as many identical samplings of the
target population as possible. This is especially important if the representativity
and/or the testing of the final model are critical for its future use, e.g. like in
automated plant monitoring etc.

Quite clearly, if we have secured two such data sets (a calibration set and a well
balanced test set), there can effectively be only one variance component that will
differ between them; the sampling variance. This sampling variance will comprise
those differences between the two data sets that can only be explained by the (two)
different samplings of n objects, made under conditions which are otherwise as
identical as possible. This is the essence of the concept test set validation.

Multivariate Data Analysis in Practice


506 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

Let us assume then that you have arrived at a “reasonably representative calibration
model”. The central idea behind all prediction model validation now is to evaluate
the prediction strength of this model. All validation is based on a comparison
between the model-based prediction results and the test set reference values. The
entire basis on which we will judge the usefulness of the prediction model is the
degree of correspondence between these “correct” reference values and the values
predicted by the model. The test set validation is the best possible option for this
critical issue – there is none better!

18.5.2 Cross Validation


The test set approach above will never fail, but there are unfortunately practical
situations in which we just cannot produce a separate representative test set. Cross
validation is then the most common alternative. It is absolutely essential to fully
understand that cross-validation is not a proper validation method in itself – it is
“only” an ingenious substitute for test set validation.

Conceptually, we may start out by dividing the calibration data set into two halves.
These should preferably be chosen randomly. Each half is now characterized by n/2
objects. If the data set is “large enough”, there is no problem with this, except that
each model is now based only on n/2 objects. This is the only difference between
two-segment cross validation and test set validation, but – as was outlined in
chapter 7 – a crucial disqualifying one: there was never any second drawing from
the parent populations in this case!

With a sufficiently large calibration set, this procedure will then be the most
adequate substitute for validation with a full test set. The only problem being that
we rarely - very rarely – are in a position to demand this large a data set. Indeed the
very reason cross-validation was called in was because we could not obtain enough
samples to delineate a proper test set!

Cross validation using more segments than two is characterized by a steadily


decreasing number of samples available for both modeling and prediction testing.

Cross validation is strongly problem dependent. There is in general no given


general number of segments that ensures an optimal cross validation result. It is up
to the data analyst to choose the number of segments, and there is no general rule
but for your own full understanding of the cross validation approach, indeed of the
entire validation issue. There has been an altogether almost complete lack of proper

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 507

understanding and respect for these critical issues in but all of chemometrics, as
indeed well beyond. Remedying this troublesome state of affairs has been one of
the primary objectives of this book.

During cross validation of, say, 150 samples, we could for example use two
75-sample data sets, one for calibration and one for validation. We could also have
divided the data into three segments, each supported by 50 samples. Or perhaps
five data sets, each supported by 30 samples, or 10 segments, each with 15
samples... It is always possible to set up a segmentation list of the following form
(2,3,4,...n segments), where the last number of segments, n, is that pertaining to full
cross validation. In this leave-one-out cross validation, each sample will be taken
out of the calibration set once and only once. The remaining n-1 data vectors make
up the model support, with the purpose of predicting the Y-value of the temporary
left out sample. This can be carried out exactly n times, and for each time one
specific Y-value will have been predicted. It is easily appreciated that we have
gradually built up a case apparently very similar to the separate test set validation
situation. There is but this one crucial difference, however. During this sequential
substitution of all calibration samples, we have never had an independent new
realization of the target population sampling. We are only performing an internal
permutation in the same calibration set. This is the crucial difference between test
set validation and segmented/full cross validation. In this precise sense, we more
get access to an assessment of the internal model stability, rather than the future
prediction error with segmented cross-validation.

When both alternative validation results are available (test and cross-validation),
we can make an estimate of the sampling variance. It is based on the difference
between the prediction variance from cross validation and that of the test set
validation. When everything else is equal, the cross validation variance must be
smaller than the test set variance, since the test set validation also includes the
sampling variance. This interesting exercise has almost never been carried out – but
it is of course a must for the exercises in this book in the cases where a true test set
is available. This is left to the reader’s discretion.

There are many myths about the different types of cross validation. For example
that full cross validation is the most comprehensive validation possible etc. This is
never the case, however, except for small-sample data sets. Otherwise it is very
unlikely that one left-out sample alone will induce any significant sampling
variance in any well-structured model (no outliers left in, no sub-groupings etc.) so
as to simulate the missing test set drawing. Full cross validation is however de
rigueur the smaller the number of samples available for validation. Put bluntly, it is

Multivariate Data Analysis in Practice


508 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

only in the case with really few samples that full cross validation is really
beneficial. In all other situations a carefully designed variant of segmented cross
validation will give a more realistic error estimate. How shall we choose the
number of segments?

The best way is to choose as many segments as will realistically simulate an


appropriate re-sampling the target population. Instead of full cross validation of the
150 sample-set above, it is more realistic to use, say, 10 segments. Each of these
different segments chosen (randomly or systematically) actually now performs a
10% re-sampling of the calibration set target population. As you can see we are
deliberately trying to get a situation that is as close to the ideal test set situation as
possible. The drawback may be the small basis for our sub-models, but the test set
validation principle of simulating the sampling variance is intact! In general full
cross validation does not produce a sufficient re-sampling of the target population.
For each problem, there is in principle only one - or at most a few - way(s) to
choose segments so the test set case is optimally simulated. There will also be an
effective lower bound as to the smallest number of samples that meaningfully can
make up a test set.

To conclude: cross validation is not a compulsory statistical validation technique.


Cross validation is the problem dependent, modified equivalent to the optimal but
unavailable, test set validation. As a data analyst you need to think carefully about
how to set up the segmentation for optimal results, and relate this to your specific
problem context. In taking personal responsibility for really getting to the bottom of
the validation issues, you gain the advantage that test set validation and the
systematic, segmented cross-validation can be used for any validation task, not only
for prediction strength assessment but also for any similar performance testing (e.g.
classification.). Esbensen & Huang (2000) deal with the full methodology of
proper validation.

18.5.3 Leverage Corrected Validation


The essence of leverage correction is to go through the calibration set only once,
and arrive at an approximate prediction error assessment, or an approximate
number of component indication, that is useful (enough) for the initial stages of
multivariate modeling. This is achieved by using a penalty weight for each object
residual (and each variable).

What would happen if we based the validation directly on the calibration set alone,
i.e. ran the validation on the same data set as was used for the calibration, instead

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 509

of the test set? This would naturally result in an over-optimistic assessment of the
prediction error, which would invariably be estimate too low. It is the same set of
objects that are used both for establishing the model as well as “testing” its
prediction strength. Of course this is not acceptable. Still, this is exactly what
leverage correction validation does, but using the important “punishment factor”,
the leverage, factored in.

Therefore leverage correction should only be used as a preliminary validation


procedure and never for the final model assessment. Consequently we often use it
only in the initial modeling stages to screen outliers, establish a homogenous data
set for further calibration work, etc. In short, use it until you finally have to decide
on exactly how many PCs to use for future prediction. Then we always use either
test set or cross validation.

18.5.4 Selecting a Validation Approach in


Practice
The three alternative validation procedures all have the same objective: How many
components to use in a given model? What is a representative estimate of the
prediction error?

The answer depends on the chosen validation approach. All three approaches aim
at determining the prediction strength and in general they should not give results
which are too dissimilar. Always be aware though that any real-world data set may
well break every rule-of-thumb in the multivariate data analysis world. When the
three approaches really do present different results, there is a rigid preference for
test set over cross validation, followed by leverage correction. Observe this at all
times. The key issue is that the proper segmentation of the calibration set (from the
leave-one-out approach trough all potential segmentations, until the two-segment
approach) is very much related to the observable model structure. It is therefore
your responsibility to study the pertinent data structures carefully, most often in the
form of the appropriate score or T vs. U plots etc. and then to decide this issue.

There is no substitute for extensive personal experience in multivariate data


analysis. We hope this book has offered a good start by presenting an extended
series of real-world data sets that illustrate the most important issues as widely as
possible. Chapter 13: “Master data sets....” contain a set of very representative real-
world data analysis problems/data sets, which has stood the test of some five

Multivariate Data Analysis in Practice


510 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

consecutive years of university teaching. Beware: here are the final challenging
problems on offer in this book from which to learn and grow!

18.6 Summary of Basic Rules for Success


If you have already learned to apply the basic rules below, and how to interpret
scores, loadings, loading-weights, residual variances (X,Y) as well as validation,
multivariate data modeling will be quite safe, and the risk of making serious
mistakes will be small.

1. Make sure to use only representative calibration and validation data. Enough
evenly distributed samples, spanning all major variations will usually do the job.
There are exceptions.

2. Select the appropriate validation method for all final model testing.

3. Always look for outliers. Decide whether to keep, or remove the candidates
spotted. Extreme samples may carry important information, while erroneous
data will destroy the model. Outlier detection is only feasible in the proper data
analysis problem context.

4. Decide on the “correct”, optimal number of principal components to use, by


e.g. studying the shape of the residual X-variance, but always compare with the
specific problem knowledge. Always be conservative: select the first local
minimum. Do not use too many PCs in relation to the dimensions of the data set,
which will always lead to overfit! Use all the different tools available for this
task.

5. Interpret the structural patterns in the data, by studying the appropriate score
plot. Be careful if there are clear, separate sub-groups. This may indicate that a
model should be made for each subgroup. Interpreting score structures is always
a problem-specific task. In the PCA regimen, t-t-plots reign, while in the PLSR/
(PCR) context, the t-u-plots takes over.

6. Interpret variable relationships by studying the loading plot, but only on


models that you are completely satisfied with, i.e. models that have been
carefully worked through all the conditions described above. Use loading-
weights in the regression regimen.

Multivariate Data Analysis in Practice


18. Comparison of Methods for Multivariate Data Analysis - And their Validation 511

7. Do not transform or preprocess data unless you know what you are doing and
why! Always use weights: 1/SDev if variances are of different ranges; this is
usually in the form of auto scaling. There are exceptions.

8. Check how well prediction models will perform when predicting new data, by
using all the full validation concepts outlined. The RMSEP prediction error is
given in original units. It should be compared to the measurement precision
levels and the accuracy of the reference method. Remember that the prediction
error depends on the validation method used as well as on selection of
calibration and validation samples.

9. Beware of the possibility for exceptions to all “rules” (including the above).

10. Feel free to contact CAMO ASA, or the author if you have any questions:
[email protected]
[email protected]

18.7 From Here – You Are on Your Own.


Good Luck!
This is where we leave you then – starting out on your own data analysis career.
The work done so far in reading, contemplating and internalizing the introductory,
theoretical part of the book has given you a solid foundation upon which to build;
call this stage 1. When you have worked yourself through the illustrations,
demonstrations and especially all the self-study exercises you have added the
second stage yourself.
When this has been achieved, the ascent has begun in earnest. Congratulations!

The rest of the way to the top is of course your completely own responsibility. The
pinnacle of chemometric data analysis is now in sight, and actually within reach.
While there is still quite a distance to go, there is nothing to hinder you to start on
these last stages – and the view from the top is spectacular!
Good luck – Have fun!

Multivariate Data Analysis in Practice


512 18. Comparison of Methods for Multivariate Data Analysis - And their Validation

The Eiffel tower: four stages and a pinnacle. This book should have elevated you to
the top of the second stage!

Multivariate Data Analysis in Practice


19. Literature 513

19. Literature
C. Albano, W. Dunn III, U. Edlund, E. Johansson, B. Nordén, M. Sjöström & S.
Wold (1978), Four levels of pattern recognition, Anal. Chim. Acta, 103, pp 429 - 443

C. Albano, G. Blomqvist, D. Coomans, W.J. Dunn III, U. Edlund, B. Eliasson, S.


Hellberg, E. Johanson, B. Nordén, M. Sjöström, B. Söderström, H. Wold & S. Wold
(1981), Pattern recognition by means of disjoint principal components models
(SIMCA), Philosophy and methods, Proc. Symp. on Appl. Stat., Copenh., 183-218

K.R. Beebe & B. Kowalski (1987), An introduction to Multivariate Calibration and


Analysis, Anal. Chem., vol. 59, No. 17 1007A-1017A.

K.R. Beebe, R.J. Pell & M.B. Seascholtz (1998) Chemometrics: A Practical Guide.
John Wiley & Sons, Inc., New York, 1998, ISBN 0-471-12451-6

H.R. Bjørsvik & H. Martens (1989), Data Analysis: Calibration of NIR instruments
by PLS regression in Burns, D.A. & Ciurczak, E.W. (Editors) Handbook of Near-
Infrared Analysis, Marcel Dekker Inc., New York

G.E.P. Box, W.G. Hunter, J.S. Hunter (1978), Statistics for experimenters, Wiley &
Sons Ltd ISBN 0-471-09315-7

R. Brereton (1990) Chemometrics, Applications of mathematics and statistics to


laboratory systems. Ellis Horwood Ltd. Series in chemical computation, statistics
and information. ISBN 0-13-131350-9.

P.J. Brown (1992), Wavelength selection in multicomponent Near-infrared


calibration, Jour. chemometrics, vol 6, pp 151 - 162

R. Carlson (1992), Design and optimization in organic synthesis, Elsevier Publ.,


Amsterdam, ISBN 0-444-89201-X

R. Carlson (1992), Preludes to a screening experiment. A tutorial. Chemometrics and


Intelligent Laboratory Sysytems, 14: 103-114

Multivariate Data Analysis in Practice


514 19. Literature

S.N. Deming, J.A. Palasota & J.M. Nocerino (1993), The geometry of multivariate
object preprocessing, Jour. Chemometrics, vol 7, pp 393 - 425

A.S.C. Ehrenburg (1978) Data Reduction – Analysing and interpreting statistical


data. Wiley Publishers. ISBN 0-471-23398-6.

K.H. Esbensen, M. Halstensen, T. T. Lied, A. Saudland, J. Svalestuen, S. de Silva,


B. Hope (1998) Accoustic chemometrics – From noise to information. Elsevier,
Chemometrics and Intelligent Laboratory Systems vol. 44, pp. 61-76

K.H. Esbensen, B. Hope, T. T. Lied, M. Halstensen, T. Gravermoen, K. Sundberg


(1999) Accoustic chemometrics for fluid flow quantifications – II: A small
constriction will go a long way). Jour. Chemometrics vol. 13, pp 209-239

K.H. Esbensen & J. Huang (2000) Principles of Proper Validation (submitted). Jour.
Chemometrics

K.H. Esbensen, S. Wold & P. Geladi (1988), Relationships between higher-order


data array configurations and problem formulations in multivariate data analysis,
Jour. Chemometrics, vol 3,1 pp 33 - 48.

R.A. Fisher (1936), The use of multiple measurements in taxonomic problems, Ann.
Eugenics, vol 7, pp 179 - 188

M. Forina, G. Drava, R. Boggia, S. Lanteri & P. Conti (1994), Validation procedures


in nearinfrared spectrometry, Anal. Chim. Acta, vol.295, No. 1-2, pp 109 - 118.

J.-P. Gauchi (1995), Utilisation de la régression PLS pour l’analyse des plans
d’expériences en chimie de formulation. Revue Statistique Appliquée, 1995, XLIII
(1), 65-89

P. Geladi (1988) Notes on the history and the nature of partial least squares (PLS)
modelling. Jour. Chemometrics, vol 2, pp 231 – 246.

P. Geladi & K. Esbensen (1990), The start and early history of chemometrics:
Selected interviews. Part 1. Jour. Chemometrics, vol 4, pp 337 - 354

P. Geladi & K. Esbensen (1990), The start and early history of chemometrics:
Selected interviews. Part 2. Jour. Chemometrics, vol 4, pp 389 - 412

Multivariate Data Analysis in Practice


19. Literature 515

P. Geladi & B.R. Kowalski (1986), Partial Least Squares Regression: A tutorial,
Anal. Chim. Acta, 185, pp 1 - 17

A. Höskuldsson (1996), Prediction methods in Science and Technology, Vol. 1.


Basic Theory
Thor Publishing, Denmark. ISBN 87-985941-0-9

J.E. Jackson (1991) A User’s Guide to Principal Components. Wiley. Wiley series in
probability and mathematical statistics. Applied probability and statistics. ISBN 0-
471-62267-2

R.A. Johnson & D.W. Wichern (1988), Applied multivariate statistical analysis,
Prentice-Hall 607 p.

I.T. Jolliffe (1986) Principal Component Analysis. Springer-verlag, New York.


ISBN 0-387-96269-7

N. Kettaneh-Wold (1992), Analysis of mixture data with PLS. Chemometrics and


Intelligent Laboratory Systems, 14: 57-69, Elsevier Science publishers B.V.,
Amsterdam.

J. Kaufmann (1991), Un Repas, Quels Vins? Editions S.A.E.P. F-68040 Ingersheim.


Colmar, France. ISBN 2-7372-2061-0

W.J. Krzanowski (1988) Principles of Multivariate Analysis – A user’s perspective.


Oxford Science Publications. Oxford Statistical Science series No.3. ISBN 0-19-
852230-4.

K.V. Mardia, J.T. Kent & J.M. Bibby (1979), Multivariate Analysis, Academic Press
Inc., London, ISBN 0-12-471252-5

H. Martens & T. Næs (1989), Multivariate Calibration, Wiley & Sons Ltd, ISBN 0-
471-90979-3

P.L. Massart, B.G.M. Vandegiste, S.N. Deming, Y. Michotte & L. Kaufman (1988),
Chemometrics: A text book, Elsevier Publ., Amsterdam, ISBN 0-444-42660

Multivariate Data Analysis in Practice


516 19. Literature

E. Morgan (1991), Chemometrics: Experimental design, Wiley & Sons Ltd, ISBN 0-
471-92903-4

T. Næs & T. Isakson (1991), SEP or RMSEP, which is best?, NIR News, vol 2, No.
4, p 16

T. Næs & H. Martens (1984), Multivariate Calibration. II. Methods, Trends in


Analytical Chemistry, 3, 10, pp 266-271

J.R. Piggott (Ed.) (1986) Statistical Procedures in Food Research. Elsevier Applied
Science Publishers. ISBN 1-85166-032-1

H.C. Romersburg (1984) Cluster Analysis for Researchers. Lifetime Learning


Publications, Belmont, California. Reprint edition Feb-1990. Krieger Publishing
Company. ISBN: 0894644262

D.B. Rubin (1987), Multiple imputation for non-response in surveys. Wiley, New
York. Wiley series in probability and mathematical statistics. Applied probability
and statistics. ISBN 0-471-08705-x

R. Rucker (1984) The 4th Dimension - Toward a geometry of higher reality.


Hougton Mifflin Company. Boston ISBN 0-395-34420-4

M. Sergent, D. Mathieu, R. Phan-Tan-Luu & G. Drava (1994), Tutorial. Correct and


incorrect use of multilinear regression, Elsevier, Chemometrics and Intelligent
Laboratory Systems 153-162

G. Spotti (Ed.) (1991) Gaetano e Pietro SGARABOTTO. Liutai – Violin makers 1878
– 1990. Editrice Turris. Cremona. Italia. ISBN 88-7929-000-2

M. Tenenhaus (1998), La Régression PLS. Théorie et Pratique. Editions Technip.


Paris, ISBN 2-7108-0735-1

P. Thy & K. Esbensen (1993), Seafloor spreading and the ophiolitic sequences of the
Troodos complex: A principal component analysis of lava and dike compositions,
Jour. Geophysical research, vol 98 B7, pp 11799 - 11805

Multivariate Data Analysis in Practice


19. Literature 517

R. Vong, P. Geladi, S. Wold & K. Esbensen (1988), Source contributions to ambient


aerosol calculated by discriminant partial least ssquares regression (PLS), Jour.
Chemometrics, vol 2, pp 281 – 296

P. Williams & K. Norris (1987), Near Infrared Technology in the Agricultural and
Food Industries, American Association of Cereal Chemists Inc., ISBN 0-913250-49X

S. Wold (1976), Pattern recognition by means of disjoint principal components


models, Pattern recognition, 8, pp 127 – 139

S. Wold (1978), Cross validatory estimation of the number of components in factor


and principal components models, Technometrics, 20, pp 397 - 406

S. Wold, C. Albano, W.J. Dunn III, O. Edlund, K. Esbensen, P. Geladi, S. Hellberg,


E. Johansen, W. Lindberg & M. Schöström (1984), Multivariate Data Analysis in
Chemistry in B.R. Kowalski (Ed), Chemometrics, Mathematics and Statistics in
Chemistry, D. Reidel Publ., pp 17 - 195, ISBN 90-277-1846-6

S. Wold, K. Esbensen & P. Geladi (1987), Principal Component Analysis - A


tutorial, Elsevier, Chemometrics and Intelligent Laboratory Systems, 2, pp 37-52

Multivariate Data Analysis in Practice


518

Multivariate Data Analysis in Practice


20. Appendix: Algorithms 519

20. Appendix: Algorithms


20.1 PCA
The general form of the PCA model is: X = T ⋅ P' + E

Usually the PCA model is centered, which gives:


'
X = 1 ⋅ xmean + T(A) ⋅ P(A) + E (A)

A
Another way to put this is: xik = x mean ,k + ∑ t ia p'ka + eik ( A )
a =1

Frame 20.1
The NIPALS algorithm for PCA

The algorithm extracts one factor at a time. Each factor is obtained


iteratively by repeated regressions of X on scores t to obtain improved
p and of X on these p to obtain improved t . The algorithm proceeds as
follows:

Pre-scale the X-variables to ensure comparable noise-levels. Then center


the X-variables, e.g. by subtracting the calibration means x ′ , forming X0.
Then for factors a = 1, 2, ..., A compute t a and p a from Xa-1:.

Start:

Select start values. e.g. t a = the column in Xa-1 that has the highest
remaining sum of squares.
Repeat points i) to v) until convergence. (Continued on next page)

Multivariate Data Analysis in Practice


520 20. Appendix: Algorithms

Frame 20.1, continued


i)Improve estimate of loading vector p a for this factor by projecting the
matrix Xa-1 on t a , i.e.
p ′a = ( t a′ t a ) −1 t a X a−1

ii) Scale length of p a to 1.0 to avoid scaling ambiguity:


p a = p a ( p ′a p a )−0.5

iii) Improve estimate of score t a for this factor by projecting the matrix
Xa-1 on p a :
t a = X a−1p a ( p ′a p a ) −1
iv) Improve estimate of the eigenvalue τ a :
τ a = t a′ t a

v) Check convergence: If τ a minus τ a in the previous iteration is smaller


than a certain small pre-specified constant, e.g. 0.0001 times τ a , the
method has converged for this factor. If not, go to step i).
Subtract the effect of this factor:
X a = X a−1 − t a p ′a
and go to Start for the next factor

20.2 PCR
PCR is performed as a two step operation; first X is decomposed by PCA, see page
519. Then the principal components regression is obtained by regressing y on the
t ’s.

Principal Component Regression Equation


Principal component regression of J different Y-variables on K X-variables is
equivalent to J separate principal component regressions on the same K X-variables
- one for each Y-variable. Thus we here only give attention to the case of one single
Y-variable, y.

Multivariate Data Analysis in Practice


20. Appendix: Algorithms 521

The principal component regression is obtained by regressing y on the t ’s obtained


from the PCA of X. The regression coefficients b for each y can be written
b = Pq

(20.1)
where X-loadings P = {p ka , k = 1,2,  , K and a = 1,2,  , A} represent the PCA
loadings of the A factors employed, and Y-loadings q = (q 1 ,  , q A ) ′ are found the
usual way by least squares regression of y on T from the model y=Tq+f. Since the
scores in T are uncorrelated, this solution is equal to
q = (diag(1 τ a )) T ′y
−1
(20.2)
 
Inserting this in (20.1) and replacing T by XP , the PCR estimate of b can be
written as
b = P (diag(1 τ a ))P ′X ′y (20.3)
which is frequently used as definition of the PCR (Gunst and Mason, 1979). When
the number of factors A equals K, the PCR gives the same b as the MLR. But the
X-variables are often intercorrelated and somewhat noisy, and then the optimal A is
less than K: In such cases MLR would imply division by eigenvalues τ a close to
zero, which makes the MLR estimate of b unstable. In contrast PCR attains a
stabilized estimation of b by dropping such unreliable eigenvalues.

20.3 PLS1
The general form of the PLS model is: X = T ⋅ P' + E and Y = T ⋅ Q' + F

Frame 20.2 - Orthogonalized PLSR algorithm for one Y-variable PLS1


Calibration:

C1 The scaled input variables X and y are first centered, e.g.


X 0 = X − 1x ′ and y 0 = y − 1y
Choose Amax to be higher than the number of phenomena expected
in X.
For each factor a = 1,..., Amax perform steps C 2.1 - C 2.5:

C 2.1 Use the variability remaining in y to find the loading weights wa,
using LS and the local 'model' (continued on next page)

Multivariate Data Analysis in Practice


522 20. Appendix: Algorithms

X a−1 = y a−1w ′a + E
and scale the vector to length 1. The solution is
 a = cX ′a−1y a −1
w
where c is the scaling factor that makes the length of the final w
a
equal to 1, i.e.
c = ( y ′a−1X a−1X ′a−1y a−1 ) −0.5
C 2.2 Estimate the scores t a using the local 'model'
X a−1 = t a w  ′a + E
 ′a w a = 1 )
The LS solution is (since w
t a = X a−1w a
C 2.3 Estimate the spectral loadings pa using the local 'model'
X a−1 = t a p ′a + E

which gives the LS solution


p a = X ′a−1 t a t a′ t a

C 2.4 Estimate the chemical loading q a using the local 'model'


y a−1 = t a qa + f
which gives the solution
q a = y ′a −1 t a t a′ t a

C 2.5 Create new X and y residuals by subtracting the estimated effect


of this factor:
E = X a−1 − t a p ′a
f = y − t q
a−1 a a
Compute various summary statistics on these residuals after a
factors, summarizing eik over objects i and variables k, and
summarizing fi over i objects (see Chapters 4 and 5).
Replace the former X a −1 and 
y a −1 by the new residuals E and f
and increase a by 1, i.e. set
X a = E
y = f
a
a=a+1
(continued on next page)

Multivariate Data Analysis in Practice


20. Appendix: Algorithms 523

C3 Determine A, the number of valid PLS factors to retain in the


calibration model.

C4 Compute b0 and b for A PLS factors, to be used in the predictor


y = 1b0 + Xb
(optional, see P4 below)
b = W ( P ′W
 ) −1 q
b = y − x ′b
0
________________________________________________________________
Prediction:

Full prediction

For each new prediction object i = 1,2,... perform steps P1 to P3, or alternatively, step P4.

P1 Scale input data x i like for the calibration variables. Then compute
x ′i,0 = x ′i − x ′
where x is the center for the calibration objects.
For each factor a = 1 ... A perform steps P 2.1 - P 2.2.
P 2.1 Find ti,a according to the formula in C 2.2 i.e.
ti,a = x ′i,a−1wa
P 2.2 Compute new residual x i,a = x i,a−1 − tia p ′a
If a < A, increase a by 1 and go to P 2.1. If a = A, go to P 3.
A
yi = y + ∑ tia q a
P3 Predict yi by a =1

Compute outlier statistics on xiA and t i (Chapters 4 and 5).

Short prediction
P4 Alternatively to steps P 1 - P 3, find y by using b0 and b in C 4,
i.e. yi = b0 + x ′i b

Note that P and Q are not normalized. T and W are normalized to 1 and orthogonal.

Multivariate Data Analysis in Practice


524 20. Appendix: Algorithms

20.4 PLS2
Frame 20.3
Simultaneous PLSR calibration for several Y-variables ('PLS2
regression')

If we replace vectors y, f, and q in Frame 3.4 by matrices Y(dim I*J),


F(dim I*J) and Q(dim J*A), the calibration in PLS2 is almost the same
as for the orthogonalized PLS1. The exceptions are that y a −1 in C 2.1 is
replaced by a temporary Y-score for this factor, u a and that two extra
steps are needed between C 2.4 and C 2.5:

C 2.1Use the temporary Y-factor u a that summarizes the remaining


variability in Y, to find the loading-weights w
a by LS, using the
local 'model'
X a−1 = u a w ′a + E
and scale the vector to length 1. The LS solution is
 a = cX ′a−1u a
w
a
where c is the scaling factor that makes the length of the final w
equal to 1, i.e.
c = ( u ′a X a −1X ′a−1u a )− .5
The first time this step is encountered, u a has been given some
start values, e.g. the column in Ya −1 with the largest sum of
squares.
The following two extra stages are then needed between C 2.4 and C 2.5:

C 2.4b
Test whether convergence has occurred, by e.g. checking that the
elements have no longer changed meaningfully since the last
iteration.
C 2.4c
If convergence is not reached, then estimate temporary factor
scores u a using the 'model'
Ya −1 = u a q ′a + F
(Continued on next page)

Multivariate Data Analysis in Practice


20. Appendix: Algorithms 525

Frame 20.3, continued


giving the LS solution
u a = Ya−1q a (q ′a q a )−1
and go to C 2.1.

If convergence has been reached, then go to step C2.5 in Frame 20.2.


The expression for B  is the same in this PLS2 algorithm as in the
PLS1 algorithm, i.e.
 =W
B  ( P ′W ′
 )−1 Q
and

b ′0 = y ′ − x ′B

Multivariate Data Analysis in Practice


526

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 527

21. Appendix: Software Installation


and User Interface
21.1 Welcome to The Unscrambler
Congratulations for entering the world of The Unscrambler! The Unscrambler is
a software package which helps product developers do their job faster and more
efficiently by providing them with the opportunity to use experimental design and
multivariate data analysis during the developmental phase, instead of the more
traditional sequential approach, which focuses on one variable at a time. As you
begin to explore The Unscrambler’s powerful analytical tools, we are sure that
you will learn to appreciate how much it can help you with your work.

The Unscrambler is developed and published by CAMO ASA. Created in the


mid-1980’s, the software has been continually improved ever since. In 1996, a
completely reengineered version for Windows 95 and Windows NT was released.

21.2 How to Install and Configure The


Unscrambler
Installing The Unscrambler is straightforward, provided that you ensure that the
requirements given below are fulfilled, and follow the installation procedure.

Hardware Requirements
We recommend that you use at least a Pentium PC running at 100 MHz or more.
Memory space is an important issue, at least 16 MB of RAM should be available,
preferably 32 MB. Using a more powerful PC improves performance significantly
and is advisable if your data tables are large.

Software Requirements
The Unscrambler software is written for the Windows 95 and Windows NT (3.51
or later) operating systems. The program does not run on Windows 3.x or Windows
for Workgroups platforms.

Multivariate Data Analysis in Practice


528 21. Appendix: Software Installation and User Interface

Installation Procedure
The Unscrambler is supplied on a set of floppy disks or a single CD-ROM. If you
have got a floppy version, insert disk 1 into your floppy drive and use the File
Manager or Windows Explorer to run SETUP.EXE on the floppy disk. If you have
got a CD-ROM version, the SETUP.EXE program can be found in the DISK1
directory.

Follow the on-screen instructions to complete the installation.

Supervisor Responsibilities
The Unscrambler requires that one person is appointed as supervisor (system
manager). The supervisor’s main task is to maintain the user accounts.

The supervisor must log in after installation and define the users who are allowed
access to The Unscrambler before they can begin to work with the program.

Start The Unscrambler and log in as supervisor by clicking on the caption bar in
the login window with the right mouse button or pressing <Ctrl>+<Shift>+<S> (see
Figure 21.1). The default supervisor password at delivery is SYSOP.

Figure 21.1 - Dialog: The Unscrambler Startup

User accounts are maintained from Project - System Setup. Select the Users tab
in the System Setup dialog (shown in Figure 21.2). New users are added by
pressing New. Select a user from the Users list and press Modify to set or change
the password.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 529

Figure 21.2 - Dialog: System Setup, Users sheet

The supervisor also defines how missing values should be handled by default when
users import or export data. Finally, the supervisor can also move the data directory
to a new location by pressing Change on the Directories sheet (see Figure 21.3).

Figure 21.3 - Dialog: System Setup, Directories sheet

Note that the data files are copied to the new location, not physically moved. This
ensures that a backup exists if the location change fails for some reason. The
previous data directory can be removed manually if desired.

21.3 Problems You Can Solve with The


Unscrambler
The main purpose of The Unscrambler is to provide you with tools which can
help you analyze multivariate data. By this we mean finding variations, co-
variations and other internal relationships in data matrices (tables). You can also

Multivariate Data Analysis in Practice


530 21. Appendix: Software Installation and User Interface

use The Unscrambler to design the experiments you need to perform to achieve
results which you can analyze.

The following are the five basic types of problems which can be solved using The
Unscrambler:
• Design experiments, analyze effects and find optima;
• Find relevant variation in one data matrix;
• Find relationships between two data matrices (X and Y);
• Predict the unknown values of a response variable;
• Classify unknown samples into various possible categories.

You should always remember, however, that there is no point in trying to analyze
data if they do not contain any meaningful information. Experimental design is a
valuable tool for building data tables which give you such meaningful information.
The Unscrambler can help you do this in an elegant way.

21.4 The Unscrambler Workplace


A short overview of the user interface is given in this section. You can improve the
way you analyze multivariate data matrices by taking advantage of all The
Unscrambler’s features.

Windows 95 and Windows NT


The Unscrambler runs under Windows 95 and Windows NT. We will assume that
you are already familiar with the operating system you are using. If not, we
recommend that you become fully acquainted with it before starting to work with
The Unscrambler. This manual does not explain in detail important points such as
the Windows graphical interface, common use of the mouse, etc. Refer to your
Windows user guide for more information.

The descriptions and screen dumps in this manual are taken from a Windows 95
installation. Some dialogs may differ in appearance on Windows NT systems,
although their functions remain the same.

The Main Window


When you start The Unscrambler, you enter the Main Window, as seen in Figure
21.4.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 531

Figure 21.4 - The Unscrambler main window

The Menu Bar


All operations in The Unscrambler are performed with the help of the menus and
options available to you on the Menu bar. Figure 21.4 shows the default menus
which are enabled when you load The Unscrambler, which all in turn contain
several sub-menus and options. When you have an Editor or a Viewer open (see
chapters 21.4.2 and 21.4.3 respectively), more menus will be available. Some sub-
menus and options may be invalid in a given context; these are grayed out.

Context Sensitive Menus


The Unscrambler also features so-called context sensitive menus. You access
these by clicking the right mouse button while the cursor rests on the area on which
you want to perform an operation. The context sensitive menus are a kind of short-
cut, as they contain only the options which are valid for the selected area, which
will save you the work of having to click your way through all the menus on the
Menu bar.

The Toolbar
The Toolbar buttons give you shortcuts to the most frequently used commands.
When you let the mouse cursor rest on a toolbar button, a short explanation of its
function appears.

The Status Bar


The Status bar at the bottom of the screen displays concise information. A short
explanation of the current menu option is displayed to the left. On the right-hand

Multivariate Data Analysis in Practice


532 21. Appendix: Software Installation and User Interface

side, additional information, such as the value of the current cell in the Editor and
the size of the data table, is displayed.

21.4.2 The Editor


The Editor, as seen in Figure 21.5, handles data in The Unscrambler. Each time
you open or create a data file, its contents will appear in an Editor window. You
can open several Editors with different contents at the same time, switching from
one to another as you wish. Each time we mention the Editor in this manual and the
Help system, we mean a window where data or results are displayed in a tabular
form.

Figure 21.5 - Unscrambler Editor

Basic Notions
The Editor consists of a data table made up of rows and columns. The intersection
of a column and a row is called a cell; each cell holds a data value. The rows and
columns correspond to samples and variables respectively. Samples and variables
are identified by a number and a name.

Active Cell and Cell Selection


At any given time, one cell in an Editor is active. The active cell is marked with a
frame. Activate a cell by using the arrow keys to position the cursor or by clicking
with the left mouse button on it.

You can also select a range of cells in the Editor, i.e. one or more columns, or one
or more rows.

A whole row or column can be selected by clicking with the left mouse button on
the sample or variable number (the gray area between the names and the data table

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 533

itself). Keep the button down and drag the cursor to select more rows or columns.
Selecting a new range removes the last range.

To add new samples or variables to an existing selection and to make a range, press
the <Ctrl> key while you click on the appropriate samples or variables. The range
may be continuous or non-continuous. You can also deselect a sample or variable
by pressing the <Ctrl> key while clicking on the object you want to remove from
the range, in toggle action. This is only possible with the mouse.

Hold down the <Shift> key while you make the selection if you want to select a
continuous block of samples or variables between the last selection and the present
selection.

When you make a selection, you always mark either samples or variables, i.e. you
either select some variables for all samples or some samples for all variables. You
can also mark the whole matrix, but the selection is still sample or variable
oriented. The difference is important because you define sets (see chapter 21.5.1 )
based on either samples or variables. You see whether you are marking samples or
variables by looking at the shape of the mouse pointer as you make the selection;
see Figure 21.6.

Figure 21.6 - The shape of the mouse pointer when marking samples and
variables respectively
Mark Samples:

Mark Variables:

Screen Layout
If the data table is larger than the screen, you can scroll the Editor.

Information about the active cell is displayed in The Unscrambler’s status bar.
Variable names are displayed in black if the variable is continuous and in blue if it
is a category variable. Locked cells, e.g. design variables, are grayed out to show
that they cannot be edited.

Plotting from the Editor


You can easily plot from the Editor: Select the samples or variables you want to
look at graphically and select the plot type you want from the Plot menu as seen in

Multivariate Data Analysis in Practice


534 21. Appendix: Software Installation and User Interface

Figure 21.7. You can choose between several different plots, depending on how
many samples/variables you have selected. A dialog will appear, in which you
select which set to plot.

Figure 21.7 - Options in the Plot menu when one variable is selected

21.4.3 The Viewer


In the Viewer, data and results are visualized graphically in an interactive manner.
Whenever you make a plot, it appears in a Viewer. Every time the Viewer is
mentioned throughout this manual and Help system, we are referring to a window
where a plot is displayed.

Several Viewers can be open at the same time. In addition, one Viewer can display
several plots. This is possible because the Viewer is divided into seven so-called
sub-views, organized as shown in Table 21.1.

Table 21.1 - Organization of sub-views


Sub-view Layout Sub-view Layout
1 4
2 5
3 6
7

Figure 21.8 shows a typical Viewer with sub-views 4–7.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 535

Figure 21.8 - The Unscrambler Viewer

Plotting from the Viewer


Data and results can be plotted in three different ways, of which the last two are
done from the Viewer:
• Display a selected part of the data table in The Editor (see chapter “Plotting
from the Editor” on page 533);
• Display data from any data table or result matrix;
• Display predefined result plots from an analysis.

Display Data from a Table or Result Matrix


You do this by selecting Results - General View. An empty Viewer appears,
giving you access to all data and result files from the Plot menu.

Display Predefined Plots


This option makes use of The Unscrambler’s many predefined result plots. After
each analysis you can choose to see an overview plot of the most important results
by pressing View. The Plot menu then consists of the appropriate result plots for the
type of analysis you have performed. You can also access the results plots from
File - Open at any time.

Plot Information
The Unscrambler gives a lot of information about the data in the current plot. If
the Plot ID is turned on, a line at the bottom of the plot displays basic information.
Toggle the Plot ID on and off using View - Plot ID. Table 21.2 shows some typical
ways of identifying plots.

Multivariate Data Analysis in Practice


536 21. Appendix: Software Installation and User Interface

Table 21.2 - Plot ID syntax


Plot type Typical ID line Explanation
Score plot, Alcohol, X-expl: The results file is Alcohol. The explained
2D Scatter 70%, 14% X-variance is 70% for PC 1 and 14% for PC 2. The
Y-expl: 29%, explained Y-variance is 29% for PC 1 and 28% for
28% PC 2.
Loading Octane, The results file is Octane. The explained
plot, line PC(X-expl, X-variance is 70% and the explained Y-variance is
Y-expl): 29% for PC 1.
1(70%,29%)
Predicted Alcohol, (Y-var, The results file is Alcohol. The predicted vs.
vs. PC): measured results for the Y-variable Methanol is
Measured (Methanol,3) plotted using 3 PCs.
(Methanol,3)

Other information about the plotted data such as data source, explanation of colors
and symbols, etc., may also be shown in a separate window using Window-
Identification. These windows are dockable views.

Use View - Plot Statistics to display the most relevant statistical measures.

Information on each object in the plot can be displayed simply by letting the mouse
cursor rest on the object in the plot. A brief explanation of the data point then
appears. Click with the left mouse button to display more detailed information
about the data object.

Use of Colors
There are two pre-set color schemes in The Unscrambler; Black background and
White background. You can change the color on any of the items of the Viewer. This
is done through File - System Setup - Viewer - Define colors…. It is possible to
use different color schemes for the screen and the printer. Note that also other items
than the background and the axis (foreground) differ in the two preset color
schemes; see Table 21.3 for details:

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 537

Table 21.3 - The Unscrambler color schemes


Item Black background White background
Foreground White Black
Curve 1 Cyan Blue
Curve 2 Magenta Magenta
Curve 3 Yellow Green
Curve 4 Blue Cyan
Curve 5 Green Brown

It is also possible to set the color for a specific item. The changes will be shown on
the preview screen.

21.4.4 Dockable Views


The Unscrambler shows different kinds of information in dockable views. A
dockable view is a window that “floats” on the desktop and which can be “glued”
to the edges of The Unscrambler workspace at wish, hence the term “dockable”.

Dockable views are toggled on and off in the Window or View menu. Dockable
Views in the Window menu are Identification and Warning List, in the View menu
Outlier List.

Click the title bar of the dockable view to drag it around the screen. The shape of
the view changes when you get close to the edge of The Unscrambler workplace.
When you release the mouse button, the view is glued to the edge. To move it
again, click inside the docked view and drag it away. When you get outside or well
inside the edges of The Unscrambler workspace, the shape changes again and it
has become a floating window.

21.4.5 Dialogs
When you are working in The Unscrambler, you will often have to enter
information or make choices in order to be able to complete your project, such as
specifying the names of files you want to work with or the sets which you want to
analyze, or how many PCs you want to compute (see chapter 21.5.1 on page 540
for an explanation of sets). This is done in dialogs, which will normally look
something like the one pictured in Figure 21.9.

Multivariate Data Analysis in Practice


538 21. Appendix: Software Installation and User Interface

Figure 21.9 - Unscrambler dialog

Radio buttons
Drop-down list

Select a tab to see


more options

List of values
Or ranges Opens a new
dialog

Tick box Spin button

This particular dialog is the one you enter when you want to run a Regression on
your data. Items that are predefined, such as sets, file filters, etc., are selected from
a drop-down list. Ranges of samples or variables are entered as shown in the Keep
Out of Calculation field in the figure. You can use a comma to separate two items
in a field, and a hyphen to specify the whole range between two values.

Options which are mutually exclusive are selected via radio buttons. Tick boxes are
used to select multiple options. For example, you may center data and issue
warnings at the same time.

Plot Preview in Plot Dialogs


Plot dialogs show you a preview of any plot type Figure 21.10 - Plot
you are about to make, enabling you to check Preview
that your choice of plot was correct; see Figure
21.10. You do however have to keep in mind
that this is not a preview of your own data, it
just shows the general shape of the plot and
indicates in which sub-view the plot will be
displayed.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 539

Extracted File Information


Dialogs involving management of files on disk have an information field at the
bottom. The Unscrambler searches the files for information that tells you more
about the contents of the file. This information typically includes:

Type of file Size of the matrices


File name Set information
Directory name Weighting information
Date Calibration method
Owner of the file Validation method

A preview screen (see Figure 21.11) in the Figure 21.11 - Preview


information area of the file dialogs shows the screen for true residual
true residual variance curve for the currently variance curve
selected model whenever you select a result file
that has the residual variance saved. The dot
shows the optimal number of components
suggested by The Unscrambler.

You also have access to the variance as a


numerical table, and the warnings from the
information field of the dialogs.

21.4.6 The Help System


The Help system has been implemented to give you the help and advice you need
when you are working with The Unscrambler. Help is available on the following
topics:
• Use of the dialogs
• Use of the methods
• Interpretation of plots

Access the Help system at any time by pressing the <F1> button or clicking on the
help button in the dialogs. The Help file is automatically opened at the appropriate
topic.

You may also open the help system by selecting Help - Unscrambler Help Topics;
this displays all the contents of the Help file. From there you can click your way to
the items you are interested in, just as you would open a book. Use the Index tab to
search for keywords.

Multivariate Data Analysis in Practice


540 21. Appendix: Software Installation and User Interface

Several levels of help are available. Click on underlined words to follow built-in
links to related help topics.

21.4.7 Tooltips
Whenever you let the cursor rest on one of The Unscrambler’s buttons or icons, a
small yellow label pops up to tell you its function. This is the quickest way to learn
the functions of toolbar buttons.

21.5 Using The Unscrambler Efficiently


The Unscrambler contains many powerful tools for handling and analyzing your
data. In this chapter, we will try to describe the basic mechanisms involved in this
process in a concrete way, to give you an idea of what you can do with The
Unscrambler, and how to do if efficiently.

21.5.1 Analyses
Most projects involve many different problems; yours probably will too. Let us
illustrate this with an example, a study of bananas involving different types of
measurements of many different properties. The aim of the study is to find answers
to questions like:
• Are there any correlations between sensory measurements and color
measurements;
• Can preference measurements be predicted from sensory measurements;
• Can sensory measurements be predicted from chemical measurements?

Often, measurements like these will be stored in different matrices, especially of


they were taken at different times, by different people, or even at different
locations. It is however difficult to maintain multiple files which contain the same
data. Therefore, we recommend that you keep all related data in one file, as one
large data table. With “related data” we mean samples that it is natural to analyze
together because they contain measurement values of the same variables. Data
tables are displayed in the Editor.

It is possible to combine data tables like this, because one table can contain several
“matrices”:

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 541

One Data Table, Several Matrices


If we define a matrix as a set of several rows and columns, Figure 21.12 illustrates
how a matrix can cover a part of a large data table.

Figure 21.12 - Matrix as a part of a larger data table

A complete data table consists of a number of samples (e.g. objects, cases,


observations, experiments). For each sample you will usually have collected
variable measurements of a number of properties (e.g. taste, absorbance,
concentrations, pressure, temperature, origin, color, etc.). The resulting data table
consists of a collection of samples organized as rows and variables organized as
columns. It is useful to have all your data in one table, but you may not always
want to work with the whole table. Therefore, The Unscrambler lets you define
several different matrices within one data table.

Matrices Are Defined by Sets


To ease analysis and interpretation, The Unscrambler offers practical functions
which allow you to define specific parts of your data by giving them separate
names. We call these parts sets. If you for instance want to make a regression
model, you can simply select which set (i.e. which part of the data) to use as X;
predictors, and which set to use as Y; responses, instead of having to make a new
data table containing only the samples and variables you want to use for this
specific model. Without this feature, you would have to store copies of the same
data in several different files, making it harder to maintain all copies, and
increasing the risks of losing valuable work through data being lost, destroyed, or
mixed up.

Multivariate Data Analysis in Practice


542 21. Appendix: Software Installation and User Interface

The Unscrambler by default predefines the special Sets “All Variables” and
“Currently selected Variables” (available if you have marked variables in the
Editor) to make selection easier. You can define as many additional sets as you
need in the Set Editor, which you enter by selecting Modify - Edit Set.

A matrix is completely defined by combining a Variable Set (columns) and a


Sample Set (rows). You can regard sets as many virtual matrices. You may define
overlapping sets, i.e. different sets which share some of their samples or variables.
If you change your raw data (e.g. by deleting a variable), all relevant sets (i.e. all
variable sets containing this variable) are automatically updated. Working with sets
therefore gives you complete control of your data.

Variable Sets
In the banana example discussed above you could define one Variable Set called
“Sensory”, which contains the sensory measurements. You could also define
another set “Pref” for the preference measurements, the set “Chemistry” for the
chemical measurements, etc. (see Figure 21.13).

Figure 21.13 - Various Variable Sets in one data table

Depending on which problem you want to solve, it is then easy to select which
variables to use as X and which as Y.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 543

Note that a set can contain non-continuous selections. The “Sensory” set in Figure
21.13 is split in two by the “Chemistry” set. We also see that some of the variables
in the “Sensory” set are part of a “Fourth” set as well; the sets overlap.

Note!
A regression model cannot be based on two overlapping sets used as X
and Y respectively. The program issues a warning if you have defined a
model like this.

Sample Sets
In practice, you do not always want to use all the available samples in a particular
analysis. For example, you might have collected one group of samples early in the
season and another group later, with a third group taken at a different site (B).
From these three groups of samples it is possible to define a range of different
Sample Sets, such as “Late”, “Early”, “Late A”, “Late B”, etc. (see Figure 21.14). It
is then easy to make different models based on different parts of the data table.
Remember that “All Samples” is a predefined set.

Figure 21.14 - Sample Sets

Note!
Do not define separate Sample Sets for calibration and validation
samples. Samples used for validation are always taken from the Sample
Set that you use when you make the model from the Task menu.

Multivariate Data Analysis in Practice


544 21. Appendix: Software Installation and User Interface

Define The Sets before You Start Your Analysis


Ideally you should organize your data before you start an analysis. When you have
imported or punched in the data, think about the problems you need to solve.
Which analyses will be necessary? What relationships should be investigated?

Give meaningful names to your variables and samples, so that you can remember
what they are later on. Add and edit category, and define Sample and Variable Sets,
again using appropriate names.

If you find that you need to define a new set at a later stage of your work in The
Unscrambler, this can be done in the appropriate analysis dialog under the Task
menu when you are about to select which data to analyze.

Keep Out of Calculation


In an analysis dialog, you can use the function Keep Out of Calculation to omit
variables or samples from the chosen Variable or Sample Set respectively. This can
be useful e.g. if you want to remove outliers.

Note that this function does not affect the raw data; it simply defines data which is
to be disregarded during the analysis. Never delete data from the raw data file if
there is a chance that you might need it in the future; instead, you should use Keep
Out of Calculation to remove the data temporarily.

Selected Samples or Variables


It is possible to make a Sample or Variable Set by marking samples or variables in
a data table using the data Editor: Whenever you select an option from the Task
menu you will see a set option called Selected samples/variables. This set consists of
the samples or variables that are marked in the Editor, and can be used in analyses
just like any other set. However, you can only select either a Sample or a Variable
Set by this method, not both at the same time.

Note!
The “Selected samples/variables” Set is not saved on disk, so do not use
this option for important sets! However, a copy of the set used to make a
model is always saved in the result file.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 545

Enter the Set Properties


All Variable Sets are marked as either Spectra or Non-Spectra. This property is used
to change the predefined plots for you automatically, as plots are used differently
depending on whether you are working with spectra or some other kind of data. For
example, when interpreting your PCA results, it is easier to understand the
contribution of each wavelength to PC1 by plotting the loadings as a line. This
looks like a spectrum and is easy to read. On the other hand, if you are interpreting
a PCA on the sensory properties of bananas, a 2D loading plot (in association with
a 2D score plot) will tell you which sensory properties have high values for which
samples. For these reasons, results that come from Variable Sets marked Spectra
have a different set of predefined plots than results from other data.

You specify the set properties when you define a new Variable Set, and you can
change it later using Modify - Edit Set.

Note!
The set containing the design variables is set to Non-Spectra by default;
this setting cannot be changed.

21.5.2 Some Tips to Make Your Work Easier


Using Several Plot Windows in the Same Viewer
The Unscrambler remembers seven different sub-views for each Viewer (see page
534). The Viewer can display up to four visible sub-views at a time. In predefined
plots, this option is utilized to display several plots simultaneously. You can also
do this yourself:

You may have a residual plot filling the whole Viewer and want to look at another
result together with this plot. Select Window - Copy To - 2. The Viewer window is
split in two and the residual plot is copied to the upper half. Click the lower sub-
view to activate it (the active sub-view is indicated by a light blue frame) and create
the other plot from the Plot menu.

Another frequently occurring situation is this: After an analysis you open the
Viewer to look at an overview of your model results. But you also want to look at a
fifth plot from the same results file. You can easily plot the fifth plot without
ruining the four plots in the model overview by selecting Window - Go To - 1. An
empty sub-view pops up, allowing you to plot the desired predefined result plot

Multivariate Data Analysis in Practice


546 21. Appendix: Software Installation and User Interface

from the Plot menu. Go back to the overview by selecting Window - Go To - 4 (or
5, 6 or 7).

View Plots and Raw Data Simultaneously


When you plot results from an analysis, it is often convenient to look at the raw
data at the same time. Open an Editor with the raw data from the result Viewer by
selecting View - Raw Data. Mark a sample or variable in the Viewer and the same
objects are automatically marked in the Editor and vice versa.

You can use this feature to detect a sample you cannot find in the score plot: Mark
the sample in the Editor and you will see immediately where it is located on the
score plot.

Using Context Sensitive Menus


Context sensitive menus are used extensively throughout the whole program. We
recommend that you utilize this way of accessing different commands rather than
click your way through the menus.

The context sensitive menus are accessed by clicking the right mouse button while
the cursor rests within the area on which you want to perform an operation. The
menus that appear give you access to the most common commands for the current
task. Figure 21.15 shows a typical context sensitive menu which applies to the
selected area in the data table.

Figure 21.15 - Context sensitive menus

The file dialogs contain functions that are not available from the ordinary menus in
addition to the regular commands. File deletion is for example only possible from
the Open File dialog by clicking the right mouse button and selecting Delete.

Multivariate Data Analysis in Practice


21. Appendix: Software Installation and User Interface 547

In the Editor and Viewer the context sensitive menus are more like short cuts to the
most used commands.

Using Toolbars
Toolbars provide you with shortcuts to the most frequently used commands by
providing you with predefined icons so that you will not have to search through the
menus. The Toolbars are normally placed right below the Menu bar. You can drag
the Toolbars onto the workspace, where they stay floating over your Editors and
Viewers.

If a Toolbar disappears for you, you can toggle it on again in View - Toolbars.

Multivariate Data Analysis in Practice


548

Multivariate Data Analysis in Practice


Glossary of Terms 549

Glossary of Terms

Accuracy
The accuracy of a measurement method is its faithfulness, i.e. how close
the measured value is to the actual value.

Accuracy differs from precision, which has to do with the spread of


successive measurements performed on the same object.

Additive Noise
Noise on a variable is said to be additive when its size is independent of
the level of the data value. The range of additive noise is the same for
small data values as for larger data values.

Analysis Of Effects
Calculation of the effects of design variables on the responses. It
consists mainly of Analysis of Variance (ANOVA), various Significance
Tests, and Multiple Comparisons whenever they apply.

Analysis Of Variance
Classical method to assess the significance of effects by decomposition
of a response’s variance into explained parts, related to variations in the
predictors, and a residual part which summarizes the experimental error.

The main ANOVA results are: Sum of Squares (SS), number of Degrees
of Freedom (DF), Mean Square (MS=SS/DF), F-value, p-value.

The effect of a design variable on a response is regarded as significant if


the variations in the response value due to variations in the design
variable are large compared with the experimental error. The
significance of the effect is given as a p-value: usually, the effect is
considered significant if the p-value is smaller than 0.05.

ANOVA
See Analysis of Variance

Multivariate Data Analysis in Practice


550 Glossary of Terms

Axial Design
One of the three types of mixture designs with a simplex-shaped
experimental region. An axial design consists of extreme vertices,
overall center, axial points, end points. It can only be used for linear
modeling, and therefore it is not available for optimization purposes.

Axial Point
In an axial design, an axial point is positioned on the axis of one of the
mixture variables, and must be above the overall center, opposite the end
point.

B-Coefficient
See Regression Coefficient.

Bias
Systematic difference between predicted and measured values. The bias
is computed as the average value of the residuals.

Bilinear Modeling
Bilinear modeling (BLM) is one of several possible approaches for data
compression.

The bilinear modeling methods are designed for situations where


collinearity exists among the original variables. Common information in
the original variables is used to build new variables, that reflect the
underlying (“latent”) structure. These variables are therefore called
latent variables. The latent variables are estimated as linear functions of
both the original variables and the observations, thereby the name
bilinear.

PCA, PCR and PLS are bilinear methods.

Data
Observation = + Error
Structure

Multivariate Data Analysis in Practice


Glossary of Terms 551

Box-Behnken Design
A class of experimental designs for response surface modeling and
optimization, based on only 3 levels of each design variable. The mid-
levels of some variables are combined with extreme levels of others. The
combinations of only extreme levels (i.e. cube samples of a factorial
design) are not included in the design.

Box-Behnken designs are always rotatable. On the other hand, they


cannot be built as an extension of an existing factorial design, so they
are more recommended when changing the ranges of variation for some
of the design variables after a screening stage, or when it is necessary to
avoid too extreme situations.

Box-plot
The Box-plot represents the distribution of a variable in terms of
percentiles.

Maximum value

75% percentile

Median

25% percentile

Minimum value

Calibration
Stage of data analysis where a model is fitted to the available data, so
that it describes the data as good as possible.

After calibration, the variation in the data can be expressed as the sum of
a modeled part (structure) and a residual part (noise).

Calibration Samples
Samples on which the calibration is based. The variation observed in the
variables measured on the calibration samples provides the information
that is used to build the model.

Multivariate Data Analysis in Practice


552 Glossary of Terms

If the purpose of the calibration is to build a model that will later be


applied on new samples for prediction, it is important to collect
calibration samples that span the variations expected in the future
prediction samples.

Candidate Point
In the D-optimal design generation, a number of candidate points are
first calculated. These candidate points consist of extreme vertices and
centroid points. Then, a number of candidate points is selected
D-optimally to create the set of design points.

Category Variable
A category variable is a class variable, i.e. each of its levels is a category
(or class, or type), without any possible quantitative equivalent.

Examples: type of catalyst, choice among several instruments, wheat


variety, etc.

Center Sample
Sample for which the value of every design variable is set at its mid-
level (halfway between low and high).

Center samples have a double purpose: introducing one center sample in


a screening design enables curvature checking, and replicating the center
sample provides a direct estimation of the experimental error.

Center samples can be included when all design variables are


continuous.

Centering
See Mean Centering.

Central Composite Design


A class of experimental designs for response surface modeling and
optimization, based on a two-level factorial design on continuous design
variables. Star samples and center samples are added to the factorial
design, to provide the intermediate levels necessary for fitting a
quadratic model.

Multivariate Data Analysis in Practice


Glossary of Terms 553

Central Composite designs have the advantage that they can be built as
an extension of a previous factorial design, if there is no reason to
change the ranges of variation of the design variables.

If the default star point distance to center is selected, these designs are
rotatable.

Centroid Design
See Simplex-centroid design.

Centroid Point
A centroid point is calculated as the mean of the extreme vertices on the
design region surface associated with this centroid point. It is used in
Simplex-centroid designs, axial designs and D-optimal mixture/non-
mixture designs.

Classification
Data analysis method used for predicting class membership.
Classification can be seen as a predictive method where the response is a
category variable. The purpose of the analysis is to be able to predict
which category a new sample belongs to. The main classification
method implemented in The Unscrambler is SIMCA classification.

Classification can for instance be used to determine the geographical


origin of a raw material from the levels of various impurities, or to
accept or reject a product depending on its quality.

To run a classification, you need


• one or several PCA models (one for each class) based on the same
variables;
• values of those variables collected on known or unknown samples.

Each new sample is projected onto each PCA model. According to the
outcome of this projection, the sample is either recognized as a member
of the corresponding class, or rejected.

Multivariate Data Analysis in Practice


554 Glossary of Terms

Collinearity
Linear relationship between variables. Two variables are collinear if the
value of one variable can be computed from the other, using a linear
relation. Three or more variables are collinear if one of them can be
expressed as a linear function of the others.

Variables which are not collinear are said to be linearly independent.


Collinearity - or near-collinearity, i.e. very strong correlation - is the
major cause of trouble for MLR models, whereas projection methods
like PCA, PCR and PLS handle collinearity well.

x2
x

Component
See Principal Component.

Condition Number
It is the square root of the ratio of the highest eigenvalue to the smallest
eigenvalue of the experimental matrix. The higher the condition number,
the more spread the region. On the contrary, the lower the condition
number, the more spherical the region. The ideal condition number is 1;
the closer to 1 the better.

Confounded Effects
Two (or more) effects are said to be confounded when variation in the
responses cannot be traced back to the variation in the design variables
to which those effects are associated.

Confounded effects can be separated by performing a few new


experiments. This is useful when some of the confounded effects have
been found significant.

Multivariate Data Analysis in Practice


Glossary of Terms 555

Confounding Pattern
The confounding pattern of an experimental design is the list of the
effects that can be studied with this design, with confounded effects
listed on the same line.

Constrained Design
Experimental design involving multi-linear constraints between some of
the designed variables. There are two types of constrained designed:
classical Mixture designs and D-optimal designs.

Constrained Experimental Region


Experimental region which is not only delimited by the ranges of the
designed variables, but also by multi-linear constraints existing between
these variables. For classical Mixture designs, the constrained
experimental region has the shape of a simplex.

Constraint
See Multi-linear constraint.

Continuous Variable
Quantitative variable measured on a continuous scale.

Examples of continuous variables are: Amounts of ingredients (in kg,


liters, etc.); recorded or controlled values of process parameters
(pressure, temperature, etc.).

Corner Sample
See vertex sample.

Correlation
A unitless measure of the amount of linear relationship between two
variables.

The correlation is computed as the square root of the covariance


between the two variables divided by the product of their variances. It
varies from -1 to +1.

Multivariate Data Analysis in Practice


556 Glossary of Terms

Positive correlation indicates a positive link between the two variables,


ie. when one increases, the other has a tendency to increase too. The
closer to +1, the stronger this link.

Negative correlation indicates a negative link between the two variables,


ie. when one increases, the other has a tendency to decrease. The closer
to -1, the stronger this link.

COSCIND
A method used to check the significance of effects using a scale-
independent distribution as comparison. This method is useful when
there are no residual degrees of freedom.

Covariance
A measure of the linear relationship between two variables.

The covariance is given on a scale which is a function of the scales of


the two variables, and may not be easy to interpret. Therefore, it is
usually simpler to study the correlation instead.

Cross Terms
See interaction effects.

Cross-Validation
Validation method where some samples are kept out of the calibration
and used for prediction. This is repeated until all samples have been kept
out once. Validation residual variance can then be computed from the
prediction residuals.

In segmented cross-validation, the samples are divided into subgroups or


“segments”. One segment at a time is kept out of the calibration. There
are as many calibration rounds as segments, so that predictions can be
made on all samples. A final calibration is then performed with all
samples.

In full cross-validation, only one sample at a time is kept out of the


calibration.

Multivariate Data Analysis in Practice


Glossary of Terms 557

Cube Sample
Any sample which is a combination of high and low levels of the design
variables, in experimental plans based on two levels of each variable.

In Box-Behnken designs, all samples which are a combination of high or


low levels of some design variables, and center level of others, are also
referred to as cube samples.

Curvature
Curvature means that the true relationship between response variations
and predictor variations is non-linear.

In screening designs, curvature can be detected by introducing a center


sample.

Data Compression
Concentration of the information carried by several variables onto a few
underlying variables.

The basic idea behind data compression is that observed variables often
contain common information, and that this information can be expressed
by a smaller number of variables than originally observed.

Degree Of Fractionality
The degree of fractionality of a factorial design expresses how much the
design has been reduced compared to a full factorial design with the
same number of variables. It can be interpreted as the number of design
variables that should be dropped to compute a full factorial design with
the same number of experiments.

Example: with 5 design variables, one can either build


• a full factorial design with 32 experiments ( 2 );
5

• a fractional factorial design with a degree of fractionality of 1,


which will include 16 experiments ( 2 − );
5 1

• a fractional factorial design with a degree of fractionality of 2,


which will include 8 experiments ( 2 − ).
5 2

Multivariate Data Analysis in Practice


558 Glossary of Terms

Degrees Of Freedom
The number of degrees of freedom of a phenomenon is the number of
independent ways this phenomenon can be varied.

Degrees of freedom are used to compute variances and theoretical


variable distributions. For instance, an estimated variance is said to be
“corrected for degrees of freedom” if it is computed as the sum of square
of deviations from the mean, divided by the number of degrees of
freedom of this sum.

Design Def Model


In The Unscrambler, predefined set of variables, interactions and
squares available for multivariate analyses on Mixture and D-optimal
data tables. This set is defined accordingly to the I&S terms included in
the model when building the design (Define Model dialog).

Design Variable
Experimental factor for which the variations are controlled in an
experimental design.

Design
See Experimental Design.

Distribution
Shape of the frequency diagram of a measured variable or calculated
parameter. Observed distributions can be represented by a histogram.

Some statistical parameters have a well-known theoretical distribution


which can be used for significance testing.

D-Optimal Design
Experimental design generated by the DOPT algorithm. A D-optimal
design takes into account the multi-linear relationships existing between
design variables, and thus works with constrained experimental regions.
There are two types of D-optimal designs: D-optimal Mixture designs
and D-optimal Non-Mixture designs, according to the presence or
absence of Mixture variables.

Multivariate Data Analysis in Practice


Glossary of Terms 559

D-Optimal Mixture Design


D-optimal design involving three or more Mixture variables and at least
one Process variable. In a D-optimal Mixture design, multi-linear
relationships can be defined between Mixture variables and/or between
Process variables.

D-Optimal Non-Mixture Design


D-optimal design in which some of the Process variables are multi-
linearly linked; and which does not involve any Mixture variable.

D-Optimal Principle
Principle consisting in the selection of a sub-set of candidate points
which define a maximal volume region in the multi-dimensional space.
The D-optimal principle aims at minimizing the condition number.

Edge Center Point


In D-optimal and Mixture designs, the edge center points are positioned
in the center of the edges of the experimental region.

End Point
In an axial or a simplex-centroid design, an end point is positioned at the
bottom of the axis of one of the mixture variables, and is thus positioned
on the side opposite to the axial point.

Experimental Design
Plan for experiments where input variables are varied systematically
within predefined ranges, so that their effects on the output variables
(responses) can be estimated and checked for significance.

Experimental designs are built with a specific objective in mind, namely


screening or optimization.

The number of experiments and the way they are built depends on the
objective and on the operational constraints.

Experimental Error
Random variation in the response that occurs naturally when performing
experiments.
Multivariate Data Analysis in Practice
560 Glossary of Terms

An estimation of the experimental error is used for significance testing,


as a comparison to structured variation that can be accounted for by the
studied effects.

Experimental error can be measured by replicating some experiments


and computing the standard deviation of the response over the replicates.
It can also be estimated as the residual variation when all “structured”
effects have been accounted for.

Experimental Region
N-dimensional area investigated in an experimental design with N
design variables. The experimental region is defined by:
• the ranges of variation of the design variables,
• if any, the multi-linear relationships existing between design
variables.
In the case of multi-linear constraints, the experimental region is said to
be constrained.

Explained Variance
Share of the total variance which is accounted for by the model.

Explained variance is computed as the complement to residual variance,


divided by total variance. It is expressed as a percentage.

For instance, an explained variance of 90% means that 90% of the


variation in the data is described by the model, while the remaining 10%
are noise (or error).

F-Distribution
Fisher Distribution is the distribution of the ratio between two variances.

The F-distribution assumes that the individual observations follow an


approximate normal distribution.

Fixed Effect
Effect of a variable for which the levels studied in an experimental
design are of specific interest.

Multivariate Data Analysis in Practice


Glossary of Terms 561

Examples are effect of the type of catalyst, effect of temperature.

The alternative to a fixed effect is a random effect.

Fractional Factorial Design


A reduced experimental plan often used for screening of many variables.
It gives as much information as possible about the main effects of the
design variables with a minimum of experiments. Some fractional
designs also allow two-variable interactions to be studied. This depends
on the resolution of the design.

In fractional factorial designs, a subset of a full factorial design is


selected so that it is still possible to estimate the desired effects from a
limited number of experiments.

The degree of fractionality of a factorial design expresses how fractional


it is, compared with the corresponding full factorial.

F-Ratio
The F-ratio is the ratio between explained variance (associated to a
given predictor) and residual variance. It shows how large the effect of
the predictor is, as compared with random noise.

By comparing the F-ratio with its theoretical distribution


(F-distribution), we obtain the significance level (given by a p-value) of
the effect.

Full Factorial Design


Experimental plan where all levels of all design variables are combined.

Such designs are often used for extensive study of the effects of few
variables, especially if some variables have more than two levels. They
are also appropriate as advanced screening designs, to study both main
effects and interactions, especially if no Resolution V design is
available.

Multivariate Data Analysis in Practice


562 Glossary of Terms

Higher Order Interaction Effects


HOIE is a method to check the significance of effects by using higher
order interactions as comparison. This requires that these interaction
effects are assumed to be negligible, so that variation associated with
those effects is used as an estimate of experimental error.

Histogram
A plot showing the observed distribution of data points. The data range
is divided into a number of bins (i.e. intervals) and the number of data
points that fall into each bin is summed up.

The height of the bar in the histograms shows how many data points fall
within the data range of the bin.

Influence
A measure of how much impact a single data point (or a single variable)
has on the model. The influence depends on the leverage and the
residuals.

Interaction
There is an interaction between two design variables when the effect of
the first variable depends on the level of the other. This means that the
combined effect of the two variables is not equal to the sum of their
main effects.

An interaction that increases the main effects is a synergy. If it goes in


the opposite direction, it can be called an antagonism.

Intercept
=Offset. The point where a regression line crosses the ordinate (Y-axis).

Interior Point
Point which is not located on the surface, but inside of the experimental
region. For example, an axial point is a particular kind of interior point.
Interior points are used in classical mixture designs.

Multivariate Data Analysis in Practice


Glossary of Terms 563

Lack Of Fit
In Response Surface Analysis, the ANOVA table includes a special
chapter which checks whether the regression model describes the true
shape of the response surface. Lack of fit means that the true shape is
likely to be different from the shape indicated by the model.

If there is a significant lack of fit, you can investigate the residuals and
try a transformation.

Lattice Degree
The degree of a Simplex-Lattice design corresponds to the maximal
number of experimental points -1 for a level 0 of one of the Mixture
variables.

Lattice Design
See Simplex-lattice design.

Least Square Criterion


Basis of classical regression methods, that consists in minimizing the
sum of squares of the residuals. It is equivalent to minimizing the
average squared distance between the original response values and the
fitted values.

Leveled Variables
A leveled variable is a variable which consists of discrete values instead
of a range of continuous values. Examples are design variables and
category variables.

Leveled variables can be used to separate a data table into different


groups. This feature is used by the Statistics task, and in sample plots
from PCA, PCR, PLS, MLR, Prediction and Classification results.

Levels
Possible values of a variable. A category variable has several levels,
which are all possible categories. A design variable has at least a low
and a high level, which are the lower and higher bounds of its range of
variation. Sometimes, intermediate levels are also included in the design.

Multivariate Data Analysis in Practice


564 Glossary of Terms

Leverage Correction
A quick method to simulate model validation without performing any
actual predictions.

It is based on the assumption that samples with a higher leverage will be


more difficult to predict accurately than more central samples. Thus a
validation residual variance is computed from the calibration sample
residuals, using a correction factor which increases with the sample
leverage.

Note! For MLR, leverage correction is strictly equivalent to full cross-


validation. For other methods, leverage correction should only be used
as a quick-and-dirty method for a first calibration, and a proper
validation method should be employed later on to estimate the optimal
number of components correctly.

Leverage
A measure of how extreme a data point or a variable is compared to the
majority.

In PCA, PCR and PLS, leverage can be interpreted as the distance


between a projected point (or projected variable) and the model center.
In MLR, it is the object distance to the model center.

Average data points have a low leverage. Points or variables with a high
leverage are likely to have a high influence on the model.

Limits For Outlier Warnings


Leverage and Outlier limits are the threshold values set for automatic
outlier detection. Samples or variables that give results higher than the
limits are reported as suspect in the list of outlier warnings.

Linear Effect
See Main Effect.

Linear Model
Regression model including as X-variables the linear effects of each
predictor. The linear effects are also called main effects.

Multivariate Data Analysis in Practice


Glossary of Terms 565

Linear models are used in Analysis of Effects in Plackett-Burman and


Resolution III fractional factorial designs. Higher resolution designs
allow the estimation of interactions in addition to the linear effects.

Loading Weights
Loading weights are estimated in PLS regression. Each X-variable has a
loading weight along each model component.

The loading weights show how much each predictor (or X-variable)
contributes to explaining the response variation along each model
component. They can be used, together with the Y-loadings, to represent
the relationship between X- and Y-variables as projected onto one, two
or three components (line plot, 2D scatter plot and 3D scatter plot
respectively).

Loadings
Loadings are estimated in bilinear modeling methods where information
carried by several variables is concentrated onto a few components.
Each variable has a loading along each model component.

The loadings show how well a variable is taken into account by the
model components. You can use them to understand how much each
variable contributes to the meaningful variation in the data, and to
interpret variable relationships. They are also useful to interpret the
meaning of each model component.

Lower Quartile
The lower quartile of an observed distribution is the variable value that
splits the observations into 25% lower values, and 75% higher values. It
can also be called 25% percentile.

Main Effect
Average variation observed in a response when a design variable goes
from its low to its high level.

The main effect of a design variable can be interpreted as linear


variation generated in the response, when this design variable varies and
the other design variables have their average values.
Multivariate Data Analysis in Practice
566 Glossary of Terms

Mean
Average value of a variable over a specific sample set. The mean is
computed as the sum of the variable values, divided by the number of
samples.

The mean gives a value around which all values in the sample set are
distributed. In Statistics results, the mean can be displayed together with
the standard deviation.

Mean Centering
Subtracting the mean (average value) from a variable, for each data
point.

Median
The median of an observed distribution is the variable value that splits
the distribution in its middle: half the observations has a lower value
than the median, and the other half has a higher value. It can also be
called 50% percentile.

MixSum
See Mixture Sum.

Mixture Components
Ingredients of a mixture. There must be at least three components to
define a mixture. A unique component cannot be called mixture; two
components mixed together do not require a Mixture design to be
studied: study the variation in quantity of one of them as a classical
process variable.

Mixture Constraint
Multi-linear constraint between Mixture variables. The general equation
for the Mixture constraint is
X1 + X2 +…+ Xn = S

where the Xi represent the ingredients of the mixture, and S is the total
amount of mixture. In most cases, S is equal to 100%.

Multivariate Data Analysis in Practice


Glossary of Terms 567

Mixture Design
Special type of experimental design, applying to the case of a Mixture
constraint. There are three types of classical Mixture designs: Simplex-
Lattice design, Simplex-Centroid design, and Axial design. Mixture
designs that do not have a simplex experimental region are generated
D-optimally; they are called D-optimal Mixture designs.

Mixture Region
Experimental region for a Mixture design. The Mixture region for a
classical Mixture design is a simplex.

Mixture Sum
In The Unscrambler, global proportion of a mixture. Generally, the
mixture sum is equal to 100%. However, it can be lower than 100% if
the quantity in one of the components has a fixed value.
The mixture sum can also be expressed as fractions, with values varying
from 0 to 1.

Mixture Variable
Experimental factor for which the variations are controlled in an
experimental mixture design or D-optimal mixture design. Mixture
variables are multi-linearly linked by a special constraint called mixture
constraint.

There must be at least three mixture variables to define a mixture design.


See Mixture Components.

Model Center
The model center is the origin around which variations in the data are
modeled. It is the (0,0) point on a score plot.

If the variables have been centered, samples close to the average will lie
close to the model center.

Model
Mathematical equation summarizing variations in a data set.

Multivariate Data Analysis in Practice


568 Glossary of Terms

Models are built so that the structure of a data table can be understood
better than by just looking at all raw values.

Statistical models consist of a structure part and an error part. The


structure part (information) is intended to be used for interpretation or
prediction, and the error part (noise) should be as small as possible for
the model to be reliable.

Model Check
In Response Surface Analysis, a section of the ANOVA table checks
how useful the interactions and squares are, compared with a purely
linear model. This section is called Model Check.

If one part of the model is not significant, it can be removed so that the
remaining effects are estimated with a better precision.

Multi-Linear Constraint
They are linear relationships between two variables or more. The
constraints have the general form:
A1 . X1 + A2 . X2 +…+ An . Xn + A0 ≥ 0

where Xi are designed variables (mixture or process), and each constraint


is specified by the set of constants A0 … An.

A multi-linear constraint cannot involve both Mixture and Process


variables.

Multiple Comparison Tests


Tests showing which levels of a category design variables can be
regarded as causing real differences in response values, compared to
other levels of the same design variable.

For continuous or binary design variables, analysis of variance is


sufficient to detect a significant effect and interpret it. For category
variables, a problem arises from the fact that, even when analysis of
variance shows a significant effect, it is impossible to know which levels
are significantly different from others. This is why multiple comparisons
have been implemented. They are to be used once analysis of variance
has shown a significant effect for a category variable.

Multivariate Data Analysis in Practice


Glossary of Terms 569

Multiple Linear Regression (MLR)


A method for relating the variations in a response variable (Y-variable)
to the variations of several predictors (X-variables), with explanatory or
predictive purposes.

An important assumption for the method is that the X-variables are


linearly independent, i.e. that no linear relationship exists between the
X-variables. When the X-variables carry common information, problems
can arise due to exact or approximate collinearity.

Noise
Random variation that does not contain any information.

The purpose of multivariate modeling is to separate information from


noise.

Non-Linearity
Deviation from linearity in the relationship between a response and its
predictors.

Normal Distribution
Frequency diagram showing how independent observations, measured
on a continuous scale, would be distributed if there were an infinite
number of observations and no factors caused systematic effects.

A normal distribution can be described by two parameters:


• a theoretical mean, which is the center of the distribution;
• a theoretical standard deviation, which is the spread of the
individual observations around the mean.

Normal Probability Plot


The normal probability plot (or N-plot) is a 2-D plot which displays a
series of observed or computed values in such a way that their
distribution can be visually compared to a normal distribution.

The observed values are used as abscissa, and the ordinate displays the
corresponding percentiles on a special scale. Thus if the values are

Multivariate Data Analysis in Practice


570 Glossary of Terms

approximately normally distributed around zero, the points will appear


close to a straight line going through (0,50%).

A normal probability plot can be used to check the normality of the


residuals (they should be normal; outliers will stick out), and to visually
detect significant effects in screening designs with few residual degrees
of freedom.

Offset
See Intercept.

Optimization
Finding the settings of design variables that generate optimal response
values.

Orthogonal
Two variables are said to be orthogonal if they are completely
uncorrelated, i.e. their correlation is 0.

In PCA and PCR, the principal components are orthogonal to each other.

Factorial designs, Plackett-Burman designs, Central Composite designs


and Box-Behnken designs are built in such a way that the studied effects
are orthogonal to each other.

Orthogonal Designs
All classical designs available in The Unscrambler are built in such a
way that the studied effects are orthogonal to each other. They are called
orthogonal designs.

Outlier
An observation (outlying sample) or variable (outlying variable) which
is abnormal compared to the major part of the data.

Extreme points are not necessarily outliers; outliers are points that
apparently do not belong to the same population as the others, or that are
badly described by a model.

Multivariate Data Analysis in Practice


Glossary of Terms 571

Outliers should be investigated before they are removed from a model,


as an apparent outlier may be due to an error in the data.

Overfitting
For a model, overfitting is a tendency to describe too much of the
variation in the data, so that not only consistent structure is taken into
account, but also some noise or uninformative variation.

Overfitting should be avoided, since it usually results in a lower quality


of prediction. Validation is an efficient way to avoid model overfitting.

Partial Least Squares Regression (PLS)


A method for relating the variations in one or several response variables
(Y-variables) to the variations of several predictors (X-variables), with
explanatory or predictive purposes.

This method performs particularly well when the various X-variables


express common information, i.e. when there is a large amount of
correlation, or even collinearity.

Partial Least Squares Regression is a bilinear modeling method where


information in the original X-data is projected onto a small number of
underlying (“latent”) variables called PLS components. The Y-data are
actively used in estimating the “latent” variables to ensure that the first
components are those that are most relevant for predicting the
Y-variables. Interpretation of the relationship between X-data and
Y-data is then simplified as this relationship in concentrated on the
smallest possible number of components.

By plotting the first PLS components one can view main associations
between X-variables and Y-variables, and also interrelationships within
X-data and within Y-data.

PCA
See Principal Component Analysis.

PCR
See Principal Component Regression.

Multivariate Data Analysis in Practice


572 Glossary of Terms

Percentile
The X% percentile of an observed distribution is the variable value that
splits the observations into X% lower values, and 100-X% higher
values.

Quartiles and median are percentiles. The percentiles are displayed


using a box-plot.

Plackett-Burman Design
A very reduced experimental plan used for a first screening of many
variables. It gives information about the main effects of the design
variables with the smallest possible number of experiments.

No interactions can be studied with a Plackett-Burman design, and


moreover, each main effect is confounded with a combination of several
interactions, so that these designs should be used only as a first stage, to
check whether there is any meaningful variation at all in the investigated
phenomena.

PLS Discriminant Analysis (PLS-DA)


Classification method based on modeling the differences between
several classes with PLS.

If there are only two classes to separate, the PLS model uses one
response variable, which codes for class membership as follows: -1 for
members of one class, +1 for members of the other one. The PLS1
algorithm is then used.

If there are three classes or more, PLS2 is used, with one response
variable (-1/+1 or 0/1, which is equivalent) coding for each class.

PLS1
Version of the PLS method with only one Y-variable.

PLS2
Version of the PLS method in which several Y-variables are modeled
simultaneously, thus taking advantage of possible correlations or
collinearity between Y-variables.

Multivariate Data Analysis in Practice


Glossary of Terms 573

Precision
The precision of an instrument or a measurement method is its ability to
give consistent results over repeated measurements performed on the
same object. A precise method will give several values that are very
close to each other.

Precision can be measured by standard deviation over repeated


measurements.

If precision is poor, it can be improved by systematically repeating the


measurements over each sample, and replacing the original values by
their average for that sample.

Precision differs from accuracy, which has to do with how close the
average measured value is to the target value.

Prediction
Computing response values from predictor values, using a regression
model.

To make predictions, you need


• a regression model (PCR or PLS), calibrated on X- and Y-data;
• new X-data collected on samples which should be similar to the
ones used for calibration.

The new X-values are fed into the model equation (which uses the
regression coefficients), and predicted Y-values are computed.

Predictor
Variable used as input in a regression model. Predictors are usually
denoted X-variables.

Principal Component Analysis


PCA is a bilinear modeling method which gives an interpretable
overview of the main information in a multidimensional data table.

Multivariate Data Analysis in Practice


574 Glossary of Terms

The information carried by the original variables is projected onto a


smaller number of underlying (“latent”) variables called principal
components. The first principal component covers as much of the
variation in the data as possible. The second principal component is
orthogonal to the first and covers as much of the remaining variation as
possible, and so on.

By plotting the principal components, one can view interrelationships


between different variables, and detect and interpret sample patterns,
groupings, similarities or differences.

Principal Component Regression


PCR is a method for relating the variations in a response variable
(Y-variable) to the variations of several predictors (X-variables), with
explanatory or predictive purposes.

This method performs particularly well when the various X-variables


express common information, i.e. when there is a large amount of
correlation, or even collinearity.

Principal Component Regression is a two-step method. First, a Principal


Component Analysis is carried out on the X-variables. The principal
components are then used as predictors in a Multiple Linear Regression.

Principal Component
Principal Components (PCs) are composite variables, i.e. linear
functions of the original variables, estimated to contain, in decreasing
order, the main structured information in the data. A PC is the same as a
score vector, and is also called a latent variable.

Principal components are estimated in PCA and PCR. PLS components


are also denoted PCs.

Process Variable
In The Unscrambler, experimental factor for which the variations are
controlled in an experimental design, and to which the mixture variable
definition does not apply (see Mixture Variable).

Multivariate Data Analysis in Practice


Glossary of Terms 575

Projection
Principle underlying bilinear modeling methods such as PCA, PCR and
PLS.

In those methods, each sample can be considered as a point in a multi-


dimensional space. The model will be built as a series of components
onto which the samples - and the variables - can be projected. Sample
projections are called scores, variable projections are called loadings.

The model approximation of the data is equivalent to the orthogonal


projection of the samples onto the model. The residual variance of each
sample is the squared distance to its projection.

Proportional Noise
Noise on a variable is said to be proportional when its size depends on
the level of the data value. The range of proportional noise is a
percentage of the original data values.

p-Value
The p-value measures the probability that a parameter estimated from
experimental data should be as large as it is, if the real (theoretical, non-
observable) value of that parameter were actually zero. Thus, p-value is
used to assess the significance of observed effects or variations: a small
p-value means that you run little risk of mistakenly concluding that the
observed effect is real.

The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If
p-value < 0.05, you have reason to believe that the observed effect is not
due to random variations, and you may conclude that it is a significant
effect.

p-value is also called “significance level”.

Quadratic Model
Regression model including as X-variables the linear effects of each
predictor, all two-variable interactions, and the square effects.

With a quadratic model, the curvature of the response surface can be


approximated in a satisfactory way.
Multivariate Data Analysis in Practice
576 Glossary of Terms

Random Effect
Effect of a variable for which the levels studied in an experimental
design can be considered to be a small selection of a larger (or infinite)
number of possibilities.

Examples: Effect of using different batches of raw material, effect of


having different persons perform the experiments.

The alternative to a random effect is a fixed effect.

Random Order
Randomization is the random mixing of the order in which the
experiments are to be performed. The purpose is to avoid systematic
errors which could interfere with the interpretation of the effects of the
design variables.

Reference Sample
Sample included in a designed data table to compare a new product
under development to an existing product of a similar type.

The design file will contain only response values for the reference
samples, whereas the input part (the design part) is missing (m).

Regression Coefficient
In a regression model equation, regression coefficients are the numerical
coefficients that express the link between variation in the predictors and
variation in the response.

Regression
Generic name for all methods relating the variations in one or several
response variables (Y-variables) to the variations of several predictors
(X-variables), with explanatory or predictive purposes.

Regression can be used to describe and interpret the relationship


between the X-variables and the Y-variables, and to predict the Y-values
of new samples from the values of the X-variables.

Multivariate Data Analysis in Practice


Glossary of Terms 577

Repeated Measurement
Measurement performed several times on one single experiment or
sample.

The purpose of repeated measurements is to estimate the measurement


error, and to improve the precision of an instrument or measurement
method by averaging over several measurements.

Replicate
Replicates are experiments that are carried out several times. The
purpose of including replicates in a data table is to estimate the
experimental error.

Replicates should not be confused with repeated measurements, which


give information about measurement error.

Residual
A measure of the variation that is not taken into account by the model.

The residual for a given sample and a given variable is computed as the
difference between observed value and fitted (or projected, or predicted)
value of the variable on the sample.

Residual Variance
The mean square of all residuals, sample- or variable-wise.

This is a measure of the error made when observed values are


approximated by fitted values, i.e. when a sample or a variable is
replaced by its projection onto the model.

The complement to residual variance is explained variance.

Resolution
Information on the degree of confounding in fractional factorial designs.

Resolution is expressed as a roman number, according to the following


code:

Multivariate Data Analysis in Practice


578 Glossary of Terms

• in a Resolution III design, main effects are confounded with


2-factor interactions;
• in a Resolution IV design, main effects are free of confounding
with 2-factor interactions, but 2-factor interactions are confounded
with each other;
• in a Resolution V design, main effects and 2-factor interactions are
free of confounding.

More generally, in a Resolution R design, effects of order k are free of


confounding with all effects of order less than R-k.

Response Surface Analysis


Regression analysis, often performed with a quadratic model, in order to
describe the shape of the response surface precisely.

This analysis includes a comprehensive ANOVA table, various


diagnostic tools such as residual plots, and two different visualizations
of the response surface: contour plot and landscape plot.

Note!
Response surface analysis can be run on designed or non-
designed data.

Response Variable
Observed or measured parameter which a regression model tries to
predict.

Responses are usually denoted Y-variables.

RMSEC
Root Mean Square Error of Calibration. A measurement of the average
difference between predicted and measured response values, at the
calibration stage.

RMSEC can be interpreted as the average modeling error, expressed in


the same units as the original response values.

Multivariate Data Analysis in Practice


Glossary of Terms 579

RMSED
Root Mean Square Error of Deviations. A measurement of the average
difference between the abscissa and ordinate values of data points in any
2D scatter plot.

RMSEP
Root Mean Square Error of Prediction. A measurement of the average
difference between predicted and measured response values, at the
prediction or validation stage.

RMSEP can be interpreted as the average prediction error, expressed in


the same units as the original response values.

Sample
Object or individual on which data values are collected, and which
builds up a row in a data table.

In experimental design, each separate experiment is a sample.

Scaling
See Weighting.

Scatter Effects
In spectroscopy, scatter effects are effects that are caused by physical
phenomena, like particle size, rather than chemical properties. They
interfere with the relationship between chemical properties and shape of
the spectrum. There can be additive and multiplicative scatter effects.

Additive and multiplicative effects can be removed from the data by


different methods. Multiplicative Scatter Correction removes the effects
by adjusting the spectra from ranges of wavelengths supposed to carry
no specific chemical information.

Scores
Scores are estimated in bilinear modeling methods where information
carried by several variables is concentrated onto a few underlying
variables. Each sample has a score along each model component.

Multivariate Data Analysis in Practice


580 Glossary of Terms

The scores show the locations of the samples along each model
component, and can be used to detect sample patterns, groupings,
similarities or differences.

Screening
First stage of an investigation, where information is sought about the
effects of many variables. Since many variables have to be investigated,
only main effects, and optionally interactions, can be studied at this
stage.

There are specific experimental designs for screening, such as factorial


or Plackett-Burman designs.

Significance Level
See p-value.

Significant
An observed effect (or variation) is declared significant if there is a
small probability that it is due to chance.

SIMCA Classification
Classification method based on disjoint PCA modeling.

SIMCA focuses on modeling the similarities between members of the


same class. A new sample will be recognized as a member of a class if it
is similar enough to the other members; else it will be rejected.

Simplex
Specific shape of the experimental region for a classical mixture design.
A simplex has N corners but N-1 independent variables in an
N-dimensional space. This results from the fact that whatever the
proportions of the ingredients in the mixture, the total amount of mixture
has to remain the same: the Nth variable depends on the N-1 other ones.
When mixing three components, the resulting simplex is a triangle.

Simplex-Centroid Design
One of the three types of mixture designs with a simplex-shaped
experimental region. A Simplex-centroid design consists of extreme

Multivariate Data Analysis in Practice


Glossary of Terms 581

vertices, center points of all "sub-simplexes", and the overall center. A


"sub-simplex" is a simplex defined by a subset of the design variables.
Simplex-centroid designs are available for optimization purposes, but
not for a screening of variables.

Simplex-Lattice Design
One of the three types of mixture designs with a simplex-shaped
experimental region. A Simplex-lattice design is a mixture variant of the
full-factorial design. It is available for both screening and optimization
purposes, according to the degree of the design (See lattice degree).

Square Effect
Average variation observed in a response when a design variable goes
from its center level to an extreme level (low or high).

The square effect of a design variable can be interpreted as the curvature


observed in the response surface, with respect to this particular design
variable.

Standard Deviation
Sdev is a measure of a variable’s spread around its mean value,
expressed in the same unit as the original values.

Standard deviation is computed as the square root of the mean square of


deviations from the mean.

Standard Error Of Performance (SEP)


Variation in the precision of predictions over several samples.
SEP is computed as the standard deviation of the residuals.

Standardization Of Variables
Widely used preprocessing that consists in first centering the variables,
then scaling them to unit variance.

The purpose of this transformation is to give all variables included in an


analysis an equal chance to influence the model, regardless of their
original variances.

Multivariate Data Analysis in Practice


582 Glossary of Terms

In The Unscrambler, standardization can be performed automatically


when computing a model, by choosing 1/Sdev as variable weights.

Star Points Distance To Center


In Central Composite designs, the properties of the design vary
according to the distance between the star samples and the center
samples. This distance is measured in normalized units, i.e. assuming
that the low cube level of each variable is -1 and the high cube level +1.

Three cases can be considered:


• The default star distance to center ensures that all design samples
are located on the surface of a sphere. In other words, the star
samples are as far away from the center as the cube samples are. As
a consequence, all design samples have exactly the same leverage.
The design is said to be “rotatable”;
• The star distance to center can be tuned down to 1. In that case, the
star samples will be located at the centers of the faces of the cube.
This ensures that a Central Composite design can be built even if
levels lower than “low cube” or higher than “high cube” are
impossible. However, the design is no longer rotatable;
• Any intermediate value for the star distance to center is also
possible. The design will not be rotatable.

Star Samples
In optimization designs of the Central Composite family, star samples
are samples with mid-values for all design variables except one, for
which the value is extreme. They provide the necessary intermediate
levels that will allow a quadratic model to be fitted to the data.

Star samples can be centers of cube faces, or they can lie outside the
cube, at a given distance (larger than 1) from the center of the cube.

Steepest Ascent
On a regular response surface, the shortest way to the optimum can be
found by using the direction of steepest ascent.

Multivariate Data Analysis in Practice


Glossary of Terms 583

Student t-Distribution
=t-distribution. Frequency diagram showing how independent
observations, measured on a continuous scale, are distributed around
their mean when the mean and standard deviation have been estimated
from the data and when no factor causes systematic effects.

When the number of observations increases towards an infinite number,


the Student t-distribution becomes identical to the normal distribution.

A Student t-distribution can be described by two parameters: the mean


value, which is the center of the distribution, and the standard deviation,
which is the spread of the individual observations around the mean.
Given those two parameters, the shape of the distribution further
depends on the number of degrees of freedom, usually n-1, if n is the
number of observations.

Test Samples
Additional samples which are not used during the calibration stage, but
only to validate an already calibrated model.

The data for those samples consist of X-values (for PCA) or of both X-
and Y-values (for regression). The model is used to predict new values
for those samples, and the predicted values are then compared to the
observed ones.

Test Set Validation


Validation method based on the use of different data sets for calibration
and validation. During the calibration stage, calibration samples are
used. Then the calibrated model is used on the test samples, and the
validation residual variance is computed from their prediction residuals.

Training Samples
See Calibration Samples.

T-Scores
The scores found by PCA, PCR and PLS in the X-matrix.

See Scores for more details.

Multivariate Data Analysis in Practice


584 Glossary of Terms

Tukey´s Test
A multiple comparison test (see Multiple Comparison Tests for more
details).

t-Value
The t-value is computed as the ratio between deviation from the mean
accounted for by a studied effect, and standard error of the mean.

By comparing the t-value with its theoretical distribution (Student


t-distribution), we obtain the significance level of the studied effect.

Underfit
A model that leaves aside some of the structured variation in the data is
said to underfit.

Upper Quartile
The upper quartile of an observed distribution is the variable value that
splits the observations into 75% lower values, and 25% higher values. It
can also be called 75% percentile.

U-Scores
The scores found by PLS in the Y-matrix.

See Scores for more details.

Validation Samples
See Test Samples.

Validation
Validation means checking how well a model will perform for future
samples taken from the same population as the calibration samples. In
regression, validation also allows for estimation of the prediction error
in future predictions.

The outcome of the validation stage is generally expressed by a


validation variance. The closer the validation variance is to the
calibration variance, the more reliable the model conclusions.
Multivariate Data Analysis in Practice
Glossary of Terms 585

When explained validation variance stops increasing with additional


model components, it means that the noise level has been reached. Thus
the validation variance is a good diagnostic tool for determining the
proper number of components in a model.

Validation variance can also be used as a way to determine how well a


single variable is taken into account in an analysis. A variable with a
high explained validation variance is reliably modeled and is probably
quite precise; a variable with a low explained validation variance is
badly taken into account and is probably quite noisy.

Three validation methods are available in The Unscrambler:


• test set validation;
• cross-validation;
• leverage correction.

Variable
Any measured or controlled parameter that has varying values over a
given set of samples.

A variable determines a column in a data table.

Variance
A measure of a variable’s spread around its mean value, expressed in
square units as compared to the original values.

Variance is computed as the mean square of deviations from the mean. It


is equal to the square of the standard deviation.

Vertex Sample
A vertex is a point where two lines meet to form an angle. Vertex
samples are used in Simplex-centroid, axial and D-optimal mixture/non-
mixture designs.

Weighting
A technique to modify the relative influences of the variables on a
model. This is achieved by giving each variable a new weight, ie.

Multivariate Data Analysis in Practice


586 Glossary of Terms

multiplying the original values by a constant which differs between


variables. This is also called scaling.

The most common weighting technique is standardization, where the


weight is the standard deviation of the variable.

Multivariate Data Analysis in Practice


Index 587

ANOVA Table
Example 435
Autoscaling 76
Average (Mean) 432
Index Averaging 196, 259
Axial design 550
Axial point 550

B
A
Background effects 256
Absorbance 255 Baseline shift 259
accuracy 549 b-coefficient 576
Accuracy 241 B-coefficient 550
Acoustic spectra 215 B-coefficients 209, 210, 263
active cell 532, 533 interpretation and use 214
Adding bias 550
center samples 429 Bias 205
design variables 429 Bilinear methods 11
levels to designs 429 bi-linear modeling 550
reference samples 429 Binary variables 420
replicates 429 BLM 550
Adding design samples/variables Blocking
428 Example 399
Additive effects 258 Blocking of designs 436
additive noise 549 Box-Behnken design 551
Alcohol 276 Box-Behnken Design
Algorithms Number Of Experiments 401
PCA 519 Box-Behnken designs 400
PCR 520 box-plot 551
PLS 521, 524 Brainstorm 415
Amplification 258 Bw 264
analysis of effects 549
Analysis Of Effects 433 C
analysis of variance 549
Analysis Of Variance 434 calibration 551
Analytical inaccuracy 192 Calibration data set 118, 188
analyzing screening designs 383 requirements 118
ANOVA 434, 549, 563, 568 selection of 188
interpreting 412 calibration samples 551

Multivariate Data Analysis in Practice


588 Index

Calibration variance 157, 201 example 351


Candidate point 552 fields of use 339
category variable 552 interpretation 344
category variables 533 Model distance 348
Category variables 211, 420 Modeling power 350
CCD of new samples 341
Number Of Experiments 398 results 341
CD-ROM 528 Si vs. Hi 345
cell Si/S0 vs. Hi 347
locked 533 Closed data sets 262
selection 532 Coding 211
Center point 192 collinearity 554
Center points 424 Collinearity 127
center sample 552, 557 colors 536
Center samples 424 preset schemes 536
adding 429 component 554, 574
Center Samples Condition number 554
How To Detect Problems 432 configure The Unscrambler 527
centering 552 confounded effects 554
Centering 33, 54, 76 Confounded effects
central composite design 552 separation of 436
Central Composite Design See also confounding 577
CCD Confounding 380
Number Of Experiments 398 confounding pattern 555
Central Composite Designs 396 Confoundings 374
Centroid design 553 Constant variables 421
Centroid point 553 Constrained design 555
Changing Constrained experimental region
designs 428, 429 555
Choosing design variables 418 Constraint 555, 566, 568
Choosing levels of design variables context sensitive menus 531, 546
420 continuous variables 533, 555
Chromatography 215, 259 Coomans plot 344
classification 553 Corner sample 555
PLS-DA 572 correlation 555
SIMCA 580 Correlation 5, 6, 27
Classification 10, 335 COSCIND 556
basic steps 340 covariance 556
Coomans plot 344 Covariance 4
Discrimination power 349 Covariation 6, 42

Multivariate Data Analysis in Practice


Index 589

Covary 27 fractional factorial 557, 561


cross terms 556 full factorial 561
Cross terms 262, 368 optimization 395
Cross validation 121, 163, 198, 269 orthogonal 570
replicates 197 Plackett-Burman 572
cross-validation 556 Design 361
crossvalidation 163 D-optimal mixture 559
cube sample 557 D-optimal non-mixture 559
curvature 557 Lattice 563
Curvature 389 Mixture 567
How To Detect It 433 Design Def Model 558
Design details 422
D design variable 558
design variables 533
data Designed calibration data 185
organization 540 designs
Data Checks 383 Box-Behnken 400
Data collection 184 Outlining a strategy 417
data compression 557 Designs 425
Data constraints 181 adding variables 429
Data description 10 blocking 436
Data matrix 19 Central Composite 396
Data structure 8, 32, 54 cross terms 368
Decomposition 32, 35, 54 deleting variables 429
define extending 428, 429
matrix 541 interaction 368
sets 543 making 410
Degree 563 optimization 422, 425
degree of fractionality 557 randomization 426, 427
degrees of freedom 558 replicates 426
Degrees of freedom 203 screening 375, 422
Degrees of Freedom 434 types 422
Deleting variables 418
design variables 429 what is it? 414
Deleting variables 263 Dev 211
Dependent variables 115 Deviations 211
Derivation 259 DF See Degrees of Freedom
Descriptive Analysis 384 dialog
design 558 System Setup 528
category variables 425 dialogs 537

Multivariate Data Analysis in Practice


590 Index

Diffuse reflectance data 255 Experimental design 185, 361


Dimension 20 example 391, 403
Dimension reduction 35 interaction effect 373
Dimensions 122, 183 main effect 373
Dioxin 273 mean effect 373
Direct observations 2 procedure 444
Dirty data 284 P-value 387
Discrete variables 211, 389, 420 Experimental Design
Discrimination 10 Analysis 366
Discrimination power 349 Steps 366
Distance measures 339 experimental error 559
distribution 558, 562 Experimental error 424
normal 569 Experimental region 560
dockable views 537 explained variance 560
D-optimal design 558 Explained variance 59, 90
D-optimal mixture design 559 Exploratory data analysis 490
D-optimal non-mixture design 559 Extending designs 428, 429
D-optimal principal 559 Extrapolation 119
Down-loading 210 Extreme objects 246

E F
Edge-center point 559 F1 button 539
editor 531, 532, 535 Factorial designs 366
Effective dimension 27 Factors 418
effects F-distribution 560, 561
interaction 368 file information 539
main 368 Finding
Effects 368 important variables 375
Definition and Calculation 368 First derivative 259
E-matrix 58 Fisher distribution 560
Enamine 403 fixed effect 560
End point 559 floppy disks 528
Error in Y 192 Fold-over designs 438
Error sources 190 Fractional designs 373
Errors fractional factorial design 555,
estimation of 424 557, 561
Estimate 116 Fractional Factorial Design
Expensive measurement 3 Analysis 381
experimental design 559 Example 378

Multivariate Data Analysis in Practice


Index 591

Number Of Experiments 379 installation


Fractional Factorial Designs 378 CD-ROM 528
F-ratio 434, 561 floppy disks 528
full factorial design 561 procedure 528
Full Factorial Design interaction 562
Example 376 Interaction effect 373
Number Of Experiments 377 definition 370
Full factorial designs 367 interaction effects 368
Full Factorial Designs 376 Interaction effects 368
Interactions 362
G intercept 562
Interior point 562
general view 535 Introduction 1
Geo 284 IR 215
Groups 91 Iris 351

H J
hardware requirements 527 Jam 130, 145, 147, 150
help 539
help system 539
K
access 539
F1 button 539 keep out of calculation 544
Hi 341 Kubelka-Munck 255
Hidden structures 7
High levels 419 L
higher order interaction effects 562
histogram 558, 562 lack of fit 563
Historical data 184 Lack of fit 55, 200
HOIE 562 Lacotid 441
Lattice degree 563
Lattice Design 563
I
least square criterion 563
Independent variables 115 least squares 563
Indicator variables 211, 420 Least Squares fit 28
Indirect observations 2 Leave-One-Out validation 163
Inexpensive measurement 3 leveled variables 563
Inference 384 Levelled variables 420
Inferential Analysis 384 levels 563
influence 562 Levels
install The Unscrambler 527 adding 429

Multivariate Data Analysis in Practice


592 Index

how to choose 420 Mean center 33


Levels of variables 420 mean centering 566
leverage 562, 564 Mean effect 373
Leverage 168, 341 Mean Square 434
leverage correction 564 Measurement error 206
Leverage correction 168, 199 median 566
Leverage validation 121 Median 431
Light scattering 256 Membership plot 345
limits for outlier warnings 564 memory space 527
linear effect 564 menu 531
linear model 564 menu bar 531
Loading plots 40 Missing data 183
example 40, 45 MixSum 566, 567
interpretation 41, 94 Mixture components 566
loading weights 565 Mixture constraint 566
Loading weights 143 Mixture deisgn 567
loadings 565 Mixture region 567
Loadings 33, 85, 143 445, 478
locked cell 533 Mixture Sum 567
Logarithmic transformations 253 Mixture variable 567
Low levels 419 MLR 124, 125, 128, 153, 207, 569
lower quartile 565 model 567
model center 567
M Model center 32
model check 568
main effect 565 Model distance 348
Main effect 373 Model fit 200
definition 370 Modeling 489
main effects 368 Modeling ability 120
Main effects 368 Modeling error 157, 190
main window 530 Modeling power 350
mark Modeling variance 65
samples 533 Models 54
variables 533 interpretation 88, 143, 217, 268
Matrices 3 move data directories 529
matrix 541 MS See Mean Square
define 541 MSC 256
mean 566 Multi-linear constraint 568
Mean 5 multiple comparison tests 568
Mean And Sdev 432 Multiple Comparisons 435

Multivariate Data Analysis in Practice


Index 593

multiple linear regression 569 offset 562, 570


Multiplicative effects 258 Offset 258
Multiplicative Scatter Correction One variable at a time 362
256 Opaque solutions 256
Multivariate calibration 115 optimization 570
Multivariate data analysis 1 optimization design 395
Multivariate method Optimization designs 422, 425
introduction 1 making 410
Optimization Designs
N Analysis 396
Which To Choose 401
Neural net 207, 263 Optimization Experiments
NIPALS 56 Strategy For Analysis 402
NIR 215 organize
NMR 215 data 540
noise 569, 575 orthogonal 570
Noise 8, 32, 213, 259 orthogonal designs 570
Noisy measurements 265 outlier 570
Noisy variables 77, 263 Outlier detection limits 216
Non-controlled variables 421 outlier warnings 564
Non-linear X-variables 252 Outliers 242
Non-linearities 260, 424 example 97
non-linearity 569 overfitting 571
normal distribution 569 Overfitting 160
checking 569 Overlapping spectra 125
Normal probability of residuals overview plot 535
plot 251
normal probability plot 569
P
Normalization 259
Normalizations 93 P 33
Number of PCs 31, 57, 61, 88, 122, Paper 332
160, 172, 219, 267 partial least squares regression 571
numerical Partial Least Squares Regression
table 539 137
Particle sizes 256
O Path lengths 256
PC 574
Object 3 PCA 19, 571, 573
Object residuals 35, 59 algorithm 55, 519
Octane 221 analysis procedure 81

Multivariate Data Analysis in Practice


594 Index

summary 85 PLS2 142, 572


The Unscrambler 84 example 147, 174
PC-models 54 problems 264
PCR 128, 153, 386, 571, 574 PLS-DA 572
algorithm 520 Precise measurements 265
compared with PLS 149 precision 573
example 130 predefined
The Unscrambler 171 result plots 535
weaknesses 136 Predicted error sum of squares 208
PCs 27, 31, 61, 85 prediction 573
number of 31 Prediction 11, 116, 209
Peas 174, 177, 178, 179 Prediction ability 120
percentile 565, 566, 572, 584 Prediction error 122, 190, 200, 206
Percentiles 431 Prediction variance 123
Plackett-Burman design 572 Predictive Analysis 385
Plackett-Burman Design predictor 573
When To Use It 382 Preprocessing 75, 93, 216
Plackett-Burman designs 382 PRESS 208
platforms principal component 574
Windows 3.x 527 Principal Component 27
Windows 95 527 principal component analysis 573
Windows for Workgroups 527 Principal Component Analysis 19
Windows NT 527 Principal component models 54
plot principal component regression
from the editor 533 574
from the viewer 535 Principal Component Regression
in sub-views 545 128
plots Problem reasons 242
ID 535 projection 575
information 535 Projection 85
normal probability 569 Projection models 54
overview 535 Projections 11, 28
predefined result 535 proportional noise 575
PLS 137, 153, 386, 571 p-value 434, 575
algorithm 521, 524 P-value 387
compared with PCR 149
The Unscrambler 171 Q
PLS discriminant analysis 572
PLS1 142, 572 Q 138
example 145 QSAR 117

Multivariate Data Analysis in Practice


Index 595

quadratic model 575 Residual standard deviation 208


Quartiles 431 residual variance 577
Residual variance 58, 201
R interpretation 267
residuals 562
r5 Residuals 28, 35, 58, 64, 85, 200,
Raman spectra 256 245, 249
Random design 440 resolution 577
random effect 576 Resolution 429
random order 576 response surface 563, 568
randomization 576 Response surface 390
Randomization 426, 427 response surface analysis 578
Rank 20 Response surfaces
Real-world data 181 interpreting 412
Reference method 118, 192 response variable 578
reference sample 576 Response variables 421
Reference samples RMSEC 157, 203, 430, 578
adding 429 RMSED 579
Refining models 264 RMSEP 123, 158, 200, 203, 205,
Reflectance 255 579
regression 569, 571, 573, 576, 578 Root Mean Square Error of
Regression 11, 115, 574 Calibration 157. See RMSEC
traditional 262 Root Mean Square Error of
regression coefficient 576 Prediction 123, 158
Regression coefficients 263 rotatable 401
445, 476 RSD 208
Removing outliers 247
Repeatability 193
S
repeated measurement 577
Repeated response measurements sample 579
437 vertex 585
replicate 577 sample sets 543
Replicates 190, 259, 423, 424, 426 Samples
adding 429 center 425
Replicates 195 reference 425
Representative 118 scaling 579, 585
Reproducibility 192 Scaling 75, 80, 93, 213
Requirements to data 181 scatter effects 579
residual 577 Scatter effects 218, 259
Residual matrix 55 Scatter plot 20

Multivariate Data Analysis in Practice


596 Index

Score plots 36 Signal 8


example 36 Significance 341
interpretation 38, 80, 89, 93 significance level 580
Score vectors 35 Significance level 343
scores 579 Significance test 327
t 583 significance testing 436
u 584 Significance testing
Scores 34, 85, 242 Martens’ uncertainty test 327
screen layout 533 significant 580
screening 580 significant effects 383
Screening designs 422 Significant effects 386
Screening Designs Signs of problems 241
Which Design To Choose 380 SIMCA 335, 553, 580
Screening Experiments advantages 338
Strategy For Analysis 385 SIMCA classification 580
SD 193 Simplex 580
SDD 192, 206 Simplex-centoid design 580
SDev 193, 581 Simplex-Lattice design 581
Second derivative 259 Smoothing 259
Segmentation 168 Soft Independent Modeling of
Segmented cross validation 165 Class Analogy 335
select software requirements 527
cell 532 Sorting
continuous range 533 randomization 427
non-continuous range 533 Span 119
range of cells 532 Spectroscopic transformations 254
samples 533 Spectroscopy 45
variables 533 scaling 77
Selecting Spectroscopy calibration 291
design variables 418 Spectroscopy data 215
Selective variables 7 square effect 581
SEP 205, 206, 581 Square terms 262
sets 541 SS 434
define 543 Stability test 327
properties 544 standard deviation 581
sample 543 Standard deviation 6
variable 542 Standard Deviation 193, 432
Si 341 Standard Deviation of Difference
Si vs. Hi 345 192
Si/S0 vs. Hi 347 standard error of performance 581

Multivariate Data Analysis in Practice


Index 597

Standard Error of Performance See training samples 583


SEP Training set
standardization 585 selection of 188
Standardization 75, 213, 216 Transformations 80, 93, 252
standardization of variables 581 Transmission data 255
Star Distance To Center Troodos 97
Example 398 t-scores 583
star points distance to center 582 Tukey´s test 584
star samples 397, 582 t-value 584
distance to center 582
status bar 531 U
steepest ascent 582
student t-distribution 583 U 138
Subgroups 247 Uncertainty 190, 210
Sum of Squares 434 prediction 210
supervisor Uncertainty limit 211
responsibilities 528 Uncertainty test 327
synergy 364 Uncoded data 440
System Setup dialog 528 underfit 584
Systematic errors 248 Uniform precision 425
Systematic variation 76, 160 Univariate methods 1
Univariate regression 124
Unscrambler
T
efficient use 540
T 34, 137 installation 527
T vs. U 244 system configuration 527
Target samples 425 Upper levels 419
t-distribution 583 upper quartile 584
Test data set 161, 189 u-scores 584
test samples 583 use The Unscrambler efficiently
Test set 540
selection of 189 UV 215
Test set switch 163 UV-VIS 256
test set validation 583
Test set validation 120, 155, 161, V
198
toolbar 531, 546 validation 556, 564, 583, 584
tooltips 540 Validation 120, 155, 161, 198, 210,
Total residual variance 61 489
Training data set 118, 188 designed data 430

Multivariate Data Analysis in Practice


598 Index

Validation data set 189 Willgerodt-Kindler 391


selection of 189 Windows 3.x 527
validation samples 584 Windows 95 527, 530
Validation variance 158, 201 Windows for Workgroups 527
variability 362 Windows NT 527, 530
Variability 263 Workplace 530
variable 585
category 533 X
continuous 533
design 533 X-variables 3
Variable X-variances 161
Mixture 567
Variable residuals 64 Y
variable sets 542 Y-residuals 249
Variable space 20 Y-variables 3
Variables 3
category 420
continuous 419
design 418
levelled 420
levels of 419, 420
non-design 421
variance 585
degrees of freedom 558
Variance 4, 5, 27, 30, 269
definition 4
Vertex 585
view
plot ID 535
viewer 531, 534
VIS 215
visualize data 534

W
W 137
Wavelengths 215, 219
weighting 585
Weighting 75, 213
Wheat 291

Multivariate Data Analysis in Practice

You might also like