Multivariate Data Analysis
Multivariate Data Analysis
– In Practice
5th Edition
An Introduction to
Multivariate Data Analysis
and Experimental Design
Kim H. Esbensen
Ålborg University, Esbjerg
Trademark Acknowledgments
Doc-To-Help is a trademark of WexTech Systems, Inc.
Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word
are trademarks of the Microsoft Corporation.
PaintShop Pro is a trademark of JASC, Inc.
Visio is a trademark of the Shapeware Corporation.
Information in this book is subject to change without notice. No part of this document
may be reproduced or transmitted in any form or by any means, electronic or
mechanical, for any purpose, without the express written permission of CAMO Software
AS.
ISBN 82-993330-3-2
Preface
October 2001
Learning to do multivariate data analysis is in many ways like learning
to drive a car: You are not let loose on the road without mandatory
training, theoretical and practical, as required by current concern for
traffic safety. As a minimum you need to know how a car functions and
you need to know the traffic code. On the other hand, everybody would
agree that it is first after having obtained your drivers’ license that the
real practical learning begins. This is when your personal experience
really starts to accumulate. There is a strong interaction between the
theory absorbed and the practice gained in this secondary, personal
training period.
Driving your newly acquired multivariate data analysis car is very much
an evolutionary process: this introductory textbook is filled with
illustrative examples, many practical exercises and a full set of self-
examination real-world data analysis problems (with corresponding data
sets). If, after all of this, you are able to work confidently on your own
applications, you’ll have reached the goal set for this book.
This is the 5th revised edition of this book. The three first editions were
mainly reprints, the only major change being the inclusion of a
completely revised chapter on ”Introduction to experimental design”,
which first appeared in the 3rd edition (CAMO). The 4th revised
edition however (published March 2000) saw very many major
extensions and improvements:
• 30% new theory & text material added, reflecting extensive student
response, full integration of PCA, PLS1 & PLS2 NIPALS algorithms
and explanations.
In the intervening years, this book was published in some 4.500 copies
and was used for the introductory basic training in some 15 universities
and in several hundred industrial companies; reactions were many and
largely constructive. We learned a lot from these criticisms; we thank all
who contributed!
Came 1999, the time was ripe for a complete revision of the entire
package. This was undertaken by the senior author in the summer 1999
with significant assistance from his then Ph.D. student Jun Huang (now
with CAMO, Norway); Frank Westad (Matforsk) who wrote chapter 14
(Martens’ Uncertainty Test), Dominique Guyot (CAMO) who wrote the
original new entire chapter 17 (Complex Experimental Design
Problems), and with further invaluable editorial and managerical
contributions from Michael Byström (CAMO) and Valérie Lengard
(CAMO). A most sincere thank you goes to Peter Hindmarch (CAMO,
UK) for very effective linguistic streamlining of the 4th edition! The
authors and CAMO also take this opportunity to acknowledge Suzanne
Schönkopf’s (CAMO) contribution to editions previous to the 4th one.
The present edition of this book still bears the fruit of her very important
past efforts.
Today, this book is a collaborative effort between the senior author and
CAMO Process AS; the tie with SINTEF is now defunct.
Thus all is well with the training package! We hope that this revised 5th
edition will continue to meet the challenging demands of the market,
hopefully now in an improved form. Writing for precisely this
introductory audience/market constitutes the highest scientific and
didactic challenge, and is thus (still) irresistible!
Acknowledgements
The authors wish to thank the following persons, institutions and
companies for their very valuable help in the preparation of this training
package:
teaching multivariate data analysis. And thanks for all the constructive
criticism to the earlier editions of this book. Last, but certainly not least,
a warm thank you to all the students at HIT/TF, at Ålborg University,
Esbjerg and many, many others, who have been associated with the
teachings of the authors, nearly all of whom have been very constructive
in their ongoing criticism of the entire teaching system embedded in this
training package. We even learned from the occasional not-so-friendly
criticisms…
Communication
The period of seven years that has been the formative period for the
training package has come of age. By now we are actually beginning to
be rather satisfied with it!
And yet: The author(s) and CAMO always welcome all critical
responses to the present text. They are seriously needed in order for this
work to be continually improving.
Contents
Index 587
This is of course also the case within many other scientific disciplines in which
underlying causal relationships give rise to manifest observable data, for instance
economics and sociology. In this book we shall primary pay attention to a wide-
ranging series of scientific and technological examples of multivariate problems
and associated multivariate data.
Accordingly data analytical methods dealing with only one variable at a time, so-
called univariate methods, will very often turn out to be of limited use in modern,
more complex data analysis. It is still necessary to master these univariate methods
however, as they often carry important marginal information – and they are the
only natural stepping stone into the multivariate realm – always realizing that they
are insufficient for a complete data analysis!
To determine the temperature of the furnace load, you could perhaps instead use
IR-emission spectroscopy and then estimate the temperature of the furnace from the
recorded IR-spectrum. This would be an indirect observation - measuring
something else to determine what you really want to know. Observe how here we
would be using many spectral wavelengths to do the job, in other words: an indirect
multivariate characterization. This is a typical feature of nearly all types of such
indirect observation and therefore shows an obvious need for a multivariate
approach in the ensuing data analysis.
The amount of information in your data will depend on how well you have defined
your problem, and whether you have performed the observations, the measurements
or the experiments accordingly. The data analyst has a very clear responsibility to
provide, or request, meaningful data. Actually it is much more important which
variables have been measured than simply how many observations have been made.
It is equally important that you have chosen the appropriate ranges of these
measurements. This is in contrast to “standard” statistical methods where the
minimum number of observations depends on the number of parameters to be
determined: here the number of observations must exceed the number of
parameters to be determined. This is not necessarily the case with the multivariate
projection methods treated in this book. In one particular sense, variables and
objects may – at least to some degree – stabilize one another, (much) more of
which will be revealed.
Now is time to introduce the first definitions before continuing (Frame 1.1).
These two types of measurements are usually organized in two matrices, as shown
in Figure 1.1:
X Y
. . . . . .
. . . . . .
. . . . . .
The variance of a variable is a measure of the spread of the variable values, which
corresponds to how large a range covered by the measured n values is. This entity
is of critical importance for things to come. One should learn always to have this
univariate concept of “variance” in the back of one’s mind also when dealing with
more complex multivariate data structures. It is often the tradition to express the
measure of this spread in the same units as the raw measurements themselves;
hence the commonly used expression variance = the standard deviation (std).
The covariance between two variables, x1 and x2, is a measure of their linear
association. If large values of variable x1 occur together with large values of
variable x2, the covariance will be positive. Conversely, if large values of variable
x1 occur together with small values of variable x2, and vice versa (small values of
variable x1 together with large values of variable x2), the covariance will be
negative. A large covariance (in absolute values) means that there is a “strong”
linear dependence between the two variables. If the covariance is “small”, the two
variables are not very dependent on one another: if variable x1 changes, this does
not affect the corresponding values of variable x2 very much.
Notice the similarity between the equations for variance, which concerns one
variable, and covariance, which concerns two.
But, as everybody knows, talk of “large” and “small” is very imprecise; we need to
define what is “much”, what is “small” and what is just “a little”. For example if
the covariance between pressure and temperature in a system is 512 oC*atm. is that
a large or a small covariance? Does it mean that temperature and pressure follow
each other closely or that they are nearly independent? And what about the
covariance between temperature and the concentration of a substance, say, if the
covariance is 12 oC*(mg.dm-3). How does that compare with the covariance
between temperature and pressure? As you see, the covariance measure depends
exclusively on the units of the variables, which is why this measure is not so very
useful for mixed variables.
To put everything on an “equal footing”, in order to compare linear dependencies,
the correlation is a much more practical measure. The correlation between two
variables is calculated by dividing the covariance with the product of their
respective standard deviations. Correlation is thus a unit-less, scaled covariance
measure. In general it is the most useful measure of interdependence between
variables, as two, or more coefficients of correlation are directly comparable
whatever units the variables are measured in. Pearson’s correlation coefficient r is
defined below; r2 is often used as a measure of the fraction of the total variance
that can be modeled by this linear association measure.
Frame 1.2 Basic univariate statistical measures
Mean: n
∑ xi
i =1
x=
n
Variance: n
∑ ( xi − x ) 2
i =1
Var ( x ) =
n −1
r≈ 1 r ≈ -1 r≈ 0
Representative x/y relationships showing different r and r2.
8 70 * * **
8 * * * *
* *
** * ** *
* * 65
** * 7 * * * ****
*
*
7 * * 60 ** * * * * * *
* * * **
* * * ** * ** 55 * ***
*
6 * 6 ** * * * ** *
* * * * * *
* 50 * * *
* * ** *
* ** * 5
*
* **
**
* 45
5 *** ** *
* * ** *
* * * 40 *
**** * * * * ** *** ****
4 * 4 *** * *
* * 35 *
* ** * *
* *** * * *** *** * **
* * ** ** * 30
3 * * * 3 * * **
* ** * * *
25
*
2 2 20
1 2 3 4 5 6 7 2 3 4 5 6 7 8 2 3 4 5 6 7 8
An example of this would be the case where we wish to find the concentration of
substance A in a mixture which also contain substances B and C. We may for
example use spectroscopy to determine the A-concentration. But the measured
spectra will not only contain spectral bands from A, which is what we seek, but in
general necessarily also bands from the other, irrelevant, compounds which we
cannot avoid measuring at the same time (Figure 1.3).
Figure 1.3 Overlapping spectra
Absorbance
B A C
The problem will therefore be to find which contributions come from A, and which
come from B and C. Since it is substance A that we want to determine, B and C can
here be considered as “noise”. Whether we consider the B and C signals as noise is
of course strongly dependent on the problem definition; if B was the substance of
interest, A and C would now be considered as noise. In still another problem
context we might be interested in measuring the contributions from both A and B
simultaneously (this is one of the particular strong issues in the so-called
multivariate calibration realm, see chapters below). In the latter case only the
contributions from C would now be considered noise. The issue here is that it is the
context of your problem alone that determines what to consider as “signal” and
what as “noise”.
Multivariate observations can therefore be thought of as a sum of two parts:
Data
Observations = + Noise
structure
The data structure is the signal part that is correlated to the property we are
interested in. The noise part is “everything else”; that is to say contributions from
other components, instrumental noise etc.- this is always a strongly problem-
specific issue. One often wishes to keep the structured part and throw away the
noise part. The problem is that our observations always are a sum of both of these
parts, and the structure part will at first be “hidden” in the raw data. We cannot
immediately see what should be kept and what should be discarded (note however,
that we do in fact also make good use of the noise part, as an important measure of
model fit).
This is where multivariate data analysis enters the scene. One of its most important
objectives is to make use of the intrinsic variable correlations in a given data set to
separate these two parts. There are quite a number of such multivariate methods.
We will exclusively be working with methods that use covariance/correlation
directly in this signal/noise separation, or decomposition, of the data. In fact one
may say that the inter-variable correlations act as a “driving force” for the
multivariate analysis.
difficult, and mostly it will be obvious which technique to use. But first one must
acquire an experience-based overview.
In the apple example, this would mean that you already at the outset know that
there are differences between sweet and sour (“supervised pattern recognition” to
introduce a term which will be more elaborated below). The aim of the data
analysis would then be to assign, to classify, new apples (based on new
measurements) to the classes of sweet or sour apples. Classification thus requires
an a priori class description. Interestingly, here also Principal Component Analysis
can be used to great advantage (see the SIMCA approach below), but there are for
certain many other competing multivariate classification methods. Note that
discrimination/classification deals with dividing a data matrix into two, or more
groups of objects (measurements).
We will mainly work with the regression methods Principal Component Regression
(PCR) and Partial Least Squares Regression (PLS-R) in this book, while also
making some reference to the statistical approach Multiple Linear Regression
(MLR).
We will not do that. Instead we will use a geometric representation of the data
structures to explain how the different methods work especially by using a central
concept of projections. This approach is strongly visual and thereby utilizes the fact
that there is no better pattern recognizer than the human brain. You will soon come
to view data sets primarily as points and swarms of points in a data space. This will
let you grasp the essence of most multivariate concepts and methods in an efficient
manner, so that you should be able to start to perform multivariate data analysis
without having to master too much of the underlying mathematics and statistics - a
very ambitious objective! This book represents some 20 years of accumulated
teaching experience; we hope that you will feel suitably empowered when you have
reached the end.
It is also befitting, however, to point out that it is not our belief that the present
geometrical approach is all you ever need in the multivariate realm. On the
contrary, this is an introductory textbook only. - After completion you should
certainly want to turn to higher level textbooks for a more solid additional
theoretical background in the mathematics and statistics of these methods, to
deepen your understanding of why they work so well. Indeed this is exactly our
objective with this introductory book.
We shall work on getting a first overview of two data sets using descriptive
univariate statistics. We shall be interested in the following two data sets as but two
examples of data matrices, and accordingly we do not present the full details to
these data sets now. It is sufficient to make a brief introduction only as both data
sets will appear again at several places later in this book.
Tasks
1. Reduce data by averaging over judges.
2. Plot raw data.
3. Calculate statistics.
How to Do it
At the bottom of your screen you should now see that your data table has the
size 1200 samples times 15 variables.
Study the data table. The first three variables identify the samples. You can see
that each of the ten judges (3-12) has tasted each sample twice for 12 variables,
and that there are 20 recordings (samples) for each sample number.
Average the data over all the judges: the objects. Two replicates and ten judges
give us 20 samples to average. Select Modify - Transform - Reduce (Average).
Use the following parameters:
On the status bar you can see that the number of samples has now been reduced
to 60.
Delete the first three variables, which are only used for object identification:
Mark the variables by clicking on the column numbers while pressing the
CTRL-key. Select Edit - Delete.
Again the size of the data table is reduced, so the status bar should now read 60
samples times 12 variables.
Save the data table in the Editor on a new file with the name PEAS1 using File-
Save As.
plot. Mark the whole data table (either manually or by choosing Edit - Select
All) and select Plot - Matrix.
Now you can see the value of all variables (columns) for all objects (rows). The
plot is displayed in a window, which we call a Viewer. A Viewer is a graphical
representation of data in an Editor window or of a matrix on file.
Once you have studied the matrix plot, close the viewer: Window - Close
Current, or with the _button. Then unmark the table selection with Edit -
Unselect All or with Esc.
Now you have plotted two variables versus each other, for example X1 vs. X2,
it is possible to fit a regression line and to study the corresponding correlation.
Select View - Trend Lines - Regression line to see the regression line. Then
choose: View - Plot Statistics to display the regression statistics (i.e. regression
slope, offset and correlation coefficient r). Notice that these two variables are
highly correlated (all points lie near a straight line and the correlation r ≈ 0.95).
To turn off the regression line and statistics, simply toggle the respective
commands once more.
Histogram
You can also study how the observations in each variable are distributed by
looking at the frequency histogram. Mark variable 1 and select Plot- Histogram.
Now you have a histogram on the screen. To show the statistics choose: View-
Plot Statistics. Now compare this with the histogram for variable 12. It would
be useful to have both histograms on the screen. The Unscrambler lets you do
this by simply going back to the Editor, marking a new variable and plotting the
histogram for this variable. You can have several Viewer and Editor windows
open at the same time. Use Window - Tile to see all windows at the same time.
3. Calculate Statistics
You can calculate sample statistics with the View - Sample Statistics or
variable statistics with the View - Variable Statistics command. A new Editor
with the most common statistical measures is created. More statistical
information can be found making a statistics model. Select Task - Statistics and
use the sets All Samples and All Variables. Click OK to make the model and
View to look at the results. Let the cursor rest over a variable for a while, click
with the left mouse button.
Which variable do you expect to describe the pea quality best? Is your answer
the same as when you compared the histograms? By which properties can the
assessors best distinguish between peas? Do the assessors use the scale in the
same way for all variables?
Now save the results in a file for later use by File - Save. Use the file name
PEASTATS, for example.
Summary
This simple exercise has demonstrated how to read a data file and how to study the
raw data both as numbers and as graphical displays. You should now be able to
start using The Unscrambler in general.
The first six variables have the largest variances, which by using classical
descriptive statistics suggests that these variables will be the best to describe pea
quality. However, in later exercises we will see that another approach may perhaps
be more useful.
The judges do not seem to use the scale in the same way for the different variables.
We can see this because some variables have a high mean value and others have a
low mean value. However, we do not know for certain if this is because most
samples really have, for example, few skin wrinkles, or if the judges just use low
values on the scale for the variable “Skin”.
The “problem” (as judged by the senior author of the present book) was just that:
univariate descriptive data analyses were the only data analyses carried out, that is
to say that each variable in the table was only analyzed individually, separately. As
will become very clear from the reader’s progression through this book, this
approach is not the best approach. “Clearly” a full-fledged multivariate analysis
will be able to tell more.
We shall investigate this data set closely in other exercises below; but as a start let
us simply do some entry-level univariate characterizations of this data, exactly as
for the peas example (means, variances -standard deviations-, histogram, etc.) - in
order to “get a feel” for this new data set. Quickly run through this exercise for all
ten variables. Do you see any pattern emerging?
You will probably quickly have appreciated that there is something rather special to
this data set - many (all?) of the variables are very symmetrically distributed
indeed. What could this mean?
The purpose of all multivariate data analysis is to decompose the data in order to
detect, and model, the “hidden phenomena”. The concept of variance is very
important. It is a fundamental assumption in multivariate data analysis that the
underlying “directions with maximum variance” are more or less directly related to
these “hidden phenomena”. All this may perhaps seem a bit unclear now, but what
PCA does will become very clear through this chapter and the accompanying
exercises in chapters 4-5.
Consider for the moment the first variable, i.e. column, X1. The individual entries
can be plotted along a 1-dimensional axis (see Figure 3.1). The axis must have an
origin, usually a zero point, as well as a direction and a measurement unit of length.
If X1 is a series of measured weights, for example, the unit would be mg, Kg or
some other unit of weight. We can extend this approach to take in also the next
variable, X2 (see Figure 3.2). This would result in a 2-dimensional plot, often
called a “bivariate” scatter plot.
The axes for the variables are orthogonal and have a common origin, but may have
different measurement units. This is of course nothing other than what you did with
the variables in the exercises in chapter 2. You can continue the extension until all
p variables are covered, in all the pertinent variable pairs. Exercise - Plotting Raw
Data (People) on page 22 is a good workout for this!
0
x1
Each object can therefore be represented - plotted - as a point in this variable space.
When all the X-values for all objects are plotted in the variable space, the result is a
swarm of points for example as shown in Figure 3.3. There are now only n points
described in this p-dimensional space. Observe, for example, how this rendition of
the (n,p) dimensional two-way data matrix, allows you direct geometrical insight
into the hidden data structure.
In this particular illustration, we suddenly get an appreciation of the fact that there
is a prominent trend among the objects, a trend that is so prominent that in some
sense we might in fact call it a “hidden linear association” among all three variables
plotted. This revealing of the underlying covariance structure is really the backbone
of Principal Component Analysis.
x2
x1
Data Set
We have selected an excerpt from a pan-European demographic survey. For
reasons of didactic introduction we have selected only a small, manageable set of
32 persons, i.e. 32 objects, of which 16 represent northern Europe (Scandinavia: A)
with a corresponding number of representatives from the Mediterranean regions
(B). An equal number of 16 males (M) and 16 females (F) were chosen for balance.
The data table, stored in the file “PEOPLE”, consists of 12 different X-variables:
Age (years)
Income (Euro)
Beer consumption (liters per year)
Wine consumption (liters per year)
Gender (Sex) (male: -1; female: +1)
Swimming ability (index based on 500 m timed swimming)
Regional belonging (A/B) (A:-1(Scandinavia); B: +1(Mediterranean))
Intelligence Quotient(IQ) (European standardized IQ-test)
Among these variables we observe that Sex, Hair length and Regions (A/B) are
discrete variables with only two possible values (dichotomous or binary variables)
realizations (–1 or 1). The remaining 9 variables are all quantitative with
representative values. The description for this data set will be further discussed
later, in the PCA- and PLS-modeling exercises.
Tasks
1. Load and examine the data. Make univariate data descriptions of all variables.
2. Select any two variables and plot them against each other using a suitable
2-vector function. Study their interrelationship and data set similarities.
3. Select any three variables and plot them using a suitable 3-vector function.
Study the variable interrelationships and data set similarities.
4. Determine the most important (strongest) two-variable, as well as three-variable
interrelationships, expressed as the strongest correlations. Evaluate all possible
combinations (if possible). What do we know about this data set now?
How to do it
Open the data from the file PEOPLE. Take a look at the data in the Editor.
Observe the numerical tabulation of the 12 variables. This is not the optimal
approach for human pattern cognition!
Plot for instance variable 1 vs. variable 2 by marking the variables and select
Plot-2D Scatter.
You should now observe a plot with points marked by the sample names: FA,
MA, FB or MB indicating the persons' gender and region. If they appear only as
numbers or dots, as in Figure 3.4, Edit-Options can be selected in the menu or
context menu (right mouse-button). Is it easier to interpret the result from
names or from numbers in this plot? This depends on the use of the plot, more
of which later. It is important that you develop an appreciation that it is possible
to include coded information via the “names” of the objects. In this exercise we
have included the dichotomous sex and region information in the “PEOPLE”-
names.
15 3 2 17 18
80 16
8 7 11
31 24
23
70
2827
6 22
60 5
29 109
50 26
142513 20 19
1221 3230 4
40
The primary relationship between the Height and the Weight is shown in Figure
3.4. It is quite obvious that the Height is proportional to the Weight in this selection
of people. Observe that The Unscrambler automatically calculates a set of useful
standard statistics in this plot; one observes, for example, that the desired
Height/Weight correlation coefficient is 0.96. The general trend accordingly shows
that the taller persons are heavier. Invoking the “name” plotting-symbol option, you
will be able also to appreciate that the men (M) are normally heavier and taller than
the women (F). As can be clearly seen, object 1 is the heaviest and highest person,
while 12 and 21 are situated at the extreme other end of a fitted regression line,
representing the lightest/shortest people. Use View-Plot Statistics to obtain other
statistical measures besides the correlation, namely the fitted regression slope,
offset, bias and so on.
Try to study a few other, randomly selected pairs of variables by marking and
plotting them directly from the data table. It is recommended here that at least some
“sensible” combinations, like Age vs. Income and Wine vs. Beer consumption
(Figure 3.5) be tried out.
One will readily observe that even this comparatively small data set contains both
strongly positively as well as some intermediate negatively correlated relations.
There are also “random shots” interrelationships to be found. Have you tried out all
possible two-variable pairs yet? Want to know why? As a rule, with p variables
there are a total of p x (p-1)/2 such combinations. That is why! Surely there must
be an easier way?
You could also study three variables at the same time using the appropriate 3-D
plot. Simply mark any three desired variables in the Editor and then select Plot-3D
Scatter. Here we first again choose the Height and the Weight variables (since we
know these well already) plus the Swim (Swimming ability); see Figure 3.6. This
time you are looking at a 3-D plot. You may use Window-Identification to identify
the variables along the axes.
To view the plot from a different angle, you can choose View-Rotate or View-
Viewpoint- Change. This is a very powerful correlation inspection tool!
100 -Y X
95
1
90 18
32
17
85 11
15
22 16
80 23 87
5 6 27 24
28 31
75
910
30 1329
70 4 14
32 25
20
2619
150 12
21
160
170
180
190 90 100
70 80
200 50 60
40
(Height,Weight,Swim)
Is the variable Swim correlated to Height and Weight? Do the women differ from
the men with respect to their swimming ability? Are there any groups in the plot?
Which persons are the most distinguishing ones? Which persons are similar and
which ones are not? What about some other combinations of similar three-variable
interrelationships, are there similar correlations among them?
Summary
In this exercise you have tried to study a total data set by looking at 2- and 3-D
plots of selected variables. The plots shown here indicate that the taller the people
are, the heavier they are among this group of people - and so on for the other pairs
you selected yourself. The swimming ability was found to be correlated to the
height and weight, but what about other three-variable intercorrelations? Discrete
variables are difficult to visualize and very difficult to interpret!
You will most probably have appreciated that even while these simple two-variable
and three-variable plotting routines are immensely powerful in their own right, the
number of variables (in this particular case only 12, or even only 9, if you discard
the binary ones) very quickly makes it impractical to investigate all the pertinent
combinations. One of the reasons for PCA, to be introduced shortly, is to make it
possible to survey all pertinent inter-variable relationships simultaneously. Thus we
will postpone other data analysis objectives to such time that we are in a position to
investigate these with PCA.
We speak of “the variance” - but the variance of what? The feature in question is
the variance of the direction described/represented by the central axis - whatever
this unknown “new variable” may represent. This is really what is meant by the
term: “modeling a hidden phenomenon”. There is a co-varying, linear behavior
along this central axis due to “something” unknown (at least at the onset of the data
analysis). If we look only at the original X1-, X2- and X3-variables, there is no
such apparent connection, except that their pair-wise covariances are large. But this
simple geometrical plotting reveals this hidden data structure very effectively. All
Principal Component Analysis does is allow for this geometrical understanding to
be generalized into any arbitrary higher p-dimensionality.
PC1
x2 x2
x1 x1
This central axis is called the first Principal Component, in short PC1. PC1 thus
lies along the direction of maximum variance in the data set. We may say that there
is a hidden, compound variable associated with this new axis (the Principal
Component Axis). At this stage we do not usually know what this new variable
“means”. PCA will give us this first - and similar other - Principal Components, but
it is up to us to interpret what they mean or which phenomena they describe. We
will return to the issue of interpretation soon enough.
In this first example we only had 3 variables, so we recognized the linear behavior
immediately by plotting the objects in the 3-D variable space. When there are more
– many more – variables (like in spectroscopy where each row of the matrix is a
spectrum of perhaps several hundred, or thousands of wavelengths), this procedure
is of course not feasible any longer. Identification of this type of linear behavior in
a space with several thousand dimensions of course cannot any longer be done by
visual inspection. Here PCA can help us to discover the hidden structures however,
with its powerful projection characteristics.
that is a best simultaneous fit to all the points through the use of the least-squares
optimization principle. We want to find the line that minimizes the sum of all the
squared transverse distances – in other words the line that minimizes Σ(ei)2.
This line is the exact same PC-axis that we found more “intuitively” earlier! When
using the Least Squares approach we now possess a completely objective
algorithmic approach to calculate the first PC, through a simple sum-of-squares
optimization.
One must appreciate that the n objects contribute differently to the determination of
the axis through their individual orthogonal projection distances. Objects lying far
away from the PC-axis in this transverse sense will “pull heavily” on the axis’
direction because the residual distances count by their squared contributions.
Conversely, objects situated in the immediate vicinity of the overall “center of the
swarm of points” will contribute very little. Objects lying far out “along the
PC-axis” may, or may not, display similarly large (or small) residual distances.
However, only the transverse component is reflected in the least square
minimization criterion.
ei
object j
object i
ej
x2
x1
We now have two approaches, or criteria, for finding the (first) principal
component: the principal component is the direction (axis) that maximizes the
longitudinal (“along axis”) variance or the axis that minimizes the squared
projection (transverse) distances. Some thought will show that these two criteria
really are two sides of the same coin. Any deviation from the maximum variance
direction in any elongated swarm of points must necessarily also result in an
increase of Σ(ei)2 - and vice versa. It will prove advantageous to have reflected
upon these two simple geometrical models of a principal component.
Again we are speaking about the variance of some “unknown” phenomenon or new
hidden compound variable which is represented by the second principal
component.
One may - perhaps – at this stage be wondering how actually to find, to calculate
the principal components. We will return to this later. At this stage it is only
important that one grasps the geometric concepts of the mutually orthogonal
principal components. The mathematics behind and the algorithmic procedure to
find them are very simple and will be described in due order.
PC 1
x2 x2
x1 x1
These new variables - let us call them PC-variables for the moment - do not co-
vary. By introducing the PCs we have made good use of the correlations between
the original variables and thereby constructed a new independent, orthogonal
coordinate system. Going from the original Cartesian co-ordinate system, one is
effectively substituting the inter-variable correlations with a new set of orthogonal
co-ordinate axes, the PC model. We shall find almost unbelievable data analytical
power in PCA, the Principal Component Analysis concept.
From a data analytical point of view, using the maximum number of PCs
corresponds to a simple change of co-ordinate system from the p-variable space to
the new PC-space, which is also orthogonal (and with uncorrelated PC-axes).
Mathematically, the effective full dimension of the PC-space, the space spanned by
the PCs, is given by the rank of X.
( x1 , x2 , , x p )
∑x ik
where xk = i =1 , is the mean of variable index k, taken over all objects.
n
x3’
x3
x2 ’
x1’ x2
x1
This PC-origin can also be viewed as a translation of the origin in variable space to
the “center-of-gravity” of the swarm of points. This procedure is called centering
and the common origin of the principal components is called the mean center.
Observe that the “average point” may well be an abstraction: it does not have to
correspond to any physical object present among the available samples. It is a very
useful abstraction however.
These coefficients are called loadings and there are thus p loadings for each PC.
The loadings for all the PCs constitute a matrix P. This matrix can be thought of as
the transformation matrix between the original variable space and the new space
spanned by the Principal Components. For PCA, the loading vectors - the columns
in P - are orthogonal.
Normally the loadings refer to variable space where the origin is centered, i.e. the
origin of the variable co-ordinate space is moved to the average object. This
corresponds to a simple translation as was shown in Figure 3.12.
We will discuss loadings in great detail (especially how to interpret them), and also
work much more with them, in several subsequent exercises.
Each object will thus have its own set of scores in this dimensionality-reduced
subspace. The number of scores, i.e. number of subspace co-ordinates for each
object, will be the same as the number of PCs. If we collect all the scores for all the
objects, we get a score matrix T. Notice that the scores for an object make up a row
in T. The columns in the score matrix T are orthogonal, a very important property
that will be of great use.
PC1 ti1
PC1
ti2
Object i
x2
x1
We will often have reason to refer to score vectors. A score vector is a column of
T. It is thus not the scores for a single object, but the scores for one entire Principal
Component; it is the vector of “foot-points” from all the objects projected down
onto one particular principal component. Therefore there will be a score vector for
each Principal Component. It will have the same number of elements as there are
objects, n. The general term “score” can be ambiguous. Usually, "scores" means
“elements in the T-matrix” without any further specification.
of dropping the noisy, higher-order PC-directions (Figure 3.14 The PC- coordinate
system). Thus PCA performs a dual objective: a transformation into a more
relevant co-ordinate system (which lies directly in the center of the data swarm of
points), and a dimensionality reduction (using only the first principal components
which reflect the structure in the data). The only “problem” is: how many PCs do
we wish to use?
PC 1
x2
x1
In Figure 3.13 one may appreciate how the new PC co-ordinate system is also
reducing the dimensionality from 3 to 2. This is of course not an especially
overwhelming reduction in itself, but it should be kept in mind that PCA handles
the case of, say, 300 → 2, or even 3000 → 2 equally easily. The 3-D → 2-D (or
1-D) reduction is only a particularly useful conceptual image of the dimensionality
reduction potential of PCA, since it can render on a piece of paper or on a computer
screen. In fact, a large number of variables can often be compressed into a
relatively small number, e.g. 2,3,4,5 PCs or so, which allows us to actually see the
data structures regardless of the original dimensionality of the data matrix with but
a few plots (but then only as projections!).
by their score designations, for example t1t2 for the PC1-PC2 score sub-space. Score
plots can be viewed as particularly useful 2-D “windows” into PC-space, where one
observes how the objects are related to one another. The PC-space may certainly
not always be fully visualized in just one 2-D plot, in which case two, or more
score plots is all you need. You are of course necessarily restricted to 2- or 3-
dimensional representations when plotting on paper or working on VDU-screens.
A1
1 A1
A5
A5 A5A5
A1 A5 A5
A5
A5B5
B1 A2 A4 B5 B5 B5
0
B1
B1 B2B2 B2 B3 A4 B5 B5B5
C2
C2 C3 C4C4 C4 C5
C2 C2 D4
D4D4 D5
C2 C3 D3D3 D4D4
C3 D4
D3D3
D3 E4 D4
D4
E2 E3 E4
-1
PC1
-6 -3 0 3
Pea Senso6, X-expl: 94%,3%
The most commonly used plot in multivariate data analysis is the score vector for
PC1 versus the score vector for PC2. This is easy to understand, since these are the
two directions along which the data swarm exhibits the largest and the second
largest variances. In Chapter 2, exercise "Quality of Green Peas" (descriptive
statistics) the problem concerned sensory data for peas. A plot of PC1 scores versus
PC2 scores, the t1t2 plot, is shown for this data set in Figure 3.15. Scores for PC1
are along the “x-axis” (abscissa) and the scores for PC2 are along the “y-axis”
(ordinate). Notice that objects are plotted in the score plot in their relative
dispositions with respect to this (t1t2)-plane, and that we have here used the very
powerful option of having one, two (or more) of the object name characters serving
as plotting symbol. This option will greatly facilitate interpretation of the meaning
of the inter-object dispositions.
But also notice that “harvest time” is not a variable in the original X-matrix. Still
we can see it clearly in the score-plot because of this option of also being able to
use information recorded in the names of the objects. For this particular example,
one is led to the conclusion that time of harvest is important for the taste of the
peas. Since PC1 is the most dominant Principal Component (in fact, it carries no
less than 93% of the total X-variance), harvest time seems to be a very important
factor for the sensory results.
The same reasoning can be followed if we are interested in PC2. One can here
easily see another pattern in the plotting symbols as we move down the plot.
Objects with letter A are placed highest and objects with the letter E lowest. Thus
PC2 clearly has something to do with discriminating between pea types (A, B…E).
Note how PCA decomposes the original X-matrix data into a set of orthogonal
components, which may be interpreted individually (the PC1 phenomenon may be
viewed as taking place irrespective of the phenomenon along PC2 etc.). In reality -
of course - both these phenomena must act out their role simultaneously, as the raw
data without doubt come from the X-matrix altogether.
loadings, the influence of the original variables can also be deduced. This will be
discussed later in section 3.9 on page 40 and in chapter 4.
1. Always use the same principal component as abscissa (x-axis) in all the score
plots: look at t1t2, t1t3, t1t4,… In this way you will be “measuring” all the other PC-
phenomena against the same yardstick, t1. This will greatly help getting the desired
overview of the compound data structure.
2. Use the principal component that has the largest “problem relevant” variance as
this basis (x-axis) plotting component. For many applications this will turn out to
be PC1, but it is entirely possible in other cases that PC1 lies along a direction that
for some problem-specific reason is not interesting. - If the time of harvesting in the
pea example above was, say, described in PC3 and place of harvesting in PC4, it
would not make much sense to plot PC1 vs. PC2 for studying these aspects. PC1
and PC2 would certainly describe “something” (other), but not what we were
looking for. In general PC1 describes the largest structural variation in any data set,
and in many situations this - per se - is often an “interesting” feature, but this does
not necessarily mean that this variation always is the most important for our
particular interpretation purpose. Correlation is neither per se equivalent to
causality.
These rules of thumb are very general and there are many exceptions. Our advice is
to start all data analysis following these simple rules, but always look out for
possible deviations. After an initial analysis, you may for example find that higher-
order score plots are necessary for interpretation after all. There are also many
The loading plot shows how much each variable contributes to each PC. Recall that
the PCs can be represented as linear combinations of the original unit vectors
& &
( pa = ∑k
p ka e k ). The loadings themselves are the coefficients in these linear
combinations. Each variable can contribute to more than one PC. In Figure 3.16 the
x-axis denotes the coefficients for all the variables making up PC1. The y-axis,
correspondingly, denotes the coefficients defining PC2 in the variable space.
0.5
Fruity
0 Sweet
Pea_Flav
Hardness
-0.5 Mealines
PC1
The variables “Sweet”, “Fruity” and “Pea Flav” are located to the right of the plot.
“Sweet” contributes strongly to PC1 but not at all to PC2, since the value on the
PC2-axis is 0. Our earlier look at the (t1,t2)-score plot for the peas found that PC1
could be related to harvest time, and the inferred relation to pea flavor can, strictly
speaking, first be appreciated with this loading plot available; “Pea Flav” loads
very high on the positive PC1 direction indeed. From this we can also deduce that
measurements of “Sweet” can be used, together with the other similar variables
“Fruity” and “Pea Flav”, to evaluate harvest time. We can also say that the later the
peas are harvested, the sweeter they are. From the loading plot we see that other
variables also contribute to PC1, but at the opposite end (displaying negative
scores).
Above we deduced that Sweetness has nothing to do with the property described by
PC2. Identifying this property is a task for exercise 3.12. For now we can say that if
we wish to determine this property, we certainly should not measure “Sweet”.
Some variables display positive loadings, that is to say positive coefficients in the
linear combinations, while others have negative loadings. For instance, Sweetness
contributes positively to PC1, while Off Flav has a negative PC1-loading (as well
as a high positive PC2 loading). PCA-loadings are usually normalized to the
interval (-1,1), more of which later.
Co-varying Variables
The 3-dimensional plotting facility, as for scores, can also be used to study how the
original variables co-vary - in a 3-D loading plot. If the variables are situated close
together geometrically (i.e. display similar loadings), they co-vary positively.
In this example (Figure 3.17) Fruity, Sweet and Pea Flav are positively correlated
through PC1 because they make more or less the same contribution to PC1, but we
now also see that Sweetness, in addition, displays a high loading on PC3.
0.5
Sweet
0 Off_flav
-0.5 Mealines Fruity
Pea_Flav
-1.0
1.0
0.5 Hardness
0.6
0 0.3
0
-0.3
-0.5 -0.6
Pea Senso6, X-expl: 94%,3%,2%
The variables Off_Flav, Hardness and Mealiness are also positively correlated with
each other at the negative PC1 end (these three variables have more or less equal
loadings for this component), but definitely only through PC1. This is clear from
their disposition in both the PC2 and PC3 directions in which these variables
occupy rather disparate positions in the 3-D plot. Simultaneously, these two sets of
three variables each (those with positive PC1-loadings and those with negative
PC1-loadings) are negatively correlated to each other since their PC1-loadings have
opposite signs. A strong correlation can either be positive or negative, and the
loading plot shows both these relationships unambiguously.
Some of these findings were already quite clear using the 2-dimensional loading
plot. The augmented 3-dimensional loading plots are usually more useful when the
number of X-variables is much higher than that of this simple illustration.
51
1 27
34
52 47 43
58 2519
42 5553
15 45 10 17
37
0
12
24 2148 3 28 38 49 311622
60
2918 20 46 2 40 8
9 57 33 1359
535
26 4
39 4441 23 17
5450
30 14 56
11
6 32 36
-1
PC1
-6 -3 0 3
Pea Senso6, X-expl: 94%,3%
PC2 X-loadings
1.0
Off_flav
0.5
Fruity
0 Sweet
Pea_Flav
Hardness
-0.5 Mealines
PC1
Sample (= object) 51 has a position in the score plot that corresponds to the
direction of variable Off-flavor in the loading plot. This means that sample 51 has a
high value for the variable Off-flavor. Sample 22 is very sweet, “9” is mealy and
hard and so on. If we take into account what we know about PC1, then early
harvesting time seems to give off-flavored peas that are hard and mealy, while late
harvesting (positive scores) results in sweet peas with strong pea flavor. Now it will
also be clear that PC2 can be interpreted as an “unwanted pea taste”-axis (going
from hard and mealy peas to distinctly off-flavored peas).
interpretation strategy - using the score plot to come up with the how and the
loading plot to understand why - should be well illustrated however. To be specific:
How? (Are the objects distributed with respect to each other, as shown by the
decomposed PC score plots?). The Score plot shows the object
interrelationships.
Why? (Which variables go together, as a manifestation of the correlations, defining
the PCs?). The Loading plot shows the variable interrelationships.
The loading plot is used for interpreting the reasons behind the object distribution.
In many cases one uses the 2-D score/loading plots illustrated above, but not
always.
Figure 3.19 is a 1-dimensional PC4 loading plot from a PCA of a set of IR spectra
of complex gas mixtures. In spectroscopy, the 1-D loading plots are often a great
advantage, as they are very useful for the assignment of diagnostic spectral bands
for example. As can be seen from Figure 3.19, such a loading plot indeed shows
great similarity to a spectrum. This loading plot is from a data set consisting of IR-
spectra of a system with 23 mixed gases. The loadings in Figure 3.19 belong to a
PC that can partially be related to the presence of dichloromethane. For
comparison, the original spectrum of pure dichloromethane is shown in Figure
3.20.
Spectral features in the lower wavenumber region (i.e. the leftmost variables) can
be recognized in the loading plot. The loading plot shows the largest loadings,
which correspond to the most important diagnostic variables in this range (and in a
few other bands). Since dichloromethane absorbs in this region we can conclude
that PC4 - among other things - models the presence of dichloromethane. Note that
this system of 23 gases is actually rather complex so the PC-loading spectrum also
contains numerous smaller contributions and interferences from the other
compounds. Still, this is a very realistic example of the typical way one goes about
interpreting the “meaning” of a particular principal component in PCA of
spectroscopic data.
A final note on the graphical display of loadings and loading plots. 2-D score-plots
can be understood as 2-D variance-maximised maps showing the projected inter-
object relationships in variable space, directed along the directions of the particular
two PC-components chosen. By a symmetric analogy: 2-D loading plots may be
viewed as 2-D maps, showing projected inter-variables relationships, directed along
the pertinent PC-component directions in the complementary object space.
The object space can be thought of as the complementary co-ordinate system made
up of axes corresponding to the n objects in matrix X. In this space, one can
conveniently plot all variables and get a similar graphical rendition of their
similarities, which will show up in their relative geometrical dispositions,
especially reflecting their correlations. The particular graphical analogy between
these two complementary graphical display spaces is matched by a direct symmetry
in the method used to calculate the scores and loadings by the NIPALS algorithm
(see section 3.14 on page 72).
Data Set
The background information about the “PEOPLE” data set was given in exercise
3.3.1 on page 22. There are 32 persons from two regions in Europe, A and B. For
reasons of anonymity the persons’ real names are not included; instead we have
added codes in each object-name. The first position represents M/F (male/female),
whereas the second position in the name represents regions A/B respectively. Just
as in exercise 3.3.1 , the option of using both, or only one, of these codes as an
information-carrying plotting symbol will be made use of also in the context of
PCA-plots. The standard option is to use the running-number identifications for
objects 1 through N=32.
Tasks
1. Study the PCA-results. Focus is on the score plot and the loading plot.
2. Interpret the variable relationships and the object groupings.
How to Do it
The PCA model for the “people data” set has already been made. The modeling
results are stored on a particular file, which can be accessed by selecting the
model from Result-PCA. Select the file “people.11D” and press View. Here you
will observe the very first graphical overview of a PCA-model results, which
consists of a score plot, a loading plot, an influence plot, and a residual variance
plot. We will first focus on an interactive study of the score plot and the loading
plots.
Score Plot
The score plot is on the upper left corner of the PCA overview. A plot of PC1
vs. PC2 can be seen. If the numbers of the persons do not show, display the
numbers instead of the object-names using Edit-Options. You may also want to
enlarge this plot with the menu Window-Full Screen or with the corresponding
icone.
You will first of all observe a very clear grouping in the PC1-PC2 plot. In fact
you observe four distinct clusters of objects. This “grouping”, “clustering” or, as
we sometimes would want to express it (see later), this “data class delineation”
is optimally seen when using identical plotting symbols. In Figure 3.21 we have
used the sample numbers as object identification. If so desired, we could
actually have used the same symbol (x, +, o…) for all objects.
Please try and find out how to do this by using Edit-Options. You will find that
this is extremely useful for what might be termed the “initial overall pattern
recognition phase” of an exploratory PCA. However, as soon as we would like
to go any deeper into the data structure revealed, for example to find out about
the specific characteristics for each of these four groups, we need to use more
“plotting symbol information”, i.e. to use object-name information instead.
18
31 17
23
2 20 22
24
27
1 26 28
32
25 30
29 19
21
0
32
1
-1 7
8 15
16
5 11
6
13
9 10
14 4
-2 12
PC1
-4 -2 0 2 4 6
RESULT3, X-expl: 54%,19%
Who are the extreme persons? Which persons are similar to each other? Are
these the same conclusions as from plotting the raw data? Try to interpret the
meaning of the axes. For example, what do the persons to the left-hand side of
the plot have in common? And what about those to the right, upper part and
lower part?
Sample 21 is the leftmost person and Sample 1 the rightmost. Why? Is that
consistent with what you saw when plotting the raw data?
Now you may invoke the “object-name” information plotting option, this time
choosing to use both characters as the plotting symbol, as in Figure 3.22.
Observe the dramatic difference this makes with respect to immediate
interpretation of the four groups revealed!
What can you now say about these four groups? Do people from region A have
something in common? Do the people in region A differ from those in region B?
MB
MB MB
MB
2 FB MB
MB
MB
1 FB MB
FB
FBFB
FB FB
FB
0
MA
MA
MA
-1 MA
MA MA
MA
FA MA
FA
FA
FAFA
FA FA
-2 FA
PC1
-4 -2 0 2 4 6
RESULT3, X-expl: 54%,19%
As can be seen from this score plot (PC1 vs. PC2), the four groups of people
represent a double two-fold grouping, regional belonging (A/B) as well as
gender differences (M/F); each possible quartering in this classification sense is
located separately. The males are on the right-hand side and the females on the
left-hand side, while along PC2 the people from region A and B seem to lie at
the lower and upper parts of the plot respectively. Trying to interpret the
meaning of PC1 and PC2 is thus easy: PC1 is a “sex discriminating” component,
and PC2 is a "region discriminating" component.
Now we proceed to study the PC1 vs. PC3 score plot using the Plot- Scores
option and see what can be observed from this plot. No similar clear groups can
be seen in the plot, even if you try the “plotting symbol” trick from above again.
However, if you use Edit - Options - Sample Grouping - Enable Sample
Grouping - Separate with Colors - Group by - Value of Variable and mark
variable Age, you are able to infer that PC3 turns out to span/separate the older
from the younger people.
The age would appear to increase from the upper part to lower part, i.e. along
PC3, but this is not an easy thing to appreciate, since we did not have any age-
related information present in the object-names.
Typically binary (-1/1) information, such as gender and region in the above
example, lends itself easily to this type of plotting-symbol coding. Sometimes a
discrete (or categorical) information type might be used in a similar fashion, for
example A/B/C/D.
25 5
26 6 17 18
9 22
1 10
20 7
8 2
4 23
21 31 311
0
12
15
29
-1 14 24 16 1
13
32
30
-2 19
27
-3 28
PC1
-4 -2 0 2 4 6
RESULT3, X-expl: 54%,13%
Loading Plot
You should always study the corresponding loading plot to see why the data
structure is grouped, or why a particular sample is located in a specific location
in the score plot. For instance, objects on the left-hand side of the score plot will
have relatively large values for the variables on the left-hand side of the loading
plot and vice versa.
0.4
0.2
Shoesize
Height
IQ Weight Swim
0
Age
Hairleng
-0.2
Sex
Income
Beer
-0.4
PC1
-0.4 -0.2 0 0.2 0.4
RESULT3, X-expl: 54%,19%
What is the characteristic of the persons on the left-hand side? And in the lower
part of the score plot? (Hint: Study both the corresponding score - and loading
plot interactively). Which variables are located along PC1 and which ones
along PC2? Do the people with larger height and weight have better swimming
ability? Is this dependent on Sex? Do the older people have higher salary than
the younger ones? Are there any variables that can differentiate people in the
region A from those in the region B? Do the people in the region A and B have
the similar wine and beer consumption? Do they have the same IQ? Is IQ
correlated to the physical variables? If so, which?
Also study loading plot (PC1vs.PC3) using Plot-Loadings. Can you see any
variables distributed along PC3? You will probably know more about the
meaning of PC3 so far. Compare the conclusions with those you get from the
score plots. Are they consistent with each other?
Hairleng
A/B
-0.2 Wine
-0.4
-0.6 Income
Age
-0.8
PC1
-0.4 -0.2 0 0.2 0.4
RESULT3, X-expl: 54%,13%
Which variables are highly correlated in the PEOPLE data set? If they are
correlated, is there any physical relationship? Which of them are negatively
correlated? What about the IQ (plot PC1 against PC4)? Are there ANY
surprises at all compared to what one would generally expect in a sample of
pan-European people?
Summary
This exercise showed how to display these two most important model plots: the
score plot and the loading plot. It illustrated how the patterns of the samples in the
score plot (inter-object groupings, trends) can be “explained” from the variable
loadings. An object or a group of similar objects on the left-hand side in the score
plot has a high value of the variable(s) on the left-hand side in the loading plot.
the upper part of the plot and all those from region A lie on the lower part. The
third PC could be used for separating relatively older people (with correlated higher
salary) from younger ones with relatively lower salary. The fourth PC describes IQ
only, not correlated to any other variable.
There is also a tendency that people in the upper-left part of the PC1-PC2 plot drink
more wine and less beer than those in the lower-right part. This trend can be
observed “diagonally” in the score plot with enough scrutiny, but it is also revealed
in the appropriate loading plot, PC1-PC2. Since variable “Wine” has a distinctly
high PC2 loading, while “Beer” loads high both on PC1 as well as on the negative
PC2; in fact “creating” the diagonal object relationship observed in the score plot.
Furthermore all males display relatively large values of the variables height,
weight, shoe-size, swimming ability and beer consumption. In addition there is a
tendency for the males to have shorter hair than the females in this data set. The
interesting thing is that IQ does not seem to be related to any other variables,
because it displays effective “zero” loadings on all first 3 components.
The score plot also showed similarities and differences; people located near each
other in the plot are similar, such as samples 7 and 8. It is easy to make whatever
pertinent interpretations characterizing the data structure revealed. For example,
coupled with loading plot, the score plot shows a simultaneous tendency that
people in region A drink more beer and less wine and have relatively higher salary
than those in region B.
Comparing the PCA-results above with the earlier plots of paired variables, we see
that a PCA model gives a much more comprehensive overview of the whole data
set, “all in one glance” (- well four plots rather). In fact, going from a large set of
isolated, bivariate scatter plots (p * (p-1)/2) to a very few score/loading plots,
demonstrates one of the strongest features of PCA as used for exploratory data
analysis. This exercise gives you a first view of the power of PCA modeling. More
will be shown later in the coming chapters.
3.11 PC-Models
We shall now present a more formal description of PC-modeling. We will also look
at the more practical aspects of constructing adequate PC-models.
Centering
The X-matrix used in the equations above is not cast precisely as our raw data set.
The original variables have first been centered:
In PCA we start with the assumption that X can be, indeed should be, split into a
sum of a matrix product, TPT, and a residual matrix E. As you have probably
gathered, T is simply the score matrix described above, and PT is but the
accompanying loading matrix (transposed). We wish to determine T and P and use
their outer product instead of X, which will then, most conveniently, have been
stripped of its error component, E:
E is not a part of the model per se. E is building up the so-called residual matrix. It
is simply that part of X which cannot be accounted for using the available PC
components; in other words, E is not “explained” by the model. Thus E becomes
the part of X that is not to be modeled by (included in) the product TPT. E is
therefore also a good measure of “lack-of-fit”, which tells us how close our model
is to the original data. While the data analytical use of PCA-models is mainly
concerned with the first data structure part, TPT, we could not do without the
complementary “goodness-of-fit” measure residing in E (a large E corresponds to a
small model fit and vice-versa).
An outer product results in a matrix of rank 1. Here, ta is the score vector for PCa
and is n x 1-dimensional. pa is the corresponding loading vector; since it is p x 1-
dimensional, paT is thus 1 x p-dimensional. Each outer product tapaT is therefore n x
p-dimensional, i.e. the same dimension as X and E, but all these tapaT matrices have
the exact mathematical rank 1. This is illustrated graphically in Figure 3.26.
t1 t2
X = + + . . .+ E
Equation 3.3 is closer to the actual PC-calculation than the compact matrix
equation. PCs are in fact often calculated one at a time. Let us outline how in an
introductory overview first:
and so forth until you have calculated “A” components. The subtractions involved
are often also referred to as updating the current X-matrix. It is one of the most
general features of the usefulness of PCA that usually A << p. A is the dimension
of the structure sub-space used.
Notice that we use the letter E from the moment we start subtracting PC-
contributions. When we talk about the residual matrix, you will see why we use
also E in the step by step model calculation.
The structured part of X is made up by the TPT product. The noise, (the residuals),
resides in the E-matrix. In this context the choice of A, how many PCs to include,
corresponds to determining the split between structure and noise.
From this we can conclude that a choice of “A” for an optimum fit must be made.
We, or the software PCA-program, must choose A so that the model, TPT, contains
the “relevant structure“ and so that the “noise”, as far as possible, is collected in E.
This is in fact a central theme in most of multivariate data analysis. This objective
is not trivial and there are many pitfalls. It is always up to you to decide how many
PCs to use in the model. There are many situations where the human eye and brain
excel, which simply cannot be programmed regardless of how far the field of
Artificial Intelligence has been developed. Most PCA software, of course, will try
to give you information on which to base your decisions, but this info is only
algorithmically derived, and it all hinges on which optimality criteria is used.
Rather than to skip this important point (by relying on an algorithmic approach
only), this book most emphatically demands that the reader takes it upon
him/herself to learn the underlying principles behind PCA. There can thus be no
two ways about it. The question (always) is: “how do we find the optimum number
of PC-model components, A?”
However, terms like “small”, “large” and “good” are imprecise. What is a large E?
After all E is a matrix consisting of n x p matrix elements - what if some are large
and some are small? What are we comparing with when we say that something is
small or large? It is obvious that we must define and quantify these terms precisely.
We would then subtract this contribution from X and continue approximating the
residual. Subtraction of the zeroth Principal Component is identical to mean
centering of the raw data matrix X. Thus for A=0, the residual matrix termed E0, is
the same as the centered X. E0 plays a fundamental role as the reference when
quantifying the (relative) size of E.
Residual Variance
The residuals will change as we calculate more PCs and subtract them. This is
reflected in the notation for E. E will in general have an index indicating how many
PCs have been calculated in the current model. These residuals are compared with
E0, our starting point. E0 is the X in Equation 3.4:
Object Residuals
The squared residual of an object i, ei2, is given by Equation 3.5:
p
Equation 3.5 ei2 = ∑ eik2
k =1
and the residual variance is ResVari = ei2/p. This sum is simply a number. If we
take the square root of this sum, the result corresponds to the geometric distance
between object i and the model "hyper plane", i.e. the “flat” or space spanned by
the current A PCs as expressed in the original variable space. Thus the object
residual is a measure of the distance between the object in variable space and the
model representation (the projection) of the object in PC-space. The smaller this
distance is, the closer the PC-representation (the model) of the object is to the
original object. In other words, the rows in E are directly related to how well the
model fits the original data – by using the current number of A components.
for all the objects. For this purpose, we define the total squared object residual to
be the sum of all the individual squared object residuals (Equation 3.6).
n
Equation 3.6 2
etot = ∑ ei2
i =1
and the total residual variance is ResVarTot = e2tot /(p.n). In general we will refer
to the “total residual variance” without specifying objects.
0.04
0.02
0
Samples
3 6 9 12
Jam senso, PC:3
This plot is used mainly to assess the relative size of the object residuals. For
example, in Figure 3.27 we can see that sample number 10 has a larger residual
variance than the other objects. The model does not fit or “explain” this object as
well as the others. The plot thus may indicate that sample number 10 perhaps is an
outlier, it is not like the rest. We will return to the concept of outliers several times
in this book, as they are very important. An object like number 10 may be the result
of erroneous measurements or data transfer errors, in which case it should perhaps
be removed from the data set. Or it may be a legitimate and significant datum,
containing some very important phenomena that the other objects do not include to
the same extent. In the multivariate data analysis realm everything is always
“problem dependent”.
For the total residual variance plot, the graph function must be a decreasing
function of the number of current components, A. It must decrease towards exactly
0 when A reaches its maximum, i.e. is equal to min(n,p).
0.5
0
PCs
There is a logical argument behind this rule of thumb. Recall that the PCs are
placed along directions of maximum variances, that is to say along the elongations
of the data swarm, in decreasing order. When placed along these directions, the
total distance measure from the objects to the PC-model in general will decrease.
Remember the duality of maximum elongation variances and minimization of the
residual (transverse) distances for each component. As long as there are (still) new
directions in the data swarm with relatively “large” variances, the total residual
distance measure will decrease significantly. This again leads to a relatively large
decrease in the total residual variance from PC to PC, corresponding to a set of
relatively steep slopes in the plot, from one component to the next.
You may use the mental picture, that the next PC still would have something “to
bite into”, or that there still is some definite direction in the remaining projections
of the data swarm for the next component to model. This goes on until the
remaining data swarm does not show preferred orientations (elongations) any
longer. At this point there will no longer be any marked gain with respect to adding
any further PCs (“gain” is here to be understood as modeling gain, thus adding a
significant total variance reduction). Consequently the total residual variance plot
will flatten out and the gain per additional PC will be significantly less than before.
Thus a break here. Once the noise region has been reached, all of the PCs will be of
similar size (as they are modeling random variation and the residual data will form
a hyper-sphere), so the residuals will flatten out.
To conclude: the optimal number of PCs to include often is the number of PCs that
gives the clearest break point in the total residual variance plot. But this is only a
first rule-of-thumb; there are, alas, also plenty of exceptions from this rule.
mineralisation PCA, if only the irrelevant first two PCs were extracted - and not the
gold-related PC3! Using too many components on the other hand (clearly leading to
an overfitted model) is equally bad, because you then risk interpreting parts of the
noise structure.
A very important lesson here is that we always carry out these evaluations in the
problem-specific context of all our knowledge about the problem, or situation
from which the data matrix X originates. It is bad form indeed to analyze data
without this critical regard for the problem context – indeed no interpretation is
possible without!
The point here is that external, indisputable evidence must override all internal data
modeling results. However, the external evidence must of course be proved beyond
doubt. There are cases where the external factors have not held true after all; the
modeling results using more components were later found to be correct. If ever in a
situation like this, one should neither reject the modeling results immediately,
neither ignore the external evidence. It will be most prudent to reflect carefully
again on the results and the evidence before deciding. In fact, there is no substitute
for building up as large a personal data analytical experience as indeed possible.
For this reason (also) we have included many examples and exercises in this book.
n
Equation 3.7 ek2 = ∑ eik2
i =1
p
Equation 3.8
,v = ∑ ek
2 2
etot
k =1
Here we will only discuss the former. The residual variance per variable can be
used to identify non-typical variables, “outlying” variables, in a somewhat similar
fashion as for the objects. We cannot, however, interpret them in an exact
analogous fashion in terms of distances without introducing a complementary
object space in which variables can be plotted. This concept lies outside the scope
of this book.
We have also briefly used the term “explained variance” above. Remember that the
residual variance is compared to the total residual variance for A=0. At this point
the total residual variance is 100% and the explained variance is 0%. When
A=min(n,p) the residual variance is 0% because E is 0, and the explained variance
is 100%. The explained variance is the variance accounted for by the model, the
TPT-product (always relative to the starting level, the mean centered X and E0). An
easy relation to remember:
100
90
80
70
Explained variance
60
50
40
30
20
Unexplained
variance
10
0 PCs
0 1 2 3 4 5
Data Set
The model “Peas0” is based on the same data set as has been used several times
before (see Chapter 2). In fact we have also been using parts of these results several
times above when introducing the various aspects of PCA. After reformatting there
are 60 pea samples (objects). The names of the samples again reflect harvest time
(1 .. 5) and pea type (A .. E). The variables were not presented properly earlier. The
X-variables in fact consist of sensory ratings of all the pea samples, on a scale
from 0-9, as carried out by a panel of trained sensory judges. Whereas we earlier
were nearly only interested in the geometrical plotting relationships, we here want
you to carry out the complete principal component analysis yourself in this context.
Tasks
Study score-, loading- and variance-plots of the data and interpret the PCA-model.
How to Do it
The model is already prepared. Go to Results - PCA and specify model Peas0.
Start by looking at how much of the variance is described by the model. Make
the plot in the lower right quadrant active, select View - Source - Explained
Variance, and un-select View - Source - Validation Variance. Is there a clear
break in this plot? You can see that the first two PCs explain around 75% of the
variance in the data. This is regarded as good for sensory analysis, due to the
high noise level in this kind of data as measurements are based on human
judgement.
Now study the score plot. Interpret the meaning of the PCs. Use Edit - Options
to replace sample numbers with names; if you click twice on the second cell of
the Name field in the dialog box, only the fraction of the name coding for
Harvesting Time will appear.
Study the loading plots. Try to answer the following type of questions: What do
the loadings represent? How can we interpret the plot? Which sensory
characteristics are the most important? Which vary the most? Which seem to
co-vary? Which variables describe the main variations of peas?
Summary
There is no clear break in the variance plot, but two PCs describe 75% of the total
variance while the third explains some 10% more. Two PCs are simple to interpret
and are probably sufficient to determine the most important variables for
description of pea quality. The clue here is to note carefully the fractions of the
total variance associated with each PC.
The score plot shows that the PC1 direction describes the harvesting time. We can
see that the samples are distributed from left to right according to their Harvesting
Time numbering. There is no similar obvious clear pattern in PC2 at the first glance
of the score plot.
The loading plot shows that Pea-flavor, Fruitiness and Sweetness co-vary.
Hardness, Mealiness, and Off-flavor are also positively correlated to each other,
while they are negatively correlated to Pea-flavor, Sweetness, and Fruitiness, since
the two groups are on opposite sides of the origin. This means that PC1 mostly
describes how the peas taste and feel in the mouth, which is perhaps not a so
surprising first direction defining what trained sensory judges base their assessment
on. The corresponding score plot indicates that taste is related to harvest time - the
riper the peas, the sweeter they taste.
Along PC2, we can see Color 1 and Whiteness to the top, negatively correlated to
Color 2 and Color 3 projected near the bottom of the plot. This means that peas
samples projected to the top of the score plot are whiter, while those projected to
the bottom are more colorful.
Data Set
The objects represent 172 Norwegian car dealerships. For reasons of anonymity we
have identified these companies only by a running number 1-172. In point of fact,
there is also information available as to the particular brand of car each individual
car dealership offers to the market (e.g. Volvo, Mitshubishi, Toyota...), but this is
mainly of interest when related to the univariate (1-D) data analyses carried out in
the original magazine article. Here we are exclusively interested in what can be
gleaned from the multivariate PCA perspective in comparison.
The X-variables consist of 10 economic indicator variables, taken directly from the
magazine tabulations. It is evident that these variables represent a standard
framework within which to carry out an economic analysis of a whole branch of the
wholesale market - in this case all Norwegian car dealerships. The data represent
the years 1993 accountings. For the moment we shall only refer to these variables
as X1 – X10.
You (the novice multivariate data analyst) are at the outset specifically not allowed
any other details of the meaning of the chosen set of key economic indicators.
Indeed this is the whole point of this exercise - what can be said about the
multivariate data structure of this particular data set without detailed economic
understanding? By treating these presumably fine-tuned economic variables just
like any other set of p multivariate attributes for the set of N (=172) objects (which
just happens to be car dealerships), what can be achieved by a proper principal
component analysis? - Surely, it would be nice, if such an analysis was to turn up
new insight, insight not revealed by a standard univariate economic analysis. We
would be in the enviable position to be able to teach a professional economic
magazine a thing or two about the interrelationships and correlations between the
economic indicators, a feature completely left out by the run-of-the-mill, one-
variable-at-the-time approach presented in the feature article from the magazine
“Kapital” (14/94 p. 50-54).
Tasks
Study score-, variance- and loading-plots of the data and interpret the PCA-model.
From a score point of view, we would like to assess how the individual car
dealerships relate to one another, and how they might be clustered, grouped or how
their interrelationships might show trending. From a loading point of view, we
would be extremely interested in which economic indicators correlate with which
(do you want to scale this data set, or not? - Why? Note: Scaling will be introduced
in chapter 4.1). From a combined scores/loading assessment, which indicators are
responsible for the data structure as revealed by the score plot disposition of all 172
dealerships?
Make up your own questions to the objectives of the PCA as you go along, based
on your interim results – and on your thinking on what might possibly be the
driving forces of the car dealership market in a small, but supposedly, rather
representative European country. On the other hand, also remember that currently
Norway is a rich country (the world’s second largest oil exporting country).In
general people in Norway buy new cars.
How to Do it
Entirely up to you. This is an interim summation of your newly developed
PCA-skills!
Summary
One must always be prepared for surprises in the multivariate data analysis realm.
Many of the examples and illustrations of standard textbooks employ well-
structured data sets, with a more-or-less clear story to tell –well conceived of
course, compare (hopefully) the PEOPLE and PEASRAW data sets above for
example.
How does this tally with the car dealership data? There are data sets and there are
data sets. At the outset this apparently VERY INTERESTING data set turned out
geometrically to be almost nothing more noteworthy but a p-dimensional
hyperspherical data swarm. What might this mean?
A tentative of interpretation: The car dealership market (demand vs. supply) forms
a particularly strong competitive sector. There is simply very little room for
anything but to operate your (own) business precisely as do all your competitors,
each striving valiantly for that little extra comparative advantage! There were only
two marginally noteworthy car dealerships, and actually the “only” question in this
data analysis would appear to be whether to regard these two as outliers, or not.
Also: Irrespective of this two-object deletion, or not, the very same general
correlation structure was to be observed by interpretation of the comparative
loading plots. The only conclusion possible is that the competition makes for a very
homogeneous market – as measured by the present set of standard economic
variables that is.
The above little discourse to show that even a seemingly “dull” data structure might
nevertheless very well carry its own significant information. The particular, almost
hyper-spherical, data disposition encountered here might for example be interpreted
to reflect that the competition results in a very tight clustering of objects, with only
the smallest signs of differentiation between the individual dealerships. What are
the relevant fractions of the total variance for the first 2-3 PCs? – Here’s a difficult
question: How would you have formulated an alternative market interpretation,
had this data set revealed itself as a marked, perhaps more familiarly, elongated
trend?
So the most prominent observation to be made, however, was the almost virtual
absence of any groupings or trends in the score plots. For sure there were these
two dealerships which set themselves apart, albeit marginally. What exactly
characterizes these two? How will this influence our ability to reveal the
intricacies of the economic interrelationships for the remaining 170 car
dealerships? For this part of the interpretations, there is now a need to know a little
Multivariate Data Analysis in Practice
3. Principal Component Analysis (PCA) – Introduction 71
more of the meaning behind the specific indicator variables, for which purpose we
now list the designations and explanations of all the ten variables in full:
In spite of the apparent enticing original data compilation, after having wrestled
with the data in every possible way, there is only one surprising conclusion: This
PCA did not reveal any particularly new secrets of the car dealership community
and the way it conducts its business. But then again, this exercise was deliberately
given in order to teach the lesson that even though the prospects for interesting
interpretations and conclusions would all appear to be in the cards – we do not
know the data structure until after our most careful data analysis. For this particular
data set, all our hopes for an illuminating new insight apparently just vanished.
But all is not completely lost. Though our triumphant visit to the editorial offices of
the magazine “Kapital” will have to wait, here is a sneak preview of coming
attractions. One of the X-variables will - upon some in-depth reflection - turn out to
be of a sufficiently special nature relative to all the remaining 9 other variables, that
a certain re-formulation of the whole objective of the data analysis may be deemed
worthwhile, more of which later (in relation to PCR/PLS-R). In fact we may have a
bona fide y-variable present in this otherwise manifest X-variable company. If you
cannot find out which now, simply wait until you have reached chapter 13.
In chapter 3 we have so far taught you quite a lot of the overall understanding of
PCA, described all the most important elements, and especially lead you through a
number of useful exercises – without going into any mathematical or algorithmic
details. We also strongly hope that by now you should be sufficiently primed to
appreciate the basic NIPALS algorithm, which lies behind all PCA-calculations in
The Unscrambler. The completion of chapter 3 will be just that: The NIPALS
algorithm.
Note!
If you are not interested in a mathematical approach, you may skip
section 3.14 for now and move on directly to chapter 4, where you will
learn more about the practical use of PCA.
This algorithm has since the 1970’s been the standard workhorse for the range of
the computing behind bilinear modeling (first and foremost PCA and PLS-R),
primarily through the pioneering work by one of the co-founding fathers of
chemometrics, Svante Wold (Herman’s son). The history of the NIPALS algorithm
has been told by Geladi (1988), Geladi and Esbensen (1990), Esbensen & Geladi
(1990). The latter two references actually deal with “The history of chemometrics”,
a topic of interest to some of the readers, hopefully.
In this introductory course on multivariate data analysis, we shall present the main
features of this algorithmic approach for two reasons. 1) - for deeper understanding
of the bilinear PCA-method. 2) - for ease of understanding of the subsequent
PLS-R methods and algorithms.
Thus we shall not go into any particular depths regarding specific numerical issues;
suffice to appreciate the specific projection/regression characteristics of NIPALS.
1. It is necessary to start the algorithm with a proxy t-vector. Any column vector
of X will do, but it is advantageous to chose the largest column, max |Xi|.
Step 3 can be seen to represent the well-known projection of the object vectors
down onto the fth PC-component in the variable space. By analogy one may view
step 2 as the symmetric operation projecting the variable vectors onto the
corresponding fth component in the object space. Note how these projections
also correspond to the regression formalism for calculating regression
coefficients, for which reason steps 2 & 3 have been described as the “criss-
cross regressions” heart-of-the-matter of the NIPALS algorithm. “Criss-cross
projections” may be an equally good understanding.
The updating step is often also called deflation: Subtraction of component no. f.
The primary characteristics for the NIPALS algorithm are that the principal
components are deliberately calculated one-component-at-a-time. NIPALS goes
about this iterative PC-calculation by working directly on the raw X-matrix alone
(appropriately centered and scaled). This is a numerical approach for bilinear
analysis which sets it apart from several other numerical calculation methods with
which to compare, such as the Singular Value Decomposition method (SVD) and
the so-called direct XTX diagonalisation methods, the description of which falls
outside the present scope however. Appropriate references for this endeavor can be
found in Martens & Næs (1987).
Standardization
There are many ways to scale or weight the data. The most usual scaling factor is
the inverse of the standard deviation. Each element in the X-matrix is multiplied
with the term 1/SDev:
Equation 4.1 1
xikscaled = xik ∗
SDev
Recall that standard deviation was defined in frame 1.2 (Chapter 1). By scaling
each column in X with the inverse of the standard deviation of the corresponding
variable, we ensure that each scaled variable gets the same variance. Try to
calculate the variances of the scaled variables and verify that they all have the same
variance, all equal to 1.0.
This is a very common scaling method when you analyze variables, which are
measured in different units, so that some display large variances compared to
others, which are smaller etc. For example, the variance of one variable could be in
the order of several 1000’s, while the variances of others are perhaps in the order of
0.001 or similarly. This is surely a case demonstrating the need for the inverse
standard deviation weighting by standardization. By this means all variable
variances become more comparable. No one variable is allowed to dominate over
another solely because of its range, and thereby unduly influence the model
(because of it measuring unit). A simple example would be if one mass variable
was measured in Kg whilst another was measured in mg. Standardization would put
these on the same variance scale.
Since we are looking for the systematic variations in PCA, standardization allows
subtle variations to play the same role in the analysis as the larger variations; this is
a very powerful result of the act of using the very simple standardization option.
Autoscaling
The combination of mean centering and scaling by 1/SDev is often called
autoscaling. Figure 4.1 shows what happens to the data set during autoscaling.
Thus autoscaling may not always be the only obvious form of scaling to use when
the variables are measured in the same unit. This is a fixed rule, however you still
have to investigate the empirical variances and their comparability. There are
indeed also many cases where 1/SDev scaling of spectroscopic data gives the best
results. The cost is this loss of direct spectroscopic interpretability of the loadings,
but the data analytical model may very well still serve its purpose better. This small
spectroscopic issue aside has caused a great deal of confusion – especially amongst
data analytical beginners, naturally enough.
Sometimes, in order to avoid this situation, it has been suggested that an arbitrary
pre-multiplication of an inflicted variable by, say, a factor of 10, will rectify this
undesired result. Clearly, however, this trick only applies to the situation in which
the subsequent data analysis is carried out without any further preprocessing. The
particular relative covariance structure of this “rectified” variable, in relation to all
others in the multivariate analysis, will be the same when using standardization or
autoscaling, which often was the reason to worry about the impact of the (0,1)-
sized variance in the first place.
If you have changed units in a particular data analysis, you must remember also to
present the data analytical results in their original units, in reports etc. A more
systematic overview of scaling, sufficient for most uses stemming from this
introductory course, is given in chapter 9.
4.2 Outliers
In the previous sections we have briefly mentioned outliers, atypical objects or
variables on a few occasions. If outliers are the result of erroneous measurements,
or if they represent truly aberrant data etc., they should of course be removed
otherwise our model will not be correct.
On the other hand “outliers” may in fact be very important, though somewhat
extreme, but still legitimate, representatives for your data, in which case it is
essential to keep them. Thus if they represent a significant or important
phenomenon and you remove them, the model will also be “incorrect”. It will lack,
and will consequently be unable to explain, this property which is in fact present in
the data. The model will be an equally poor approximation to the real world. This
may appear to be a major problem - that you will create a false model if you include
true outliers and also if you remove false outliers. Fortunately, outliers must either
be one or the other.
The “bad news” is that it is always up to you to decide whether an outlier should be
kept or discarded. In fact, the only problem about this is that it will take some
experience to get to know these things by their appearances in the appropriate plots,
but it is a feasible learning assignment – and it is one which is absolutely essential
to master. Textbook examples, exercises and personal experience will quickly get
you up to speed in this task.
It is perhaps most important to realize that there are essentially two major outlier
detection modes:
1. Data analytical: the relative (geometrical) distribution of objects in e.g. the
score plots is all you have to go by. Decision must be based wholly on
experience.
Score plots are particularly good for outlier detection. An outlying object will
appear in score plots as separate from the rest of the objects, to a larger or smaller
degree. This is the result of one, or more, excessively high or low scores as
compared to the other objects. Figure 4.2 shows two cases of outliers. In the left-
hand one the object is a potential outlier, but some observers may decide that it
nevertheless still “fits” in general, while the object in the right panel is
considerably more doubtful when assessed together with all the remaining objects,
and their trend.
You will see - and learn much more about - outliers later in this book.
t1 t1
Residual object/variable variances can be used for this purpose, as indeed can the
relevant plots as well. This latter manual option may sometimes involve a lot of
work though, especially when there are many variables and/or objects. We shall
also show you some examples of this outlier detection approach, but the general
issue of automatic outlier deletion mainly falls outside the scope of this
introduction to multivariate data analysis.
In Figure 4.3 the data set has been scaled with 1/SDev before PCA, whilst in Figure
4.4 no such scaling has been performed. The two distributions of objects are quite
different, even though we do find the same overall groupings present.
-0.5 12 7
8
-1.0 2 5
-1.5
-2.0
4
-2.5
-4 -3 -2 -1 0 1 2 3
p-0, PC(expl): <1(54%),2(15%)>
This simply means that in a great many situations, there is much (often all) to be
gained by paying the closest attention to the context surrounding the generation of
the data. The data analyst simply cannot learn enough about the available data in all
specific data analytical situations – never mind the overall, general principles of
multivariate data analysis, which unfortunately makes up the range of what can be
learned from the mere reading of a textbook. Experience rules!
On the other hand, there is a major advantage: it does not matter if your data set
contains any amount of additional information. The multivariate analysis will find
this easily enough and you will have an unexpected bonus. Occasionally, it turns
out that important information was to be found in this additional realm, even
though in the outset this was not thought to be the case. Time spent in analyzing
particular data analytical objectives is generally very well spent. But, there is a
downside too: even the best data analytical method in the world cannot compensate
for a lack of information, i.e. a bad, or ill-informed choice of variables/objects.
In the first runs you should of course freely compute “too many” PCs, to be sure
that there are more than enough to cover the essential data structures. There is a
great risk in missing the slightly more subtle data structures, for want of a few more
components in the initial runs. Until the data set is internally consistent (free from
all significant outlying objects and/or variables etc.), there is no point in
determining the optimum number of PCs, as this number may change depending on
what you do with the data set next.
Multivariate Data Analysis in Practice
4. Principal Component Analysis (PCA) - In Practice 83
Step 4: Exploration
The first few score plots are investigated to determine the presence of major
outliers, groups, clusters and trends etc. If the objects are collected in two separate
clusters, as an example, you should naturally determine which phenomena separate
them and decide whether the clusters rather should be modeled separately, or
whether the original objective of analyzing the total data set still stands.
Be especially aware if the score plots show suspect outliers, as they will also affect
the loadings, usually in severe fashions. In this case do not use the loadings to
detect outlying variables at this stage of the data analysis. Although in general one
should have good reasons for removing anything from the data set, on the other
hand, too much caution can be equally dangerous. You may have to perform
several “initial runs” and successively exclude outliers before the data set can be
satisfactorily characterized. One excludes all outlying objects before one embarks
upon the more subtle pruning away of information-lacking variables; this order of
outlier exclusion (objects before variables) is extremely important. The most
common error that inexperienced data analysts often make is to leave “too much”
as it is; in other words one does not take sufficient personal responsibility with
respect to deleting outlying objects, variables, divide into sub-groups, etc.
Note!
This option is not applicable with the training version, where only File-
Open can be used to access existing Unscrambler data files.
3. Plot the data to get a first impression. Mark the data and choose Plot - Matrix.
4. Perform pre-processing if necessary. (Modify - Transform).
Note!
Scaling and centering is done from the Task menu (see hereafter).
5. Open the Task - PCA menu. This starts the PCA dialog box.
• Select the samples to be analyzed, from the “Samples” tab. If necessary
click Define… to create a new sample set.
• Select the variables to be taken into account, from the “variables” tab.
Check the current weighting options, and if relevant change the weights
by clicking Weights… In the Set Weights dialog which then pops up,
select the variables to be weighted, then pick up the desired weights at the
bottom of the dialog box (A/Sdev + B, or 1.0, or constant), then click
Update to apply the weights. Finally click OK to close the Set Weights
dialog box.
• Back to the main PCA dialog: choose validation method = Leverage
correction (at this stage – before you have learned more about
validation).
• Choose an appropriate number of Principal Components.
• Make sure that option Center Data is active.
• You may now start the computations by clicking OK.
6. Evaluate the present model by plotting the results (View). Go back to step 5
and use the option Keep out of calculation for the detected outliers. It is
normal to repeat this several times during an analysis.
(a)
Each point is a
sample
The direction of
the PC is
Mean described by p
x2
x1
Translated into the principal components model, the new coordinate system has
fewer dimensions than the original set of variables, and the directions of the new
coordinate axes, called principal components, factors, or t-variables, have been
chosen to describe the largest variations. This is called decomposition or data
structure modeling, because we try to describe important variations, and
covariances in the data, using fewer PCs, i.e. by decomposing into orthogonal
components, which are supposed to be easier for interpretation as well.
The coordinates of the samples in the new system, i.e. their coordinates related to
the principal components, are called scores. The corresponding relationships
between the original variables and the new principal components are called
loadings. The differences between the coordinates of the samples in the new and
the old system, lost information due to projection onto fewer dimensions, can be
regarded as the modeling error or their lack-of-fit with respect to the chosen model.
Figure 4.6 illustrates PCA decomposition on a very small data set. The data set is
made simply to make it easy to see the principal components in relation to the
original variables. The data set contains 8 samples and 2 variables.
-0.2
2 7
-0.4
3 5
1
0 2 4 6
(x,y)
PC2 Scores
0.4
4
0.2
2 6
8
0
1
3 7
-0.2
5
-0.4
PC1
-4 -2 0 2 4
Very small data…, X-expl: 100%,0%
Figure 4.6 (upper part) shows the samples displayed in the original X-Y variable
space. Note that the axes do not have the same scale. Approximate principal
components lines are drawn by hand in the upper panel. Figure 4.6 (lower part)
shows the exact scores for PC1 and PC2 after a PCA. PC1 actually explains close
to 100% of the variance, also made clear by fact that the principal components are
now rotated relatively to the original variable axes.
One should of course always try to compare the data analytical result with the best
estimate one may come up with regarding the “expected” dimensionality. As an
example, consider the case of an NIR-spectroscopic investigation of mixtures of
alcohol in water. Here we might expect one PC to be appropriate for example,
reflecting a two-component end-member mixing system. However mixing alcohol
with water also gives rise to physical interference effects which require one, or
maybe even two additional PCs for a complete description. The number of PCs in
practice is therefore not 1-2 but rather 2-3. On the other hand, if your PCA on the
alcohol/water spectra came up with, say, 5-7 components, one would naturally be
very suspicious. Such a large number of components for this system clearly implies
a gross overfitting – unless, say, contaminants were at play.
In the pea example only the type of peas and the harvest time have been varied. We
should not therefore expect to see more than a couple of PCs. For the car dealership
data, we simply have no clue at the outset of the analysis. All we may surmise is
that there certainly cannot be more than 10 independent economic indicators
present in this data set, equal to p, but surely some correlation amongst these is to
be expected in such a sufficiently interacting system as the economic performance
of multi-million dollar dealerships. Still we in fact do not know whether to expect
A close to one, or closer to 10 in this case.
The main challenge in PCA is how many PCs to use - and how to interpret them.
To look for patterns, it might for example be useful to draw lines between objects
with identical dye level and to encircle groups of objects with the same milk level,
as has been done in Figure 4.7, manually by all means, if need be, or aided by some
relevant computer software (it is perfectly admissible to using hand-drawn guide
lines on any type of plot – in fact this is recommended throughout the modeling
phase in which information is emerging). The issue here is not how to do this - the
issue is to use whatever appropriate annotation to the plot in question, which will
help your particular interpretation.
The annotated plot shows that dye concentration increases along a (virtual) axis
that would go from lower left to upper right; i.e. both PCs contribute to the
determination of dye concentration. Similarly, milk content increases along a
direction that, although not quite straight within each dye concentration, could be
summarized by an axis roughly going from upper left to lower right. So both PCs
contribute to the determination of both compounds, even in the decomposed score-
plot (which above has been claimed to result in orthogonal, individually
interpretable components). Well, both yes and no!
It is - perhaps regrettably - not always this simple. There is no guarantee that all
data systems will necessarily be structured in such a simple fashion so as to be
stringently decomposable only into one-to-one phenomena-PC-component
relationships. But the PCA-results are nevertheless decomposed into easily
interpretable axes, a milk-concentration axis and a dye-concentration axis. By
careful inspection it may in fact be appreciated that these two mixing-phenomena
axes are but nearly orthogonal in their respective relationships; it is just that the
simultaneous description of both these more complex phenomena requires that the
first and the second axes be both involved.
As you can see at the bottom of Figure 4.7, The Unscrambler always lists two
numbers, “X-expl: (92%),(7%)”. These correspond to the “explained variances of
X” along each component shown. PC1 explains 92% of the original variance in X
while PC2 explains 7%. This shows that PC1 and PC2 together describe 99% of the
total variation in the X-matrix. Higher order PCs therefore influence less than 1%
of the model, so interpretation or outliers in higher order PCs would be absolutely
irrelevant.
Figure 4.8 - Score plot with apparent gross outliers (objects 25 & 26)
Figure 4.9 - Score plot after removal of outliers 25 & 26; cf. Figure 4.8
In general objects close to each other are similar. Samples 16 and 19 are very
similar in Figure 4.9, whilst 18 and 7 are very dissimilar. Note also that these
particular two samples are actually the two most dominating samples defining PC2
(however, the latter would almost be identically defined were these two samples
removed).
The two score plots in Figure 4.10 and Figure 4.11 are the results of PCA on
another data set, with a different pretreatment in each case. The data is again a set
of NIR spectra. The score plot in Figure 4.10 was made from an analysis of the raw
spectra. The spectra were then pre-processed with a particular method called
Multiplicative Scatter Correction (MSC- to be explained later), resulting in the
alternative score plot in Figure 4.11.
Once again the score plots are relatively different, although the same overall
disposition of all objects is pretty much recognizable in both renditions - but which
is correct? In this case one would probably use the score plot in Figure 4.11,
because of its more homogeneous layout of objects. This plot is claimed to be
easier to interpret than Figure 4.10, by experienced data analysts. This data set will
appear again in some of the later exercises, in which some (more) argumentation
for this stand (Figure 4.11 over Figure 4.10) will also become apparent.
Figure 4.10 - Score plot (from Figure 4-8) for un-preprocessed data
PC2 Scores
1.0
20
0.5 119
1022 3
24
12
1827
6 726
0 4 821
525
14 23
911 17
-0.5 2
15
13
-1.0
PC1
-2 -1 0 1 2
Alcohol raw, X-expl: 79%,16%
19
1
0.2
17
2
12
24 21 23
11
8
10
0 27
6 25
5 9
22 415
26 20
7
14 13
-0.2
3
18
-0.4
PC1
There are plenty of other optional pre-treatments. In general it is bad form to try out
all alternative scalings, transformations or normalisations indiscriminately without
problem-specific justification. Some of the most important pre-treatments will be
described in detail in section 11.3. At this point all you need to know is that there
are many possibilities, that they are all problem dependent, and that a wrong choice
may unfortunately lead to interpretations that are not relevant to your specific
problem.
• Objects close to each other are similar, those far away from each other are
dissimilar.
• Objects in clear groups are similar to each other and dissimilar to other groups.
Well-separated groups may indicate that a model for each separate group will be
appropriate.
• “Isolated objects” may be outliers - objects that do not fit in with the rest.
• In the ideal case, objects typically should be “well spread” over the whole plot.
If they are not, your problem-specific, domain knowledge must be brought in.
• By using well-reflected object names that are related to the most important
external properties of the different objects, one may better understand the
meaning of the principal components as directly related to the problem context.
• The layout of the overall object structure in score plots must be interpreted by
studying the corresponding loading plots.
Correlation / Covariance
Variables close to each other, situated out towards the periphery of the loading
plots, covary strongly, proportionally to the degree distanced from the PC-origin
(relative to the overall total variance fractions explained by the pertinent
components). If the variables lie on the same side of the origin, they covary in a
positive sense, i.e. they have a positive correlation. If they lie on opposite sides of
the origin, more or less (some latitude here) along a straight line through the PC-
origin, they are negatively correlated. Correlation is not automatically reflecting a
causal relation; interpretation is always necessary. Also: loadings, which are at 90
degrees to each other through the origin, are independent. Loadings close to a PC
axis are significant only to that PC. Variables with a large loading on two PCs are
significant on both PCs.
Spectroscopic Data
In spectroscopic applications, and similar many X-variable data sets, the 1-vector
loading plot is often the more appropriate. Again, large loading values imply
important variables (e.g. wavelengths).
Below we have listed some of the most common potential pitfalls. While it is not
completely comprehensive (even the senior author of this book has not finished
making illuminative errors), one may certainly use it as a useful checklist.
procedures that may not apply to your specific data set, the present The
Unscrambler included.
Hopefully you have not been put off completely by this list of possible errors, some
of which cannot even be detected when they arise! Experience, experience – and
still more experience is the only thing that will help you through many of these
pitfalls. Below in chapter 5 you will find a selected series of representative real-
world data sets, all of which show one or more interesting particularities. Which is
just the stuff experience come from! But first we will lead you through a
particularly interesting case.
Purpose
To learn about outliers and how to recognize them from the score plot and the
influence plot, which we have not introduced before.
To learn that in the end it is you, the data analyst, who must assume responsibility
and decide on the outlier designation issue(s). There is no other way.
Data Set
The Troodos area of Cyprus is a region of particular geological interest. There has
been quite some dispute over this part of Cyprus’ geological history, which
however need not be given in all details here, in order for the data to be used in the
present context. At our disposal we have 143 rock samples from different locations
in the areas underlying the pertinent section of the Troodos Mountains of Cyprus.
The rocks were painstakingly collected by the senior author’s geologist-friend and
colleague since college days, Peter Thy, in a series of strenuous one-man field
campaigns in Cyprus. The data analysis was carried out many years later.
The values in Table 4.1 are measurements of the concentrations of ten rock
geochemical compounds. Geologists often use such data in order to discriminate
between “families” of rocks, or rather, chemically related rock series, in order -
they hope - to be able to discriminate between genetically different, or similar rock
groups, clusters or rock series.
One cardinal question is: are there more than one overall group of samples?
If the rocks are all geochemically similar, one would expect the whole area to have
been formed geologically at the same time. If there are clear groups in the locations
of rocks, due to different geological backgrounds, we might draw other conclusions
about the formation of the area. This work was originally carried out to help settle a
major controversy regarding the entire geological history of Cyprus, see Thy and
Esbensen (1993) for details.
Tasks
1. Make a PCA model; find all significant patterns that may impinge on the
objectives as laid out above.
How to Do it
1. Read the data file and study the data table in a matrix plot. Mark the whole
data table and choose Plot - Matrix. Are all the variables in the same value
range? Do they vary to the same extent? Is scaling necessary?
Close the viewer, and use the View - Variable statistics menu to check the
mean values and standard deviation of the variables; then close the Variable
Statistics window.
2. Go to the Task - PCA menu and make a model with the following parameters.
Samples: All samples Variables: All variables
Weights: 1/SDev Validation: Leverage correction
Number of PCs: 10 Warning Limits Outlier limits (Field 2 - 7): 3.5
We use leverage correction here to make the modeling faster but, as you will
learn later, another validation method is more appropriate before we complete
the analysis.
Study the PCA Progress box. You can see how many outlier warnings were
given at the computation of each PC. Hit View to continue. The model
overview appears.
Study the warnings by selecting Windows - Warning List - Outliers and note
the outliers in the first PCs, especially 65, 66, 129 and 130. When you have
finished, close the Warning List.
3. Study the score plot for PC1 vs. PC2. Look for samples that are far away from
other samples.
Such lonely samples might indeed be outliers, but they can also be extremes.
You have to bear the original problem context and the raw data in mind while
working with the plots. It is also quite normal to look at the original data
matrix again when assessing potential outliers. In fact from Figure 4.12 it
would appear that objects 65-66 are indeed very atypical – especially with
respect to the dominating trend made up of all other samples. Whilst we would
not be justified to regard either object 129 or 130 as a similar vein, it is a fairly
safe initial bet that objects 65-66 are indeed gross outliers, while object 129 is
in all likelihood an extreme end-member only. The status for object 130
probably needs further investigation.
Figure 4.12 is thus a typical picture of a model where such problems may
occur. A “good model” spans the variations in the data such that the resulting
score plots show the samples “well spread over the whole PC space”. In the
present plot the samples are well spread in PC1, but only a very few samples
represent the major variations in PC2. All the other samples are situated at, or
very close to the origin (zero score values in PC2), i.e. they have very little
influence on the model in this direction. We also note the <54%,23%>
partitioning of the total variance captured along <PC1,PC2> respectively.
4. Select Plot - Residuals (you may double-click on the miniature screen in the
dialog box to make your plot fill the whole screen) and plot the Influence plot
for PC1 - PC4 (write the range “1-4” under Components). Observe how
samples 65 and 66 move. This is a typical behavior of outliers and is illustrated
in Figure 4.13.
The plot shows residual variance on the ordinate axis and leverage along the
abscissa. High residual variance means poor model fit. High leverage means
having a large effect on the model. Therefore samples in the upper right corner
(large contribution to the model and high residual variance) are potentially
dangerous outliers.
When you add more PCs the residual variance decreases and even outliers will
eventually be fitted better and better to the model. The model thus
concentrates on describing the variations due to these few different samples
instead of modeling the variations in the whole data set.
PC1
PC2
PC3
PC4
132 131 119
24 56
10100
138
137
99
41 134 13311818
31
33
32
123
23
26
54
116
25
2743
122
50
649
80
113
95
15
125
28
117
29 94
5
34
8 63
7
47
21
64
48
34
4612
37111
11
143
59
62
44
38
98
89
93
22
90
9761102
126
53
142
58
39
55
42
92
13940
136
57 9127
51
52
135 101
103 12816
17
2 108
1104
107
109
105106
35
115
114
30
11084
112 79
78
60
140
8187
72
77
75
45
86
88
73
74
70
71
120
83
69
82
121
124
85
67
6891
141
76
9636 1314 Leverage
0 0.01 0.02 0.03 0.04 0.05 0.06
troo-0, # of PCs: 1
6. Go back to the data table (Window, Troodos) and select Task - PCA dialog.
Select Keep Out of Calculation in the sample tab. Type 65-66. (We always
remove only a few outliers at a time, starting with the most serious ones). Re-
model. Study the new warnings, the variance, the score plots, and the influence
plots. Are there any more outlier candidates? Has the amount of explained
variance increased?
7. Remove the next one, or two most obvious outlier candidate(s) (129-130). Re-
calibrate again and study the resulting new scores and variance. Does the
model look “better” now? Has the explained variance increased? Are there
more outliers? Do you see any signs of groups? Now take a look at the
loadings to see which variables influence the model.
Summary
In this exercise you worked with a data table, which after some initial standard
PCA apparently contained only four outliers. The difference between outliers and
extremes can indeed be small. Remove only one or two at a time, make a new
model, and study the new model to see how the changes imparted manifest
themselves. After removing the first two outliers, the explained variance was
slightly higher at 2 PCs. Removing the next two outliers did not really change the
explained variance any further.
The first 2 or 3 PCs describe 78-87% of the total variance. It is not necessarily an
objective in itself to achieve as high as possible a fraction of the total variance
explained in the first Principal Components at the exclusion of other data analytical
objectives; but it is of course often an important secondary goal even so. For the
present case: in the last score plot for PC1 vs. PC2 we now see clear signs of two
data groupings on either side of the ordinate. Finding out whether there was only
one, or several rock groups was the overall objective for this data analysis.
It is of course difficult for you to interpret the meaning of PC1 without more
detailed geological knowledge about the samples. What is objectively clear
however, is that the corresponding PC1vsPC2 loading plot indicates that variables
6 (MgO) and 7 (CaO) pull one group to the left, and the rest (except no. 3 Al2O3)
pull the other group to the right (see for yourself!). Thus there is a very clear two-
fold grouping of these 10 variables (one lonely variable would appear to make up a
third group all on its own along the PC2-direction, which we shall not be interested
in here). While there is a total of ten variables, there are in fact only two underlying
geochemical phenomena present, and the one portrayed by PC1 involves no less
than nine of these variables (and they are all pretty well correlated with one
another; two of them are negatively correlated to the seven others).
Note that these groupings of objects as well as variables were not at all obvious
until the four outliers were removed. Objects 19 and 20 are now also seen as
potential candidates for removal. We could continue to pick out more “mild
outliers” of course, but the main objective - to look for separate groups - has been
achieved after removing only four outlying, severely disturbing samples. This
revealed “hidden grouping” resulted in a new, interesting geological hypothesis,
Thy & Esbensen (1993).
From a geochemical point of view, one could also study the outliers in more detail
to understand how they are different, as well as going into the detailed
interpretations of each of the three PC components in the final model, including the
interesting meaning of the singular PC-2 variable, but this is of course a task
rightfully reserved for the geologists, and we here leave these results for them to
mull over further, ibid.
Problem
The data set used in this exercise is taken from Fisher (1936). This is a famous data
set in the world of statistics. The variables are measurements of sepal length/width,
and petal length/width of twenty-five plants from each of three subspecies of Iris:
Iris Setosa, Iris Versicolor and Iris Virginica. The data are eventually to be used to
test the botanical hypothesis that Iris Versicolor is a hybrid of the two other
species. This hypothesis is based on the fact that Iris Setosa is a diploid species
with 38 chromosomes, Iris Virginica is a tetraploid and Iris Versicolor is a
hexaploid having 108 chromosomes.
Data Set
The file IRIS contains several data sets. The sample set Training contains
measurements of four variables (see above) for 75 samples: 25 samples of each Iris
type.
Tasks
1. Make a PCA model and identify clusters.
2. Find the most important variables to discriminate between the clusters found.
How to Do it
1. Open the file called IRIS and investigate it using plots and statistics.
There are four outlier warnings. For the moment we will disregard them. We
will now look at the results directly.
3. Variance
If necessary, change the residual variance plot to explained variance. Select
View - Source- Explained Variance.
How many PCs must be applied to explain, say, 70% of the variance? 95%?
4. Scores
Interpret the score plot for PC1 and PC2. How many groups do you see? How
does that comply with our prior knowledge about the data? Is there a clear gap
between the versicolor and the virginica groupings? What does PC1-PC3 show?
5. Loadings
Study the 2D Scatter loading plot. Which variables are the most important?
Which variable discriminates between setosa and the other two? Try to plot also
the loadings as line plots. Edit - Options may be useful to plot the results as
bars instead of lines.
Summary
The first two PCs account for approximately 96% of the variance. There are three
classes: one very distinct (the setosa class) and two others which are not as well
separated from each other. This plot, of course, clearly indicates that the versicolor
and virginica species are most alike, while the setosa species is distinctly different
from these two. It may also suggest that it can be difficult to differentiate versicolor
from virginica, but we have only taken our very first shot at this yet. We can see
however, that Iris versicolor lies between setosa and virginica, perhaps supporting
the hypothesis that it is a hybrid - or at least not contradicting this. Perhaps more
information is needed to be in any position to address this important scientific
question?
All the variables are used to differentiate between the three different species. Of
course, in an up to date study we would have used many more morphological
variables. This particular 4-variable data set of Fischer (1936) has become a
Multivariate Data Analysis in Practice
5. PCA Exercises – Real-World Application Examples 107
statistical standard over the years, and been used to test a great many new
methodological approaches; see e.g. Wold (1976). We shall return to the Iris data
set later.
Problem
Assume that you are about to test a new reaction for which electrophilic catalysis is
strongly believed to be beneficial. For this purpose the addition of a Lewis acid
catalyst would be worth testing. As there are many potentially useful Lewis acids to
choose from, the problem is to find a good one, preferably the optimal one. In
totally new reactions it may be difficult to make a good guess, so we need to make
some experiments. But which ones should we test? How do we design such an
experiment?
A good idea would be to select a limited number of test catalysts to cover a range
of their molecular properties. Using PCA we may describe a range of different
catalysts in terms of principal properties - i.e. the principal components that
describe the main variations.
Data Set
The data table in the file LEWIS contains chemical descriptors for 28 Lewis acids.
The following descriptors are believed to contain relevant information for the
problem (Table 5.1):
Tasks
Make a PCA model to select catalysts that are the most different, i.e. span the
variations.
How to Do it
1. Open the file LEWIS and make a PCA model. Should the data be
standardized?
Validate with leverage correction and calculate the maximum number of PCs.
Interpret the score plot using the two first PCs. How many PCs do you need to
explain more than 50% of the variance? Select 9 Lewis acids with “the most
different” properties from the plot!
2. Study the loading plot. Are any of the descriptors unnecessary? Which
variables show the highest covariation? Which variables show a negative
covariation with variable 9?
The organic chemists who conducted these experiments originally chose nine
different acids, but they also based their decisions on certain chemical factors,
which we are unaware of. They chose samples 1, 4, 11, 12, 13, 16, 19, 26 and
28, which cover the variations well and also include a few around the middle of
the score plot. Why also choose these latter?
The experimental results obtained in the reactions using these selected catalysts
fully confirmed independent conclusions in the literature on preferred catalysts.
Lewis acid number 1 (AlCl3) got the best results. Sample 18 has also been
reported to be a superior catalyst in Friedel-Crafts reactions, and you can see
that samples 1 and 18 do indeed lie close to each other in the score plot.
Summary
In this exercise you have used the score plot to find samples that differ greatly from
each other, i.e. representative samples that span the experimental domain as much
as possible. Samples lying close together in the plot of course have similar
properties. The extreme samples all lie far away from the origin. All variables have
large contributions. Variables 2 and 3 covary. Variable 9 has a negative covariation
with number 10, both in PC1 and PC2.
Problem
The data in this exercise were kindly provided by IKU (Institute for Petroleum
Research), Trondheim, Norway. During oilrig drilling operations, mud (barium
sulfate with other chemical components) is sometimes released into the sea and
may thus cause pollution, primarily along the main current direction. The oil
authorities demand regular monitoring of the pollution level around the platforms;
if it is too high, the mud must be removed or the concentrations of harmful
substances must be reduced in another way.
77 mud samples were collected and their chromatograms recorded. About 3000
peaks were reduced to 1041 by maximum entropy pretreatment in a selected
chromatographic retention interval. Normally several of the peaks are integrated,
but this does not really catch all the important variations. In addition it may be
difficult to compare many chromatograms and interpret the variations.
It is normal to quantify the THC (total hydrocarbon contents), but using PCA we
can instead:
• get a qualitative measure
• get an overview of the variation in a compressed way
• interpret loadings to find the interesting peaks
• look for patterns in the score plot
• classify the samples with regard to level of pollution
Data Set
The data file MUD contains 77 chromatograms with 527 variables
(chromatographic peaks). Originally there were 3000, reduced as described above.
In addition we have deleted other variables with little information, so that you can
analyze the data even if your PC has little memory.
Tasks
Make a PCA of the raw data table. Investigate if there are significant patterns that
reflect polluted and non-polluted samples. Then make the model on standardized
data instead.
How to Do it
1. Open the data file MUD and plot the data as lines to get an overview.
Typical unpolluted samples are no. 1, 2 and 3; typical polluted samples are 77,
23 and 22 for example. Do as follows: Edit – Select Samples, Samples: 1-3,
22, 23, 77, OK. Plot – Line, All Variables, OK. Edit – Options, Curve ID,
Labels Layout: Position on File.
2. Make a PCA model without weighting. Validate with leverage correction and
set the warning limit for outlier detection (field 2 - 5) to 5.0. Calculate 4 PCs.
Study the score plot. How much of the total variation is described by the first
two PCs? How can you interpret PC1? Find the samples listed as polluted
above and compare to the unpolluted; draw your conclusions.
Based on the score plot, would you think sample number 52 is polluted or non-
polluted? What about sample 66? Is sample 72 more polluted than sample 22?
4. Study the line loading plot (activate the upper right plot, Plot – Loadings –
General, Line, Vector 1: 1-2, OK) to look for interesting peak areas. At which
retention times are the chromatograms most different?
Save the model: File – Save As, Mud1, Save.
We would normally consider standardizing the data, when there are such large
differences between the variables as you could see in the line plot. The aim is to
also allow the more subtle variations to play a role in the analysis. Run a new
PCA model with Weights = 1/Sdev: activate the data editor, use the Task menu,
etc. Give the model a new name.
Close the data editor and select Window – Tile Vertically so that you can
compare the two models. Study the explained variance and the scores. Is this
model different? Does it change your overall interpretations? Try to explain
why the models give the same results!
Summary
There is a break in the variance curve at 1 PC, but with so many variables and
samples we should also be able to use 2 PCs to get a good 2-vector score plot. 2
PCs explain 93% of the total variations in the 77 chromatograms. The explained
validated variance is 88% using 2 PCs. The unpolluted samples lie to the left in the
score plot, while the polluted ones lie to the right. The first PC thus seems to
describe the overall, general pollution level. Sample 52 is unpolluted, while no. 66
and 72 are polluted. The more to the right the samples appear, the more polluted
they are.
The loadings in PC1 are largest at retention times between 100 and 300, so this is
where the most interesting peak information lies for this data set. In PC2 variables
105-110 have the largest loadings.
The model based on standardized data shows the same general patterns. The score
plot is reversed along PC2, which doesn’t matter. - Only the relative patterns count
in PCA. In this case the systematic variations in the important variables are very
large, both in the standardized and the non-scaled data. Therefore they will
dominate both models. The loading plot naturally also shows a reversed PC2, and
has a somewhat different shape.
Normally you must be very cautious when dividing variables with values between 0
and 1 by their standard deviation. If the standard deviation is small you will divide
by a small number that is dangerously close to zero, which may cause an
unnaturally large amplification of the scaled variances and can sometimes result in
numerical instability in the calculations. In this case all the variables are between 0
and 1, so they were all affected in the same way.
It will now be possible to make a PCA model using only normal background
samples for example (i.e. all the samples to the left) and use classification to see if
new samples are polluted or not, see below SIMCA. In this example we have used
PCA as an initial “data-scope” on which to see the first exploratory, overall
patterns of the X-matrix we was given to start out with. We may opt for carrying on
in the manner indicated etc.
Purpose
In this exercise run a PCA without scaling the same Troodos data. Compare this
model with the earlier one and investigate the effects of the scaling. If you did not
save the pertinent model results from earlier, observe how quickly you can
regenerate these now that you are already a somewhat accomplished data analyst.
Data Set
File: TROODOS (143 rock samples and 10 geochemical variables).
Tasks
Make a PCA model without scaling. Study the scores, loadings, and explained
variance plots, and explain why results are so different compared to the autoscaled
results from the earlier auto-scaled analysis.
How to Do it
1. Open the data and run PCA with weights = 1.0, leverage correction and
outlier limit = 3.0.
3. Study the loadings for PC1 versus PC2. Are there any insignificant
variables? Which? Why do fewer variables explain more of the variation in this
case? If necessary compute the statistics for the variables again.
Do you find the same outliers and groups in this model?
Summary
Two PCs explain about 90% of the variance in this model, while the scaled model
needed four PCs to explain the same variance level. The loadings show that
variables no. 5, 9, and 10 now have no contribution to the model. Their variance in
absolute numbers is very, very small compared to the others. Variables 5, 9, and 10
are so unimportant that they are not even included in the total variation. The other
variables therefore dominate the model completely and the model explains 90% of
the variance in these variables.
One should not ponder overly over these findings. There is no objective basis on
which to compare different data sets (e.g. data sets with a different number of
variables). In effect, we have only a seven-variable data set when not auto-scaling.
Every data set analyzed with PCA has its own relative initial total variance, which
is set to 100%. The relative internal %-levels in any PCA-analysis cannot be
compared between two, externally different PCA-analyses.
We do find the same outliers and groups with both models though, but in a
different order, so everything was not totally incommensurable between these two
alternative scalings.
We have witnessed that when the raw variables are characterized by (very)
different variations for one specific data set, it matters very much that these
numerical variance differences are not allowed to dominate. We choose the
autoscale PCA in this, and any similar situations.
6. Multivariate Calibration
(PCR/PLS)
The central issue in this book (after the necessary introductions to projections and
PCA) is multivariate calibration. This involves relating two sets of data, X and Y,
by regression. In a systematic context multivariate calibration can be called
multivariate modeling (X, Y). We first address multivariate calibration in general
terms before we introduce the most important methods in more detail. So far we
have not worked with an Y-matrix at all, but now Y will become a very important
issue.
Model
⇒
X +
Y
The multivariate model for (X,Y) is simply a regression relationship between the
empirical (X,Y) relations. We establish the model through multivariate calibration.
Thus the first stage of multivariate modeling (X,Y) is the calibration stage.
Figure 6.2 - Using the multivariate regression model to predict new Y-values
∧
X + Model ⇒ Y
The regression model is secondly used on a new set of X-measurements for the
specific purpose of predicting new Y-values. The reason is that this makes it
possible to use only X-measurements for future Y-determinations, instead of
making more Y-measurements.
Spectroscopic methods, for example, can often be implemented as fast methods that
can measure many simultaneous chemical and physical parameters indirectly. In the
same vein, there are also cases where it would be advantageous to substitute
several, perhaps slow and cumbersome, physical or wet chemical measurement
methods with one spectroscopic method. These spectra would then constitute the X
matrix and the sought for parameters, for example chemical composition, flavor,
viscosity, quality, etc. would constitute the Y matrix.
For instance, consider the case where you wish to use spectroscopy to measure the
amount of fat in ground meat, instead of the more time consuming laboratory wet
chemical fat determination methods. If the future samples all will have a fat content
between 1 - 10% only, then obviously we cannot use the spectra of meat with a fat
content of 60 - 75% for the calibration set etc. This simplistic example may sound
trivial, but the issue is not.
The demand that the training set (and the test set, see further below) be
representative covers all aspects of all conceivable variations in the conditions
influencing a multivariate calibration. In the case of the ground meat if for example
one was to say: “We only want to determine fat in these meat samples. Therefore
let us make our own training set mixtures of fatty compounds in the laboratory to
keep things simple. We would know precisely how much fat we have from how we
made the training samples, concentrating on the fat component and adding the other
meat components in correct proportions etc. Then we would not have to collect
complex, real-word meat samples and do those tiresome fat measurements”.
Never entertain such thoughts! This idea is based upon univariate calibration
theory, which, in reality, would seriously limit your creativity.
In this case, your laboratory spectra would be so different from the spectra of the
“real” meat that your laboratory-quality “fat model” would not apply at all.
Naturally so, because the artificial laboratory training samples, no matter how
precisely they would appear to have been created, would not - at all - correspond to
the complexities of the real-world, processed meat samples. This may well be
mostly because of significant interferences between the various meat components,
as occurring in the natural state – in spite of their quantitative correct proportions.
Experimental design ensures that the calibration set covers the necessary ranges of
the phenomena involved. However, there are always some related constraints being
put on the training set. The most often being that the number of available samples
is more-or-less severely restricted. At other times we simply have to accept the
training data set as presented in a specific situation.
Irrespective of one’s own situation, one must always be aware of the range, the
span, of the calibration set, since this defines the application region of the model in
future prediction situations. Only very rarely will one be so lucky that the
application range can be extensively extrapolated beyond the range of the
calibration set. Data constraints will also be further discussed in section 9.1.
At first sight, this issue may appear somewhat difficult, but it will be discussed in
more detail at the appropriate points. Validation will be discussed in full detail in
chapter 7, and also in sections 9.8 and 18.5 The brief overview below is intended
only to introduce those important issues of validation which must be borne in mind
when specifying a multivariate calibration. From a properly conducted validation
one gets some very important quantitative results, especially the “correct” number
of components to use in the calibration model, as well as proper, statistically
estimated, assessments of the future prediction error levels.
An ideal test set situation is to have a sufficiently large number of training set
measurements for both X and Y, appropriately sampled from the target population.
This data set is then used for the calibration of the model. Now an independent,
Multivariate Data Analysis in Practice
6. Multivariate Calibration (PCR/PLS) 121
second sampling of the target population is carried out, in order to produce a test
set to be used exclusively for testing/validating of the model – i.e. by comparing
Ypred with Yref.
There is, however, a price to pay. Test set validation entails taking twice as many
samples as would be necessary with the training set alone. However desirable, there
are admittedly situations in which this is manifestly not always possible. For
example when the measuring of the Y-values is (too) expensive, unacceptably
dangerous or the test set sampling is otherwise limited e.g. for ethical reasons or
when preparing samples is extremely difficult etc. For this situation, there is a
viable alternative approach, called cross validation, see chapter 7. Cross validation
can, in the most favorable of situations, be almost as good as test set validation, but
only almost - but it can never substitute for a proper test set validation! And the
most favorable of situations do not occur very often either....
In chapter 7, we shall later explain in detail how these other validation methods
work and how they are related to test set validation.
Modeling Error
How well does the model fit to the X-data and to the Y-data? How small are the
modeling residuals? One may perhaps feel that a good modeling fit implies a good
prediction ability, but this is generally not so, in fact only very rarely, as we shall
discuss later in more detail.
Initial Modeling
Detection of outliers, groupings, clusters, trends etc. is just as important in
multivariate calibration as in PCA, and these tasks should in general always be first
on the agenda. In this context one may use any validation method in the initial
screening data analytical process, because the actual number of dimensions of a
multivariate regression model is of no real interest until the data set has passed this
stage, i.e. until it is cleaned up for outliers and is internally consistent etc. In
general, removal of outlying objects or variables often influences the model
complexity significantly, i.e. the number of components will often change as a
result anyway. However, the final model must be properly validated, preferably by
a test set (alternatively with cross validation), but never with just leverage
correction.
Test set validation, cross validation, and leverage correction are all designed to
assess the prediction ability, i.e. the accuracy and precision associated with Ypred.
To do this, Ypred must be compared to the reference values Yref. The smaller the
difference between predicted and real Y-values, the better. The more PCs we use,
the smaller this difference will be, but only up to a point, which is the optimal
number of components. Let us see how this is done.
In Figure 6.3 the x-axis shows the number of components included in the prediction
model. The y-axis denotes a measure for the prediction variance, which usually
comes in two forms: 1) the residual Y-variance (also called prediction variance,
Vy_Val) or 2) the RMSEP (Root Mean Square Error of Prediction); the latter is
simply the square root of the former.
n
∑ ( yi − yi,ref )2
Equation 6.1 RMSEP = i =1
= Vy_ Val
n
30 RMSEP
27
24
21
18
15
12
9
6
3
0
PC_0 PC_1 PC_2 PC_3 PC_4
<RMSEP Methanol>
As is obvious from Equation 6.1, the overall prediction ability is best when the
prediction variance (prediction error) is at its lowest. This is where the prediction
error, the deviation between predicted values and real values, has been minimized
in the ordinary statistical sum-of-squared-deviations sense. The plot in Figure 6.3
shows a clear minimum at 3 PCs, which indicates that this number of components
is optimal, i.e. the number where the prediction variance (residual Y-variance) is
minimized. Inclusion of more components may improve the specific modeling fit,
but will clearly reduce the prediction ability, because the RMSEP goes up again
after this number. From the practical point of view of prediction optimization, this
minimum corresponds to the “optimal” complexity of the model, i.e. the “correct”
number of prediction model components. Note that the specific determination of
the optimum is intimately tied in with the validation. It is therefore very easy
indeed to obtain the correct dimensionality of any multivariate calibration model –
all one has to do is to carry out an appropriate validation. This is somewhat
different in relation to the case for PCA, in which only the residual X-variance plot
was at hand.
Univariate regression is undoubtedly the most often used regression method. It has
been studied extensively in the statistical literature and it is part of any university
or academy curriculum in the sciences and within technology. We assume that the
reader is sufficiently familiar with this basic regression technique, but also refer to
the relevant statistical texts in the literature section if need be.
“I” “I”
“II”
There is a serious problem with this approach however, as there are no modeling
and prediction diagnostics available. It is de facto impossible to detect situations
Equation 6.2 y = b0 + b1 x1 + b2 x2 + + b p x p + f
We wish to find the vector of regression coefficients b so that f, the error term, is
the smallest possible. To do this one uses the least squares criterion on the squared
error terms: find b so that fTf is minimized. This leads to the following well-known
statistical estimate of b.
Equation 6.4 bˆ = ( XT X) −1 XT y
As is well known estimating b involves matrix inversion of (XTX) - and this may
cause severe problems with MLR. If there are collinearities in X, i.e. if the
X-variables correlate with each other, matrix inversion may become increasingly
difficult and in severe cases may not be possible at all. The (XTX)-1 division will
become increasingly unstable (it will in fact increasingly correspond to “dividing
by zero”). With intermediate to strong correlations in X, the probability of this ill-
behaving collinearity is overwhelming and MLR will in the end not work. To avoid
this numerical instability it is standard statistical practice to delete variables in X so
as to make X become of full rank. At best this may mean throwing away
information. To make things worse, it is definitively not easy to choose which
variables should go and which should stay. In the worst case we may be unable to
cope with the collinearities at all and have to give up.
Again, these “explanations” are only a first non-statistical introduction into matters
that of course also should be studied in their proper mathematical and statistical
context, chapter 18.
6.7 Collinearity
Collinearity means that the X-variables are intercorrelated to a non-neglectable
degree, that the X-variables are linearly dependent to some degree; for example
X1 = f(X2, X3, .., Xp).
If there is a high collinearity between X1 and X2 (see Figure 6.6), the variation
along the solid line in the X1/X2 plane is very much larger than across this line. It
will then be difficult to estimate precisely how Y varies in this latter direction of
small variation in X. If this minor direction in X is important for the accurate
prediction of Y, then collinearity represents a serious problem for the regression
modeling. The MLR-solution is graphically represented by a plane through the data
points in the X1/X2/y-space. In fact the MLR-model can directly be depicted as a
plane optimally least square fitted to all data points. This plane will easily be
subjected to a tilt at even the smallest change in X, e.g. due to an error in the
X-measurements, and thus become unstable, and thereby more or less unsuited for
Y-prediction purposes.
In such a case one usually tries to pick out a few variables that do not covary (or
which correlate the least), and use the information in a combination of these. This
is the idea behind the so-called stepwise regression methods, and this may
sometimes work well in some applications but certainly not in all. Also note that
we have to follow the demands of a particular calculation method; this is surely
something all true data analysts dislike! There are in general many problems in
relation to step-wise methods, for which we may refer to Høskuldsson (1996).
However, if the minor, “transverse” directions in X are more or less irrelevant for
the prediction of Y (which may be the case also in spectroscopy), this collinearity
is not a problem anymore, provided that a method other than MLR is chosen.
Bilinear projection methods, the chemometric approaches chosen in this book,
actually utilize the collinearity feature constructively, and choose a solution
coinciding with the variation along the solid line. This type of solution is thus
stable with respect to collinearity.
Equation 6.5 y = Tb + f
instead of y = Xb + f. This “MLR”, now called PCR for obvious reasons, would
thus be stable. But not only that - by using the advantages of PCA, we also get
additional benefits, in the form of scores and loadings and variances etc. which can
be interpreted with ease.
T Y
This validated number will in general differ from the optimal number of PCs found
from an isolated PCA of X without regard to y. This is because we now let the
prediction ability determine the number of components, not the PCA modeling fit
with respect to X alone. This is the first time we meet with this very important
distinction between the alternative statistical criteria, modeling fit optimization vs.
prediction error minimization, but it will certainly not be the last we see of this in
chemometrics; see Høskuldsson (1996) for a comprehensive overview.
Because we have earlier built up the relevant competencies regarding PCA, MLR
and validation in a planned stepwise manner, it has now been possible to introduce
all the essentials of PCR in the three small sections 6.6 to 6.8 . And now it is time
for an exercise!
Problem
The data set you will be using in this and the next two exercises is about jam
quality, or rather about how to assess jam quality. We want to quantify the (human)
sensory quality of raspberry jam, - especially to determine which parameters are
relevant to the perceived quality and to try to replace costly sensory or preference
measurements (laboratories full of trained, expensive taste assessors, etc.) with
(much) cheaper instrumental methods. This is a highly realistic multivariate
calibration context, in fact the data come directly from a real-world industry project
from the former Norwegian Food Science Research Institute (now known under the
acronym “MATFORSK”).
The data set consists of the instrumental variables (X), as well as two types of
subjective variables (Y), which need not however lead to unnecessary confusion if
reflected upon carefully. Basically one may carry out two alternative calibrations
for these two alternative Y-data sets, both performed on the basis of the one-and-
the-same X data set.
Data
The analysis will be based on 12 samples of jam, selected to span normal quality
variations. The way the problem specification was originally formulated is given
below (so that the data set is not entirely “served up” perfectly for your data
analyzing pleasure, but you have to “get under the skin” of this particular problem
personally). This in order that you fully understand the organization of this slightly
complex data analysis exercise.
Were these the only data at our disposition, we could easily set up the appropriate
multivariate calibration formalism: X ➜ Y,
in this case: X(instrumental) ➜ Y(sensory)
However - just to make matters really interesting – the context of the problem
would also allow for an exploratory multivariate calibration between the two sets of
alternative Y- profiling data, i.e. Y1 versus Y2. Clearly the most expensive
profiling of jam quality is the one involving a large number of consumers (114 in
this case). Were these to be replaced by the taste panelist data (Sensory), this could
result in serious cost reductions. The appropriate calibration would correspond to a
very special X ➜ Y setup, namely one between the two Y-data sets:
Y1(Sensory) ➜ Y2(Preference)
We shall use this data set extensively also later on, so we don’t exhaust all the
above calibration combinations yet.
How to do it
1. Study the data
All three variable sets (X,Y1,Y2) are stored on The Unscrambler file JAM.
Note that the agronomic production variables are not used as quantitative variables
in any of the matrices, but they are exclusively “known external information”, and
will thus be very valuable as object annotations when interpreting the results of the
data analysis. This information has been coded into the names of the samples –
comparable to Figure. 3.15 in chapter 3 for example.
The prediction error (Y-residual variance) has a local minimum after 3 PCs.
According to what we know about the particular problem and the data only two
factors varied (growth location and harvesting time), so 2-3 PCs would not at all be
unreasonable from a data analytical point of view.
The calibration variances in X and Y show how well the data have been modeled in
X and Y respectively. You see that one PC describes 43% of the X-data, but
completely fails to model Y - the explained variance is of 1%. With two PCs
The validation variances in X and Y are based on the testing (this time using
leverage correction). The residual validation variance of Y is an expression of the
prediction error - what we can expect when predicting new data. The validation
variance is usually higher than the corresponding calibration variance, more of
which later. The error increases in Y with only one PC, but then decreases again.
4. Variance plot
Look at the residual variance plot for variable Y (preference). Add the calibration
variance by using the toolbar buttons. Approximately how much of the variance of
Y is explained by 2 and 3 PCs? Why is the calibration variance lower? Do you
think it is wise to use 4 or 5 PCs instead of 3?
5. Score plot
Using the menu Plot-Scores, take a look at the score plot using the two first PCs.
Also plot PC1 vs. PC3. You may also try a 3D Scatter score plot of PC1 vs. PC2 vs.
PC3. Try Edit - Options – Vertical Line and – Name. Can you see specific
patterns? What does PC1 model? And PC2?
6. Loading plot
Look at the loading plot for PC1 vs. PC2 and PC1 vs. PC3. Notice that the
Y-loading PREFEREN is also plotted.
Which variables describe the jam quality best? Which sensory variables correlate
most with Preference? Why does Preference have a small loading value in PC1?
Which variable is negatively correlated with both Raspberry Smell and Raspberry
Flavor? Is Sweetness an important variable?
The scores and loading plot are the same as if you had used PCA. In fact that is
what you have done, since firstly the PCA was calculated and then the MLR-
regression was invoked.
Summary
You have learned to make a PCR-model, and to decide how many PCs to use by
one specific validation method. Later in this training package you will be directed
to re-do this exercise using more appropriate validation methods.
Your PCR model has its optimum solution at 3 PC, not the 5 PCs as was suggested
by the leverage corrected procedure; already we are becoming adept at taking the
controls based on our understanding of multivariate data analysis.
Here we also for the first time looked at one singularly important way to determine
how good the model is, by using the Predicted vs. Measured plot and its
associated prediction assessment statistics. Some specific jam-related
interpretations include the following.
PCR calculates scores and loadings just as PCA does. Note the structure related to
harvesting time. There is a group for harvesting time 1 (H1) in the third quadrant,
starting to spread for harvesting time 2, and becoming widely spread at harvesting
time 3, where it hardly forms a group at all. This indicates that the quality of jam
made from different berries picked early in the season varies less than later in the
season. There is also some structure related to the harvesting sites, denoted with C1
to C4, in PC1.
The taste variables describe the jam samples best (43% of the variations along
PC1), but color and consistency are not much worse (28% along PC2). The color
variables correlate most with Preference, both in the 2nd and 3rd PC. Preference
was not modeled at all in PC1 (the explained variance was 1%). Therefore
Preference only has a very small loading value in PC1. Off Flavor is negatively
correlated with the smell and flavor of raspberries, both in PC1 and PC2 (you can
draw a straight line between them through the origin). Sweetness does not look
important if you just look at (PC1, PC2): it is close to the origin. But if you study
(PC1, PC3) you will notice that sweetness is the most important variable along
PC3.
The late harvested samples picked at site 3 and 4 have the most characteristic
raspberry taste but jams based on berries from site 1 and 2 are preferred by most
consumers, because of their intense color.
Note that all three PCs should be studied, since Preference needs 3 PCs to be
adequately modeled!
Observe how PCR is a distinct two-stage process: First a PCA is carried out on X,
then we use this derived T-matrix as input for the MLR stage, usually in a
truncated fashion; we only use the A “largest” components, as determined by an
appropriate validation. There are no objections to this if we use enough PCR-
components, but we do not want to use too “many” components. Then the whole
idea of projection compression is lost.
In PCR there is one cardinal aspect which is still not optimized, no matter what:
There is no guarantee that the separate PC-decomposition of the X-matrix
necessarily produces exactly what we want - only the structure which is correlated
to the y-variable. There is no built-in certainty that the A first (“large”) principal
components contain only that information which is correlated to the particular
Y-variable of interest. There may very well be other variance components
(variations) present in these A components. Worse still, there may also remain
y-correlated variance proportions in the higher order PCs that never get into the
PC-regression stage, simply because the magnitudes of other X-structure parts
(which are irrelevant in an optimal (X,Y)-regression sense) dominate. What to do?
– PLS-R is the answer!
PLS has seen an unparalleled application success, both in chemometrics and other
fields. Amongst other features, the PLS approach gives superior interpretation
possibilities, which can best be explained and illustrated by examples. PLS claims
to do the same job as PCR, only with fewer bilinear components.
Let us follow the geometrical approach and picture PLS in the same way that we
introduced PCR. Frame 6.1 presents a simplified overview of PLS, or rather the
matrices and vectors involved. And already some help will be at hand from the
earlier PCA algorithm accomplishments, e.g. the specific meaning of the t- and
p-vectors depicted.
W-loading in addition to the familiar P-loading, see further below), while these are
called U and Q respectively for the Y-space.
Note that we are treating the general case of several Y-variables (q) here. This is
not a coincidence. For the uninitiated reader it is easier to be presented the general
PLS-regression concepts in this fully developed scenario (PLS2) than the opposite
(PLS1); this strategy is almost exclusive to this present book. Most other textbooks
on the subject have chosen to start out with PLS1 and later to generalize to PLS2.
We have found that the general PLS-concepts are far more easily related to both
PCA as well as PCR beginning with PLS2. The case of one y-variable (PLS1) will
later be considered as but a simple boundary case of this more general situation.
However PLS does not really perform two independent PCA-analyses on the two
spaces. On the contrary, PLS actively connects the X- and Y-spaces by specifying
the u-score vector(s) to act as the starting points for (actually instead of) the t-score
vectors in the X-space decomposition. Thus the starting proxy-t1 is actually u1 in
the PLS-R method, thereby letting the Y-data structure directly guide the otherwise
much more “PCA-like” decomposition of X. Subsequently u1 is later substituted by
t1 at the relevant stage in the PLS-algorithm in which the Y-space is decomposed.
The crucial point is that it is the u1 (reflecting the Y-space structure) that first
influences the X-decomposition leading to calculation of the X-loadings, but these
are now termed ”w” (for “loading-weights”). Then the X-space t-vectors are
calculated, formally in a “standard” PCA fashion, but necessarily based on this
newly calculated w-vector. This t-vector is now immediately used as the starting
proxy-u1-vector, i.e. instead of u1, as described above only symmetrically with the
X- and the Y-space interchanged. By this means, the X-data structure also
influences the ”PCA (Y)-like” decomposition. This is sufficient for a first overview
comparison of the PLS-approach.
X Y
T U
W Q
P Τ
X = ∑ T⋅ P + E
A
Τ
Y= ∑U ⋅Q + F
A
Thus, what might at first sight appear as two sets of independent PCA
decompositions is in fact based on these interchanged score vectors. In this way we
have achieved the goal of modeling the X- and Y-space interdependently. By
balancing both the X- and Y-information, PLS actively reduces the influence of
large X-variations which do not correlate with Y, and so does the job of removing
the problem of the two-stage PCR weakness.
The PLS2 NIPALS- algorithm will now be outlined (with reference to section 3.14
above).
0. Center and scale both the X and Y-matrices appropriately (if necessary)
Index initialization, f: f = 1; Xf = X; Yf = Y
Steps 3 and 5 represent projection of the object vectors down onto the fth PLS-
component in the X- and Y- variable spaces respectively. By analogy one may view
steps 2 and 4 as the symmetric operations projecting the variable vectors, w and q,
onto the corresponding fth PLS-component in the corresponding object spaces. We
also note that these projections all correspond to the regression formalism for
calculating regression coefficients. Thus the PLS-NIPALS algorithm has also been
described as a set of four interdependent, “criss-cross” X-/Y-space regressions.
Note that in PLS the “loading weights vector”, w, is the appropriate representative
for the PLS-component directions in X-space. Without going into all pertinent
details, one particular central understanding is that of the w-vector as representing
the direction which simultaneously both maximizes the X-variance as well as the
Y-variances in the conventional least-squares sense. Another way of expressing this
is to note that - after convergence - w reflects the direction which maximizes the
(t,u)-covariance, or correlation (auto-scaled data) between the two spaces,
Høskuldsson (1996).
8. Calculation of the regression coefficient for the inner X-Y space regression.
This so-called “inner relation” of the PLS-model, graphically depicted as the “T
vs. U plot” (TvsU), constitutes the central score plot of PLS, occupying a
similar interpretation role as does the equivalent (t,t)-plot for PCA. There is of
course also a double set of these (t,t)- and (u,u)- score-plots available. It bears
noting that the central inner PLS relation is made up of nothing but a standard
univariate regression of u upon t. This PLS inner relation is literally to be
understood as the operative X-Y link in the PLS-model. It is characteristic that
this link is estimated one dimension at the time (partial modeling), hence the
original PLS acronym: Partial Least Squares regression, whereas a more
modern re-interpretation is often quoted as: Projection to Latent Structures.
This step is often also called deflation: Subtraction of component no. f for both
spaces. This is where the p-loadings come into play. By using the p-vectors
instead of the w-vectors for updating X, the desired orthogonality for the
t-vectors is secured.
10. The PLS model: TPT and UQT is also calculated – and deflated - for one
component dimension at the time. After convergence, the rank one models, tfpfT
Multivariate Data Analysis in Practice
142 6. Multivariate Calibration (PCR/PLS)
and ufqfT are substituted appropriately, the latter expressed as Yf+1 = Yf - btfqfT
by inserting the inner relation, so as to allow for appreciation of how Y is
related to the X-scores, t.
From a method point of view, there are two versions of PLS: PLS1, which models
only one Y-variable, while PLS2 models several Y-variables simultaneously.
PLS2 gives one set of X- and Y-scores and one set of X- and Y-loadings, which are
valid for all of the Y-variables simultaneously. If instead you make one PLS1
model for each Y-variable, you will get one set of X- and Y-scores and one set of
X- and Y-loadings for each Y-variable. PCR also produces only one set of scores
and loadings for each Y-variable, even if there are several Y-variables. PCR can
only model one Y-variable at a time. Thus PCR and PLS1 are a natural pair to
match and to compare, while PLS2 would appear to be in a class of its own.
From a data analysis point of view the use of PLS2 was for many years thought of
as the epitome of the power of PLS-regression: complete freedom – modeling any
arbitrary number of Y-response variables simultaneously. Gradually however, as
chemometric experiences accumulated, everything pointed to the somewhat
surprising fact that marginally better prediction models were always to be obtained
by using a series of PLS1 models on the pertinent set of Y-variables. The reason for
this is easily enough understood - especially with 20/20 hindsight. Here we will
mostly let the exercises teach you this lesson - better didactics!
The P-loadings are very much like the well-known PCA-loadings; they express the
relationships between the raw data matrix X and its scores, T. (in PLS these may be
called PLS scores.) You may use and interpret these loadings in the same way as in
PCA or PCR, so long as you remember that the scores have been calculated by
PLS. In many PLS applications P and W are quite similar. This means that the
dominant structures in X “happen” to be directed more or less along the same
directions as those with maximum correlation to Y. In all these cases the difference
is not very interesting - the p and w vectors are pretty much identical. The duality
between P and W will only be important in the situation where the P and the W
directions differ significantly.
difference between these alternative component directions that tells us how much
the Y-guidance has influenced the decomposition of X; one may think of the PCA
t-score as being tilted because of the PLS-constraint. One illuminating way to
display this relation is to plot both these alternative loadings in the same 1-D
loading plot.
In PLS there is also a set of Y-loadings, Q, which are the regression coefficients
from the Y-variables onto the scores, U. Q and W may be used to interpret
relationships between the X- and Y-variables, and to interpret the patterns in the
score plots related to these loadings. The specific use of these double sets of scores
(T,U) and loadings (W,Q) shall be amply illustrated by the many practical PLS-
analytical examples and exercises to be presented below.
The fact that both P and W are important however, is clear from construction of the
formal regression equation Y = XB from any specific PLS solution with A
components. This B matrix is calculated from:
T -1 T
B = W (P W) Q
This B-matrix is often used for practical (numerical) prediction purposes, see
section 9.13.
u, all simply collapse, and the convergence is also made redundant. The result is a
much simpler, non-iterative calculation procedure. As usual, centering and scaling
first:
Center and scale both the X and Y-matrices appropriately (if necessary)
Index initialization, f: f = 1; Xf = X; yf = y
The y-vector is its own proxy “u-vector” (there is only one Y-column)
The PLS1 algorithm – and procedure – is as simple as this. There are no other bells
or whistles. Because of its computational simplicity PLS1 is very easy to perform,
but there are other, more important - indeed salient - reasons why PLS1 has become
the singular most important multivariate regression method. We will demonstrate
these reasons by using examples.
Data set
The data set is the same as in the previous exercise. Again we start the analysis
with the variable sets (Y1) Sensory, used as X and (Y2) Preference, Y proper, in
the file JAM.
Tasks
Make an identical PLS model, find the optimal number of PCs, and investigate how
good the model is. Also interpret the relevant scores and loadings. One other reason
to use leverage corrected validation here is that this allows direct comparison with
the earlier PCR model. – As a later exercise, you will benefit greatly from
duplicating this very same PCR/PLS1 comparison, using a common cross-
validation.
How to do it
1. Go to the Task-Regression menu and change calibration method to PLS1. The
other parameters are unchanged. Give the model another name, for example
Jam2. Calibrate for 6 PCs.
2. Study the Variance plot and find the optimal number of PCs, validated
Y-variance. Why do we now only need 1 PC now to explain about 90% of the
Preference? What does the fact that 27% of the variation in X explain 91% of
the variations in Y with a 1 PLS-component model mean?
Plot Predicted vs. Measured using your choice of PCs. Also see how the results
change with 1, 2 and 3 PCs. Are the results significantly better with more PCs? Or
are they actually worse?
3. Study the 2D Scatter Loading plot. Which variables are positively or negatively
correlated with Preference? How much of the total explained X-variance do
they contribute to? Hint: Study the figures below the plot.
Summary
In this exercise you have made a PLS1-model, and used the Predicted vs. Measured
plot to get an idea of how good the model is. Two PCs are probably optimal.
Adding more PCs always implies a risk of overfitting, so we would like to play it
safe. Also as this was a leverage corrected validation, great caution against possible
overfitting is needed. Only 27% of the sensory variables are needed to predict 91%
Multivariate Data Analysis in Practice
6. Multivariate Calibration (PCR/PLS) 147
The loading plot shows that color, sweetness and thickness correlate most with
Preference. The more intense the color and the sweeter the jam, the more the jam is
liked, while the thicker jams are less liked. All variables with small loading values
in PC1 are unimportant for determining the preference.
Since PLS focuses on Y, this method will immediately look for the Y-relevant
structure in X. Therefore it will give a lower residual Y-variance with fewer PCs.
Problem
Now we compare the instrumental (X) and sensory data (Y1) to find out if the
instrumental and chemical variables give a good enough description of the jam
quality. Would it be possible to predict variations in the quality by using only these
instrumental variables? In that case we might replace costly taste panels with
cheaper instrumental measurements.
Data set
The variable sets Instrumental as (X) and Sensory as (Y) resides in the file JAM.
Tasks
Make a PLS2 model for prediction of sensory data (Y) from instrumental data (X).
Carry out a complete interpretation.
How to do it
1. Read the data table from the JAM file. Use Instrumental as X-variables and
Sensory as Y-variables. Make a PLS2-model by changing Calibration method to
PLS2. All other parameters should be the same as in the previous exercise.
With PLS2 you also need to think about scaling (weighting) Y. In this exercise it is
natural to standardize both X and Y, so 1/SDev is a suitable weights option.
Name the model Jam3. Calibrate with all variables and e.g. 5 PCs.
The calibration overview, which is displayed when the model is complete, does not
show very promising results. The Total Y-variance is not well explained. However,
this is a measure for all the Y-variables together, so hopefully some of them may be
better described individually. This is a typical PLS2 situation, there is no need to
worry at this stage.
Study the loadings for these first two PCs. Which sensory variables seem to
correlate with which instrumental variables? Can raspberry flavor judgments be
replaced by instrumental measurements?
Study the scores for the first two PCs. Which property is modeled by PC1? How
much of the X-variance is explained by PC1? Hint: see below the score plot.
Study scores and loadings together by using two windows. Which variables are
related to harvesting time?
Summary
It seems that the spectrometric color measurements (L, a and b) are strongly
negatively correlated with color intensity and redness. Sweetness is as, expected,
rather strongly negatively correlated with measured Acidity, but the flavor shows
weak correlation to all of the instrumental variables and is not at all well described
by 2 PCs (small loadings).
The variance plot shows that color, redness and thickness are best modeled with 1-
2 PCs. To model the others we need at least 5 PCs, which implies a large risk of
overfitting.
PLS component 1 models harvesting time, which is mainly related to color and
thickness, just as we found in the previous models.
By studying the loading plot in a previous exercise we learned that jam quality
varied with respect to color, flavor and sweetness. The chemical and instrumental
variables mainly predict variations in color and sweetness only (which is also
indicated by the low explained Y-variance). This means that we cannot replace the
Y-variable flavor with the present set of X-variables. Using other instrumental X-
variables, e.g. gas chromatographic data, could possibly have increased the flavor
prediction ability.
PLS2 is a natural method to start with when there are many Y-variables. You
quickly get an overview of the basic patterns and see if there is significant
correlation between the Y-variables. PLS2 may actually in a few cases even give
better results if Y is collinear, because it utilises all the available information in Y.
This is a rare situation however. The drawback is that you may need different
numbers of PCs for the different Y-variables, which you must remember at
interpretation and prediction. This is however also the case with PCR, when there
are several Y-variables, since each Y-variable may need more or fewer PCs. There
are also cases where PLS2 fails to model some of the Y-variables well. Then one
should try separate PLS1 models anyway. You will definitely need to interpret each
model separately.
To conclude: Comparing PCR and PLS is an interesting issue, since you will need
to reflect on why the results are different. - PLS will probably give results faster. As
for PLS2 vs. PLS1: PLS2 is always useful for screening with multiple Y-variables,
but one will very often need separate PLS1 models to get the most satisfactory
prediction models. Nearly all individual PLS1 models will be superior, since the X-
decomposition can be optimised with respect to just one Y-variable - as opposed to
just doing an average job for all Y-variables. These PLS1/PLS2 distinctions are
dominantly founded on a large base of chemometric experience.
How to do it
1. Use the models you made in exercise 6.8.1 (jam1) and 6.9.6 (jam2) to compare
PCR and PLS1.
Close all viewers and data editors. Use the Results menu to plot the model
overviews in two Viewers and compare results:
Results - Regression, mark Jam1, hold Ctrl down and mark Jam2 as well, View –
Window – Tile Horizontally.
2. Variance
Compare the residual Y-variance between the two models.
A PLS1 model with 2 PCs is better than a PCR model with 3 PCs, and only slightly
worse than the PCR model with 5 PCs.
Figure 6.8 Residual variance in the PCR (Jam1) and the PLS (Jam2) models
Note that the PCR model actually displays an increase in the prediction error in the
first PC. In general this is a bad sign (and it is never acceptable in PLS). However,
in PCR you may often well accept this, because the first PCR components may very
well only be modeling X-structures, which are irrelevant to Y.
Important note: You cannot normally compare different models by using the
prediction variance (validation Y-variance) alone. This can only be done when the
models are based on the same data set and you have used the same Model center
and weighting in each model (and used the same validation procedure). This is the
case here, we can use this prediction variance as a measure of how good the
alternative models are.
More generally we use the measure RMSEP (Root Mean Square Error of
Prediction) to compare different models. RMSEP gives the errors in the same unit
of measure as used in the variables in the Y-matrix, and is therefore suitable for
general comparison. This will be discussed in detail later.
3. You can also try to plot RMSEP for these two models. You find it under Plot -
Variances and RMSEP (double-click on the miniature screen in the dialog box
so that your plot fills up the whole viewer).
4. Loadings
Study the Loadings (PC1 vs. PC2, Variables: X and Y) in a 2D Scatter plot. The
PLS loading plot is turned clockwise through almost 90º. The most obvious change
is that the Preference variable (the Y-matrix) is explained far better by the first
component with PLS1. This shows that the two methods use the same X-data in a
very different way. PCR (PCA really) extracts the systematic variations in the
rather independent data set, X, while PLS performs the interdependent
decomposition in both X- and Y- matrices. In the loading plot we clearly see how
the Y-data have influenced the decomposition of X.
How would you go about using the loading-weights for a similar comparison? We
do have the loading weights from the PLS1-solution all right – but what would be
the corresponding item from the PCR-solution?
5. Scores
Study scores for the two models. When you compare the score plot for PCR and
PLS1 you see some of the same general structures in both plots, but they are
actually clearer for PCR. Structures for harvesting time and for harvesting place are
present in both plots.
Summary
PLS1 needs fewer PCs to explain the data than PCR and the final model performs
better. PLS1 and PCR utilize the X-data in very different ways.
In general PLS uses fewer PCs to model Y than PCR, which gives a minimum
residual Y-variance earlier. PCR may end up with as low a prediction error as PLS,
but with more PCs.
6.11 Summary
MLR
Multiple linear regression is the most widely used multivariate regression method,
but it has profound weaknesses. There are no diagnostics to tell e.g. whether
interferents are present or not. MLR also has severe problems when the X-data are
collinear. The MLR prediction solution is inherently unstable due to numerical
properties in collinear data sets.
PCR/PLS
PCR and PLS are shown to be strong alternative multivariate techniques, both with
many advantages. Interferents and erroneous measurements (outliers) are easily
detected using the diagnostics inherent in these methods. The different plots
available make it possible to interpret many generic relationships in the data set
both between variables and objects. The approach is highly visual, making the
methods available to a wider range of users than just skilled statisticians. Many
chemometricians have been proponents of the PLS-method over many years, and
many statisticians prefer PCR. Luckily - by proper validation of the relative
prediction model performances, this will always be subject to an objective
assessment. There are however still many skirmishes when these two schools meet
and exchange pleasantries....
Because these methods use projection, the collinearity problem is turned around to
a powerful advantage. Collinear data sets may in fact be modeled completely
without difficulty. It is then possible to use e.g. full spectra instead of just a few
selected wavelengths.
The rest about PLS - at least at this introductory level - concerns the many practical
applications where PLS has been found useful. We shall present many examples
and illustrations below, but first the critical, indeed essential issue validation must
be presented in its full context. A proper validation understanding is an absolute
must in order to be able to get the maximum out of these immensely powerful
methods, PCR, PLS1 and PLS2.
First of all validation is absolutely essential in order to make sure that the model
will work in the future for new, similar data sets, and indeed do this in a
quantitative way. This can be viewed as prediction error estimation.
Secondly, validation is often also used in order to find the optimal dimensionality
of a multivariate model (X,Y), i.e. to avoid either overfitting or underfitting. One
should not get confused by the fact that usually the dimensionality validation has to
be carried out before any prediction validation is put on the agenda.
Both test set and cross validation can be applied to any regression model made by
either MLR, PCR, PLS (PCA models can also be validated but this regards the
modeling performance; see section 7.1.3 on page 159). Test set and cross validation
are equally applicable to augmented regression models like non-linear regression
and neural networks, for example, and are perhaps even more important for
methods which involve estimates of many parameters as these imply even greater
risks of overfitting.
exclusively to calibrate the model; this set is called the calibration set. The other,
the validation set, is expressly used only for the validation.
Figure 7.1 - Data sets present for modeling (cal) and validation (val)
Xcal Ycal
Xval Yval
It is not fatal if this initial tentative dimensionality does not already give us the
final answer; the validation principles you are about to learn will always allow you
to pin down the correct dimensionality, which is the critical basis needed for the
final prediction validation testing. What matters is that you take responsibility for
getting your thinking about it right, not relying on validation as one ready-to-use-
for-all-situations standard procedure. Most unfortunately, the validation issue is a
somewhat confused issue both within chemometrics as well as outside.
Therefore, we will spend some time on this crucial issue so that you have a full
understanding of the purposes of validation. These important principles can then be
applied in all situations.
All validations produce a measure of the prediction error, i.e. the error we can
expect when using the model to predict new objects. There are also other pertinent
measures of the prediction performance of the model, which we shall all introduce
Multivariate Data Analysis in Practice
7. Validation: Mandatory Performance Testing 157
Model
Xcal Ycal
Then we feed the Xcal values right back into the model to “predict” ycal .
Equation 7.1 Xcal + Model Î ycal
Comparing the predicted and measured Ycal values gives us an expression of the
modeling error, due to the fact that we have only used A components in the model:
This is calculated for each object. Summing the squared differences and taking
their mean over all n objects gives the calibration residual Y-variance:
The square root of this (divided by the appropriate weights used for scaling at
calibration if necessary) gives us RMSEC, (Root Mean Square Error of
Calibration), the modeling error, expressed in original measuring units.
Equation 7.4
∑ ( yˆ i ,cal − yi ,cal ) 2
RMSEC = i =1
n
Clearly RMSEC = 0 only if all potential components are used. For A < min(n,p)
RMSEC is a good measure of the error when only A components are used in the
model. We would like RMSEC to be as small as possible but there is a competing
consideration to be fully exposed immediately.
Next we compare the predicted and measured Yval values to get an expression of
the prediction error:
This is calculated for each validation object. Again, by summing the squared
differences and taking their mean over all objects in the test set we get the
validation residual Y-variance:
Equation 7.7 Residual varianceval = ∑ ( y val − yval ) 2
n
The square root of this expression (divided by the weights used for scaling in the
calibration) gives us RMSEP, (Root Mean Square Error of Prediction), the
prediction error, again in original units.
Equation 7.8
∑ ( yˆ i , val − yi ,val ) 2
RMSEP = i =1
n
Be aware, though, that there may be exceptions to this clear “V-rule” for data sets
with non-trivial internal data structures (also influenced by non-deleted outliers,
etc.)
Figure 7.3 - Empirical prediction error – the sum of two parts (modeling &
estimation errors). This is the powerful plot which will always allow you to
determine the optimal number of components in a multivariate model (X,Y)
Error of prediction
Underfitting Overfitting
Modeling Estimation
The prediction validation Y-variance will usually decrease until a certain point,
after which it generally starts increasing again. In prediction testing this minimum
corresponds to the optimum number of components, and one should never go
beyond this point!
Therefore we may also study the validation X-variance in PCA, to check that the
model is not overfitted. The conventional calibration (modeling) X-variance is the
most often used of these two. Until now you have only used leverage correction in
the PCA exercises, but with more knowledge about validation you may also use a
test set or cross validation before you are fully satisfied with your PCA model.
There are several possible problems regarding proper test set validation and its
requirements. Perhaps the most important issue is the somewhat surprising notion
that the prediction testing itself is not a completely objective procedure - that it
actually matters, to some degree at least, which validation method one chooses.
However, the two sets must not be too similar. If the two data sets are identical,
then the only difference between them would be the sampling variance, i.e. the
variance due to two independent samplings from the target population. In real life it
may be more or less difficult to obtain two almost identical drawings, both
representative with respect to the all-important future drawings, which represent
the eventual use of the prediction model. However, since this is such an important
issue, one must consider all these aspects of validation even as early as when you
plan the initial data collection. A problem may be how to pick out a representative
test set from all the available samples in the target population, but this is at least a
practical problem which can be confronted directly.
The calibration set must always be large enough to calibrate a model satisfactorily.
The test set must also be large enough to provide a satisfactory basis for the test.
Both these requirements call for a balance between the size of the sampling, the
number of samples and the representatively of both. Test set validation may
therefore at times require relatively more objects than a straight forward calibration
This situation is often called the test set switch. This may indeed seem to be a
promising method. The test set switch situation is equivalent to the ideal of having
“enough samples for a good calibration” x 2, which translates into having plenty
calibration samples. Test set switch is in fact almost identical to the proper test set
situation– with one crucial difference: there have not been two independent
drawings from the target population, one! This has some specific implications,
which shall be more fully explored below.
But what do we do when the available number of samples is clearly below this
level - i.e. when we simply do have “too few samples” at our disposal for a proper
test set validation? Cross-validation now must come into play.
summed and averaged, giving the usual validation Y-variance apparently in the
exact same sense as for test set prediction testing.
The full training set model is based on all 20 samples however. This means that the
LOO cross-validation error estimate is not based exactly on the full model, but on
20 almost identical sub-models, each with only 19 samples. For the series of these
20 sub-models, each pair of sub-models will have 18 objects in common. Does
cross-validation, seem a bit like cheating? We have actually never really performed
the validation procedure on a truly independent test set. This does not make sense!
Unreflected use of this cross-validation scheme may well appear as if we have but
created a true test set out of nothing – or at least out of the very same training data
set with which we have completed the calibration also. What is wrong here?
Cross validation is the best, and indeed the only alternative we have when there are
not enough samples for a separate test set. There are two, apparently different,
types of cross-validation which we are about to present in more detail; full cross-
validation and segmented cross-validation. In reality these two types are very
closely related however, and once you’ve mastered the one, the other follows
directly.
Actually there are many myths about full cross validation. Many text books and
experts recommend full cross validation as a general approach to prediction testing,
claiming that this should give the most comprehensive testing of the model. Many
statistical and data analytical programs include full cross-validation as the default
procedure. Certainly not everyone agrees with this however, because of the fatal
weaknesses discussed above.
Now it is in fact easily seen that in every segmented cross-validation situation there
is a definite range for the number of segments that can be chosen, corresponding to
(2,3,4,…,n-2,n-1,n). With this realization, the systematics of cross-validation is in
fact easily mastered: Each cross-validation necessitates a choice of the number of
segments to be made by you, the data analyst, not by the software program, and
most emphatically not by any algorithmic approach. This choice is always in the
range (2,3,4,…,n-2,n-1,n).
At the outset of trying to master multivariate calibration, this will almost certainly
appear as a very unfortunate situation for the novice data analyst. It would be so
much better (read: easier) were this “difficult choice” to be made by the “method”
itself. But this is not the correct way to tackle this issue. On the contrary, it will be
necessary for you, as a responsible data analyst, to make an informed choice of the
number of segments to be used in all cross-validation. We would perhaps now be
expected to give you another list of rules-of-thumb for this endeavor.
The issue of how to select the “correct” number of segments revolves around a
novel way to interpret the entire cross-validation situation. A full disclosure is
outside the scope of the present introduction, for which see Esbensen & Huang
(2000), but the critical essentials are easily enough presented:
In summary, all cross-validation is ever doing, in fact all it can ever do, is trying to
simulate the ideal case of test set validation. With this in mind you should have a
much more relaxed attitude towards assuming the personal responsibility, which we
have already talked about at great lengths above. This may seem like a difficult
obligation for a trainee data analyst, but there is no other way than to start out on
your own forays. In reality all that is needed is a few reflections on the effect of
disturbing the data structure, as it is manifested in the T vs. U plots, by selective
removal of (1,2,3, … ,n-2,n-1,n) objects simultaneously ......
The kind of reflections involved revolve around picturing in one’s own mind the
effect(s) of removing e.g. one object from the data set: how would this affect the
pertinent y|x-regression line direction in the TvsU plot? How about e.g. removing
segments with 2,3,4..... objects? ----What about removing 50% of the training data
set? The issue here is to be able to imagine how the pertinent segmentation deletion
will affect the y|x-regression line direction, viewed as a response to a perturbation
to the data set structure, the specific perturbation being the removal of a certain
fraction of the overall number of objects. Some practice in this context is surely
needed, and for this reason we have arranged for plenty of multivariate calibration
exercises, complete with our offer of a “correct” number of segments, to be
included in this book.
But the central issue involved is, strictly speaking, not just the y|x-regression
direction perturbation responses, but rather this: how large a proportion of the
samples must be taken away in one segment – in order to simulate the effects of the
second target population sampling that was never carried out in this cross-
validation setting?
It is a more “realistic simulation of a test set situation” to divide the calibration set
into a few segments, e.g. with at least 10% of the samples in each. It would appear
very unlikely, for example., that one singular left-out sample will result in a
significantly greater sampling variance than that resulting from the 10%
segmentation removal, unless the one left-out sample happens to be an outlier. We
must assume here that all such outliers have been screened away by you, the
experienced data analyst, of course! By this reasoning, full cross validation on any
reasonable balanced data set simply must lead to over-optimistic validation results.
This makes the minimum 10% segmentation approach seem more realistic.
So long you plan what to do, understand the consequences and analyze the results
accordingly, the chance of making big mistakes is minimized. Omitting validation
would be much, much worse!
If all 10 samples are equally important to span the variations, even leaving out one
sample may cause serious problems. A good example is a data set constructed by a
fractional factorial design. Then, we may need to disregard validation, but simply
check that the model fit is adequate and not be too optimistic about using the model
for safe future predictions. If we make this model in order to understand the
relationships in our system, for screening, or to investigate the possibility of using
multivariate modeling for future indirect measurements, perhaps we can live with
insufficient validation of the screening model. There is however also a third option:
leverage correction.
The leverage-corrected model is exactly the same as the alternative test set
validated, or the cross-validated ones. They give exactly the same scores and
loadings, since all the calibration samples are used to make the model, but in
general the prediction error will be estimated to be lower and it may sometimes
even indicate fewer PCs.
center, will have a high effect (leverage close to 1). A typical object, close to the
model center will have a small effect (leverage close to 0).
For each variable Yj, the raw residual for an object, fij is divided by the leverage
expression (1-hi ):
Equation 7.10 f ij
f ijcorrected =
1 − hi
The residual validation variance is calculated as usual, from the mean squared
error, but the individual error contributions are now leverage corrected according to
Equation 7.10. In this way, what would have been only small residuals from
influencing objects, are now contributing to increasing the corrected prediction
error estimate, which is only fair, since these objects – because of their relatively
far out positions– were very influential in describing the data structure model. This
increasing in weight of the error contributions from unduly influential data is what
constituted the original motivation for introducing the concept of leverage
correction. This is an extensively used feature in statistics and data analysis. The
data are “unduly influential” because in the least-square sense of bilinear modeling,
such data points will necessarily lie close to the model PC-directions with the
unwanted consequence, that they will always contribute unrealistically small
prediction errors – if not properly corrected!
In general this correction approach often works rather well, but there are also
several situations in which grave distortions may arise, especially when dealing
with special data structures in which there is a strong colinearity. Leverage
corrected validation is never to be used for the really important final validations.
There is an analogous leverage statistic for each variable as well, which may be
used for detailed interpretations of the relative importance between variables.
8. The prediction performance is evaluated by looking e.g. at the RMSEP and the
other validation options introduced above.
where
a = current dimensionality (PC number)
Vytot = Total residual Y-variance at validation
Vxtot = Total residual X-variance at validation
Index PC0 = at PC number zero
Index PCa = at PC number a.
In other words, for each new PC the program adds 1% of the variance before the
calculations start, to the variance at PC number a. This correction factor ensures
that you do not select still another PC unless you gain at least this much by doing
so. In this way you are encouraged to use fewer PCs. If increasing the number of
PCs does not bring much improvement, one must avoid using more PCs, as more
degrees of freedom will have been lost completely without any further residual
variance reduction.
However, you must also check the present residual variance relations carefully. If
the “optimum” solution comes after a local increase, one must be very cautious.
Outliers and/or large(r) data set irregularities are on the prowl! Always study the
particular shape of the variance curve. This is further discussed in section 11.7. The
bottom line is that the standard The Unscrambler suggestion for the optimal
In several later chapters you will use PLS and PCR on many different application
examples, involving typical real-world data analytical problems, in which the
informed choice as to the validation will be increasingly up to you, and your
accumulating multivariate modeling expertise. It cannot be stressed enough that it
is the personal experience which counts here. It is all fine to learn the theory, and
the methodological finesses, algorithms etc. pertaining to multivariate calibration,
but this does not make a good data analyst per se! The only way to really learn the
trade in this realm is personally to apply the theory to representative, real-world
problems. We have worked hard to select precisely this type of exercises in this
book. In this context, chapters 10-13 are perhaps the most important for your own
practical multivariate modeling learning curve, but first things first:
You will study the familiar data set on green peas, but in a modified form from the
one encountered above earlier.
Problem
Pea quality is mainly described by sweetness and texture, as experienced whilst in
the mouth. Using trained judges, panelists, is often standard practice in these
matters. However, peas are harvested and valued by their texture, often measured
by tendrometer readings. Tendrometer measurements are relatively inaccurate
however. Sugar contents indicate sweetness. Again the main objective behind
establishing a multivariate calibration model: Is it possible to replace the costly
sensory panel evaluations by inexpensive instrumental measurements for routine
quality control?
Data Set
The data are stored in the file PEAS. The variable set Chemical contains 6
chemical measurement variables (X). The Y-variables are stored in the variable set
Important Sensory which contains 60 samples with the six most important
variables, in fact the ones you found in a previous exercise. The data are averaged
over 2 replicates and 10 judges.
Tasks
1. Calibrate a PLS2 model.
2. Interpret the results.
How to Do it
1. Take an overview look at the data. Does pre-processing or weighting seem
necessary?
2. Go to the task menu and choose a PLS2 regression model with default
weights set to 1/SDev. Why is weighting (autoscaling) necessary? Use the
mean as the model center and set the outlier limit to 3. Choose leverage
correction as the validation method for the first run. Why do we use PLS2?
Now plot the variance for the individual Y-variables: use Plot – Variances and
RMSEP to plot the variance for Y-variable 1, 2, 3 and so on in the same plot
(Variables: Y, 1-6, un-tick Total, Samples: Validation. Double-click on the
miniature screen in the dialog box to make your plot fill up the whole viewer).
Which Y-variable has the highest prediction error? Are all the Y-variables well
modeled?
Plot the loadings in a 2 vector plot for PC1 and PC2. Which variables are the
most important? Which seem to correlate? Which chemical measurements
correlate most with sensory data? Should we pay more attention to PC2?
Which are worst? Is there any relationship between these results and the
variance results? Also check the results using only 1 PC.
7. Outliers
Relation outliers, i.e. outliers due to errors in the relationship between X and Y
may be difficult to find in the normal score plot (t1-t2), which of course is a
picture of the data structure in the X-space alone. Check for possible relation
outliers by plotting X-scores (T) versus Y-scores (U). The “X-Y Relation
outliers” plot does this (Components: Double, 1 and 2). This is an extremely
important plot!
8. RMSEP
Check the expected future prediction error in original units by displaying the
RMSEP (Y-variables: All, Samples: Validation, make the plot fill up the whole
viewer). Select Window- Identification to see what you plotted. Does the
RMSEP provide the same interpretation as the Predicted vs. Measured-plot?
Summary
One should always standardize data when they are measured in different units, as is
the case with the chemical X-variables in this exercise. Sensory data are always
standardized in PLS and PCR because the scale may be used differently for the
different variables by the different judges.
PLS2 is suitable when there are several Y-variables, at least to get a good first
overview. If the prediction error is high, one may alternatively try several separate
PLS1-models.
The calibration variances express the model fit in the X-space and the Y-space
respectively. Validation variances are calculated using the chosen validation
method and say more about how well the model will work for predicting new data
(Y).
One PC explains about 85% of the variance of each Y-variable, except for Off-
flavor. We can safely use two PCs since the explained variance increases to about
90% and we know that there are only two independent variation factors (time and
place).
The samples are well spread in the score plot and there are no obvious outliers or
clear groups. The first component describes the variation due to harvesting time.
All the variables vary a lot in PC1, but the flavor and sugar variables also have
some contribution in PC2. However, PC2 only accounts for 2% of the variation in
X and 2-6% in Y (over all the Y-variables). Sweetness correlates with %sucrose
and Off-flavor is negatively correlated with those. The texture variables correlate
with tendrometer and dry matter measurements. It seems that ripe peas are sweet
and fruity while early harvested peas are hard, mealy and have Off-flavor –no
surprises here.
We studied prediction results with two PCs. Using only one PC gives slightly
poorer predictions. Off-flavor has the worst correspondence between predicted and
measured values. This is natural, since the explained Y-variance was lowest for this
variable.
RMSEP is the estimate of the mean prediction error of all “test samples”. In this
exercise we used leverage corrected validation instead of an independent test set, so
the estimate of the prediction error is probably too optimistic. The curve of RMSEP
against number of components has the same shape as the validation residual Y-
variance and is scaled back to original units. We see that RMSEP is larger for those
variables that have a worse Predicted/Measured relationship. RMSEP is between
0.3 and 0.6 for all Y-variables. This is appropriately “good” since the panelist
judgement measurements are on a scale of 1-9. The relative error is thus max. 0.6/1
= 60% on the low value levels and max. 0.6/9 = 7% on the high levels. Considering
the inaccurate nature of sensory assessments this is probably acceptable.
that PLS1 does not use all the information available in the Y-matrix. What you will
also learn in this exercise is how to compare models and decide which is the better
one.
Tasks
1. Calibrate two PLS1 models (same Y-variable, different validation procedures)
2. Compare them with the PLS2 model just completed above.
How to Do it
1. Go to Task - Regression. Use Chemical as X and define variable
Sweetness as a new set. Use the same model parameters as in the last model,
but change the calibration method to PLS1. Calibrate for e.g. 6 PCs.
2. Interpret the model. Remember to check the RMSEP by plotting it. How
many components to use now? Which model is the better, PLS1 or PLS2?
Which criteria did you use to establish this finding?
Interim Summary
In this case the leverage corrected PLS2 model and the leverage corrected PLS1
model turned out be equal with respect to size of the estimated prediction error –
compared with the same number of model components of course. You never know
in advance which one will be best. You must always try both PLS2/PLS1, if you do
not go directly to PLS1 for “external reasons” that is.
Purpose
Introducing cross-validation, now to be applied on the same data set, PEAS.
Comparing three alternative PLS-models, based on two alternative validations.
Tasks
Calibrate another version of the PLS1 model for sweetness, but this time using full
cross-validation (LOO).
How to Do it
1. Change the validation method in Task - Regression to Full Cross validation;
Calibrate again for e.g. 6 PCs. How does cross- validation work? While waiting
for the calibration to finish, reflect on the functioning of cross validation. Do
you expect the prediction error to increase or decrease? Why?
3. Now do the same comparison both for the loadings as well as for loading-
weights
4. Compare the prediction errors (RMSEP). Are the prediction errors equal?
What is the difference? Which error is lowest? Why? Which is highest? Why?
Summary
The leverage corrected PLS1 model could perhaps have been interpreted as slightly
“better” than the cross validated one, because its RMSEP was in fact a bit lower.
Scores and loadings were exactly equal for the two PLS1 models, of course,
because it is only the prediction error estimate that is a function of the particular
validation method used. We got the same model in both cases. But cross validation
gave a more conservative - and probably more realistic - prediction error estimate.
Note that leverage correction in this, as in many other data sets, may be too
optimistic.
Data Set
PEAS, the same as in the previous exercises.
Tasks
1. Now calibrate a PCR model
2. Compare the PCR and PLS models
How to Do it
1. Make the PCR model as you made the PLS models. Change the calibration
method to PCR. First use leverage correction as the validation method. Why?
2. Compare the PCR model against both the PLS2 model and the leverage
corrected PLS1 model from exercises above. For a final model, if
appropriately validated, RMSEP is the most important single criteria to check.
You should compare RMSEP for all the Y-variables pairwise. Remember that
the PCR model includes regression to each of the Y-variables.
When comparing PCR and PLS1, remember that sweetness is variable number
2. The Window - Identification command is handy to see what you plotted.
Are the models different? Which is best? Why did we use leverage correction in
the PCR model? Why did you not suggest the use of e.g. cross-validation in this
exercise?
Summary
You should now have gathered some initial experience in making and comparing
bilinear prediction models. In general PLS models result in a lower prediction error
than PCR models, using fewer PCs. However, PCR may be forced to just as low an
error, often by including more PCs. Remember to ask yourself when making
models and interpreting results – comparing with the standard theoretical
expectations: Why is this particular result like this? Are the results like expected?
We mostly used leverage correction above, simply because the other models we
compared with were also leverage corrected.
You are now to compare all the different models above using both test set
validation (if/where possible), and especially you are to compare the LOO cross-
validation, and appropriately set up segmented cross-validation alternatives. Good
luck!
Multivariate Data Analysis in Practice
9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 181
These issues are presented in a comprehensive fashion, which will allow you to
gain more experience and understanding before we introduce a series of varied and
realistic real-world exercises in the next few chapters. Many of the miscellaneous
issues in this chapter will not find use simultaneously in any one data analysis, but
we have found no other way than to present them here in a somewhat kaleidoscopic
fashion. This is a chapter to read now (i.e. between chapters 8 and 10), but you will
get much more out of it when re-reading it after you have completed the entire
book, and especially after having completed all the exercises involved in the rest of
the book. The practical experience thus gained will allow for an enhanced
awareness of the kind of very important issues addressed here. There will also be a
further deepening of the validation issue.
Some general requirements for the calibration data set are given below.
• The model will never be better than the accuracy of the reference method (Ycal).
Collect some selected replicates (at least) to ensure that the measurement
inaccuracy in the reference method is covered, and can be estimated (see also
further below).
• All major interferents (physical, chemical and other) should vary as widely as
possible in the calibration set, otherwise the multivariate calibration model cannot
distinguish between them. Failure to achieve this may for instance cause
erroneous Y-predictions in the future with some objects being identified
(wrongly) as abnormal outliers by the too limited model coverage. All factors
involved should thus display representative covariation in the calibration set in
addition to the individual maximum variable variances. Otherwise the model will
surely fail at compensating for their interaction.
• For the most efficient calibration design, one should select training objects that
are as typical as possible. They should also span each of the individual
phenomena, which can be “controlled” as well as possible, and include enough
randomly selected objects to ensure a chance of also spanning the non-
controllable phenomena and their interactions. All this certainly would appear a
very demanding task for the novice and the expert alike - but luckily these are but
ideal objectives to be aimed for. When facing the practical world other limitations
• One should always strive for perfection when designing the protocol for the
training data set (and test set) sampling. On the other hand, one should not be
overly discouraged when the real-world practicalities force one to make some
necessary compromises.
If the X-variables are more or less independent from each other (i.e. non-
correlated), you may, in well-behaved data sets, easily handle up to, say 10-50
times as many X-variables as there are objects. An “ideal data set” means a strong
X-Y data correlation with a comparatively low model dimension, A. If the X-
variables are strongly correlated then the number of X-variables may be much
larger. For example, 20 - 50 samples and thousands of spectral wavelengths may
not be a problem at all. However, the ideal situation of course calls for a less
extreme, rectangular X-matrix.
If models are found to be malfunctioning, one reason may be that there are too few
objects in relation to the number of independent phenomena and X-variables. This
is of course also the case for Y; few objects and many Y-variables may be difficult
to handle with PLS2. Then it may help to make several PLS1 models.
In general, the need for more/many calibration objects increases with the levels of
the measurement noise in the X and/or Y. It also increases with the number of
interference types in the X-variables of course.
that the scores, loadings and residuals are computed as usual, but the missing
values do not influence the results. However, objects with missing X-values do not
get predicted Y-values either!
Other possibilities would be to replace missing values by the mean value of the
variable in which the missing value occurs. This is fine if the object is a typical one
with respect to the variable in question, but it will certainly be a big mistake if the
object otherwise is an extreme member. How can we know this in advance? This is
impossible as the object has a missing element! A much better strategy would be to
find the two most similar objects, or the two most correlated variables, in the full
multivariate sense, the one including the pertinent missing values, and to
interpolate the missing value(s) aided by the pair wise correlation etc. In this way
one gets a local average replacement, corresponding with the overall correlation
for the missing value, that better can be used in the computations.
A much augmented development involving this general principle, but with a solid
statistical underpinning, goes under the name of “Multiple Imputations” and is a
distinct sub-discipline by itself. A very useful first reference is Rubin (1987).
Missing values should never be replaced by 0. This will certainly cause false
results from the computations, and lead to very unreliable interpretations.
sampling conditions and contain many uncontrolled phenomena that may now be
difficult, if not impossible, to trace.
You will, however, have many good tools to detect these weaknesses using
projection methods: lack of variability, noise, outliers, trends, groupings, etc.
Studying available external data may also show you which types of informative
data (objects/variables) is not present and give you ideas about which additional
types you need to collect. In any case, one should always look at all the data one
has at one’s disposition, but in general one should perhaps not expect too much
from historical data.
Screening designs are primarily used to assess the significance of effects from
potential factors of variation in the data, whilst optimization designs give a more
detailed description of the relationships between the response(s) and the design
variables. A data set resulting from an experimental design can usually be modeled
and analyzed by traditional statistical methods such as ANOVA or MLR, but may
of course also be modeled by projection methods. Experimental design is given a
full presentation in chapter 16.
There are, of course, many practical situations where you cannot disturb a process,
at least not as much as preferred for an ideal experimental design. In many plants,
process excitation is not allowed at all. Collecting data from the on-going process is
then the only alternative, unless you can make experiments in the lab or have pilot
plant facilities.
There are also situations where you cannot perform all the desired experiments as
planned; some variable settings may be impossible in real life for example. This is
often the case in process operations. If the experiment (measurement) actually
performed turns out to be far from the planned specifications, the basis for classical
analysis of variance is no longer valid. Thus the traditional methods for the
determination of significant effects would be void, since they often require
orthogonality in the design matrix. PCR or PLS may sometimes save this situation,
but there is in general practically never a remedy for a badly thought out
experimental design.
Data from designed experiments may be useful to analyze simple, or more complex
response relationships in data, but a designed experiment often does not generate
enough samples for a model intended for prediction, since the resulting data set
often only consists of measurements on two or three levels. You may also often
need to get several additional samples around and between the initial design
settings if this is the objective too. The interrelationships between the small sample
experimental design situation(s) and the general multivariate calibration training
situation(s) are not a simple issue. It is not possible to cover all aspects in this
introduction. After the next section and chapters 16, 17 and 18 (experimental
design) have been considered, some further reflections on this issue will be offered.
The alternative is called the random design, Esbensen et al. (1998). The setup is
easy to comprehend. The number of available experiments is set by the boundary
conditions; in the situations sketched above the need for at least 30-50 objects
should be justified. Consider that the system to be characterized is “complex”
(factors, levels, interactions), but that a number of 42 experiments has been
accepted as the absolute maximum (economic constraints, practical and/or time
limits etc.). The same analysis of the problem which resulted in these constraints
also furthered specific minimum and maximum levels for all factors involved; this
is a prerequisite for any experimentation, designed or random, or otherwise!
The random design is now set up by a generator which works on each factor
individually: the interval (max – min) is divided by the number of experiments
allowed (42 in this case). A random number table is scaled to the interval [1,42],
and 42 selections (no replacements) from this table are taken. Each random number
will indicate a specific level between the pertinent minimum and corresponding
maximum (continuous variables are binned in some problem-specific fashion).
Thus we are given a set of 42 randomly chosen levels, which neatly spans the entire
experimental domain pertaining to experimental factor 1. This is repeated for the
remaining factors in turn. At the end of this iterative procedure we are left with the
required total of 42 compound level settings for all factors involved, with their
combinations entirely determined by a random changing of the individual
selections. We thus have arrived at 42 experimental settings completely spanning
all the pertinent factor intervals, at levels that are maximally faithful for capturing
both the individual factor ranges as well as their interactions.
If the number of experimental factors were (only) three, the geometrical model of
the random design is particularly easy to depict in one’s mind. It is a cube with 42
objects sprinkled randomly and homogeneously all over the interior of the cube
volume, including some objects very close to (or perhaps a few actually lying on)
the edges/corners. If the actual number of experimental factors is larger than three
as it would be in a complex system, this conceptual model still applies: simply
think of a generalized hypercube with the same “interior volume” characteristics.
There are some problems with this direct geometrical generalization, Rucker
(1984), but they only relate to the strict hyper-geometrical aspects, not to the
practical use of this 3-dimensional geometrical model of the random design for the
more complex situations.
Esbensen et al. (1998, 1999) give several examples of the use of the random design
for suitably complex technological systems, in which the practical set up and use of
the random design is laid out clearly. The resulting spanning of the pertinent
multivariate calibration training data sets are particularly evident.
patterns. You will see whether some types of samples are over-represented or
whether there are “holes” in the overall population distribution. Study the score
plots and try to pick out a subset that spans the variations in each of the relevant
PCs, with but only the necessary minimum number of training samples. More
typical objects, but certainly not just the “average” samples, will give a robust
model for such samples. Many extremes will allow for a more global, but perhaps
less accurate model. Again the particular balance one should make is always
problem-dependent. This procedure may be a bit difficult if there are many relevant
PCs, but it works well in principle. Do not give up on complex problems too easily.
When there are many relevant PCs you may use a factorial design to pick
calibration and validation samples from them, see chapter 16.
Usually this type of training data set screening is carried out in the X-space. An
alternative is to systematically pick out samples based on the Y-value distributions,
for example from every third consecutive Y-level. This works very well because
samples that span large variations in Y necessarily also span the variations in X,
assuming of course that a relevant (X,Y) correlation does exist. However, if the
distribution of the Y-values is very uneven, for example one sample with Y-value 3
and fifteen samples between 35 and 40, the model will of course be very different
depending on whether it contains the first sample or not! Problem-dependent
common sense is very often the best remedy for such non-standard problems. There
is one exercise below (“Geo – dirty samples”) which is fraught with exactly this
type of practical problem. Experience with many types of data sets is mandatory.
And, of course, the above does not constitute a proper test set validation setting; we
are back in the vicious circle of the singular sampling test set switch.
There is an error component accumulating from each stage in the whole chain of
sampling, preparation and measurement through to data analysis. These errors all
contribute to the model error and to the prediction error. To be able to evaluate
how good a final prediction model is you need to be aware of these typical error
sources.
• Inhomogenity, e.g. differently packed powders or grains, mixtures that are not
well blended, solid samples that are not homogeneous, e.g. meat, rocks, alloys.
• Sample preparation, e.g. different laboratory assistants may perform procedures
in slightly different ways; samples collected at different points in time which may
incur slight sampling differences, aging of chemicals over time etc.
• Instrument inaccuracy, drifts, faults, both in X and the reference method Y.
• Modeling errors
Measurement replicates are defined as samples, which have been measured twice,
or more, in X and/or in Y. When measured three times, they are called triplicates,
etc. Several Y-measurements of the same sample are called repeated response
measurements. In some situations you may choose to repeat the whole experiment,
or you may choose to prepare the sample again. Which choice to make is of course
problem-dependent.
If one divides a sample in two and measure each part once, one may either consider
it as a repeated measurement or as a replicated sample, depending on how
homogeneous the original sample was. Giving the same type of pea twice to a taste
panel can probably be regarded as a repeated sensory measurement. Cutting a piece
of meat in two may be regarded as a replicated sample measurement, or as two
different samples, because one piece of the meat may still be rather different from
the other, e.g. it may contain more fat. Blending three alcohols twice in the same
proportions may be regarded as two samples, because the experimentation contains
variation in itself; these two samples would give you with a measure of the
laboratory preparation (mixing) error component etc.
In the following text we will use the term replicate both for repeated X- and
repeated Y-measurements.
What you decide to do depends very much on where you believe the largest
inaccuracy is to occur, and when you feel the need to determine its size. But always
be aware of potential error sources and keep this in mind when analyzing data (also
at a more general level), when selecting validation methods, and when evaluating
model performances etc.
The practical handling of replicates is discussed further in section 9.7 on page 195.
these two sets, though both totaling 20 “objects” do not contain the same
information, and consequently proper data analysis cannot treat them identically
either.
How many measurements you make on each sample depends on cost, tradition,
regulations and also on the expected size of the measurement inaccuracy. If one
does not know how many replicates to use, a good suggestion always is taking a
few pilot measurements, say three to five and see how much they differ. If this
empirical measurement variation is large, take a few more until you have a feeling
of stability of this replicate distribution. Then decide how many to use in future
measurements.
Equation 9.1 ∑ ( d i - d m )2
SDD =
(n -1)
where
When calculating SDD one normally assumes that each sample is homogeneous, so
the analytical inaccuracy, which is estimated by SDD, primarily consists of the
variation in the reference measurement method, preparation procedure and
instrument uncertainty.
SDev is simply the square root of the total variance of all the repeated
Y-measurements.
Equation 9.2 ∑( yi - ym )2
SDev =
(n -1)
where
ym = average measurement of y for all the replicates
yi = y-measurement of replicate i
n = number of replicates
J
∑ ( yij − yi )2
Equation 9.3 Vi = j =1
J −1
where
J = number of replicate measurements on one sample
i = sample number
Equation 9.4
n
Vi
V =∑
i =1 n
where
n = number of samples
Equation 9.6 1 n J
SDev = ⋅ ∑ ∑ ( yij − yi )2
n ⋅ ( J − 1) i =1 j =1
Hopefully you will be able to see the small groups (triplets) consisting of replicates,
well separated from other object replicate groups. In Figure 9.1 you can see the
triplicates quite clearly.
In Figure 9.2, for example, one replicate of sample 40 is more similar to sample 23
than the other replicates of sample 40.
-0.06
PC1
Averaging
After having checked that the inter-replicate variation is not in danger of destroying
the data analysis proper, it may often be an advantage to prepare two sets of
pertinent score plots, one including all replicates, but also one in which is depicted
only the “average samples” from the replicates. By using these average one has, to
a certain degree, damped the inaccuracies in the model, but by also having access to
the full score plot it will always be possible to factor in the replicate variation again
later.
What should you do if you, for example, have three replicates of most samples, but
only two of some? It is of course possible simply to add a third sample where one is
lacking by adding the average of the other two. If the score plot shows the same
replicate variation for this sample as for the other samples, this is quite safe.
Likewise, if there are a few extra replicates in a few samples, you may replace
these by an average. Again, check the score plot first in order to fully grasp which
of these averaging “tricks” is necessary.
Note!
The number of samples, n, will be inflated when you have replicates. If
your data set for example contains triplicate X-measurements, the data
analysis is not really carried out on 3n independent objects. Many
standard statistics are based on the number of effective samples
present, n or (n-1). It is entirely up to you to keep track of how and where
to avoid this pitfall.
To avoid this, you should keep all replicates of particular sample together in the
validation process. This is done by selecting cross-validation with exactly as many
segments as there are sets of replicates. The Unscrambler has an easy option for
this “systematic selection” of cross-validation segments to allow you to put all
replicates of a sample in the same segment. This requires the same number of
replicates for all samples, which can be achieved as described above.
From these simple illustrations above it will again be apparent that validation is
certainly not just some standard procedure to be applied to whichever data set at
hand. Careful reflection on the entire data analytical process is needed.
Practical validation can be done in three ways. Sometimes starting with leverage
correction is relevant in a specific problem-context, always remembering of course
to revalidate properly for the final model. This may then either be test set validation
or cross-validation if there really are not enough objects for a test set.
Comments on these three methods from a practical point of view are given below in
section 9.8.1 to 9.8.4 . The basic approaches were described in detail in chapter 7
and a final overview is also given in chapter 18.
segmentation series. You can choose alternative ways to select these validation
segments, as was described above.
If the data set is relatively small, one would use segments that are at most 10% of
the total number of objects. Full cross-validation will become more and more
relevant if/when the number of available objects decreases even further. If the data
set is large and you only choose two segments, you will of course not get the same
effect as running two proper test sets (on two calibration sets of n/2 samples), as is
well known by now...
On the other hand, if there are very many samples, one may consider using
segmented cross-validation to imitate a series of three, four.... test set validations.
In general, such a “many-sample” is no problem at all.
The influence of each object on the model, the leverage, will be computed. The
residuals in the calibration objects will then be “corrected” according to the
reciprocals of their leverage influences, i.e. their weighing increased or decreased
according to their modeling influences. Since all objects are used for both modeling
and validation, this method may often give estimates of the prediction error, which
are too optimistic.
They only differ in the manner validation objects are either brought in from outside
(test set validation), or by the particular way they are sub-sampled from the training
set (cross-validation). The leverage-corrected validation simply tries to counter-
weight the effect of wrongly using the objects twice according to their distance-to-
model-center leverages.
This means that the multivariate model will always be the same, irrespective of the
particular validation method chosen. The consequences of this choice are that, in
general, the estimates of the prediction error will differ, but hopefully not by much.
There is one important imperative associated with this method choice; it is not a
subjective choice to be made based on the data analyst’s preferences. On the
contrary, it is the specific data structure and this data structure alone, which
determines the specific choice of validation method. The only subjectivity in this
context would be that based on differences with respect to the relevant practical
experience(s) of the data analyst. It is certainly not the case to go searching for the
particular validation alternative, which gives the smallest prediction error alone.
This would be very bad form indeed, never to be undertaken!
9.9.1 Residuals
The residuals, E or F, are the deviations between the measured and modeled or
predicted values in each object. The residuals are what has not been modeled.
Here “Xestimated“ is the projected (modeled) X-values for each object and “Ypredicted“
is of course the predicted Y-variable values for each object.
The residuals may conveniently be plotted for all samples, for example in run order
(the order of the objects in data matrix X) or versus the size of the predicted
Y-values. A large residual for an object means that this object has not been well
fitted (it may even be an outlier). The residuals should be randomly distributed,
meaning that the remaining unexplained variations in the data should be similar to
white noise. A systematic pattern indicates that some systematic variation in the
data remains, which is not described.
The residual variance is the summarized error expression: the sum of the mean
squares of the residuals. The calibration variance for X or Y is a measure for how
well the X and Y data, respectively, have been modeled. The validation variance
for Y expresses how well the model will perform for similar, new data. The
validation variance for X indicates how well the validation data have been
projected onto the PCA - or the PLS components model.
Both the calibration and validation variance can be plotted either as residual
variance, (which is supposed to decrease as the number of PCs increases), or as
explained variance which shows which percentage of the total variance has been
explained by the increasing number of components. The residual and the explained
variance plots are but two alternative expressions based on exactly the same data;
they always sum to 100%, see Figure 9.3.
% Y-variance expl.
100
90
80 v.Tot
70
c.Tot
60
50
40
30
20
10
0 PCs
0 3 6 9
t-0, variable: v.Tot c.Tot
Usually one studies the residual variance curves by focusing on their characteristic
shapes, searching for the first local or for the global minimum. This plot’s primary
use (dimensionality optimization) is to find how many PCs to use in the model.
Because the residual variance is dependent upon the original measurement units
and/or model scaling, it may be difficult to use it to compare different models
directly.
RMSEP is defined as the square root of the average of the squared differences
between predicted and measured Y-values of the validation objects:
n
∑ ( yi − yi )2
Equation 9.8 RMSEP = i =1
n
RMSEP is also the square root of the validation variance for Y, divided by the
weights used at calibration (scaling).
RMSEC is the corresponding measure for the model fit, calculated from the
calibration objects only.
RMSEP expresses the average error to be expected associated with the future
predictions. Therefore one may conveniently give predicted Y-values using 2 x
Some authors are concerned with the fact that RMSEP is estimated using reference
values which themselves are also error-prone (measurement errors in the reference
method), and have consequently introduced the term apparent RMSEP etc. It is
claimed that a correction for this is easily invoked. While in the laboratory
calibration context it may well be possible to estimate the true measurement error
and carry out the pertinent corrections. There are certainly also many other
practical situations, we might term them field situations, i.e. technical monitoring,
production plant, environment or biosystems in which it is precisely this
comparison with the error-containing reference samples that carries the practical
meaning of the validation.
Note that RMSEP is the average error, composed of large and small errors
altogether. This is often well illustrated in the important Predicted vs. measured
plot, where many samples in general will be predicted well and some badly. In
Figure 9.4, sample 15 has a much larger error than sample 45.
Bias represents the averaged difference between predicted and measured Y-values
for all samples in the validation set.
n
∑ ( yi − yi )
Equation 9.9 Bias = i =1
n
SEP, the Standard Error of Performance, on the other hand expresses the precision
of results, corrected for the Bias.
n
∑ ( yi − yi − Bias)2
Equation 9.10 SEP = i =1
n −1
If one can reasonably expect a normal distribution of the samples included in the
calculation of the Bias,
This also means that the uncertainty of the Bias is directly dependent on SEP. SEP
increases when the Y-values (the reference values) are inaccurate.
The relationship between RMSEP, SEP and Bias is statistically well known:
Normally, indeed hopefully, there is no Bias, which then leads to RMSEP = SEP.
What should we do if SDD is just at the level of accuracy required (for instance by
regulations), and SEP=2*SDD? You can only solve this problem by making better,
more precise Y-measurements, i.e. by reducing SDD. SDD is a function of the
number of repeated measurements, so SDD based on four replicates is only half as
large as SDD based on two replicates.
Note however that if you transform Y, e.g. to log Y, the RMSEP will be in log Y
units. But you cannot back-transform RMSEP directly; exp(RMSEP(log Y)) is not
equal to RMSEP (Y). In this situation a little spreadsheet back-calculation is
necessary.
estimates. Make sure that you understand how everybody calculates their error
estimates, and that you can explain how you got yours.
For example, some Neural Net and MLR implementations do not validate based on
similar external principles, so here one might see apparent prediction results that
are really only model error estimates proper. As is known by now, an overfitted
model may have a very low apparent error (if evaluated in this fashion), but in
reality will have an unreliable prediction ability.
n
∑ ( yi − yi )2
Equation 9.12 RSD = i =1
(n − k − 1)
where
n = number of samples
k = number of variables
RSD is , therefore, more of an expression of the modeling error. The value of RSD
becomes very small when there are few variables. For example, in spectroscopy
applications with filter instruments, using only few wavelengths, the RSD will
seem very low. RSD does not estimate prediction errors (estimation errors of the
regression coefficients), but only errors in the model. It cannot be used for
prediction or validation samples since it is related to the number of samples. RSD
is, therefore, not equal to the prediction error in the above chemometric definition.
The best way to compare an MLR model with a PCR or PLS model is to calculate
an appropriate RMSEP for the same validation or prediction set.
PRESS = ∑ ( y i − yi ) 2
PRESS is the Residual Y-variance over the number of validation objects. PRESS is
often used to assess whether an individual, new (“next”) component represents a
significant addition to a model, whilst the residual variance as defined in this book
comprises all the included components simultaneously. It is a viable alternative to
use a “PRESS vs. no. of components” plot instead of the conventional residual
validation variance plot, since the Y- axes in this plot are but proportional to each
other, with a multiplication factor equal to the fixed number of validation samples
involved.
Once we have found a satisfactory prediction model, satisfactorily validated that is,
the quality of the predicted values is expected to be approximately as good as that
for the average calibration/validation object. Except for outlier warnings and
prediction uncertainty limits, there is no way to check whether the prediction
objects are good or bad, i.e. whether they correspond in general to the data
structure for the training data set.
Thus, one set of B-coefficients is calculated for the model with 1 PC included only.
Another set is calculated for the model including 2 PCs, and so on. You should of
course use the set of B-coefficients corresponding to the appropriate optimum
number of components, A.
Note!
Prediction using these B-coefficients gives exactly the same predicted
numerical Y-values as the projection model equations using A PLS-
components! Sometimes, inherent rounding-off errors may produce
small discrepancies; however, they should never be of any quantitative
consequence.
The traditional regression equation is therefore often used for e.g. downloading
prediction models to spectroscopic instruments etc. and for automatic predictions.
The only significant drawback in prediction using the B-vector is that you lose the
outlier detection and interpretation capabilities available with the projection
models.
represents a fair estimate of the average prediction error. This will, in many
practical situations, be absolutely acceptable.
There have, in addition, been attempts at giving some uncertainty indications valid
for each specific prediction sample. Using The Unscrambler during prediction,
predicted Y-values are given an uncertainty limit called deviations (Dev). These
limits are calculated from the validation variances, the residual variances, and the
leverage of the X-data in the prediction objects. This is based on an empirical
formula originally developed in the 1980’s, then further improved in 1998. If the
X-data for the prediction sample is very similar to the training X-data, then the Dev
interval will be smaller and the prediction more reliable. If the new sample is more
different from the training data, the Dev interval will be larger. This prediction
deviation interval is really most useful when comparing predictions. Note that these
uncertainty limits (Dev’s) mostly indicate to what extent you can trust a particular
predicted value, i.e. a form of outlier detection.
If a predicted value is “bad”, e.g. gives outlier warnings, has large uncertainty
limits or seems to fit badly with the model, the reason may either be that the
prediction object is dissimilar to the calibration samples, or that the validation set
was different from the calibration set.
or similar. It is preferred to use the values -1, +1 for such variables than to use 0, 1,
because of better symmetry.
The above dichotomous dummy variable facility often works very well, provided
there are not too many of these alongside the dominating continuous variables.
However, the simple analogy between (-1,1) and any dichotomous category
classification cannot be carried over to the case where more than two categories are
involved. It is most emphatically wrong to try to code multiple categories, e.g. (A,
B, C, D) into a discrete variable realization space, e.g. (1,2,3,4). There is no
guarantee that the “metric” discriminating between categories (A,B,C,D) should
happen to correspond to the equidistant, rational metric set out by (1,2,3,4). This is
a very serious mistake to make.
If a category variable can take more than two values, then we must use one
category variable for each category type. Example:
This is often referred to as “re-coding”. With this approach, we are in fact able to
make use of these, at best, semi-quantitative category variables. In fact quite an
important issue in PLS-modeling concerns what has been termed “PLS-
discrimination modeling”, or PLS-DISCRIM for short. In chemometrics there have
been some spectacular application showcases based on this simple, yet enormously
powerful concept. We have found this issue so important that we have included
some PLS-DISCRIM problems amongst the new master data sets in this revised 4th
edition.
1
0 and 1 can be replaced by -1 and 1. If you should ever happen to import a data set
in which such variables have been coded (0,1), to replace (0,1) in a variable (e.g.
V1) with -1 and +1, select Modify - Compute, and type V1=(V1-0.5)/0.5. For the
whole matrix, type X=(X-0.5)/0.5, and specify the range of variables when the
programs asks for it.
Another way to avoid amplifying the effect of noise is to scale by 1/(SDev + C),
where C is a small constant, related to the accuracy of the data. This prevents
variables with a very low standard deviation from producing a very high value of
1/Sdev.
Note!
You may use the Weights button to allocate individual weights to individual
variables or to sets of variables, which should be treated identically. Such variable
sets are often referred to as “blocks”. Block-scaling is sometimes useful, for
Multivariate Data Analysis in Practice
214 9. Multivariate Data Analysis – in Practice: Miscellaneous Issues
can be calculated directly from the pertinent PLS-loadings, and they give exactly
the same predicted Y-values. The B-vector can thus be used to predict new
Y-values with a minimum of fuss, when we are truly only interested in the
prediction result, like e.g. in automation and process control etc.
Always study the B-vector together with the appropriate loading-weights. Check
this by plotting the loadings for the same PCs.
The B-vector is cumulative (the B-vector is for one complete model with A
components), while there is one set of loadings for each PLS-component – much
more informative and much easier to interpret!
Bw-coefficients are always calculated in The Unscrambler if the data have been
weighted (scaled at the Task menu). They should be used to predict new Y-values
from new weighted X-values.
The Bw-coefficients take the weighting into account, with the aim of disregarding
small or large original variable values. Therefore a large Bw indicates an important
X-variable. The sign may, however, still be “wrong” if there is interaction. The Bw
Multivariate Data Analysis in Practice
9. Multivariate Data Analysis – in Practice: Miscellaneous Issues 215
option is primarily used for example for export from The Unscrambler, since all
internal prediction automatically is carried out with the appropriate weights, etc.
Spectroscopic data usually comprise rather large data sets, often with hundreds – or
even thousands – of variables, but because they are collinear PLS can handle them
even with few objects. Chromatography data, acoustic spectra, vibration data and
NMR are examples of similar types of data with the same characteristics. In the
rest of this book the term “spectroscopic” may often be taken in a more general
generic sense, meaning also applying to data with similar characteristics to the
spectroscopic data.
The X-variables typically represent wavelengths while the X-data themselves are
often absorbance, reflectance or transmission readings etc. In chromatography the
X-variables are usually retention times and the X-values are peak heights (single
peaks, or integrated). The Y-variables (often called constituents or properties) may
be chemical concentrations, protein contents or physical parameters such as octane
number a.o.
PLS is well established within NIR today, because NIR applications often require
methods based on many wavelengths due to non-selective, full-spectrum
wavelength responses. PLS-applications are however being continuously developed
also in other wavelength ranges, such as IR, UV or VIS, where potential
information remained partly hidden earlier when univariate methods or MLR were
the only calibration options. There are also many recent developments within
acoustic chemometrics, which relies heavily on PLS-calibration, see one of the
master data sets in chapter 13.
The automatic outlier detection limits should typically be raised, because this kind
of spectra is usually very precise and a low limit will display too many objects as
outlying. A usual limit would be 4.0-6.0 in this case. As this modification is highly
application-dependent, you will to find suitable limits for your own data. At a low
limit many objects are indicated as outliers. Studying the list of outlier warnings
shows you which are just above the limit and which are far higher. Adjust the limit
accordingly - good domain-specific knowledge is of course necessary.
In PLS2 the 2-vector Y-loading plot shows the relationships within the set of
Y-variables and this is often useful information.
With spectroscopic data the 1-vector loading-weight plot is often very useful in e.g.
understanding the chemistry of particular applications. Large loading-weights
imply wavelengths in which there is for example significant absorption related to
the constituent of interest. This type of interpretation is a vital source of
information to help you to understand the chemistry of the samples. And similarly,
for instance with respect to acoustic spectra, which show the relationships of
frequency variable responses.
By studying the specific patterns in the loading-weights you may also - with some
experience - begin to be able to interpret which “effect” is being modeled in
specific calibration situations: peaks, shifts, double-peaks, scatter, or combinations
of these effects. Professional interpretation of loadings and loading-weights (by
spectroscopists, analytical chemists, etc.) requires extensive practical experience,
but there is also a lot of literature on this particular subject.
3.6
3.3
3.0
2.7
2.4
2.1
1.8
1.5
1.2
20 40 60 80 100
<5> <6> <7> <8ny> <9> <12>
As a typical example, strong scatter, as seen in Figure 9.5, often gives loadings in
the first PLS-component with the pattern shown in Figure 9.6.
0.13
0.12
0.11
0.10
0.09
0.08
0.07
0.06 X-variables
20 40 60 80 100
test-1, PC(expl): 1(99%)
Another example concerns shifts in the X-spectra. This gives the type of loadings
shown in Figure 9.7. The shift form in the loading curve is shown in all significant
PCs.
Figure 9.7 - Loading plots showing typical shift effects in PC1, PC2...
Loadings Loadings
0.4 0.5
0.3 0.4
0.2 0.3
0.1 0.2
0 0.1
-0.1 0
-0.2 -0.1
Determination of the optimal number of PCs follows the usual rules. In addition it
is natural to study the B-vector and the 1-vector loadings for a varying number of
PCs. The 1-vector loading-weights become noisy (around an effective zero-level)
when you have passed the optimum, because you start to model noise and overfit.
This can be a powerful alternative way to determine the number of components to
use in more complex applications.
Experience and experiments show clearly that PLS manages to model structures
both on full-spectrum data sets and data from filter instruments.
If you reduce the number of wavelengths by using only the ones that carry most
information (the ones with high PLS-loading values), the model may be safer and
easier to interpret (fewer factors) and the prediction error may be reduced. This is
the result of avoiding wavelengths with large noise fractions and irrelevant
information, or non-linearities.
Choosing fewer wavelengths arbitrarily still may give rather good models, but an
intelligent, and problem-dependent, selection technique will always improve the
results considerably.
In practice you should study the B-coefficients that give the accumulated picture of
the most important wavelengths - for the final, completely validated model. For a
model with, say, four valid PCs (giving the minimum residual variance), you study
the B-coefficients for 4 PCs. Select the variables with the highest absolute
B-values. Then recalibrate with only these variables and evaluate the results anew.
Wavelengths chosen to give optimal MLR solutions will also be useful for PLS.
There is a lot of work to be done in this field, and much to be gained from
optimizing an automated procedure for selecting the best wavelengths. A great
many papers have been published in recent years on this issue, which however is
just outside the scope on this introduction to (generic spectroscopic) multivariate
calibration.
A long, but essential chapter on the range of additional practical aspects of the
crucial application experience with multivariate calibration has come full circle.
There is nothing more we can teach you.
We can now progress to three full chapters on realistic, real-world multivariate
calibration exercises, many of which display very interesting non-standard issues,
all ready for you (chapters 10, 12 and 13).
The reference method for octane number measurements is (very) time consuming
and relatively expensive, involving comparative use of test engines (two: one for a
reference distillation product mixture, the other for the mixture to be graded),
which has to run for 24 hours. If it would be possible to replace such measurements
(Y) with fast, inexpensive NIR-spectroscopy measurements (X), routine quality
control could be effectively rationalized and indeed very much cheaper. In this
example we do not concern ourselves with a score of other potential problems
related to optimization of the practical implementation of NIR-technology in the oil
refinery setting (these have all been overcome, although at no small effort), but we
shall assume that the prediction of octane number is the sole objective at hand.
Data Set
The sample set Training consists of twenty-six production gasoline samples that
were collected over a sufficient period of time, considered to span all the most
important variations in the production. All the data are stored in the file Octane.
Two of the samples contain added alcohol, which increases the octane number. The
variable set NIR spectra consists of NIR absorbance spectra over 226 wavelengths.
The variable set Octane consists of the corresponding reference measurements of
the octane number.
There is also a sample set Test of 13 new, “unknown” samples. (The corresponding
octane numbers are of course also available for control and validation purposes.)
Tasks
Make a PLS model on the sample set Training. Detect outliers using all available
means, interpret, delete the outliers and re-model. When you have found a good
model, use it to predict the octane number of the sample set Test.
How to Do it
1. Studying raw data
There are several ways to study raw data.
Open the file Octane in the Editor. Study the raw data values as well as possible
A second possibility is to plot the raw data. You can use a matrix plot to display
the whole spectrum for a whole set of samples; this will show you the general
shape of the spectrum, and may enable you to spot special samples. To do this:
Use Edit – Select Variables and choose the set NIR spectra. Use Plot- Matrix
and take the defined sample set “Training” which corresponds to the 26 first
rows of the data table. Click OK. If necessary, go to Edit - Options to select a
plot of type Landscape.
The last possibility is to plot a summary of the data. For one individual variable,
a histogram is a good summary of the distribution of the values:
Mark the variable Octane again, use Plot - Histogram and select sample set
Training.
What is the range of variation of the octane number? Are there any groups of
samples?
For a whole set of related variables, for instance the X-matrix, descriptive
statistics are also a powerful way to summarize data distributions. To use them:
Go back to the original Octane Editor window, and choose Task - Statistics.
Select sample set Training, variable set NIR spectra. View the results.
The upper plot displays the minimum and maximum, lower and upper quartiles,
and median of each wavelength as a boxplot.
The lower plot displays the average (as a bar), and standard deviation (as an
error bar).
Can you easily see the common shape of the spectra for all samples? What is
happening at the right end of the spectrum?
You can now start the calibration. Study the screen during PLS1 regression
progress.
Are there any warnings in the early PCs? Approximately how many PCs will
you need? Does the Y-variance decrease as it should?
Study the Scores plot for PC1 versus PC2. You can use Window - Warning list or
View – Outlier List to get more information about the warnings you noticed
during regression progress.
Which samples contribute most to make up the model span?
How do you interpret the narrow horizontal group of samples around the origin
along the first PC?
Are there any outliers? Which?
4. Spotting outliers
The X-Y relation outlier plot, which displays U scores vs. T scores, is only
available in PLS. This plot shows you directly how the regression works and
gives a good overview of the relationship between X and Y for one particular
PC. If the regression works well, you will see that all the samples form a straight
regression line. Outliers “stick out” orthogonally from this line. Extreme values
lie at the ends. Noise is typically being modeled when the samples start to
spread, i.e. you have gone past the optimal number of PCs.
Plot X-Y Relation outliers, Quadruple plot for PC1 - PC4. You can use View-
Trend lines to add a regression line to each plot.
Can you spot any outliers? Mark them (using the toolbar or Edit - Mark), and
notice how they are now marked on all plots, a facility called brushing.
Plot Predicted vs Measured for a varying number of PCs (e.g. from 1 to 4), one
in each quarter window. Toggle between Cal (calibration) and Val (Validation)
using the toolbar. If you do not have the Cal and Val buttons on your toolbar,
turn them on by choosing View – Toolbar – Source.
If you wish to take additional information into account, use Edit - Options-
Sample grouping. Try to Separate with Colors and Group by Value of Leveled
variable 2 (the category variable containing information about octane number
range). It will help you to spot misplaced samples and see groups.
Can you see the outliers? What happens with the predicted values for 1 PC? 2
PCs? How many PCs do you need to get regularly distributed predictions?
Use Plot – Variances and RMSEP - RMSE to plot the calibration error (RMSEC)
or prediction error (RMSEP) in original units. Before clicking OK in the dialog
box, double-click on the plot preview to get it as a full window. This way the
produced plot will use the single subview window. Point at the minimum and
click to see the value!
Plot Loadings - Line - X for Component 1 (which separates those two samples
from the others). Point and click to see the wavelengths.
Can you deduce which band is mostly absorbed by alcohol-related compounds?
Try the Sample outliers plots and study how they reveal the problems with
sample 25 and 26. You may use several options and select Validation only on
some plots to make things clearer.
Based on the damage the outliers are doing to the model, and on the additional
knowledge of why they are outlying, make a decision about whether to keep
them or exclude them.
Study the Variances and RMSEP and X-Y Relation outliers plots for the new
model.
Do you see new outliers?
Is there still an increase in the prediction error?
How many PCs should we use?
How large is RMSEP with that number of components?
Plot Predicted vs Measured for varying number of PCs. Use View - Trend
lines to put on a Regression line. View - Plot Statistics gives additional
information.
Do you recognize the groups? How many PCs should we use? Are you satisfied
with the distribution of the samples around the regression line?
If you are hesitant to choose between two numbers of PCs, the smallest number
will be called the “conservative” choice, whilst the larger one may signal
problems!
Summary
Some would say we do not need to weight these spectra, since they are so
extremely similar over the entire X-wavelength range. We will need about 3 to 5
PCs, but the Y-variance increases in the first PC, which is a sure sign of problems -
usually outliers. Samples 25 and 26 were indicated as extreme outliers. These
outliers were actually seen in all the sample-related plots. PC1 in the first model
described mainly the difference between samples 25-26 and the other samples. The
outliers also caused an increase in the prediction error in the first PC. The narrow
group of samples around the origin in the score plot are in fact all the remaining
samples, whose variations are only small compared to the difference between them
and sample 25-26. Obviously no. 25 and 26 make up most of the model variance
alone. The X-Y Relation outliers plot shows the same thing, and in PC4 the
samples spread out, indicating overfitting. The loadings for PC1 were largest in
band 1400-1420 nm. Samples 25-26 were the samples with added alcohol;
obviously they are so dissimilar to the others that we cannot make a model for both
types. (An alternative might be to try to add more samples with alcohol, but there is
a great risk of the two types of samples being so different that the result would
most probably be an inaccurate global model.) The outliers were also visible in
Predicted vs. Measured. In some PCs samples got misplaced. RMSEP was about
0.3 with 3 PCs.
The second model had no local increase in prediction error; 2 PCs already seem
OK, but the model is also further improved with 3 PCs, giving an RMSEP around
0.25 octane number. The score plot indicates groups, but since the explained
variance of Y is around 98%, these are probably not harmful. The groups are
actually composed of Low, Medium, and High octane. In the Predicted vs.
Measured plot, the correlation between predicted and measured Y is close to 0.99,
and the regression line for validation samples is very close to “y=x”. This is indeed
a very good model, and comparing a RMSEP of 0.25 with octane numbers between
87 and 93 gives an average relative error of less than 0.5%. However, since the
error in the reference method, SDD, is unknown to us, we cannot say anything
more specific about how well the model predicts compared to traditional octane
measurements.
Samples 10-13 get large outlier warnings at prediction and therefore cannot be
trusted at all. Their predicted values are too high with a model including 3 PCs.
According to their spectra, they probably also contain alcohol and, since those
samples were removed from the calibration set, we cannot expect the prediction
model to handle such samples either. It seems that the model can also be used to
detect new prediction samples that do not fit in, so now nobody can try to cheat by
adding alcohol to raise the octane number! This is a very powerful illustration of
the possibilities of detecting “non-similar” training set objects when using
multivariate calibration.
The RMSEP is the average prediction error, estimated in the validation stage. If
new prediction samples are of the same kind and in the same range as the training
samples, we should expect roughly the same average prediction error. The
uncertainty deviation at prediction tries to consider the particular new sample that
is to be predicted, indicating a larger uncertainty if the new sample is (very)
different from the calibration samples. We can see this as a convenient form of
outlier detection at the prediction stage.
When plotting Predicted vs. Reference and drawing a regression line to fit the
points, the result seems bad because of the second group of outlying samples 10-13.
If you disregard these samples, you will notice that all the “normal” samples are
close to the target line.
Samples 10 - 13 have nice predictions using 2 PCs, but the deviations are large.
This should make you worry, because the samples are obviously different from the
samples used to make the model. It may therefore be pure luck that the predictions
fall nicely into the range of the others.
We hope you have not used leverage-corrected validation in the final evaluations!
If you did not: congratulations! If you did, here is what you still have to do:
Take a close look at the initial Matrix-plot of the X-data again. Extreme
collinearity and redundancy. This is a typical situation in which there is a real
danger of the “famous” over-optimistic leverage corrected validation. It is in fact
necessary to perform the entire calibration again using a more appropriate
validation procedure. However, the repeat is readily done now that you know all
the outliers, including the potential last three candidates (samples 10-13). In fact
this makes the whole re-analysis boil down to simply choosing the more
appropriate validation method to be used directly on the outlier-screened data set
inherited from above.
What about you not scaling, or weighting these data? Why did you - perhaps -
decide not to use this option on these particular data? There is nothing special
about spectral data. True, the X-spectra were all measured in the same units, and
across the entire X-interval these data would appear very much identical with
enormous redundancy (only for the outlier-screened data set to be sure). That
would actually weigh in favor of auto-scaling rather, so as to help bring forth the
miniscule differences between this set of very similar objects, very similar spectra
(in the X-space). This is a severe lesson of not following any myths about scaling
or not (perhaps); in fact, all this commenting on the scaling/no-scaling issue can be
put to a much easier test: just do it! You’ll have to carry out the entire PLS-analysis
again, only now using the alternative auto-scaled data.
Compare the scaled PLS-model results with the earlier un-scaled model version(s).
In this particular case, what can we conclude with respect to the issue of the merits
of scaling?
Problem
This exercise is based on studies of municipal water quality at an outlet of a sewage
plant at the 1996 Olympic Winter Games Village Lillehammer, Norway.
Make a model of the target Y-variable BOF7 from the other available
measurements and find an equation: BOF7 = f (tot-P, Cl, Susp). BOF7
measurements cost about $140 each so taking one measurement per day may be
rather costly for a medium-sized municipality (of course viewed in the context of
all the other analytical requirements in the municipality’s environmental protection
division). A good PLS-model would be valuable to the sewage plant monitoring
economics, and will ease the load on the wet laboratory resources.
Data Set
The samples were measured twice a month over a five-month period. In the data
tables they are listed in time order. The data file name is Water:
Tasks
Make a PLS1 model.
Plot or read the B-coefficients and estimate the traditional regression equation.
How to Do it
1. Make a PLS1 model using the variable set Important Measure as
X-variables (Weights: 1/Sdev) and BOF7 as Y-variables; keep out sample 9.
Interpret the model and look for outliers. How many components should we
use? Re-calibrate with an appropriate validation method when you are sure
of the model. Compare RMSEP with the typical y-value levels. Can this
model be used to replace BOF7 with the cheaper X-measurements?
Save the model.
The B-coefficients may be used with new unweighted data. To study the
B-coefficients, use File - Import - Unscrambler Results.
Note!
It is NOT possible to import files or results in the training version of The
Unscrambler. This feature is only available for The Unscrambler full
version.
Specify the model you just made and select the B-coefficients for the optimal
PC. Read also B0. Will predictions using the B-coefficients give different
y-values from those using scores and y-loadings?
2. Read the data again and pretend that they are new samples. (This is of course
only for this exercise purpose!) Select Task - Predict, specify your samples,
variables and model. Predict with one PC. Note the predicted Y-value for the
first sample.
Now predict using the B-vector instead. Append a new variable into the Editor.
Mark columns 1-3 and 12. Select Modify - Compute and type for example:
V12 = -27.5 + 10.4*V1 + 1.59*V2 + 0.0868*V3
using of course your own B-values.
Alternative procedure:
First, replicate the row containing the B-coefficients in the imported editor (by
using Edit – Copy then Paste 8 times). You now have as many B-coefficient
rows in this window as samples in the Water data table. Select all B-coefficient
columns, and drag and drop them to the Water editor (choose “Insert as 4 new
columns”). Then insert a new variable, and use Modify – Compute to perform
the manual predictions by typing in for example:
V1 = V2*V6 + V3*V7 + V4*V8 + V5
where V1 is the newly inserted variable, V2–V5 are the Bs and V6-V7 are
Important Measure.
Is Y for the first sample equal to the predicted result from The Unscrambler?
How would you indicate the predicted result, given the validation results?
Summary
In this data set one PC is enough to explain 81% of the validated Y-variance, giving
an RMSEP of 8.7 (with full cross-validation) that we can compare to the values of
BOF7 (which range from 15 to 70). RMSEP is then equal to some 23% of the mean
value of BOF7. The regression equation obtained from the B-coefficients is:
y = -29 + 10.25*tot-P + 1.63*Cl + 0.087*Susp
The samples are fairly well distributed. There is no obvious time trend. All three
X-variables have positive loadings.
Prediction using the B-coefficients gives the same predicted BOF7 values as a
prediction using scores and y-loadings, of course. The drawback of using the B-
coefficients for prediction is that we do not get outlier statistics or uncertainty
measures. The predicted value of sample 1 is 26.9 ± 4.3 if we use the prediction
deviation to build an approximate confidence interval.
Replacing BOF7 with the cheaper measurements indeed has a clear cost saving
potential, but the model needs to be improved, for instance by adding more
calibration samples. It is not yet precise enough, but we might agree that it shows
potential?
It would be of great interest to carry out these tests by some faster instrumental
procedure, preferably one with a smaller uncertainty.
We also know that the present set of jet fuel samples come from four different
refineries. There is an assumption that the four refineries produce identical jet
fuels. If this is true, there should be no differences between these four sub-types of
jet fuel with respect to freezing point depression. And we should be able to make
one global model for all four refineries.
This data set has kindly been provided by Chevron Research & Technology Co.,
Richmond, CA, USA. The original problem context has been slightly modified
(conceptually as well as numerically) for the present educational purpose.
Data Set
The file Fuel contains two variable sets:
Spectr: 65 sample spectra with absorbance readings at 111 wavelengths in the
NIR range (1100 to 1320 nm).
Freeze: Freezing point determinations (°C) for the 65 samples.
Tasks
Try to make a model that predicts freezing point from spectra, with the aim of
replacing the cumbersome reference method with fast and cheap NIR-
measurements. Check if there are significant differences between samples from
each of the four refineries.
How to Do it
1. Always plot raw data to get acquainted with the data matrices. Before you do
that, you may change the data type for variable set Spectr to “Spectra”
(Modify – Edit Set etc.).
You may notice that variable No. 2 is constant. You needn’t delete it, it will
be enough to keep it out of the analysis.
3. Make a PLS1 model. Start with leverage correction. Try to find outliers.
Look for groups. Study the X-Y Relation outliers plot to identify outliers.
4. Refine the model by removing a few outliers at a time. Hint: We found three
obvious outliers (using the X-Y Relation Outliers plot), and two that may be
from a specific refinery (using the Influence plot: Plot – Residuals –
Influence Plot, Components: 1-5, Variables: X). See how the resulting
validation Y-variance improves without local increases. How many PCs
should we use? Check the RMSEP. Is the prediction error significantly
lower when the outliers are removed?
5. Also study the 1-vector Loading Weights plots, to see at which wavelengths
there is freezing point related information.
6. Re-calibrate with full cross-validation. Does this change the RMSEP and the
number of PCs drastically?
8. Make conclusions: Can this model be used to replace the reference method?
Compare RMSEP with the uncertainty in the reference method. Can we
expect a smaller prediction error than the analytical error? Suggest an
approach to get a better model.
Summary
We do not scale the X-data because they are of the same type (absorbance values).
The PCA model shows that there is strong spectral structure. About 99.5% of the
X-variance is explained by 4 PCs. The score plot for PC3 and PC4 plot is the only
one that shows clear groupings, however, both in PCA and PLS. Observe that we
see this grouping only at higher order components. This grouping is significant - of
what?
There does not seem to be a strong enough relationship between the NIR spectra
and the freezing point to constitute even a first working prediction model, since
only about 55-65% of validated Y-variance is explained. No matter what we do to
try to make a good model - and this is indeed the same message we get from
whichever validation approach we choose.
o
The RMSEP is about 14 C at its lowest, using leverage correction. From a data
analytical point of view this is about what can be expected, since RMSEP is about
twice the analytical error. RMSEP is composed of uncertainty both in the X- and
the Y-measurements. However the R&D lab needs a smaller prediction uncertainty
than this. The only chance to get a lower RMSEP would be with better reference
measurements, for example by finding a more accurate method or by using a higher
number of replicates.
Samples 2, 46 and 47 are obvious outliers, easily found in the X-Y Relation outliers
plot. This plot also indicates that the regression does not work very well. Samples
37 and 63 are also outlying. They do not have extreme Y-values and may originate
from a certain, undisclosed refinery. If they are also removed, the residual Y-
variance curve is now almost smooth; the RMSEP does not get much better
however. Using leverage correction, the RMSEP curve flattens out after 11 PCs,
and using full cross validation the optimal number is 12.
Problem
A paper mill monitors the quality of its newsprint product by applying ink to one
side of the paper in a certain, heavily standardized layout pattern. By measuring the
reflectance of light on the reverse side of the paper, a reliable, practical measure of
how visible the ink is on the opposite side is obtained. This property, Print
through, is an important quality parameter for paper mills in general. The paper is
also analyzed with regard to several other production parameters as well as other
pertinent raw material characteristics.
The paper mill wants to make a model, which can be used for quality control and
for production management. For example, it may be possible to rationalize the
quality control process by reducing the number of significant parameters measured.
Preferably, it should also be possible to predict the Print through for new, not too
dissimilar, paper compositions.
Data Set
The data is stored in the file Paper. You are going to use the sample sets Raw data
(106 samples) and Prediction (12 samples), and the variable sets Process (15
variables) and Quality (1 variable). The variables in the Process (X) set are shown
in Table 10.3.
The samples were collected from the production line over a considerable time
interval, in the hope that the measurements would span all the important variations
in the newsprint production. The sample names show in which sequence the
samples were collected. The samples are sorted by increasing levels of the variable
Print through. In order to check the model, twelve new samples are stored in the
test set Prediction. These are used to check how the model performs for prediction.
Tasks
1. Find outliers in the sample set Raw data and remove them.
2. Reduce the model by also finding the less important variables and make a
new model without them.
3. Predict Print through for the new samples in the sample set Prediction.
4. Try to solve the same problem using PCR instead of PLS.
How to Do it
1. Make an initial PLS model
Read the data from the file Paper. View the statistics, plot the raw data, and
decide on an appropriate preprocessing and weighting. Make a quick initial PLS
model with leverage correction using the sample set Raw data, the variable set
Process as X-variables and Quality as Y-variable.
Remove the outliers, always only one, or a few at a time, and make a new
model. When you have reached the final model, plot the variance and decide
how many PCs to use.
How much of the X-variance is explained? How much of the Y-variance does
the first PC explain? Check the RMSEP!
Compare the new regression coefficients to those of the previous model, and
check whether they are similar.
Close the viewer with the latest model and answer No to Save.
Summary
The X-variables should be standardized because they are all in very different units
and value ranges. Only some outliers are shown in the list of warnings, compared to
those you find in the X-Y Relation outliers plot (T vs. U scores). The so-called
“relation outliers” (discrepancy between X and Y) show up in this plot only.
Samples number 105 and 106 are clear outliers and are removed from the
calculations first. Sample 104 seems outlying too, and is kept out of calculation
from the final model. You may argue that sample 98 is an outlier also. There are no
definite rules, it is your experience that counts.
In the refined model (all outliers removed), only 18% of the variations in X are
modeled by the first PC. However, these 18% explain all of 84% of the variation in
Y. Obviously, there is a lot of irrelevant X-information in these data. RMSEP is
around 3.4 using 1 PC, 2.8 using 2 PCs, and 2.6 using 3 PCs. The samples are also
well distributed over the score plot in PC1 vs. PC2, indicating that important
variations are well spanned.
Variables Weight, Opacity, Scatter and Filler (however they are measured) co-
vary most, in a negative way, with Print through. This makes sense: a high
weight/sq. m. makes the paper thicker and thus less transparent, light scatter is
reflected light, opacity is by definition the opposite of Print through, and filler is
added to counteract Print through. The seven largest regression coefficients belong
to the four above mentioned variables, with the addition of Brightness, Density and
Ink. The rest of the variables have very small coefficients, and we may thus feel
motivated to remove them in the variable reduction context.
In the resulting reduced model using 2 PCs, the predicted Y-values correspond well
with the measured ones, giving an RMSEP of 2.6. Cross validation gives an
RMSEP of approximately 2.7, (depending on the specific segment selection). This
is the more conservative error estimate and is still satisfactory compared to the
measurement range of Print through (30-69).
Some samples in the test set Prediction are not well predicted. These are outlying,
probably due to X-errors. Outliers in the prediction set can be detected by their
deviations, or because they turn up in the list of outlier warnings.
Using PCR, we need no less than 6 PCs to get an RMSEP of about 3.3 on the data
set with all variables in, but the outliers removed. The outliers were very difficult
to find without access to X-Y Relation outliers plot. (In PCR T scores = U scores.)
Only sample 98 showed up in the score plot for PC1 vs. PC2. The reduced model,
with fewer variables and no outliers, gave a RMSEP of 2.7 using 6 PCs. The PCR
model resulted in pretty much the same interpretation as the PLS model, but
because there were now more PCs, the patterns were more difficult to see.
Various expressions of model fit and prediction ability are used to assess how good
a model is. Every solution must still satisfy the basic minimum rules for sound data
modeling though: no outliers and no systematic errors. You should also -
preferably always - be able to interpret the model relationships, ensuring that they
comply with your application knowledge.
11.1.1 Scores
As a typical example, in Figure 11.1 samples 25 and 26 are situated dramatically
apart from all the other samples. They contribute unreasonably to the overall
model; PC1 is actually used almost exclusively to describe the difference between
these two objects and all others. They may of course still belong with the others,
but chances are generally (very) high that they are significant outliers. As always
the specific decision as to the status of such objects is problem-dependent.
Using name markers may show unexpected sample locations. For example, an “A”
sample in the B-group may be an outlier, or it may represent a simpler labeling
mistake, see Figure 11.2.
11.1.3 Residuals
Figure 11.4 shows Y-residuals versus a run order listing of the samples. Large
residuals in this plot indicate possible outliers. Three samples have been flagged.
Figure 11.4 - Individual Y-residuals for each object (in run order listing)
and Figure 11.5 are from the same data set. The same objects turn up in both plots
as possible outliers. There are always many different reflections of an abnormal
multivariate behavior. The objective is to have been acquainted with all of them in
your training, so as to use them more and more expertly later on.
extreme. Checking the raw data gives valuable information to be used later in the
analysis.
If you remove an “outlier” that is really only an extreme end-member object, the
model may not get better. There may be no change in the prediction error or it may
even increase. Extreme objects actively help to span the model. Extreme end-
member samples are easy to spot. They occupy the extreme ends of the Tvs.U
regression lines, while outliers lie off this line (perpendicular to the model). This
distinction is much more difficult to appreciate if you only use numerical outlier
warnings, leverage index, etc. Always study the score plots carefully (T-U). In
regression modeling, X-Y Relation Outliers plots are really all you need – for the
modeling purposes. There may of course also be problem-specific reasons to dig
into how the objects are distributed in the X-space in a specific PLS-solution, via
the T-T scores plots, i.e. for interpretation purposes.
The octane application in exercise 12.1 is a good example. The outliers were
different from the other samples; they contained alcohol, resulting in clearly
different spectra. If we include more such samples, there are two possible
scenarios:
1. They would not seem so different in the score plot, but there is a great risk that
the model will be too inaccurate.
2. There will be two clearly different groups that will make an inaccurate model.
After removing these two outliers, a distinct grouping appeared in the score plot,
which consists of samples at different general octane number levels (called Low,
Medium and High), see Figure 11.6. The L, M, and H octane samples described in
this plot are a good example of the interpretation use of a T-T score plot from a
PLS-solution.
The prediction error in this model was satisfactory; RMSEP was approx. 0.3 units
to be compared with the measurement range of 87 - 93 octane numbers, i.e. a
relative error of about 0.5%. Apparently these groups are “harmless” in the greater
picture. This T-T grouping is in fact duplicated in the Y-space, which is why in the
more appropriate T-U plot they are all nearly perfectly aligned along the relevant
regression lines. This is a very important distinction. If you wish, return to exercise
10.1 for a moment and look at the appropriate Tvs.U plots.
Figure 11.6 - PLS solution, T-T score plot, showing distinct X-groupings
Residuals
Objects
0.4
0.2
-0.2
-0.4
-0.6 Predicted Y
0 5 10 15 20 25 30
oct-1, (Yvar,PC): (octane,2)
If for example, the upper and lower bounds diverge and the pattern is funnel-like,
as in Figure 11.9, the error is clearly different in different parts of the experimental
range. We may then try to transform the Y-variables in appropriate ways, for
example, to counteract the problem observed, and make a new model. Iterate as
need be.
Predicted Y
If for example, the upper and lower bounds are parallel but not horizontal, see
Figure 11.10, there is a systematic error, in this case a trend. This could be because
a linear term is missing from the model, or it could indicate a scatter-effect not
(yet) corrected for.
All these residual inspection plots are very useful in telling you that your current
model is still not complete. What should you do about it then? Since this is
problem-dependent, you must use your application knowledge and try to find the
reasons.
Predicted Y
97.92 23
8
89.58 12
9
81.25 5
3
72.92 19
18
64.58 14
56.25 222
50.00 17
11
43.75 15
24
35.42 21
16
27.08 4
20
18.75 13
7
10.42 6
1
2.08 10 Y-residual
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3
OCT-0, PC = 4, Yvar = octane
The objects in Figure 11.11 are indeed found distributed along a straight line
through the point (0,50). This is a good indication that the model has taken care of
all the structure in the data and there are no outliers left, in contrast to Figure 11.12.
96.88 11
90.63 2
84.38 14
78.13 6
71.88 15
65.63 4
59.38 8
53.13
50.00 13
46.88 16
40.63 7
34.38 1
28.13 3
21.88 5
15.63 12
9.38 9
3.13 10 Y-residual
-0.000006-0.000004-0.000002 0 0.0000020.0000040.000006 0.0000
will-0, PC = 1, Yvar = Yield
11.3 Transformations
The field of transformations, the major class of pre-processing, is very varied and
complex. It is an area with a lot of topical interest and activity. We can only touch
upon some of the basics here, logarithmic, spectroscopic, MSC (Multiplicative
Scatter Correction), differentiation (computing derivatives), averaging,
normalization, but this should serve as an introduction. Bilinear models basically
assume linear data relationships. It is, however, not necessary that Y displays a
linear relationship to each X-variable. Y may be a linear function of the
combination of several non-linear X-variables (linear combination of non-linear
X-variables, or a non-linear combination of suitably transformed X-variables), see
Figure 11.13.
Figure 11.13 - Nonlinear X-X relation, linear X-Y relation (highly schematic)
Y
X2
X1
All the transformations in the following sections are available from the Modify
menu in The Unscrambler.
Skewed distribution
4 (raw data)
0
0 10 20 30 40 50 60 70 80
<dispers.UNS DispAbil>
0
0 0.5 1.0 1.5 2.0
<test.UNS DispAbil>
If you have no prior knowledge about the data, study histograms of the variables. If
the frequency distribution is very skewed, a variance stabilizing transformation
such as the logarithmic function may help (Figure 11.14). Here the skewness was
reduced from 1,02 to -0,14 after the logarithmic transformation.
Transmission data are often non-linear, so they are “always” transformed into, e.g.
absorbance data, using a modified logarithmic transformation (see below). Diffuse
reflectance data are “always” transformed into Kubelka-Munk units, but exceptions
may be around in more problem-specific cases.
Reflectance to Absorbance
We shall here assume, without loss of generality, that the instrument readings R
(Reflectance), or T (Transmittance), are expressed in fractions between 0 and 1.
The readings may then be transformed to apparent Absorbance (Optical Density)
according to Equation 11.1.
Equation 11.1 1
Mnew (i, k ) = log( )
M (i, k )
Absorbance to Reflectance
An absorbance spectrum may be transformed to Reflectance/ Transmittance
according to Equation 11.2.
Reflectance to Kubelka-Munk
A reflectance spectrum may be transformed into Kubelka-Munk units according to
Equation 11.3.
Absorbance to Kubelka-Munk
In addition the apparent absorbance units may also be transformed into the
pertinent Kubelka-Munk units by performing two steps: first transform absorbance
units to reflectance units, and then reflectance to Kubelka-Munk.
Other types of measurements may also suffer from similar multiplicative and/or
additive effects, such as instrument baseline shift, drift, interference effects in
mixtures, etc.
effects can also be successfully treated with MSC. In The Unscrambler MSC
transformation can be done from the Modify - Transform menu.
The idea behind MSC is that these two undesired general effects, amplification
(multiplicative) and offset (additive), should be removed from the raw spectral
signals to prevent them from dominating over the chemical signals or other similar
signals, which often are of lesser magnitude. Thus we may well save one or more
PLS-components in our modeling of the relevant Y-phenomena, were we to be able
to eliminate (most of) these effects before multivariate calibration. This, in general,
will enable us to proceed with more precise and accurate modeling, based on the
cleaned-up spectra. MSC can be a very powerful general pre-processing tool.
A short description of how MSC works on a given data set is given below.
• All object spectra are plotted against their average spectrum; see Figure 11.15.
Note that in The Unscrambler, this plot is automatically generated by doing
Task – Statistics, viewing the results, then choosing Plot – Statistics – Scatter.
Figure 11.15 - All individual object spectra plotted against their average
spectrum
Individual spectral
values
Average spectral
values
• A standard regression is fitted to these data, with offset A(i) and slope B(i), called
the common amplification and the common offset respectively. Index i for all
individual objects in the data set.
• The rationale behind MSC is to compensate for these so-called common effects,
i.e. to correct for both the amplification and/or offset (see the options below).
The MSC function replaces every element in the original X-matrix according to one
of the equations below.
Common offset only corrects additive effects, while common amplification only
corrects multiplicative effects.
In practice you select a range of X-variables (spectra) to base the correction on.
One should preferably pick out a part that contains no clear specific chemical
information. It is important that this “MSC-base” should only comprise background
wavelengths, in so far as this is possible. If you do not know where this is, you may
try to use the whole set of variables. The larger the range, the better, but this also
implies a risk of including noise in the correction. Or worse, we may thus
accidentally include (some of) the “chemical specific” wavelengths in this
correction. Omit test samples from the base calculation.
The MSC-base is calculated on this selected X-range and the MSC coefficients (A
and B) are calculated accordingly. The whole data set, including the test set, will be
corrected using this MSC base. Finally you make a calibration model on the
corrected spectra. Before any future prediction, the new samples must of course
also be corrected using the same MSC-base.
The A and B coefficients may in some cases contain a certain kind of signal
information in them. In certain advanced applications these vectors may in fact be
added as extra variables in the X-block. In exercise 12.2 you will try out MSC in
practice. Likewise, in addition to the original MSC concept, still other uses have
been found for MSC, consequently now known as Multiplicative Signal Correction.
11.3.4 Differentiation
The first or second derivatives are common transformations on continuous function
data where noise is a problem, and are often applied in spectroscopy. Some local
information gets lost in the differentiation but the “peakedness” is supposed to be
amplified and this trade-off is often considered advantageous. It is always possible
to “try out” differentiated spectra, since it easy to see if the model gets any better or
not. As always however, you should preferentially have a specific reason to choose
a particular transformation. And again, this is really not to be understood as a trial-
and-error optional supermarket - experience, reflection, and more experience!
The first derivative is often used to correct for baseline shifts. The second
derivative is an often used alternative to handling scatter effects, the other being
MSC, which handles the same effects.
11.3.5 Averaging
Averaging is used when the goal is to reduce the number of variables or objects in
the data set, to reduce uncertainty in measurements, to reduce the effect of noise,
etc. Data sets with many replicates of each sample can often be averaged over all
sets of replicates to ease handling regarding validation and to facilitate
interpretation. The result of averaging is a smoother data set.
11.3.6 Normalization
Normalization is concerned with putting all objects on an even footing. Above we
have so far mostly treated the so-called column-transformations, i.e. making
specific pre-processings or transformations which act on one column-vector
individually (single-variable transformations).
The row sum of all variable elements is computed for each object. Each variable
element is then divided by this object sum. The result is that all objects now display
a common size - they have become “normalized” to the same sum area in this case.
Normalization is a row analogy to column scaling (1/SDev).
There are several other data analysis problems where normalization can be used in
a similar fashion, even though the physical or chemical reasons for the phenomena
compensated for may be very different from the chromatographic regimen
mentioned above. This scenario is an analogue to the earlier mentioned augmented
MSC-usages for other analogous purposes.
11.4 Non-Linearities
As mentioned above, Y may be a linear function of the combination of several non-
linear X-variables and thus present no problem for bilinear regression. In many
cases PLS can handle non-linearities up to second-order T-U relationships by using
a few more PLS-components. How do we see if there are non-linearities in the
data? There are many plots that reveal this:
Signs of Non-Linearities
The objects show a curved pattern in the X-Y Relation Outlier plots. Residuals have
a curved pattern, U-form, S-form, or arch form. Y-residuals may be plotted against:
• Objects sorted after Y-value, or in run order, or some other suitable order that
may suit the residual display (problem-dependent, of course)
-1 217 T Scores
-1 0 1
Alcohol, PC(X-expl,Y-expl): 1(89%,49%)
-10 -5 0 5 10
Alcohol, (Y-var, PC): (Ethanol,2)
The “inverse” S-shape of the data is typical in data sets where non-linearities are
present.
• Add for example a second-degree term in the model (i.e. square or cross terms of
the variables, e.g. x1*x1 and x1*x2). This gives models that are easier to interpret
and you avoid mixing up non-linearities with the general error. These additional
model terms are automatically generated in The Unscrambler by clicking
Interaction and squares in the Modify – Edit Set dialog.
Do not use this approach unless you know well what you are doing, and are willing
to take full responsibility for the data analysis results, as such applications are
likely to have second order characteristics, because it implies a large risk of
overfitting!
• An alternative is to first compute T (the scores) and then add square and cross
terms of these score vectors, for each PC. This expanded scores matrix is then
used alongside the “ordinary” X-variables in the regression, with added columns
such as t1*t1, t1*t2. The first PCs are the most stable, so it is safer to use the
scores from only a few of the first PCs. Using more terms again increases the risk
of overfitting. (To perform traditional MLR-regression with The Unscrambler,
run PLS or PCR with max. number of PCs.)
• If preprocessing or rank reduction do not help, and the model is not good enough
for your needs, try a non-linear modeling method instead. Examples are non-
linear PLS, spline-PLS or a Neural Network. Since very few applications suffer
from such severe non-linearities that the above tricks are in vain, it is beyond the
scope of this book to go into these. However we will give a few hints on Neural
Networks below.
An interesting approach to correct this was developed by the Danish Meat Research
Institute and is implemented in the Neural-UNSC package as an add-on module to
The Unscrambler. Data are preprocessed by ordinary PCR or PLS, giving you
access to interpretation of the main variations, outlier detection, diagnostics and
transformations. If there are clear signs of remaining non-linearities, the PLS-score
matrix is used as input to the Neural Network. This gives a small network that is
fast to train, and guarantees the minimum solution of PCR or PLS. Neural-UNSC
also offers test set validation, which makes modeling quite safe.
Noisy Variables
A model may likewise be improved by deleting especially noisy variables. These
usually make little contribution to the model, with small loadings and loading
weights. Plot the loadings and/or the loading weights for the relevant components
in the same plot and find these variables. Then make a new model without these
variables, using Task – Recalculate without marked, or manually selecting them
in the Keep Out field of the regression dialog, and see if the model gets better.
Remove variables by weighting them by zero or deleting them completely from the
raw data matrix.
It may be difficult to study the results for many components until some experience
with large data sets has been acquired. An alternative is to also study the
B-coefficients (regression coefficients), since there is only one B-vector for a
model with A relevant components. A small B-value may be a sign of a noisy or
unimportant variable, which you may try to delete. Note the risks when interpreting
the B-vector however! A small B-value for a variable may still be due to large
measurement values, or due to interaction between several variables. So as a
precaution, always re-model and check that the results really are improved.
If the data are scaled by 1/SDev, you should study the Bw-vector instead, which
takes into account both large and small variable values.
1. Start with PLS2 if there are several Y-variables. It will give a good overview of
the data without too much modeling work. Use leverage correction for
validation in the initial rounds, unless you have data suitable for test set
validation. This is much faster than cross validation and gives exactly the same
model, except for the estimate of the prediction error, which you will not trust at
this stage anyhow.
2. Look for outliers. Correct errors. In the case of “true” outliers, remove them, a
few at a time, always starting with the ones that appear first in the earliest PCs.
Study the validation variances to check that the model is improved as you
remove objects, but do not use RMSEP as the only indicator. It is time to start
developing a more holistic feel for a multivariate calibration model.
3. Check the residual variances in X. Variables with small variability can usually
be omitted. There are also other ways to inspect the variable set for “outliers”.
4. If the model still seems strange, for instance the scores pattern is very different
from what you expect, and there are no obvious outliers, consider data
transformations. Again, if the pertinent score plots look better after the
transformation, usually using the X-Y Relation outlier plots. Use RMSEP to
compare the quantitative prediction ability of the different models.
5. Try separate PLS1 models for each variable. In most cases they will have better
prediction ability than a global PLS2 or PCR model. If PLS2 does not give good
enough results, you may model each Y-variable separately by PLS1 to find
Y-variables that are badly modeled and remove these.
6. Try to add second order terms (interactions and squares) if the data appear
(very) non-linear and the application is likely to have second order
characteristics.
8. Always remember that the goal of all the “rules” in this book is just to provide
a safety net. The objective is to make you a self-confident, creative data analyst
as soon as possible.
9. The only way to become just that: a self-confident, creative data analyst is to
start performing as many PCA (PCR) and PLS-R multivariate calibrations as
indeed possible, so please continue with chapters 12-13!
When data are noisy, we must in general take a more relaxed attitude towards both
the fraction of Y-variance that can be modeled and the obtainable RMSEP.
Consider a set of geological samples (objects) characterized by a relevant set of
geochemical variables. These may include both the so-called major element oxides
(compositionally in the % range) and trace element concentrations (in the ppm
range). An average measurement error of some 5% or more is not unusual here;
typical ranges would cover 1-15%.
Add to this a significant sampling error (e.g. localization error of the individual
geological samples from an inhomogeneous rock type) which will also lead to a
significantly reduced signal to noise ratio. In such an imprecise data situation, there
are quite different quantitative goals for the fraction of explained X and Y variance,
as well as of the obtainable RMSEP. It may for example be quite satisfactory to
model X and predict Y with an explained variance of the order of, say, 50-60%.
Process chemometric applications is another area where inherently noisy data may
occur, and there are still many other analogous examples in practical multivariate
data applications, biological data, environmental data, survey data. Which tells us
to bear in mind the simple distinction between precise and noisy data.
If the residual variance increases all the time (in all PCs), the model is certainly
wrong! Maybe a transformation is required, or maybe there are no systematic
relationships between X and Y. If PLS2, try to run PLS1 for each Y at a time, and
see if this gives any better results.
If the Y residual variance increases, it means that the model does not describe the
relationships between X and Y, quite the opposite. In PLS you should then of
course not even inspect the loading-weight plots or try to interpret your model.
However, in PCR you model the X-structure as it is, all the way. In the PCR
situation, one may in fact, therefore, see such an increase in the residual Y-variance
for the first components, which does not necessarily signify problems. The first
components model the dominant X-structure no matter what, so more PCs are
needed to also model the Y-variations in this case. Such a model may eventually
end up being acceptable, though necessarily with more components than its
alternative PLS-cousin.
it may also be the result of all object residuals having a similar size. The rule here
is very clear: use the first minimum (or the left-most part of the V-shape).
Why should you ever choose fewer PCs if it gives a larger residual variance? It is
because your validation set may not necessarily be totally representative of the
future prediction samples. Choosing fewer PCs gives a more robust model, which
is less sensitive to noise and errors, especially the unavoidable sampling errors. It is
the representativity of the validation, i.e. the representativity of the prediction error
estimation, which is at stake here, not the minimum RMSEP as such.
It may be a great help to look at the pertinent 1-D loading-weights and
B-coefficients. When this curve is noisy in some particular components, this part of
the model is unstable. You should then select fewer components, where the loading
weights are not so noisy.
However, if you get a decrease in the residual variance but the prediction error is
too high, then you may interpret the loading-weight plots as an indication of real
relationships. (The RMSEP for a variable is approx. the square root of the residual
variance divided by the weights used. RMSEP is the error in Y in its original units.)
A low explained variance of course indicates an unsatisfactory model and the
loading-weight plot corresponding to the first components only shows you the most
general relationships related to that degree of explanation. For instance an
explained variance of 10% given by the first two PCs means that only 10% of the
variations are explained and that these variations follow the pattern of the loading-
weight plots for PC1 & PC2.
Multivariate Data Analysis in Practice
11. PLS (PCR) Multivariate Calibration – In Practice 269
Cross Validation
One must always select the pertinent cross validation options very carefully. As a
pregnant example, if you have selected full cross-validation and one singular
sample fits very badly with the model, this may make the total prediction error look
much worse, than with this particular (yes, you guessed right) outlier policed away.
Running a tentative model with leverage correction will give you an indication of
the prediction error, without the risk of making this type of mistake in selecting
cross validation options. The model will be the same as with cross-validation, but
the prediction error may be too optimistic (i.e. too low).
Variance/PC
For PCA and PCR: PC1 is responsible for the largest fraction of the X-variance,
PC2 takes care of the second largest fraction, and so on.
For PLS: PLS-component 1 takes care of the largest fraction of the Y-variance
modeled, which normally decreases with each additional component, and so on, no
matter what fraction of the X-variance is modeled. There are several instructive
examples from recent chemometrics with only some 10% (or even less) of the
X-variance in use for the first component, in which PLS was indeed able to isolate
just the right miniscule fraction of the X-variance, for example accounting for 70-
80% of the Y-variance alone. Examples in which the second PLS-component
accounted for a much larger fraction of the X-variance for a much more modest
additional Y-fraction can also be found.
Also with small data sets you may sometimes observe a similar feature, i.e. that a
“later PC” is accounting for a larger fraction of the variance than an earlier one.
When working with small data set there are three basic causes for this:
1. Small data sets where the raw data are distributed in a certain way: for instance,
cigar-shaped along the dominant direction(s), but with a circular cross-section.
2. One of the variables in your data set are somewhat special, for example nearly
constant.
3. The data set is so small that the internal program iterations are oscillating
between two fixed positions.
There are many other “strange” effects that may occur when analyzing small data
sets. Mostly this is due to the fact that a very small data set displays a rigid,
irregular data structure compared to larger data sets. For example a data set with
only eight objects can barely support two components. There is absolutely no point
in aiming for complete modeling, validation and prediction on such a small data
set. It will only work if all 8 points line up very regularly in a linear way.
Always inspect your raw data (for example by PCA) before you perform any
regression modeling. You can avoid many frustrations by deciding at the outset that
a particular data set cannot really support multivariate modeling at all.
Scores (X-Y Relation outliers): may show outliers, subgroups, and non-linearities.
Regarding the Predicted vs. measured Y plot: It is bad form indeed to use the
Pred/Meas plot as a modeling tool, for instance using this plot to spot outliers. You
must aim at developing sufficient modeling experience to “catch all outliers”, and
indeed be able to do all your modeling before this last plot is ever invoked and
evaluation of the established models’ prediction strength!
Now you are ready for graduating onto the next two sets of PLS-exercises, chapters
12-13.
Problem
Combustion of waste and fuel residuals containing chlorine is a well-known source
of emission of organic micro-pollutants, especially of chlorinated organics.
Sampling and analysis of micro-pollutants in flue gases is both complicated and
expensive. Both complexity and costs increase when the detection limit is lowered,
e.g. as regulations become stricter. This applies in particular to ultratrace toxic
components like polychlorinated dioxins (PCDD) and dibenzofurans (PCDF).
Data Set
The data are stored in the file DIOXIN, with variable sets X and Y. The
X-variables contain the measured levels of ten different isomers of chlorinated
benzenes in seven samples. The Y-variables consists of the measured TCDD-
equivalent levels in nanograms in the same seven samples.
Task
Make a PLS model that predicts TCDD (dioxin) from the associated chlorinated
benzenes. There may be a need for preprocessing.
How to Do it
1. Read the file DIOXIN and plot the data using a matrix plot. Use View -
Variable Statistics to have a look at min, max etc… for all variables. Also
plot histograms of each variable; Use View-Plot Statistics to display the
skewness of the distribution. Are the distributions skewed? Mildly, or
severely?
Note!
A skewness around ± 1.0 indicates a rather severe asymmetry.
Make two PLS1 models with leverage correction, one with and one without
standardization, computing up to 4 PCs. Study the explained Y-variance curve
for each of the models. Are these models satisfactory?
When variable distributions are very skewed, a Log transformation may often
help. Such data actually often follows a log-normal distribution, so it is a good
first try to log transform the data. Similar effects will occur for analogous
transformations, e.g. roots.
Go to Modify - Compute to compute a function. Make sure all samples and
all variables are selected in the Scope field and type X=Log(X) in the
Expression field. Save the data table with a new name, e.g. Dioxin Log.
Interpret the scores, then the loadings. Is there something wrong when all the
loadings are positive along PC1? Which X-variables are highly correlated with
the level of TCDD? Plot the regression coefficients for the optimal number of
PCs. If reduction of the X-matrix is needed, can some X-variables be left out
without decreasing the quality of the model?
3. Try this by making a new model where you keep the variables with smaller
regression coefficients out of calculation. You do this by marking the
unimportant variables on the plot, then selecting Task-Recalculate Without
Marked. Does the explained variance change significantly? Plot Predicted
vs. Measured to study the prediction ability of the two models.
4. Make a new model version now using full cross validation. Is it possible to
use test set validation on this data set? Is cross validation a problem? Is the
model still ok? Can we use RMSEP as a representative measure of the future
prediction uncertainty for this model?
5. Look at the predictive ability of the model. Plot Predicted vs. Measured.
Is this satisfactory?
Summary
The histograms show that most of the variables have rather skewed distributions,
which certainly will cause problems in the analysis. The models based on non-
transformed data were not very good; they needed 4 PCs to explain a fair portion of
the Y-variance.
Nothing is wrong when all loadings are positive along PC1. It just reflects the
overall correlation among the X-variables. Since we have standardized the
X-variables, all regression coefficients share a common scale and can be compared
to each other. Variables X3, X4, X5 and X6 have smaller coefficients than the
others and can be removed. The reduced model needs only one PC and has a lower
RMSEP than the previous one. You may also have tried to remove X1, which
contributes to the model mostly through PC2: it does not harm the model.
You may argue that candidates for variable deletion are those which strongly
correlate, since this perhaps should imply that one of them would be enough. This
is not however the case. The collinearity of several variables is often important to
stabilize a model. You can easily try this out of course, and see what happens.
The scores plot shows a satisfactory distribution of samples, with no clear groups,
but there are few intermediate samples – each sample has an influence on the
model. We can try full cross validation; this proves to work fine in this study. In
this case leverage correction also gave a good estimate for the prediction error. Test
set would probably not work here, since the data set is extremely small. We need
all the data present to make a model. Note however that it is actually possible to
carry out a meaningful PLS-regression if/when all samples participate in a more-or-
less homogeneous spread of the Y-space (and the corresponding X-data are well
correlated).
Problem
We want to generate a PLS-model to predict alcohol concentrations in mixtures of
methanol, ethanol, and propanol. This is to take place by using spectroscopic
transmission data (NIR) over 101 wavelengths from a specific instrument (Guided
Wave, Inc.). This particular application has also previously been used by Martens
& Næs in their textbook “Multivariate Calibration”. The calibration samples have
been constructed as a triangular mixture design.
All mixture samples have been carefully prepared in the laboratory. The mixing
proportions are used directly as reference Y-values. The inaccuracy in the reference
method thus only consists of the variations in sample preparation, volume
The spectra have been transformed to absorbance units. No single wavelength can
be used alone, because of strongly overlapping spectra. NIR spectra of mixtures
may often exhibit scatter effects due to interference. This causes shifts in the
spectra, which is often a very big obstacle in the multivariate calibration game. But
this can be corrected for by MSC.
Data Set
The samples are characterized by different mixing proportions of the three alcohols
methanol, ethanol and propanol, always adding up to 100%. The three pure
alcohols are also included.
The data are stored in the file ALCOHOL. The total sample set, called Training (27
samples), and variable sets, Spectra (101 wavelengths) and Concentrations (three
Y-variables) are used in the calibration. The first 16 samples (A1 - A16) should be
used as calibration samples. Samples 17 - 27 (B1 - B11) should be used as test
samples to validate the model.
The sample set New may be used for prediction. The sample set MSCorrected
contains MSC transformed spectra for comparison.
Tasks
1. Make a full PLS2 model.
2. Look for outliers and consider transformations, if necessary.
How to Do it
1. Study raw data
Read the data file ALCOHOL by File - Open. Study the data in the Editor and
by View - Variable Statistics. Close the Editor with the statistics after you have
looked at the results.
Display the spectra in a matrix plot to have an initial look. Use Edit - Select
Variables and select the variable set Spectra. Choose Plot - Matrix, and select
as scope sample set “Training”.
Plot variables 1-3 as a general 3D Scatter plot. Use the total sample set Training
(27 samples) for the plot.
Do you recognize the design in the Y-data? If you have trouble with this, click
on a few points and study their coordinates (X = Methanol, Y = Ethanol,
Z = Propanol).
You may also try Edit - Options – Vertical Line or View – Viewpoint- Change,
or View – Rotate. Notice the contents of samples no. 1, 2, 3 and 17, 18, 19.
Close the Viewer when you are finished looking at the plot.
How many PCs do you expect to find in this mixture data set? Why?
Of course you should calibrate for a few more than that. Why?
To configure test set validation, check the Test Set box, then click Setup…;
choose Manual Selection, and select samples 17-27.
Change the warning limits for outlier detection: press Warning limits and change
fields 2 – 7 to a value of 5.5.
Now you can click OK. Study the model overview in the progress dialog.
Are there any outliers?
How many PCs seem enough or optimal? Why?
The only factors varied were the proportions of the three alcohols, and the
design was overall symmetric. Something is wrong here!
Make the residual variance plot active. Check the residual Y-variance for all
individual Y-variables by Plot – Variances and RMSEP, clicking “All” and
removing “Total”.
What is wrong in this plot?
Go back to the Editor and plot a few individual spectra (e.g. 4 - 10) as lines. If
necessary, use Edit – Options and choose to display the plot as curves. Use
View - Scaling - Min / Max to enlarge the detail in the picture between variables
20 and 60. Scale the ordinate axis to the range 0 - 0.5.
Do all spectra have the same baseline?
Select the sample set Training and variable set Spectra in the Scope field. The
correction method is Common Offset. We want to use variables 31 to 45 as the
basis for this correction; this is done by omitting all other important variables
(1-30,46-101) in the selection field at the bottom of the dialog box. Also exclude
samples 17 - 27 because the correction should only be based on the calibration
samples. The test samples will still be corrected, since they are included in the
scope you have chosen. Click OK.
A dialog box pops up, asking you whether you want to save the MSC model.
Reply Yes, and give the model a name. This will save the model coefficients for
future use (on a prediction data set, for instance). The spectra are now MSC-
corrected. Save the corrected data in the Editor by File - Save As with a new
name, e.g. Alcohol corrected.
Launch a general Viewer by Results - General View and plot the first 10
samples from the original data file (Alcohol) using the variable Set Spectra:
Plot, Line, Browse, Select Alcohol.00d, Samples: 1-10, Var. Set: Spectra,
OK. Select Window - Copy To - 2. Activate the lower plot window and plot the
first 10 corrected samples (from file Alcohol corrected).
Re-scale the plots using either View – Scaling – Min/Max or View – Scaling-
Frame, and see how (well) the MSCorrection has transformed the spectra.
Close the Viewer.
Look at the plot X-Y Relation Outliers for components 1 - 4 in a quadruple plot.
This plot visualizes the relationship between X and Y along each component.
The samples should lie close to the target line. Outliers stick out from this line
and the samples "collapse" when the optimal number of components is
exceeded. Note: make sure that both Cal and Val samples are plotted!
Which samples are outlying in the present model?
Study the design as it is depicted in the score plot and interpret PC1 and PC2.
Use the sample grouping feature to see how the samples are distributed
according to the levels in the Y-variables.
Plot Y-loadings on the subview below the score plot, for PC1 and PC2. You will
see the Y-loadings forming a triangle.
Compare the loading plot and the score plot to interpret the meaning of PC1 and
PC2.
Can you see a slight curvature at the base of the triangle in the score plot?
Plot the residual variances for Y-variables 1, 2 and 3 in the same plot or plot the
RMSEP for all Y-variables.
How many PCs would you use?
Plot the loading-weights as a line plot for PC 1, 2 and 3 together (type “1-3” in
the Vector 1 field).
How can we interpret this plot? Here one must assume the role of a
spectroscopist!
Plot predicted versus measured for each variable, with varying numbers of PCs.
View plot statistics. Toggle between Calibration and Validation samples by
using View – Source or by switching the “Cal” and “Val” buttons alternatively
on and off.
How many components give the best results? Why? Are the test samples as well
predicted as the calibration samples?
Make conclusions: Is the model OK? What do the components “mean”? How
many components should be used? How large is the prediction error for each
alcohol? Is this satisfactory? What do you think about the model’s ability to
predict samples between the test points?
Save the model and close the Viewer before you continue.
New and the variable set Spectra in the Scope field. Now select Use Existing
MSC Model and find the MSC model you made earlier in this exercise.
You do not need to save the MSC coefficients after the correction is done.
Specify the appropriate model name. Use the optimal number of PLS-
components that you found in the final model evaluation above. Click OK.
Press View and study the prediction results for each response variable. What is
the meaning of the deviations around each predicted value? Do you notice
anything suspicious? Why does the difference between the predicted values and
real concentrations (e.g. at the 0% level) have a larger error than the RMSEP
from the calibration?
Plot Predicted vs. Reference for each response (without table) using 3 quadrants
of the Viewer. Mark the outlier so that you can spot it on all 3 plots.
Is it as easy to detect on the Methanol plot as on the other two plots? What if
you plot Predicted with Deviations?
Save the results and close the viewer. To see how the scores of the new samples
are placed compared to the calibration samples, select Results – General View
– Plot – 2D Scatter and click Browse next to the Abscissa box. Find the
prediction results file and select Tai in the Matrix box, PCs: 1. For the Ordinate
the same is specified except PCs: 2. Now select Edit – Add Plot and fill out the
same as above for the PLS2 model saved earlier.
Summary
In general X-spectra need rarely be weighted (but they may equally well be), while
in PLS2 it is often mandatory to standardize the Y-variables. In this rather special
case all three Y-variables are in the same units and measurement range, and they
have the same variance because of the triangular design, so weighting is not strictly
necessary (but it does not hurt either).
The first model is distinctly bad, with a low explained Y-variance for PC1 and
PC2. The samples are not well spread in the score plot; actually they form two
groups plus one isolated sample, no. 20. It is easy to think that no 20 is an extreme
outlier, but things would not get better if you remove it. Since we never throw away
outliers before checking the raw data, we plotted the spectra and found signs of a
significant base line shift. Such scatter effects are not unusual in spectroscopy,
especially in fluid mixtures, or spectra of powders and grains.
Multiplicative Scatter Correction can be used to correct for both additive (e.g. base
line shifts) and multiplicative effects. We calculated the common correction
coefficients based on the calibration samples and the test samples was
automatically corrected with this “base”.
The second model was found to be much better, with a good decrease in prediction
error. Sample number 16 is now an outlier, easily found from the X-Y Relation
Outliers plot.
The prediction error of the final model was minimized at 3 PCs. We would expect
to need only 2 PCs, since the three alcohols always add up to 100%. (If we vary the
contents of Propanol and Methanol, the Ethanol content is given.) You see this both
in the 2-vector score plot, loading plot, and the Y-variance plot for each of the
Y-variables. Propanol and Methanol are negatively correlated in PC1, and
Methanol is negatively correlated to the combination of the other two in PC2. This
means that PC1 describes the variation of the proportions of Propanol and
Methanol, while PC2 describes the variation of the three alcohols together.
The score plot clearly shows the original triangular design. The slight non-linearity
in the triangle base is due to physico-chemical interference effects often associated
with mixtures, but PC3 takes care of this easily. That is why we need 3 PCs in this
“2 phenomena” application.
The RMSEP was about 1.7% for Methanol, 1.9% for Ethanol, and 1.2% for
Propanol using 3 PCs based on the test set validation. This is very good at higher
concentration levels. However as we have only tested new samples at the same
levels as the calibration samples, we cannot be perfectly sure that the model will
work equally well for points in between (but we would certainly expect it to!).
Nonetheless, since there were no signs of severe non-linearities and the calibration
samples cover the whole design space quite well, the model can safely be used for
other mixing combinations.
Normally we do not have reference values for the prediction samples. Note that the
reference value was not used in the prediction; it is only stored to be used in
plotting Predicted Y versus Yref.
The bars in the Pred+/-Dev plot indicate the prediction uncertainty. The deviations
are based on the similarity of spectra in the prediction and calibration samples, and
on their leverage. We therefore can use these deviations as outlier detection and
thus we do not trust predictions with large deviations. A more correct way to give
predicted values is to include the RMSEP as an indication of the uncertainty, for
example 33 ± 1.3% (1 std).
Because there were outliers in the prediction set, the Pred/ref picture initially
seemed bad (the regression line is based on all the plotted elements), while the fit
of the non-outliers is really very good. When the outliers are removed from the data
set and we include only the correctly predicted ones, the statistics improve greatly,
of course.
Problem
The chemical element Tungsten (W) is an important alloying ingredient in many
types of steel for example, and is of course used extensively in light bulb filaments.
Geologically, Tungsten occurs predominantly in only one particular mineral, called
Scheelite. Scheelite mineralisations are thus a natural target for geological
This sample medium is easy to collect (using a standard sieved bucket), but there
are never many Scheelite mineral grains in a standard FFSS sample. Geological
background levels are low, often in the range 0-10 grains (out of several thousands,
or in the ten thousands). Nearby mineralisations, however, may easily raise this
number of grains by factors of 3-15 as heavy mineral grains are freed by erosion
over geological time. The number of Scheelite grains in a standard FFSS-sample is
thus a very important Y-variable for exploration purposes. It is also a labor
intensive Y-variable. A professional mineralogist has to shift through 1 liter of
FFSS-sediment under a microscope and specifically count the critical number of
Scheelite grains. This is very laborious and very expensive indeed, especially when
it is pointed out that stream sediment exploration campaigns easily comprise
hundred, even thousands of samples.
Since the FFSS-samples can be collected easily and in great abundance when an
exploration field campaign is first launched, there is a clear interest in trying to
calibrate these Y-measurements against less laborious and less expensive chemical
analytical techniques; preferably instrumental methods that can be applied directly
to the FFSS. If this is possible, we might do away with the mineralogist altogether!
Of course his expert knowledge is needed to make a good calibration first (a
universally applicable calibration model). Actually there are many other tasks that
can now be assigned to the mineralogist, tasks which are much more interesting for
him than the perpetual FFSS screenings.
In the present scenario we most certainly do not have an ideal calibration data set.
In fact, it is a very imprecise data set. We do have a standard chemical FFSS X-data
set (XRF-analyses of 17 variables), but unfortunately Tungsten itself cannot be
analyzed by this method. There is however good geological reason to hypothesize
that the remaining XRF-data set (X) should carry sound geochemical evidence of
possible Scheelite mineralisations close to the sampling locality. From general
geological knowledge there is firm evidence that the 17 X-variables available for
XRF-analysis usually may act as an indirect measure of Scheelite content.
There are even more difficulties with the present data set. The X-samples (FFSS)
and the Y-samples (grain counts) do not originate from the same field campaign. In
The calibration data set is thus extraordinarily “dirty”, due to extremely large
uncertainties and error sources:
A conservative estimate of the average uncertainty level (X) in this problem may
well be way beyond 10-15% in the more extreme cases. We must therefore make
full use of the best possible data analysis method to handle that much error and
noise. We shall investigate whether it is possible to model the Scheelite grain
counts (Y) as a PLS1-model of the 17 other chemical FFSS variables (X). The
overall geological background is consistent with the possibility that such a complex
indirect model just might work, but only barely - it is a long shot indeed!
The data set was originally released by the former Geological Survey of Greenland
(GGU). We have slightly modified the problem description for use in this exercise
in order to stretch multivariate calibration scenario to the limit. The situation for
the exploration department at GGU in collaboration with the senior author was not
exactly this challenging, but close enough. This is really a most difficult calibration
problem, with several unusually severe extraordinary error components added for
good measure.
Data Set
Calibration data are stored in the file GEO. The sample set Training contains 23
FFSS samples, which are used to make the model. The sample set New contains 51
FFSS samples, from other areas, in which the number of Scheelite grains has not
been counted.
Tasks
Two alternative strategies are suggested:
1. Start building a PLS1 model right away, and make extensive use of the
diagnostic tools in order to detect irregularities and abnormal samples. Try to
improve the model.
2. First, have a closer look at the data and decide on possible ways to make it
better suited for analysis. Then build a PLS1 model, diagnose it and improve it.
Once you are satisfied with your model (properly validated), predict the new
samples.
How to Do it
1. Read the data from the file GEO. Study the data and determine whether or
not to use weighting. Make an appropriate PLS1 model with warnings
enabled.
Which validation method would you recommend for this data set? Why?
Find possible outliers using the regression progress window, score plots, the
X-Y Relation outliers plot and other available means.
Are the suspected samples really outliers or just extreme values?
Try to remove the potential outliers, but only one at a time. You should play
around quite a lot with this data set. Try removing some “apparent outliers”, and
observe what happens to Validation variance, RMSEP and Predicted vs.
Measured. Try to remove both extreme end-members and outliers. Do not
hesitate to build 3 or 4 different models at least, and check the impact of
removing a sample or including it again.
Compare the different models with respect to prediction error. You can plot
RMSEP using Results - General View and add plot for the other models.
When you are close to your final model, make sure that you use a valid
validation method to determine the number of components and check the
performance of your final model.
2. Go back to the raw data and see whether you can improve the quality of the
data by transformations. Hint: have a look at the histograms of the various
variables.
How are most variables distributed? Can you distinguish between extreme
(abnormal) values and skewed distributions? Which transformation(s) might
make the distributions more symmetrical? Is it necessary to have normal
(gaussian) distributions?
Make a new model based on suitably transformed data. Check it for outliers, and
if necessary remove them or try replacing some individual matrix elements by
missing values.
Are there any outliers now? Are they the same as with raw data?
How does the X-Y relationship look now? Is it improved compared to the
previous models?
Use a proper validation method. Which type of cross-validation does this data
set invite?
Check whether all variables are important; you may try to improve the model by
variable reduction.
3. Use the best model you settle on to make a prediction from the data in the
sample set New. Check that you have transformed the new samples the same
way as you transformed the data used to make your preferred model. Look at
the predicted results and their uncertainty limits, and evaluate.
Summary
This exercise made you work with a very noisy data set, for which it was probably
difficult to find a really good model. Furthermore you practiced the detection,
identification and removal of outliers in a severe context of uncertainty. You may
have noticed that it can be very difficult to determine whether a sample is an outlier
or not and you had to be very careful.
You might have taken samples 4 and 6 back in, keeping 18 out since it is so
obviously influential, and then sample 22 was not such an obvious outlier after all.
You might even have spotted other candidates in this game of: “What if...?”
If you tried cross-validation after having started out with leverage correction, you
experienced a serious drop in explained validation variance. You may also have
noticed the strong curvature in the X-Y relationship along the first component, and
a non-uniform distribution of the prediction errors. So obviously even the best
model you can build with this approach is not quite satisfactory.
The second approach shows an obvious need for transformations to make variable
distributions more symmetrical. In all cases requiring a transformation (most of the
X-variables, and the Y-response variable as well), you may have tried both a
logarithm and a square root, and concluded that the logarithm performed better.
Thus all distributions were made roughly symmetrical, some of them looking
perfectly normal, others bimodal; the variables with a large number of zero values
could be improved by using a constant inside the logarithmic term perhaps,
although they couldn’t be made completely symmetrical. Only very few extreme
values remained after these transformations.
A first model on transformed data then showed that sample 18 was a very
influential sample, while samples 4 and 6, although still extreme, fitted much better
into the overall picture. But most importantly, the shape of the X-Y relationship
was now much closer to a straight line. Replacing a few individual values by
“missing” got rid of the remaining outlier warnings.
Full cross-validation leads to a choice of only 3 PCs with this approach; then
variable reduction based on the regression coefficients can be applied to get a
simpler model, which requires only 2 PCs and performs slightly better. The
residuals and predicted values were well distributed.
To evaluate the quality of your model, you could apply a pragmatic criterion; the
RMSEP was of the same order of magnitude as the response value for the first non-
zero level, and looked small enough to ensure that at least samples with a higher
grain count than this could be detected, which is what matters most.
A Possible Conclusion
Remember that the goal here was not to make a statistically perfect model, nor
necessarily one with the lowest RMSEP, but simply one that works with the given
extreme uncertainty levels in the problem context. There was no prior knowledge
as to how a possible model might look, and what the most appropriate
transformation and validation method might be.
The fact that there are really only four “effective” Y-levels in these data (levels of
roughly 0, 15, 25, 40 scheelite grains) strongly guides all the data analysis efforts.
Noting this critical point is pivotal to the possibilities of doing anything reasonable
with this data set. One should be extremely hesitant to declare any sample from the
two highest Y-levels as outliers, if it is at all possible to include these objects in the
model. This would make us effectively lose the overwhelmingly most important
part of the spanning of the Y-space. On the other hand, there are really many
samples with an effective a zero count. Any of these might easily be discarded if
this would streamline our model in the X-Y Relation outliers-plot. There are still
enough other samples to anchor the model at the equally important “zero scheelite
grains” end.
It is vital to note that we cannot select outliers from a plot of the X-space alone, for
instance the t1-t2 score plot. You would get thoroughly deceived in this case! One
never, ever, performs outlier delineation in the X-space alone when doing
multivariate calibration – ONLY the appropriate T vs U plot(s) will do!
Also, there is no single correct solution to this modeling exercise. There may be
several equally valid models. It is only important that you are able to argue your
specific choices of the particular data analysis strategy you have chosen – always
(and only) with respect to the actual problem specifics present.
Problem
The aim of a particular Norwegian grain mill is to keep a constant wheat flour
quality, to meet the bakers’ requirements (who naturally prefer constant baking
characteristics). The requirement for protein contents in this case is 13.4 ± 0.3%.
By NIR-analysis of wheat it is possible to determine not only the water and protein
contents, but also the ash content. Ash is the residue after complete combustion of
the wheat, indicating the extension rate after milling.
Data Set
A set of 55 wheat samples, considered to span the most important variations, was
collected at the mill. Each wheat sample was packed in the instrumental
measurement container three times (uniform packing is critical in powder NIR
diffuse spectroscopy). NIR spectra were recorded for each of these triplicates with
a Bran+Luebbe Infralyzer 450 (diffuse reflection) filter instrument with 19 standard
wavelengths. There is also one extra, special wavelength, believed to enhance
results for ashes. All 55 samples were also analyzed in the chemical lab, for
protein, water, ash, and ash (dry matter); these are the reference Y-data.
Task
Make a multivariate calibration model of the data. Concentrate on trying to model
Ash and Protein, as these are the more difficult Y-variables in this problem.
How to Do it
1. Start with a PLS2. Try an outlier limit of 4. Use Leverage Correction in
the screening process (to find outliers), then cross validation for the
following calibrations (for example systematic 111222… with 3 samples per
segment for raw data and full cross validation for averaged spectra).
You may also transform the spectra to reflectance or Kubelka-Munk units and
see how this affects the models.
Use all available means to look for outliers. Compare results with raw data to
find extremes. Use the X-Y Relation Outliers plot and the Sample Outliers plot
to spot outliers and study how the shape of the Y-variance curve changes when
remove them. If a local increase in the error disappears when you remove a
sample, this indicates that it was indeed an outlier. There is however not much
use in removing samples if this does not reduce the prediction error.
Study replicate variations and see what happens if you average the spectra over
each replicate set. Also try MSC instead of the raw data. The use of RMSEP to
compare between these two alternative models is rightly justified.
2. Try separate PLS1 models to see if the more “difficult” variables are any
easier to model. Remember to define new variable sets, as you need it.
Also check how the model performs if you keep some wavelengths out of
calculation. This initiates you to the important area of variable selection, which
you will learn more about in Chapter 14.3. Study the B-vector as well as the
loading-weights for this.
Summary
Plotting the raw data indicates some inter-replicate variations. Since the replicates
were in fact packed individually and since we are analyzing aggregate powder
samples, we may suspect that scatter effects are at work here. An initial PLS2 gives
an overview of the possibilities for modeling these data. Standardization of the
Y-values is absolutely necessary. PLS2 indicates that modeling the four
constituents requires different numbers of PCs for each. Water is easiest to model.
Protein needs the most PCs.
For raw data and MSC corrected spectra, samples 103-105 are outliers (large X-
variances) and should be removed. For averaged spectra, sample 35 is an outlier.
For PLS2 models on raw spectra, there are only small differences in RMSEP using
absorbance, transmittance, or Kubelka-Munk. The best overall model is obtained
using MSC on absorbance spectra.
By computing the average spectra we include the natural replicate variation and
imitate the real world situation, where scans are averaged before prediction or the
predictions from several scans are averaged.
We obtain the lowest number of PCs in the models for protein and ash when using
separate PLS1 models on MSC corrected absorbance spectra.
The RMSEP for Protein is about 0.09 for 6-8 PCs. This is about the same order as
the reference method. RMSEP of Ash is about 0.014-0.015 using 7-8 PCs. We do
not know the corresponding laboratory inaccuracy of the Ash measurements.
By studying the regression coefficients for the optimal number of PCs we can try to
make a model based on fewer wavelengths. Remove wavelengths with small
regression coefficients - if the loading-weight plot(s) agree with this. Remaining
variables number 2, 4, 12, 18, and 20 give a 4 PC model. It has an RMSEP of about
0.018 for ash, which is slightly worse than the 20-filter model. This is still quite
satisfactory if you prefer a simpler model. The best model for Ash was based on
MSC corrected spectra and 20 wavelengths though.
Problem
This exercise is based on work done by Lennart Eriksson et al., Dept of Organic
Chemistry, University of Umeå, regarding strategies for ranking chemicals
occurring in the environment, and it is used here with kind permission.
We would like to know both long term and acute biological effects of all the
chemical compounds continually being released into the environment. It has been
estimated that there are 20.000-70.000 different chemicals in frequent use in
industry. Testing all these compounds is impossible, due to reasons of cost, time
and ethical considerations, (animals are often used for testing).
In this application the cytotoxicity towards human cervical cancer cell lines (HeLa
cells) was to be determined for a series of halogenated aliphatic hydrocarbons. The
cytotoxicity is expressed as the inhibitory concentration lowering cell viability by
50%, IC50.
Data Set
The data are stored in QSAR. Sample set New contains 58 samples (16-73) with
non-missing data for 8 (1-8) descriptors of the compounds. The names of the
samples are given in Table 12.1. This data set will be used in the first part of the
exercise, PCA.
Tasks
In this exercise we will make the PCA model (the principal properties model), see
how experimental design can be used to select a representative subset of samples,
make a PLS model, and validate with both “internal” cross validation and
“external” test set.
How to Do it
1. PCA
Make a PCA on sample set New, variable set X. Variables 9 and 10 which only
have missing values will automatically be kept out. Is standardization
necessary? Save the model under a meaningful name (e.g. QSAR New
compounds).
Study scores and loadings. How many PCs should we use? Is the sample
distribution of the score plot satisfactory? Which variables are important in the
first PC? And the second? If you are a chemist, try to interpret the meaning of
the PCs! Which variables dominate PC3 and PC4?
Choose File – New Design – From Scratch – Fractional Factorial, and build a
fractional design for four design variables (corresponding to four PCs). Also add
two center points to get a few points in the “middle”. Do not put any effort into
choosing names and so on, and use only -1 and +1 for low and high value.
Choose as Design Type “Fractional Factorial Resolution IV” (8 experiments),
and include the default 2 center samples.
Click Next until you reach Finish. By clicking Finish you will display the
designed data table. It should look roughly like this:
A B C D
Cube 001a -1 -1 -1 -1
Cube 002a 1 -1 -1 1
… …
Cent-b 0 0 0
The idea is now to find compounds with values of their principal properties
(scores) corresponding as well as possible to these design points.
(Geometrically, the selected points form the corners of a hypercube in the space
defined by the design variables). This means finding compounds with score
values in PC1, PC2, PC3, and PC4 that match patterns of the design points. For
example the design point +1 -1 +1 -1 corresponds to a compound whose score
values are + - + - for the four first PCs.
Practical constraints like boiling points which are too low and the availability of
the compounds also restrict the candidate list of possible calibration compounds.
The two center points (with score values close to zero) were used to get
information about curvature and variability.
Import the scores matrix by File - Import - Unscrambler Results. Select the
PCA model QSAR New Compounds. Select matrix Tai. Remove the last four
Multivariate Data Analysis in Practice
298 12. PLS (PCR) Exercises: Real-World Applications - II
rows, so as to keep only the scores along PC1-PC4. Then choose Modify -
Transform - Transpose. Try to pick out 10 compounds according to the
directions given above. Save it using File - Save.
Eriksson et al. made the following calibration set selections (Table 12.3):
Table 12.3
Design Score Score Score Score Sample Name
PC1 PC2 PC3 PC4 no.
---- -1.40 -0.71 -0.22 -0.17 30 CH3CH2Br
-+-+ -2.03 0.80 -0.45 0.16 48 CH3CHClCH3
+--+ 1.59 -0.99 0.23 -0.18 33 CH3CH2F
++-- 0.50 1.09 -1.16 -0.20 52 CH3CH2CH2CH2Br
--++ -1.87 -0.59 0.71 0.28 2 CH2Cl2
+-+- 3.2 -1.26 0.23 -1.97 39 CBr3F
-++- -0.94 0.48 0.03 -1.25 7 CCl3F
++++ 1.8 0.70 0.62 0.38 15 CHCl2CHCl2
0000 -0.4 -0.1 0.68 0.11 3 CHCl3
0 0 0 0 -0.55 0.03 0.87 1.2 11 CH2ClCH2Cl
Experiments were now performed on the selected compounds. For the studies of
IC50 Eriksson et al. included a few more variables in addition to the chemical
descriptor variables. These were log retention times for two HPLC systems
(LC1, LC2), and a few others. (Since those “others” proved to be insignificant,
we will skip them here.) This means that we now have 10.
4. PLS
Make a PLS model using the sample set Training, X-variable set X (all of
them!), Y-variable set Y. Choose Test Set Validation. Set up the test set using
samples 11-15 as test samples. Save this first model as e.g. “Cyto 1”. Also make
a PLS model based on samples 1-10 only, where you validate the calibration by
full cross validation (10 random segments) and save this model as e.g. “Cyto 2”.
Interpret the model by studying Y-variance, scores and loadings. How many
components should we use? How large is the explained Y-variance using test set
and internal cross validation, respectively? Which X-variables dominate the
model? How do you interpret PC1 and PC2? Explain the difference between the
prediction error using internal cross validation and external test set validation!
How large is RMSEP?
Make two new PLS models for IC50 based on the five dominant variables, using
both test set and internal cross validation, to check if your interpretation about
important variables holds. Does this model have a better or worse prediction
ability with respect to explained variance and RMSEP?
Save your favorite model as e.g. “Cyto Reduced 1”
Make a PLS model based on only the five important variables again, but now
include all 15 samples. Call this model e.g. “Cyto Reduced 2”. Compare the
prediction error of this model with Cyto Reduced 1.
Also study Predicted vs. Measured for varying number of PCs. How many
components should we use?
Since the HPLC measurements are not available in the literature like the
chemical descriptors, make a new PLS model based on all the tested
compounds, but using only the three most important variables. (Exclude LC1
and LC2). Use full cross validation. Interpret and check the RMSEP and the
explained Y-variance. Is this cross validated model much worse?
Save this model as e.g. “Cyto for Prediction”.
Predict IC50 for the 43 compounds in sample set “Prediction”, using your last
model. Can you trust all the predicted values? Why not?
Summary
PCA
The data should be standardized. There are no signs of direct problems, but sample
no. 4 seems potentially outlying in several PCs. Four PCs describe about 94% of
the variance. All variables except Ip have large loadings on PC1; thus PC1 can be
interpreted as related to the size/bulk of the compounds. PC2 is dominated by log P,
Van der Waals volume and density, which can be interpreted as reflecting a
combination of size and lipophilicity/hydrophilicity; Ip also has a large loading on
PC2. The interpretation of PC3 and PC4 is ambiguous. Clearly the ionization
potential (Ip) is important for PC3, and melting point (MP) dominates PC4.
PLS
The PLS model based on test set validation needs one PC to describe about 82% of
the Y-variance with an RMSEP of 1.0. Using internal cross validation, two PCs
describe 73% of the Y-variance with an RMSEP of 1.5. This means that the test set
is too small to be predicted with more than one PC. IC50 depends primarily on the
hydrophobic and steric properties of the compounds, Mw, VdW and log P, and LC1
and LC2. The larger and more hydrophobic the compound is, the more cytotoxic it
is.
The reduced model with test set validation is better, with an explained Y-variance
of 90% using only one PC. Using 2 PCs gives overfitting. The internally cross
validated model needs four PCs to explain 80% of IC50. Both data sets are too small
to give consistent estimates of the prediction error, but we will only use them to
indicate toxicity levels - not to predict accurate levels.
The cross-validated “Cyto for Prediction” explains 85% of IC50 with 3 PCs. There
is little risk of overfitting since we know that all the three variables are necessary.
Prediction
Sample 51 appears to be outlier, having very large uncertainty limits at prediction.
The compounds outside the validity of the model are of course also uncertain. In
this case “compounds outside the validity of the model” means samples that lie
outside the area from which you picked the calibration set in the PCA score plot.
The model has to extrapolate to predict these samples.
Six compounds have predicted IC50 values lower than 0.5 mM. Since the lowest
cytotoxicity value measured for the calibration set is 0.9 mM, giving specified
predictions below 0.5 mM does not seem quite justifiable. Maybe we can give these
predictions as “< 0.5 mM”.
The predictions may now be used as a starting point for further risk assessment of
the non-tested compounds, preferably integrated with other data of environmental
relevance.
We shall give a thorough description of each major data set, its origin and
general background only. The data sets themselves will not be given in the
same ready-to-use, fully prepared form as has been necessary for all the
exercises presented hitherto. All four problems are genuine real-world data
sets however, presented directly as is, without any preparatory help. We are
confident that you understand why – and that you concur.
There will for sure be a need for many of the contingencies outlined above in
chapters 1 through 12, at some point(s) in the preparation of a particular data
analysis strategy or problem re-formulation below. Thus you should be well
prepared for the easy issues such as which pre-processing to choose (and
why?), outlier detection (objects and/or variables), groupings, etc. Also look
out specifically for the more subtle ones as well, such as “typical” problems,
which on second reflection may well have to be re-formulated into another
more useful form, or the definition of what actually constitute proper
problem-dependent variables and objects, say. In point of fact we have
prepared a rather interesting palette for your own painting here. All these data
analysis scenarios have been tested thoroughly during the senior author’s
teaching experiences in the last five years, so far with only positive student
responses. It is to be hoped that the same applies to YOU – good luck!
These major data analytical assignments are your own exclusively. We will
not debase your learning efforts so far by outlining a "correct solution"
immediately following the individual descriptions - we have devised a route
for you to follow that is much more respectful of your examination efforts.
For three of these data problems there will be on record one possible full
solution, perhaps amongst many other alternatives, but it will only be
available to you after you have carried out and documented your own
solution. There will be three full data set reports available, and a meta-
principle solution suggestion for the fourth data problem. The latter is to be
published in full in an upcoming high-level textbook, Esbensen & Bro: “PLS -
theory & practice in science, technology and industry” to be published in
2001 (Wiley).
Gaetano S. (born 20.9.1878 – died 15.12.1959) and his son Pietro S. (born
1903 – retired 1971, but still producing masterworks as late as 1977; died
04.05.1999) both worked for their whole life in Italy as master violinmakers.
Gaetano S. worked initially in the city of Vicenza for many years before
moving to Parma in 1926, where he stayed almost without interruptions until
his death in 1959. His son, Pietro, continued in his father’s profession and
also became a master violinmaker in his own right. Today one speaks within
initiated circles with the utmost admiration of the brief, but prestigious
dynasty of violinmakers Sgarabotti. This refers to the historical presence of
both these masters who, apart from creating their own master instruments,
spent much of their time passing on experience to young violin makers.
Indeed the activity of the Sgarabotto makers was very influential in the violin
making school of Parma, Cremona and elsewhere. Much could be said about
their combined influence on the cultural heritage within the musical world of
string instruments, and of the enormous regard and esteem with which all
their students and musicians held the masters. The greatest interest by
posterity, of course, centers on their combined oeuvre, on the set of master
violins from their hand left for us to play and admire.
The works of both the Sgarabotto makers can be readily identified by their
meticulous choice of materials, the workmanship always being exquisitely
“manual” in every phase of the making of each instrument, and showing
extreme precision and loving care to detail. The violin making of the
Sgarabotto makers is always graceful, that of the father Gaetano said to be
presenting a lighter touch whilst the thicknessing used by Pietro is more
consistent. The sonority and general musical quality of these master violins of
course cannot be expressed in any way fair by numerical values, but shall
forever be residing in their handling by the musicians and experts who are
fortunate enough to play a Sgarabotto violin, viola, violoncello….
Table 13.1 below lists 18 master violins from the hand of the father Gaetano
S. Accompanying these are a further 14 master violins originating with the
son, Pietro S. Each individual work is dated with the year in which it was
made as well as in which city. We may also observe that Pietro Sgarabotto
adopted the tradition of assigning an individual name to every instrument in
the master class, sometimes dedicating them to a historical or prestigious
personality. The running numbering in Table 13.1 reflects a chronological
listing of Gaetano Sgarabotto’s entire master violin oeuvre (numbers 1-18),
followed by that of his son (numbers 19-32). This numbering listing is to be
used for a shorthand identification of these 32 objects.
However, this particular part of the evaluation of the works from these two
master violin makers has never been put on a quantitative footing, being
traditionally carried out in decidedly non-quantitative artistic, humanistic,
craftsmanship or musicology related terms. But precisely the apparently
somewhat elusive feature of the “overall physical harmony” actually lends
itself to a translation into the kind of issues that are well known in our current
data analysis traditions. For “overall physical harmony” read “interacting
variables” or better still “correlated variables”! By use of bilinear modeling it
will be possible to put this artistic expert impression also in a more objective
quantitative context. This should of course be of immediate interest to the
data analysis community, i.e. to be able to express such subtle artistic
impressions in an objective language and thus, perhaps, to be able to
contribute towards greater clarity and precision in this scholarly debate.
Whether the same interest and appreciation is reciprocated from our musician
friends is not known, but it would probably be looked upon with at least some
initial frowning! Nevertheless we haste to point out that the following
quantitative analysis of the combined oeuvres of the Sgarabotto violins in no
way should be taken to indicate anything else than but a modest contribution
only for the totality, and the integrity, of the violin making art and
craftsmanship tradition.
Of course this can only concern the relative dimensions of the framework and
the sculpturing of the violin. This will of course never be but a weak
reflection of the whole appreciation of violin making. Still – what a
tremendously interesting data analysis context!
have his mind focused both on the totality as well as the individuality of all
the essential elements involved in making such a complex, harmonic artwork
as a violin. Such elements may be the choice of wood materials, cutting,
shaping, sculpting, and varnishing. It would appear likely that the final
outcome of this complex process will possess at least some inter-correlated
features, which have come into existence more or less by a non-conscious
emerging totality of the material object, the violin, as more and more of the
essential parts of the process are added to the product of labor.
What is conjectured here is that at least some of the interacting features in the
totality and identity of a violin come into existence as a non-conscious sum-
of-parts, rather than as a deliberate act of Gestalt creation. The interacting,
correlated relationships between the external physical dimensions of a
finished violin would be one prominent representative of such emergent
properties. While presumably to a large extent non-conscious, the totality of
the manual crafting involved in the individual violin making, will end up as a
final, overall, integrated hallmark of the style of violin making craftsmanship,
supposedly distinctive and characteristic of each individual violin maker. In
any event, this constitutes the working hypothesis for undertaking a data
analytical examination of the VIOLINS data!
Thus the objective for this chapter’s first data analytical assignment will be:
Make a complete bilinear analysis of the data stored in file VIOLINS, with the
ultimate aim of finding out if there are (any?) distinctive features which can
be used to discriminate between the representative works of father and son
Sgarabotto violin makers. You may keep in mind the previous assertion
“...that of the father Gaetano said to be presenting a lighter touch whilst the
thicknessing used by Pietro is more consistent.”. How would YOU shed new
(quantitative) light on this issue?
Hint: We are firstly looking for the violin-discriminating features here; what
could possibly constitute outliers in this context? Next, for the violins proper,
how can we look behind the massive similarities, which surely must be
present when comparing entities, which to all but the most erudite experts and
musicians are virtually look-alike? Indeed these similarities positively almost
overwhelm the innocent data analyst at first sight. A certain creative data
analysis insight must be found here, lest we get stuck in this similarity swamp.
Once the initial major question has been correctly solved: whether to use
auto-scaled, only standardized, only mean-centered, or raw data, the
remaining key hint is that a certain liberating re-formulation (based on some
of the interim results, which can be obtained relatively easily) will be
necessary. It will absolutely not be possible to delve directly into these data.
Above all, this compilation of invaluable data must be approached with the
utmost reverence!
Photograph 3 - “Now they are smiling again: after seven meager years, the
car dealership business reports results that are again really on the move”.
But how are they reporting their results? That is the issue at hand in this last
exercise on these data. And how (good or bad) is the managing director paid?
In the above data analysis of the master violin data, it was found necessary to
re-formulate the initial PCA into a pointed, problem-specific PLS-
reformulation. It was found that one particular X-variable played such a key
role, that a re-assignment of this variable into a specific PLS Y-variable
perhaps would be found profitable. The data analysis continued, guided by the
comparative results from the relevant PCA and PLS-analyses. This issue
could be termed “internal re-formulation”, signifying that the reasons for a re-
formulation of the entire data analysis objective (PCA Æ PLS) came about by
an “internal evaluation” (interim results of the previous data analysis efforts).
It will probably not be too difficult to single out one particular X-variable,
which after only a little reflection, is not strictly in the same field as all of the
others. One particular X-variable, which – in principle at least – need not
exclusively contain data only originating via strong passive correlations to all
the other economic fellow travelers. A variable whose data actually to some
extent might be externally controlled in the sense that the particular values for
this variable actually can be fixed somewhat irrespective of the remaining
sales- and value-related data?
We may formulate the following working hypothesis for the car dealership
data case – in analogy with the data analysis experiences from the violin data
case – that variable X8: “Salary, managing director” may constitute an
analogous PCA Æ PLS re-assignment variable of similar potentially increased
data analysis insight value.
Pertaining to this, we thus want to analyze the data set explicitly from the
standpoint of seeing quantitatively how well the set of the nine other
economic indicator variables are able to model (or to predict, rather) the level
of salary assigned to the managing director. This would be a direct
investigation of to what extent the macroeconomics of the company is directly
related to the corresponding level of salary to the managing director – or not.
Any significant positive deviation (i.e. a positive residual with respect to the
predicted salary following from the PLS- model) would constitute evidence of
violations of a strict market and sales-related salary policy. Opposite
deviations (i.e. negative residuals) would signify managing directors
accepting to be underpaid, certainly a more interesting situation. Thus for
economic analysis, the simple PCA vs. PLS switch option, could, perhaps,
offer drastically different points-of-departure. But, really, in doing this kind of
Y-guided data modeling, we are actually more interested in the embedded X-
decomposition than in this salary modeling per se.
You should thus be able to re-formulate the objective of the car dealership
data analysis accordingly, and to repeat the analysis of the interrelationships
of Norwegian car dealerships, but now as expressed by an appropriate PLS1
analysis. It will still be the pertinent t1-t2 score- and accompanying loading-
After having negotiated these issues, please try to formulate the relevant new
interpretations of the PLS data analysis results, and compare them with their
counterparts from Exercise 3.13 above. Are we now finally in a position to
approach the editorial offices of the magazine “CAPITAL”? – And, if yes,
what can we teach the economic journalists about the significant issues of
interacting, correlated variables? This would be the crux of the matter!
13.3 Vintages
Here is another completely new data set, which you have never seen before.
This particular data set originates as a small guide for beginner wine-
aficionados, as a first-hand overview of the most important background
knowledge, which is essential for any wine expert, beginner or experienced.
The compilation of vintage assessments, which forms the background for the
present examination data set, stems from an extended survey carried out by a
major Scandinavian wine importer, who would prefer to remain anonymous.
But we are very grateful for permission to use the data compilation below.
Photograph 4
In the data file Wines a selection of major wine types and selected
representatives from some of the most important French wine-producing
regions and other countries have been assigned vintage assessments for a
series of important recent years (1975-1994). Please observe that each entry in
this data table represents a carefully pruned and trimmed average overall
quality assessment as carried out by the local wine controlling authorities, not
by the producers themselves. Clearly there is a complex averaging procedure
involved behind each entry, the particulars of which need not concern us here
however, except for the fact that the same averaging procedure is used for all
individual entries listed, making them eminently comparable. In this table,
each vintage is expressed by an average assessment-index on a scale from 0 to
20 - obviously with 0 representing an outlandish, completely failed and utterly
unacceptable quality, while 20 represents the sublime, the perfect, the
quintessential quality…
This data set is of general interest, amongst other reasons, because by the very
nature of this subject-matter there will always be a major bias present towards
the higher end of this scale! In point of fact there are no values below 8 in this
20 x 27 table, while more than 45% of the entries lie in the interval 15–19,
although, naturally enough, on the other hand there are (very) few “20”-
values! The data are thus both heavily censored and heavily skewed and as
such of very high tutorial value. But there is more.
We list an excerpt from the entire data table below, Table 13.2 (it is to be
found in its entirety on the CD accompanying this training package, alongside
all other exercise data sets) in order to focus on another important issue for
this particular data set. As listed here by the original wine importer, the rows
of the data table are made up of either of the individual wines (e.g. Saint-
Emilion), or wine regions (e.g. Bourgogne, white), or even individual country
aggregates (as detailed for red/white wine respectively), totaling 27 rows. The
columns are representing the wine assessment years in question: 1975,
1976… 1994 – totaling 20 columns in all. There are a few, quite
understandable, missing values in this data table as well (e.g. the years when
grape harvests were destroyed). However these are so few and so irregularly
located that no serious hampering of the data analysis is likely to occur –
except from the finding that no information is available for the entire 1975-
1982 interval on Beaujolais wines at least in this compilation. Whether your
Multivariate Data Analysis in Practice
320 13. Master Data Sets: Interim Examination
Now: You are kindly asked to perform an appropriate data analysis on the
WINES data table.
There are only a total of (27 x 20): 540 potential elements in this small data
matrix, less some 19 missing values, totaling 521 actual values, so what could
possibly be the problems involved in what at first sight would appear as a
straightforward PCA?
You will – hopefully – be greatly surprised: good luck with your ongoing
advanced learning!
Table 13.2 - Enologic vintage assessments for selected wines and regions
for the years 1975-1994, excerpt. Complete file: Wines
Vintage 75 76 77 78 79 80 81 82
Bordeaux, red, Médoc/Graves 17 14 10 17 17 14 16 18
Saint-Emilion/Pomerol 19 16 10 16 17 13 16 18
Bordeaux, white (dry) 19 16 13 16 17 14 18 19
Sauternes/Barsac 17 16 12 15 14 m 15 14
Bourgogne, red, Côtes de Nuits 10 16 11 17 15 13 14 12
Côtes de Beaune 7 14 9 19 15 13 12 14
Bourgogne, white 13 15 13 17 17 12 14 17
Beaujolais m m m m m m m m
Rhône – north 14 14 13 19 15 14 13 17
Rhône – south 13 15 12 18 16 14 17 15
Loire, Muscadet/Touraine/Anjou 15 15 12 14 16 12 12 14
Pouilly-Fumé/Sancerre 18 18 8 15 14 13 14 17
Alsace 14 19 11 15 14 8 14 14
Champagne, vintage 18 16 m 15 16 14 15 18
Germany, Rhine valley 12 19 12 13 15 14 13 12
Mosel 17 18 8 9 15 8 15 11
Italy, Toscana 16 8 14 18 16 16 14 18
Piemonte 10 10 10 20 16 14 12 20
Portugal – vintage port 13 m 19 13 m 16 m 13
Spain – Rioja 17 16 19 17 12 16 18 19
California, red 16 16 15 18 16 17 17 17
Washington State, red 18 18 16 17 18 18 17 15
Oregon, red 17 12 18 16 16 14 15 16
Chile, red 14 13 14 15 15 13 14 12
Australia, red 15 16 16 16 17 16 15 18
Argentina, red 14 16 19 14 13 15 16 18
South Africa, red 11 17 14 16 14 13 14 19
/
SIGNAL ANALYSIS
Downstream the constriction a turbulent regime is set up, which turns out to
be both selective and diagnostic of many of the factors involved in generating
this modified flow regimen, for example of the concentrations of the
influencing mixture-components.
Figure 13.2 shows ACRG’s experimental laboratory rig on which many of the
fundamental studies on the oil-in-water micro-pollution system have been
carried out (in the 0-300 PPM range). In particular it has been found that real-
time FFT (Fast Fourier Transform) of the raw time-domain acoustic signals
will result in a suitable spectral format (so-called power spectra) for this kind
of data, which can be used directly as X-inputs to for example PLS-
calibration. The Y-variable in this context would of course be any relevant
intensive parameter for which prediction from such acoustic measurements is
desirable, e.g. the trace oil concentrations. Esbensen et al. (1999) details
many other systems, which have also been characterized by acoustic
chemometrics.
The file Glycol1 contains 256 training spectra; note 18 replicates for each
concentration in the general glycol interval between 0.0% (corresponding to
pure Gardermoen groundwater) and 3.0%, which is the range of interest for
the Norwegian pollution authorities. The acoustic frequencies range between
0 – 25 kHz, which is probably severe frequency overkill. From our combined
acoustic chemometrics experiences one is led to expect that one, or more,
coherent frequency-bands embedded in this overall broad band signal would
be optimal to do the job of quantifying the concentration of glycol. These are
high-precision data, and there are many options when contemplating to use a
problem-specific averaged version of the 18 replicates.
There is also a completely pristine test set to be found for the Glycol data set!
This is called Glycol2 (which in fact carries a different number of replicates).
But when all that has been well taken care of, there is a big bonus waiting:
The validation issue is almost ideal in this particular case: we have at our
disposition a bona fide real-world test set, acquired according to all the most
stringent validation prerequisites, compare Chapter 7. Thus we may direct our
attention (after your own perfunctory calibration has been arrived at, of
course) at the interesting issues involved in using a full-fledged test data set.
Indeed the main issue of this application is assessing the specific prediction
robustness of the technology involved. For this purpose the test set was
acquired only after considerable time had elapsed since the training
calibration (several days), and for good measure the whole acoustic
chemometrics apparatus had been shut down for service in this period as well.
Therefore we are in a position to focus our calibration experience mainly on
The objective for this application context is thus clear: You are to find the
most robust subset of the original spectral range employed. Use whatever
means you may command at this stage. The main issue here turns out to be
problem-specific (validation-specific) variable selection.
Recall that the cross-validation gives a number of individual sub-models that are
used to predict the samples kept out in that particular segment. Therefore, we have
perturbed loadings, loading weights, regression coefficients and scores to be
compared to the full model. The differences, or rather variances, between the
individual models and the full model will reflect the stability towards removing one
or more of the samples. The sum of these variances will be utilized to estimate
uncertainties for the model parameters.
Equation 14.1
M
(
s b = ∑ (b - b m ) 2 N − 1 N
2
m=1
)
where
N = number of samples
After rotation, the rotated parameters T(m) and [PT, QT](m) may be compared to the
corresponding parameters from the common model T and [PT, QT]. The loading
weights, W, are rotated correspondingly to the scores, T. The uncertainties are then
estimated as for b, thus the significance for the loadings and loading weights are
also estimated. This can be used for component wise variable selection in
x-loadings and loadings weights, but it also gives an optional criterion for finding
AOpt from the significance of the y-loadings, Q.
The general rule is thus that if a model with fewer variables/components is as good
or better with respect to predictability as the full model, we would prefer the
simpler model. However, if the objective is also to interpret the PCs and the
underlying structure, it could be an advantage in some cases to keep some of the
non-significant variables to span the multidimensional space. An example of this is
the visualization aspect of plotting the new predicted samples’ scores in the score
plot from the calibration model.
14.4.1 Introduction
We have previously covered the procedure of removing outliers in our multivariate
models, and the leverage measure, hi, is a tool to find influential samples; samples
which have a high impact of the direction of the PCs. The simultaneous
interpretation of scores and loadings plots gives us information about which
variables span a specific direction and which samples are extreme in this direction.
The loading plots can also give information about correlation between variables,
both x and y, but the explained variance is also an important part here: is it valid to
interpret this PC at all? Even if there seems to be a high correlation, it is essential
to find out how the two variables are correlated, i.e. a 2D scatter plot of the
variables will reveal the structure. These aspects are particularly relevant for
models with a low number of samples.
Figure 14.1 - Stability plot. Note the position of sample 14 when it was not part of
the sub-model (the crossed circle marked with an arrow).
PC2 Scores
4
Sample 14
-2
-4
PC1
-3 -2 -1 0 1 2 3
Print-through
Permeability
-0.5
PC1
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6
out. The stability plots based on LOO cross-validation enable us to find and
interpret subtle structures in the data very efficiently in the realm of our objective:
to establish a regression model between x and y.
Data Set
PAPER, the same as in exercise 10.4. Part of the description is given below.
The data is stored in the file Paper. You are going to use the sample sets Training
(103 samples) and Prediction (12 samples), and the variable sets Process (15
variables) and Quality (1 variable).
Tasks
Make a PLS regression and estimate uncertainties. Mark the significant variables
and make a new model. Predict the 12 Prediction samples with the two models.
Visualize and interpret the stability plots in scores and loadings (loading weights).
How to do it
1. Make a PLS model.
Go to Task - Regression. We will use “Systematic 123123123” with 20
segments as cross-validation method to yield identical results in this exercise.
The data are sorted after increasing y-values, which means that the main
structure is retained in all segments by this segment selection. Check the box
named Uncertainty test in the regression dialogue. There is an option to how
to determine how many components to use: the Opt #PCs as default from The
Unscrambler, or to manually select the optimum number of components. These
uncertainty calculations are often performed after outliers have been removed
and the “correct” number of components has been decided upon. Use 3 in this
case.
Try different cross-validation options, and see how the significance is affected
by the number of segments, and how they are selected. Experience from other
data show that the uncertainty estimates are quite stable regardless if you use
LOO, 20, 10 or 5 segments in the cross-validation, assuming that extreme
outliers have been removed.
4. Visualize stability
The rotated and perturbed scores and loadings can be activated from View-
Uncertainty Test - Stability Plot or the icon on the toolbar. Since this is a
20 segment cross-validation, each individual score is a result from keeping five
or six samples out. You might want to make a model with 15-20 samples with
LOO cross-validation to interpret these plots in more detail, and see how a
single sample will affect the model. There is information about which sample
that was left out; click on the points in the score plot and you’ll see the segment
number. Plot both x- and y-loadings and loading weights/y-loadings and explain
the differences.
Summary
The automatic marking of the significant x-variables gave the same important
variables as from the manual selection, and a model based on this variable selection
gave lower RMSEP after 2 PCs than the full model with 3 PCs. It also seemed that
the deviations in prediction of new samples were smaller on average. The stability
plots give information about the structures of the data, such as correlation patterns
and the impact of influential samples on the model.
The philosophy behind this classical chemometric technique - it was in fact the
very first chemometric method to be formulated, Wold (1976) - is that objects in
one class, or group, show similar rather than identical behavior. This can at first
simply be described so as to mean that objects belonging to the same class show a
particular class pattern, which makes all these samples more similar than with
respect to any other group or class. The goal of classification is to assign new
objects to the class to which they show the largest similarity. With this approach
you specifically allow the objects to display intrinsic individualities as well as their
common patterns, but you model only the common properties of the classes.
The easy part of mastering SIMCA is that you have already learned 90% of this
approach, because you have mastered the application of PCA. SIMCA is nothing
other than a flexible, problem-dependent, multi-application use of PCA-modeling.
A practical introduction of how to use SIMCA is given below, without spelling out
all the technical background at first. The SIMCA approach has been retold
numerous times since its inception (see reference list), never surpassed though is
Wold’s classical paper (1976) for the full traditional introduction, complete with all
technical details.
Let us first of all observe how grouping or clustering appears to the bilinear eye
(Figure 15.1).
Figure 15.1 - Grouping (clustering) as revealed in the initial overview score plot
PC2
PC1
SIMCA classification is simply based on using separate bilinear modeling for each
bona fide data class, which concept was originally called disjoint modeling. The
individual data class models are most often PCA models (because in the simplest
SIMCA formulation there is no Y-information present).
Classification is only applicable if you have several objects in each class because
every class has to support an A-dimensional PCA model. A complete SIMCA
classification model usually consists of several PC-models, one for each class
recognized, but of course the marginal case of just one class is also an important
option.
Figure 15.2 - Each data class from Figure. 15-1 as modeled by a separate
PC-model (SIMCA)
x3
x2
x1
The subsequent classification stage then uses these established class models to
assess to which classes new objects belong. Results from the classification stage
allow us to study the modeling and discrimination power of the individual
variables. In addition a number of very useful graphic plots are available, which let
us study the classified objects’ membership of the different classes in more details
etc, as well as quantifying the differences between classes – and (much) more.
If the data classes are known in advance, i.e. if we know the specific class-
belonging of all the training set objects, it is very easy to make a SIMCA-model of
each class. This is called supervised classification. Otherwise, if we do not know in
advance any relevant class-belongings, we have to identify the data classes by
pattern recognition first, e.g. by using PCA on the entire training data sets – to look
for groupings, clusters etc.
Thus there are at the outset two primary ways into a SIMCA classification:
• Classes are known in advance (which objects belong to which classes)
• There is no a priori class membership knowledge available
For the second case, any problem-relevant data analytical method which lets us find
patterns, groupings, clusters etc. in our data may be relevant (assuming of course
that a pattern recognition problem indeed is at hand), e.g. cluster analysis For many
applications however there is already such a method readily available, namely
using PCA on the entire data matrix present. It may be as simple as that!
It is not a critical issue, however, by which technique a training pattern has been
delineated, what matters is that this pattern be representative of the classification
situation.
In any event, when/after the problem-specific data class setup is known, the
SIMCA-procedure(s) is simple, direct and incredibly effective.
Another advantage is that all the pertinent results can be displayed graphically with
an exceptional insight regarding the specific data structures behind the modeled
patterns.
Thus, different distance measures are used to evaluate the class membership of
new objects: the object distance to the model and the distance from the model
center. Many plots are available to help you to interpret the object/class-model
relationships. The primary tools are called the Coomans plot and the model-to-
model plot. The Coomans plot gives compressed information about the class
membership to any two models simultaneously. The model-to-model plot gives
information about the degree of similarity between models. If the distance between
models is small, there is little difference and the classification model is unable to
distinguish between these classes.
4. Make a separate model for each class. You can use different modeling contexts
for the different classes. Use individual data pre-treatment (e.g. standardization,
weighting, or more advanced preprocessings if necessary) for each class to
assure maximum disjoint class modeling. Validate each class properly. All
classes must be validated in the exact same way, or the membership limits will
not be comparable.
5. Remove outliers and remodel as deemed necessary. Also study the appropriate
score plots to see if there should be more classes present than what is “known”
in advance as often the conventional wisdom is wrong! If so, repeat from step 3.
Determine the optimal number of PCs for each class. We now have completed
the classification modeling stage. This may take place immediately before the
next step (see below), or this may represent a modeling task carried out earlier,
on the basis of which the present classification is to be carried out, all is
problem dependent.
6. Classify new objects. Read the new data into the program and enter the Task-
Classify menu. Select the models you want to test the objects against. Then
choose the appropriate number of PCs for each class model, the number you
determined in step 5. For more details, see section 15.3.
7. Evaluate the present classification by studying the results and using the Plot
menu. Which plots to use and how to interpret them are described in section
15.5.
Select the class models to be used for the pattern recognition and specify how
many PCs to use in each model (this is strongly problem-dependent). The
appropriate number of PCs depends on the data set, the goal, and the application.
Then start the classification routine. Results can be studied both numerically and
graphically.
For each object, a star is shown in the column belonging to a model whenever the
object in question belongs to this highlighted model with the current significance,
that is to say when it is simultaneously satisfying both the Si and Hi limits set. Non-
marked objects do not belong to any of the tested classes.
2. The object may fit several classes, i.e. it has a distance that is within the critical
limits of several classes simultaneously. This ambiguity can be due to two
reasons; either the given data are insufficient to distinguish between the
different classes or the object actually does belong to several classes. It may be
a borderline case or have properties of several classes (for example being both
sweet and salty at the same time). If such an object is classified to fit several
classes one may for example study both the object distance (Si) and the
Leverage (Hi) to determine the best fit; at comparable object distances, the
object is probably closest to the model with which is displayed the smallest
leverage.
3. The object fits none of the classes within the given limits. This is a very
important result in spite of its apparent negative character. This may mean that
the object is of a new type, i.e. it belongs to a class that was unknown until now
or - at least - to a class that has not been used in the classification. Alternatively
it may simply be an outlier.
One of the most important scientific potentials of the SIMCA approach is related to
this very powerful aspect of “failed” pattern recognition – one must always be
prepared to accept that one or more objects actually do not comply with the
assumed data structure pattern(s). Clearly it is important to be able to identify such
potentially important “new objects” with some measure of objectivity – At some
point in this pattern cognition process it will become important to be able to
specific the statistical significance level(s) associated with this “discovery”. Hence
a few remarks on the use of the statistical significance level.
class! In the setup used in SIMCA-classification, the test carried out quantifies the
risk of saying that a particular object lies outside a specific model envelope - even
if it truly belongs. If you have had no formal statistical training all this may -
perhaps - appear a little confusing, but let us see how this works in practice.
The “normal” statistical significance level used is 5%. In very practical data
analytical classification terms this “means” that there is a 5% risk that a particular
object falls outside the class, even if it truly belongs to it; 95% of the objects which
truly belong will thus fall inside the class. At opposing ends of a spectrum of
significance levels typically used, we may illustrate these issues in the following
manner:
A high significance level (e.g. 25%) means that you are being stricter - only very
“certain” objects will belong to the class, and (many) more “doubtful” cases will lie
outside it. Fewer objects that truly belong will fall inside the class (in this case,
75%).
A low significance level (e.g. 1%) on the other hand means that you are being very
“sloppy” - cases, which are doubtful, will still be classified as belonging to the
class. You will get more objects classified as members, i.e. almost all of the true
member objects (i.e. 99%) will be classified as members of the class in this case.
It is important to understand that the significance test only checks the object with
respect to “transverse” object-to-model distance, Si, which is compared to a
representative measure of the total Si -variation exhibited by all the objects making
up the class, called S0. A standard F-test is used. A fixed limit (depending on the
class model) is used for the leverage.
The Coomans plot shows the object-to-model distances for both the new objects as
well as the calibration objects, which is very useful when evaluating classification
results.
Interpretation
If an object truly belongs to a model (class) it should fall within the membership
limit, that is to the left of the vertical line or below the horizontal line in this plot.
Objects that are within both lines, i.e. near the origin, must be classified as
belonging to both models. The Coomans plot looks only at the orthogonal distance
of an object with respect to the model. To achieve a correct classification the
leverages should also be studied, e.g. in the Si vs. Hi plot.
If an object falls outside the limits, i.e. in the upper right corner, it belongs to
neither of the models. It is very important that - after having decided on the
significance level before you carry out the classification - you respect the
classification results.
To make interpretation easier, the objects are color coded in The Unscrambler.
NB- What follows applies if you are using the default color scheme with black
background (or with white background in parentheses).
Yellow (green) objects are the new objects being classified. Cyan (blue) objects
represent the calibration objects in model one, while magenta (magenta) ones
represent the calibration objects from model number two.
2345
13
4
4
30
41 58
50
35
40
3
72946
16
20
3926
37
18
43
38
476 24
36
28
48 33
215
12
10 25
21232
11
14
1
27 2
34
44
17
31 429
49
2
56 65
71 51
19 668
705763 473
67
10 597466
54 60
55 52 72
7 118
35 14 12
9217
11
621975
5358 61
251686 924
21
11
4231715
86
24
13
19
23
22
5
13
2012
1422
20
1532 14 69
7
0 18
10
16
21
25
Sample Distance to Model Iris Setosa
0 5 10 15
SIMCA Iris, Significance = 5.0%, Model 1: Iris Setosa, Model 2: Iris Versicolor
This plot is similar to the Influence plot, used to detect outliers in PCA-calibration
Interpretation
The Si vs. Hi plot shows the object-to-model distance and the leverage for each
new object. The leverage can be read from the abscissa and the distance from the
ordinate. The class limits are shown as gray lines; horizontal for the object-to-
model distance and vertical for the leverage limit.
The limit for the object-to-model distance depends on the significance level chosen.
The leverage limit depends on the number of PCs used and is fixed for a given
classification.
The leverage value shows the distance from each object to the model center. It
summarizes the information contained in the model, i.e. the variation described by
the PCs.
Objects near the origin, within both gray lines, are classified as bona fide members
of the model. Objects outside these lines are classified as not belonging to the
model. This either/or aspect of the classification results is what has given rise to the
terminology “hard classification”. The further away from the origin of the plot they
lie, the more different the objects are. Objects close to the abscissa have short
distances to the model, but may be extreme (they may well have large leverages).
Objects in the upper right quadrant, for example object 45 in Figure 15.4, do not
belong to the class in question. Objects in the lower left corner, for example object
1, are well within all limits of the test. The ones in the lower right quadrant, for
example object 15, have short distances to the PC-model but at a high(er) leverage,
so they are in the sense specified by the chosen significance level “extreme” and
may in fact not belong for that reason. You should check this based on your
knowledge of the problem.
45
4 40
30 50
35 2946
41 37
3926
43
47
38
3336 3228
27 48
4449
42 34
31
2
118524 8 715
6613
14
251
68
73 6 11
23
59
4 2516
21 10
20
70
57
64
61
65
5374
67
12
71
56
17
9 22 19 3
60
72
63
69
58
54
7562
55
52
0
Leverage
This is actually all the help you can get from SIMCA classification. It is now up to
you to decide how to view samples, which e.g. lie just outside the appropriate limit.
Statistically, these samples lie outside because you decided on the significance
level in advance. It is questionable, indeed unscientific, to toggle the limits after a
classification has been carried through because of a specific result that “clearly can
be improved if only I lower the significance level marginally”.
Interpretation
The plot is interpreted in exactly the same way as the Si vs. Hi plot, i.e. objects in
the lower left corner belong to the class in question within all pre-set limits.
2 7
118524
13 11 168 10
251
6814
66 5923
2521
20 15
73 12
70
57
64 96174 22 19 3
61
65
53
6074
67
72 71
56
63
69
58
54
55
7562
52
0
Leverage
Figure 15.5 shows an example of this plot. It is from the same data set, which was
classified in Figure 15.4, but note that a different significance level has been used
for illustration purposes. Object no. 7, just outside the limits, is however close to
the class at approximately twice the average distance (Si/S0 = 2).
Interpretation
A useful rule-of-thumb is that a model distance greater than 3 indicates models
which are significantly different. A Model distance close to 1 suggests that the two
models are virtually identical (with respect to the given data). The distance from a
model to itself is of course, by definition, 1.0. In the distance range 1-3 models
overlap to some degree.
Figure 15.6 - A model distance plot for the three IRIS classes met with before
Model Distance
100
50
0
Models
The example in Figure 15.6 is taken from exercise 15.6 (classification of Iris
species), where the three earlier met species of Iris are classified using the four
classical X-variables. Using these variables only, it is known that two of the species
are very similar. This is also reflected in the model distance plot where the distance
from model Versicolor is shown. The distance to the first model (Setosa) is very
large (around 100), but the distance to the last model (Virginica) is small, around 3-
4, i.e. they are to some degree similar. The second bar in Figure 15.6 is the distance
to the Versicolor model itself, i.e. 1.0.
If you have a poor classification, deleting the variables with both a low
discrimination power and a low modeling power may sometimes help. The
rationale for this specific deletion is of course justified by the fact that variables
which do not partake in either the data structure modeling nor in the inter-class
discrimination are not interesting variables – at least not in a classification
perspective (they may be otherwise interesting of course).
Interpretation
The discrimination power plot shows the discrimination power of each variable in a
given two-model comparison. A value near 1 indicates no discrimination power at
all, while a high value, i.e. >3, indicates good discrimination for a particular
variable.
Figure 15.7 - Discrimination power plot for the four classical IRIS-variables
Discrimination Power
14
12
10
4
X-variables
The example in Figure 15.7 again shows a plot from the IRIS species classification.
Here the data from model Iris Setosa are projected onto model Iris Versicolor. The
diagram thus shows which variables are the most important in distinguishing
between these two species models. All the variables have a discrimination power
larger than 3 and all are therefore useful in the overall classification.
The modeling power can thus be a useful tool for improving an individual class
model. Even with careful variable selection, some variables may still contain little
or no information about the specific class properties. Thus these variables may
have a different variation pattern from the others, and consequently, they may cause
the model to deteriorate. Different variables may show different modeling power in
different models however, so one must always keep a strict perspective with respect
to the overall classification objective(s) when dealing with a multi-class problem.
Variables with a large modeling power have a large influence on the model. If the
modeling power is low, i.e. below 0.3, the variable may make the model worse and
Interpretation
The modeling power is always between 0 and 1. A rule-of--thumb is that variables
with a value equal to or lower than 0.3 are less important.
0.4
0.3
0.2
0.1
X-variables
In Figure 15.8 the modeling power for the Iris Setosa model is shown. The last two
variables have a very low modeling power indeed and may therefore possibly be
deleted if our only interest lies in modeling Iris Setosa. However, in Figure 15.7,
the discrimination power for the same two variables is very high. These variables
cannot therefore be deleted if the goal is also to discriminate between these two
classes. Even with as small a multiple-class number as three, one must keep the
overall perspective crystal clear.
Data Set
The data table is stored in the file IRIS and contains 75 training (calibration)
samples and 75 test samples – here we have a well-balanced and completely
satisfactory test data set.
The training samples are a priori divided into three training data sets, each
containing 25 samples. These three sets are Setosa, Versicolor, and Virginica. The
sample set Testing is used to test the efficiency of the established classification.
Four traditional taxonomic variables are measured: Sepal length, Sepal width, Petal
length, and Petal width. The measurements are given in centimeters.
Task
Add a new variable to the data table so that the three classes can be identified on
PCA plots, then make a PCA model of all calibration samples.
How To Do It
Open the file IRIS. Mark the first column in the table, then choose Edit –
Insert – Category Variable. Enter a name for the category variable, e.g.
Class. In the Method frame, select “I want my levels to be based on a collection
of sample sets”, then click Next. Move the three sets Setosa, Versicolor, and
Virginica from the “Available Sets” to the “Selected Sets” by selecting them
and clicking Add. Click Finish. You are now back in the data table and you can
see the new “Class” variable in column 1. Its name is written in blue, to show
that this is a special type of variable, used for labeling purposes only.
We assume that you are thoroughly familiar with making PCA models by now.
Refer to one of the previous exercises if needed.
Note that there are few outlier warnings and most of the variance is explained
by three PCs. Click View to look at the modeling results.
Activate the residual variance plot and select Plot – Variances and RMSEP.
Remove the number in the variables field so that only the total variance is
displayed, select only the validated variance in the samples field and (if
necessary) change the plot from residual to explained variance (View –
Source - Explained Variance).
We see that two to three PCs are enough to describe most of the variation
present.
Activate the score plot and select Edit - Options. Select the Sample
Grouping tab; enable sample grouping, separate with Colors and select Value
of Variable in the Group By field. Make sure Levelled Variable 1 is selected.
Note that there are three classes in the data; one very distinct (Setosa) and two
that are not so well separated (Versicolor and Virginica). The score plot
indicates that it may be difficult to differentiate Versicolor from Virginica.
This means that the number of components must be determined for each model
individually, outliers found and removed separately etc.
Task
Make individual PCA models for the three classes Setosa, Versicolor, and
Virginica.
How To Do It
Select Task - PCA and make a model with the following parameters:
Repeat the modeling using the sample sets Versicolor and Virginica. Name each
model after the sample sets (close the viewer and save the model after each
calculation).
The program suggests three PCs as the optimal for all models. Here we overrule
this and suggest using one PC for all models. The data contains only four
variables and two of the residual variance plots have a “break” at PC1, which
indicates that one PC may be enough.
Task
Assign the sample set Testing to the classes Setosa, Versicolor, and Virginica.
How To Do It
Select Task - Classify. Use the following parameters:
Samples: Testing Variables: Measurements
Make sure that Centered Models is checked. Add the three models Setosa,
Versicolor, and Virginica. Mark each model and change the number of PCs to
use from three to one.
Tasks
Interpret the classification results in suitable plots.
Look at the Cooman’s and Si vs. Hi plots.
How To Do It
Click View when the classification is finished.
A table plot is displayed where the samples with a star in the column of a model
are classified to the corresponding model. These samples are within the limits as
defined by the significance level chosen and the leverage limit.
The significance level can be toggled using the Significance Level field in the
toolbar. We see that all samples are “recognized” by the correct class model.
However, some samples are indeed classified as belonging to two classes
simultaneously.
The classification results are well displayed in the Cooman’s plot. Select Plot-
Classification and choose the Cooman’s plot for models Setosa and
Versicolor.
This plot displays the sample-to-model distance for each sample to two models.
The new test set samples are displayed in green color (if you have chosen a
white background for your plots), while the calibration samples for the two
models are displayed in magenta and blue. Yellow, magenta and cyan are used if
Figure 15.9
The Cooman’s plot for classes Setosa and Versicolor nevertheless shows that all
Setosa samples are classified uniquely as belonging to the Setosa model only.
All Setosa samples are located to the left of the vertical line indicating
membership. We also see that almost all the Versicolor samples are also
classified correctly. Nonetheless, it seems like some of the Virginica samples
are also classified as belonging to this model. We also have to look at the
distance from the model center to the projected location of the sample, i.e. the
leverage.
This is done in the Si vs. Hi plot. Select Plot - Classification and choose Si
vs. Hi for model Versicolor.
Figure 15.10
Task
Look at the model-to-model distance, discrimination power, and modeling power
plots.
How To Do It
Select Plot - Classification and choose the Model Distance plot for the
Versicolor model (you may double-click on the miniature screen in the dialog
box so that your plot uses the full window). If necessary, use Edit – Options,
Plot Layout: Bars.
This plot compares different models. A distance larger than 3 indicates a good
class separation. The models are then sufficiently different for most practical
classification and discrimination purposes.
It is clear from this plot that the Setosa model is very different from the
Versicolor, while the distance to Virginica is small. It is barely over three.
Figure 15.11
This plot tells which of the variables describe the difference between the two
models well. A rule-of-thumb says that a discrimination power larger than three
indicates a good discrimination. The overall discrimination power can be
increased by deleting variables with a particular low discrimination power, if
they also have low modeling power.
Figure 15.12
The plot above tells us that all variables have low discrimination power. This
tells us that none of the measured variables are very helpful in describing the
difference between these two types of IRIS, which is another indication of their
partly overlapping nature.
Multivariate Data Analysis in Practice
15. SIMCA: An Introduction to Classification 359
Select Plot - Classification and choose the modeling power for Versicolor.
Variables with a modeling power near a value of one are important for the
model. A rule-of-thumb says that variables with modeling power less than 0.3
are of little importance for the model.
Figure 15.13
The plot tells us that all variables have a modeling power larger than 0.3, which
means that all variables are important for describing the model. If not, we might
have wanted to remove the unimportant ones (on the basis of their modeling
power), but we must be very careful in such an isolated venture. It is much
better to use the aggregate information pertaining to both the modeling as well
as the discriminating power for all variables involved.
Note that we have used the test data set strictly for classification in order to get a
feeling for the classification efficiency. One might instead perform a similar,
independent SIMCA analysis in its own right on the test data set, assuming the
internal three-fold data structure is the same, i.e. that the first 25 samples are
known to belong to the Sestosa species, etc. as for the first training data set, and
compare these two independent IRIS classifications.
Alternatively these two data sets may be pooled, assuming correct class-
membership for the second data set, and a new SIMCA-modeling may be
performed on the pooled data set.
In fact, it is entirely possible to use the test set in complete accordance with the
methodology of test set validation, which was described, for regression prediction
assessment above.
How would you set up, and perform, a classification test set validation?
We leave these interesting new tasks to the reader’s discretion to develop. Good
luck!
There are three major problems with this approach. First, people who apply it will
rarely understand how their system really works, so it may be difficult to transfer
the knowledge to a new application. Second, since there is usually some amount of
variability in the outcome of each experiment, interpreting the results of just two
successive experiments can be misleading because a difference due to chance only
can be mistaken for a true, informative difference. Lastly, there is also a large risk
that their solution is not optimal. In some situations this does not matter, but if they
want to find a solution close to optimum, or understand the application better, an
alternative strategy is recommended.
Let us show this with a simple example from an investigation of the conditions of
bread baking:
We list the variables, such as ingredients and process parameters, that may have an
influence on the volume of the bread, then study each of them separately. Let us say
that the input parameters we wish to study are the following: type of yeast, amount
of yeast, resting time and resting temperature.
First, you set type of yeast, amount of yeast and resting temperature to arbitrary
values (for instance, those you are most used to: e.g. the traditional yeast, 15g per kg
of flour, with a resting temperature of 37 degrees); then you study how bread volume
varies with time, by changing resting time from 30 to 60 minutes. A rough graph
drawn through the points leads to the conclusion that under these fixed conditions,
the best resting time is 40 minutes, which gives a volume of 52 cl for 100g of dough
(Figure 16.1).
Then you can start working on the amount of yeast, with resting time set at its “best”
value (40 minutes) and the other settings unchanged, while changing the amount of
yeast from 12 to 20g as in Figure 16.3. At this “best” value of resting time, the best
value of the amount of yeast is about not far from the 15g in the first series of runs,
giving a volume of about 52 cl. Now the conclusion might seem justified that an
overall maximum volume is achieved with the conditions “amount of yeast = 15g,
resting time = 40 minutes”.
Figure 16.3 - Volume vs. Yeast. Second set of experiments, with the traditional
yeast, a resting time of 40 minutes and a resting temperature of 37 degrees
Volume
15g/kg Yeast
The graphs show that, if either amount of yeast or resting time is individually
increased or decreased from these conditions, volume will be reduced. But they do
not reveal what would happen if these variables were changed together, instead of
individually!
To understand the possible nature of the synergy, or interaction, between the amount
of yeast and resting time, you may study the contour plot below, see Figure 16.5,
which shows how bread volume varies for any combination of amount of yeast and
resting time within the investigated ranges. It corresponds to the two individual
graphs above. However, if the contour plot represents the true relationship between
volume and yeast and time, the actual maximum volume will be about 61 cl, not 52
cl! A volume of 52 cl would also be achieved at for example 50 minutes and 18g/kg,
which is quite different from the conditions found by the One-Variable-at-a Time
method. The maximum volume of 61 cl is achieved at 45 minutes and 16.5g/kg.
30
18
60
50
15
12 40
30 40 50 Resting time
Figure 16.5 illustrates that the One Variable at a Time strategy very often fails
because it assumes that finding the optimum value for one variable is independent
from the level of the other. Usually this is not true.
The trouble with that approach is that nothing guarantees that the optimal amount of
yeast is unchanged when you modify resting time. On the contrary, it is generally the
case that the influence of one input parameter may change when the others vary: this
phenomenon is called an interaction. For instance, if you make a sports drink that
contains both sugar and salt, the perceived sweetness does not only depend on how
much sugar it contains, but also on the amount of salt. This is because salt interacts
on the perception of sweetness as a function of sugar level.
y We know exactly how many experiments we will need to get the information we
want
y The individual effects of each potential cause, and the way these causes interact,
can be studied independently from each other from a single set of designed
experiments
y We analyze the results with a model which enables us to predict what would
happen for any experiment within a given range
y We can conclude about the significance of the observed effects, that is to say,
distinguish true effects from random variations
The successive steps of building a new design and interpreting its results are listed
hereafter.
1. Define which output variables you want to study (we call them responses). You
will measure their values for each experiment.
2. Define which input variables you want to investigate (we call them design
variables). You will choose and control their values.
3. For each design variable, define a range of variation or a list of the levels you
wish to investigate.
4. Define how much information you want to gain. The alternatives are:
a- find out which variables are the most important (out of many)
b- study the individual effects and interactions of a rather small number of
design variables
c- find the optimum values of a small number of design variables.
5. Choose the type of design which achieves your objective in the most economical
way.
The various types of designs to choose from are detailed in the next sections.
Each design variable is studied at only a few levels, usually two: it varies from a low
to a high level. You investigate different combinations, for example low temperature
(5 degrees) and low time (5 minutes), high temperature (75 degrees) and high time
(20 minutes), low temperature (5 degrees) and high time (20 minutes), and vice
versa.
Each point in the cube in Figure 16.7 is an experiment. Three design variables varied
at 2 levels give 23 = 8 experiments using all combinations. The table shows the low
(-) and the high (+) settings in each experimental run in a systematic order - standard
order.
As you can see, the number of experiments increases dramatically when there are
many design variables. The advantage of Full Factorial Designs is that you can
estimate the main effects of all design variables and all interaction effects. The
program generates the experimental design automatically. All you have to do is
define which design variables to use and the low and high levels.
Effects
The effects are also calculated by the program, and we will deal with that later in the
section about Analysis of Effects on page Error! Cannot open file.. However, to
understand what “effect” means and how to interpret an effect, we will go through
the definition here.
The variation in a response generated by varying a design variable from its low to its
high level is called main effect of that design variable on that response. It is
computed as the linear variation of the response over the whole range of variation of
the design variable. There are several ways to judge the importance of a main effect,
for instance significance testing or use of a normal probability plot of effects.
Interaction effects are computed using the products of several variables (cross-
terms). There can be various orders of interaction: two-factor interactions involve
two design variables, three-factor interactions involve three of them, and so on. The
importance of an interaction can be assessed with the same tools as for main effects.
Important Variables
Design variables that have an important main effect are important variables.
Variables that participate in an important interaction, even if their main effects are
negligible, are also important variables.
Volume
C2
40 60
Type of
yeast
30 20
C1
35°C 37°C
Temperature
A main effect reflects the effect on the response variable of a change in a given
design variable, while all other design variables are kept at their mean value:
35°C 37°C C1 C2
Temperature Yeast
Average C2
Volume: 40 60
50 Effect: +25
⇑ 30 20
25 C1
An interaction reflects how much the effect of a first design variable changes when
you shift a second design variable from its average value to its high level, (which
amounts to the same as shifting it halfway between low and high):
But Volume decreases with 10 (-10 = 20-30) when we change the temperature from
35 to 37 degrees, if we use yeast C1.
30 20
C1: -10
That is, we get an increase with one yeast, but a decrease with another; the effect of
temperature depends on which yeast we use. So the interaction effect is
(1/2)*(20 - (-10)) = 30/2 = +15.
Interaction - the effect of one variable depends on the level of another variable, as
illustrated in Figure 16.19.
35°C 37°C
Temperature
and Lemon). You can also include the computed interactions (between Salt and
Sugar: Salt*Sugar, between Salt and Lemon: Salt*Lemon, and so on) and the
measured response variables (here only one, e.g. Sweetness).
Table 16.1
RUN Salt Sugar Lemon Salt*Sugar Salt*Lemon Sugar*Lemon Sugar*Salt*Lemon Sweet
1 - - - + + + - 1.7
5 - - + + - - + 4.5
3 - + - - + - + 5.2
7 - + + - - + - 7.2
2 + - - - - + + 3.5
6 + - + - + - - 2.1
4 + + - + - - - 2.8
8 + + + + + + + 4.8
The main effect of Salt on response Sweetness in the table above is -1.35.
Calculation:
The main effect of Salt on Sweetness =
(3.5 + 2.1 + 2.8 + 4.8)/4 - (1.7 + 4.5 + 5.2 + 7.2)/4 = 3.3 - 4.64 = - 1.35
Interpretation: This means that by increasing Salt from its low to its high level, the
response Sweetness will decrease by 1.35 units.
i.e. by finding the regression coefficients bi. This can be done using several methods.
MLR is the most usual, but PLS or PCR may also be used. If we have three design
variables and want to investigate on one response following expression will be used.
Mean Effect
The mean effect is the average response of all the experiments and equals b0 in the
regression equation. In ANOVA it is simply the average response value.
Main Effects
The main effect of variable A is an average of the observed difference of the
response when A is varied from the low to the high level. The estimated effect
equals twice the b-coefficient for variable A in the regression equation, and so on.
Interaction Effects
An interaction effect AB means that the influence of changing variable A will
depend on the setting of variable B. This is analyzed by comparing the effects of A
when B is at different levels. If these effects are equal, then there is no interaction
effect AB. If they are different, then there is an interaction effect. Estimated
interaction effects again equal twice the corresponding b-coefficients.
X1 X2 X3
Sugar Salt Sugar*Salt
- - +
X2
- + -
X3
+ - -
X1 + + +
The smart subset of combinations of three design variables gives us only 22-1= 4
experiments. This is therefore called the half-fraction of a Full Factorial Design, or a
Fractional Factorial Design with a degree of fractionality of one.
In the table, you can see that the sign for variable X3 is the same as for the
interaction between X1 and X2 - which is X1 times X2 (X1* X2).
Confounding
The price to be paid for performing fewer experiments is called confounding, which
means that sometimes you cannot tell whether variation in a response is caused by,
for instance, Sugar, or the interaction between Salt, and Lemon. The reason is that
you may not be able to study all the main effects and all the interactions for all the
design variables, if you do not use the full factorial set of experiments. This happens
because of the way those fractions are built; using some of the resources that would
otherwise have been devoted to the study of interactions, are now merely used to
study main effects of more variables instead.
The list of confounding patterns in the program shows which effects can be
estimated with the current number of experimental runs. For instance, A=BC means
that the main effect of A will be mixed up (confounded) with the interaction (BC)
between B and C.
If you are interested in the interactions themselves, using a design where two-
variable interactions are confounded with each other, you will only be able to detect
whether some of them are important, but not to tell for sure which are the important
ones. For instance, if AD (confounded with BC, “AD=BC”) turns out as significant,
you will not know whether AD or BC (or a combination of both) is responsible for
the observed change in the response.
You can see that despite the large number of variables we can still keep the number
of experiments on a manageable level by using Fractional Factorial Designs.
which are the most important variables. This is achieved by including many
variables in the design, and roughly estimating the effect of each design variable on
the responses. The variables which have “large” effects can be considered as
important.
Design variables that have an important main effect are important variables.
Variables that participate in an important interaction, even if their main effects are
negligible, are also important variables.
Note!
Conditions 1) and 2) can apply separately, or together. Note that in case
1), there may be other valid designs than a full factorial, whereas in case
2), no other type of design can be built.
C
50 60 70 80
Mixing Speed
Time
% yeast
(- - -) (+ - -)
Temperature
Example: The Full Factorial Design for 5 variables with 2 levels each includes
2x2x2x2x2 = 32 experiments.
Example: The Full Factorial Design studying the effects of variable A (3 levels),
variable B (2 levels) and variable C (5 levels) includes 3x2x5 = 30 experiments.
In addition, center samples can be included in the design whenever all design
variables have a continuous range of variation. The center samples are experiments
which combine the mid-levels of all design variables. They are useful for checking
what happens between Low and High (it may be non-linear). They are usually
replicated, i.e. the experiment is run several times, so as to check how large the
experimental error is.
Example: If you are studying the effects of variables Temperature (28 to 33°C)
and Amount of yeast (8 to 12 g), it is recommended to include a center sample,
replicated three times. These three experiments will all have
Temperature=30.5°C and Amount of yeast=10 g.
1. You want to study the effects of a rather large number of variables (3 to 15),
with fewer experiments than a Full Factorial Design would require
2. And all your design variables have two levels (either Low and High limits of a
continuous range, or two categories, e.g. Starch A / Starch B).
The important point is that the design now enables us to study 5 variables with no
more experiments than required for 4 variables, i.e. 2x2x2x2 = 16 experiments
instead of 32.
Table 16.2 and Figure 16.27 hereafter illustrate this principle in the simpler case of 3
variables studied with 4 experiments instead of 8. You can see how variable Time,
introduced in the design originally built for variables Temperature and %Yeast only,
is combined with the other two variables in a balanced way.
Time
% yeast
(- - -) (+ - -)
Temperature
Example: If you want to investigate the effects of 6 design variables, there are
three possible Fractional Factorial Designs, including respectively 8, 16 and 32
experiments (the full factorial requires 64 experiments).
In addition, as with full factorials, center samples can be included in the design
whenever all design variables have a continuous range of variation. The center
samples are experiments which combine the mid-levels of all design variables. They
are useful for checking what happens between Low and High (it may be non-linear).
They are usually replicated, i.e. the experiment is run several times, so as to check
how large the experimental error is.
The fact that several effects cannot be separated at the interpretation stage is called
confounding. Whenever two effects are confounded with each other, they cannot
mathematically be computed separately. They appear as only one effect in the list of
significant effects. You can still see if the confounded effects are significant, but you
cannot know for sure which of the two is responsible for the observed changes in the
response values.
1. First screening of many variables: you wish to find out which variables are the
most important. In practice, if you have more than 8 design variables, you need at
least 32 experiments to detect interactions (with a Resolution IV design). If you
cannot afford to run so many experiments, choose a Resolution III design. Then
only main effects will be detected. If you have more than 15 variables, no
2. Screening of a reduced number of variables: you wish to know about the main
effects and interactions of a reasonable number of variables (4 to 8). This may
either be your first stage, or follow a first screening where these variables have
been found the most important. You should select at least a Resolution IV design
(available for 4, 6, 7, 8 design variables), which does not require more than 16
experiments. If you have 5 design variables, 16 experiments will give you a
Resolution V design. If you have 6 design variables, 32 experiments will give you
a Resolution VI design (even better). If you have 4 design variables, the only way
to study all interactions is a Full Factorial (see above) with 16 experiments.
3. Last stage before you start an optimization: you have run at least one screening
before, and identified 3 to 6 most important variables. Before you build a more
complex optimization design (see hereafter), you need to identify all interactions
with certainty. With 3 or 4 design variables, the design you need is a Full
Factorial. For 5 design variables, choose the Resolution V design with 16
experiments. If you have 6 design variables, choose the Resolution VI design
with 32 experiments.
a) You want to study the effects of a very large number of variables (up to 32), with
as few experiments as possible
b) And all your design variables have two levels (either Low and High limits of a
continuous range, or two categories, e.g. Starch A / Starch B).
As with factorial designs, each design variable is combined with the others in a
balanced way. Unlike Fractional Factorial Designs, however, they have complex
confounding patterns where each main effect can be confounded with an irregular
combination of several interactions. Thus there may be some doubt about the
interpretation of significant effects, and it is therefore recommended not to base final
conclusions on a Plackett-Burman design only. In practice, an investigation
conducted by means of a Plackett-Burman design should always be followed by a
more precise investigation with a Fractional Factorial Design of Resolution IV or
more.
The other area of use is for feasibility studies, when you do not want to invest too
much money into the first set of experiments which will tell you whether you can
obtain any valuable information at all.
No matter how you generated your data, you have to analyze them if you expect to
obtain information from them. After briefly introducing the logical steps in a data
analysis, this chapter focuses on the many different ways to analyze your
experimental results.
Data Checks
If you have ever worked with data - and you most probably have - you will recognize
this statement as true:
A data table usually contains at least one error.
This being a fact, there is only one way to ensure that you get valid results from the
analysis of experimental data: detect the error(s) and correct them! The sooner you
do this in your analysis, the better. Imagine going through your whole sequence of
successive analyses, producing your final results, and realizing that these results do
not make sense. Or, even worse: not realizing anything, and presenting your results -
your wrong results! To avoid the awkward and dangerous consequences of such a
situation, there is one recipe and only one: include error detection as the first step in
a data analysis.
Descriptive Analysis
No matter what the ultimate objective of your data analysis is, for instance
predicting fat content or understanding process malfunction, you will increase your
chances of reaching that objective by starting with a descriptive phase.
Descriptive methods for data analysis are tools which give you a feeling for your
data. In short, they replace dry numbers (your raw data) with something which
appeals to your imagination and intuition: simple plots, striking features.
Once you have in a way “revealed” what was hidden in your data, you can digest it
and transform it into information. It is also your duty to have a critical view of this
newly extracted information: compare the structures or features you have just
discovered, with your a priori expectations. If they do not match, it means that either
your hypotheses were wrong, or there is an error in the data which generates
abnormal features in the results. Thus descriptive analysis is also a powerful tool for
data checking and error detection.
Inferential Analysis
Whenever you are drawing general conclusions from a selection of observations, you
are making inferences. For instance, using the results from designed experiments to
determine which input variables have a significant influence on your responses, is a
typical case of inferential analysis, called significance testing.
Once you have cleaned up your data and revealed their information content, it is
time to start making inferences. Remember that experimental design is the only way
to prove the existence of a causal relationship between two facts! Which means that,
in practice, we will use inferential analysis mostly as a stage in the analysis of
designed data. Non-designed data will be analyzed with descriptive methods, and
predictive techniques if our objective requires it. Read what follows to understand
the differences between the two approaches.
Predictive Analysis
While we use inference to increase our knowledge, i.e. build up new rules which we
will then apply to reach a goal, predictive methods can help us obtain immediate
practical benefits. Let us illustrate the difference through an example.
The last stage of the analysis brings the two groups of variables together, and
enables you to draw final conclusions.
In PCR, study the results for as many PCs as there are variables (i.e. max. number of
PCs). This corresponds to the MLR solution.
Note! Artifacts
Factorial designs give only one PLS component per Y-variable because
you vary all variables equally! For the same reason the calibration
X-variance will be zero.
It may be difficult from this plot to say where the limit between significant and
insignificant effects lies.
There are several ways to find the significant effects, either using an F-test or a
p-value, or by studying the effects relative to each other in a Normal probability plot
of effects (the Normal B-plot).
15
12
-3
-6
-9 X-variables
0 2 4 6 8 10 12 14 16
tutd-0, (Yvar,PC): (Yield,1)
n
∑ ( yi − yi ) 2
i =1
RSD =
( n − 1)
F is compared with the F-distribution with three parameters; m-1, n-1, and
significance level (typically 95%). You find the F-value in a statistical table.
If F > F-value, then the effect is regarded to be significant.
P-Value
A complementary measure is the P-value. The P-value is the probability that effect =
0. For instance, PA = 0.01 means that the effect A is significant with a probability of
99%.
If the P-value you get from the statistical table is small, then the effect is significant.
Unfortunately this approach has a few pitfalls and, even worse, most people are
unaware of them!
The P-values are totally irrelevant if the number of degrees of freedom for the
estimate of the error is small, i.e. there were few replicates to estimate the error.
All P-values become low, that is all effects seem significant, if the error is very
small. This may occur if the reference method is very accurate.
If this is the situation, a better strategy may be to look at the effects relative to each
other, for instance in the Normal B-plot!
60 * *
* *
**
* *
50
*
* * * * * ** * *
40 * *
* * * *
* *
* *
30 * *
* * * *
* *
* * *
20 * *
* * *
* *
* *
10 * *
* ** *
* *
**
0
0 10 20 30 40 50 60
R=-0.42 , P=0.0006
In Figure 16.31 the P-value is very small, implying a significant effect. However
most people have difficulties in seeing a “significant” trend and correlation in the
plotted data!
Since the F-test and the P-value are two sides of the same coin, this also applies for
the results of the F-test.
96.67 C
90.00 A
83.33 B
76.67 CD
70.00 CE
63.33 AE
56.67 DE
50.00 BE
43.33 BD
36.67 AD
30.00 D
23.33 AB
16.67 E
10.00 AC
3.33 BC B-coefficient
-10 -5 0 5 10 15 20
tutd-0, PC = 1, Yvar = Yield
The abscissa axis shows the size of the b-coefficients. The ordinate axis shows the
probability. (An F-test is used to calculate this probability.)
Since the center points are not used to estimate the effects, it does not matter which
setting you use for discrete variables in these experiments.
Curvature Check
The b0 is an estimate of the average response value. If b0 is different from the
average response at the center point(s), then the response surface is probably curved
and you may need to continue with a response surface design to make a quadratic
model. If you have no center points, you may instead calculate the average response
of all the experiments.
An easy way to do this in The Unscrambler is to make a new model where you
weight all the insignificant effects to zero. Then plot the residuals of this reduced
model, for example as a Normal probability plot of residuals. They should now be
random if all systematic variation in the response is explained by the significant
design variables. This form a more or less straight line through (0, 50).
Sometimes it is difficult to see which of the effects are significant; are they on or off
the line in the Normal B-plot? Start by making reduced models where you first keep
only the obvious ones. If you have deleted a significant design variable, there will be
unexplained systematic variations in the reduced model. The residuals will not be
random and there will be no straight line in the Normal Residuals plot! You should
then put the doubtful variables back into the model, one by one, until you are
satisfied.
Normally the response surface plot is associated with optimization designs but the
plot illustrates well the interactions and how the response varies with the design
variable settings. Note that the response surface will be a bad description of the
response surface if there is curvature!
Task
Find which variables have significant influence on the yield. Use a (fractional)
factorial design. From the literature, there is an indication that the following
variables could influence the yield and that the ranges given are appropriate:
How to Do it
1. Make the design
Select a suitable design to study the influence of the five design variables on the
yield. (The Yield is the response variable.)
Go to File – New Design, choose From Scratch and hit NEXT select Create
Fractional Factorial Design, and hit NEXT.
Select New to enter a new design variable. Fill in a name and levels of variable A.
Hit OK and do the same for the rest.
(You can edit your entries with Properties or by double-clicking on the list.)
Then enter a New Response variable; Yield. When satisfied - press Next.
Select Design type. Choose a design for max. 16 experiments (not counting center
points). The design should enable you to estimate all main effects and to see if
there are any interactions between any two of the variables.
look for Files of type Designed Data instead, and select WILLKIND which
already contains the response values.
Then use Plot - Effects - Details and select Normal probability plot and Include
Table.
Which are the largest effects? Which ones are likely to be significant?
Do you get the same information as in the Effects Overview?
The lower plot shows Mean and Std. Deviation for all design samples.
Select Plot - Statistics and plot Mean and Std. Deviation for Group containing
both Design Samples and Center Samples.
Is the standard deviation of the cube samples much larger than the standard
deviation of the replicated center points? What can you conclude from this
regarding the experimental error and/or the precision of the response
measurement?
How large is the average response value (yield) in the cube samples? How large
is the average response in the center points?
Is the relationship between yield and the design variables linear? How can you
tell?
Summary
The Fractional Factorial Resolution V design requires 16 experiments (which we can
afford in this case). All main effects will be estimated without any confoundation
problems. The two-variable interaction effects will be confounded with three-
variable interactions, but those are in general negligible, so this is not a problem.
The Fractional Resolution III design requires only 8 experiments, but all main
effects will be confounded with one or two interaction effects. If there are
interactions it will therefore be difficult to interpret the results from this design.
Center points are used to check curvature. If we replicate them we can also use them
to estimate the experimental error. This gives us 18 experiments in total. If there are
no center points it is difficult to get good estimates of significance. Then you will
need to use the Normal probability of effects plot to get an indication. We normally
randomize the experiments to avoid systematic effects, so the lab rapport we print
out should be randomized.
The list of experiments in standard order shows that the design was built as if there
were only 4 design variables (A, B, C, D) combined factorially, and that the fifth
variable (E) was generated from the interaction column ABCD.
The Effects Overview plot shows that Temperature, Sulfur and Morpholine have
positive significant effects. The AC and BC interactions are significant, negative
effects.
The other effects are not significant at the 5% level.
The Normal Probability plot indicates similar results, but you do not have
significance levels - you have to interpret which effects are likely to be significant:
Large effects that are far from the normal distribution line are likely to be
significant. The significant effects detected by ANOVA are the only ones that stand
out from the normal distribution line.
The detailed ANOVA table displayed together with the normal probability plot lists
the values of the effects, their significance level (p-value), and the confounding
pattern. It is a good summary of the overall results.
Residuals are very small. In fact, it is not meaningful to look at them because the
design has exactly as many cube samples as there are terms in the model (i.e. the
model is saturated), so the fit is perfect by construction. You might also have noticed
that the model had an R-square of 1.000, for the same reason.
Statistics
The percentile plot shows a very wide range of variation for yield: from 10 to 90
approximately. This indicates that the levels chosen for the design variables
generated enough variation in the response. The distribution of Yield values is
slightly asymmetrical: half the measured values are above 75 approximately. The
standard deviation of the cube samples (left bar) is much larger (about 50) than that
of the replicated center points (about 5). This means that the experimental error is so
small compared to the overall variation that the results can be trusted. (If the
variations due to experimental or instrumental variability were of the same order of
magnitude as the variation in the whole experiment series - caused by changing the
design variable settings - then we could not draw any conclusions about effects.)
The average response in the cube samples is close to the center samples’ average
value, so we can conclude that the relationship looks linear.
When there are no replicated center points (or reference points) and the design is
saturated, there are no residual degrees of freedom to estimate the significance. The
only significance testing method that applies to such a case is COSCIND. The
Effects overview table is then different from the ordinary one. The effects are
displayed by increasing order of absolute value. You should read the p-values until
you find the first significant effect; then all larger effects are assumed to be at least
as significant. The Normal probability plot of effects is a useful complement to the
COSCIND method, to make sure that all important effects are detected.
Since the point is to study what happens anywhere within a given range of variation,
optimization designs can only investigate design variables which vary over a
continuous range. As a consequence, if you have previously investigated any
category variables, you have to select their best level according to your screening
results, and fix them at the optimization stage.
Two very different approaches are possible; each defines a particular type of
optimization design.
You can read more about these types of design in the next two chapters.
y A linear part which consists of the main effects of the design variables.
y An interaction part which consists of the 2-variable interactions.
y A square part which consists of the square effects of the design variables,
necessary to study the curvature of the response surface.
The model results are visualized by means of one or several response surface plots,
where you can read the value of a response variable for any combination of values of
the design variables.
Note!
Case b) only applies if the ranges of variation of your design variables are
the same at the screening and the optimization stage. In addition, if your
optimization is based on fewer design variables than the screening,
re-using the previous experiments is only possible if the variables you are
dropping from the investigation are fixed at their Low or High level.
The star samples combine the center level of all variables but one, with an extreme
level of the last variable. The star levels (Low star and High star) are respectively
lower than Low cube and higher than High cube. Usually, these star levels are such
that the star samples have the same distance from the center as the cube samples. In
other words, all experiments in a Central Composite Design are located on a sphere
around the Center sample.
Note!
From these numbers, it is pretty obvious that it is not recommended to
build a Central Composite Design with 6 variables. The total number of
experiments is so large that something is bound to go wrong, which will
prevent you from interpreting the results clearly enough. In practice, if you
have 6 design variables to investigate, run one more screening before
starting an optimization
The solution consists in tuning down the distance between Star samples and Center
of the design, until we reach possible values for all Low star and High star levels. In
the most extreme case, the Star levels will be the same as the Cube levels. Then the
star samples are located on the center of the faces of the cube - see Figure 16.37 for
an illustration.
Fortunately, the Central Composite Design consists of two main sets of experiments:
Cube and Star samples. These two groups have the mathematical property that they
contribute to the estimation of a quadratic model independently from each other. As
a consequence, if some of the experimental conditions vary slightly between the first
group and the second one, it will of course generate some “background noise”, but it
will not change the computed effects.
y The first block contains all Cube samples and half of the Center samples.
y The second block contains all Star samples and the other half of the Center
samples.
Block 1
Block 2
Block 1/ Block 2
So the design does indeed avoid extreme situations: if you study Figure 16.41 you
will see that the corners of the cube are not included. All experiments actually lie on
the centers of the edges of the cube.
As a consequence, it is also obvious that all experiments lie on a sphere around the
center: the design is rotatable.
And finally, since only 3 levels are used, there is no risk of including “impossible”
levels once you have defined a valid range of variation for each design variable.
Note!
If you compare these numbers with those of the Central Composite
Design (see Table 16.5), you will notice that the Box-Behnken generally
requires fewer experiments. So if you do not have any particular reason
for using a Central Composite Design, Box-Behnken is an economical
alternative.
y If you need blocking (read about that in the Central Composite chapter above):
the Central Composite is the only type of design with that possibility.
y If you are investigating 2 design variables only: a Central Composite Design is
the only choice.
y If you wish to avoid extreme situations (because they are likely to be difficult to
handle, or because you already know that the optimum is not in the corners): the
Box-Behnken design is preferable.
y If you cannot go out of the cube but want a rotatable design: the Box-Behnken
design is the only one with that combination of properties.
y If you do not have any special constraints, except budget: the Box-Behnken
design is more economical than the Central Composite.
The last stages of the analysis involve both inference and prediction: it is important
to know which effects included in the model are useful and which can be taken out,
so that the model finally used for prediction is simple, effective and robust.
Note that, once this analysis is completed, there will be a confirmation stage if
satisfactory conditions have been identified; the analysis of the confirmation
experiments will consist mostly of a descriptive stage where the results are checked
against the expectations.
Problem Description
This exercise was built from the enamine synthesis example published by R.
Carlsson in his book “Design and Optimization in Organic Synthesis”, Elsevier,
1992.
A standard method for the synthesis of enamine from a ketone gave some problems
and a modified procedure was investigated. A first series of experiments gave two
important results:
• A new procedure was built up, which shortened reaction time considerably.
• It was shown that the optimal operational conditions were highly dependent on
the structure of the original ketone.
Data Table
From the previous experiments, reasonable ranges of variation were selected for the
following 4 design variables:
Tasks
Select a screening design requiring a maximum of 11 experiments, that will make it
possible to estimate all main effects and detect the existence of 2-factor interactions.
Note: with 4 design variables, you need a Fractional Factorial Design to keep the
number of experiments lower than 16 (24).
How To Do It
Go to File – New Design,choose From Scratch and hit Next. Then select Create
Fractional Factorial Design, and hit Next.
Then you may define your variables. From the Define Variables window, use
New in the Design Variables box to add each new design variable. From the Add
Design Variable window, name each new design variable (e.g. TiCl4,
Morpholine, Temperature, Stirring), select Continuous, enter the low and high
levels (lookup the levels in the table previous page; use only the Low and High
levels), and validate with New.
Note: in order to be allowed to specify center samples, you will have to define
Stirring rate as a continuous variable; you can give it the arbitrary levels -1 and 1,
where -1 stands for “no stirring” and 1 stands for “high stirring”.
After all four design variables have been defined, the Design Variables box
should contain the following:
Table 16.9
ID Name Data Type Levels
A TiCl4 Continuous 2 (0.6;0.9)
B Morpholine Continuous 2 (3.7;7.3)
C Temperature Continuous 2 (25.0;40.0)
D Stirring Continuous 2 (-1.0;1.0)
From the Non-design Variables window, use New to define the response variable
(Yield).
Now you are ready to choose your design type more specifically.
Use Next to get into the Design Type window.
You will notice that the default choice is set to a Fractional Factorial Resolution
IV design, which consists of 8 experiments. Try other choices by toggling
Number of Experiments to Run up or down. (Actually, there is only one possible
Fractional Factorial Design with 4 variables; if you go up to 16 samples, then you
have a Full Factorial Design.)
Study the confounding pattern of the suggested design. You can see that all main
effects are confounded with 3-factor interactions, which is acceptable if we
assume that those interactions are unlikely to be significant. The 2-factor
interactions are confounded two by two.
The last step consists in setting the numbers of replicates and center samples.
Use Next to get into the Design Details window (Figure 16.43). Keep Number of
Replicates to 1, and add 3 Center Samples. Click Next twice until you reach the
Last Checks window (Figure 16.43). A summary is displayed to make sure that
all your design parameters have the correct values. If not, use Back and make
corrections.
Once you are satisfied with your design specifications, use Finish to exit. The
generated design is automatically displayed on screen. You can use the View
menu to toggle between display options. Try Sample Names and Point Names,
Standard Sample Sequence and Experiment Sample Sequence (randomized order).
It would now be safe to store your new data table into a file, using File - Save As;
give it a name, e.g. Enam FRD. Note that you should not overwrite the existing
file Enam_FRD. You need this file later in the exercise.
Tasks
Study the main effects of the four design variables, and check whether there are any
significant interactions. The simplest way to do this is to run an Analysis of Effects.
Then, interpret the results.
How To Do It
First, you should enter the response values. Since this has already been done, you
just need to read the complete file. Use File - Open, and select among the
Designed Data list the file named Enam_FRD, which already contains the
response values.
From the Analysis of Effects window use the Samples, X-variables and
Y-variables boxes to select the appropriate samples and variables. Sample Set
should be Cube & Center Samples (11). X-variables should be Design Vars + Int
(4+3). Y-variable set should be Cont Non-Design Vars (1).
After the calibration has completed successfully, click “View” to get an overview
of the model results. Before doing anything else, use File - Save to save the results
file with a name like “Enam FRD AoE-a”, for example.
You can see that three effects are considered to be significant: Main effect TiCl4
(A), Interaction AB or CD, and Main effect Morpholine (B).
Go to Window - Copy to - 2 and use the empty window to plot those effects. To do
that, go to Plot - Effects - Details and select Normal Probability only.
The normal probability plot of the effects (Figure 16.47) confirms the results of
the Effects Overview: the effect of Morpholine (B) is clearly very significant, and
AB=CD and TiCl4 (A) are also likely to be significant.
So we should just check the model for non-linearities. To do that, go back to the
Editor window and select Task -Statistics. Choose Cube & Center Samples (11),
then OK. View the results.
The upper plot shows the range of variation of the response (Yield).
The lower plot shows mean and standard deviation over all samples. Click that
plot and use Plot - Statistics, selecting Mean and Std. Dev. for Sample Groups
Design Samples and Center Samples; validate your choices with OK.
The lower plot now displays the mean and standard deviation of all Design
samples compared to that of the Center samples only.
You can see that the standard deviation for the Center samples is about half the
overall standard deviation. This indicates some lack of reproducibility in the
Center samples; this is why most of the effects observed in the Analysis of
Effects were not found significant according to the Center significance testing
method. If you go back to the Editor and study the Yield values, you will notice
that Center sample Cent-c has a very different value from Cent-a and -b; maybe
that experiment was performed wrongly.
The other important information conveyed by that plot is that there is a strong
non-linearity in the actual relationship between Yield and the design variables:
the mean value for the Center samples is much higher than for the overall design.
There was some lack of reproducibility in the Center samples, although the
remaining part of the design showed a clear structure (according to the
COSCIND and Normal probability results). If new experiments are performed, it
will be useful to replicate the Center samples a few more times.
Task
Build a Central Composite Design to study the effects of the two important variables
(TiCl4 and Morpholine) in more details. NB- the other two variables investigated in
the screening design have been set to their most convenient values: No stirring, and
Temperature=40°C.
How To Do It
Choose File - New Design to start the dialog that will enable you to generate a
designed data table as in the previous exercise. Select Create Central Composite
Design. From the Define Variables window, define the two design variables
TiCl4 and Morpholine with the same ranges of variation as previously, and the
response variable (Yield).
Check that the Design Variables box indicates the correct Star Points Distance
from Center, namely 1.41.
Once you are satisfied with your variable definitions, use Next to get into the
Design Details window. Set Number of Replicates to 1, and Number of Center
Samples to 5.
Check the summary displayed to the right of the window to make sure that all
your design parameters have the correct values. The design should include a total
of 13 experiments. Otherwise, use Back.
Once you are satisfied with your design specifications, use Finish to exit. The
generated design is automatically displayed on screen.
You may view the list of experiments in standard order to better understand the
structure of the design.
Task
Find the levels of TiCl4 and Morpholine that give the best possible yield. You will
need to use a Response Surface Analysis.
How To Do It
First, you should enter the response values, but this has already been done. Open
the Designed Data list the file named Enam_CCD, which already contains the
response values.
From the Response Surface window, check that the Samples, X-variables and
Y-variables boxes contain the appropriate selections. Select a Quadratic Model.
Click OK to start the analysis.
When the computations are finished, click View to study the results. But before
you start interpreting them, do not forget to save the result file!
The Summary shows that the model is globally significant, so we can go on with
the interpretation.
The Model Check indicates that the quadratic part of the model is significant,
which shows that the interactions and square terms included in the model are
useful.
The Variables ANOVA displays the values of the b-coefficients, and their
significance. You see that the most significant coefficients are for the linear and
quadratic effects of Morpholine; the quadratic effect of TiCl4 is close to the 0.05
significance level. That section of the table also tells you that the maximum point
is reached for TiCl4=0.835 and Morpholine= 6.504; the information displayed on
top of the table shows a Predicted Max Point Value of 96.747.
The Lack of Fit section tells you that, with a p-value around 0.19, there is no
significant lack of fit in the model. Thus we can trust the model to describe the
response surface adequately.
Here, you see that the residuals form two groups (positive residuals and negative
ones). Apart from that, they lie roughly along a straight line, and no extreme
residual is to be found outside that line. This means that there is no apparent
outlier.
On the Studentized residuals plot, all values are within the (-2;+2) range, which
confirms that there are no outliers. Furthermore, there is no clear pattern in the
residuals, so nothing seems to be wrong with the model.
The landscape plot displayed in the lower right quadrant shows you the shape of
the response surface: a kind of round hill with a maximum somewhere between
the center and maximum values of the design variables.
That plot is not precise enough to spot the coordinates of the maximum; the
contour plot displayed left is more suited to that purpose. For instance, you can
change the scaling to zoom around the optimum, so as to locate its coordinates
more accurately. Check that they match what is displayed in the ANOVA table.
You can also click at various points in the neighborhood of the optimum, to see
how fast the predicted values decrease. You will notice that the top of the surface
is rather flat, but that the further away you go, the steeper the Yield decreases.
Finally, you may also have noticed that the Predicted Max Point Value is smaller
than several of the actually observed Yield values (sample Cube004a for instance
has a Yield of 98.7). This is not paradoxical, since the model smoothes the
observed values; those high observed values might not be reproduced if you
performed the same experiments again.
Since there was no apparent lack of fit, no outlier and the residuals showed no
clear pattern, the model could be considered valid and its results interpreted more
thoroughly.
• Define the variables that will be measured to describe the outcome of the
experimental runs (response variables), and examine their precision.
• Choose among available standard designs the one that is compatible with your
objective, number of design variables and precision of measurements, and has a
reasonable cost.
In the following sections we will go through the practical issues to consider, in the
same order as you enter them into the program when creating a new design.
Participants - Make sure that the following people take part in the brainstorming
session:
Project leader.
People who know the application, the product, the process.
People knowledgeable in measurement methods.
Somebody who has some experience in experimental design and data analysis.
People who will actually perform the experiments.
Objective - Define:
The application you are interested in. Which product? Which process?
The output parameters (responses) you are going to study. Which properties? Are
they precisely defined? Make sure that all relevant responses are included!
The measurement methods and protocols you will use. Is there a standard
measurement method for each of your responses? Have you got the necessary
instruments? Are your sensory descriptors adequate? Is the panel well trained?
Your target values, for each response. Do you want the response value to be
maximum? Minimum? Within a certain range? As close as possible to a reference
value? No target value, just detect variations?
• Take practical constraints into account. If there is only one supplier for raw
material A, it is of no use to consider a possible change of supplier. If the
regulations fix the amount of ingredient B, this is not a potential variable any
more.
• Use your previous knowledge. If earlier studies have shown that, for a
representative set of samples, the preservative has no effect on the sensory
properties of your product, then you do not need to investigate the effect of the
preservative any more.
• Do not forget your common sense. The taste of the product does not depend on
the color of the package! But it might be influenced by the packaging material.
• But do not leave any potential factor aside just because you assume that it has no
effect or because it has always remained fixed before. Use reasonable arguments
to agree on what can vary and what may have an influence. If a parameter can
vary and if its influence cannot be excluded, then it should be studied.
Outlining a Strategy
After you have reduced the list of potential factors to its minimum size, you have to
check whether the remaining number of factors is compatible with the precision of
your objective. Usually you wish to describe the variations of your responses
precisely depending on the values of your input parameters. You may understand
intuitively that this is easier to achieve if you have a small number of input
parameters to study!
Number
of 6 or more Ö 4 to 8 Ö 2 to 5
factors:
Fractional Factorial Fractional Factorial Central Composite
Designs:
(high resolution)
or or or
Plackett-Burman Full Factorial Box-Behnken
The designs available in The Unscrambler are the most common standard designs,
dealing with several continuous or category variables that can be varied
independently of each other.
Just select the type of design which matches the number of potential factors and
complexity level of your current step. For example, if you are ready to start an
advanced screening with 4 variables, choose a Full Factorial Design.
Since you already know how the main types of designs work, it is easy for you to
check that they will not lead to too many experiments. If you are unsure, try the most
economical type of design first. For instance, you wish to study the main effects and
interactions of five parameters (advanced screening with 5 variables): you could
build a Full Factorial Design, with 25=32 experiments. But there is a chance that a
Fractional Factorial Design 25-1 will give you as much information with just 16
experiments (not counting center samples).
For a first screening, the most important rule is: do not leave out a variable that
might have an influence on the responses, unless you know that you cannot control it
in practice. It would be more costly to have to include one more variable at a later
stage, than to include one more into the first screening design.
For a more extensive screening, variables that are known not to interact with other
variables can be left out. If those variables have a negligible linear effect, you can
choose whatever constant value you wish for them (like for instance: the least
expensive). If those variables have a significant linear effect, then you should fix
them to the most suitable level to get the desired effect on the response.
The previous rule also applies to optimization designs, if you also know that the
variables in question have no quadratic effect. If you suspect that a variable can have
a non-linear effect, you should include it in the optimization stage.
• Its name.
• Its type: continuous or category.
• Its levels.
Continuous Variables
All variables that have numerical values and that can be measured quantitatively are
called continuous variables. This may be somewhat abusive in the case of discrete
quantitative variables, such as counts for instance. It reflects the implicit use which
is made of these variables, namely model their variations using continuous functions.
The variations of continuous design variables are usually set within a pre-defined
range, which goes from a lower level to an upper level. At least those two levels
have to be specified when defining a continuous design variable.
You can also choose to specify more levels if you wish to study some values
specifically.
If only two levels are specified, the other necessary levels will be computed
automatically. This applies to center samples (which use a mid-level, half-way
between lower and upper), and star samples in optimization designs (which use
extreme levels outside the predefined range).
Note!
If you have specified more than two levels, then center samples will be
disabled.
Category Variables
In The Unscrambler, all non-continuous variables are called category variables.
Their levels can be named, but not measured quantitatively. Examples of category
variables are: color (Blue, Red, Green), type of texture agent (starch, xanthane, corn
starch), supplier (Hoechst, Dow Chemicals, Unilever).
For each category design variable, you have to specify all levels. Since there is a
kind of quantum jump from one level to another (there is no intermediate level in-
between), you cannot directly define center samples when there are category
variables.
the adequate levels is a trade-off between these two aspects. However, do not choose
only a range that gives “good” quality! Also bad results are useful to understand how
the system works. We need to select a range for each design variable that spans all
important variations!
Thus, a rule of thumb can be applied: make the range large enough to give effect and
small enough to be realistic. If you suspect that two of the designed experiments will
give extreme, opposite results, perform those first. If the two results are indeed
different from each other, this means that you have generated enough variation. If
they are too far apart, you have generated too much variation, and you should shrink
the ranges a bit. If they are too close, try a center sample: you might just have a very
strong curvature!
If you are not sure that your ranges of variation are suitable, perform a few pre-trials.
Section “Do a Few Pre-Trials!” gives you practical rules for that.
• Constant variables, i.e. variables that might have an influence on the outcome of
the experiments, and that are kept constant so as not to interfere with the design
variables.
• Non-controlled variables, i.e. variables that might have an influence on the
outcome of the experiments, and which you cannot control. In order to have a
possibility to take them into account in further analyses, you can record their
observed values during the experiments. In case they indeed vary, they may also
influence the results of the experiments and disturb the estimation of effects
using the classical statistics like ANOVA (see page Error! Cannot open file.).
However, since The Unscrambler includes multivariate analysis methods like
PLS, they can still be analyzed using these methods instead, taking the
uncontrolled variables into account too.
When you select among the available design types, you can interactively change
either the number of experiments to run or the resolution, which are linked. When
you make a change you can see how the confounding pattern changes, see page 374.
There is usually a trade-off between fewer experiments and less confounding.
Design Details
The next stage of your design specification concerns how to deal with possible
errors and uncertainty, by adding extra experimental points. These extra samples can
be of three types:
As soon as you enter new samples, an overview tells you how many experiments you
will now get in total.
Replicates
Replicates are experiments performed several times. They should not be confused
with repeated measurements, where the samples are only prepared once but the
measurements are performed several times on each.
When you try to find the effect on the response of changing from a low level to a
high level, you are really fitting a line between two experimental points. However,
do not forget that these points have observed values that are a combination of actual
(theoretical, unobserved) values and some error (experimental or measurement error)
that interferes with the observations. Therefore, each observed point has a certain
amount of imprecision that will reflect on the slope of the fitted line and make it
imprecise too. This is why you may include replicates into your design: by making
two or three experiments with the same settings, you will have the opportunity to
collect observed values that are slightly different from one another and thus estimate
the variability of your response.
Assumed slope
Range of possible
slopes
By making all experiments twice you get a better precision of the results. “One
replicate” means that you make each experiment only once, while “two replicates”
means that you make them twice. Whether you decide to replicate the whole design
or not depends on cost and reproducibility. If you know that there is a lot of
uncontrolled or unexplained variability in your experiments, it may be wise to
replicate the whole design (make it twice).
Center Samples
Now fitting a straight line assumes that the underlying phenomenon is linear; but
what if it is not? You should have the means to detect such a situation. Center
samples are used to diagnose non-linearities. By making also an experiment where
all design variables take their mid-levels (in the middle between low and high) you
will have a chance to compare the response values in this point with the calculated
average response. If they are not equal, it means that the relationship is non-linear. In
the case of high curvature, you will have to build a new design to describe a
quadratic relationship, for example a Central Composite or Box-Behnken design.
Assumed shape
Since replicating the whole experimental series usually is rather expensive, you may
instead include a replicated center point, i.e. you perform the “average” experiment
twice or three times. It can thus be used to check both the reproducibility of the
experiments (at least in the middle) and possible non-linearities. Of course, you can
never be sure that the level of imprecision is exactly the same in the center samples
as for the extreme levels of the design variables, but it is likely to be close to the
average variability.
We therefore recommend you to always make at least two center samples if possible!
A practical way to overcome this problem is to make a “center point” for each level
of the category variable, for example one using agent A and all other design
variables at their mid-level, and one using agent B and all other design variables at
their mid-level. Such pseudo-center samples can be included as reference samples,
see page 425. If you cannot make true center samples we recommend that you
replicate these reference samples (or one of them) and use them to check the
reproducibility of the experiments instead.
Reference Samples
Reference samples are experiments which do not belong to a standard design, but
which you choose to include for various purposes. If you want to compare the
designed samples with today’s production or competitors’ samples, these should be
included too. You will not enter the values of the design variables for these samples
(you seldom know the competitor’s recipe!) but their response values can usually be
measured and included in the analysis.
Another use of reference samples is to compensate for the fact that center samples
cannot be used when there are category design variables, as described on page 425.
Make pseudo-center samples as reference samples instead.
Randomization
Randomization consists in performing the experiments in random order, as opposed
to the standard order which is sorted according to the levels of the design variables.
Therefore the program sorts the experiments in random order when printing out the
lab report.
Incomplete Randomization
Sometimes, however, it is very impractical to perform all experiments in random
order. For example, the temperature may be very difficult or time-consuming to tune,
so the experiments will be performed much more efficiently if you tune that
parameter only a few times. It would be much easier to first run all experiments with
a low temperature and then all with a high temperature.
In The Unscrambler you can tick the box “Sorting Required During
Randomization” at the bottom of the Design details dialog before you select Finish.
Then you can select which variables you do not want to randomize. As a result, the
experimental runs will be sorted according to the non-randomized variable(s). This
will generate groups of samples with a constant value for those variables. Inside each
such group, the samples will be randomized according to the remaining variables.
But remember that you have done this and be aware of possible systematic effects.
This may be detected by studying the so-called residuals.
Do a Few Pre-Trials!
Sometimes there are combinations of low and high values for some variables that
cannot be accomplished. We recommend that you always do a few initial
experiments, for example the experiment where all design variables have their low
level values, the one with only high level values, and perhaps the center point where
all design variables have their average value. You should perform those two or three
experiments first, regardless of the randomization.
In this way you can easily check that the chosen range is wide enough. If the
responses are about the same for the both extreme experiments, your selected
variables have no effect on the responses, or the range is too narrow. You can also
check that the experiments can be conducted and responses measured as planned,
and have a chance to alter the procedure or the test plan before you have wasted
more efforts. These initial experiments should thus also be used to check the
reproducibility and the measurement errors.
These experiments should normally give the most different samples, so if they are
too similar - rethink! You want to generate different samples to investigate how
responses vary and to compare.
• Either the experiments have provided you with all the information you needed:
then your project is completed.
• Or the experiments have given you valuable information, which you can use to
build a new series of experiments that will lead you closer to your objective.
In the latter case, sometimes the new series of experiments can be designed as a
complement to the previous design, in such a way that you minimize the number of
new experimental runs, and that the whole set of results from the two series of runs
can be analyzed together. This is called extending a design.
Extending an existing design is also a nice way to build a new, similar design that
can be analyzed together with the original one. For instance, if you have investigated
a baking process and recipe using a specific type of yeast, you might then want to
investigate another type of yeast in the same conditions as the first one, in order to
compare their performances. This can be achieved by adding a new design variable,
namely type of yeast, to the existing design.
Last but not least, you can use extensions as a basis for an efficient sequential
experimental strategy. That strategy consists in breaking your initial problem into a
Multivariate Data Analysis in Practice
16. Introduction to Experimental Design 429
• Add levels: Whenever you are interested in investigating more levels of already
included design variables, especially for category variables.
• Add a design variable: Whenever a parameter that has been kept constant
previously is suspected of having a potential influence on the responses. Also,
whenever you wish to duplicate an existing design so as to apply it to new
conditions that differ by the values of one specific variable (continuous or
category), and analyze the results together. For instance, you have just
investigated a baking process using a specific yeast, and now wish to study
another similar yeast for the same process, and compare its performances to the
other one’s. The simplest way to do this is to extend the first design by adding a
new variable: type of yeast.
• Delete a design variable: If one or a few of the variables in the original session
have been determined as clearly non-significant by the analysis of effects, you
can increase power of your conclusions by deleting this variable and reanalyzing
the design. Deleting a design variable can also be a first step before extending a
screening design into an optimization design. You should use this option with
caution if the effect of the removed variable is close to significance. Also make
sure that the variable you intend to remove does not participate in any significant
interaction!
• Add more replicates: If the first series of experiments shows that the
experimental error is unexpectedly high, replicating all experiments once more
might make your results clearer.
• Add more center samples: If you wish to get a better estimation of the
experimental error, adding a few center samples is a good and inexpensive
solution.
• Extend to higher resolution: Use this option for Fractional Factorial Designs
where some of the effects you are interested in are confounded with each other.
You can use that option whenever some of the confounded interactions are
significant, and you wish to find out which ones exactly. This is only possible if
there is a higher resolution Fractional Factorial Design. Otherwise, you can
extend to full factorial instead.
Caution!
Whichever kind of extension you use, remember that all the experimental
conditions not represented in the design variables must be the same for
the new experimental runs as for the previous runs.
Cross validation is impossible for Full Factorial Designs unless there are systematic
replicates, because each sample is equally important to the model. The validated
Y-variance using leverage correction may be useless too, because this method
simulates full cross validation.
If the main effects dominate, validation causes no problems. Validation may thus
primarily be a problem if interaction effects dominate. Use leverage correction
during calibration, but disregard the validation Y-variance. Study the calibration
Y-variance, which is a measure of the model fit, and study Y residuals.
With methods like PLS, it is possible to use RMSEC, Root Mean Square Error of
Calibration, or the calibration Y-variance. These measures express how well the
model has been fitted to the data.
Descriptive statistics consist of a few measures extracted from the raw data, either
by picking out some key values, or by very simple calculations.
Percentiles are values extracted from the raw data. The Unscrambler gives you the
following percentiles:
y Minimum and Maximum: the extreme values encountered in the current group of
samples.
y Quartiles: the values inside which the middle half of the observed values are to be
found - or outside which the 25% largest and the 25% smallest values are
encountered.
y Median: the value which cuts the observed values into two equal halves; in other
words, 50% of the samples have a larger value than the median, the remaining
50% have a smaller value.
y The Mean is the average of the observed values, i.e. the sum of the values,
divided by the number of samples in the group.
y The Standard deviation (abbreviated Sdev) is the square root of the variance; the
variance is itself computed as the sum of squares of the deviations from the mean,
divided by the number of samples minus one.
The mean is supposed to give an indication of the central location of the samples, i.e.
a value around which the most typical samples are located. The standard deviation
provides a measure of the spread of the observed values around the average, i.e. how
much any sample taken from the same population is likely to vary around the
average.
y The response variable varies very little over the design samples, or its variations
are due mostly to uncontrolled conditions.
To know whether you need an optimization, have a look at the mean for the Center
samples, and compare it to the mean for the design samples (for each response
variable). If they differ noticeably, it means that at least one of the design variables
has a non-linear effect on the response. You will not be able to conclude with
certainty about which design variable has a non-linear effect, but at least you will
know with certainty that you need an optimization stage to describe the variations of
your response adequately.
y Estimated value of the main effect of each design variable on each response.
y Estimated value of the interaction effect between two design variables for each
response (if the resolution of the design allows for these interactions to be
studied).
y Significance of these effects.
y In case of design variables with more than two levels, which levels generate
significantly different response values.
Once you have checked your raw data and corrected any data transcription errors,
and possibly completed the descriptive stage by performing a multivariate data
analysis, you are ready to start the inferential stage, i.e. draw conclusions from your
design.
The purpose of Analysis of Effects is to find out which design variables have the
largest influence on the response variables you have selected, and how significant
this influence is. It especially applies to screening designs.
To test the significance of a particular effect, you have to compare the response’s
variance accounted for by that effect to the residual variance which summarizes
experimental error. If the “structured” variance (due to the effect) is no larger than
the “random” variance (error), then the effect can be considered negligible. Else it is
regarded as significant.
y First, several sources of variation are defined. For instance, if the purpose of the
ANOVA model is to study the main effects of all design variables, each design
variable is a source of variation. Experimental error is also a source of variation.
y Each source of variation has a limited number of independent ways to cause
variation in the data. This number is called number of degrees of freedom (DF).
y Response variation associated to a specific source is measured by a sum of
squares (SS).
y Response variance associated to the same source is then computed by dividing
the sum of squares by the number of degrees of freedom. This ratio is called
mean square (MS).
y Once mean squares have been determined for all sources of variation, F-ratios
associated to every tested effect are computed as the ratio of MS(effect) to
MS(error). These ratios, which compare structured variance to residual variance,
have a statistical distribution which is used for significance testing. The higher
the ratio, the more important the effect.
y Under the null hypothesis that an effect’s true value is zero, the F-ratio has a
Fisher distribution. This makes it possible to estimate the probability of getting
such a high F-ratio under the null hypothesis. This probability is called p-value;
the smaller the p-value, the more likely it is that the observed effect is not due to
The ANOVA results are traditionally presented as a table, in the format illustrated in
Table 16.10.
B dfB = SS B M SB = FB = 0.29
#levels-1 SS B /dfB M S B /M S err
C dfC = SS C M SC = FC = 0.02
#levels-1 SS C /dfC M S C /M S err
Note!
The underlying computations of ANOVA are based on the MLR algorithm.
The effects are computed from the regression coefficients, according to
the following formula:
Multiple Comparisons
Multiple comparisons apply whenever a design variable with more than two levels
has a significant effect. Their purpose is to determine which levels of the design
variable have significantly different response mean values.
The Unscrambler uses one of the most well-known procedures for multiple
comparisons: Tukey’s test. The levels of the design variable are sorted according to
their average response value, and non-significantly different levels are displayed
together.
other methods for significance testing. They differ from each other by the way the
experimental error is estimated. In The Unscrambler, five different sources of
experimental error determine different methods. Read more about those methods in
the reference manual.
Let us start with a simple example; a Fractional Factorial Design of four design
variables and reduction 1; 4-1/IV resolution. If the confounded effect of AC and BD
is significant (AC+BD), how do we separate them? Everywhere in the design the
variation of AC is always the same as BD. To be able to separate them we must
therefore run additional experiments, in which the variation of AC is different from
the variation of BD. This can actually be achieved by only one extra experiment,
where the setting of AC and BD is different.
We pick one of the experiments, e.g. number 8, and then make a new one, number 9,
thus:
Table 16.11
No A B C D ... AC BD
8 +1 +1 +1 +1 ... +1 +1 (existing experiment)
9 +1 +1 -1 +1 ... -1 +1 (new experiment)
This extra run allows us to estimate AC and BD. This illustrates the great advantage
of Fractional Factorial Designs. Confounded effects can be separated by doing
complementary runs. Clearly such a sequential experimentation strategy is very
efficient and economical.
design variable for the blocks. Experimental runs must then be randomized within
each block.
• day (if several experimental runs can be performed the same day).
• operator or machine or instrument (when several of them must be used in parallel
to save time).
• batches (or shipments) of raw material (in case one batch is insufficient for all
runs).
You can also measure the standard deviation of the measurements on each sample,
as an expression for the precision of the measurement method, see Equation 16.3 (or
rather, the measurement error).
1 I
Equation 16.3 SDev = ∑
I − 1 i =1
( yi − y )
2
where
I = the number of reference measurements
Note!
It is important not to mix up repeated response measurements with
replicated experiments. With repeated measurements, each experiment is
carried out once; then the measurements only are repeated several times.
Therefore, if you have repeated measurements but no true replicates, do
not specify your design as replicated! There are other ways to handle the
repeated response values in practice.
Before Analysis of Effects, PCA, etc., you will calculate the average of all repeated
measurements for each sample. There are two alternatives:
It is important that you enter the data in a sequence that make it easy to analyze and
average. Use the following scheme:
In a fold-over design you simply switch the signs of all the experimental settings of
the variables in the first design. Here is an example:
Table 16.12
First design Fold-over design
A B C AB AC BC ABC A B C AB AC BC ABC
-1 -1 -1 -1 -1 -1 -1 +1 +1 +1 +1 +1 +1 +1
-1 +1 -1 +1 -1 +1 +1 +1 -1 +1 -1 +1 -1 -1
+1 +1 -1 -1 +1 +1 -1 -1 -1 +1 +1 -1 -1 +1
-1 -1 +1 -1 +1 +1 +1 +1 +1 -1 +1 -1 -1 -1
+1 -1 +1 +1 -1 +1 -1 -1 +1 -1 -1 +1 -1 +1
-1 +1 +1 +1 +1 -1 -1 +1 -1 -1 -1 -1 +1 +1
-1 +1 +1 +1 +1 -1 -1 +1 -1 -1 -1 -1 +1 +1
+1 +1 +1 -1 -1 -1 +1 -1 -1 -1 +1 +1 +1 -1
When analyzing the two parts together, the main effects will be free from
confounding with two-variable interactions. Two-variable interactions may (still) be
confounded with each other. But you may often make an “educated” guess of which
term dominates in such confoundings; significant interaction effects are generally
found for variables which also have significant main effects.
Read more about the powerful fold-over designs in the design literature.
We enter the real variable values and use PCA on the responses, PCR or PLS in
order to make an ordinary multivariate projection model, to relate response variation
to the design variable. We disregard the Normal B-plot and instead use the loadings
plot to study the most important variables “as usual”. Even if the design was not as
planned, you have probably generated a data set which spans the most important
variations as well as the interacting covariations.
As a consequence, you should be especially careful to collect response values for all
experiments. If you do not, for instance due to some instrument failure, it might be
advisable to re-do the experiment later so as to collect the missing values.
If, for some reason, some response values simply cannot be measured, you will still
be able to use the standard multivariate methods described in this book: PCA on the
responses, and PCR or PLS to relate response variation to the design variable.
Remember to autoscale the variables in this situation. If some of the variables are
varying between 0 and 1, scale them in an appropriate fashion to get values larger
than 1 before you divide by their standard deviation. For instance if a variable varies
between 0 and 1 gram give its levels in milligrams instead.
Calculate enough components and study the residual Y-variance to find the
appropriate number of components to use for interpretation.
If the model is bad, for instance if the prediction error is high, one reason may be
non-linearities. Try to expand the X-matrix with cross- and square terms. Remember
to subtract the mean value of each variable before expansion. Then make a new
model and see if the added cross- and square terms do the trick.
This procedure may also be an alternative if the basis for a chosen design has failed.
Suppose you make a Plackett-Burman design but there are clear interactions. The
data set then contains too little information for the ambition of the stated modeling.
Make a fold-over design, add the new experiments to the data set, and make a PLS-
model as usual based on the uncoded (real) values expanded with cross terms. Note
that the results from a Plackett-Burman design cannot be used to estimate curvature
since it does not contain any axial points.
Problem
Lacotid is a white crystalline powder used in medicine. The synthesis of Lacotid is a
two stage process: 1) synthesis 2) crystallization. The synthesis of the raw products
is performed in a methanol solution (MeOH). A slurry of the raw product is then
pumped into a new container where crystallization takes place. Crystallization is
performed by gradually adding isopropanol (C3H7OH) to the slurry. The producers
of Lacotid wanted to increase the yield and make the production more optimal and
stable, as there was a yield of only 50% and large variations in quality. After some
initial experiments the factory concluded that the main variations occurred in the
crystallization stage. They therefore planned to improve the monitoring of the
crystallization process to ensure stable and optimal production of Lacotid with
respect to yield and quality. To achieve this they first needed to find which process
parameters have significant effects on the yield.
In factorial designs all X-variables take only two values (high or low), because the
goal is to investigate if Y is affected by a change in each X-variable. Because of
problems in keeping some of the variables at their planned levels, the data could not
be analyzed by the traditional methods used to analyze experimental designs. They
therefore had to work with the real x-values and use a PLS model instead to interpret
the variable relationships.
The data for this exercise were provided by the Norwegian Society for Process
control at a workshop in 1993. They are based on a real application.
Data Set
LACOTID with the variable Sets X-Data and Y-Data.
The X variables are:
There are two responses in Y: Yield in % and purity (GC) of the yield (measured by
gas chromatography). Twenty experiments were made using a 27-3 factorial design
with 4 center points. The replicated center points are marked R. Center points have
mean values of all x-variables. They are replicated to check the error in the reference
method.
Task
Make a PLS model to study which process variables influence the yield and quality.
How to Do it
1. Plot the raw data and study the data table to get acquainted with the data.
Calculate general statistics of the variables (both X and Y) in the Task-
Statistics menu. Note the SDev of Y1 and Y2.
2. Make a PLS2 model. Should we standardize the data? Use leverage correction
initially.
Study the variance from the Model overview. You should see that the explained
calibration X variance is very small. In factorial designs, by definition, all the
X-variables have been systematically varied in the same way. That is no one
variable varies more than another, so normally the explained calibration
X-variance will be zero. (In this case the planned settings could not be kept
completely constant, so there is a small variation, and even a small decrease in
PC2.) In factorial designs, do not pay attention to these X-variances as they have
no meaning here.
Study the residual Y- Variance. Observe the big difference in variance for Yield
and GC. What does this mean? Also look at RMSEP. How many components do
we need? How much of the variance of Y is explained?
Take a look at the score plot. Is it really meaningful to interpret PC2? The center
points represent the average samples. Study the loading plot. Which process
variables have highest effect on the Yield?
3. Calculate the SDev of the variable GC for the four samples marked R. Compare
this SDev to the standard deviation of all samples in Y2. Use this to explain why
the modeling of Y2 is so bad.
4. Run a new model on Yield only. Study the Y-variance, the score plot and the
loading plot. Check your conclusions from the first model. How many
components should we use? Can we say anything about possible interactions?
Summary
Data should be standardized if they are not coded. Yield is explained by one PLS-
component, but GC is not explained at all. Normally we get one PLS-component per
Y-variable with PLS2 of factorial designs, but since the X-data deviate a little from
the planned design we may accept one or two components more.
SDev of GC for the replicates is 0.33 and the standard deviation of GC for all
samples is 0.47, i.e. the error in the reference method is of the same order as the
spread of the response in all the samples. Obviously the chosen measurement method
cannot be used, therefore you cannot model Purity based on these data. We do not
know from this whether or not the chosen design variables have any effect on the
purity.
Either 1 or 2 PCs should be used. In the PLS2 model it is difficult to say how many
PCs to use, since the error has a local minimum in PC1. In that model we do not
know if the increase in PC2 is caused by real problems or overfitting. The PLS2
model suggests that 2 components can be used. Since these data are only generated
to find the most important variables, not to make the perfect prediction model, we
can therefore look at the 2 PC model for convenience (it is a bit easier to study the
2-vector plots than the 1-vector plots, but pay most attention to PC1).
PC1 suggests variable X1 and X2 as the most important for the Yield. X6 has some
contribution in PC2. If we continue with optimization experiments, we could
perhaps include X6.
From the loading plots it is clear that variable X1 and X2 covary. We cannot say if
they also interact, unless we add the variable X1*X2 in the X-matrix. The high
degree of explained Y-variance (80% at 2 PCs) without the interaction term suggests
that this is not a significant effect. (If you are interested in experimenting, you can
delete the non-significant effects from the X-matrix and make a new model including
only X1 and X2. Study the Normal probability plot of residuals.)
4. Choose From Scratch, and choose the Design type that you need.
5. Define the low and high levels of the design variables (the ones you plan to
vary, X-variables), enter variable names, units and blocks (e.g. experimenters).
Define the number of responses (Y-variables) you plan to measure.
6. Choose the suitable number of experiments and resolution for your purposes.
In factorial designs, check whether the effects you are most interested in will be
clear or confounded. If you are not satisfied, select a different design.
7. Define the number of repeated measurements, and the number of center
points per block, as required. Here we can add reference samples as well.
8. We can also choose to sort during randomization.
9. The program now calculates an experimental pattern that spans all variables
optimally and generates a randomized experimental plan for the lab. The
X-matrix is automatically expanded with all interaction effects as additional
X-variables, and stored as a Design file for later use.
10. Preview and Print out the randomized experimental plan for the lab.
11. Re randomize if you are not happy with the randomization.
Part II Analysis
12. When the experiments have been performed, enter the lab results (the
responses) as Y-variables. Note! Entering response values is disabled in the
training version of the program.
13. The next analyses are found in Plot and Task menu.
14. Data checks: Plot raw data use line, scatter or histogram plots, Statistics and
PCA if several responses. The purpose is to find transcription errors and more
serious problems are identified.
15. Descriptive analysis: Statistics or PCA. The purpose is to check ranges of
variation for each response and correlation among responses.
16. Inferential analysis: Analysis of effects with ANOVA and significance testing.
At this stage we want to find the significance of each effect and choose to leave
out non-significant effects.
17. Predictive analysis (optimization designs): Response Surface analysis with
ANOVA and PLS2 if several responses.
Make more experiments if necessary, for example using optimization designs, fold-
over designs or complementary runs to separate confoundings. With just a little work
you will quickly develop enough personal experience to build a sequential
experimental strategy.
Table 17.1: Ranges of the process variables for the cooked meat design
Process variable Low High
Marinating time 6 hours 18 hours
Steaming time 5 min 15 min
Frying time 5 min 15 min
When seeing this table, the process engineer expresses strong doubts that
experimental design can be of any help to him. “Why?” asks the statistician in
charge. “Well,” replies the engineer, “if the meat is steamed then fried for 5
minutes each it will not be cooked, and at 15 minutes each it will be overcooked
and burned on the surface. In either case, we will not get any valid sensory ratings,
because the products will be far beyond the ranges of acceptability.”
After some discussion, the process engineer and the statistician agree that an
additional condition should be included:
“In order for the meat to be suitably cooked, the sum of the two cooking times
should remain between 16 and 24 minutes for all experiments”.
This type of restriction is called a multi-linear constraint. In the current case, it
can be written in a mathematical form requiring two equations, as follows:
The impact of these constraints on the shape of the experimental region is shown in
Figure 17.1 and Figure 17.2:
15
Frying
18
Marinating
5
6
5 Steaming 15
18
Marinating
5
6
5 Steaming 15
As you can see, it contains all "corners" of the experimental region, in the same
way as the full factorial design does when the experimental region has the shape of
a cube.
The product developer has learnt about experimental design, and tries to set up an
adequate design to study the properties of the pancake dough as a function of the
amounts of flour, sugar and egg in the mix. She starts by plotting the region that
encompasses all possible combinations of those three ingredients, and soon
discovers that it has quite a peculiar shape:
100
Mixtures of
3 ingredients
100% Flour
Egg
0
Sugar
Only Flour and Sugar
0
100
100% Sugar 0 Flour 100
The reason, as you will have guessed, is that the mixture always has to add up to a
total of 100 g. This is a special case of multi-linear constraint, which can be written
with a single equation:
This is called the mixture constraint: the sum of all mixture components is 100%
of the total amount of product.
The practical consequence, as you will also have noticed, is that the mixture region
defined by three ingredients is not a three-dimensional region! It is contained in a
two-dimensional surface called a simplex.
Therefore, mixture situations require specific designs. Their principles will be
introduced in the next chapter.
0% 0%
Sugar Flour
33.3%
Sugar
33.3%
Flour
33.3%
Egg
100% 100%
Flour Sugar
Flour 0% Sugar
Egg
This simplex contains all possible combinations of the three ingredients flour,
sugar and egg. As you can see, it is completely symmetrical. You could substitute
egg for flour, sugar for egg and flour for sugar in the figure, and still get exactly the
same shape.
Classical mixture designs take advantage of this symmetry. They include a varying
number of experimental points, depending on the purposes of the investigation. But
whatever this purpose and whatever the total number of experiments, these points
are always symmetrically distributed, so that all mixture variables play equally
important roles. These designs thus ensure that the effects of all investigated
mixture variables will be studied with the same precision. This property is
equivalent to the properties of factorial, central composite or Box-Behnken designs
for non-constrained situations.
The first design in Figure 17.5 is very simple. It contains three corner samples
(pure mixture components), three edge centers (binary mixtures) and only one
mixture of all three ingredients, the centroid.
The second one contains more points, spanning the mixture region regularly in a
triangular lattice pattern. It contains all possible combinations (within the mixture
constraint) of five levels of each ingredient. It is similar to a 5-level full factorial
design - except that many combinations, like "25%,25%,25%" or
"50%,75%,100%", are excluded because they are outside the simplex.
You can read more about classical mixture designs in Chapter 17.2 "The Mixture
Situation".
D-Optimal Designs
Let us now consider the meat example again (see Chapter 17.1.1 "Constraints
Between the Levels of Several Design Variables"), and simplify it by focusing on
Steaming time and Frying time, and taking into account only one constraint:
Steaming time + Frying time ≤ 24. Figure 17.6 shows the impact of the constraint
on the variations of the two design variables.
Figure 17.6: The constraint cuts off one corner of the "cube"
9
15
S + F = 24
Frying
9
5
5 Steaming 15
If we try to build a design with only 4 experiments, like as in the full factorial
design, we will automatically end up with an imperfect solution that leaves a
portion of the experimental region unexplored. This is illustrated in Figure 17.7.
Figure 17.7: Designs with 4 points leave out a portion of the experimental region
Unexplored portion
5 4 5 4
I 3 II 3
1 2 1 2
From this figure it can be seen that, design II is better than design I, because the
left- out area is smaller. A design using points (1,3,4,5) would be equivalent to (I),
and a design using points (1,2,4,5) would be equivalent to (II). The worst solution
would be a design with points (2,3,4,5): it would leave out the whole corner
defined by points 1,2 and 5.
Thus it becomes obvious that, if we want to explore the whole experimental region,
we need more than 4 points. Actually, in the above example, the five points
(1,2,3,4,5) are necessary. These five crucial points are the extreme vertices of the
constrained experimental region. They have the following property: if you were to
wrap a sheet of paper around those points, the shape of the experimental region
would appear, materialized formed by your wrapping.
Every time you add a constraint, you INCREASE the number of vertices.
When the number of variables increases and more constraints are introduced, it is
not always possible to include all extreme vertices into the design. In these cases
you need a decision rule to select the best possible subset of points to include in
your design. There are many possible rules; one of them is based on the so-called
D-optimal principle, which consists in enclosing maximum volume into the
selected points. In other words, you know that a wrapping of the selected points
will no exactly re-constitute the experimental region you are interested in, but you
want to leave out the smallest possible portion.
Read more about D-optimal designs and their various applications in section 17.3 ,
"How To Deal With Constraints".
You can see at once that the resulting experimental design will have a number of
features which make it very different from a factorial or central composite design.
Firstly, the ranges of variation of the three variables are not independent. Since
Watermelon has a low level of 30%, the high level of Pineapple cannot be higher
than 100 - 30 = 70%. The same holds for Orange.
The second striking feature concerns the levels of the three variables for the point
called “centroid”: these levels are not half-way between “low” and “high”, they are
closer to the low level. The reason is, once again, that the blend has to add up to a
total of 100%.
Whenever the low and high levels of the mixture components are such that the
mixture region is a simplex (as shown in Chapter 17.1.2 , "A Special Case: Mixture
Situations"), classical mixture designs can be built. Read more about the necessary
conditions in section 17.3.4 , "When is the Mixture Region a Simplex?".
These designs have a fixed shape, depending only on the number of mixture
components and on the objective of your investigation. For instance, we can build a
design for the optimization of the concentrations of Watermelon, Pineapple and
Orange juice in Cornell's fruit punch, as shown in Figure 17.8.
Figure 17.8: Design for the optimization of the fruit punch composition
Watermelon
100% W
0% P 0% O
70% O 70% P
30% W
100% O 100% P
Orange 0% W Pineapple
The next chapters will introduce the three types of mixture designs which are the
most suitable for three different objectives:
• 1- Screening of the effects of several mixture components;
• 2- Optimization of the concentrations of several mixture components;
• 3- Even coverage of an experimental region.
What is the best way to build a mixture design for screening purposes? To answer
this question, let us go back to the concept of main effect.
The main effect of an input variable on a response is the change occurring in the
response values when the input variable varies from Low to High, all experimental
conditions being otherwise comparable.
In a factorial design, the levels of the design variables are combined in a balanced
way, so that you can follow what happens to the response value when a particular
design variable goes from Low to High. It is mathematically possible to compute
the main effect of that design variable, because its Low and High levels have been
combined with the same levels of all the other design variables.
In a mixture situation, this is no longer possible. Look at the previous figure: while
30% Watermelon can be combined with (70% P, 0% O) and (0% P, 70% O),
100% Watermelon can only be combined with (0% P, 0% O)!
To find a way out of this dead end, we have to transpose the concept of "otherwise
comparable conditions" to the constrained mixture situation. To follow what
happens when Watermelon varies from 30% to 100%, let us compensate for this
variation in such a way that the mixture still adds up to 100%, without disturbing
the balance of the other mixture components. This is achieved by moving along an
axis where the proportions of the other mixture components remain constant, as
shown in Figure 17.9.
Orange Pineapple
The most "representative" axis to move along is the one where the other mixture
components have equal proportions. For instance, in the above figure, Pineapple
and Orange each use up one half of the remaining volume once Watermelon has
been determined.
Mixture designs based upon the axes of the simplex are called axial designs. They
are the best suited for screening purposes because they manage to capture the main
effect of each mixture component in a simple and economical way.
A more general type of axial design is represented, for 4 variables, in the next
figure. As you can see, most of the points are located inside the simplex: they are
mixtures of all 4 components. Only the four corners, or vertices (containing the
maximum concentration of an individual component) are located on the surface of
the experimental region.
Axial point
Overall
centroid
Optional
end point
Each axial point is placed halfway between the overall centroid of the simplex
(25%,25%,25%,25%) and a specific vertex. Thus the path leading from the
centroid ("neutral" situation) to a vertex (extreme situation with respect to one
specific component) is well described with the help of the axial point.
In addition, end points can be included; they are located on the surface of the
simplex, opposite to a vertex (the are marked by crosses on the figure). They
contain the minimum concentration of a specific component. When end points are
included in an axial design, the whole path leading from minimum to maximum
concentration is studied.
Thus, an optimization design for mixtures will include a large number of blends of
only two, three, or more generally a subset of the components you want to study.
The most regular design including those sub-blends is called simplex-centroid
design. It is based on the centroids of the simplex: balanced blends of a subset of
the mixture components of interest. For instance, to optimize the concentrations of
three ingredients, each of them varying between 0 and 100%, the simplex-centroid
design will consist of:
• 1- The 3 vertices: (100,0,0), (0,100,0) and (0,0,100);
• 2- The 3 edge centers (or centroids of the 2-dimensional sub-simplexes defining
binary mixtures): (50,50,0), (50,0,50) and (0,50,50);
• 3- The overall centroid: (33,33,33).
Optional
interior point
3rd order
centroid
(face center)
Overall
centroid
If all mixture components vary from 0 to 100%, the blends forming the simplex-
centroid design are as follows:
• 1- The vertices are pure components;
• 2- The second order centroids (edge centers) are binary mixtures with equal
proportions of the selected two components;
• 3- The third order centroids (face centers) are ternary mixtures with equal
proportions of the selected three components;
• …..
• N- The overall centroid is a mixture where all N components have equal
proportions.
•
In addition, interior points can be included in the design. They improve the
precision of the results by "anchoring" the design with additional complete
mixtures. The most regular design is obtained by adding interior points located
halfway between the overall centroid and each vertex. They have the same
composition as the axial points in an axial design.
example, you just want to investigate what would happen if you mixed three
ingredients which you have never tried to mix before.
This is one of the cases when your main purpose is to cover the mixture region as
evenly and regularly as possible. Designs which address that purpose are called
simplex-lattice designs. They consist of a network of points located at regular
intervals between the vertices of the simplex. Depending on how thoroughly you
want to investigate the mixture region, the network will be more or less dense,
including a varying number of intermediate levels of the mixture components. As
such, it is quite similar to an N-level full factorial design. Figure 17.12 illustrates
this similarity.
In the same way as a full factorial design, depending on the number of levels, these
can be used for screening, optimization, or other purposes, simplex-lattice designs
have a wide variety of applications, depending on their degree (number of
intervals between points along the edge of the simplex). Here are a few:
- Feasibility study (degree 1 or 2): are the blends feasible at all?
- Optimization: with a lattice of degree 3 or more, there are enough points to fit a
precise response surface model.
- Search for a special behavior or property which only occurs in an unknown,
limited sub-region of the simplex.
- Calibration: prepare a set of blends on which several types of properties will be
measured, in order to fit a regression model to these properties. For instance, you
may wish to relate the texture of a product, as assessed by a sensory panel, to the
parameters measured by a texture analyzer. If you know that texture is likely to
Since there is no "template" that can automatically be applied, the design will have
to be computed algorithmically to fit your own particular situation. The main
principle underlying these computations is the D-optimal principle. The chapters to
come explain this principle and its practical implications.
Note!
An eigenvalue gives a measure of the size or significance of a dimension
(or PC). The NIPALS algorithm extracts PCs in order of decreasing
Eigenvalues. That is to say that if the eigenvalues are equal, then each
dimension is equivalent to the others so the space is spherical. If the
largest dimension is much larger than the smallest dimension, then the
region is “flat”.
In the ideal case, if all extreme vertices are included into the design, it has the
smallest attainable condition number. If that solution is too expensive, however,
you will have to make a selection of a smaller number of points. The automatic
consequence is that the condition number will increase and the enclosed volume
will decrease. This is illustrated by Figure 17.13.
Figure 17.13: With only 8 points, the enclosed volume is not optimal
Region of interest Unexplored portion
Once the model has been fixed, the condition number of the "experimental
matrix", which contains one column per effect in the model, and one row per
experimental point, can be computed.
When the exchange of points does not give any further improvements, the
algorithm stops and the subset of candidate points giving the lowest condition
number is selected.
In the simplest case of a linear model, an orthogonal design like such as a full
factorial would have a condition number of 1. It follows that the condition number
of a D-optimal design will always be larger than 1. A D-optimal design with a
linear model is acceptable up to a cond# around 10.
If the model gets more complex, it becomes more and more difficult to control the
increase in the condition number. For practical purposes, one can say that a design
including interaction and/or square effects is usable up to a cond# around 50.
If you end up with a cond# much larger than 50 no matter how many points you
include in the design, it probably means that your experimental region is too
constrained. In such a case, it is recommended to re-examine all your design
variables and constraints with a critical eye, and search for ways to simplify your
problem (see 17.3.4 , "Advanced Topics"). Else you run the risk of starting an
expensive series of experiments which will not give you any useful information at
all.
The set of candidate points for the generation of the D-optimal design will then
consist mostly of the extreme vertices of the constrained experimental region. If the
number of variables is small enough, edge centers and higher order centroids can
also be included.
In addition, center samples are automatically included in the design (whenever they
apply); they are not submitted to the D-optimal selection procedure.
The set of candidate points for a D-optimal optimization design will thus include:
- all extreme vertices;
- all edge centers;
- all face centers and constraint plane centroids.
To imagine the result in three dimensions, you can picture yourself a combination
of a Box-Behnken design (which includes all edge centers) and a Cubic Centered
Faces design (with all corners and all face centers). The main difference is that the
constrained region is not a cube, but a more complex polyhedron.
The D-optimal procedure will then select a suitable subset from these candidate
points, and several replicates of the overall center will also be included.
Here again, the set of candidate points depends on the shape of the model. You may
lookup section 17.4.2 “Relevant Regression Models" for more details on mixture
models.
The overall centroid is always included in the design, and is not subject to the
D-optimal selection procedure.
Note!
Classical mixture designs have much better properties than D-optimal
designs. Remember this before establishing additional constraints on
your mixture components!
Section 17.3.4 "How to Select Reasonable Constraints" tells you more about how
to avoid unnecessary constraints.
The Unscrambler offers three different ways to build a design combining mixture
and process variables. They are described below.
The first solution is useful when several process variables are included in the
design. It applies the D-optimal algorithm to select a subset of the candidate points,
which are generated by combining the complete mixture design with a full factorial
in the process variables.
Note!
The D-optimal algorithm will usually select only the extreme vertices of
the mixture region. Be aware that the resulting design may not always be
relevant!
The alternative is to use the whole set of candidate points. In such a design, each
mixture is combined with all levels of the process variables. The figure hereafter
below illustrates two such situations.
Screening: Optimization:
axial design combined with a simplex centroid design combined
2-level factorial with a 3-level factorial
Egg Egg
Note!
When the mixture region is not a simplex, only continuous process
variables are allowed.
Note that if some of the ingredients do not vary in concentration, the sum of the
mixture components of interest (called Mix Sum in the program) is smaller than
100%, to leave room for the fixed ingredients. For instance if you wish to prepare a
fruit punch by blending varying amounts of Watermelon, Pineapple and Orange,
with a fixed 10% of sugar, Mix Sum is then equal to 90% and the mixture
constraint becomes "sum of the concentrations of all varying components = 90%".
In such a case, unless you impose further restrictions on your variables, each
mixture component varies between 0 and 90% and the mixture region is also a
simplex.
Whenever the mixture components are further constrained, like in the example
shown in Figure 17.15, the mixture region is usually not a simplex.
Figure 17.15: With a multi-linear constraint, the mixture region is not a simplex
Watermelon
Experimental
region W ≥ 2*P
W = 2*P
Orange Pineapple
In the absence of multi-linear constraints, the shape of the mixture region depends
on the relationship between the lower and upper bounds of the mixture
components.
It is a simplex if:
The upper bound of each mixture component is larger than
Mix Sum - (sum of the lower bounds of the other components).
Figure 17.16 illustrates one case where the mixture region is a simplex, and one
case where it is not.
Figure 17.16: Changing the upper bound of Watermelon affects the shape of the
mixture region
Watermelon W
17% 17%
O P
Orange Pineapple
In the leftmost case, the upper bound of Watermelon is 66% = 100 - (17 + 17): the
mixture region is a simplex. If the upper bound of Watermelon is shifted to 0.55, it
becomes smaller than 100% - (17 + 17) and the mixture region is no longer a
simplex.
Note!
When the mixture components only have Lower bounds, the mixture
region is always a simplex.
This means that ingredients which are represented in the mixture with a very small
proportion, can in a way "escape" from the mixture constraint.
So whenever one of the minor constituents of your mixture plays an important role
in the product properties, you can investigate its effects by treating it as a process
variable. See 17.3.3 "How to Combine Mixture and Process Variables" for more
details.
It does not make any sense to treat such a situation as a true mixture; it will be
better addressed by building a classical orthogonal design (full or fractional
factorial, central composite, Box-Behnken, depending on your objectives).
When you start defining a new design, think twice about any constraint you intend
to introduce. An unnecessary constraint will not help you solve your problem
faster; on the contrary, it will make the design more complex, and may lead to more
experiments or poorer results.
Physical constraints
The first two cases mentioned above can be called "real constraints ". You cannot
disregard them; if you do, you will end up with missing values in some of your
experiments, or uninterpretable results.
Constraints of cost
The third case, however, can be referred to as "imaginary constraints". Whenever
you are tempted to introduce such a constraint, examine the impact it will have on
the shape of your design. If it turns a perfectly regular and symmetrical situation,
which can be solved with a classical design (factorial or classical mixture), into a
complex problem requiring a D-optimal algorithm, you will be better off just
dropping the constraint.
Build a standard design, and take the constraint into account afterwards, at the
result interpretation stage. For instance, you can add the constraint to your response
surface plot, and select the optimum solution within the constrained region.
Note that if you stick to that rule without allowing for any extra margin, you will
end up with a so-called saturated design, that is to say without any residual degrees
of freedom. This is not a desirable situation, especially in an optimization context.
A D-optimal design computed with the default number of experiments will have, in
addition to the replicated center samples, enough additional degrees of freedom to
provide a reliable and stable estimation of the effects in the model.
Read more about the choice of a model in Chapter 17.4.2 , "Relevant Regression
Models".
A side effect of the projection principle is that PLS not only builds a model of
Y=f(X), it also studies the shape of the multidimensional swarm of points formed
by the experimental samples with respect to the X-variables. In other words, it
describes the distribution of your samples in the X-space.
Thus any constraints present when building a design, will automatically be detected
by PLS because of their impact on the sample distribution. A PLS model therefore
has the ability to implicitly take into account multi-linear constraints, mixture
constraints, or both. Furthermore, the correlation or even the linear relationships
introduced among the predictors by these constraints, will not have any negative
effects on the performance or interpretability of a PLS model, contrary to what
happens with MLR.
In other words: the regression coefficients from a PLS model tell you exactly what
happens when you move from the overall centroid towards each corner, along the
axes of the simplex.
This property is extremely useful for the analysis of screening mixture experiments:
it enables you to interpret the regression coefficients quite naturally as the main
effects of each mixture component.
The mixture constraint has even more complex consequences on a higher degree
model necessary for the analysis of optimization mixture experiments. Here again,
PLS performs very well, and the mixture response surface plot enables you to
interpret the results visually (see Chapter 17.4.3 , "The Mixture Response Surface
Plot" for more details).
Thus PLS regression is the method of choice to analyze the results from D-optimal
designs, no matter whether they involve mixture variables or not.
- If the regression coefficient for a variable is larger than 0.2 in absolute value, then
the effect of that variable is most probably significant.
- If the regression coefficient is smaller than 0.1 in absolute value, then the effect is
negligible.
- Between 0.1 and 0.2: "gray zone" where no certain conclusion can be drawn.
Note!
In order to be able to compare the relative sizes of your regression
coefficients, do not forget to standardize all variables (both X and Y)!
The best and easiest way to check the significance of the effects is to use Martens’
Uncertainty test, which allows The Unscrambler to detect and mark the
significant X-variables (see Chapter 14).
Therefore, The Unscrambler asks you to choose a model immediately after you
have defined your design variables, prior to determining the type of classical
mixture design or the selection of points building up the D-optimal design which
best fits your current purposes.
The minimum number of experiments also depends on the shape of your model;
read more about it in section 17.3.4 "How many Experiments Are Necessary?”.
Screening designs are based on a linear model, with or without interactions. The
interactions to be included can be selected freely among all possible products of
two design variables.
In a mixture design, the interaction and square effects are linked and
cannot be studied separately.
Here are therefore the basic principles for building relevant mixture models.
For screening purposes, use a purely linear model (without any interactions) with
respect to the mixture components. If your design includes process variables, their
interactions with the mixture components may be included, provided that each
process variable is combined with either all or none of the mixture variables. No
restriction is placed on the interactions among the process variables themselves.
For optimization purposes, you will choose a full quadratic model with respect to
the mixture components. If any process variables are included in the design, their
square effects may or may not be studied, independently of their interactions and of
the shape of the mixture part of the model. But as soon as you are interested in
process-mixture interactions, the same restriction as before applies.
Instead of having two coordinates, the mixture response surface plot uses a special
system of 3 coordinates. Two of the coordinate variables are varied independently
from each other (within the allowed limits of course), and the third one is computed
as the difference between MixSum and the other two.
C [0.000:100.0000] C [0.000:100.0000]
A [0.000:100.0000] A [0.000:100.0000]
B [0.000:100.0000] B [0.000:100.0000]
10.577
11.648
9.5
05
8.
43
4
7.3
63
12.680
6.29
2
11.497
5.221
10.313
4.149 9.130
7.946
3.078
6.763
2.007 2.
02 3.21 5.579
5
8 2 4.39
A=100.0000 B=100.0000 A=100.0000 B=100.0000
Centroid quad, PC: 3, Y-var: Y, (X-var = value): D-opt quad2, PC: 2, Y-var: Y, (X-var = value):
Similar response surface plots can also be built when the design includes one or
several process variables.
Context
A wine producer wants to blend 3 different types of wines together: Carignan,
Grenache and Syrah. All three types can vary between 0 and 100% in proportion.
He is aiming at finding out what proportion of the three makes the most preferred
wine, but to simplify his production work he is mostly interested in blending only
two types of wine together. He is also concerned about the production cost.
Tasks
In this exercise, you will build a mixture design with 3 mixture variables (Carignan,
Grenache and Syrah).
This exercise will lead you through data input of the responses of interest:
Preference and Cost. We will analyze the data and take into account both
preference and cost in the search of a compromise.
How to Do It
1. Define the variables
Go to File - New Design and choose to build your design From Scratch.
Click Next.
Select to build a Mixture design and click Next.
Define three mixture variables: Carignan, Grenache and Syrah, varying from 0
to 100% each. As there are no additional constraints to the mixture constraint,
do not tick the Multi-linear constraints box. Click Next.
There are no process variables involved, so you do not need to define any. Click
Next.
Enter 2 responses: “Preference” and “Cost”, then click Next.
Mark the experiments included in each of these two designs on the simplexes
below to see the difference.
Click Next to access the Design Details dialog. Take 1 replicate and 2 center
samples; click Next. In the randomization details dialog, click Next. In the Last
Checks dialog, hit Preview to take a first look at your design.
Check if the resulting design is what you expected. Is the number of experiments
as you expected? Do the experimental points meet with your expectations?
Click Finish in the Last Checks dialog.
Save your design under the name of your choice by going to the menu: File-
Save As…
Note!
If you need to delete a line, press “Alt Gr” and click on the line to select
it, then press “Delete”.
Do you recognize the simplex shape? Notice how the 3-dimensional space (a
cube) merged into a 2-dimensional space (a flat triangle).
Minimize the 3D scatter plot, and create a new identical one (Plot- 3D
Scatter). We are going to observe the surface by a different method.
Go to View - Rotate and rotate the points horizontally (use the keyboard
arrows or the mouse) until all the design points are lined up. For a rotation 1
degree by 1 degree, press “Ctrl” as you rotate the plot.
The production cost was computed for each sample according to the amount of
each wine type with a linear equation.
Type in the calculated production cost and the averaged preference evaluations
into your The Unscrambler design table according to Table 17.5.
Multivariate Data Analysis in Practice
482 17. Complex Experimental Design Problems
Do not forget to save the new table with results! (File - Save)
Go to Task - Statistics and choose “All samples” in the sample set selection,
“Response variables” in the variable set selection. Click OK. Hit the View
button to view the results, and go straight away to File - Save in order to save
this new Statistics result file under a meaningful name.
You may want to compare the center sample to the design samples. Go to Plot-
Statistics and in the Compressed tab tick “Center” as well as “Design” in the
sample groups field. Click OK.
Is the center sample (33% Carignan, 33% Grenache, 33% Syrah) well
appreciated by the consumers compared to the rest of the samples?
Close or minimize your statistics results and select the column Preference with
the mouse. Go to the menu Plot - Histogram and choose “All samples” in the
sample set selection. To access the skewness value, which corresponds to how
symmetrical the data is, go to View - Plot Statistics. The closer to zero, the
more symmetrical the distribution.
Does the skewness value confirm your opinion about the symmetry of the data?
Do you need to perform any pretreatment before starting a multivariate
analysis?
Validation Method
Choose Cross Validation as a validation method, and make sure that “Full
Cross Validation” is selected in the Setup.
Tick the Jack-Knife and check that the number of PCs used for Jack-knifing
will be the “Optimal number of PCs”. Click OK.
View the results and go straightaway to the menu File - Save to save this PLS2
model under a meaningful name.
Look at the plot in the bottom right corner (Predicted vs. Measured), and check
how many PCs The Unscrambler has used to compute the results (right under
the plot). What is the optimal number of PCs according to The Unscrambler?
Compare your number to The Unscrambler’s finding. Do you agree with The
Unscrambler’s choice?
Go to View - Jack-knife - Uncertainty limits. You can notice that for all
the non-significant variables, the uncertainty is such that we cannot even know
for sure whether the regression coefficient is positive or negative.
Important note!
Even though most interaction and square effects are not significant, we
cannot remove them from the model: all these terms are tightly related
because of the mixture constraint!
The only option would be to remove all interactions and squares from the
model, but we would then ignore the significant square effect of C (Syrah). So
we decide to keep our model as it is.
The X- and Y-Loadings plot shows that on PC1 and PC2, 69% of the variation
in X (wine types) explain 80% of the variation in consumer preference and cost.
The model performs very well!
The plot reveals the significant effects detected before (marked variables). You
can notice that some of the variables are projected far from the center of the plot
on PC1 and PC2; however, they are not marked as significant. This is due to the
fact that we are looking at two components only, whereas the model is actually
based on 5 components.
To get a bigger view of the plot, go to Window - Copy to and select figure 1.
Now the X- and Y-Loadings plot takes the whole viewer screen.
What percentages of the three wines give the highest acceptance? Does this
combination match with your expectations? What is the predicted Preference
value for this optimal wine combination?
You can also display the response surface as a landscape (right-click to access
the context menu, Edit - Options). Go to View - Rotate or select the Rotate
icon to rotate the plot.
How big are the regression coefficients for interactions and squares? Can you
explain that?
Click on the Jack-knife icon to mark the variables that have a significant effect
on Cost. Note that the X- and Y-loadings plot is updated at the same time.
Which wine types have a significant effect on the production cost? Are these
effects positive or negative?
Relate your findings to what you can see on the X- and Y-loadings plot.
Which significant effects are negatively/positively correlated to Cost?
To be able to compare easily the preference and the cost for different wine
combinations, display the response surface plot for Preference in the lower part
of the screen.
How much is the production cost for the most preferred wine combination?
The wine producer knows from the consumer study that a number of consumers
would consider the price first, then the taste when choosing wine. However, a
grade under 2 in the consumer study clearly meant “rejected wine”.
Help the wine producer find a wine combination for a production cost lower
than 3 and a preference as close as possible to 2.
Is it possible to find such a combination involving only two wines?
Multivariate Data Analysis in Practice
17. Complex Experimental Design Problems 487
Summary
We built a Simplex-Lattice design in order to focus our study on two-wine
blendings. Because of the mixture constraint (C+G+S=100%), we worked in a two-
dimensional space, that is to say a surface, even though we are including three
variables in the design.
Before starting the analysis, we checked the raw data and found no suspicious out-
of-range value. The distribution of the data was quite symmetrical for both
response variables, Preference and Cost.
A PLS2 regression was performed with the design variables and their interactions
and squares as X, and the response variables as Y. Automatic marking by Jack-
knifing showed us that interactions and squares were needed for the model.
Regarding the consumers’ preference, a positive effect of Carignan, a negative
effect of Grenache and a negative square effect of Syrah were shown.
The cost was linearly influenced by the quantities of Carignan, Grenache and
Syrah: Syrah was the most expensive wine to produce and Syrah the cheapest.
The most preferred wine combination was 71% Carignan - 29% Syrah, for a
preference of 2.9 on the scale from 1 to 3 and a cost of 3.6.
To keep his production cost lower than 3 but ensure a preference above 2 by
mixing only two wines, the producer should mix 55% Grenache with 45% Syrah.
The underlying feature that will serve as a common backbone with which to
interrelate the various methods will be the effective projection dimension of each
of the methods. This is dimension has been called “A” in this book. Principal
Component Analysis and Factor Analysis, for instance, may be seen as methods
that project the original p-dimensional data (recall the variable space defined by the
original p variables) onto a low-dimensional subspace of dimension A. At times A
may be very low, for instance 1, 2 or 3, and in general A << p.
The idea is that the signal part, which corresponds to the main multivariate
structure(s) projected onto the A-dimensional subspace, very often corresponds to
the most useful part of the data. As soon as the subspace dimension, A, has been
determined, we can usually do away with the complementary (p-A) dimensions.
According to the underlying assumptions of PCA-decomposition, these dimensions
represent the “noise”, or error part, which we are not interested in continuing to
confound the data.
In some ways the action of projection methods such as PCA is like using a
magnifying glass. The essential data structures are enhanced, while the irrelevant
noise is screened away. It is this “truncated” use of PCA that has been particularly
useful - and therefore popular - in science and technology in general, and in
chemometrics in particular.
In this type of application PCA performs as a powerful tool for Exploratory Data
Analysis (EDA). The graphic score and loadings plots are used to visualize the
underlying structure in the data after removal of the noise contributions, and for
problem-dependent interpretations. But PCA can also be used for quite another
purpose than EDA - it can be used for classification and modeling of individual
data classes. An example of this is given in Figure 18.1.
A=2
x2
x1
This illustration is a case where the data swarm in variable space (here p=3) is
found to be grouped into three separate clusters – three data classes. An initial PCA
on this data set would reveal these groupings in the relevant score plots. Each group
can now be modeled and indeed interpreted separately. This will reveal the
effective dimension within each class; in Figure 18.1 the classes have dimensions A
= 0, 1 and 2 respectively. Notice the cluster with A=0. In this class the data points
are quasi-spherically distributed, i.e. there is no preferred direction of maximum
variance - all directions are equal. In this (admittedly very rare) case of isotropic
data variance the “best” model of the data swarm is simply the mean object!
When PCA is used for classification purposes like this, the overall data
decomposition is controlled by the structures inherent in the entire data set. It is a
“let us see what we get” approach, after which SIMCA may proceed if found
appropriate etc.
Another approach, which will be discussed later, is when the data analyst knows
beforehand exactly which classes to expect (and which not to expect), e.g. the data
is made up of red or green apples and nothing else, in which case one would expect
only a class for each color of apples. Then is it possible to “force” the
decomposition onto the classes in question through the use of the class of
“dummy”-regression methods, PLS-DISCRIM for example? This would be wrong!
What about the possible presence of rotten apples? We trust the reader can see
behind the extreme simplicity of this illustration, and be able to carry over the
mental notion of “rotten apples” to his/her own data analysis situation, i.e. always
be on the lookout for outliers - always!
First of all notice the clear similarity to PCA - FA also operates with scores and
loadings matrices. In PCA we attempt to take out the noise contributions and
collect them in the error matrix E, by determining the dividing effective dimension
A. The scores and loadings in PCA thus represent the signal part of the data and the
rest is in the E matrix. In FA the errors are also supposed to follow a quite specific
statistical distribution structure, detailed in Frame 18.1. Consequently, the
modeling of the error components is also intrinsic to FA. Thus there are basic
statistical assumptions both for the error matrix E as well as for the scores matrix
F, and these assumptions lead to the expressions given for the covariance of X and
the covariance between X and F. The Cov(X) = AA’ + diag and Cov(X,F) = A, of
course expresses the basic factor analysis model. FA is clearly a statistically much
more elaborate method than PCA.
A main objective of FA is still finding the “correct” dimension of the subspace that
represents the signal-structure part. There are a large number of statistically based
methods available for this purpose. We will not go into any detail on this point; the
interested reader is referred to the literature on FA; here we will only highlight
Joliffe (1986) and Jackson (1991) especially, but the literature on factor analysis is
especially comprehensive.
n X = mp + A••F´ + nE
A = loading matrix
F´= score matrix (factor matrix)
m = average vector
E = error matrix
Statistical model:
E(F) = 0; Cov(F) = I
E(E) = 0; Cov(E) = diag } Î Cov(X) = AA´+ diag
Cov(X,F) = A
In general however, most data analysts are happy with the primary solutions - but
from very advanced and experienced FA-analysts, be prepared for a lecture on the
“primitive” PCA method! It may of course do the reader no harm to dig somewhat
deeper into this issue, but hopefully not before having gained some - a lot rather -
personal experience with PCA, especially concerning its practical use.
Besides, PCA and FA very often give quite similar numerical results, especially
when the errors are small. Thus the practical use of FA is very much the same as
for PCA. It has been postulated that in some instances FA is superior to PCA
because it attempts to determine the “true factor structure”, which may encompass
both the signal and noise parts. In other words, FA attempts to model the noise as
well - to get the noise “under control” - as opposed to PCA where the objective is
to discard the noise and pay no more attention to it. Whether FA is superior to
PCA, or not, as claimed in this general sense, FA has indeed seen some noteworthy
spectacular successes, but nearly always exclusively in the hands of (very)
experienced users with more than just a passing interest in the underlying statistics
(biometrics, psychometrics etc). Thus, FA is not a method recommended for the
novice - there are too many traps and pitfalls.
Still, an overview of the main distinctions between PCA and FA may be of help at
this stage:
b) d ( X, Y ) = ∑ X •Y
X2 • Y 2
samples/
groupings
The common characteristic of very many of the related CA-approaches is the well-
known type of visual display, the dendogram, which has proven extremely popular;
see Figure 18.2. This type of display manages to compress the data structure, as far
as the data grouping is concerned, onto an apparently 2-dimensional chart. The
calculated grouping is displayed along one dimension, with the degree of
relatedness between these groups along the other. These features are in fact
intimately related, and the listing of the ordered set(s) of samples does not
constitute a proper dimensionality by itself. CA must therefore be viewed as a
projection onto a “1.5”-dimensional subspace, as it were.
There is a snag with the many alternative CA possibilities, however, especially for
someone who is inexperienced with CA. You are usually advised to try out several
similarity measures and cluster methods “to see which corresponds best with the
data structures”. This many method attitude was probably devised in the hope that
if the results from several methods are more or less the same, these data structures
reflect reality. But in reality this points to a very serious weakness with CA. The
different, competing similarity criteria are not unique!
Also, a little thought brings us to the following problem: if several measures, and/or
cluster algorithms give different solutions, which do we choose? This may very
well (and frequently does) happen. The problem with CA is that nowhere is there to
be found any general optimization criterion for the many different clusterings
possible on the very same data set! Thus CA runs into the same sort of problem as
did FA (albeit in a distinctly different setting: CA is a classification method) - non-
uniqueness of the primary solutions. There is only one absolute certainty here, and
it in fact applies to all multivariate analysis:
The above extremely short introduction to CA really does not pay the necessary
respect to this venerable approach, but in the interest of the present introductory
book (on bilinear methods), we simply have to accept some boundaries. The reader
is however referred with great enthusiasm to the excellent CA-textbook by
Rosenberg (1987) which is a superb introduction to CA.
DA is partly a supervised EDA technique, but is also often used for supervised
pattern recognition in a subsequent step. With DA you need to know some initial
means or characteristics in order to start dividing the objects into two or more data
classes. The overwhelming abundance of LDA-applications is concerned only with
two groupings.
It may also be seen from Figure 18.3 that LDA in fact can be viewed as a projection
onto A=1 dimensions. If the objects are projected onto the “LDF-axis” in Figure
18.3, this axis could be viewed as a “component vector” separating the two classes,
i.e. a 1-dimensional representation. This singular discriminant axis may be
extended to higher dimensions in more advanced versions of DA (e.g. quadratic
DA), but a great many of the classic methods stay with this very low 1-D dimension
of the subspace employed. For situations with two classes this makes good sense.
There are, however, many real-world data sets where this simple, low dimensional
picture is not enough. These systems are simply of a sufficiently higher complexity
in which a 1-dimensional approximation is a gross misrepresentation.
One last point on LDA which is also important, concerning collinearity. LDA
suffers from the same collinearity problems as does MLR, and there are no
remedies of the PCR kind in this case.
LDF
(Fisher’s linear discriminant
x x function (A=1)
xx x
x
x x x
x1
x x x x
x x x
x
x x
Multiple Linear Regression, MLR, is designed for the regression of one Y-variable
on a set of p so-called “independent” X-variables. It is implicit in the classical LR
formulation that X has full mathematical rank. The concept of “full rank” means
that the columns in X are linearly independent, i.e. they are not collinear. “Full
rank” and “not collinear” in practice means that the variables are uncorrelated.
This is a very important point. In science and technology collinearity is very often
the case for the sets of variables employed. Use of MLR on collinear data can lead
to serious misinterpretations and, worst of all, this may remain undiscovered, if
sufficient warnings against this have gone unnoticed.
There is still far too little emphasis in the applied sciences on critical inspection of
data structures (e.g. outliers, groupings, etc.) before running the data through one
of the popular and plentiful packaged MLR routines available. For the professional
user there is a very large apparatus of “regression diagnostics” with which to assess
whether this, and other critical assumptions are upheld, but very little help is
offered in the case they are not. Regression analysis is a huge area for professional
statisticians, and it is with the greatest respect that we here nevertheless largely
dismiss this class of regression. Reasons have been given however in chapter 6 and
elsewhere.
Another MLR-premise that is often violated concerns the fact that the X-variables
are often associated with errors, be it measurement errors, sampling errors, or
otherwise. In the classical case, only the Y-variable is assumed to be affected by
errors. Take for example the LR least-squares fitting criteria that is related to
Y-variable variance only - it is implicitly assumed that the X-values are noise free.
MLR is, in fact, ideally concerned specifically with an orthogonal X-matrix, (as is
reflected by its name in statistical terminology: the design matrix). For truly
orthogonal, i.e. uncorrelated X-variables, LR works just fine.
PCA decomposition is however still carried out completely without regard to the Y-
data structure. This means that you may very well decompose the X-matrix in a
way that is not optimal for Y-variable prediction, especially if you are modeling
more than one Y-variable at the same time. Here the PLS method is a far better
choice. On the other hand, in the case of only one Y-variable, PCR is statistically
the best studied and most well-known method. Many of the references in this book
give in-depth information on MLR and PCR from the statistical point of view. In
The Unscrambler package PCR and PLS are the two obvious choices.
As was the case with PCA, PLS can also be used as a supervised calibration tool
that simultaneously classifies the new X-vectors submitted for prediction. This
latter feature is unique to multivariate calibration and is used extensively, for
example for automatic outlier-warning. Within chemometrics there is a complete
strategy for multivariate calibration in science and technology in general. The book
by Martens & Næs (1989): “Multivariate Calibration” is still a leading authority on
the subject even with about ten years on its back!
(a data matrix, X) into one of the all too easily available software packages
boasting this and that “well known” statistical or analytical routine. It is very
doubtful however that this act alone guarantees you a relevant and intelligible
answer to the specific data analytical problem at hand, despite the fact that there is
always some output. Even though multivariate methods may appear difficult to
grasp at first sight, these methods are in fact very easy to use. We have in any event
worked hard to press home exactly this point in this book - and if you have
carefully carried out all (well most of) the exercises herein, you will have gotten
this message by now.
An appropriate analogy could be that of a cookie cutter. Cookie cutters stamp out a
predefined form from whatever is placed beneath them. But not only is the form of
the cookie cutter important - what is placed under it is also of equal importance of
course. Trying to stamp out cookie forms from, for instance, spaghetti would be a
senseless thing to do.
• Which data set: which data are measured/observed in your problem context.
• Which method: the (multivariate) method must comply with your problem
formulation.
• Why: why do I measure/observe these data? The problem definition determines
which (multivariate) method is to be used!
One of the most fundamental distinctions in multivariate data analysis concerns the
alternative data analysis modes: unsupervised methods vs. supervised methods.
In EDA you are searching for patterns, groupings, outliers, etc. - in short,
performing an act of Pattern Cognition (PAC). You may use whatever appropriate
unsupervised method your feel comfortable with (e.g. PCA), but you should most
definitely not use e.g. MLR if you have not formulated some form of functional
XÎY regression concept that has been derived from “external” knowledge
pertaining to your problem formulation, your sampling scheme and your general
knowledge of the data context. Otherwise MLR is almost prohibited in this context,
because it is specifically a regression method and therefore assumes a regression
objective for the analysis, i.e. that there is a Y that can be explained by an X.
Alternatively you may use PCA or another method which does not partition the
variables into an X-block and a Y-block.
On a slightly more general level: unsupervised methods are used for unsupervised
purposes (and of course: supervised pattern recognition methods are used for
supervised pattern recognition purposes). What does this apparently circular dictum
mean then? When you do not know any specific data analysis purpose (e.g.
regression, classification) from the original problem specification, all you in fact
can do is to perform an unsupervised data analysis, since their simply is no
supervising guidance to be had. On the contrary:
There is no harm done however, if you prefer to perform some EDA first. In fact it
is useful to do this every time. You will get to know the data structure and you may
perhaps even find that your initial assumptions do not hold up - and surely nobody
will object to that type of cautious data analysis in the initial stages.
1. Establishing a model for the XÎX or XÎY relationship (e.g. DA, MLR, PCR,
PLS). This is called the training, the modeling - or calibration stage. DA can
also be seen as a marginal subset of the regression case. This stage can in some
sense be considered as a “passive” modeling stage, because the data itself pretty
much determines the (soft) data model.
2. Using this model for whatever purpose your original objective dictates e.g.
PARC for DA, for prediction for MLR, PCR, PLS, or for classification. This
may be called the “active” stage, the classification or the prediction stage etc.
The validity and efficiency of any supervised data analysis method is totally
dependent on the representativity of the initial data relationship used as a training
basis. The principle of GIGO (Garbage In - Garbage Out) applies to all supervised
methods! It is the responsibility of the data analyst - YOU - to specify the training
data set in an as relevant and representative manner as possible (which n samples?
Why? Characterized by which p variables?). Data classes and the samples therein
must be representative with respect to future sampling of the populations modeled
by the particular supervised method employed for the specific problem.
The systematic relationships between data analytical problem formulations and the
appropriate multivariate methodological choices are presented in full detail by
Esbensen et al. (1988).
The review in this chapter has been successful if you have formed an opinion that
this structuring is intimately connected to the original problem objective, and that
an appropriate data analytical method will almost suggest itself if only seen in this
proper context.
and other computer systems). The “why” is a matter that rests solely with the user -
YOU. Even if this appears burdensome at first sight to the novice user, it is in fact
a blessing because it relieves the pressure of technical expertise and emphasizes the
data context imperative, and this is where all your domain specific expert
knowledge comes in. The interplay between the expert knowledge of the problem
context and the multivariate data analysis knowledge (and experience) is really
where most of the fun in multivariate data analysis is to be found.
Quite clearly, if we have secured two such data sets (a calibration set and a well
balanced test set), there can effectively be only one variance component that will
differ between them; the sampling variance. This sampling variance will comprise
those differences between the two data sets that can only be explained by the (two)
different samplings of n objects, made under conditions which are otherwise as
identical as possible. This is the essence of the concept test set validation.
Let us assume then that you have arrived at a “reasonably representative calibration
model”. The central idea behind all prediction model validation now is to evaluate
the prediction strength of this model. All validation is based on a comparison
between the model-based prediction results and the test set reference values. The
entire basis on which we will judge the usefulness of the prediction model is the
degree of correspondence between these “correct” reference values and the values
predicted by the model. The test set validation is the best possible option for this
critical issue – there is none better!
Conceptually, we may start out by dividing the calibration data set into two halves.
These should preferably be chosen randomly. Each half is now characterized by n/2
objects. If the data set is “large enough”, there is no problem with this, except that
each model is now based only on n/2 objects. This is the only difference between
two-segment cross validation and test set validation, but – as was outlined in
chapter 7 – a crucial disqualifying one: there was never any second drawing from
the parent populations in this case!
With a sufficiently large calibration set, this procedure will then be the most
adequate substitute for validation with a full test set. The only problem being that
we rarely - very rarely – are in a position to demand this large a data set. Indeed the
very reason cross-validation was called in was because we could not obtain enough
samples to delineate a proper test set!
understanding and respect for these critical issues in but all of chemometrics, as
indeed well beyond. Remedying this troublesome state of affairs has been one of
the primary objectives of this book.
During cross validation of, say, 150 samples, we could for example use two
75-sample data sets, one for calibration and one for validation. We could also have
divided the data into three segments, each supported by 50 samples. Or perhaps
five data sets, each supported by 30 samples, or 10 segments, each with 15
samples... It is always possible to set up a segmentation list of the following form
(2,3,4,...n segments), where the last number of segments, n, is that pertaining to full
cross validation. In this leave-one-out cross validation, each sample will be taken
out of the calibration set once and only once. The remaining n-1 data vectors make
up the model support, with the purpose of predicting the Y-value of the temporary
left out sample. This can be carried out exactly n times, and for each time one
specific Y-value will have been predicted. It is easily appreciated that we have
gradually built up a case apparently very similar to the separate test set validation
situation. There is but this one crucial difference, however. During this sequential
substitution of all calibration samples, we have never had an independent new
realization of the target population sampling. We are only performing an internal
permutation in the same calibration set. This is the crucial difference between test
set validation and segmented/full cross validation. In this precise sense, we more
get access to an assessment of the internal model stability, rather than the future
prediction error with segmented cross-validation.
When both alternative validation results are available (test and cross-validation),
we can make an estimate of the sampling variance. It is based on the difference
between the prediction variance from cross validation and that of the test set
validation. When everything else is equal, the cross validation variance must be
smaller than the test set variance, since the test set validation also includes the
sampling variance. This interesting exercise has almost never been carried out – but
it is of course a must for the exercises in this book in the cases where a true test set
is available. This is left to the reader’s discretion.
There are many myths about the different types of cross validation. For example
that full cross validation is the most comprehensive validation possible etc. This is
never the case, however, except for small-sample data sets. Otherwise it is very
unlikely that one left-out sample alone will induce any significant sampling
variance in any well-structured model (no outliers left in, no sub-groupings etc.) so
as to simulate the missing test set drawing. Full cross validation is however de
rigueur the smaller the number of samples available for validation. Put bluntly, it is
only in the case with really few samples that full cross validation is really
beneficial. In all other situations a carefully designed variant of segmented cross
validation will give a more realistic error estimate. How shall we choose the
number of segments?
What would happen if we based the validation directly on the calibration set alone,
i.e. ran the validation on the same data set as was used for the calibration, instead
of the test set? This would naturally result in an over-optimistic assessment of the
prediction error, which would invariably be estimate too low. It is the same set of
objects that are used both for establishing the model as well as “testing” its
prediction strength. Of course this is not acceptable. Still, this is exactly what
leverage correction validation does, but using the important “punishment factor”,
the leverage, factored in.
The answer depends on the chosen validation approach. All three approaches aim
at determining the prediction strength and in general they should not give results
which are too dissimilar. Always be aware though that any real-world data set may
well break every rule-of-thumb in the multivariate data analysis world. When the
three approaches really do present different results, there is a rigid preference for
test set over cross validation, followed by leverage correction. Observe this at all
times. The key issue is that the proper segmentation of the calibration set (from the
leave-one-out approach trough all potential segmentations, until the two-segment
approach) is very much related to the observable model structure. It is therefore
your responsibility to study the pertinent data structures carefully, most often in the
form of the appropriate score or T vs. U plots etc. and then to decide this issue.
consecutive years of university teaching. Beware: here are the final challenging
problems on offer in this book from which to learn and grow!
1. Make sure to use only representative calibration and validation data. Enough
evenly distributed samples, spanning all major variations will usually do the job.
There are exceptions.
2. Select the appropriate validation method for all final model testing.
3. Always look for outliers. Decide whether to keep, or remove the candidates
spotted. Extreme samples may carry important information, while erroneous
data will destroy the model. Outlier detection is only feasible in the proper data
analysis problem context.
5. Interpret the structural patterns in the data, by studying the appropriate score
plot. Be careful if there are clear, separate sub-groups. This may indicate that a
model should be made for each subgroup. Interpreting score structures is always
a problem-specific task. In the PCA regimen, t-t-plots reign, while in the PLSR/
(PCR) context, the t-u-plots takes over.
7. Do not transform or preprocess data unless you know what you are doing and
why! Always use weights: 1/SDev if variances are of different ranges; this is
usually in the form of auto scaling. There are exceptions.
8. Check how well prediction models will perform when predicting new data, by
using all the full validation concepts outlined. The RMSEP prediction error is
given in original units. It should be compared to the measurement precision
levels and the accuracy of the reference method. Remember that the prediction
error depends on the validation method used as well as on selection of
calibration and validation samples.
9. Beware of the possibility for exceptions to all “rules” (including the above).
10. Feel free to contact CAMO ASA, or the author if you have any questions:
[email protected]
[email protected]
The rest of the way to the top is of course your completely own responsibility. The
pinnacle of chemometric data analysis is now in sight, and actually within reach.
While there is still quite a distance to go, there is nothing to hinder you to start on
these last stages – and the view from the top is spectacular!
Good luck – Have fun!
The Eiffel tower: four stages and a pinnacle. This book should have elevated you to
the top of the second stage!
19. Literature
C. Albano, W. Dunn III, U. Edlund, E. Johansson, B. Nordén, M. Sjöström & S.
Wold (1978), Four levels of pattern recognition, Anal. Chim. Acta, 103, pp 429 - 443
K.R. Beebe, R.J. Pell & M.B. Seascholtz (1998) Chemometrics: A Practical Guide.
John Wiley & Sons, Inc., New York, 1998, ISBN 0-471-12451-6
H.R. Bjørsvik & H. Martens (1989), Data Analysis: Calibration of NIR instruments
by PLS regression in Burns, D.A. & Ciurczak, E.W. (Editors) Handbook of Near-
Infrared Analysis, Marcel Dekker Inc., New York
G.E.P. Box, W.G. Hunter, J.S. Hunter (1978), Statistics for experimenters, Wiley &
Sons Ltd ISBN 0-471-09315-7
S.N. Deming, J.A. Palasota & J.M. Nocerino (1993), The geometry of multivariate
object preprocessing, Jour. Chemometrics, vol 7, pp 393 - 425
K.H. Esbensen & J. Huang (2000) Principles of Proper Validation (submitted). Jour.
Chemometrics
R.A. Fisher (1936), The use of multiple measurements in taxonomic problems, Ann.
Eugenics, vol 7, pp 179 - 188
J.-P. Gauchi (1995), Utilisation de la régression PLS pour l’analyse des plans
d’expériences en chimie de formulation. Revue Statistique Appliquée, 1995, XLIII
(1), 65-89
P. Geladi (1988) Notes on the history and the nature of partial least squares (PLS)
modelling. Jour. Chemometrics, vol 2, pp 231 – 246.
P. Geladi & K. Esbensen (1990), The start and early history of chemometrics:
Selected interviews. Part 1. Jour. Chemometrics, vol 4, pp 337 - 354
P. Geladi & K. Esbensen (1990), The start and early history of chemometrics:
Selected interviews. Part 2. Jour. Chemometrics, vol 4, pp 389 - 412
P. Geladi & B.R. Kowalski (1986), Partial Least Squares Regression: A tutorial,
Anal. Chim. Acta, 185, pp 1 - 17
J.E. Jackson (1991) A User’s Guide to Principal Components. Wiley. Wiley series in
probability and mathematical statistics. Applied probability and statistics. ISBN 0-
471-62267-2
R.A. Johnson & D.W. Wichern (1988), Applied multivariate statistical analysis,
Prentice-Hall 607 p.
K.V. Mardia, J.T. Kent & J.M. Bibby (1979), Multivariate Analysis, Academic Press
Inc., London, ISBN 0-12-471252-5
H. Martens & T. Næs (1989), Multivariate Calibration, Wiley & Sons Ltd, ISBN 0-
471-90979-3
P.L. Massart, B.G.M. Vandegiste, S.N. Deming, Y. Michotte & L. Kaufman (1988),
Chemometrics: A text book, Elsevier Publ., Amsterdam, ISBN 0-444-42660
E. Morgan (1991), Chemometrics: Experimental design, Wiley & Sons Ltd, ISBN 0-
471-92903-4
T. Næs & T. Isakson (1991), SEP or RMSEP, which is best?, NIR News, vol 2, No.
4, p 16
J.R. Piggott (Ed.) (1986) Statistical Procedures in Food Research. Elsevier Applied
Science Publishers. ISBN 1-85166-032-1
D.B. Rubin (1987), Multiple imputation for non-response in surveys. Wiley, New
York. Wiley series in probability and mathematical statistics. Applied probability
and statistics. ISBN 0-471-08705-x
G. Spotti (Ed.) (1991) Gaetano e Pietro SGARABOTTO. Liutai – Violin makers 1878
– 1990. Editrice Turris. Cremona. Italia. ISBN 88-7929-000-2
P. Thy & K. Esbensen (1993), Seafloor spreading and the ophiolitic sequences of the
Troodos complex: A principal component analysis of lava and dike compositions,
Jour. Geophysical research, vol 98 B7, pp 11799 - 11805
P. Williams & K. Norris (1987), Near Infrared Technology in the Agricultural and
Food Industries, American Association of Cereal Chemists Inc., ISBN 0-913250-49X
A
Another way to put this is: xik = x mean ,k + ∑ t ia p'ka + eik ( A )
a =1
Frame 20.1
The NIPALS algorithm for PCA
Start:
Select start values. e.g. t a = the column in Xa-1 that has the highest
remaining sum of squares.
Repeat points i) to v) until convergence. (Continued on next page)
iii) Improve estimate of score t a for this factor by projecting the matrix
Xa-1 on p a :
t a = X a−1p a ( p ′a p a ) −1
iv) Improve estimate of the eigenvalue τ a :
τ a = t a′ t a
20.2 PCR
PCR is performed as a two step operation; first X is decomposed by PCA, see page
519. Then the principal components regression is obtained by regressing y on the
t ’s.
20.3 PLS1
The general form of the PLS model is: X = T ⋅ P' + E and Y = T ⋅ Q' + F
C 2.1 Use the variability remaining in y to find the loading weights wa,
using LS and the local 'model' (continued on next page)
X a−1 = y a−1w ′a + E
and scale the vector to length 1. The solution is
a = cX ′a−1y a −1
w
where c is the scaling factor that makes the length of the final w
a
equal to 1, i.e.
c = ( y ′a−1X a−1X ′a−1y a−1 ) −0.5
C 2.2 Estimate the scores t a using the local 'model'
X a−1 = t a w ′a + E
′a w a = 1 )
The LS solution is (since w
t a = X a−1w a
C 2.3 Estimate the spectral loadings pa using the local 'model'
X a−1 = t a p ′a + E
Full prediction
For each new prediction object i = 1,2,... perform steps P1 to P3, or alternatively, step P4.
P1 Scale input data x i like for the calibration variables. Then compute
x ′i,0 = x ′i − x ′
where x is the center for the calibration objects.
For each factor a = 1 ... A perform steps P 2.1 - P 2.2.
P 2.1 Find ti,a according to the formula in C 2.2 i.e.
ti,a = x ′i,a−1wa
P 2.2 Compute new residual x i,a = x i,a−1 − tia p ′a
If a < A, increase a by 1 and go to P 2.1. If a = A, go to P 3.
A
yi = y + ∑ tia q a
P3 Predict yi by a =1
Short prediction
P4 Alternatively to steps P 1 - P 3, find y by using b0 and b in C 4,
i.e. yi = b0 + x ′i b
Note that P and Q are not normalized. T and W are normalized to 1 and orthogonal.
20.4 PLS2
Frame 20.3
Simultaneous PLSR calibration for several Y-variables ('PLS2
regression')
C 2.4b
Test whether convergence has occurred, by e.g. checking that the
elements have no longer changed meaningfully since the last
iteration.
C 2.4c
If convergence is not reached, then estimate temporary factor
scores u a using the 'model'
Ya −1 = u a q ′a + F
(Continued on next page)
Hardware Requirements
We recommend that you use at least a Pentium PC running at 100 MHz or more.
Memory space is an important issue, at least 16 MB of RAM should be available,
preferably 32 MB. Using a more powerful PC improves performance significantly
and is advisable if your data tables are large.
Software Requirements
The Unscrambler software is written for the Windows 95 and Windows NT (3.51
or later) operating systems. The program does not run on Windows 3.x or Windows
for Workgroups platforms.
Installation Procedure
The Unscrambler is supplied on a set of floppy disks or a single CD-ROM. If you
have got a floppy version, insert disk 1 into your floppy drive and use the File
Manager or Windows Explorer to run SETUP.EXE on the floppy disk. If you have
got a CD-ROM version, the SETUP.EXE program can be found in the DISK1
directory.
Supervisor Responsibilities
The Unscrambler requires that one person is appointed as supervisor (system
manager). The supervisor’s main task is to maintain the user accounts.
The supervisor must log in after installation and define the users who are allowed
access to The Unscrambler before they can begin to work with the program.
Start The Unscrambler and log in as supervisor by clicking on the caption bar in
the login window with the right mouse button or pressing <Ctrl>+<Shift>+<S> (see
Figure 21.1). The default supervisor password at delivery is SYSOP.
User accounts are maintained from Project - System Setup. Select the Users tab
in the System Setup dialog (shown in Figure 21.2). New users are added by
pressing New. Select a user from the Users list and press Modify to set or change
the password.
The supervisor also defines how missing values should be handled by default when
users import or export data. Finally, the supervisor can also move the data directory
to a new location by pressing Change on the Directories sheet (see Figure 21.3).
Note that the data files are copied to the new location, not physically moved. This
ensures that a backup exists if the location change fails for some reason. The
previous data directory can be removed manually if desired.
use The Unscrambler to design the experiments you need to perform to achieve
results which you can analyze.
The following are the five basic types of problems which can be solved using The
Unscrambler:
• Design experiments, analyze effects and find optima;
• Find relevant variation in one data matrix;
• Find relationships between two data matrices (X and Y);
• Predict the unknown values of a response variable;
• Classify unknown samples into various possible categories.
You should always remember, however, that there is no point in trying to analyze
data if they do not contain any meaningful information. Experimental design is a
valuable tool for building data tables which give you such meaningful information.
The Unscrambler can help you do this in an elegant way.
The descriptions and screen dumps in this manual are taken from a Windows 95
installation. Some dialogs may differ in appearance on Windows NT systems,
although their functions remain the same.
The Toolbar
The Toolbar buttons give you shortcuts to the most frequently used commands.
When you let the mouse cursor rest on a toolbar button, a short explanation of its
function appears.
side, additional information, such as the value of the current cell in the Editor and
the size of the data table, is displayed.
Basic Notions
The Editor consists of a data table made up of rows and columns. The intersection
of a column and a row is called a cell; each cell holds a data value. The rows and
columns correspond to samples and variables respectively. Samples and variables
are identified by a number and a name.
You can also select a range of cells in the Editor, i.e. one or more columns, or one
or more rows.
A whole row or column can be selected by clicking with the left mouse button on
the sample or variable number (the gray area between the names and the data table
itself). Keep the button down and drag the cursor to select more rows or columns.
Selecting a new range removes the last range.
To add new samples or variables to an existing selection and to make a range, press
the <Ctrl> key while you click on the appropriate samples or variables. The range
may be continuous or non-continuous. You can also deselect a sample or variable
by pressing the <Ctrl> key while clicking on the object you want to remove from
the range, in toggle action. This is only possible with the mouse.
Hold down the <Shift> key while you make the selection if you want to select a
continuous block of samples or variables between the last selection and the present
selection.
When you make a selection, you always mark either samples or variables, i.e. you
either select some variables for all samples or some samples for all variables. You
can also mark the whole matrix, but the selection is still sample or variable
oriented. The difference is important because you define sets (see chapter 21.5.1 )
based on either samples or variables. You see whether you are marking samples or
variables by looking at the shape of the mouse pointer as you make the selection;
see Figure 21.6.
Figure 21.6 - The shape of the mouse pointer when marking samples and
variables respectively
Mark Samples:
Mark Variables:
Screen Layout
If the data table is larger than the screen, you can scroll the Editor.
Information about the active cell is displayed in The Unscrambler’s status bar.
Variable names are displayed in black if the variable is continuous and in blue if it
is a category variable. Locked cells, e.g. design variables, are grayed out to show
that they cannot be edited.
Figure 21.7. You can choose between several different plots, depending on how
many samples/variables you have selected. A dialog will appear, in which you
select which set to plot.
Figure 21.7 - Options in the Plot menu when one variable is selected
Several Viewers can be open at the same time. In addition, one Viewer can display
several plots. This is possible because the Viewer is divided into seven so-called
sub-views, organized as shown in Table 21.1.
Plot Information
The Unscrambler gives a lot of information about the data in the current plot. If
the Plot ID is turned on, a line at the bottom of the plot displays basic information.
Toggle the Plot ID on and off using View - Plot ID. Table 21.2 shows some typical
ways of identifying plots.
Other information about the plotted data such as data source, explanation of colors
and symbols, etc., may also be shown in a separate window using Window-
Identification. These windows are dockable views.
Use View - Plot Statistics to display the most relevant statistical measures.
Information on each object in the plot can be displayed simply by letting the mouse
cursor rest on the object in the plot. A brief explanation of the data point then
appears. Click with the left mouse button to display more detailed information
about the data object.
Use of Colors
There are two pre-set color schemes in The Unscrambler; Black background and
White background. You can change the color on any of the items of the Viewer. This
is done through File - System Setup - Viewer - Define colors…. It is possible to
use different color schemes for the screen and the printer. Note that also other items
than the background and the axis (foreground) differ in the two preset color
schemes; see Table 21.3 for details:
It is also possible to set the color for a specific item. The changes will be shown on
the preview screen.
Dockable views are toggled on and off in the Window or View menu. Dockable
Views in the Window menu are Identification and Warning List, in the View menu
Outlier List.
Click the title bar of the dockable view to drag it around the screen. The shape of
the view changes when you get close to the edge of The Unscrambler workplace.
When you release the mouse button, the view is glued to the edge. To move it
again, click inside the docked view and drag it away. When you get outside or well
inside the edges of The Unscrambler workspace, the shape changes again and it
has become a floating window.
21.4.5 Dialogs
When you are working in The Unscrambler, you will often have to enter
information or make choices in order to be able to complete your project, such as
specifying the names of files you want to work with or the sets which you want to
analyze, or how many PCs you want to compute (see chapter 21.5.1 on page 540
for an explanation of sets). This is done in dialogs, which will normally look
something like the one pictured in Figure 21.9.
Radio buttons
Drop-down list
List of values
Or ranges Opens a new
dialog
This particular dialog is the one you enter when you want to run a Regression on
your data. Items that are predefined, such as sets, file filters, etc., are selected from
a drop-down list. Ranges of samples or variables are entered as shown in the Keep
Out of Calculation field in the figure. You can use a comma to separate two items
in a field, and a hyphen to specify the whole range between two values.
Options which are mutually exclusive are selected via radio buttons. Tick boxes are
used to select multiple options. For example, you may center data and issue
warnings at the same time.
Access the Help system at any time by pressing the <F1> button or clicking on the
help button in the dialogs. The Help file is automatically opened at the appropriate
topic.
You may also open the help system by selecting Help - Unscrambler Help Topics;
this displays all the contents of the Help file. From there you can click your way to
the items you are interested in, just as you would open a book. Use the Index tab to
search for keywords.
Several levels of help are available. Click on underlined words to follow built-in
links to related help topics.
21.4.7 Tooltips
Whenever you let the cursor rest on one of The Unscrambler’s buttons or icons, a
small yellow label pops up to tell you its function. This is the quickest way to learn
the functions of toolbar buttons.
21.5.1 Analyses
Most projects involve many different problems; yours probably will too. Let us
illustrate this with an example, a study of bananas involving different types of
measurements of many different properties. The aim of the study is to find answers
to questions like:
• Are there any correlations between sensory measurements and color
measurements;
• Can preference measurements be predicted from sensory measurements;
• Can sensory measurements be predicted from chemical measurements?
It is possible to combine data tables like this, because one table can contain several
“matrices”:
The Unscrambler by default predefines the special Sets “All Variables” and
“Currently selected Variables” (available if you have marked variables in the
Editor) to make selection easier. You can define as many additional sets as you
need in the Set Editor, which you enter by selecting Modify - Edit Set.
Variable Sets
In the banana example discussed above you could define one Variable Set called
“Sensory”, which contains the sensory measurements. You could also define
another set “Pref” for the preference measurements, the set “Chemistry” for the
chemical measurements, etc. (see Figure 21.13).
Depending on which problem you want to solve, it is then easy to select which
variables to use as X and which as Y.
Note that a set can contain non-continuous selections. The “Sensory” set in Figure
21.13 is split in two by the “Chemistry” set. We also see that some of the variables
in the “Sensory” set are part of a “Fourth” set as well; the sets overlap.
Note!
A regression model cannot be based on two overlapping sets used as X
and Y respectively. The program issues a warning if you have defined a
model like this.
Sample Sets
In practice, you do not always want to use all the available samples in a particular
analysis. For example, you might have collected one group of samples early in the
season and another group later, with a third group taken at a different site (B).
From these three groups of samples it is possible to define a range of different
Sample Sets, such as “Late”, “Early”, “Late A”, “Late B”, etc. (see Figure 21.14). It
is then easy to make different models based on different parts of the data table.
Remember that “All Samples” is a predefined set.
Note!
Do not define separate Sample Sets for calibration and validation
samples. Samples used for validation are always taken from the Sample
Set that you use when you make the model from the Task menu.
Give meaningful names to your variables and samples, so that you can remember
what they are later on. Add and edit category, and define Sample and Variable Sets,
again using appropriate names.
If you find that you need to define a new set at a later stage of your work in The
Unscrambler, this can be done in the appropriate analysis dialog under the Task
menu when you are about to select which data to analyze.
Note that this function does not affect the raw data; it simply defines data which is
to be disregarded during the analysis. Never delete data from the raw data file if
there is a chance that you might need it in the future; instead, you should use Keep
Out of Calculation to remove the data temporarily.
Note!
The “Selected samples/variables” Set is not saved on disk, so do not use
this option for important sets! However, a copy of the set used to make a
model is always saved in the result file.
You specify the set properties when you define a new Variable Set, and you can
change it later using Modify - Edit Set.
Note!
The set containing the design variables is set to Non-Spectra by default;
this setting cannot be changed.
You may have a residual plot filling the whole Viewer and want to look at another
result together with this plot. Select Window - Copy To - 2. The Viewer window is
split in two and the residual plot is copied to the upper half. Click the lower sub-
view to activate it (the active sub-view is indicated by a light blue frame) and create
the other plot from the Plot menu.
Another frequently occurring situation is this: After an analysis you open the
Viewer to look at an overview of your model results. But you also want to look at a
fifth plot from the same results file. You can easily plot the fifth plot without
ruining the four plots in the model overview by selecting Window - Go To - 1. An
empty sub-view pops up, allowing you to plot the desired predefined result plot
from the Plot menu. Go back to the overview by selecting Window - Go To - 4 (or
5, 6 or 7).
You can use this feature to detect a sample you cannot find in the score plot: Mark
the sample in the Editor and you will see immediately where it is located on the
score plot.
The context sensitive menus are accessed by clicking the right mouse button while
the cursor rests within the area on which you want to perform an operation. The
menus that appear give you access to the most common commands for the current
task. Figure 21.15 shows a typical context sensitive menu which applies to the
selected area in the data table.
The file dialogs contain functions that are not available from the ordinary menus in
addition to the regular commands. File deletion is for example only possible from
the Open File dialog by clicking the right mouse button and selecting Delete.
In the Editor and Viewer the context sensitive menus are more like short cuts to the
most used commands.
Using Toolbars
Toolbars provide you with shortcuts to the most frequently used commands by
providing you with predefined icons so that you will not have to search through the
menus. The Toolbars are normally placed right below the Menu bar. You can drag
the Toolbars onto the workspace, where they stay floating over your Editors and
Viewers.
If a Toolbar disappears for you, you can toggle it on again in View - Toolbars.
Glossary of Terms
Accuracy
The accuracy of a measurement method is its faithfulness, i.e. how close
the measured value is to the actual value.
Additive Noise
Noise on a variable is said to be additive when its size is independent of
the level of the data value. The range of additive noise is the same for
small data values as for larger data values.
Analysis Of Effects
Calculation of the effects of design variables on the responses. It
consists mainly of Analysis of Variance (ANOVA), various Significance
Tests, and Multiple Comparisons whenever they apply.
Analysis Of Variance
Classical method to assess the significance of effects by decomposition
of a response’s variance into explained parts, related to variations in the
predictors, and a residual part which summarizes the experimental error.
The main ANOVA results are: Sum of Squares (SS), number of Degrees
of Freedom (DF), Mean Square (MS=SS/DF), F-value, p-value.
ANOVA
See Analysis of Variance
Axial Design
One of the three types of mixture designs with a simplex-shaped
experimental region. An axial design consists of extreme vertices,
overall center, axial points, end points. It can only be used for linear
modeling, and therefore it is not available for optimization purposes.
Axial Point
In an axial design, an axial point is positioned on the axis of one of the
mixture variables, and must be above the overall center, opposite the end
point.
B-Coefficient
See Regression Coefficient.
Bias
Systematic difference between predicted and measured values. The bias
is computed as the average value of the residuals.
Bilinear Modeling
Bilinear modeling (BLM) is one of several possible approaches for data
compression.
Data
Observation = + Error
Structure
Box-Behnken Design
A class of experimental designs for response surface modeling and
optimization, based on only 3 levels of each design variable. The mid-
levels of some variables are combined with extreme levels of others. The
combinations of only extreme levels (i.e. cube samples of a factorial
design) are not included in the design.
Box-plot
The Box-plot represents the distribution of a variable in terms of
percentiles.
Maximum value
75% percentile
Median
25% percentile
Minimum value
Calibration
Stage of data analysis where a model is fitted to the available data, so
that it describes the data as good as possible.
After calibration, the variation in the data can be expressed as the sum of
a modeled part (structure) and a residual part (noise).
Calibration Samples
Samples on which the calibration is based. The variation observed in the
variables measured on the calibration samples provides the information
that is used to build the model.
Candidate Point
In the D-optimal design generation, a number of candidate points are
first calculated. These candidate points consist of extreme vertices and
centroid points. Then, a number of candidate points is selected
D-optimally to create the set of design points.
Category Variable
A category variable is a class variable, i.e. each of its levels is a category
(or class, or type), without any possible quantitative equivalent.
Center Sample
Sample for which the value of every design variable is set at its mid-
level (halfway between low and high).
Centering
See Mean Centering.
Central Composite designs have the advantage that they can be built as
an extension of a previous factorial design, if there is no reason to
change the ranges of variation of the design variables.
If the default star point distance to center is selected, these designs are
rotatable.
Centroid Design
See Simplex-centroid design.
Centroid Point
A centroid point is calculated as the mean of the extreme vertices on the
design region surface associated with this centroid point. It is used in
Simplex-centroid designs, axial designs and D-optimal mixture/non-
mixture designs.
Classification
Data analysis method used for predicting class membership.
Classification can be seen as a predictive method where the response is a
category variable. The purpose of the analysis is to be able to predict
which category a new sample belongs to. The main classification
method implemented in The Unscrambler is SIMCA classification.
Each new sample is projected onto each PCA model. According to the
outcome of this projection, the sample is either recognized as a member
of the corresponding class, or rejected.
Collinearity
Linear relationship between variables. Two variables are collinear if the
value of one variable can be computed from the other, using a linear
relation. Three or more variables are collinear if one of them can be
expressed as a linear function of the others.
x2
x
Component
See Principal Component.
Condition Number
It is the square root of the ratio of the highest eigenvalue to the smallest
eigenvalue of the experimental matrix. The higher the condition number,
the more spread the region. On the contrary, the lower the condition
number, the more spherical the region. The ideal condition number is 1;
the closer to 1 the better.
Confounded Effects
Two (or more) effects are said to be confounded when variation in the
responses cannot be traced back to the variation in the design variables
to which those effects are associated.
Confounding Pattern
The confounding pattern of an experimental design is the list of the
effects that can be studied with this design, with confounded effects
listed on the same line.
Constrained Design
Experimental design involving multi-linear constraints between some of
the designed variables. There are two types of constrained designed:
classical Mixture designs and D-optimal designs.
Constraint
See Multi-linear constraint.
Continuous Variable
Quantitative variable measured on a continuous scale.
Corner Sample
See vertex sample.
Correlation
A unitless measure of the amount of linear relationship between two
variables.
COSCIND
A method used to check the significance of effects using a scale-
independent distribution as comparison. This method is useful when
there are no residual degrees of freedom.
Covariance
A measure of the linear relationship between two variables.
Cross Terms
See interaction effects.
Cross-Validation
Validation method where some samples are kept out of the calibration
and used for prediction. This is repeated until all samples have been kept
out once. Validation residual variance can then be computed from the
prediction residuals.
Cube Sample
Any sample which is a combination of high and low levels of the design
variables, in experimental plans based on two levels of each variable.
Curvature
Curvature means that the true relationship between response variations
and predictor variations is non-linear.
Data Compression
Concentration of the information carried by several variables onto a few
underlying variables.
The basic idea behind data compression is that observed variables often
contain common information, and that this information can be expressed
by a smaller number of variables than originally observed.
Degree Of Fractionality
The degree of fractionality of a factorial design expresses how much the
design has been reduced compared to a full factorial design with the
same number of variables. It can be interpreted as the number of design
variables that should be dropped to compute a full factorial design with
the same number of experiments.
Degrees Of Freedom
The number of degrees of freedom of a phenomenon is the number of
independent ways this phenomenon can be varied.
Design Variable
Experimental factor for which the variations are controlled in an
experimental design.
Design
See Experimental Design.
Distribution
Shape of the frequency diagram of a measured variable or calculated
parameter. Observed distributions can be represented by a histogram.
D-Optimal Design
Experimental design generated by the DOPT algorithm. A D-optimal
design takes into account the multi-linear relationships existing between
design variables, and thus works with constrained experimental regions.
There are two types of D-optimal designs: D-optimal Mixture designs
and D-optimal Non-Mixture designs, according to the presence or
absence of Mixture variables.
D-Optimal Principle
Principle consisting in the selection of a sub-set of candidate points
which define a maximal volume region in the multi-dimensional space.
The D-optimal principle aims at minimizing the condition number.
End Point
In an axial or a simplex-centroid design, an end point is positioned at the
bottom of the axis of one of the mixture variables, and is thus positioned
on the side opposite to the axial point.
Experimental Design
Plan for experiments where input variables are varied systematically
within predefined ranges, so that their effects on the output variables
(responses) can be estimated and checked for significance.
The number of experiments and the way they are built depends on the
objective and on the operational constraints.
Experimental Error
Random variation in the response that occurs naturally when performing
experiments.
Multivariate Data Analysis in Practice
560 Glossary of Terms
Experimental Region
N-dimensional area investigated in an experimental design with N
design variables. The experimental region is defined by:
• the ranges of variation of the design variables,
• if any, the multi-linear relationships existing between design
variables.
In the case of multi-linear constraints, the experimental region is said to
be constrained.
Explained Variance
Share of the total variance which is accounted for by the model.
F-Distribution
Fisher Distribution is the distribution of the ratio between two variances.
Fixed Effect
Effect of a variable for which the levels studied in an experimental
design are of specific interest.
F-Ratio
The F-ratio is the ratio between explained variance (associated to a
given predictor) and residual variance. It shows how large the effect of
the predictor is, as compared with random noise.
Such designs are often used for extensive study of the effects of few
variables, especially if some variables have more than two levels. They
are also appropriate as advanced screening designs, to study both main
effects and interactions, especially if no Resolution V design is
available.
Histogram
A plot showing the observed distribution of data points. The data range
is divided into a number of bins (i.e. intervals) and the number of data
points that fall into each bin is summed up.
The height of the bar in the histograms shows how many data points fall
within the data range of the bin.
Influence
A measure of how much impact a single data point (or a single variable)
has on the model. The influence depends on the leverage and the
residuals.
Interaction
There is an interaction between two design variables when the effect of
the first variable depends on the level of the other. This means that the
combined effect of the two variables is not equal to the sum of their
main effects.
Intercept
=Offset. The point where a regression line crosses the ordinate (Y-axis).
Interior Point
Point which is not located on the surface, but inside of the experimental
region. For example, an axial point is a particular kind of interior point.
Interior points are used in classical mixture designs.
Lack Of Fit
In Response Surface Analysis, the ANOVA table includes a special
chapter which checks whether the regression model describes the true
shape of the response surface. Lack of fit means that the true shape is
likely to be different from the shape indicated by the model.
If there is a significant lack of fit, you can investigate the residuals and
try a transformation.
Lattice Degree
The degree of a Simplex-Lattice design corresponds to the maximal
number of experimental points -1 for a level 0 of one of the Mixture
variables.
Lattice Design
See Simplex-lattice design.
Leveled Variables
A leveled variable is a variable which consists of discrete values instead
of a range of continuous values. Examples are design variables and
category variables.
Levels
Possible values of a variable. A category variable has several levels,
which are all possible categories. A design variable has at least a low
and a high level, which are the lower and higher bounds of its range of
variation. Sometimes, intermediate levels are also included in the design.
Leverage Correction
A quick method to simulate model validation without performing any
actual predictions.
Leverage
A measure of how extreme a data point or a variable is compared to the
majority.
Average data points have a low leverage. Points or variables with a high
leverage are likely to have a high influence on the model.
Linear Effect
See Main Effect.
Linear Model
Regression model including as X-variables the linear effects of each
predictor. The linear effects are also called main effects.
Loading Weights
Loading weights are estimated in PLS regression. Each X-variable has a
loading weight along each model component.
The loading weights show how much each predictor (or X-variable)
contributes to explaining the response variation along each model
component. They can be used, together with the Y-loadings, to represent
the relationship between X- and Y-variables as projected onto one, two
or three components (line plot, 2D scatter plot and 3D scatter plot
respectively).
Loadings
Loadings are estimated in bilinear modeling methods where information
carried by several variables is concentrated onto a few components.
Each variable has a loading along each model component.
The loadings show how well a variable is taken into account by the
model components. You can use them to understand how much each
variable contributes to the meaningful variation in the data, and to
interpret variable relationships. They are also useful to interpret the
meaning of each model component.
Lower Quartile
The lower quartile of an observed distribution is the variable value that
splits the observations into 25% lower values, and 75% higher values. It
can also be called 25% percentile.
Main Effect
Average variation observed in a response when a design variable goes
from its low to its high level.
Mean
Average value of a variable over a specific sample set. The mean is
computed as the sum of the variable values, divided by the number of
samples.
The mean gives a value around which all values in the sample set are
distributed. In Statistics results, the mean can be displayed together with
the standard deviation.
Mean Centering
Subtracting the mean (average value) from a variable, for each data
point.
Median
The median of an observed distribution is the variable value that splits
the distribution in its middle: half the observations has a lower value
than the median, and the other half has a higher value. It can also be
called 50% percentile.
MixSum
See Mixture Sum.
Mixture Components
Ingredients of a mixture. There must be at least three components to
define a mixture. A unique component cannot be called mixture; two
components mixed together do not require a Mixture design to be
studied: study the variation in quantity of one of them as a classical
process variable.
Mixture Constraint
Multi-linear constraint between Mixture variables. The general equation
for the Mixture constraint is
X1 + X2 +…+ Xn = S
where the Xi represent the ingredients of the mixture, and S is the total
amount of mixture. In most cases, S is equal to 100%.
Mixture Design
Special type of experimental design, applying to the case of a Mixture
constraint. There are three types of classical Mixture designs: Simplex-
Lattice design, Simplex-Centroid design, and Axial design. Mixture
designs that do not have a simplex experimental region are generated
D-optimally; they are called D-optimal Mixture designs.
Mixture Region
Experimental region for a Mixture design. The Mixture region for a
classical Mixture design is a simplex.
Mixture Sum
In The Unscrambler, global proportion of a mixture. Generally, the
mixture sum is equal to 100%. However, it can be lower than 100% if
the quantity in one of the components has a fixed value.
The mixture sum can also be expressed as fractions, with values varying
from 0 to 1.
Mixture Variable
Experimental factor for which the variations are controlled in an
experimental mixture design or D-optimal mixture design. Mixture
variables are multi-linearly linked by a special constraint called mixture
constraint.
Model Center
The model center is the origin around which variations in the data are
modeled. It is the (0,0) point on a score plot.
If the variables have been centered, samples close to the average will lie
close to the model center.
Model
Mathematical equation summarizing variations in a data set.
Models are built so that the structure of a data table can be understood
better than by just looking at all raw values.
Model Check
In Response Surface Analysis, a section of the ANOVA table checks
how useful the interactions and squares are, compared with a purely
linear model. This section is called Model Check.
If one part of the model is not significant, it can be removed so that the
remaining effects are estimated with a better precision.
Multi-Linear Constraint
They are linear relationships between two variables or more. The
constraints have the general form:
A1 . X1 + A2 . X2 +…+ An . Xn + A0 ≥ 0
Noise
Random variation that does not contain any information.
Non-Linearity
Deviation from linearity in the relationship between a response and its
predictors.
Normal Distribution
Frequency diagram showing how independent observations, measured
on a continuous scale, would be distributed if there were an infinite
number of observations and no factors caused systematic effects.
The observed values are used as abscissa, and the ordinate displays the
corresponding percentiles on a special scale. Thus if the values are
Offset
See Intercept.
Optimization
Finding the settings of design variables that generate optimal response
values.
Orthogonal
Two variables are said to be orthogonal if they are completely
uncorrelated, i.e. their correlation is 0.
In PCA and PCR, the principal components are orthogonal to each other.
Orthogonal Designs
All classical designs available in The Unscrambler are built in such a
way that the studied effects are orthogonal to each other. They are called
orthogonal designs.
Outlier
An observation (outlying sample) or variable (outlying variable) which
is abnormal compared to the major part of the data.
Extreme points are not necessarily outliers; outliers are points that
apparently do not belong to the same population as the others, or that are
badly described by a model.
Overfitting
For a model, overfitting is a tendency to describe too much of the
variation in the data, so that not only consistent structure is taken into
account, but also some noise or uninformative variation.
By plotting the first PLS components one can view main associations
between X-variables and Y-variables, and also interrelationships within
X-data and within Y-data.
PCA
See Principal Component Analysis.
PCR
See Principal Component Regression.
Percentile
The X% percentile of an observed distribution is the variable value that
splits the observations into X% lower values, and 100-X% higher
values.
Plackett-Burman Design
A very reduced experimental plan used for a first screening of many
variables. It gives information about the main effects of the design
variables with the smallest possible number of experiments.
If there are only two classes to separate, the PLS model uses one
response variable, which codes for class membership as follows: -1 for
members of one class, +1 for members of the other one. The PLS1
algorithm is then used.
If there are three classes or more, PLS2 is used, with one response
variable (-1/+1 or 0/1, which is equivalent) coding for each class.
PLS1
Version of the PLS method with only one Y-variable.
PLS2
Version of the PLS method in which several Y-variables are modeled
simultaneously, thus taking advantage of possible correlations or
collinearity between Y-variables.
Precision
The precision of an instrument or a measurement method is its ability to
give consistent results over repeated measurements performed on the
same object. A precise method will give several values that are very
close to each other.
Precision differs from accuracy, which has to do with how close the
average measured value is to the target value.
Prediction
Computing response values from predictor values, using a regression
model.
The new X-values are fed into the model equation (which uses the
regression coefficients), and predicted Y-values are computed.
Predictor
Variable used as input in a regression model. Predictors are usually
denoted X-variables.
Principal Component
Principal Components (PCs) are composite variables, i.e. linear
functions of the original variables, estimated to contain, in decreasing
order, the main structured information in the data. A PC is the same as a
score vector, and is also called a latent variable.
Process Variable
In The Unscrambler, experimental factor for which the variations are
controlled in an experimental design, and to which the mixture variable
definition does not apply (see Mixture Variable).
Projection
Principle underlying bilinear modeling methods such as PCA, PCR and
PLS.
Proportional Noise
Noise on a variable is said to be proportional when its size depends on
the level of the data value. The range of proportional noise is a
percentage of the original data values.
p-Value
The p-value measures the probability that a parameter estimated from
experimental data should be as large as it is, if the real (theoretical, non-
observable) value of that parameter were actually zero. Thus, p-value is
used to assess the significance of observed effects or variations: a small
p-value means that you run little risk of mistakenly concluding that the
observed effect is real.
The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If
p-value < 0.05, you have reason to believe that the observed effect is not
due to random variations, and you may conclude that it is a significant
effect.
Quadratic Model
Regression model including as X-variables the linear effects of each
predictor, all two-variable interactions, and the square effects.
Random Effect
Effect of a variable for which the levels studied in an experimental
design can be considered to be a small selection of a larger (or infinite)
number of possibilities.
Random Order
Randomization is the random mixing of the order in which the
experiments are to be performed. The purpose is to avoid systematic
errors which could interfere with the interpretation of the effects of the
design variables.
Reference Sample
Sample included in a designed data table to compare a new product
under development to an existing product of a similar type.
The design file will contain only response values for the reference
samples, whereas the input part (the design part) is missing (m).
Regression Coefficient
In a regression model equation, regression coefficients are the numerical
coefficients that express the link between variation in the predictors and
variation in the response.
Regression
Generic name for all methods relating the variations in one or several
response variables (Y-variables) to the variations of several predictors
(X-variables), with explanatory or predictive purposes.
Repeated Measurement
Measurement performed several times on one single experiment or
sample.
Replicate
Replicates are experiments that are carried out several times. The
purpose of including replicates in a data table is to estimate the
experimental error.
Residual
A measure of the variation that is not taken into account by the model.
The residual for a given sample and a given variable is computed as the
difference between observed value and fitted (or projected, or predicted)
value of the variable on the sample.
Residual Variance
The mean square of all residuals, sample- or variable-wise.
Resolution
Information on the degree of confounding in fractional factorial designs.
Note!
Response surface analysis can be run on designed or non-
designed data.
Response Variable
Observed or measured parameter which a regression model tries to
predict.
RMSEC
Root Mean Square Error of Calibration. A measurement of the average
difference between predicted and measured response values, at the
calibration stage.
RMSED
Root Mean Square Error of Deviations. A measurement of the average
difference between the abscissa and ordinate values of data points in any
2D scatter plot.
RMSEP
Root Mean Square Error of Prediction. A measurement of the average
difference between predicted and measured response values, at the
prediction or validation stage.
Sample
Object or individual on which data values are collected, and which
builds up a row in a data table.
Scaling
See Weighting.
Scatter Effects
In spectroscopy, scatter effects are effects that are caused by physical
phenomena, like particle size, rather than chemical properties. They
interfere with the relationship between chemical properties and shape of
the spectrum. There can be additive and multiplicative scatter effects.
Scores
Scores are estimated in bilinear modeling methods where information
carried by several variables is concentrated onto a few underlying
variables. Each sample has a score along each model component.
The scores show the locations of the samples along each model
component, and can be used to detect sample patterns, groupings,
similarities or differences.
Screening
First stage of an investigation, where information is sought about the
effects of many variables. Since many variables have to be investigated,
only main effects, and optionally interactions, can be studied at this
stage.
Significance Level
See p-value.
Significant
An observed effect (or variation) is declared significant if there is a
small probability that it is due to chance.
SIMCA Classification
Classification method based on disjoint PCA modeling.
Simplex
Specific shape of the experimental region for a classical mixture design.
A simplex has N corners but N-1 independent variables in an
N-dimensional space. This results from the fact that whatever the
proportions of the ingredients in the mixture, the total amount of mixture
has to remain the same: the Nth variable depends on the N-1 other ones.
When mixing three components, the resulting simplex is a triangle.
Simplex-Centroid Design
One of the three types of mixture designs with a simplex-shaped
experimental region. A Simplex-centroid design consists of extreme
Simplex-Lattice Design
One of the three types of mixture designs with a simplex-shaped
experimental region. A Simplex-lattice design is a mixture variant of the
full-factorial design. It is available for both screening and optimization
purposes, according to the degree of the design (See lattice degree).
Square Effect
Average variation observed in a response when a design variable goes
from its center level to an extreme level (low or high).
Standard Deviation
Sdev is a measure of a variable’s spread around its mean value,
expressed in the same unit as the original values.
Standardization Of Variables
Widely used preprocessing that consists in first centering the variables,
then scaling them to unit variance.
Star Samples
In optimization designs of the Central Composite family, star samples
are samples with mid-values for all design variables except one, for
which the value is extreme. They provide the necessary intermediate
levels that will allow a quadratic model to be fitted to the data.
Star samples can be centers of cube faces, or they can lie outside the
cube, at a given distance (larger than 1) from the center of the cube.
Steepest Ascent
On a regular response surface, the shortest way to the optimum can be
found by using the direction of steepest ascent.
Student t-Distribution
=t-distribution. Frequency diagram showing how independent
observations, measured on a continuous scale, are distributed around
their mean when the mean and standard deviation have been estimated
from the data and when no factor causes systematic effects.
Test Samples
Additional samples which are not used during the calibration stage, but
only to validate an already calibrated model.
The data for those samples consist of X-values (for PCA) or of both X-
and Y-values (for regression). The model is used to predict new values
for those samples, and the predicted values are then compared to the
observed ones.
Training Samples
See Calibration Samples.
T-Scores
The scores found by PCA, PCR and PLS in the X-matrix.
Tukey´s Test
A multiple comparison test (see Multiple Comparison Tests for more
details).
t-Value
The t-value is computed as the ratio between deviation from the mean
accounted for by a studied effect, and standard error of the mean.
Underfit
A model that leaves aside some of the structured variation in the data is
said to underfit.
Upper Quartile
The upper quartile of an observed distribution is the variable value that
splits the observations into 75% lower values, and 25% higher values. It
can also be called 75% percentile.
U-Scores
The scores found by PLS in the Y-matrix.
Validation Samples
See Test Samples.
Validation
Validation means checking how well a model will perform for future
samples taken from the same population as the calibration samples. In
regression, validation also allows for estimation of the prediction error
in future predictions.
Variable
Any measured or controlled parameter that has varying values over a
given set of samples.
Variance
A measure of a variable’s spread around its mean value, expressed in
square units as compared to the original values.
Vertex Sample
A vertex is a point where two lines meet to form an angle. Vertex
samples are used in Simplex-centroid, axial and D-optimal mixture/non-
mixture designs.
Weighting
A technique to modify the relative influences of the variables on a
model. This is achieved by giving each variable a new weight, ie.
ANOVA Table
Example 435
Autoscaling 76
Average (Mean) 432
Index Averaging 196, 259
Axial design 550
Axial point 550
B
A
Background effects 256
Absorbance 255 Baseline shift 259
accuracy 549 b-coefficient 576
Accuracy 241 B-coefficient 550
Acoustic spectra 215 B-coefficients 209, 210, 263
active cell 532, 533 interpretation and use 214
Adding bias 550
center samples 429 Bias 205
design variables 429 Bilinear methods 11
levels to designs 429 bi-linear modeling 550
reference samples 429 Binary variables 420
replicates 429 BLM 550
Adding design samples/variables Blocking
428 Example 399
Additive effects 258 Blocking of designs 436
additive noise 549 Box-Behnken design 551
Alcohol 276 Box-Behnken Design
Algorithms Number Of Experiments 401
PCA 519 Box-Behnken designs 400
PCR 520 box-plot 551
PLS 521, 524 Brainstorm 415
Amplification 258 Bw 264
analysis of effects 549
Analysis Of Effects 433 C
analysis of variance 549
Analysis Of Variance 434 calibration 551
Analytical inaccuracy 192 Calibration data set 118, 188
analyzing screening designs 383 requirements 118
ANOVA 434, 549, 563, 568 selection of 188
interpreting 412 calibration samples 551
E F
Edge-center point 559 F1 button 539
editor 531, 532, 535 Factorial designs 366
Effective dimension 27 Factors 418
effects F-distribution 560, 561
interaction 368 file information 539
main 368 Finding
Effects 368 important variables 375
Definition and Calculation 368 First derivative 259
E-matrix 58 Fisher distribution 560
Enamine 403 fixed effect 560
End point 559 floppy disks 528
Error in Y 192 Fold-over designs 438
Error sources 190 Fractional designs 373
Errors fractional factorial design 555,
estimation of 424 557, 561
Estimate 116 Fractional Factorial Design
Expensive measurement 3 Analysis 381
experimental design 559 Example 378
H J
hardware requirements 527 Jam 130, 145, 147, 150
help 539
help system 539
K
access 539
F1 button 539 keep out of calculation 544
Hi 341 Kubelka-Munck 255
Hidden structures 7
High levels 419 L
higher order interaction effects 562
histogram 558, 562 lack of fit 563
Historical data 184 Lack of fit 55, 200
HOIE 562 Lacotid 441
Lattice degree 563
Lattice Design 563
I
least square criterion 563
Independent variables 115 least squares 563
Indicator variables 211, 420 Least Squares fit 28
Indirect observations 2 Leave-One-Out validation 163
Inexpensive measurement 3 leveled variables 563
Inference 384 Levelled variables 420
Inferential Analysis 384 levels 563
influence 562 Levels
install The Unscrambler 527 adding 429
W
W 137
Wavelengths 215, 219
weighting 585
Weighting 75, 213
Wheat 291