The Unscrambler Methods
The Unscrambler Methods
The Unscrambler Methods
Camo Software AS
The Unscrambler
Methods
By CAMO Software AS
www.camo.com
Camo Software AS
This manual was produced using ComponentOne Doc-To-Help 2005 together with Microsoft
Word. Visio and Excel were used to make some of the illustrations. The screen captures were taken
with Paint Shop Pro.
Trademark Acknowledgments
Doc-To-Help is a trademark of ComponentOne LLC.
Microsoft is a registered trademark and Windows 95, Windows 98, Windows NT, Windows
2000, Windows ME, Windows XP, Excel and Word are trademarks of Microsoft Corporation.
PaintShop Pro is a trademark of JASC, Inc.
Visio is a trademark of Shapeware Corporation.
Restrictions
Information in this manual is subject to change without notice. No part of the documents that build it
up may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of CAMO Software AS.
Software Version
This manual is up to date for version 9.6 of The Unscrambler.
Document last updated on June 5, 2006.
Copyright 1996-2006 CAMO Software AS. All rights reserved.
Camo Software AS
Contents
What Is New in The Unscrambler 9.6?
11
15
Contents iii
Camo Software AS
59
71
iv Contents
Camo Software AS
91
95
107
Validate A Model
121
Contents v
Camo Software AS
Make Predictions
133
Classification
137
Clustering
145
149
vi Contents
Camo Software AS
161
177
Interpretation Of Plots
187
Contents vii
Camo Software AS
viii Contents
Camo Software AS
Glossary of Terms
237
Index
269
Contents ix
Camo Software AS
Analysis
Automatic pre-treatments can now be registered in models of reduced size minimum and micro.
Access your models from the Results menu for registration.
Editor
Easy filling of missing values in a data table, using either PCA or row column mean analysis. Use menu
Edit - Fill Missing for one-time filling or configure automatic filling using File - System Setup.
Nanometer / Wavenumber unit conversion: two new options in Modify - Transform Spectroscopic convert your spectroscopic data from nanometers to wavenumber unit and vice versa.
Mean Centering and Standard Deviation scaling are now available as pre-processing. Use new menu
option Modify - Transform - Center and Scale.
User-friendliness
Sample grouping in Editor plots provide group visualization using colors and symbols in line plots, 2D
scatter plots, of raw data. Use menu Edit - Options.
Remember plot selection and options in saved models. You may now change plots and options in model
Viewer. Save the model after those changes. The plots selected on screen prior to saving the model will be
displayed again when re-opening the model file.
Reduce model file size with new format Micro model. This choice when running a PCA, PCR or PLS
saves fewer matrices on file, thus reducing the model file size.
Camo Software AS
File compatibility
Improved Excel Import with a new interface for importing from Excel files.
New import format allows you to import files from Brimrose instruments (BFF3).
Safety
Lock data set: locked data sets cannot be edited (satisfies the FDAs 21 CRF Part 11 guidelines). Use
menu option File - Lock.
Passwords expire after 70 days (satisfies the FDAs 21 CRF Part 11 guidelines).
Analysis
Multivariate Curve Resolution: resolves mixtures by determining the number of constituents, their profiles
Area Normalization , Peak Normalization, Unit Vector Normalization: three new normalization options for
pre-processing of multi-channel data.
Norris Gap derivative, Gap-Segment derivative: two new derivatives implemented in collaboration with
Dr. Karl Norris, in replacement for the former Norris derivative option.
Camo Software AS
The former "Norris" derivative from versions 9.2 and earlier will still be supported in auto-pretreatment in
The Unscrambler, OLUP and OLUC.
User-friendliness
File-Duplicate-As 3-D data table: converts an unfolded 2D data table into a 3D format, for modeling with
3-way PLS regression.
New theoretical chapter introducing Multivariate Curve Resolution, written by Rom Tauler and Anna de
Juan.
New tutorial exercises guiding you through the use of Multivariate Curve Resolution (MCR) modeling.
File compatibility
Forward compatibility from version 9.0: Read any data or model file built in version 9.x into any other
version 9.x. (This does not apply to the new MCR models).
A new option was introduced when exporting PLS1 models in ASCII format: Export in the Unscrambler
9.1 format. This ensures maintained compatibility of Unscrambler PLS1 models with Yokogawa
analyzers.
Floating licenses: Define as many user names as you need, and give access to The Unscrambler to a
limited number of simultaneous users on your network.
No delays in receiving Unscrambler upgrades! All license types are available by download.
Analysis
Prediction from Three-Way PLS regression models. Open a 3D data table, then use menu Task-Predict.
Visualisation
Two new plots are available for Analysis of Effects results: Main effects and Interaction effects.
Camo Software AS
Compatibility with databases: Oracle, MySQL, MS Access, SQL Server 7.0, ODBC.
User-Defined Import (UDI): Import any file format into The Unscrambler!
Analysis
New analysis method: Three-Way PLS regression. Open a 3D data table, then use menu TaskRegression.
The following key features can be named: Two validation methods available (Cross-Validation and Test
Set), Scaling and Centering options, over 50 pre-defined plots to view the model results, over 60
importable result matrices.
The following data pretreatments and their combinations are available as automatic pretreatments in
Classification and Prediction: Smoothing, Normalize, Spectroscopic, MSC, Noise, Derivatives, Baselines.
Combinations of these pretreatments are also supported in auto-pretreatments.
3D Editor
Toggle between the 12 possible layouts of 3D tables with submenus in the Modify menu or using Ctrl+3
Create Primary Variable and Secondary Variable sets for use in 3-Way analysis. Use menu Modify-Edit
Set on an active 3D table.
User-friendliness
Optimized PC-Navigation toolbar. Freely switch PC numbers by a simple click on the Next horizontal
PC, Previous horizontal PC, Next vertical PC, Previous vertical PC and Suggested PC buttons,
or use the corresponding arrow keys on your keyboard. The PC-Navigation tool is available on all PCA,
PCR, PLS-R and Prediction result plots.
Importation of *.F3D file format from Hitachi supported. Use menu File-Import 3D-F3D
Importation of files from Analytical Spectral Devices software supported (file extensions: *.001 and
*.asd). Use menu File-Import-Indico
Camo Software AS
Visualisation
Passified variables are displayed in a different color from non-passified variables on Bi-Plots, so that they
are easily identified.
Plot header and axes denomination are shown on 2D Scatter plots, 3D Scatter plots, histogram plots,
Normal probability plots and matrix plots of raw data.
Analysis
The chosen variable weights are more accurately indicated than in previous versions in the PCA and
Regression dialogs
Weighting is free for each model term, except with the Passify option which automatically passifies all
interactions and squares of passified main effects. The user can change this default by using the
"Weights..." button in the PCA and Regression dialogs.
Visualisation
Passified variables are displayed in a different color from non-passified variables on Loadings and
Correlation Loadings plots so that they are easily identified.
When computing a PCR or PLS-R model with Uncertainty Test, the significant X-variables are marked by
default when opening the results Viewer
Importation of file formats *.asc, *.scn and *.autoscan from Guided Wave is now supported (CLASS-PA
and SpectrOn software)
Importing very large ASCII data files is subsequently faster than in previous versions
Camo Software AS
User-friendliness
A Guided Expression dialog makes the Compute function simpler and more intuitive to use.
Sort Variable Sets and Sort Sample Sets are now available even in the presence of overlapping sets.
Switch PC numbers by a simple click on the Next PC and Previous PC buttons in most plots of the
PCA, PCR and PLS regression results.
Possibility to save plots in five image formats (Bitmap, Jpeg, Gif, Portable network graphics and TIFF)
An Undo Adjust button allows you to regret forcing a simplex onto your mixture design
Visualisation
Sample grouping options let you choose how many groups to use, which sample ID should be displayed on
the plot and how many decimals/characters to be displayed
Possibility to perform Sample Grouping with symbols instead of colours. It allows to visualise groups also
when printing plots in black & white
The Loadings plot replaces the Loading Weights plot in Regression Overview results, thus allowing easy
access to the Correlation loadings plot.
Analysis
The raw regression coefficients are available through the Plot menu. In addition, B0 or B0W values are
indicated on the regression coefficients plots
Traceability
Data and model files information indicate the software version that was used to create the file.
The Empty button in File-Properties-Log can be disabled in the administrator system setup options,
preventing the user from deleting the log of performed operations.
Camo Software AS
You may now Passify X- or Y-variables when recalculating your PCA, PCR or PLS model. The variables are
kept in the analysis but are weighted close to zero so as not to influence the model.
2.
A bug fix allows you to keep out Y-variables by using Recalculate Without Marked.
2.
Loadings:
Correlation Loadings are now implemented and help you interpret variable correlations in Loading plots .
Camo Software AS
Camo Software AS
Camo Software AS
The following are the basic types of problems that can be solved using The Unscrambler:
Resolve unknown mixtures by finding the number of pure components and estimating their concentration
profiles and spectra;
Find relationships between one response data matrix (Y) and a cube of predictors (three-way data X);
The purpose of experimental design is to generate experimental data that enable you to find out which design
variables (X) have an influence on the response variables (Y), in order to understand the interactions between
the design variables and thus determine the optimum conditions. Of course, it is equally important to do this
with a minimum number of experiments to reduce costs. An experimental design program should offer
appropriate design methods and encourage good experimental practice, i.e. allow you to perform few but
useful experiments which span the important variations.
Camo Software AS
Screening designs (e.g. fractional, full factorial and Plackett-Burman) are used to find out which design
variables have an effect on the responses, and are suitable for collection of data spanning all important
variations.
Optimization designs (e.g. central composite, Box-Behnken) aim to find the optimum conditions for a process
and generate non-linear (quadratic) models. They generate data tables that describe relationships in more
detail, and are usually used to refine a model, i.e. after the initial screening has been performed.
Whether your purpose is screening or optimization, there may be multi-linear constraints among some of your
design variables. In such a case you will need a D-optimal design.
Another special case is that of mixture designs, where your main design variables are the components of a
mixture. The Unscrambler provides you with the classical types of mixture designs, with or without additional
constraints.
There are several methods for analysis of experimental designs. The Unscrambler uses Analysis Of Effects
(ANOVA) and MLR as its default methods for orthogonal designs (i.e. not mixture or D-optimal), but you can
also use other methods, such as PCR or PLS.
The Unscrambler finds this information by decomposing the data matrix into a structure part and a noise part,
using a technique called Principal Component Analysis (PCA).
Camo Software AS
Study the influence of individual samples on your model on powerful, simple to interpret graphical
representations;
Test the significance of your predictor variables and remove unimportant predictors from your PLS or
PCR model.
Camo Software AS
Camo Software AS
How to collect good data for a future analysis, with special emphasis given to experimental design
methods;
How data entry and experimental design generation are taken care of in practice in The Unscrambler.
Get hold of historical data (from a database, from plant records, etc.);
Collect new data: record measurements directly from the production line, make observations in the fish
farms, etc This will ensure that the data apply to the system that you are studying, today (not another
system, three years ago);
Make your own experiments by disturbing the system you are studying. Thus the data will encompass
more variation than is to be seen in a stable system running as usual.
Design your experiments in a structured, mathematical way. By choosing symmetrical ranges of variation
and applying this variation in a balanced way among the variables you are studying, you will end up with
data where effects can be studied in a simple and powerful way. You will also have better possibilities of
testing the significance of the effects and the relevance of the whole model.
Experimental design is a useful complement to multivariate data analysis because it generates structured
data tables, i.e. data tables that contain an important amount of structured variation. This underlying structure
will then be used as a basis for multivariate modeling, which will guarantee stable and robust model results.
More generally, a careful sample selection increases the chances of extracting useful information from your
data. When you have possibilities to actively perturb your system (experiment with the variables) these
chances become even bigger. The critical part is to decide which variables to change, the intervals for this
variation, and the pattern of the experimental points.
Camo Software AS
Standard designs are well-known classes of experimental designs which can be generated automatically in The
Unscrambler as soon as you have decided on the objective, the number and nature of design variables, the
nature of the responses and the number of experimental runs you can afford. Generating such a design will
provide you with the list of all experiments you must perform to gather enough information for your purposes.
Design Variables
Performing designed experiments is based on controlling the variations of the variables for which you w ant to
study the effects. Such variables with controlled variations are called design variables. They are sometimes
also referred to as factors.
In The Unscrambler, a design variable is completely defined by:
Its name;
Its levels.
Note: in some cases (D-optimal or Mixture designs), the variables with controlled variations will be referred to
using other names: mixture variables or process variables. Read more in Designs for Simple Mixture
Situations, D-Optimal Designs Without Mixture Variables and D-Optimal Designs With Mixture Variables.
Continuous Variables
All variables that have numerical values and that can be measured quantitatively are called continuous
variables. This may be somewhat abusive in the case of discrete quantitative variables, such as counts. It
reflects the implicit use which is made of these variables, namely the modeling of their variations using
continuous functions.
Examples of continuous variables are: temperature, concentrations of ingredients (e.g. in %), pH, length (e.g.
in mm), age (e.g. in years), number of failures in one year, etc.
Camo Software AS
Category Variables
In The Unscrambler, all non-continuous variables are called category variables. Their levels can be named,
but not measured quantitatively.
Examples of category variables are: color (Blue, Red, Green), type of catalyst (A, B, C, D), place of origin
(Africa, The Caribbeans)
Binary variables are a special type of category variables. They have only two levels and symbolize an
alternative.
Examples of binary variables are: use of a catalyst (Yes/No), recipe (New/Old), type of electric power
(AC/DC), type of sweetener (Artificial/ Natural)...
Non-design Variables
In The Unscrambler, all variables appearing in the context of designed experiments which are not themselves
design variables, are called non-design variables.
This is generally synonymous to Response variables , i.e. measured output variables that describe the outcome
of the experiments.
Mixture Variables
If you are performing experiments where some ingredients have to be mixed according to a recipe, you may be
in a situation where the amounts of the various ingredients cannot be varied independently from each other. In
such a case, you will need to use a special kind of design called Mixture design, and the variables with
controlled variations are then called mixture variables.
An example of a mixture situation is blending concrete from the following three ingredients: cement, sand and
water. If you increase the percentage of water in the blend with 10%, you will have to reduce the proportions
of one of the other ingredients (or both) so that the blend still amounts to 100%.
However, there are many situations where ingredients are blended, which do not require a mixture design. For
instance in a water solution of four ingredients whose proportions do not exceed a few percent, you may vary
the four ingredients independently from each other and just add water at the end as a filler. Therefore you
will have to think carefully before deciding whether you own recipe requires a mixture design or not!
Read more about Mixture designs in chapter Designs for Simple Mixture Situations p.30.
Camo Software AS
Process Variables
In a mixture situation, you may also want to investigate the effects of variations in some other variables which
are not themselves a component of the mixture. Such variables are called process variables in The
Unscrambler.
Typical process variables are: temperature, stirring rate, type of solvent, amount of catalyst, etc.
The term process variables will also be used for non-mixture variables in a design dealing with variables that
are linked by Multi-Linear Constraints (D-Optimal design). Read more about D-Optimal designs in chapter
Introduction to the D-Optimal Principle p.35.
Screening
When you start a new investigation or a new product development, there is usually a large number of
potentially important variables. At this stage, the aim of the experiments is to find out which are the most
important variables. This is achieved by including many variables in the design, and roughly estimating the
effect of each design variable on the responses with the help of a screening design. The variables which have
large effects can be considered as important.
The simplest shape is a linear model . If you choose a linear model, you will investigate main effects only;
If you are also interested in the possible interactions between several design variables, you will have to
include interaction effects in your model in addition to the linear effects.
Camo Software AS
When building a mixture or D-optimal design, you will need to choose a model shape explicitly, because the
adequate type of design depends on this choice. For other types of designs, the model choice is implicit in the
design you have selected.
Optimization
At a later stage of investigation, when you already know which variables are important, you may wish to study
the effects of a few major variables in more detail. Such a purpose will be referred to as optimization. Another
term often used for this procedure, especially at the analysis stage, is response surface modeling.
Maximizing a single response, i.e. to find out which combinations of design variable values lead to the
maximum value of a specific response, and how high this maximum is.
Minimizing a single response, i.e. to find out which combinations of design variable values lead to the
minimum value of a specific response, and how low this minimum is.
Finding a stable region, i.e. to find out which combinations of design variable values lead closely enough
to the target value of a specific response, while a small deviation from those settings would cause
negligible change in the response value.
Finding a compromise between several responses, i.e. to find out which combinations of design variable
values lead to the best compromise between several responses.
Describing response variations, i.e. to model response variations inside the experimental region as
precisely as possible in order to predict what will happen if the settings of some design variables have to
be changed in the future.
Full factorial designs for any number of design variables between 2 and 6; the design variables may be
continuous or category, with 2 to 20 levels each.
Fractional factorial designs for any number of 2-level design variables (continuous or category) between
3 and 15.
Plackett-Burman designs for any number of 2-level design variables (continuous or category) between 4
and 32.
Camo Software AS
Among other properties, full factorial designs are perfectly balanced, i.e. each level of each design variable is
studied an equal number of times in combination with each level of each other design variable.
Full factorial designs include enough experiments to allow use of a model with all interactions. Thus, they are
a logical choice if you intend to study interactions in addition to main effects.
Experiment
1
2
3
4
5
6
7
8
+
+
+
+
+
+
+
+
If we now build additional columns, computed from products of the original three columns A, B, C, we get the
new table shown hereafter. These additional columns will symbolize the interactions between the design
variables.
Full factorial design 2 3 with interaction columns
Experiment
1
2
3
4
5
+
+
AB AC
+
+
BC ABC
+
+
+
+
+
6
7
8
Camo Software AS
+
+
+
+
+
+
+
We can see that none of the seven columns are equal; this means that the effects symbolized by these columns
can all be studied independently of each other, using only 8 experiments.
If we now use the last column to study the main effect of an additional variable, D, instead of ABC:
Fractional factorial design 2 4 1
Experiment
C DD
+
+
+
+
+
+
+
+
+
+
It is obvious that the new design allows the main effects of the 4 design variables to be studied independently
of each other; but what about their interactions? Let us try to build all 2-factor interaction columns, illustrated
in the table hereafter. Since only seven different columns can be built out of 8 experiments (except for columns
with opposite signs, which are not independent), we end up with the following table:
Fractional factorial design 24-1 with interaction columns
Experiment
1
2
3
4
5
6
7
8
+
+
+
+
+
+
+
+
AB = CD
+
+
+
+
+
+
AC = BD
BC = AD
+
+
As you can see, each of the last three columns is common to two different interactions (for instance, AB and
CD share the same column).
Confounding
Unfortunately, as the example shows, there is a price to be paid for saving on the experimental costs! If you
invest less, you will also harvest less...
In the case of fractional factorials, this means that if you do not use the full factorial set of experiments, you
might not be able to study the interactions as well as the main effects of all design variables. This happens
because of the way those fractions are built, using some of the resources that would otherwise have been
devoted to the study of interactions, merely to study main effects of more variables instead.
This side effect of some fractional designs is called confounding. Confounding means that some effects
cannot be studied independently of each other.
Camo Software AS
For instance, in the above example, the 2-factor interactions are confounded with each other. The practical
consequences are the following:
All main effects can be studied independently of each other, and independently of the interactions;
If you are interested in the interactions themselves, using this specific design will only enable you to detect
whether some of them are important. You will not be able to decide which are the important ones. For
instance, if AB (confounded with CD, AB=CD) turns out as significant, you will not know whether AB
or CD (or a combination of both) is responsible for the observed effect.
The list of confounded effects is called the confounding pattern of the design.
Resolution III designs: Main effects are confounded with 2-factor interactions.
Resolution IV designs: Main effects are free of confounding with 2-factor interactions, but 2-factor
interactions are confounded with each other.
Resolution V designs: Main effects and 2-factor interactions are free of confounding.
Definition: In a Resolution R design, effects of order k are free of confounding with all effects of order less
than R-k.
In practice, before deciding on a particular factorial design, check its resolution and its confounding pattern to
make sure that it fits your objectives!
Plackett-Burman Designs
If you are interested in main effects only, and if you have many design variables to investigate (let us say more
than 10), Plackett-Burman designs may be the solution you need. They are very economical, since they require
only 1 to 4 more experiments than the number of design variables.
(+ + +)
X2
(+ + +)
X2
X3
(- - -)
(+ - -)
X1
Full factorial 2 3
X3
(- - -)
(+ - -)
X1
Fractional factorial 2 31
Camo Software AS
Cube samples are experiments which cross lower and upper levels of the design variables; they are the
factorial part of the design;
Center samples are the replicates of the experiment which cross the mid-levels of all design variables;
they are the inside part of the design.
Star samples are used in experiments which cross the mid-levels of all design variables except one with
the extreme (star) levels of the last variable. Those samples are specific to central composite designs.
Cube
Low Star
Cube
Low Cube
Star
Center
Center
Cube
High Cube
Star
High Star
Levels of
Variable 1
Variable 1
Cube
Star
As you can see, each design variable has 5 levels: Low Star, Low Cube, Center, High Cube, High Star. Low
Cube and High Cube are the lower and upper levels that you specify when defining the design variable.
The four cube samples are located at the corners of a square (or a cube if you have 3 variables, or a hypercube if you have more), hence their name;
Camo Software AS
The four star samples are located outside the square; by default, their distance to the center is the same as
the distance from the cube samples to the center, i.e. here:
As a result, all cube and star samples are located on the same circle (or sphere if you have 3 design variables).
From that fact follows that all cube and star samples will have the same leverage, i.e. the information they
carry will have equal weight on the analysis. This property, called rotatability, is important if you want to
achieve uniform quality of prediction in all directions from the center.
However, if for some reason those levels are impossible to achieve in the experiments, you can tune the star
distance to center factor down to a minimum of 1. Then the star points will lie at the center of the cube faces.
Another way to keep all experiments within a manageable range when the default star levels are too extreme, is
to use the optimal star sample distance, but shrink the high and low cube levels. This will result in a smaller
investigated range, but will guarantee a rotatable design.
Box-Behnken Designs
Box-Behnken designs are not built on a factorial basis, but they are nevertheless good optimization designs
with simple properties.
In a Box-Behnken design, all design variables have exactly three levels: Low Cube, Center, High Cube. Each
experiment crosses the extreme levels of 2 or 3 design variables with the mid-levels of the others. In addition,
the design includes a number of center samples.
The properties of Box-Behnken designs are the following:
The actual range of each design variable is Low Cube to High Cube, which makes it easy to handle;
In the figure below, the Box-Behnken design is shown drawn in two different ways. In the left drawing you see
how it is built, while the drawing to the right shows how the design is rotatable.
Camo Software AS
Box-Behnken design
2.
Process variable
Low
High
6 hours
18 hours
Steaming time
5 min
15 min
Frying time
5 min
15 min
Marinating time
Sample
Mar. Time
Steam. Time
Fry. Time
18
15
18
15
Camo Software AS
15
18
15
15
15
18
15
15
When seeing this table, the process engineer expresses strong doubts that experimental design can be of any
help to him. Why? asks the statistician in charge. Well, replies the engineer, if the meat is steamed then
fried for 5 minutes each it will not be cooked, and at 15 minutes each it will be overcooked and burned on the
surface. In either case, we wont get any valid sensory ratings, because the products will be far beyond the
ranges of acceptability.
After some discussion, the process engineer and the statistician agree that an additional condition should be
included:
In order for the meat to be suitably cooked, the sum of the two cooking times should remain between 16 and
24 minutes for all experiments.
This type of restriction is called a multi-linear constraint . In the current case, it can be written in a
mathematical form requiring two equations, as follows:
Steam + Fry >= 16
and
The impact of these constraints on the shape of the experimental region is shown in the two figures hereafter:
The cooked meat experimental region multi-linear constraints
18
Fryi ng
Fryi ng
15
15
18
6
5
Steaming
Marinating
5
Marinating
15
6
5
Steaming
15
The constrained experimental region is no longer a cube! As a consequence, it is impossible to build a full
factorial design in order to explore that region.
The design that best spans the new region is given in the table hereafter.
The cooked meat constrained design
Sample
Mar. Time
Steam. Time
Fry. Time
11
15
Camo Software AS
15
11
15
15
18
11
18
15
18
15
10
18
11
11
18
15
12
18
15
As you can see, it contains all "corners" of the experimental region, in the same way as the full factorial design
does when the experimental region has the shape of a cube.
Depending on the number and complexity of multi-linear constraints to be taken into account, the shape of the
experimental region can be more or less complex. In the worst cases, it may be almost impossible to imagine!
Therefore, building a design to screen or optimize variables linked by multi -linear constraints requires special
methods. Chapter Alternative Solutions below will briefly introduce two ways to build constrained designs.
100
Mixtures of
3 ingredients
Egg
100% Flour
0
Sugar
100% Sugar
100
0
Flour
100
Camo Software AS
The reason, as you will have guessed, is that the mixture always has to add up to a total of 100 g. This is a
special case of multi-linear constraint, which can be written with a single equation:
Flour + Sugar + Egg = 100
This is called the mixture constraint: the sum of all mixture components is 100% of the total amount of
product.
The practical consequence, as you will also have noticed, is that the mixture region defined by three
ingredients is not a three-dimensional region! It is contained in a two-dimensional surface called a simplex.
Therefore, mixture situations require specific designs. Their principles will be introduced in the next chapter.
Alternative Solutions
There are several ways to deal with constrained experimental regions. We are going to focus on two well
known, proven methods:
Classical mixture designs take advantage of the regular simplex shape that can be obtained under
favorable conditions.
In all other cases, a design can be computed algorithmically by applying the D-optimal principle.
0%
Sugar
0%
Flour
33.3%
Sugar
33.3%
Flour
33.3%
Egg
100%
Flour
Flour
100%
Sugar
0%
Egg
Sugar
This simplex contains all possible combinations of the three ingredients flour, sugar and egg. As you can see, it
is completely symmetrical. You could substitute egg for flour, sugar for egg and flour for sugar in the figure,
and still get exactly the same shape.
Classical mixture designs take advantage of this symmetry. They include a varying number of experimental
points, depending on the purposes of the investigation. But whatever this purpose and whatever the total
number of experiments, these points are always symmetrically distributed, so that all mixture variables play
equally important roles. These designs thus ensure that the effects of all investigated mixture variables will be
studied with the same precision. This property is equivalent to the properties of factorial, central composite or
Box-Behnken designs for non-constrained situations.
The figure hereafter shows two examples of classical mixture designs.
Camo Software AS
Egg
Flour
Sugar
Flour
Sugar
The first design is very simple. It contains three corner samples (pure mixture components), three edge centers
(binary mixtures) and only one mixture of all three ingredients, the centroid.
The second one contains more points, spanning the mixture region regularly in a triangular lattice pattern. It
contains all possible combinations (within the mixture constraint) of five levels of each ingredient. It is similar
to a 5-level full factorial design - except that many combinations, such as "25%,25%,25%" or
"50%,75%,100%", are excluded because they are outside the simplex.
Read more about classical mixture designs in Chapter Designs for Simple Mixture Situations p.30.
D-optimal designs
Let us now consider the meat example again (see Chapter Constraints Between the Levels of Several Design
Variables p.25), and simplify it by focusing on Steaming time and Frying time, and taking into account only
one constraint:
Steaming time + Frying time <= 24.
The figure hereafter shows the impact of the constraint on the variations of the two design variables.
The constraint cuts off one corner of the "cube"
15
Frying
S + F = 24
Steaming
15
If we try to build a design with only 4 experiments, as in the full factorial design, we will automatically end up
with an imperfect solution that leaves a portion of the experimental region unexplored. This is illustrated in the
next figure.
Camo Software AS
II
On the figure, design II is better than design I, because the left out area is smaller. A design using points
(1,3,4,5) would be equivalent to (I), and a design using points (1,2,4,5) would be equivalent to (II). The worst
solution would be a design with points (2,3,4,5): it would leave out the whole corner defined by points 1,2 and
5.
Thus it becomes obvious that, if we want to explore the whole experimental region, we need more than 4
points. Actually, in the above example, the five points (1,2,3,4,5) are necessary. These five crucial points are
the extreme vertices of the constrained experimental region. They have the following property: if you were to
wrap a sheet of paper around those points, the shape of the experimental region would appear, revealed by your
wrapping.
When the number of variables increases and more constraints are introduced, it is not always possible to
include all extreme vertices into the design. Then you need a decision rule to select the best possible subset of
points to include in your design. There are many possible rules; one of them is based on the so-called Doptimal principle, which consists in enclosing maximum volume into the selected points. In other words, you
know that a wrapping of the selected points will not exactly re-constitute the experimental region you are
interested in, but you want to leave out the smallest possible portion.
Read more about D-optimal designs and their various applications in Chapter Introduction to the D-Optimal
Principle p.35.
Camo Software AS
The manufacturer decides to use experimental design to find out which combination of those three ingredients
maximizes consumer acceptance of the taste of the punch. The ranges of variation selected for the experiment
are as follows:
Ranges of variation for the fruit punch design
Ingredient
Low
High
Centroid
Watermelon
30%
100%
54%
Pineapple
0%
70%
23%
Orange
0%
70%
23%
You can see at once that the resulting experimental design will have a number of features that make it very
different from a factorial or central compo site design.
Firstly, the ranges of variation of the three variables are not independent. Since Watermelon has a low level of
30%, the high level of Pineapple cannot be higher than 100 - 30 = 70%. The same holds for Orange.
The second striking feature concerns the levels of the three variables for the point called centroid: these
levels are not half-way between low and high, they are closer to the low level. The reason is, once again,
that the blend has to add up to a total of 100%.
Since the levels of the various concentrations of ingredients to be investigated cannot vary independently from
each other, these variables cannot be handled in the same way as the design variables encountered in a factorial
or central composite design. To mark this difference, we will refer to those variables as mixture components
(or mixture variables).
Whenever the low and high levels of the mixture components are such that the mixture region is a simplex (as
shown in Chapter A Special Case: Mixture Situations p.27), classical mixture designs can be built. Read
more about the necessary conditions in Chapter Is the Mixture Region a Simplex? p.49.
These designs have a fixed shape, depending only on the number of mixture components and on the objective
of your investigation. For instance, we can build a design for the optimization of the concentrations of
Watermelon, Pineapple and Orange juice in Cornell's fruit punch, as shown in the figure below.
Design for the optimization of the fruit punch composition
Watermelon
100% W
0% P
0% O
70% O
70% P
30% W
100% O
Orange
100% P
0% W
Pineapple
The next chapters will introduce the three types of mixture designs that are most suitable for three different
objectives:
1.
Camo Software AS
2.
3.
In a mixture situation, this is no longer possible. Look at the Fruit Punch image above: while 30% Watermelon
can be combined with (70% P, 0% O) and (0% P, 70% O), 100% Watermelon can only be combined with (0%
P, 0% O)!
To find a way out of this dead end, we have to transpose the concept of "otherwise comparable conditions" to
the constrained mixture situation. To follow what happens when Watermelon varies from 30% to 100%, let us
compensate for this variation in such a way that the mixture still adds up to 100%, without disturbing the
balance of the other mixture components. This is achieved by moving along an axis where the proportions of
the other mixture components remain constant, as shown in the figure below.
Studying variations in the proportion of Watermelon
Watermelon
(100% W, 0%[1/2P+1/2 O])
W varies from 30 to 100%,
P and O compensate
in fixed proportions
(77% W, 23%[1/2P+1/2 O])
Orange
Pineapple
The most "representative" axis to move along is the one where the other mixture components have equal
proportions. For instance, in the above figure, Pineapple and Orange each use up one half of the remaining
volume once Watermelon has been determined.
Mixture designs based upon the axes of the simplex are called axial designs. They are the best suited for
screening purposes because they manage to capture the main effect of each mixture component in a simple and
economical way.
Camo Software AS
A more general type of axial design is represented, for four variables, in the next figure. As you can see, most
of the points are located inside the simplex: they are mixtures of all four components. Only the four corners, or
vertices (containing the maximum concentration of an individual component) are located on the surface of the
experimental region.
A 4-component axial design
Vertex
Axial point
Overall
centroid
Optional
end point
Each axial point is placed halfway between the overall centroid of the simplex (25%,25%,25%,25%) and a
specific vertex. Thus the path leading from the centroid ("neutral" situation) to a vertex (extreme situation with
respect to one specific component) is well described with the help of the axial point.
In addition, end points can be included; they are located on the surface of the simplex, opposite to a vertex
(the are marked by crosses on the figure). They contain the minimum concentration of a specific component.
When end points are included in an axial design, the whole path leading from minimum to maximum
concentration is studied.
0% P
0% O
70% O
70% P
30% W
100% O
Orange
100% P
0% W
Pineapple
Camo Software AS
The 3 edge centers (or centroids of the 2-dimensional sub-simplexes defining binary mixtures): (50,50,0),
(50,0,50) and (0,50,50);
A more general type of simplex-centroid design is represented, for 4 variables, in the figure below.
A 4-component simplex-centroid design
Vertex
3rd order
centroid
(face center)
Optional
interior point
Overall
centroid
If all mixture components vary from 0 to 100%, the blends forming the simplex-centroid design are as follows:
1- The vertices are pure components;
2- The second order centroids (edge centers) are binary mixtures with equal proportions of the selected
two components;
3- The third order centroids (face centers) are ternary mixtures with equal proportions of the selected three
components;
..
N- The overall centroid is a mixture where all N components have equal proportions.
In addition, interior points can be included in the design. They improve the precision of the results by
"anchoring" the design with additional complete mixtures. The most regular design is obtained by adding
Camo Software AS
interior points located halfway between the overall centroid and each vertex. They have the same composition
as the axial points in an axial design.
Flour
Baking temperature
Sugar
Time
In the same way as a full factorial design, depending on the number of levels, can be used for screening,
optimization, or other purposes, simplex-lattice designs have a wide variety of applications, depending on their
degree (number of intervals between points along the edge of the simplex). Here are a few:
Optimization: with a lattice of degree 3 or more, there are enough points to fit a precise response surface
model.
Search for a special behavior or property which only occurs in an unknown, limited sub-region of the
simplex.
Calibration: prepare a set of blends on which several types of properties will be measured, in order to fit a
regression model to these properties. For instance, you may wish to relate the texture of a product, as
assessed by a sensory panel, to the parameters measured by a texture analyzer. If you know that texture is
likely to vary as a function of the composition of the blend, a simplex-lattice design is probably the best
way to generate a representative, balanced calibration data set.
Camo Software AS
which is linked to the elongation or degree of "non-sphericity" of the region actually explored by the design.
The smaller the condition number, the more spherical the region, and the closer you are to an orthogonal
design.
Unexplored portion
Camo Software AS
3. Selecting a subset with the desired number of points more or less randomly, and computing the condition
number of the resulting experimental matrix.
4. Exchanging one of the selected points with a left over point and comparing the new condition number to
the previous one. If it is lower, the new point replaces the old one; else another left over point is tried.
This process can be re-iterated a large number of times.
When the exchange of points does not give any further improvements, the algorithm stops and the subset of
candidate points giving the lowest condition number is selected.
Camo Software AS
The set of candidate points for a D-optimal optimization design will thus include:
2.
Camo Software AS
The D-optimal solution is acceptable if you are in a screening situation (with a large number of variables to
study) and the mixture components have a lower limit. If the latter condition is not fulfilled, the design will
include only pure components, which is probably not what you had in mind!
The alternative is to use the whole set of candidate points. In such a design, each mixture is combined with all
levels of the process variables. The figure below illustrates two such situations.
Two full factorial combinations of process variables with complete mixture designs
Screening:
axial design combined with a
2-level factorial
Optimization:
simplex centroid design combined
with a 3-level factorial
Egg
Egg
Flour
Sugar
Flour
Sugar
This solution is recommended (if the number of factorial combinations is reasonable) whenever it is important
to explore the mixture region precisely.
Cube Samples
Cube samples can be found in factorial designs and their extensions.
They are a combination of high and low levels of the design variables, in experimental plans based on two
levels of each variable.
This also applies to Central Composite designs (they contain the full factorial cube).
More generally, all combinations of levels of the design variables in N-level full factorials, as well as in
Simplex lattice designs, are also called cube samples.
In Box-Behnken designs, all samples that are a combination of high or low levels of some design variables,
and center level of others, are also referred to as cube samples.
Camo Software AS
Center Samples
Center samples are samples for which each design variable is set at its mid-level. They are located at the exact
center of the experimental regi on.
Star samples.
Star Samples
Star samples are samples with mid-values for all design variables except one, for which the value is extreme.
They provide the necessary intermediate levels that will allow a quadratic model to be fitted to the data.
Camo Software AS
Cube
Low Star
Cube
Low Cube
Star
Center
High Cube
Center
Cube
Star
High Star
Levels of
Variable 1
Variable 1
Cube
Star
Star samples can be centers of cube faces, or they can lie outside the cube, at a given distance (larger than 1)
from the center of the cube.
By default, their distance to the center is the same as the distance from the cube samples to the center, i.e. here:
Distance To Center
The properties of the Central Composite design will vary according to the distance between the star samples
and the center samples. This distance is measured in normalized units, i.e. assuming that the low cube level of
each variable is -1 and the high cube level +1.
Three cases can be considered:
1. The default star distance to center ensures that all design samples are located on the surface of a
sphere. In other words, the star samples are as far away from the center as the cube samples are. As a
consequence, all design samples have exactly the same leverage. The design is said to be rotatable;
7. The star distance to center can be tuned down to 1. In that case, the star samples will be located at the
centers of the faces of the cube. This ensures that a Central Composite design can be built even if
levels lower than low cube or higher than high cube are impossible. However, the design is no
longer rotatable;
8. Any intermediate value for the star distance to center is also possible. The design will not be
rotatable.
Axial design: vertex samples, axial points, optional end points, overall centroid;
Simplex-centroid design: vertex samples, centroids of various orders, optional interior points, overall
centroid ;
Camo Software AS
Axial Point
In an axial design, an axial point is positioned on the axis of one of the mixture variables, and must be above
the overall centroid, opposite the end point.
Centroid Point
A centroid point is calculated as the mean of the extreme vertices on a given surface. Edge centers, face
centers and overall centroid are all examples of centroid points.
The number of mixture components involved in the centroid is called the centroid order. For instance, in a 4component mixture, the overall centroid is the fourth order centroid.
Edge Center
The edge centers are positioned in the center of the edges of the simplex. They are also referred to as second
order centroids.
End Point
In an axial or a simplex-centroid design, an end point is positioned at the bottom of the axis of one of the
mixture variables, and is thus on the opposite side to the axial point.
Face Center
The face centers are positioned in the center of the faces of the simplex. They are also referred to as third
order centroids.
Interior Point
An interior point is not located on the surface, but inside the experimental region. For example, an axial point
is a particular kind of interior point.
Overall Centroid
The overall centroid is calculated as the mean of all extreme vertices. It is the mixture equivalent of a center
sample.
Vertex Sample
A vertex is a point where two lines meet to form an angle. Vertex samples are the corners of D-optimal or
mixture designs.
vertex samples, also called extreme vertices (see the description of a Vertex Sample above);
centroid points (see Centroid Point, Edge Center and Face Center);
Camo Software AS
Reference Samples
Reference samples are experiments which do not belong to a standard design, but which you choose to include
for various purposes.
Here are a few classical cases where reference samples are often used:
If you are trying to improve an existing product or process, you might use the current recipe or process
settings as reference.
If you are trying to copy an existing product , for which you do not know the recipe, you might still include
it as reference and measure your responses on that sample as well as on the others, in order to know how
close you have come to that product.
To check curvature in the case where some of the design variables are category variables, you can include
one reference sample with center levels of all continuous variables for each level (or combination of
levels) of the category variable(s).
Note: For reference samples, only response values can be taken automatically into account in the Analysis of
Effects and Response Surface analyses. You may, however, enter the values of the design variables manually
after converting to non-designed data table, then run a PLS analysis.
Replicates
Replicates are experiments performed several times. They should not be confused with repeated
measurements, where the samples are only prepared once but the measurements are performed several times on
each.
It enables you to compare response variation due to controlled causes (i.e. due to variation in the design
variables) with uncontrolled response variation. If the explainable variation in a response is no larger
than its random variation, the variations of this response cannot be related to the investigated design
variables.
Camo Software AS
Randomization
Randomization means that the experiments are performed in random order, as opposed to the standard order
which is sorted according to the levels of the design variables.
Incomplete Randomization
There may be circumstances which prevent you from using full randomization. For instance, one of the design
variables may be a parameter that is particularly difficult to tune, so that the experiments will be performed
much more efficiently if you only need to tune that parameter a few times. Another case for incomplete
randomization is blocking (see Chapter Blocking hereafter).
The Unscrambler enables you to leave some variables out of the randomization. As a result, the experimental
runs will be sorted according to the non-randomized variable(s). This will generate groups of samples with a
constant value for those variables. Inside each such group, the samples will be randomized according to the
remaining variables.
Blocking
In cases where you suspect experimental conditions to vary from time to time or from place to place, and when
only some of the experiments can be performed under constant conditions, you may consider to use blocking
of your set of experiments instead of free randomization. This means that you incorporate an extra design
variable for the blocks. Experimental runs must then be randomized within each block.
Typical examples of blocking factors are:
Day (if several experimental runs can be performed the same day);
Operator or machine or instrument (when several of them must be used in parallel to save time);
Batches (or shipments) of raw material (in case one batch is insufficient for all runs).
Blocking is not handled automatically in The Unscrambler, but it can be done manually using one or several
additional design variables. Those variables should be left out of the randomization.
Extending a Design
Once you have performed a series of designed experiments, analyzed their results, and drawn a conclusion
from them, two situations can occur:
1. The experiments have provided you with all the information you needed, which means that your
project is completed.
9. The experiments have given you valuable information which you can use to build a new series of
experiments that will lead you closer to your objective.
In the latter case, the new series of experiments can sometimes be designed as a complement to, or an
extension of, the previous design. This lets you minimize the number of new experimental runs, and the whole
set of results from the two series of runs can be analyzed together.
Camo Software AS
Type of extension
Fractional Full
CCD
Factorial
Factorial
Add levels
No
Yes
No
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes(*)
Yes(*)
Yes
Yes
Yes
Yes
Yes
Yes
Yes(*)
Yes(*)
Camo Software AS
Type of extension
D-opt
Non
mixture
Mixture
with
Process
No
Yes(**)
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Extend to centroid
No
Yes
Yes
No
Yes
No
Yes
In addition, all designs which are not listed in the above tables can be extended by adding more center and
reference samples or replicates.
Add levels: Used whenever you are interested in investigating more levels of already included design
variables, especially for category variables.
Add a design variable: Used whenever a parameter that has been kept constant is suspected to have a
potential influence on the responses, as well as when you wish to duplicate an existing design in order to
apply it to new conditions that differ by the values of one specific variable (continuous or category), and
analyze the results together. For instance, you have just investigated a chemical reaction using a specific
catalyst, and now wish to study another similar catalyst for the same reaction and compare its
performances to the other ones. The simplest way to do this is to extend the first design by adding a new
variable; type of catalyst.
Delete a design variable: If the analysis of effects has established one or a few of the variables in the
original session to be clearly non-significant, you can increase the power of your conclusions by deleting
this variable and reanalyzing the design. Deleting a design variable can also be a first step before extending
a screening design into an optimization design. You should use this option with caution if the effect of the
removed variable is close to significance. Also make sure that the variable you intend to remove does not
participate in any significant interactions.
Add more replicates: If the first series of experiments shows that the experimental error is unexpectedly
high, replicating all experiments once more might make your results clearer.
Add more center samples: If you wish to get a better estimation of the experimental error, adding a few
center samples is a good and inexpensive solution.
Add more reference samples: Whenever new references are of interest, or if you wish to include more
replicates of the existing reference samples in order to get a better estimation of the experimental error.
Extend to higher resolution: Use this option for fractional factorial designs where some of the effects you
are interested in are confounded with each other. You can use this option whenever some of the
confounded interactions are significant and you wish to find out exactly which ones. This is only possible
if there is a higher resolution fractional factorial design. Otherwise, you can extend to full factorial instead.
Extend to full factorial: This applies to fractional factorial designs where some of the effects you are
interested in are confounded with each other and no higher resolution fractional factorial designs are
possible.
Camo Software AS
Extend to central composite: This option completes a full factorial design by adding star samples and
(optionally) a few more center samples. Fractional factorial designs can also be completed this way, by
adding the necessary cube samples as well. This should be used only when the number of design variables
is small; an intermediate step may be to delete a few variables first.
Caution! Whichever kind of extension you use, remember that all the experimental conditions not represented
in the design variables must be the same for the new experimental runs as for the previous runs.
1.
2.
After analyzing the results, it turns out (for example) that only variables A, B, C and E have significant main
effects and/or interactions. But those interactions are confounded, so you need to extend the design in order
to know which are really significant.
3.
You extend the first design by deleting variables D and F and extending the remaining part (which is now a
24-1, resolution IV design) to a full factorial design with one more center sample. Additional cost: 9
experiments.
4.
After analyzing the new design, the significant interactions which are not confounded only involve (for
example) A, B and C. The effect of E is clear and goes in the same direction for all responses. But since your
center samples show some curvature, you need to go to optimization stage for the remaining variables.
5.
Thus, you keep variable E constant at its most interesting level, and after deleting that variable from the
design you extend the remaining 2 3 full factorial to a CCD with 6 center samples. Additional cost: 9
experiments.
Camo Software AS
6.
Analysis of the final results provides you (if all goes well) with a nice optimum. Final cost: 18+9+9=36
experiments, which is less than half of the initial estimate.
For a first screening, the most important rule is: Do not leave out a variable that may have an influence
on the responses unless you know that you cannot control it in practice. It would be more costly to have
to include one more variable at a later stage than to include one more in the first screening design.
For a more extensive screening, variables that are known not to interact with other variables can be left
out. If those variables have a negligible linear effect, you can choose whatever constant value you wish for
them (e.g. the least expensive). If those variables have a significant linear effect, they should be fixed at
the level most likely to give the desired effect on the response.
The previous rule also applies to optimization designs, if you also know that the variables in question
have no quadratic effect. If you suspect that a variable can have a non-linear effect, you should include it
in the optimization stage.
Camo Software AS
However, in other cases when there are several residual degrees of freedom in the cube and/or star samples,
full cross validation can be used without trouble. This applies whenever the number of cube and/or star
samples is much larger than the number of effects in the model.
Whenever the mixture components are further constrained, like in the example shown below, the mixture
region is usually not a simplex.
With a multi-linear constraint, the mixture region is not a simplex
Watermelon
Experimental
region W 2*P
W = 2*P
Orange
Pineapple
In the absence of Multi-Linear Constraints, the shape of the mixture region depends on the relationship
between the lower and upper bounds of the mixture components.
It is a simplex if:
Camo Software AS
The figure below illustrates one case where the mixture region is a simplex and one case where it is not.
Changing the upper bound of Watermelon affects the shape of the mixture region
Watermelon
17%
66%
17%
17%
17%
66%
66%
66%
17%
Orange
66%
17%
Pineapple
In the leftmost case, the upper bound of Watermelon is 66% = 100 - (17 + 17): the mixture region is a simplex.
If the upper bound of Watermelon is shifted to 0.55, it becomes smaller than 100% - (17 + 17) and the mixture
region is no longer a simplex.
Note: When the mixture components only have Lower bounds, the mixture region is always a simplex.
Camo Software AS
1. Some of the levels or their combinations are physically impossible. For instance: a mixture with a
total of 110%, or a negative concentration.
2. Although the combinations are feasible, you know that they are not relevant, or that they will result in
difficult situations. Examples: some of the product properties cannot be measured, or there may be
discontinuities in the product properties.
3. Some of the combinations that are physically possible and would not lead to any complications are
not desired, for instance because of the cost of the ingredients.
When you start defining a new design, think twice about any constraint that you intend to introduce. An
unnecessary constraint will not help you solve your problem faster; on the contrary, it will make the design
more complex, and may lead to more experiments or poorer results.
Physical constraints
The first two cases mentioned above can be called "real constraints ". You cannot disregard them; if you do,
you will end up with missing values in some of your experiments, or uninterpretable results.
Constraints of cost
The third case, however, can be referred to as "imaginary constraints". Whenever you are tempted to introduce
such a constraint, examine the impact it will have on the shape of your design. If it turns a perfectly regular and
symmetrical situation, which can be solved with a classical design (factorial or classical mixture), into a
complex problem requiring a D-optimal algorithm, you will be better off just dropping the constraint.
Build a standard design, and take the constraint into account afterwards, at the result interpretation stage. For
instance, you can add the constraint to your response surface plot, and select the optimum solution within the
constrained region.
This also applies to Upper bounds in mixture components. As mentioned in Chapter Is the Mixture Region a
Simplex? p.49, if all mixture components have only Lower bounds, the mixture region will automatically be a
simplex. Remember that, and avoid imposing an Upper bound on a constituent playing a similar role to the
others, just because it is more expensive and you would like to limit its usage to a minimum. It will be soon
enough to do this at the interpretation stage, and select the mixture that gives you the desired properties with
the smallest amount of that constituent.
Note that if you stick to that rule without allowing for any extra margin, you will end up with a so-called
saturated design, that is to say without any residual degrees of freedom. This is not a desirable situation,
especially in an optimization context.
Therefore, The Unscrambler uses the following default number of experiments (n), where p is the number of
effects included in the model:
- For screening designs: n = p + 4 + 3 center samples;
- For optimization designs: n = p + 6 + 3 center samples.
A D-optimal design computed with the default number of experiments will have, in addition to the replicated
center samples, enough additional degrees of freedom to provide a reliable and stable estimation of the effects
in the model.
However, depending on the geometry of the constrained experimental region, the default number of
experiments may not be the ideal one. Therefore, whenever you choose a starting number of points, The
Camo Software AS
Unscrambler automatically computes 4 designs, with n-1, n, n+1 and n+2 points. The best two are selected and
their condition number is displayed, allowing you to choose one of them, or decide to give it another try.
Read more about the choice of a model in Chapter Relevant Regression Models in the section about
analyzing results from designed experiments, further down in this document.
Products
Multivariate
quality control
IxJ
Quality measurements
3-way data:
Fluorescence
Spectroscopy
Emission wl
Products
Sensory Analysis
IxJ
IxJ
...
2
...
Attributes
Judges
Samples
Excitation wl
Unscrambler users can now import and re-format their three-way data with the help of several new features
described in the following sections of this chapter. Before moving on to detailed program operation, let us first
define a few useful concepts.
Camo Software AS
Similarly, a three-way data array (in The Unscrambler we will simply refer to 3-D data tables) consists of
three modes. Most often, one or two of these modes correspond to Objects and the rest to Variables, which
2
2
leads to two major types of logical organization: OV and O V.
3D data of type OV 2
One mode corresponds to Objects, while the other two correspond to Variables.
Example: Fluorescence spectroscopy. The Objects are samples analyzed with fluorescence spectroscopy. The
Variables are the emission and excitation wavelengths. The values stored in the cells of the 3-D data table
indicate the intensity of fluorescence for a given (sample, emission, excitation) triplet.
OV2 or O2V?
Sometimes the difference between the two is subtle and can depend on the question you are trying to answer
with your data analysis. Take as an example three-way sensory data, where different products are rated by
several judges according to various attributes.
If you consider that usually several samples of the same product are prepared for evaluation by the different
judges, and that the results of the assessment of one sample are expressed as a sensory profile across the
2
various attributes, then you will clearly choose an O V structure for your data. Each sample is a two-way
Object determined by a (product, judge) combination, and the Variables are the attributes used for sensory
profiling.
However, if you want to emphasize the fact that each product, as a well-defined Object, can be characterized
by the combination of a set of sensory attributes and of individual points of view express ed by the different
judges, the data structure reflecting this approach is OV2.
Camo Software AS
3D data
IxJ
K
...
1
F irst mode
Unfolded data
...
IxJ
IxJ
IxJ
IxJ
3D data
IxJ
K
...
1
F irst mode
Unfolded data
...
IxJ
IxJ
IxJ
IxJ
We will call the variables defining the blocks primary variables (here: k = 1 to K), and the nested variables
secondary variables (here: j = 1 to J).
Camo Software AS
3D data
Unfolded data
...
First mode
IxJ
IxJ
IxJ
...
IxJ
IxJ
2
1
Third mode
Second mode
Second mode
We will call the samples defining the blocks primary samples (here: k = 1 to K), and the nested samples
secondary samples (here: i = 1 to I).
File - New ;
File - Import;
File - Duplicate.
Camo Software AS
In addition, Dragn Drop may be used from an existing Unscrambler data table or an external source.
A short description of each menu option follows hereafter. If you need more detailed instructions, read one of
the next sections (for instance Build A Non-designed Data Table or Build An Experimental Design) for a
list of the commands answering your specific needs.
File - New
The File - New option lets you define the size of a new Editor, i.e. the number of samples and variables. It
helps you create either a plain 2-D data table, or a 3-D data table with the orientation of your choice. You can
then enter the appropriate values in the Editor manually. To name the samples and variables, double-click on
the cell where the name is to be displayed and type in the name.
File - Import
With the File - Import option, you can import a data table from another program. Once you have made all the
necessary specifications in the Import and Import from Data Set dialogs, a new Editor, which contains the
imported data, will be created in The Unscrambler.
File - Duplicate
The File - Duplicate option contains several choices that allow you to duplicate a designed data table or a
three-way data table into a new format. It also allows you to go from a 2-D to a 3-D data structure and viceversa.
File - Convert Vector to Data Table: Create new 2-D from a Vector
File - Duplicate - As 2-D Data Table: Create new 2-D from a 3-D
File - Duplicate - As 3-D Data Table: Create new 3-D from a 2-D
Camo Software AS
Import Data
The menu options listed hereafter allow you to create a new 2-D or 3-D data table by importing from various
sources.
File - UDI: Register new DLL for User Defined Import (Supervisor only)
File - Properties: Document your data and keep log of transformations and analyses
Ready To Work?
Read the next chapters to learn how to make good use of the data in your table:
Camo Software AS
File - Print Lab Report: Print out randomized list of experiments for your Design
Camo Software AS
Camo Software AS
4. Matrix plot;
5. Normal probability plot;
6. Histogram.
In addition, to cover a few special cases, we need two more kinds of representations:
7. Table plot (which is not a plot, as we will see later);
8. Various special plots.
(See Chapter Special Cases p.69 for a detailed description of the last two plot types).
Line Plot
A line plot displays a single series of numerical values with a label for each element. The plot has two axes:
The horizontal axis shows the labels, in the same physical order as they are stored in the source file;
The vertical axis shows the scale for the plotted numerical values.
The points in this plot can be represented in several ways:
A curve linking the successive points is more relevant if you wish to study a profile, and if the labels
displayed on the horizontal axis are ordered in some way (e.g. PC1, PC2, PC3);
Curve
Bars
Symbols
1.2
1.2
1.2
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
Dec
Nov
Oct
Sep
Aug
Ju l
Ju n
May
Apr
Mar
Feb
Turnover
Ja n
Dec
Nov
Oct
S ep
A ug
Jul
Jun
Ma y
A pr
Ma r
F eb
Jan
Dec
Nov
Oct
S ep
A ug
Jul
Jun
Ma y
A pr
Ma r
F eb
Jan
Turnover
Turnover
Several series of values which share the same labels can be displayed on the same line plot. The series are then
distinguished by means of colors, and an additional layout is possible:
Accumulated bars are relevant if the sum of the values for series1, series2, etc... has a concrete meaning
(e.g. total production).
Three layouts of a line plot for two series of values
Curve
Bars
25
25
20
20
15
15
10
10
Accumulated Bars
30
20
10
Dec
Nov
Oct
S ep
A ug
Jul
Jun
Ma y
A pr
Ma r
F eb
Jan
Dec
Nov
Oct
S ep
A ug
Jul
Jun
Ma y
Detroit Pittsburgh
A pr
Ma r
F eb
Jan
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Detroit Pittsburgh
Detroit Pittsburgh
Camo Software AS
2D Scatter Plot
A 2D scatter plot displays two series of values which are related to common elements. The values are shown
indirectly, as the coordinates of points in a 2-dimensional space: one point per element.
As opposed to the line plot, where the individual elements are identified by means of a label along one of the
axes, both axes of the 2D scatter plot are used for displaying a numerical scale (one for each series of values),
and the labels may appear beside each point.
Various elements may be added to the plot, to provide more information:
A regression line visualizing the relationship between the two series of values;
Plot statistics, including among others the slope and offset of the regression line (even if the line itself is
not displayed) and the correlation coefficient.
A 2D scatter plot with various additional elements
Raw
Dec
20
Dec
20
Mar
15
Oct
10
Mar
Jul Aug
May
Apr
Sep
Jan
15
Feb
10
Oct
Feb
Jun
5
5
Elements:
20 Slope:
Offset:
Correlation:
15 RMSED:
SED:
Bias:
10
12 Dec
-0.634036
19.59069
-0.324980
5.452754
Oct
5.190158
2.244903
Mar
Jul Aug
May
Apr
Sep
Jan
Feb
Nov
Nov
Jul Aug
May
Apr
Sep
Jan
With statistics
10
15
(Detroit,Pittsburgh)
Nov
Jun
5
0
10
Jun
15
10
15
(Detroit,Pittsburgh)
(Detroit,Pittsburgh)
3D Scatter Plot
A 3D scatter plot displays three series of values which are related to common elements. The values are shown
indirectly, as the coordinates of points in a 3-dimensional space: one point per element.
3D scatter plots can be enhanced by the following elements:
Vertical lines which anchor the points can facilitate the interpretation of the plot.
The plot can be rotated so as to show the relative positions of the points from a more relevant angle; this
can help detect clusters.
A 3D scatter plot with various enhancements
Raw
After rotation
Y
-Y
25
-Y
25
M
20
20
15
15
C
L
E D
IF
A
10
GH
5
6
9
12
15 6
5
6
12
(X,Y,Z)
15
18
(X,Y,Z)
CE
20
GH F B
9
12
15 6
25
CL
E ID
F
A
10
K
B
15
I
F
GH F
D
A
10
9
12
15
18
5
6
12
9
12
15
18 15
(X,Y,Z)
Camo Software AS
Matrix Plot
The matrix plot can be seen as the 3-dimensional equivalent of a line plot, to display a whole table of
numerical values with a label for each element along the 2 dimensions of the table. The plot has up to three
axes:
The first two show the labels, in the same physical order as they are stored in the source file;
The vertical axis shows the scale for the plotted numerical values.
Depending on the layout, the third axis may be replaced by a color code indicating a range of values.
The points can either be represented individually, or summarized according to one of the following layouts:
Bars give roughly the same visual impression as the landscape plot if there are many points, else the
surface appears more rugged;
The contour plot has only two axes. A few discrete levels are selected, and points (actual or interpolated)
with exactly those values are shown as a contour line. It looks like a geographical map with altitude lines;
On a map, each point of the table is represented by a small colored square, the color depending on the
range of the individual value. The result is a completely colored rectangle, where zones sharing close
values are easy to detect. The plot looks a bit like an infra-red picture.
A matrix plot shown with two different layouts
Landscape
Contour
81.073 464.923 848.7741.233e+03
1.616e+03
2.000e+03
O_A1
O_A2
O_A3
O_B1
O_B2
O_B3
O_C1
O_C2
O_C3
O_D1
O_D2
O_D3
O_E1
O_E2
O_E3
681
745
809
873
937
1000
1064
1128
1192
1256
1320
1383
1447
3026
If the points are close to a straight line, the distribution is approximately normal (gaussian);
If most points are close to a straight line but a few extreme values (low or high) are far away from the line,
these points are outliers;
If the points are not close to a straight line, but determine another type of curve, or clusters, the
distribution is not normal.
Camo Software AS
Normal
98.00
15
82.00
66.00
50.00
34.00
18.00
12
2.00
2
11
17
133
23
24
7
14
20
16
21
8 5
19
10
22
19
4
25
6
86.00
74.00
62.00
50.00
38.00
26.00
5
11
13
21
14.00
18
2.00
6
9
12
15
DATA2 - Normal Probability Plot, $PlotsSamScope3$, Normal
Not normal
98.00
16
7
22
14
924
1
4
19
8
2
12
10
23
317
15
25
18
21
20
82.00
66.00
50.00
34.00
18.00
20
2.00
0
10
20
DATA2 - Normal Probability Plot, $PlotsSamScope2$, Outliers
17
425
6
19
8
14
22
3
10 16
15
11
23
2
18
12
5
1
13
24
7
0
20000
40000
60000
DATA2 - Normal Probability Plot, $PlotsSamScope4$, Not normal
Histogram Plot
A histogram summarizes a series of numbers without actually showing any of the original elements. The values
are divided into ranges (or bins), and the elements within each bin are counted.
The plot displays the ranges of values along the horizontal axis, and the number of elements as a vertical bar
for each bin.
The graph can be completed by plot statistics which provide information about the distribution, including
mean, standard deviation, skewness (i.e. asymmetry) and kurtosis (i.e. flatness).
It is possible to re-define the number of bins, so as to improve or reduce the smoothness of the histogram.
A histogram with different configurations
Few bins
10
0
0
20000
40000
60000
20000
40000
60000
How to do it:
Camo Software AS
Plot - Line
Camo Software AS
Turn on Plot Statistics if you want to know about the correlation between your two variables;
Add a Regression Line if you want to visualize the best linear approximation of the relationship between
your two variables;
How to do it:
Plot - 2D Scatter
How to do it:
Plot - 3D Scatter
Camo Software AS
How to do it:
Plot - Matrix
Plot - Matrix 3-D
The most relevant way to plot three-way data as a matrix is by selecting a sample (for OV data) or variable
2
(for O V) and plot the primary and secondary variables (resp. samples) as a matrix.
How to do it:
Plot - Normal Probability
Camo Software AS
Edit - Options
How to do it:
Plot - Histogram
Camo Software AS
Elements:
20
Raw values:
Skewed distribution
Symmetrical, 3 subgroups
Elements:
40
Skewness:
Kurtosis:
Mean:
Variance:
6 SDev:
Skewness: 0.502320
Kurtosis:
-1.286155
Mean:
8.099250
Variance:
67.68936
SDev:
8.227354
40
-0.262833
-1.668708
0.435621
0.636946
0.798089
10
0
-10
-5
10
15
20
25
-1.0
-0.5
0
Fat_cor - Histogram Plot, $PlotsSamScope1$, log22;1
0.5
1.0
1.5
Note: There is nothing wrong with a non-normal distribution in itself. There can be 3 balanced groups of
values, low, medium and high. Only highly skewed distributions are dangerous for multivariate
analyses.
Most consumers dislike the product, The consumers disagree: some like it a
a few find it OK
lot, some rather dislike it
15
15
10
10
0
1
2
3
Senspref w, $PlotsVarScope6$, jam14
0
4
1
2
3
Senspref w, $PlotsVarScope7$, jam1
Note: Configure your histograms with a relevant number of bars, to get enough details.
Camo Software AS
Elements:
40
Skewness: 0.670800
Kurtosis:
-0.163434
Mean:
-2.906e-08
Variance:
7.926202
SDev:
2.815351
0
-6
-4
Fat GC raw, Tai, PC_01
-2
Special Cases
This section presents a few types of graphical data representations which do not fit in any of the 6 standard plot types
described in Chapter Various Types of Plots. These types of plots are not available for manual plotting of raw data from
the Editor.
Special Plots
This is an ad-hoc category which groups all plots that do not fit into any of the other descriptions.
Some are an adaptation of existing plot types, with an additional enhancement. For instance, Means can be
displayed as a line plot; if you wish to include standard deviations (SDev) into the same plot, the most relevant
way to do so is to
1. configure the plot layout as bars;
2. and display SDev as an error bar on top of the Mean vertical bar.
This is what has been done in the special plot Mean and Sdev.
Other special plots have been developed to answer specific needs, e.g. visualize the outcome of a Multiple
Comparisons test in a graphical way which gives immediate overview.
Two examples of special plots
Multiple Comparisons
Table Plot
A table plot is nothing but results arranged in a table format, displayed in a graphical interface which
optionally allows for re-sizing and sorting of the columns of the table. Although it is not a plot as such, it
allows tabulated results to be displayed in the same Viewer system as other plots.
Special Cases 69
Camo Software AS
Effects Overview
Camo Software AS
What Is Re-formatting?
Changing the layout of a data table is called re-formatting.
Here are a few examples:
1. Get a better overview of the contents of your data table by sorting variables or samples.
2. Change point of view: by transposing a data table, samples become variables and vice-versa.
3. Apply a 2-D analysis method to 3-D data: by unfolding a three-way data array, you enable the use of e.g.
PCA on your data.
What Is Pre-processing?
Introducing changes in the values of your variables, e.g. so as to make them better suited for an analysis, is
called pre-processing. One may also talk about applying a pre-treatment or a transformation.
Here are a few examples:
1. Improve the distribution of a skewed variable by taking its logarithm.
2. Remove some noise in your spectra by smoothing the curves.
3. Improve the precision in your sensory assessments by taking the average of the sensory ratings over all
panelists.
4. Allow plotting of all raw data and use of classical analysis methods by filling missing values with values
estimated from the non-missing data.
Other operations
In addition, section Make Simple Changes In The Editor shows you how to perform various editing operations
like adding new samples or variables, or creating a Category variable.
in order to improve the interpretation of future results (e.g. insert a category variable whose levels
describe the samples in your table qualitatively);
as a safety measure (e.g. make a copy of a variable before you take its logarithm);
Camo Software AS
as a pre-requisite before the desired re-formatting or transformation can be applied (e.g. create a new
column where you can compute the ratio of two variables).
Re-formatting and editing operations will not be described in detail here; you may lookup the specific
operation you are interested in by checking section Re-formatting and Pre-processing in Practice.
Enable the use of transformations requiring that all values are non-missing, like for instance derivatives;
Enable the use of analysis methods requiring that all values are non-missing, like for instance MLR or
Analysis of Effects.
Principal Component Analysis performs a reconstruction of the missing values based on a PCA
model of the data with an optimal number of components. This fill missing procedure is the default
selection and the recommended method of choice for spectroscopic data.
Row Column Mean Analysis only makes use of the same column and row as each cell with missing
data. Use this method if the columns or rows in your data come from very different sources that do not
carry information about other rows or columns. This can be the case for process data.
Smoothing
This transformation is relevant for variables which are themselves a function of some underlying variable, for
instance time, or in the existence of intrinsic spectral intervals.
In The Unscrambler, you have the choice between four smoothing algorithms:
Camo Software AS
1. Moving average is a classical smoothing method, which replaces each observation with an average of the
adjacent observations (including itself). The number of observations on which to average is the userchosen segment size parameter.
2. Savitzky-Golay
The Savitzky-Golay algorithm fits a polynomial to each successive curve segment, thus replacing the
original values with more regular variations. You can choose the length of the smoothing segment (or right
and left points separately) and the order of the polynomial. It is a very useful method to effectively remove
spectral noise spikes while chemical information can be kept, as shown in the figures below.
Raw UV / Vis spectra show noise spikes
UV / Vis spectra after Savitzky-Golay smoothing at 11 smoothing points and 2nd polynomial degree setting
3. Median filtering replaces each observation with the median of its neighbors. The number of observations
from which to take the median is the user-chosen segment size parameter; it should be an odd number.
4. Gaussian filtering is a weighted moving average where each point in the averaging function is affected a
2
coefficient determined by a Gauss function with = 2. The further away the neighbor is, the smaller the
coefficient, so that information carried by the smoothed point itself and its nearest neighbors is given
greater importance than in an un-weighted moving average.
Example:
Let us compare the coefficients in a Moving average and a Gaussian filter for a data segment of size 5.
If the data point to be smoothed is x k, the segment consists of the 5 values xk-2, xk-1, xk, xk+1 and x k+2.
Camo Software AS
that is to say
0.2*xk -2 + 0.2*xk -1 + 0.2*xk + 0.2*xk+1 + 0.2*xk+2
As you can see, points closer to the center have a larger coefficient in the Gaussian filter than in the moving
average, while the opposite is true of points close to the borders of the segment.
Normalization
Normalization is a family of transformations that are computed sample-wise. Its purpose is to scale samples
in order to achieve specific properties.
The following normalization methods are available in The Unscrambler:
1.
Area normalization;
2.
3.
Mean normalization;
4.
Maximum normalization;
5.
Range normalization;
6.
Peak normalization.
Area Normalization
This transformation normalizes a spectrum Xi by calculating the area under the curve for the spectrum. It
attempts to correct the spectra for indeterminate path length when there is no way of measuring it, or isolating
a band of a constant constituent.
newX i X i / xi, j
j
Camo Software AS
Mean Normalization
This is the most classical case of normalization.
It consists in dividing each row of a data matrix by its average, thus neutralizing the influence of the hidden
factor.
It is equivalent to replacing the original variables by a profile centered around 1: only the relative values of the
variables are used to describe the sample, and the information carried by their absolute level is dropped. This is
indicated in the specific case where all variables are measured in the same unit, and their values are assumed to
be proportional to a factor which cannot be directly taken into account in the analysis.
For instance, this transformation is used in chromatography to express the results in the same units for all
samples, no matter which volume was used for each of them.
Caution! This transformation is not relevant if all values of the curve do not have the same sign. It was
originally designed for positive values only, but can easily be applied to all-negative values through division by
the absolute value of the average instead of the raw average. Thus the original sign is kept.
Maximum Normalization
This is an alternative to classical normalization which divides each row by its maximum absolute value
instead of the average.
Caution! The relevance of this transformation is doubtful if all values of the curve do not have the same sign.
If the sign of the values changes over the curve: either the maximum value becomes +1 or the minimum
value becomes -1.
Range Normalization
Here each row is divided by its range, i.e. max value - min value.
Peak Normalization
th
This transformation normalizes a spectrum Xi by the chosen k spectral point, which is always chosen for both
training set and "unknowns" for prediction.
newX i X i / xi ,k
Camo Software AS
It attempts to correct the spectra for indeterminate path length. Since the chosen spectral point (usually the
maximum peak of a band of the constant constituent, or the isosbestic point) is assumed to be concentration
invariant in all samples, an increase or decrease of the point intensity can be assumed to be entirely due to an
increase or decrease in the sample path length. Therefore, by normalizing the spectrum to the intensity of the
peak, the path length variation is effectively removed.
Caution! One potential problem with this method is that it is extremely susceptible to baseline offset, slope
effects and wavelength shift in the spectrum.
The method requires that the samples have an isosbestic point, or have a constant concentration constituent and
that an isolated spectral band can be identified which is solely due to that constituent.
Spectroscopic Transformations
Specific transformations for spectroscopy data are simply a change of units.
The following transformations are possible:
Reflectance to absorbance,
Absorbance to reflectance,
Reflectance to Kubelka-Munk.
Camo Software AS
MSC
MSC was originally designed to deal with multiplicative scattering alone. However, a number of similar effects
can be successfully treated with MSC, such as:
- path length problems,
- offset shifts,
- interference, etc.
The idea behind MSC is that the two effects, amplification (multiplicative) and offset (additive), should be
removed from the data table to avoid that they dominate the information (signal) in the data table.
The correction is done by two simple transformations. Two correction coefficients, a and b, are calculated and
used in these computations, as represented graphically below:
Multiplicative Scatter Correction
Sample i
Individual
spectra
Sample i
Wavelength k
Wavelength k
Absorbance
(i,k)
Absorbance
(i,k)
Absorbance
(average,k)
Average spectrum
Absorbance
(average,k)
Average spectrum
The correction coefficients are computed from a regression of each individual spectrum onto the average
spectrum. Coefficient a is the intercept (offset) of the regression line, coefficient b is the slope.
EMSC
EMSC is an extension to conventional MSC, which is not limited to only removing multiplicative and additive
effects from spectra. This extended version allows a separation of physical light scattering effects from
chemical light absorbance effects in spectra.
In EMSC, new parameters h, d and e are introduced to account for physical and chemical phenomena that
affect the measured spectra. Parameters d and e are wavelength specific, and used to compensate regions where
such unwanted effects are present. EMSC can make estimates of these parameters, but the best result is
obtained by providing prior knowledge in form of spectra that are assumed to be relevant for one or more of
the underlying constituents within the spectra and spectra containing undesired effects. The parameter h is
estimated on the basis of a reference spectrum representative for the data set, either provided by the user or
calculated as the average of all spectra.
Camo Software AS
Adding Noise
Contrary to the other transformations, adding noise to your data would seem to decrease the precision of the
analysis.
This is exactly the purpose of that transformation: Include some additive or multiplicative noise in the
variables, and see how this affects the model.
Use this option only when you have modeled your original data satisfactorily, to check how well your model
may perform if you use it for future predictions based on new data assumed to be more noisy than the
calibration data.
Derivatives
Like smoothing, this transformation is relevant for variables which are themselves a function of some
underlying variable, e.g. absorbance at various wavelengths. Computing a derivative is also called
differentiation.
In The Unscrambler, you have the choice among three methods for computing derivatives, as described
hereafter.
Savitzky-Golay Derivative
st
nd
rd
th
Enables you to compute 1 , 2 , 3 and 4 order derivatives. The Savitzky-Golay algorithm is based on
performing a least squares linear regression fit of a polynomial around each point in the spectrum to smooth
the data. The derivative is then the derivative of the fitted polynomial at each point. The algorithm includes a
smoothing factor that determines how many adjacent variables will be used to estimate the polynomial
approximation of the curve segment.
Gap-Segment Derivative
Enables you to compute 1 st, 2nd, 3rd and 4 th order derivatives. The parameters of the algorithm are a gap factor
and a smoothing factor that are determined by the segment size and gap size chosen by the user.
The principles of the Gap-Segment derivative can be explained shortly in the simple case of a 1 st order
derivative. If the function y=f(x) underlying the observed data varies slowly compared to sampling frequency,
the derivative can often be approximated by taking the difference in y-values for x-locations separated by more
than one point. For such functions, Karl Norris suggested that derivative curves with less noise could be
obtained by taking the difference of two averages, formed by points surrounding the selected x-locations. As a
further simplification, the division of the difference in y-values, or the y-averages, by the x-separation x, is
omitted.
Norris introduced the term segment to indicate the length of the x-interval over which y-values are averaged, to
obtain the two values that are subtracted to form the estimated derivative.
The gap is the length of the x-interval that separates the two segments that are averaged.
You may read more about Norris derivatives (implemented as Gap-Segment and Norris-Gap in The
Unscrambler) in Hopkins DW, What is a Norris derivative?, NIR News Vol. 12 No. 3 (2001), 3-5. See chapter
Method References for more references on derivatives.
Norris-Gap Derivative
It is a special case of Gap-Segment Derivative with segment size = 1.
Camo Software AS
1 st Derivative
The 1 st derivative of a spectrum is simply a measure of the slope of the spectral curve at every point. The slope
of the curve is not affected by baseline offsets in the spectrum, and thus the 1st derivative is a very effective
method for removing baseline offsets. However, peaks in raw spectra usually become zero -crossing points in
st
1 derivative spectra, which can be difficult to interpret.
Example:
Public NIR transmittance spectra for an active pharmaceutical ingredient (API) recorded in the range of 6001980 nm in 2 nm increments. API = 175.5 for spectra C1-3-345 and C1-3-55; API = 221.5 for spectra C1 -3235 and C1-3-128.
The figure below shows severe baseline offsets and possible linear tilt problems, and two levels of API spectra
are not separated.
Public NIR transmittance spectra for an active pharmaceutical ingredient (API) recorded in the range of 600-1980 nm in 2
nm increments: raw spectra
The next figure displays the 1 st order derivative spectra at the region of 1100-1200 nm (Savitzky-Golay
nd
derivative, 11 points segment and 2 order of polynomial). One can see the baseline offsets effectively
removed, and spectra of two levels of API separated. Note that a peak around 1206 nm crosses zero.
Camo Software AS
nd
Derivative
nd
The 2 derivative is a measure of the change in the slope of the curve. In addition to ignoring the offset, it is
not affected by any linear "tilt" that may exist in the data, and is therefore a very effective method for removing
both the baseline offset and slope from a spectrum. The 2 nd derivative can help resolve nearby peaks and
sharpen spectral features. Peaks in raw spectra usually change sign and turn to negative peaks.
Example:
nd
On the same data as in the previous example, a 2 order derivative has been computed at the region of 11001200 nm (Savitzky-Golay derivative, 11 points segment and 2nd order of polynomial). One can see the spectra
of two levels of API separated, as well as overlapped spectral features enhanced.
2 nd order derivative spectra at the region of 1100-1200 nm.
Camo Software AS
represent the local behaviour of the spectrum (especially in the case of Gap-Segment), and it will smooth out
too much of the important information (especially in the case of Savitzky-Golay). Although there have been
many studies done on the appropriate size of the spectral segment to use, a good general rule is to use a
sufficient number of points to cover the full width at half height of the largest absorbing band in the spectrum.
One can also find optimum segment sizes by checking model accuracy and robustness under different segment
size settings.
Example:
The data are still the same as in the previous examples.
In the next figure, you can see what happens when the selected segment size is too small (Savitzky-Golay
nd
derivative, 3 points segment and 2 order of polynomial). One can see noisy features in the region.
Segment size is too small: 2nd order derivative spectra at the region of 1100-1200 nm.
In the figure that follows, the selected segment size is too large: (Savitzky-Golay derivative, 31 points segment
and 2 nd order of polynomial). One can see that some relevant information has been smoothed out.
Segment size is too large: 2 nd order derivative spectra at the region of 1100-1200 nm.
The main disadvantage of using derivative pre-processing is that the resulting spectra are very difficult to
interpret. For example, the PLS loadings for the calibration model represent the changes in the constituents of
interest. In some cases (especially in the case of PLS-1 models), the loadings can be visually identified as
representing a particular constituent. However, when derivative spectra are used, the loadings cannot be easily
identified. A similar situation exists in regression coefficient interpretation. In addition, the derivative makes
visual interpretation of the residual spectrum more difficult, so that for instance finding spectral location for
impurities in the samples cannot be done.
Camo Software AS
Like MSC (see Multiplicative Scatter Correction), the practical result of SNV is that it removes scatter effects
from spectral data.
An effect of SNV is that on the vertical scale, each spectrum is centered on zero and varies roughly from 2 to
+2. Apart from the different scaling, the result is similar to that of MSC. The practical difference is that SNV
standardises each spectrum using only the data from that spectrum; it does not use the mean spectrum of any
set. The choice between SNV and MSC is a matter of taste.
Averaging
Averaging over samples (in case of replicates) or over variables (for variable reduction, e.g. to reduce the
number of spectroscopic variables) may have, depending on the context, the following advantages:
Increase precision;
Transposition
Matrix transposition consists in exchanging rows for columns in the data table.
It is particularly useful if the data have been imported from external files where they were stored with one row
for each variable.
Shifting Variables
Shifting variables is much used on time-dependent data, such as for processes where the output measurement
is time-delayed relative to input measurements.
To make a meaningful model of such data you have to shift the variables so that each row contains
synchronized measurements for each sample.
User-Defined Transformations
The transformation that your specific type of data requires may not be included as a predefined choice in The
Unscrambler. If this is the case, you have the possibility to register your own transformation for use in the
Unscrambler as User-Defined Transformation (UDT).
Such transformation components have to be developed separately (e.g. in Matlab), and installed on the
computer when needed. A wide range of modifications can be done by such components, including deleting
and inserting both variables and samples.
You may register as many UDTs as you wish.
Centering
As a rule, the first stage in multivariate modeling using projection methods is to subtract the average from each
variable. This operation, called mean-centering, ensures that all results will be interpretable in terms of
variation around the mean. For all practical purposes we recommend to center the data.
Camo Software AS
An alternative to mean-centering is to keep the origin (0-value for all variables) as model center. This is only
advisable in the special case of a regression model where you would know in advance that the linear
relationship between X and Y is supposed to go through the origin.
Note 1: Centering is included as a default option in the relevant analysis dialogs, and the computations are
done as a first stage of the analysis.
Note 2: Mean centering is also available as a transformation to be performed manually from the Editor. This
allows you for instance to plot the centered data.
Weighting
PCA, PLS and PCR are projection methods based on finding directions of maximum variation. Thus, they all
depend on the relative variance of the variables.
Depending on the kind of information you want to extract from your data, you may need to use weights based
on the standard deviation of the variables, i.e. square root of variance, which expresses the variance in the same
unit as the original variable. This operation is also called scaling.
Note 1: Weighting is included as a default option in the relevant analysis dialogs, and the computations are
done as a first stage of the analysis.
Note 2: Standard deviation scaling is also available as a transformation to be performed manually from the
Editor. This may help you study the data in various plots from the Editor, or prior to computing descriptive
statistics. It may for example allow you to compare the distributions of variables of different scales into one
plot.
1/Sdev
Constant
A/Sdev+B
Passify
Weighting Option: 1
1 represents no weighting at all, i.e. all computations are based on the raw variables.
Camo Software AS
Camo Software AS
The Edit menu lets you move your data through the clipboard and modify your data table by inserting or
deleting samples or variables.
2.
The Modify menu includes two options which allow you to change variable properties.
Edit - Cut: Remove data from the table and store it on the clipboard
Camo Software AS
Edit - Insert - Category Variable: Add new category variable left to cursor position
Edit - Insert - Mixture Variables: Add new mixture variables left to cursor position
Edit - Append - Samples: Add new samples at the end of the table
Edit - Append - Variables: Add new variable at the end of the table
Edit - Append - Category Variable: Add new category variable at the end of the table
Edit - Append - Mixture Variables: Add new mixture variables at the end of the table
Edit - Fill Missing: Fill empty cells with values estimated from the structure in the non-missing data
Edit - Convert to Category Variable: Convert from continuous to category (discrete or ranges)
Edit - Split Category Variable: Convert from category to indicator (binary) variables
Edit - Correct Mixture Components: Ensure that sum of mixture components is equal to Mixsum
for each sample
Modify - Properties: Change name of selected sample or variable and lookup general properties
Camo Software AS
This has the advantage of displaying all data values in one window. No need to look at several sheets to get a
full overview!
Some existing features accessible from the Editor have been adapted to 3-D data, and specific features have
been developed (see for instance section Change the Layout or Order of Your Data below).
However, some features which do not make sense for three-way data, or which would introduce
inconsistencies in the 3-D structure, are not available when editing 3-D data tables. Lookup Chapter Reformatting and Pre-processing: Restrictions for 3D Data Tables p.88 for an overview of those limitations.
Modify - Edit Set: Define new sample or variable sets or change their definition
Sorting Operations
Modify - Sort Samples: Sort samples according to name or values of some variables
Modify - Sort Samples by Sets: Group samples according to which set they belong
Modify - Sort Variables by Sets: Group variables according to which set they belong
Modify - Transform - Transpose: Samples become variables and variables become samples
Modify - Swap 3-D Layout: Switch 3-D data from OV2 to O2V or vice-versa
Modify - Swap Samples & Variables: 6 options for swapping samples and variables in a 3-D data
table
Modify - Toggle 3-D Layouts: Quick change of layout for a 3-D data table
File - Duplicate - As 2-D Data Table: Unfold 3-D data to a 2-D structure
File - Duplicate - As 3-D Data Table: Build a 3-D data table from an unfolded 2-D structure
Apply Transformations
Transform your samples or variables to make their properties more suitable for analysis and easier to interpret.
Apply ready-to-use transformations or make your own computations.
Bilinear models, e.g. PCA and PLS, basically assume linear data. Therefore, if you have non-linearities in your
data, you may apply transformations which result in a more symmetrical distribution of the data and a better fit
to a linear model.
Note: Transformations which may change the dimensions of your data table are disabled for 3-D data tables.
Camo Software AS
General Transformations
Modify - Compute General: Apply simple arithmetical or mathematical operations (+, *, log)
Modify - Transform - Noise: Add noise to your data so as to test model robustness
Modify - Transform - Smoothing: Reduce noise by smoothing the curve formed by a series of
variables
Modify - Transform - Normalize: Scale the samples by applying normalization to a series of variables
Modify - Transform - Derivatives: Compute derivatives of the curve formed by a series of variables
Modify - Transform - SNV: Center and scale individual spectra with Standard Normal Variate
Modify - Transform - Center and Scale: Apply mean centering and/or standard deviation scaling
Modify - Transform - Reduce (Average): Average over a number of adjacent samples or variables
User-defined Transformations
Operations which change the number or order of the samples (O V layout) or variables (OV layout);
Operations which have to do with mixture variables, since experimental design is not implemented for
three-way arrays;
User-defined transformations.
The following menu options may be affected by these restrictions:
Edit - Paste
Edit - Insert
Modify - Transpose
Edit - Append
Modify - User-defined
Edit - Delete
Camo Software AS
Camo Software AS
Descriptive Statistics
Descriptive statistics is a summary of the distribution of one or two variables at a time. It is not supposed to tell
much about the structure of the data, but it is useful if you want to get a quick look at each separate variable
before starting an analysis.
One-way statistics - mean, standard deviation, variance, median, minimum, maximum, lower and upper
quartile - can be used to spot any out-of-range value, or to detect abnormal spread or asymmetr y. You
should check this before proceeding with any further analysis, and look into the raw data if they suggest
anything suspect. A transformation might also be useful.
Two-way statistics - correlations - show how the variations of two different variables are linked in the data
you are studying.
Camo Software AS
For non-designed data tables, this means that you can group the samples according to the levels of one or
several category variables.
For designed data, in addition to optional grouping according to the levels of the design variables,
predefined groups such as Design Samples or Center Samples are automatically taken into account.
Line plots show mean or standard deviation, or mean and standard deviation together;
Box-plots show the percentiles (min, lower quartile, median, upper quartile, max).
In addition, you may graphically study the correlation between two variables by plotting them as a 2D scatter
plot. If you turn on Plot Statistics, the value of the correlation coefficient will be displayed among other
information.
View - Sample Statistics: Display descriptive statistics for your samples in a slave Editor window
View - Variable Statistics: Display descriptive statistics for your variables in a slave Editor window
Plot - 2D Scatter: Plot two variables (or samples) against each other
Plot - Normal Probability: Plot one variable (or sample) and check against a normal distribution
Plot - Histogram: Plot one variable (or sample) as number of elements in evenly spread ranges of values
Camo Software AS
View - Trend Lines - Regression Line: Add a regression line to your 2D Scatter Plot
View - Trend Lines - Target Line: Add a target line to your 2D Scatter Plot
Details:
Task - Statistics: Run the computation of Descriptive Statistics on a selection of variables and samples
Results - Statistics: Retrieve Statistics results and display them in the Viewer
Camo Software AS
Purposes Of PCA
Large data tables usually contain a large amount of information, which is partly hidden because the data are too
complex to be easily interpreted. Principal Component Analysis (PCA) is a projection method that helps you
visualize all the information contained in a data table.
PCA helps you find out in what respect one sample is different from another, which variables contribute most
to this difference, and whether those variables contribute in the same way (i.e. are correlated) or independently
from each other. It also enables you to detect sample patterns, like any particular grouping.
Finally, it quantifies the amount of useful information - as opposed to noise or meaningless variation contained in the data.
It is important that you understand PCA, since it is a very useful method in itself, and forms the basis for
several classification (SIMCA) and regression (PLS/PCR) methods. The following is a brief introduction; we
refer you to the book Multivariate Analysis in Practice by Kim Esbensen et al., and other references given in
the Method References chapter for further reading.
Camo Software AS
X3
Row i
Variable 2
X2
X1
Variable 1
Let us consider the whole data table geometrically. Two samples can be described as similar if they have close
values for most variables, which means close coordinates in the multidimensional space, i.e. the two points are
located in the same area. On the other hand, two samples can be described as different if their values differ a
lot for at least some of the variables, i.e. the two points have very different coordinates, and are located far
away from each other in the multidimensional space.
Principles Of Projection
Bearing that in mind, the principle of PCA is the following: Find the directions in space along which the
distance between data points is the largest. This can be translated as finding the linear combinations of the
initial variables that contribute most to making the samples different from each other.
These directions, or combinations, are called Principal Components (PCs). They are computed iteratively, in
such a way that the first PC is the one that carries most information (or in statistical terms: most explained
variance). The second PC will then carry the maximum share of the residual information (i.e. not taken into
account by the previous PC), and so on.
PCs 1 and 2 in a multidimensional space
Variable 3
PC 2
PC 1
Variable 2
Variable 1
This process can go on until as many PCs have been computed as there are variables in the data table. At that
point, all the variation between samples has been accounted for, and the PCs form a new set of coordinate axes
which has two advantages over the original set of axes (the original variables). First, the PCs are orthogonal to
each other (we will not try to prove this here). Second, they are ranked so that each one carries more
information than any of the following ones. Thus, you can prioritize their interpretation: Start with the first
ones, since you know they carry more information!
Camo Software AS
The way it was generated ensures that this new set of coordinate axes is the most suitable basis for a graphical
representation of the data that allows easy interpretation of the data structure.
2.
Validation: Checking whether the component describes new data well enough.
Each of those two steps requires its own set of samples; thus, we will later refer to calibration samples (or
training samples), and to validation samples (or test samples).
A more detailed description of validation techniques and their interpretation is to be found in Chapter Validate
A Model p. 121.
Variances are error measures; they tell you how much information is taken into account by the successive
PCs;
Camo Software AS
Variances
The importance of a principal component is expressed in terms of variance. There are two ways to look at it:
Residual variance expresses how much variation in the data remains to be explained once the current PC
has been taken into account.
Explained variance, often measured as a percentage of the total variance in the data, is a measurement of
the proportion of variation in the data accounted for by the current PC.
These two points of view are complementary. The variance which is not explained is residual.
These variances can be considered either for a single variable or sample, or for the whole data. They are
computed as a mean square variation, with a correction for the remaining degrees of freedom.
Variances tell you how much of the information in the data table is being described by the model. The way
they vary according to the number of model components can be studied to decide how complex the model
should be (see section How To Use Residual And Explained Variances for more details).
Loadings
Loadings describe the data structure in terms of variable correlations.
Each variable has a loading on each PC. It reflects both how much the variable contributed to that PC, and how
well that PC takes into account the variation of that variable over the data points.
In geometrical terms, a loading is the cosine of the angle between the variable and the current PC: the smaller
the angle (i.e. the higher the link between variable and PC), the larger the loading. It also follows that loadings
can range between 1 and +1.
The basic principles of interpretation are the following:
1.
For each PC, look for variables with high loadings (i.e. close to +1 or 1); this tells you the meaning of
that particular PC (useful for further interpretation of the sample scores).
2.
To study variable correlations, use their loadings to imagine what their angles would look like in the
multidimensional space. For instance, if two variables have high loadings along the same PC, it means
that their angle is small, which in turn means that the two variables are highly correlated. If both loadings
have the same sign, the correlation is positive (when one variable increases, so does the other). Else, it is
negative (when one variable increases, the other decreases).
For more information on score and loading interpretation, see section How To Interpret PCA Scores And
Loadings p.102, and examples in Tutorial B.
Scores
Scores describe the data structure in terms of sample patterns, and more generally show sample differences or
similarities.
Each sample has a score on each PC. It reflects the sample location along that PC; it is the coordinate of the
sample on the PC.
Once the information carried by a PC has been interpreted with the help of the loadings, the score of a
sample along that PC can be used to characterize that sample. It describes the major features of the
sample, relative to the variables with high loadings on the same PC;
2.
Camo Software AS
Samples with close scores along the same PC are similar (they have close values for the corresponding
variables). Conversely, samples for which the scores differ much are quite different from each other with
respect to those variables.
For more information on score and loading interpretation, see section How To Interpret PCA Scores And
Loadings p.102, and examples in Tutorial B.
X TP T E
where T is the scores matrix, P the loadings matrix and E the error matrix.
The combination of scores and loadings is the structure part of the data, the part that makes sense. What
remains is called error or residual, and represents the fraction of variation that cannot be interpreted.
When you interpret the results of a PCA, you focus on the structure part and discard the residual part. It is OK
to do so, provided that the residuals are indeed negligible. You decide yourself how large an error you can
accept.
Sample Residuals
If you look at your data from the samples point of view, each data point is approximated by another point
which lies on the hyperplane generated by the model components.
The difference between the original location of the point and its approximated location (or projection onto the
model) is the sample residual (see figure below).
This overall residual is a vector that can be decomposed in as many numbers as there are components. Those
numbers are the sample residuals for each particular component.
Camo Software AS
Sample residuals
X3
Sample
Principal
Component
Residual
X2
X1
Variable Residuals
From the variables point of view, the original variable vectors are being approximated by their projections
onto the model components. The difference between the original vector and the projected one is the variable
residual.
It can also be broken down into as many numbers as there are components.
Residual Variation
The residual variation of a sample is the sum of squares of its residuals for all model components. It is
geometrically interpretable as the squared distance between the original location of the sample and its
projection onto the model.
The residual variations of Variables are computed the same way.
Residual Variance
The residual variance of a variable is the mean square of its residuals for all model components. It differs from
the residual variation by a factor which takes into account the remaining degrees of freedom in the data, thus
making it a valid expression of the modeling error for that variable.
Total residual variance is the average residual variance over all variables. This expression summarizes the
overall modeling error, i.e. it is the variance of the error part of the data.
Explained Variance
Explained variance is the complement of residual variance, expressed as a percentage of the global variance in
the data. Thus the explained variance of a variable is the fraction of the global variance of the variable taken
into account by the model.
Total explained variance measures how much of the original variation in the data is described by the model. It
expresses the proportion of structure found in the data by the model.
Camo Software AS
1.
Check variances, to determine how many components the model should include and know how much
information the selected components take into account. At that stage, it is especially important to check
validation variances (see Chapter Principles of Model Validation p. 121 for details on validation methods).
2.
Look for outliers, i.e. samples that do not fit into the general pattern.
These two steps may have to be run several times before you are satisfied with your model.
Variable Variances
Variables with small residual variance (or large explained variance) for a particular component are well
explained by the corresponding model. Variables with large residual variance for all or for the 3-4 first
components have a small or moderate relationship with the other variables.
If some variables have much larger residual variance than the other variables for all components (or for the
first 3-4 of them), try to keep these variables out and make a new calculation. This may produce a model which
is easier to interpret.
In PCA, outliers can be detected using score plots, residuals and leverages.
Different types of outliers can be detected by each tool:
Score plots show sample patterns according to one or two components. It is easy to spot a sample lying far
away from the others. Such samples are likely to be outliers.
Residuals measure how well samples or variables fit the model determined by the components. Samples
with a high residual are poorly described by the model, which nevertheless fits the other samples quite
well. Such samples are strangers to the family of samples well described by the model, i.e. outliers.
Camo Software AS
Leverages measure the distance from the projected sample (i.e. its model approximation) to the center
(mean point). Samples with high leverages have a stronger influence on the model than other samples; they
may or may not be outliers, but they are influential. An influential outlier (high residual + high leverage)
is the worst case; it can however easily be detected using an influence plot.
First, let us consider one PC at a time. Here are the rules to interpret that link:
If a variable has a very small loading, whatever the sign of that loading, you should not use it for
interpretation, because that variable is badly accounted for by the PC. Just discard it and focus on the
variables with large loadings;
If a variable has a positive loading, it means that all samples with positive scores have higher than average
values for that variable. All samples with negative scores have lower than average values for that variable;
If a variable has a negative loading, it means just the opposite. All samples with positive scores have
lower than average values for that variable. All samples with negative scores have higher than average
values for that variable;
The higher the positive score of a sample, the larger its values for variables with positive loadings and vice
versa;
The more negative the score of a sample, the smaller its values for variables with positive loadings and
vice versa;
The larger the loading of a variable, the quicker sample values will increase with their scores.
To summarize, if the score of a sample and the loading of a variable on a particular PC have the same sign, the
sample has higher than average value for that variable and vice-versa. The larger the scores and loadings, the
stronger that relation.
If you now consider two PCs simultaneously, you can build a 2 -vector loading plot and a 2-vector score plot.
The same principles apply to their interpretation, with a further advantage: you can now interpret any direction
in the plot - not only the principal directions.
PCA in Practice
In practice, building and using a PCA model involves 3 steps:
1.
Choose and implement an appropriate pre-processing method (see Chapter Re-formatting and Pre-processing
p. 71);
2.
Run the PCA algorithm, choose the number of components, diagnose the model;
3.
Camo Software AS
The sections that follow list menu options and dialogs for data analysis and result interpretation using PCA.
For a more detailed description of each menu option, read The Unscrambler Program Operation, available as a
PDF file from Camos web site www.camo.com/TheUnscrambler/Appendices .
Run A PCA
When your data table is displayed in the Editor, you may access the Task menu to run a suitable analysis for
instance, PCA.
File - Save: Save result file for the first time, or with existing name
Results - PCA: Open PCA result file or just lookup file information, warnings and variances
Results - All: Open any result file or just lookup file information, warnings and variances
Plot - Scores and Loadings: Display scores and loadings separately or as a bi-plot
Camo Software AS
View - Uncertainty Test - Stability Plot: Display stability plot for scores or loadings
PC Navigation Tool
Navigate up or down the PCs in your model along the vertical and horizontal axes of your plots:
View - Source: Select which sample types / variable types / variance type to display
Edit - Insert Draw Item: Draw a line or add text to your plot
View - Outlier List: Display list of outlier warnings issued during the analysis for each PC, sample
and/or variable
Window - Warning List: Display general warnings issued during the analysis
View - Scaling
View - Zoom In
View - Raw Data: Display the source data for the analysis in a slave Editor
Camo Software AS
Check that the currently active subview contains the right type of plot (samples or variables) before using Edit
- Mark.
Edit - Mark - One By One: Mark samples or variables individually on current plot
Edit - Mark - With Rectangle: Mark samples or variables by enclosing them in a rectangular frame (on
current plot)
Edit - Mark - Test Samples Only: Mark test samples (only available if you used test set validation)
Edit - Mark - Evenly Distributed Samples Only: Mark a subset of samples which evenly cover your
data range
Edit - Mark - Unmark All : Remove marking for all objects of the type displayed on current plot
Edit - Mark - Reverse Marking: Exchange marked and unmarked objects on the plot
Task - Recalculate with Marked: Recalculate model with only the marked samples / variables
Task - Recalculate without Marked: Recalculate model without the marked samples / variables
Task - Recalculate with Passified Marked: Recalculate model with marked variables weighted down
using Passify
Task - Recalculate with Passified Unmarked: Recalculate model with unmarked variables weighted
down using Passify
Lookup the previous section View - Raw Data: Display the source data for the analysis in a slave Editor
Run New Analyses From The Viewer.
Camo Software AS
View - Raw Data: Display the source data for the analysis in a slave Editor
Task - Extract Data from Marked: Extract data for only the marked samples / variables
Task - Extract Data from Unmarked: Extract data for only the unmarked samples / variables
Useful tips
To run a PCA on your 3-way data, you need to duplicate your 3-D table as 2-D data first. Then all relevant
analyses will be enabled.
For instance, you may run a PCA on unfolded 3-way spectral data, by doing the following sequence of
operations:
2
1. Start from your 3-D data table (OV layout) where each row contains a 2-way spectrum;
2. Use File - Duplicate - As 2-D Data Table: this generates a 2-D table containing unfolded spectra;
3. Save the resulting 2-D table with File - Save As;
4. Use Task - PCA to run the desired analysis.
Another possibility is to develop your own three-way analysis routine and implement it as a User-Defined
Analysis (UDA). Such analyses may then be run from the Task - User-defined Analysis menu.
Camo Software AS
What Is Regression?
Regression is a generic term for all methods attempting to fit a model to observed data in order to quantify the
relationship between two groups of variables. The fitted model may then be used either to merely describe the
relationship between the two groups of variables, or to predict new values.
Context
General
Predictors
Responses
Independent Variables
Dependent Variables
Designed Data
Responses
Spectroscopy
Spectra
Constituents
Camo Software AS
Every time you wish to use cheap, easy-to-perform measurements as a substitute for more expensive or
time-consuming ones;
When you want to build a response surface model from the results of some experimental design, i.e.
describe precisely the response levels according to the values of a few controlled factors.
Noise can be random variation in the response due to experimental error, or it can be random variation in
the data values due to measurement error. It may also be some amount of response variation due to factors
that are not included in the model.
Irrelevant information is carried by predictors that have little or nothing to do with the modeled
phenomenon. For instance, NIR absorbance spectra may carry some information relative to the solvent and
not only to the compound of which you are trying to predict the concentration.
A good regression model should be able to
Pick up only relevant information, and all of it. It should leave aside irrelevant variation and focus on the
fraction of variation in the predictors which affects the response;
Avoid overfitting, i.e. distinguish between variation in the response that can be explained by variation in
the predictors, and variation caused by mere noise.
Camo Software AS
b (X X) X y
T
This operation involves a matrix inversion, which leads to collinearity problems if the variables are not linearly
independent. Incidentally, this is the reason why the predictors are called independent variables in MLR; the
ability to vary independently of each other is a crucial requirement to variables used as predictors with this
method. MLR also requires more samples than predictors or the matrix cannot be inverted.
The Unscrambler uses Singular Value Decomposition to find the MLR solution. No missing values are
accepted.
More About:
How MLR compares to other regression methods in More Details About Regression Methods p.114
PC3
X3
(PCA)
X2
(+)
(MLR)
PC2
PC1
X1
PC j
f(X i )
PC1
PC 2
Y
f(PC j )
More About:
How PCR compares to other regression methods in More Details About Regression Methods p.114
References:
Camo Software AS
PLS Regression
Partial Least Squares - or Projection to Latent Structures - (PLS) models both the X- and Y-matrices
simultaneously to find the latent variables in X that will best predict the latent variables in Y. These PLScomponents are similar to principal components, and will also be referred to as PCs.
PLS procedure
u
Y3
X3
t
t
X1
X2
f(PCx)
PCy
u f(t)
Y1
Y2
PLS1 deals with only one response variable at a time (like MLR and PCR);
More About:
How PLS compares to other regression methods in More Details About Regression Methods p.114
References:
2.
Validation: Checking whether the component describes new data well enough.
Calibration is the fitting stage in the regression modeling process: The main data set, containing only the
calibration sample set, is used to compute the model parameters (PCs, regression coefficients).
We validate our models to get an idea of how well a regression model would perform if it were used to predict
new, unknown samples. A test set consisting of samples with known response values is usually used. Only the
X-values are fed into the model, from which response values are predicted and compared to the known, true
response values. The model is validated if the prediction residuals are low.
Camo Software AS
Each of those two steps requires its own set of samples; thus, we will later refer to calibration samples (or
training samples), and to validation samples (or test samples).
A more detailed description of validation techniques and their interpretation is to be found in Chapter
Validate A Model p. 121.
Result
Application
MLR
PCR
PLS
B-coefficients
I,D
Residuals (*)
ANOVA
X
X
Predicted Y-values
I,D
Loading weights
I,D
(*) The various residuals and error measures are available for each PC in PCR and PLS, while for MLR there is only one of
each type
(**) There are two types of scores and loadings in PLS, only one in PCR
In short, all three regression methods give you a model with an equation expressed by the regression
coefficients (b-coefficients), from which predicted Y-values are computed. For all methods, residuals can be
computed as the difference between predicted (fitted) values and actual (observed) values; these residuals can
then be combined into error measures that tell you how well your model performs.
PCR and PLS, in addition to those standard results, provide you with powerful interpretation and diagnostic
tools linked to projection: more elaborate error measures, as well as scores and loadings.
The simplicity of MLR, on the other hand, allows for simple significance testing of the model with ANOVA
and of the b-coefficients with a Students test (ANOVA will not be presented hereafter; read more about it in
the ANOVA section from Chapter Analyze Results from Designed Experiments p. 149.)
However, significance testing is also possible in PCR and PLS, using Martens Uncertainty Test.
B-coefficients
The regression model can be written
Camo Software AS
meaning that the observed response values are approximated by a linear combination of the values of the
predictors. The coefficients of that combination are called regression coefficients or B-coefficients.
Several diagnostic tools are associated with the regression coefficients (available only for MLR):
Comparing the t-value to a reference t-distribution will then yield a significance level or p-value. It shows
the probability of a t-value equal to or larger than the observed one would be if the true value of the
regression coefficient were 0.
Predicted Y-values
Predicted Y-values are computed for each sample by applying the model equation with the estimated Bcoefficients to the observed X-values.
For PCR or PLS models, the Predicted Y-values can also be computed using projection along the successive
components of the model. This has the advantage of diagnosing samples which are badly represented by the
model, and therefore have high prediction uncertainty. We will come back to this in Chapter Make
Predictions p. 133.
Residuals
For each sample, the residual is the difference between observed Y-value and predicted Y-value. It appears as
e in the model equation.
More generally, residuals may also be computed for each fitting operation in a projection model: thus the
samples have X- and Y-residuals along each PC in PCR and PLS models. Read more about how sample and
variable residuals are computed in Chapter More Details About The Theory Of PCA p. 99.
Residual Y-variance is the variance of the Y-residuals and expresses how much variation remains in the
observed response if you take out the modeled part. It is an overall measure of the misfit (i.e. the error
made when you compute the fitted Y-value as a function of the X-values). It takes into account the
remaining number of degrees of freedom in the data.
Explained Y-variance is the complement to residual Y-variance, and is expressed as a percentage of the
total Y-variance.
RMSEC and RMSEP measure the calibration error and prediction error in the same units as the original
response variable.
Residual and explained Y-variance are available for both calibration and validation.
Camo Software AS
2. Across variables (all X-variables or all Y-variables), to obtain a Total variance curve describing the global
fit of the model. The Total Y-variance curve shows how the prediction of Y improves when you add
more PCs to the model; the Total X-variance curve expresses how much of the variation in the Xvariables is taken into account to predict variation in Y.
Read more about how sample and variable residuals, as well as explained and residual variances, are computed
in Chapter More Details About The Theory Of PCA p. 99.
In addition, the Y-calibration error can be expressed in the same units as the original response variable using
RMSEC, and the Y-prediction error as RMSEP .
RMSEC and RMSEP also vary as a function of the number of PCs in the model.
PLS Scores
Basically, PLS scores are interpreted the same way as PCA scores: They are the sample coordinates along the
model components. The only new feature in PLS is that two different sets of components can be considered,
depending on whether one is interested in summarizing the variation in the X- or Y-space.
T-scores are the new coordinates of the data points in the X-space, computed in such a way that they
capture the part of the structure in X which is most predictive for Y.
U-scores summarize the part of the structure in Y which is explained by X along a given model
component. (Note: they do not exist in PCR!)
The relationship between t- and u-scores is a summary of the relationship between X and Y along a specific
model component. For diagnostic purposes, this relationship can be visualized using the X-Y Relation
Outliers plot.
PLS Loadings
The PLS loadings used in The Unscrambler express how each of the X- and Y-variables is related to the model
component summarized by the t-scores. It follows that the loadings will be interpreted somewhat differently in
the X- and Y-space.
P-loadings express how much each X-variable contributes to a specific model component, and can be used
exactly the same way as PCA loadings. Directions determined by the projections of the X-variables are
used to interpret the meaning of the location of a projected data point on a t-score plot in terms of
variations in X.
Q-loadings express the direct relationship between the Y-variables and the t-scores. Thus, the directions
determined by the projections of the Y-variables (by means of the q-loadings) can be used to interpret the
meaning of the location of a projected data point on a t-score plot in terms of sample variation in Y.
Camo Software AS
The two kinds of loadings can be plotted on a single graph to facilitate the interpretation of the t-scores with
regard to directions of variation both in X and Y. It must be pointed out that, contrary to PCA loadings, PLS
loadings are not normalized, so that p- and q-loadings do not share a common scale. Thus, their directions are
easier to interpret than their lengths, and the directions should only be interpreted provided that the
corresponding X- or Y-variables are sufficiently taken into account (which can be checked using explained or
residual variances).
In case of collinearity among X-variables, the b-coefficients are not reliable and the model may be
unstable;
PCR uses MLR in the regression step; a PCR model using all PCs gives the same solution as MLR (and so
does a PLS1 model using all PCs).
If you run MLR, PCR and PLS1 on the same data, you can compare their performance by checking validation
errors (Predicted vs. Measured Y-values for validation samples, RMSEP).
It can also be noted that both MLR and PCR only model one Y-variable at a time.
The difference between PCR and PLS lies in the algorithm. PLS uses the information lying in both X and Y to
fit the model, switching between X and Y iteratively to find the relevant PCs. So PLS often needs fewer PCs to
reach the optimal solution because the focus is on the prediction of the Y-variables (not on achieving the best
projection of X as in PCA).
Camo Software AS
What is an Outlier?
Lookup Chapter How To Detect Outliers in PCA p. 101.
Outliers in Regression
In regression, there are many ways for a sample to be classified as an outlier. It may be outlying according to
the X-variables only, or to the Y-variables only, or to both. It may also not be an outlier for either separate set
Camo Software AS
of variables, but become an outlier when you consider the (X,Y) relationship. In the latter case, the X-Y
Relation Outliers plot (only available for PLS) is a very powerful tool showing the (X,Y) relationship and
how well the data points fit into it.
Choose and implement an appropriate pre-processing method (see Chapter Re-formatting and Pre -processing
p. Feil! Bokmerke er ikke definert.);
2.
Build the model: calibration fits the model to the available data, while validation checks the model for new
data;
3.
Choose the number of components to interpret (for PCR and PLS), according to calibration and validation
variances;
4.
Diagnose the model, using outlier warnings, variance curves (for PCR and PLS), X-Y relation outliers (for
PLS), Predicted vs. Measured;
5.
Interpret the loadings and scores plots (for PCR and PLS), the loading weights plots (for PLS), Uncertainty
Test results (for PCR and PLS see Chapter Uncertainty Testing with Cross Validation p. 123), the Bcoefficients, optionally the response surface
6.
Run A Regression
When your data table is displayed in the Editor, you may access the Task menu to run a suitable analysis
here, Regression.
Note: If the data table displayed in the Editor is a 3-D table, the Task - Regression menu option described
hereafter allows you to perform three-way data modeling with nPLS. For more details concerning that
application, lookup Chapter Three-way Data Analysis in Practice.
Camo Software AS
File - Save: Save result file for the first time, or with existing name
Results - Regression: Open regression result file or just lookup file information, warnings and
variances
Results - All: Open any result file or just lookup file information, warnings and variances
Plot - X-Y Relation Outliers: Display t vs. u scores along individual PCs (PLS)
Plot - Predicted vs Measured: Display plot of predicted Y values against actual Y values
Plot - Scores and Loadings: Display scores and loadings separately or as a bi-plot (PCR, PLS)
Plot - Loading Weights: Plot loading weights along selected PCs (PLS)
Plot - Important Variables: Display 2 plots to detect most important variables (PCR, PLS)
Plot - Regression and Prediction: Display Predicted vs. Measured and Regression coefficients
Camo Software AS
View - Uncertainty Test - Stability Plot: Display stability plot for scores or loadings
View - Uncertainty Test - Uncertainty Limits: Display uncertainty limits on regression coefficients
plot
Application example
If you have used the Uncertainty Test option when computing your PCR or PLS model, you may mark all
significant X-variables on a loading plot, then recalculate the model with only the marked X-variables.
The new model will usually fit as well as the original and validate better when variables with no significant
contribution to the prediction of Y are removed.
Edit - Mark - One By One: Mark samples or variables individually on current plot
Edit - Mark - With Rectangle: Mark samples or variables by enclosing them in a rectangular frame (on
current plot)
Edit - Mark - Significant X-variables Only: Mark significant X-variables (only available if you used
uncertainty testing)
Edit - Mark - Test Samples Only: Mark test samples (only available if you used test set validation)
Edit - Mark - Evenly Distributed Samples Only: Mark a subset of samples which evenly cover your
data range
Edit - Mark - Unmark All : Remove marking for all objects of the type displayed on current plot
Edit - Mark - Reverse Marking: Exchange marked and unmarked objects on the plot
Task - Recalculate with Marked: Recalculate model with only the marked samples / variables
Camo Software AS
Task - Recalculate without Marked: Recalculate model without the marked samples / variables
Task - Recalculate with Passified Marked: Recalculate model with marked variables weighted down
using Passify
Task - Recalculate with Passified Unmarked: Recalculate model with unmarked variables weighted
down using Passify
Camo Software AS
Validate A Model
Check how well your PCA or regression model may apply to new data of the same kind as your model is based upon.
What Is Validation?
Validating a model means checking how well the model will perform on new data.
A regression model is usually made to do predictions in the future. The validation of the model estimates the
uncertainty of such future predictions. If the uncertainty is reasonably low, the model can be considered valid.
The same argument applies to a descriptive multivariate analysis such as PCA: If you want to extrapolate the
correlations observed in your data table to future, similar data, you should check whether they still apply fo r
new data.
In The Unscrambler, three methods are available to estimate the prediction error: test set validation, cross
validation and leverage correction.
Camo Software AS
Manual selection is recommended since it gives you full control over the selection of a test set;
Random selection is the simplest way to select a test set, but leaves the selection to the computer;
Group selection makes it possible for you to specify a set of samples as test set by selecting a value or
values for one of the variables. This should only be used under special circumstances. An example of such
a situation is a case where there are two true replicates for each data point, and a separate variable indicates
which replicate a sample belongs to. In such a case, one can construct two groups according to this
variable and use one of the sets as test set.
Cross Validation
With cross validation, the same samples are used both for model estimation and testing. A few samples are left
out from the calibration data set and the model is calibrated on the remaining data points. Then the values for
the left-out samples are predicted and the prediction residuals are computed. The process is repeated with
another subset of the calibration set, and so on until every object has been left out once; then all prediction
residuals are combined to compute the validation residual variance and RMSEP.
Several versions of the cross validation approach can be used:
Full cross validation leaves out only one sample at a time; it is the original version of the method;
Test-set switch divides the global data set into two subsets, each of which will be used alternatively as
calibration set and as test set.
Leverage Correction
Leverage correction is an approximation to cross validation that enables prediction residuals to be estimated
without actually performing any prediction. It is based on an equation that is valid for MLR, but is only an
approximation for PLS and PCR.
According to this equation, the prediction residual equals
(calibration residual) divided by (1 - sample leverage).
All samples with low leverage (i.e. low influence on the model) will have estimated prediction residuals very
close to their calibration residuals (the leverage being close to zero). For samples with high leverage, the
calibration residual will be divided by a smaller number, thus giving a much larger estimated prediction
residual.
Validation Results
The simplest and most efficient measure of the uncertainty on future predictions is the RMSEP (Root Mean
Square Error of Prediction). This value (one for each response) tells you the average uncertainty that can be
expected when predicting Y-values for new samples, expressed in the same units as the Y-variable. The results
of future predictions can then be presented as predicted values 2RMSEP. This measure is valid provided
that the new samples are similar to the ones used for calibration, otherwise, the prediction error might be much
higher.
Validation residual and explained variances are also computed in exactly the same way as calibration
variances, except that prediction residuals are used instead of calibration residuals. Validation variances are
used, as in PCA, to find the optimum number of model components. When validation residual variance is
minimal, RMSEP also is, and the model with an optimal number of components will have the lowest expected
prediction error.
RMSEP can be compared with the precision of the reference method. Usually you cannot expect RMSEP to be
lower than twice the precision.
Camo Software AS
Camo Software AS
Stability Plots
The results of all these calculations can also be visualized as stability plots in scores, loadings, and loading
weights plots. Stability plots can be used to understand the influence of specific samples and variables on the
model, and explain for example why a variable with a large regression coefficient is not significant. This will
be illustrated in the example that follows (see Application Example).
Camo Software AS
Application Areas
1. Spectroscopic calibrations work better if you remove noisy wavelengths.
2. Some models may be improved by adding interations and squares of the variables, and The
Unscrambler has a feature to do this automatically. However, many of these terms are irrelevant.
Apply Martens uncertainty test to identify and keep only the significant ones.
Application Example
In a work environment study, we used PLS1 to model 34 data samples corresponding to 34 departments in a
company. The data was collected from a questionnaire about feeling good at work (Y), modeled from 26
questions (X1, X2, X26) about repetitive tasks, inspiration from the boss, helpful colleagues, positive
feedback from the boss, etc.
The model has 2 PCs assessed by full cross validation and Uncertainty Test. Thus the cross validation has
created 34 sub-models, where 1 sample has been left out in each.
The Unscrambler regression overview shown in the figure below contains a Score plot (PC1-PC2), the XLoading Weights and Y-loadings plot (PC1-PC2), the explained variance and the Predicted vs. Measured plot
for 2 PCs for this PLS1 regression model.
Regression overview from the work environment study
10
0.4
PC2
1
4
16
0.2
5
19
18
222 5 24
10
28
11
8
9
15
25
20 6
0
26
16
17
23
3
12
21
13 7
34
29
31
19
14
32
6
17
1 22
24
18
14
25
26
13
30
21
33
27
-5
11
-0.2
10
7
12
15
20
23
-0.4
-10
-10
-5
0
pls1 bbs jack-k,X-expl: 33%,21% Y-expl: 66%,6%
Y-variance
Explained Variance
100
10
80
PC1
-0.2
-0.1
0
pls1 bbs jack-k,X-expl: 33%,21% Y-expl: 66%,6%
Predicted Y
9
Elements:
34
Slope:
0.624272
Offset:
2.787214
Correlation: 0.775728
RMSEP:
0.517955
8 SEP:
0.525744
Bias:
-0.000909
YCal
8 22
11
29
34
1 3
0.3
0.4
20
19 5
27
2 18
13
28
25 15
33
24
32
14
10
9
6
4
16
YVal
20
0.2
30
60
40
0.1
26
12
31
21
17
23
PCs
PC _05
PC _04
PC _03
PC _02
PC _01
PC _00
Measured Y
5.5
6.0
6.5
pls1 bbs jack-k, (Y-var, PC):(gentrivs,2)
7.0
7.5
8.0
8.5
9.0
9.5
Regression Coefficients
X11
0.1
-0.1
-0.2
X-variables
5
10
15
20
25
30
(gentrivs,2)
Variable X11s regression coefficient has uncertainty limits crossing the zero line: it is not significant.
Camo Software AS
The automatic function Mark significant variables shows clearly which variables have a significant effect on
Y (see figure below).
Regression coefficients plot with marked significant variables.
15 X-variables out of 26 are significant. X11 (Do you get help from your colleagues?) is not significant,
even though its B-coefficient is not among the smallest. How come?
PC2
0.2
19 19
19
19
19 1919
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19 19
0
13
13 13
13
13
13
13
13
13
13
13 19
13
13
13
13
13
13
13
13
13
13
13
13
13
-0.2
-0.4
1 1
11
11
111
111
11 1
1111
11 1
11
22 2
2 2222
2
2 222
22222
2 2
8 88 8
14
6 17
24
8888 8
X11 uncertain
24
66 666
24
17
8888
66
6
17
17 17
8888
66
6666
6
17
17
17
14
18 18
218
24
2424
2424 3
17
17
17
24
24
24
122
22
17
88
1114
24
6
22
22
2217
88 8 811
11 14
1414
18
18
1824
24
24
24
24
1
11
11
18
18
18
241611
17
1
11
18
18
18
24
25
26
253
1
11
11
22
22
22
88
14
14
25
18
18
18
6
3
22
11
11
14
18
3
22
22
11
11
11
111114
14
14
14
14
14
18
24
1
22
22
22 22
11
14
1814
18
18
25
25
25
33
333
25
322
6322
22
22
22
11
14
18
18
26
19 8
11
11
14
14
14
18
14
25
25
25
25
3
25
3
3
22
26
11
18
25
25
25 25
333
26
11
1414
26
25
14
25 33 3 3
26
26
26
25
11
14
25
26
26
26
26
26
26
26 3
26
26
26
26 26 22 21
21
21
21
21
21
21
21
21
21
21
21
2121
21
21
10
10
10
10
10
10
10
5
10
10
10
12
10
10
10
10
7
10
7
10
10
7
10
7
7
20
1212
777
7777
5 5
12
12
12
12
12
12
12
12 9
1015
777
10
555
720
55
12 12
12
12
12
12
12
5
12
55
5555 5
12
9
9
20
5 5555
12
9
9
20
20
555 5
15
15
15
20
15
12
999
999 20
20
15
20
15
20
15
15
20
15
20
20
15
23
999 15
23
15
23
23
9
23
20
23
23
23
23
23
23
23
23
23
23
-0.3
-0.2
-0.1
0
pls1 bbs jack-k,X-expl: 33%,21% Y-expl: 66%,6%
16
16
16
16
16
16
1616
1616
16
16
16
1616
16
1616
16
16
PC1
0.1
0.2
0.3
0.4
For each variable you see a swarm of its loading weights in each sub-model. There are 26 such X-loading
weights swarms. In the middle of each swarm you see the loading weight for the variable in the total model.
They should lie close together. Usually the uncertainty is larger (the spread is larger in the swarm) for variables
close to the origin, i.e. these variables are non-significant.
Camo Software AS
If a variable has a sub-loading far away from the rest in its swarm, then this variable is strongly influenced by
one of the sub-models. The segment information on the figure above indicates that sub -model 26 (or segment
26 as shown in the pop-up information) has a large influence on variable X11.
Individual samples can be very influential when included in a model. In segment 26, where sample 26 was
kept out, the sub-loading weight for variable X11 is very different from the sub-loading weights obtained from
all other sub-models, where sample 26 was included. Probably this sample has an extreme value for variable
X11, so the distribution is skewed. Therefore the estimate of the loading weight for variable X11 is uncertain,
and it becomes non-significant.
We can verify the extreme value of sample 26 by plotting X11 versus Y as shown below:
Line plot of X11 vs. Y
10
14
30
9
6
10
9
33
32
8
7
15
18
25
28
11
29
12
31
24
27
195
2
13
228
204
17
34
21
6
26
23
16
5
75
(hjelp,gentrivs)
80
85
90
95
100
Only two departments (15 and 26) consider their colleagues not being helpful, so these two samples influence
the sub-models strongly and twist them. Without these two samples, variable X11 would have a very small
variation and the model would be different. Sample 26 clearly drags the regression line down. By removing it
you would get a fairly horizontal line, i.e. no relationship at all between X11 and Y.
Camo Software AS
For each sample you see a swarm of its scores from each sub-model. There are 34 sample swarms. In the
middle of each swarm you see the score for the sample in the total model. The circle shows the projected or
rotated score of the sample in the sub-model where it was left out.
The next figure presents a zooming on sample 23. The sub-score marked with a circle corresponds to the submodel where sample 23 was kept out. The segment information displayed on the figure points towards the sub score for sample 23 when sample 26 was kept out. Here again, we observe the influence of sample 26 on the
model.
Stability Plot on the scores: Zooming in on sample 23
If a given sample is far away from the rest of the swarm, it means that the sub-model without this sample is
very different from the other sub-models. In other words, this sample has influenced all other sub-models due
to its uniqueness.
In the work environment example, from looking at the global picture from the stability score plot we can
conclude that all samples seem OK and the model seems robust.
Camo Software AS
Each perturbed model is based on all the objects except one or more objects which were kept 'secret' in this
cross validation segment m.
If a perturbed segment model differs greatly from the common model, based on all the objects, it means that
the object(s) kept 'secret' in this cross validation segment have significantly affected the common model. These
left out objects caused some unique pattern of variation in the model parameters. Thus, a plot of how the
model parameters are perturbed when different objects are kept 'secret' in the different cross validation
segments m=1,2,...,M shows the robustness of the common model against peculiarities in the data of
individual objects or segments of objects.
These perturbations may be inspected graphically in order to acquire a general impression of the stability of the
parameter estimates, and to identify dominating sources of model instability. Furthermore, they may also be
summarized to yield estimates of the variance/covariance of the model parameters.
This is often called jack-knifing. It will here be used for two purposes:
3. Elimination of useless variables, based on the linear parameters B;
4. Stability assessment of the bilinear structure parameters T and [P', Q'].
Camo Software AS
T(m) TmC
m
S B =
( (B - B
) g)
m=1
where
Significance Testing
When the variances for B, P, Q, and W have been estimated, they can be utilized to find significant
parameters.
As a rough significance test, a Students t-test is performed for each element in B relative to the square root of
its estimated uncertainty variance S 2B, giving the significance level for each parameter. In addition to the
significance for B, which gives the overall significance for a specific number of components, the significance
levels for Q are useful to find in which components the Y-Variables are modeled with statistical relevance.
Camo Software AS
Task - PCA: Starts the PCA dialog where you may choose a validation method and further specify
validation details
Task - Regression: Starts the Regression (PLS, PCR or MLR) dialog where you may choose a
validation method and further specify validation details
Validation Dialogs
The following dialogs are accessed from the PCA dialog and Regression dialog at the Task stage:
Uncertainty Test
Results - PCA: Open PCA result file or just lookup file information, warnings and variances
Results - Regression: Open regression result file or just lookup file information, warnings and
variances
Results - All: Open any result file or just lookup file information, warnings and variances
Camo Software AS
Plot - Variances and RMSEP: Plot variance curves and estimated Prediction Error (PCA, PCR, PLS)
Plot - Predicted vs Measured: Display plot of predicted Y values against actual Y values
View - Plot Statistics: Display statistics (including RMSEP) on Predicted vs Measured plot
Window - Warning List: Display general warnings issued during the analysis among others related to
validation
View - Uncertainty Test - Stability Plot: Display stability plot for scores or loadings
View - Uncertainty Test - Uncertainty Limits: Display uncertainty limits on regression coefficients
plot
Camo Software AS
Make Predictions
Use an existing regression model to predict response values for new samples.
You need a regression model (MLR or PCR or PLS) which expresses the response variable or variables
(Y) as a function of the X-variables;
The model should have been calibrated on samples covering the region your new samples belong to, i.e.
on similar samples (similarity being determined by the X-values);
The model should also have been validated on samples covering the region your new samples belong to.
Note that model validation can only be considered successful if you have
dealt with outliers in a proper way (not just removed all the samples which did not fit well);
This prediction method is simple and easy to understand. However it has a disadvantage, as we will see when
we compare it to another approach presented in the next section.
Camo Software AS
However, you can also take advantage of projection onto the model components to express predicted Y-values
in a different way.
The PCR model equation can be written:
X = T . PT + E and y = T . b + f
X = T . PT + E and Y = T . B + F
In both these equations, we can see that Y is expressed as an indirect function of the X-variables, using the
scores T.
The advantage of using the projection equation for prediction, is that when projecting a new sample onto the
X-part of the model (this operation gives you the t-scores for the new sample), you simultaneously get a
leverage value and an X-residual for the new sample that allow for outlier detection.
A prediction sample with a high leverage and/or a large X-residual is a prediction outlier. It cannot be
considered as belonging to the same population as the samples your regression model is based on, and
therefore you should not apply your model to the prediction of Y-values for such a sample.
Note: Using leverages and X-residuals, prediction outliers can be detected without any knowledge of the true
value of Y.
Camo Software AS
Prediction in Practice
The sections that follow list menu options, dialogs and plots for prediction. For a more detailed description of
each menu option, read The Unscrambler Program Operation, available as a PDF file from Camos web site
www.camo.com/TheUnscrambler/Appendices .
Run A Prediction
In practice, prediction requires three operations:
1. Build and validate a regression model, using PCR or PLS (see Chapter Multivariate Regression in
Practice p. 116) or, for three-way data, nPLS; save the final version of your model.
2. Collect X-values for new samples (for three-way data, you need both Primary and Secondary Xvalues);
3. Run a prediction, using the chosen regression model.
When your data table is displayed in the Editor, you may access the Task menu to run a Prediction.
Task - Predict: Run a prediction on some samples contained in the current data table
File - Save: Save result file for the first time, or with existing name
Results - Prediction: Open prediction result file or just lookup file information and warnings
Results - All: Open any result file or just lookup file information, warnings and variances
PC Navigation Tool
Navigate up or down the PCs in your model along the vertical and horizontal axes of your plots:
Camo Software AS
Edit - Insert Draw Item: Draw a line or add text to your plot
View - Plot Statistics: Display plot statistics, including RMSEP, on your Predicted vs. Reference plot
View - Outlier List: Display list of outlier warnings issued during the analysis for each PC, sample
and/or variable
Window - Warning List: Display general warnings issued during the analysis
Task - Recalculate with Marked: Recalculate predictions with only the marked samples
Task - Recalculate without Marked: Recalculate predictions without the marked samples
View - Raw Data: Display the source data for the predictions in a slave Editor
Task - Extract Data from Marked: Extract data for only the marked samples
Task - Extract Data from Unmarked: Extract data for only the unmarked samples
Camo Software AS
Classification
Use existing PCA models to build a SIMCA classification model, then classify new samples.
to distinguish among the most important variables to keep in a model (variables that characterize the
population);
It follows that, contrary to regression, which predicts the values of one or several quantitative variables,
classification is useful when the response is a category variable that can be interpreted in terms of several
classes to which a sample may belong.
Examples of such situations are:
- Predicting whether a product meets quality requirements, where the result is simply Yes or No (i.e.
binary response).
- Modeling various close species of plants or animals according to their easily observable characteristics, so as
to be able to decide whether new individuals belong to one of the modeled species.
- Modeling various diseases according to a set of easily observable symptoms, clinical signs or biological
parameters, so as to help future diagnostic of those diseases.
SIMCA Classification
The classification method implemented in The Unscrambler is SIMCA (Soft Independent Modeling of Class
Analogy).
SIMCA is based on making a PCA model for each class in the training set. Unknown samples are then
compared to the class models and assigned to classes according to their analogy to the training samples.
Steps in Classification
Solving a classification problem requires two steps:
1. Modeling: Build one separate model for each class;
Camo Software AS
2. Classifying new samples: Fit each sample to each model and decide whether the sample belongs to
the corresponding class.
The modeling stage implies that you have identified enough samples as members of each class to be able to
build a reliable model. It also requires enough variables to describe the samples accurately.
The actual classification stage uses significance tests, where the decisions are based on statistical tests
performed on the object-to-model distances.
The classification decision rule is based on a classical statistical approach. If a sample belongs to a class, it
should have a small distance to the class model (the ideal situation being distance=0). Given a new sample,
you just need to compare its distance to the model to a class membership limit reflecting the probability
distribution of object-to-model distances around zero.
Model Results
For each pair of models, Model distance between the two models is computed.
Variable Results
138 Classification
Camo Software AS
Sample Results
Combined Plots
Si vs. Hi
Coomans plot.
Model Distance
This measure (which should actually be called model-to-model distance) shows how different two models
are from each other. It is computed from the results of fitting all samples from each class to their own model
and to the other one.
The value of this measure should be compared to is 1 (distance of a model to itself). A model distance much
larger than 1 (for instance, 3 or more) shows that the two models are quite different, which in turn implies that
the two classes are likely to be well distinguished from each other.
Modeling Power
Modeling power is a measure of the influence of a variable over a given model. It is computed as
(1 - square root of (variable residual variance / variable total variance)).
This measure has values between 0 and 1; the closer to 1, the better that variable is taken into account in the
class model, the higher the influence of that variable, and the more relevant it is to that particular class.
Discrimination Power
The discrimination power of a variable indicates the ability of that variable to discriminate between two
models. Thus, a variable with a high discrimination power (with regard to two particular models) is very
important for the differentiation between the two corresponding classes.
Like model distance, this measure should be compared to 1 (no discrimination power at all), and variables with
a discrimination power higher than 3 can be considered quite important.
Camo Software AS
Si vs. Hi
This plot is a graphical tool used to get a view of the sample-to-model distance (Si) and sample leverage (Hi)
for a given model at the same time. It includes the class membership limits for both measures, so that samples
can easily be classified according to that model by checking whether they fall inside both limits.
Coomans Plot
This is an Si vs. Si plot, where the sample-to-model distances are plotted against each other for two models.
It includes class membership limits for both models, so that you can see whether a sample is likely to belong to
one class, or both, or none.
Outcomes Of A Classification
There are three possible outcomes of a classification:
1. Unknown sample belongs to one class;
2. Unknown sample belongs to several classes;
3. Unknown sample belongs to none of the classes.
The first case is the easiest to interpret.
If the classes have been modeled with enough precision, the second case should not occur (no overlap). If it
does occur, this means that the class models might need improvement, i.e. more calibration samples and/or
additional variables should be included.
The last case is not necessarily a problem. It may be a quite interpretable outcome, especially in a one-class
problem. A typical example is product quality prediction, which can be done by modeling the single class of
acceptable products. If a new sample belongs to the modeled class, it is accepted; otherwise, it is rejected.
140 Classification
Camo Software AS
Binary discriminant analysis is performed using regression, with the discriminant variable coded 0 / 1 (Yes =
1, No = 0) as Y-variable in the model.
With PLS2, this can easily be extended to the case of more than two classes. Each class is represented by an
indicator variable, i.e. a binary variable with value 1 for members of that class, 0 for non-members. By
building a PLS2 model with all indicator variables as Y, you can directly predict class membership from the Xvariables describing the samples. The model is interpreted by viewing Predicted vs. Measured for each class
indicator Y-variable:
Samples with Ypred > 0.5 and a deviation that does not cross the 0.5 line are predicted members;
Samples with Ypred < 0.5 and a deviation that does not cross the 0.5 line are predicted non-members;
Samples with a deviation that crosses the 0.5 line cannot be safely classified.
See Chapter Make Predictions p. 133 for more details on Predicted with Deviations and how to run a
prediction.
Classification in Practice
The sections that follow list menu options, dialogs and plots for classification. For a more detailed description
of each menu option, read The Unscrambler Program Operation, available as a PDF file from Camos web site
www.camo.com/TheUnscrambler/Appendices .
Run A Classification
When your data table is displayed in the Editor, you may access the Task menu to run a Classification.
Prior to the actual classification, we recommend that you do two things:
1.
Insert or append a category variable in your data table. This category variable should have as many levels as
you have classes. The easiest way to do this is to define one sample set for each class, then build the category
variable based on the sample sets (this is an option in the Category Variable Wizard).
The category variable will allow you to use sample grouping on PCA and Classification plots, so that each
class appears with a different color.
2.
Run a PCA on the training samples (i.e. the samples with known class membership on which you are
going to base the classification model). Check on the score plots for the first PCs (1 vs. 2, 3 vs. 4, 1 vs. 3 etc)
whether the classes have a good spontaneous separation. Look for outliers using warnings, score plots and
influence plots. If the classes are not well separated, a transformation of some variables may be necessary
before you can try a classification.
Then the classification procedure itself begins by building one PCA model for each class, diagnosing the
models and deciding how many PCs are necessary according to the variance curve (use a proper validation
method).
Once all your class PCA models are saved, you may run Task - Classify.
Camo Software AS
Modify - Edit Set: Create new sample sets (one for each class + one for all training samples)
Edit - Insert - Category Variable: Insert category variable anywhere in the table
Edit - Append - Category Variable: Add category variable at the right end of the table
File - Save: Save PCA model file for the first time, or with existing name
File - Save As: Save PCA model file under a new name
Run Classification
Later, you may also run a classification on new samples (once you have checked that the training samples
are correctly classified)
File - Save: Save result file for the first time, or with existing name
Results - Classification: Open classification result file or just lookup file information and warnings
Results - All: Open any result file or just lookup file information, warnings and variances
142 Classification
Camo Software AS
Edit - Options: Format your plot on the Sample Grouping sheet, group according to the levels of a
category variable
The
Edit - Insert Draw Item: Draw a line or add text to your plot
View - Outlier List: Display list of outlier warnings issued during the analysis
Window - Warning List: Display general warnings issued during the analysis
Insert or append a category variable in your data table. This category variable should have as many levels as
you have classes. The easiest way to do this is to define one sample set for each class, then build the category
variable based on the sample sets (this is an option in the Category Variable Wizard).
The category variable will allow you to use sample grouping on PCA and Classification plots, so that each
class appears with a different color.
2.
Split the category variable into indicator variables. These will be your Y-variables in the PLS model. Create
a new variable set containing only the indicator variables.
Modify - Edit Set: Create new sample sets (one for each class + one for all training samples)
Edit - Insert - Category Variable: Insert category variable anywhere in the table
Edit - Append - Category Variable: Add category variable at the right end of the table
Edit - Split Category Variable: Split the category variable into indicator variables
Modify - Edit Set: Create a new variable set (with all indicator variables)
Run a Regression
Task - Regression: Run a regression on all training samples; select PLS as regression method
More options for saving, viewing and refining regression results can be found in chapter Multivariate
Regression in Practice p. 116.
Camo Software AS
Run a Prediction
Task - Predict: Run a prediction on new samples contained in the current data table
More options for saving and viewing prediction results can be found in chapter Prediction in Practice p.
135.
144 Classification
Camo Software AS
Clustering
Use the K-Means algorithm to identify a chosen number of clusters among your samples.
Principles of Clustering
K-Means methodology is a commonly used clustering technique. In this analysis the user starts with a
collection of samples and attempts to group them into k Number of Clusters based on certain specific distance
measurements. The prominent steps involved in the K-Means clustering algorithm are given below.
1. This algorithm is initiated by creating k different clusters. The given sample set is first randomly
distributed between these k different clusters.
2. As a next step, the distance measurement between each of the sample, within a given cluster, to their
respective cluster centroid is calculated.
3. Samples are then moved to a cluster (k
) that records the shortest distance from a sample to the cluster
(k
) centroid.
As a first step to the cluster analysis the user decides on the Number of Clusters k. This parameter could take
definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and
an upper bound that equals the total number of samples.
The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time
starting with a random set of initial clusters.
Distance Types
The following distance types can be used for clustering.
Euclidean distance
This is the most usual, natural and intuitive way of computing a distance between two samples. It takes into
account the difference between two samples directly, based on the magnitude of changes in the sample levels.
This distance type is usually used for data sets that are suitably normalized or without any special distribution
problem.
Manhattan distance
Also known as city-block distance, this distance measurement is especially relevant for discrete data sets.
While the Euclidean distance corresponds to the length of the shortest path between two samples (i.e. as the
crow flies), the Manhattan distance refers to the sum of distances along each dimension (i.e. walking round
the block).
Camo Software AS
dp = 1 - r
and lies between 0 (when correlation coefficient is +1, i.e. the two samples are most simi lar) and 2 (when
correlation coefficient is -1).
Note that the data are centered by subtracting the mean, and scaled by dividing by the standard deviation.
Taking the absolute value gives equal meaning to positive and negative co rrelations, due to which anticorrelated samples will get clustered together.
146 Clustering
Camo Software AS
Clustering in Practice
This section describes menu options for clustering.
Run A Clustering
When your data table is displayed in the Editor, you may access the Task menu to run a Clustering analysis
using Task - Clustering.
Before: check for any natural groupings; the PCA score plots may provide you with a relevant number of
clusters.
After: display the new score plots along various PCs with sample grouping according to the clustering
variable. This will help you identify which sample properties play an important role in the clustering.
Plot - Scores and Loadings: Display a score plot and the corresponding loading plot
Edit - Options: Format your plot on the Sample Grouping sheet, group according to the levels of the
category variable containing clustering results
Camo Software AS
General methods for univariate and multivariate descriptive data analysis have been described in the following
chapters:
Camo Software AS
The last chapter focuses on the use of PLS for analyzing results from constrained (non-orthogonal)
experiments.
ANOVA;
ANOVA
Analysis of variance (ANOVA) is based on breaking down the variations of a response into several parts that
can be compared to each other for significance testing.
To test the significance of a given effect, you have to compare the variance of the response accounted for by
the effect to the residual variance, which summarizes experimental error. If the structured variance (due to
the effect) is no larger than the random variance (error), the effect can be considered negligible. If it is
significantly larger than the error, it is regarded as significant.
In practice, this is achieved through a series of successive computations, with results traditionally displayed as
a table. The elements listed hereafter define the columns of the ANOVA table, and there is one row for each
source of variation:
1.
First, several sources of variation are defined. For instance, if the purpose of the model is to study the main
effects of all design variables, each design variable is a source of variation. Experimental error is also a
source of variation;
2.
Each source of variation has a limited number of independent ways to cause variation in the data. This
number is called number of degrees of freedom (DF);
3.
4.
Response variance associated to the same source is then computed by dividing the sum of squares by the
number of degrees of freedom. This ratio is called mean square (MS);
5.
Once mean squares have been determined for all sources of variation, f-ratios associated to every tested
effect are computed as the ratio of MS(effect) to MS(error). These ratios, which compare structured variance
to residual variance, have a statistical distribution which is used for significance testing. The higher the ratio,
the more important the effect;
6.
Under the null hypothesis (i.e., that the true value of an effect is zero), the f-ratio has a Fisher distribution.
This makes it possible to estimate the probability of getting such a high f-ratio under the null hypothesis. This
probability is called p-value; the smaller the p-value, the more likely it is that the observed effect is not due to
chance. Usually, an effect is declared significant if p-value<0.05 (significance at the 5% level). Other
classical thresholds are 0.01 and 0.001.
The outlined sequence of computations applies to all cases of ANOVA. Those can be the following:
Summary ANOVA: ANOVA on the global model. The purpose is to test the global significance of the
whole model before studying the individual effects.
Linear with Interactions ANOVA: Each main effect and each 2-factor interaction is studied separately.
Camo Software AS
Quadratic ANOVA: Each main effect, each 2-factor interaction and each quadratic effect is studied
separately.
Note1: Quadratic ANOVA is not a part of Analysis of Effects, but it is included in Response Surface Analysis
(see the next chapter Make a Response Surface Model).
Note2: The underlying computations of ANOVA are based on MLR (see the chapter about Multivariate
Regression). The effects are computed from the regression coefficients, according to the following formula:
Main effect of a variable = 2(b-coefficient of that variable).
Multiple Comparisons
Multiple comparisons apply whenever a design variable with more than two levels has a significant effect.
Their purpose is to determine which levels of the design variable have significantly different response
meanvalues.
The Unscrambler uses one of the most well-known procedures for multiple comparisons: Tukeys Test. The
levels of the design variable are sorted according to their average response value, and non-significantly
different levels are displayed together.
Center samples:
When HOIE cannot be used because of insufficient degrees of freedom in the cube samples, the experimental
error can be estimated from replicated center samples. This is why including several center samples is so
useful, especially in fractional factorial designs.
Reference samples:
This method is similar to center samples, and applies when there are no replicated center samples but some
reference samples have been replicated.
Camo Software AS
effects are sorted on increasing absolute value and their significance is estimated using an approximation (the
Psi statistics) which is not based on the Fisher distribution. This method has an essentially different philosophy
from the others; the p-values computed from the Psi statistic have no absolute meaning. They can only be
interpreted in the context of the sorted effects. Going from the smallest effect to the largest, p-value is
compared to a significance threshold (e.g. 0.05); when the first significant effect is encountered, all the larger
effects can be interpreted as at least as significant.
Whenever such computations are possible, The Unscrambler automatically computes all results based on those
five methods. The most relevant one, depending on the context, is then selected as default when you view the
results using Effects Overview. You can view the results from the other methods if you wish, by selecting
another method manually.
Note: When the design includes variables with more than two levels, only HOIE is used.
Leverages;
Residuals;
Regression coefficients;
ANOVA;
Camo Software AS
b-coefficients: The values of the regression coefficients are displayed for each effect of the model.
Standard Error of the b-coefficients: Each regression coefficient is estimated with a certain precision,
measured as a standard error.
Lack of Fit: Whenever possible, the error part is divided into two sources of variation, pure error and
lack of fit. Pure error is estimated from replicated samples; lack of fit is what remains of the residual
sum of squares once pure error has been removed.
By computing an f-ratio defined by MS(lack of fit)/MS(pure error), the significance of the lack of fit of the
model can be tested.
A significant lack of fit means that the shape of the model does not describe the data adequately. For
instance, this can be the case if a linear model is used when there is an important curvature.
Min/Max/Saddle: Since the purpose of a quadratic model often is to find out where the optimum is, the
minimum or maximum value inside the experimental range is computed, and the design variable values
that produce this extreme are displayed as an additional column for the rows where linear effects are
tested. Sometimes the extreme is a minimum in one direction of the surface, and a maximum in another
direction; such a point is called a saddle point, and it is listed in the same column.
Model Check: This new section of the table checks the significance of the linear (main effects only) and
quadratic (interactions and squares) parts of the model. If the quadratic part is not significant, the quadratic
model is too sophisticated and you should try a linear model instead, which will describe your surface
more economically and efficiently.
For linear models with interactions, the model check (linear only vs. interactions) is included, but not
min/max/saddle.
Landscape plot: This plot displays the surface in 3 dimensions, allowing you to study its concrete shape. It
is the better type of plot for the visualization of interactions or quadratic effects.
Contour plot: This plot displays the levels of the response variable as lines on a 2-dimensional plot (like a
geographical map with altitudes), so that you can easily estimate the response value for any combination of
levels of the design variables. This is done by keeping all variables but two at fixed levels, and plotting the
contours of the surface for the remaining two variables. The plot is best suited for final interpretation, i.e.
to find the optimum, especially when you need to make a compromise between several responses, or to
find a stable region.
Camo Software AS
The X-data are centered; i.e. further results will be interpreted as deviations from an average situation,
which is the overall centroid of the design;
2.
The Y-data are also centered, i.e. further results will be interpreted as an increase or decrease compared to
the average response values;
3.
The mixture constraint is implicitly taken into account in the model; i.e. the regression coefficients can be
interpreted as showing the impact of variations in each mixture component when the other ingredients
compensate with equal proportions.
In other words: the regression coefficients from a PLS model tell you exactly what happens when you move
from the overall centroid towards each corner, along the axes of the simplex.
This property is extremely useful for the analysis of screening mixture experiments: it enables you to interpret
the regression coefficients quite naturally as the main effects of each mixture component.
The mixture constraint has even more complex consequences on a higher degree model necessary for the
analysis of optimization mixture experiments. Here again, PLS performs very well, and the mixture response
surface plot enables you to interpret the results visually (see Chapter The Mixture Response Surface Plot p.156
for more details).
Camo Software AS
Thus PLS regression is the method of choice to analyze the results from D-optimal designs, no matter whether
they involve mixture variables or not.
If the regression coefficient for a variable is larger than 0.2 in absolute value, then the effect of that
variable is most probably important .
If the regression coefficient is smaller than 0.1 in absolute value, then the effect is negligible.
Between 0.1 and 0.2: "gray zone" where no certain conclusion can be drawn.
Note: In order to be able to compare the relative sizes of your regression coefficients, do not forget to
standardize all variables (both X and Y)!
Martens Uncertainty Test in chapter Uncertainty Testing with Cross Validation p. 123
Plotting Uncertainty Test results and marking significant variables in chapter View Regression Results
p. 117
Camo Software AS
Example:
A, B and C vary from 0 to 1.
A+B+C = 1 for all mixtures.
Therefore, C can be re-written as 1 - (A+B).
As a consequence, the square effect C*C or C2 can also be re-written as (1-(A+B)) 2 = 1 + A 2 + B2 -2A - 2B +
2A*B:
it does not make any sense to try to interpret square effects independently from main effects and interactions.
In the same way, A*C can be re-expressed as A*(1-A-B) = A - A*A - A*B, which shows that interactions
cannot be interpreted without also taking into account main effects and square effects.
Here are therefore the basic principles for building relevant mixture models:
Camo Software AS
Instead of having two coordinates, the mixture response surface plot uses a special system of 3 coordinates.
Two of the coordinate variables are varied independently from each other (within the allowed limits of course),
and the third one is computed as the difference between MixSum and the other two.
Examples of mixture response surface plots, with or without additional constraints, are shown in the figure
below.
Unconstrained and constrained mixture response surface plots
Simplex
1.471
3.614
5.756
Response Surface
7.899
D-optimal
10.041
12.183
C=100.0000
1.437
3.804
6.171
Response Surface
C [0.000:100.0000]
A [0.000:100.0000]
B [0.000:100.0000]
8.538
10.905
13.272
C=100.0000
C [0.000:100.0000]
A [0.000:100.0000]
B [0.000:100.0000]
11.64 8
10.577
9 . 50
5
8.
43
4
7. 3 6
3
12.680
6 .2 9 2
11.497
5 .221
10.313
4. 149
9.130
7.946
3.07 8
6.763
2.007
A=100.0000
Centroid quad, PC: 3, Y-var: Y, (X-var = value):
2 .0 3 .
28 21 2
B=100.0000
5.579
A=100.0000
95
4 .3
B=100.0000
Similar response surface plots can also be built when the design includes one or several process variables.
Task - Analysis of Effects: Run an Analysis of Effects on the current data table
Task - Response Surface: Run a Response Surface analysis on the current data table
Task - Regression: Run a regression on the current data table (choose method PLS for constrained
designs)
Camo Software AS
File - Save: Save result file for the first time, or with existing name
Results - PCA, Results - Statistics, etc.: Open a specific type of result file or just lookup file
information, warnings and variances
Results - All: Open any result file or just lookup file information, warnings and variances
Plot - Effects: Display the main plot of effects (and select appropriate significance testing method)
Plot - Predicted vs Measured: Display plot of predicted Y values against actual Y values
PC Navigation Tool
Navigate up or down the PCs in your model along the vertical and horizontal axes of your plots:
Edit - Insert Draw Item: Draw a line or add text to your plot
Camo Software AS
View - Outlier List: Display list of outlier warnings issued during the analysis for each PC, sample
and/or variable
Window - Warning List: Display general warnings issued during the analysis
View - Scaling
View - Zoom In
Plot - Response Surface Overview: Display the 4 main response surface plots
Plot - Response Surface: Display the a response surface plot according to your specifications
Plot - Predicted vs Measured: Display plot of predicted Y values against actual Y values
Plot - Regression and Prediction: Display Predicted vs. Measured and Regression coefficients
Edit - Insert Draw Item: Draw a line or add text to your plot
View - Outlier List: Display list of outlier warnings issued during the analysis for each PC, sample
and/or variable
Window - Warning List: Display general warnings issued during the analysis
View - Scaling
Camo Software AS
View - Zoom In
Camo Software AS
What is MCR?
Multivariate Curve Resolution (MCR) methods may be defined as a group of techniques which intend the
recovery of concentration (pH profiles, time/kinetic profiles, elution profiles, chemical composition
changes...) and response profiles (spectra, voltammograms...) of the components in an unresolved mixture
using a minimal number of assumptions about the nature and composition of these mixtures. MCR methods
can be easily extended to the analysis of many types of experimental data including multi-way data.
Chromatogram
Retention times
Wavelengths
Spectrum
Camo Software AS
s1
sn
c1
ST
cn
Wavelen
Retention times
Pure signals
Chemical model
Process evolution
Compound contribution
relative quantitation
Compound identity
source identification
and Interpretation
Purposes of MCR
Multivariate Curve Resolution has been shown to be a powerful tool to describe multi -component mixture
systems through a bilinear model of pure component contributions. MCR, like PCA, assumes the fulfilment
of a bilinear model, i.e
J
T
N < < I or J
PCA
T orthogonal, P orthonormal
PT in the direction of
maximum
variance
Unique solutions
but without physical meaning
Useful for interpretation
+ I
MCR
Other constraints (non-negativity,
unimodality, local rank, )
T
Camo Software AS
Limitations of PCA
Principal Component Analysis, PCA, produces an orthogonal bilinear matrix decomposition, where
components or factors are obtained in a sequential way expla ining maximum variance. Using these constraints
plus normalization during the bilinear matrix decomposition, PCA produces unique solutions. These 'abstract'
unique and orthogonal (independent) solutions are very helpful in deducing the number of different sources of
variation present in the data and, eventually, they allow for their identification and interpretation. However,
these solutions are 'abstract' solutions in the sense that they are not the 'true' underlying factors causing the data
variation, but orthogonal linear combinations of them.
How unique is the MCR solution? in Rotational and Intensity Ambiguities in MCR p.165
Types of problems which MCR can solve in MCR Application Examples p.168
As a comparison, you may also read more about PCA in chapter Principles of Projection and PCA p. 95.
You may also read about the MCR-ALS algorithm in the Method Reference chapter, available as a separate
.PDF document for easy print-out of the algorithms and formulas download it from Camos web site
www.camo.com/TheUnscrambler/Appendices.
Residuals are error measures; they tell you how much variation remains in the data after k components
have been estimated;
Estimated concentrations describe the estimated pure components profiles across all the samples
included in the model;
Estimated spectra describe the instrumental properties (e.g. spectra) of the estimated pure components.
Camo Software AS
Residuals
The residuals are a measure of the fit (or rather, misfit) of the model. The smaller the residuals, the better the
fit.
MCR residuals can be studied from three different points of view.
Variable Residuals are a measure of the variation remaining in each variable after k components have
been estimated. In The Unscrambler, the variable residuals are plotted as a line plot where each variable is
represented by one value: its residual in the k -component model.
Sample Residuals are a measure of the distance between each sample and its model approximation. In
The Unscrambler, the sample residuals are plotted as a line plot where each sample is represented by one
value: its residual after k components have been estimated.
Total Residuals express how much variation in the data remains to be explained after k components have
been estimated. Their role in the interpretation of MCR results is similar to that of Variances in PCA. They
are plotted as a line plot showing the total residual after a varying number of components (from 2 to n+1).
The three types of MCR residuals are available for two different model fits.
MCR Fitting: these are the actual values of the residuals after the data have been resolved to k pure
components.
PCA Fitting: these are the residuals from a PCA with k PCs performed on the same data.
Estimated Concentrations
The estimated concentrations show the profile of each estimated pure component across the samples included
in the MCR model.
In The Unscrambler, the estimated concentrations are plotted as a line plot where the abscissa shows the
samples, and each of the k pure components is represented by one curve.
The k estimated concentration profiles can be interpreted as k new variables telling you how much each of
your original samples contains of each estimated pure component.
Note!
Estimated concentrations are expressed as relative values within individual components. The estimated
concentrations for a sample are not its real composition.
Estimated Spectra
The estimated spectra show the estimated instrumental profile (e.g. spectrum) of each pure component across
the X-variables included in the analysis.
In The Unscrambler, the estimated spectra are plotted as a line plot where the abscissa shows the X-variables,
and each of the k pure components is represented by one curve.
The k estimated spectra can be interpreted as the spectra of k new samples consisting each of the pure
components estimated by the model. You may compare the spectra of your original samples to the estimated
spectra so as to find out which of your actual samples are closest to the pure components.
Note!
Estimated spectra are unit-vector normalized.
Camo Software AS
where ki are scalars and n refers to the number of components. Each concentration profile of the new C matrix
would have the same shape as the real one, but being ki times smaller, whereas the related spectra of the new
S matrix would be equal in shape to the real spectra, though k i times more intense.
Constraints in MCR
Although resolution does not require previous information about the chemical system under study, additio nal
knowledge, when it exists, can be used to tailor the sought pure profiles according to certain known features
and, as a consequence, to minimize the ambiguity in the data decomposition and in the results obtained.
The introduction of this information is carried out through the implementation of constraints.
What is a Constraint?
A constraint can be defined as any mathematical or chemical property systematically fulfilled by the whole
system or by some of its pure contributions. Constraints are translat ed into mathematical language and force
the iterative optimization to model the profiles respecting the conditions desired.
Camo Software AS
freedom in the way combinations of constraints may be used for profiles in the different concentration and
spectral domains. This increase in flexibility also makes it possible to apply a certain constraint with variable
degrees of tolerance to cope with noisy real data, i.e., the implementation of constraints often allows for small
deviations from the ideal behavior before correcting a profile. Methods to correct the profile to be constrained
have evolved into smoother methodologies, which modify the wrong-behaved profile so that the global shape
is kept as much as possible and the convergence of the iterative optimization is minimally upset.
Non-negativity
The non-negativity constraint is applied when it can be assumed that the measured values in an experiment
will always be non-negative.
This constraint forces the values in a profile to be equal to or greater than zero. It is an example of an
inequality constraint.
Non-negativity constraints may be applied independently of each other to
Unimodality
The unimodality constraint allows the presence of only one maximum per profile.
This condition is fulfilled by many peak-shaped concentration profiles, like chromatograms, by some types of
reaction profiles and by some instrumental signals, like certain voltammetric responses.
It is important to note that this constraint does not only apply to peaks, but to profiles that have a constant
maximum (plateau) and a decreasing tendency. This is the case of many monotonic reaction profiles that show
only the decay or the emergence of a compound, such as the most protonated and deprotonated species in an
acid-base titration reaction, respectively.
Closure
The closure constraint is applied to closed reaction systems, where the principle of mass balance is fulfilled.
With this constraint, the sum of the concentrations of all the species involved in the reaction (the suitable
elements in each row of the C matrix) is forced to be equal to a constant value (the total concentration) at each
stage in the reaction. The closure constraint is an example of equality constraint.
In practice, the closure constraint in MCR forces the sum of the concentrations of all the mixture components
to be equal to a constant value (the total concentration) across all samples included in the model.
Camo Software AS
Other constraints
Apart from the three constraints previously defined, other types of constraints can be applied. See literature on
curve resolution for more information about them.
Physico-chemical constraints
One of the most recent progresses in chemical constraints refers to the implementation of a physicochemical
model into the multivariate curve resolution process. In this manner, the concentration profiles of compounds
involved in a kinetic or a thermodynamic process are shaped according to the suitable chemical law. Such a
strategy has been used to reconcile the separate worlds of hard- and soft-modeling and has enabled the
mathematical resolution of chemical systems that could not be successfully tackled by either of these two pure
methodologies alone. The strictness of the hard model constr aints dramatically decreases the ambiguity of the
constrained profiles and provides fitted parameters of physicochemical and analytical interest, such as
equilibrium constants, kinetic rate constants and total analyte concentrations. The soft - part of the algorithm
allows for modeling of complex systems, where the central reaction system evolves in the presence of
absorbing interferences.
Finally, it should be mentioned that MCR methods based on a bilinear model may be easily adapted to resolve
three-way data sets. Particular multi-way models and structures may be easily implemented in the form of
constraints during MCR optimization algorithms, such as Alternating Least Squares (see below). The
discussion of this topic is, however, out of the scope of the present chapter. When a set of data matrices is
obtained in the analysis of the same chemical system, they can be simultaneously analyzed setting all of them
together in an augmented data matrix and following the same steps as for a single data matrix analysis. The
possible data arrangements are displayed in the following figure:
Camo Software AS
S1 T S2T
X1
X2
X33
X2
C1
ST
X2
X3
C1
S2T
S3T
ST
= C2
X4
X3
Row-wise
ST
C
Column-wise
X1
S3T
X5
X6
C2
C3
C
Several experiments
monitored with the
same technique
X
C
Several experiments
monitored with several
techniques
Note! What follows is not a tutorial. See the Tutorials chapter for more examples and hands-on training.
Camo Software AS
property. For example in the case of an A B reaction where both A and B have overlapped spectra, and
reaction profiles also overlap in the whole range of study.
This is a case of strong rotational ambiguity since many possible solutions to the problem are possible. Using
non-negativity (for both spectra and reaction profiles) unimodality and closure (for reaction profiles) reduces
considerably the number of possible solutions.
When to apply constraints, in chapter Constraint Settings Are Known Beforehand below.
Camo Software AS
Camo Software AS
Case 2: You have no prior expectations about the number of pure components, but some of the extracted
profiles look very noisy and/or two of the estimated spectra are very similar. This indicates that the actual
number of components is probably smaller than the estimated number. Action: reduce sensitivity.
Case 3: You know that there are at least n different components whose concentrations vary in your system,
and the estimated number of pure components is smaller than n. Action: increase sensitivity.
Case 4: You know that the system should contain a trace-level component, which is not detected in the
current resolution. Action: increase sensitivity.
Case 5: You have no prior expectations about the number of pure components, and you are not sure whether
the current results are sensible or not. Action: check MCR message list.
Outliers in MCR
As in any other multivariate analysis, the available data may be more or less clean when you build your first
curve resolution model.
The main tool for diagnosing outliers in MCR consists of two plots of sample residuals, accessed with menu
option Plot - Residuals.
Any sample that sticks out on the plots of Sample Residuals (either with MCR fitting or PCA fitting) is a
possible outlier.
To find out more about such a sample (Why is it outlying? Is it an influential sample? Is that sample dangerous
for the model?), it is recommended to run a PCA on your data.
If you find out that the outlier should be removed, you may recalculate the MCR model without that sample.
Read more about:
Camo Software AS
Non-targeted wavelength regions: these variables carry virtually no information that can be of use to the
model;
Highly overlapped wavelength regions: several of the estimated components have simultaneous peaks in
those regions, so that their respective contributions are difficult to entangle.
The main tool for diagnosing noisy variables in MCR consists of two plots of variable residuals, accessed with
menu option Plot - Residuals.
Any variable that sticks out on the plots of Variable Residuals (either with MCR fitting or PCA fitting) may be
disturbing the model, thus reducing the quality of the resolution; try recalculating the MCR model without that
variable.
Here are a few rules and principles that may help you:
1. To have reliable results on the number of pure components, you should cross-check with a PCA result, try
different settings for the Sensitivity to pure components, and use the navigation bar to study the MCR results
for various estimated numbers of pure components.
2. Weak components (either low concentration or noise) are usually listed first.
3. Estimated spectra are unit-vector normalized.
4. The spectral profiles obtained may be compared to a library of similar spectra in order to identify the nature
of the pure components that were resolved.
5. Estimated concentrations are relative values within an individual component itself. Estimated concentrations
of a sample are NOT its real composition.
Application examples:
1. One can utilize estimated concentration profiles and other experimental information to analyze a chemical/
biochemical reaction mechanism.
2. One can utilize estimated spectral profiles to study the mixture composition or even intermediates during a
chemical/biochemical process.
Camo Software AS
1.
Choose and implement an appropriate pre-processing method (see Chapter Re-formatting and Preprocessing);
2.
Specify the model. If you already have estimations of the pure component concentrations or spectra, enter
them as Initial guess. Remember to define relevant constraints: non-negative concentrations is usual, the
spectra are also often non-negative, while unimodality and closure may or may not apply to your case.
Finally, you may also tune the sensitivity to pure components before launching the calculations;
3.
View the results and choose the number of components to interpret, according to the plots of Total residuals;
4.
5.
Run An MCR
When your data table is displayed in the Editor, you may access the Task menu to run a suitable analysis for
instance, MCR.
Task - MCR: Run a Multivariate Curve Resolution on the current data table
File - Save: Save result file for the first time, or with existing name
Results - MCR: Open MCR result file or just lookup file information
Results - All: Open any result file or just lookup file information, warnings and variances
Plot - Estimated Concentrations: Plot estimated concentrations of the chosen pure components for all
samples
Plot - Estimated Spectra: Plot estimated spectra of the chosen pure components
Plot - Residuals: Display various types of residual plots. There you may choose between
. MCR Fitting: Plot Sample residuals, Variable Residuals or Total residuals in your MCR model, for a
Camo Software AS
PC Navigation Tool
Navigate up or down the PCs in your model along the vertical and horizontal axes of your plots:
View - Source: Select which sample types / variable types / variance type to display
Edit - Insert Draw Item: Draw a line or add text to your plot
View - MCR Message List: Display list of recommendations issued during the analysis, to help you
improve your MCR model
View - Scaling
View - Zoom In
View - Raw Data: Display the source data for the analysis in a slave Editor
Edit - Mark - One By One: Mark samples or variables individually on current plot
Camo Software AS
Edit - Mark - With Rectangle: Mark samples or variables by enclosing them in a rectangular frame (on
current plot)
Edit - Mark - Unmark All : Remove marking for all objects of the type displayed on current plot
Edit - Mark - Reverse Marking: Exchange marked and unmarked objects on the plot
Task - Recalculate with Marked: Recalculate model with only the marked samples / variables
Task - Recalculate without Marked: Recalculate model without the marked samples / variables
View - Raw Data: Display the source data for the analysis in a slave Editor
Task - Extract Data from Marked: Extract data for only the marked samples / variables
Task - Extract Data from Unmarked: Extract data for only the unmarked samples / variables
Camo Software AS
If you have three-way data that is not easily described with a flat table structure, read about the exciting
method to analyze those data (NPLS) using three-way data analysis. Before describing this tool though, it is
instructive to learn what three-way data actually is and how it arises.
0.02
0.08
0.17
0.05
0.03
0.06
0.32
0.64
0.19
0.13
0.10
0.50
1.00
0.30
0.20
0.06
0.32
0.64
0.19
0.13
0.02
0.08
0.17
0.05
0.03
0.00
0.01
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
where the third row is seen to be the same as above. In this case, every sample yields a table in itself. This is
shown graphically as follows:
Camo Software AS
0.06
0.32
0.64
0.19
0.13
0.10
0.50
1.00
0.30
0.20
0.06
0.32
0.64
0.19
0.13
0.02
0.08
0.17
0.05
0.03
0.00
0.01
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
When the data from one sample can be held in a vector, it is sometimes referred to as first-order data as
opposed to scalar data one measurement per sample which is called zeroth-order data. When data of one
sample is a matrix, then the data is called second-order data (see the 1988 article by Sanchez and Kowalski
detailed bibliography given in the Method References chapter).
Having several sets of matrices, for example from different samples, a three-way array is obtained (see figure
below). Three-way data analysis is the analysis of such structures.
A three-way array is obtained from several sets of matrices
In the same way as going from two-way matrices to three-way arrays, it is also possible to obtain four-way,
five-way, or multi-way in general, data. Multi-way data is sometimes referred to as N-way data, which is
where the N in NPLS (see below) comes from.
Camo Software AS
here. Note that a three-way array is not referred to as a three-dimensional array. The term dimension is retained
for indicating the size of each mode.
The definition of which is the first, second and third mode can be seen in the figure below. The dimensions of
these modes are I, K and L respectively.
First, second and third modes in a three-way array
Mode 1
od
e
I
Mode 2
Two different types of modes will be distinguished. One is a sample-mode and the other is a variable-mode.
For a typical two-way (matrix) data set, the samples are held in the first (row) mode and the variables are held
in the second (column) mode. This configuration is also sometimes called OV where O means that the first
mode is an object-mode and V means that the second mode is a variable mode. If a grey-level image is
analyzed and the image represents a measurement on a sample, then the matrix holding the data is a V 2
structure because both modes represent different measurements on the same sample.
2
Likewise, for three-way data, several types of structures such as OV , O V, V etc. can be imagined. In the
following, only OV2 data are considered in detail.
2
Note: As in two-way analysis it is common practice to keep samples in the first mode for OV data.
Camo Software AS
K vertical slices
L frontal slices
I horizontal slices
It is also possible to divide further into vectors. Rather than just rows and columns, there are rows, columns
and tubes as shown below.
Rows, columns and tubes in a three-way array
Column
Row
Tube
Camo Software AS
Fifteen food samples have been assessed using texture-measurements (40 variables) after six different types
of storage conditions. The subsequent data can be stored in a 15406 array.
As can be seen, many types of data are conveniently seen as three-way data.
Note: There is no practical consequence of whether the second and third modes are interchanged. As long as
samples are kept in the first mode, the choice between the second and third mode is immaterial except for the
trivial interchanged interpretation.
Three-way Regression
With a three-way array X and matrix Y or vector y it is possible to build three-way regression models. The
principle in three-way regression is more or less the same as in two-way regression. The regression method NPLS is the extension of ordinary PLS to arbitrary ordered data. For three-way data specifically, the term triPLS is used. Tri-PLS provides a model of X which predicts the dependent variable Y through an inner relation
just like in two-way PLS.
The model of X is a trilinear model which is easily shown graphically, but complicated to write in matrix
notation. Matrices are intrinsically connected to two-way data, so in order to write a three-way model in
matrices, the data and the model have to be rearranged into a two-way model. For appropriately pre-processed
data (See chapter Pre-processing of Three-way data) the tri-PLS model consists of a model of X, a model of Y
and an inner relation connecting these.
Camo Software AS
Data
Principle in rearranging a three-way array and the corresponding one-component trilinear model to matrix-form.
X1
Ve cto
( 2)
rw
Model
w (2)
w (1)
X2
X2
men ts
o e le
w it h tw
w2(2)
w1(2)
X1
w(1)
w(1) *w1(2)
(1)
w(1)*w2(2)
w
t
A one-component model of X is also shown. More components are easily added, but one is enough to show the
principle of the rearranging. The trilinear component consists of a score vect or t (dim I*1), a weight vector in
the first variable mode w (1) (dim K*1) and a weight vector in the second variable mode w(2) (dim L*1). These
three vectors can be rearranged similarly to the data leading to a matrix representation of the trilinear
component which can then be written
T
w (1) * w1(2)
X t (1)
t( w (2) w (1) )T
(2)
w * w1
where the Kronecker product is used to abbreviate the expression in parentheses. While this two-way
representation looks a bit complicated, it is noteworthy that it simply expresses the trilinear model shown in the
above figure using two-way notation. Additionally, it represents the trilinear model as a bilinear model using a
score vector and a vector combined from the two weight vectors.
Camo Software AS
TG (W(2) W(1) )T
X
where the rearranged matrix G is originally the (dim A*A*A) core array that takes possible interactions into
account.
ua T1-a b
1-a
where T1-a is a matrix containing all the first a score vectors.
UQ
T.
Y
No Loadings in tri-PLS
As mentioned in chapter Three-way Regression (see for instance section Only Weights and no Loadings), a triPLS model is expressed with two sets of weights (similar to the loading weights in PLS) but no loadings are
computed. Thus the interpretation of tri -PLS results will, as far as the Predictor variables are concerned, focus
on the X-weights.
Camo Software AS
Most tri-PLS results are interpreted in much the same way as in ordinary PLS (see Chapter Main Results of
Regression p. 111 for more details). Exceptions are listed in Chapter Main Results of Tri-PLS Regression
above.
Read more about specific details:
Camo Software AS
4. Diagnose the model, using variance curves, X-Y relation outliers, Predicted vs. Measured;
5. Interpret the scores and weights plots and the B-coefficients;
6. Predict response values for new data (optional).
Task - Regression: Run a tri-PLS regression on the current 3-D data table
File - Save: Save result file for the first time, or with existing name
Results - Regression: Open regression result file or just lookup file information, warnings and
variances
Results - All: Open any result file or just lookup file information, warnings and variances
Plot - X-Y Relation Outliers: Display t vs. u scores along individual PCs
Plot - Scores and Loading Weights: Display scores and weights separately or as a bi-plot
Plot - Predicted vs Measured: Display plot of predicted Y values against actual Y values
Camo Software AS
Plot - Regression and Prediction: Display Predicted vs. Measured and Regression coefficients
Useful tips
To run an analysis (other than three-way regression) on your 3-way data, you need to duplicate your 3-D table
as 2-D data first. Then all relevant analyses will be enabled.
For instance, you may run an exploratory analysis with PCA on unfolded 3-way spectral data, by doing the
following sequence of operations:
1.
Start from your 3-D data table (OV2 layout) where each row contains a 2-way spectrum;
2.
Use File - Duplicate - As 2-D Data Table: this generates a 2-D table containing unfolded spectra;
3.
4.
Another possibility is to develop your own three-way analysis routine and implement it as a User-Defined
Analysis (UDA). Such analyses may then be run from the Task - User-defined Analysis menu .
Camo Software AS
Interpretation Of Plots
This chapter presents all predefined plots available in The Unscambler. They are sorted by plot types:
Line;
2D Scatter;
3D Scatter;
Matrix;
Normal Probability;
Table plots;
Special plots.
Whenever viewing a plot in The Unscrambler, hitting <F1> will display the Help chapter on how to interpret
the type of plot which is currently active in your viewer.
Line Plots
Detailed Effects
(Line Plot)
This plot displays all effects for a given response variable. It is recommended to choose a layout as bars to
make it easier to read. Each effect (main effect, interaction) is represented by a bar.
A bar pointing upwards indicates a positive effect. A bar pointing downwards indicates a negative effect. Click
on a bar to read the exact value of the calculated effect.
Camo Software AS
or
or
Leverages
(Line Plot)
Leverages are useful for detecting samples which are far from the center within the space described by the
model. Samples with high leverage differ from the average samples; in other words, they are likely outliers. A
large leverage also indicates a high influence on the model. The figure below shows a situation where sample 5
is obviously very different from the rest and may disturb the model.
One sample has a high leverage
Leverage
10 Samples
Camo Software AS
Influence on the model is best measured in terms of relative leverage. For instance, if all samples have
leverages between 0.02 and 0.1, except for one which has a leverage of 0.3, although this value is not
extremely large, the sample is likely to be influential.
Camo Software AS
Variable #
Variables with large loadings in early components are the ones that vary most. This means that these variables
are responsible for the greatest differences between the samples.
Note: Passified variables are displayed in a different color so as to be easily identified.
Variable #
Y-variables with large loadings in early components are the ones that are most easily modeled as a function of
the X-variables.
Note: Passified variables are displayed in a different color so as to be easily identified.
Camo Software AS
Loading Weights
(Line Plot)
This is a two dimensional scatter plot of X-loading weights for two specified components from a PLS analysis.
It can be useful for detecting which X-variables are most important for predicting Y, although it is better to use
the 2D scatter plot of X-loading weights and Y-loadings.
Note 1: The X-loading weights for PC1 are exactly the same as the regression coefficients for PC1.
Note 2: Passified variables are displayed in a different color so as to be easily identified.
Whiteness
Greasiness
Meat Taste
Variables
Camo Software AS
(Line Plot)
This is a plot of the p-values of the effects in the model. Small values (for instance less than 0.05 or 0.01)
indicate that the effect is significantly different from zero, i.e. that there is little chance that the observed effect
is due to mere random variation.
(Line Plot)
This is a plot of the p-values for the different regression coefficients (B). Small values (for instance less than
0.05 or 0.01) indicate that the corresponding variable has a significant effect on the response (given that all the
other variables are present in the model).
Since the predictors are kept in their original scales, the coefficients do not reflect the relative importance of
the X-variables in the model.
Camo Software AS
Predictors with a small coefficient are negligible. You can mark them and recalculate the model without those
variables.
With all X1-variables along the abscissa; Y is fixed (as selected in the Regression Coefficients plot
dialog), and the plot shows one curve for each X2-variable;
With all X2-variables along the abscissa; Y is fixed (as selected in the Regression Coefficients plot
dialog), and the plot shows one curve for each X1-variable.
The plot can be interpreted by looking for regions in X1 (resp. X2) with large positive or negative coefficients
for some or all of the X2- (resp. X1-) variables. In the example below, the most interesting X1-region with
respect to response Severity is around 350, with three additional peaks: 250-290, 390-400 and 550-560.
Line plot of X1-Regression Coefficients for response Severity
Camo Software AS
indicate an unimportant variable. The coefficient value indicates the average increase in Y when the
corresponding X-variable is increased by one unit, keeping all other variables constant.
The critical value for the different regression coefficients (5% level) is indicated by a straight line. A
coefficient with a larger absolute value than the straight line, is significant in the model.
The plots of the t- and p-values for the different coefficients may also be added.
RMSE
(Line Plot)
This plot gives the square root of the residual variance for individual responses, back-transformed into the
same units as the original response values. This is called
RMSEC (Root Mean Square Error of Calibration) if you are plotting Calibration results;
RMSEP (Root Mean Square Error of Prediction) if you are plotting Validation results.
The RMSE is plotted as a function of the number of components in your model. There is one curve per
response (or two if you have chosen Cal and Val together). You can detect the optimal number of components:
this is where the Val curve (i.e. RMSEP) reaches a minimum.
or
toolbar buttons.
The size of the residuals tells you about the misfit of the model. It may be a good idea to compare the sample
residuals from an MCR fitting to a PCA fit on the same data (displayed on the plot of Sample Residuals, PCA
Fitting). Since PCA provides the best possible fit along a set of orthogonal components, the comparison tells
you how well the MCR model is performing in terms of fit.
Note that, in the MCR Overview, both plots are displayed side by side in the lower part of the Viewer. Check
the scale of the vertical axis on each plot to compare the sizes of the residuals.
(Line Plot)
This plot is available when viewing the results of an MCR model. It displays the sample residuals from a PCA
model on the same data.
This plot is supposed to be used as a basis for comparison with the Sample Residuals, MCR fit (the actual
residuals from the MCR model). Since PCA provides the best possible fit along a set of orthogonal
components, the comparison tells you how well the MCR model is performing in terms of fit.
Note that, in the MCR Overview, both plots are displayed side by side in the lower part of the Viewer. Check
the scale of the vertical axis on each plot to compare the sizes of the residuals.
Camo Software AS
Variables
In contrast to the variable residual plot, which gives information about residuals for all samples for a particular
variable, this plot gives information about all possible variables for a particular sample. It is therefore useful
when studying how a specific sample fits to the model.
Variables
This plot gives information about all possible variables for a particular sample (as opposed to the variable
residual plot, which gives information about residuals for all samples for a particular variable), and therefore
indicates how well a specific sample fits to the model.
Scores
(Line Plot)
This is a plot of score values versus sample number for a specified component. Although it is usually better to
look at 2D or 3D score plots because they contain more information, this plot can be useful whenever the
samples are sorted according to the values of an underlying variable, e.g. time, to detect trends or patterns.
The smaller the vertical variation (i.e. the closer the score values are to each other), the more similar the
samples are for this particular component. Look for samples which have a very large positive or negative score
value compared to the others: these may be outliers.
Camo Software AS
Outlier
Sample #
Also look for systematic patterns, like a regular increase or decrease, periodicity, etc. (only relevant if the
sample number has a meaning, like time for instance).
Line plot of the scores for time-related data
Score
Periodic behavior
Sample #
(Line Plot)
This is a plot of the standard errors of the different regression coefficients (B). These values can be used to
compare the precision of the estimations of the coefficients. The smaller the standard error, the more reliable
the estimated regression coefficient.
Camo Software AS
It may be a good idea to compare the total residuals from an MCR fitting to a PCA fit on the same data
(displayed on the plot of Total Residuals, PCA Fitting). Since PCA provides the best possible fit along a set of
orthogonal components, the comparison tells you how well the MCR model is performing in terms of fit.
Display the two plots side by side in the Viewer. Check the scale of the vertical axis on each plot (and adjust it
if necessary, using View - Scaling - Min/Max) before you compare the sizes of the total residuals.
(Line Plot)
This plot is available when viewing the results of an MCR model. It displays the total residuals from a PCA
model on the same data.
This plot is supposed to be used as a basis for comparison with the Total Residuals, MCR fit (the actual
residuals from the MCR model). Since PCA provides the best possible fit along a set of orthogonal
components, the comparison tells you how well the MCR model is performing in terms of fit.
Display the two plots side by side in the Viewer. Check the scale of the vertical axis on each plot (and adjust it
if necessary, using View - Scaling - Min/Max) before you compare the sizes of the total residuals.
PCs
Good model
Calibration variance is based on fitting the calibration data to the model. Validation variance is computed by
testing the model on data which was not used to build the model. Compare the two variances: if they differ
significantly, there is good reason to question whether either the calibration data or the test data are truly
representative. The figure below shows a situation where the residual validation variance is much larger than
Camo Software AS
the residual calibration variance (or the explained validation variance is much smaller than the explained
calibration variance). This means that although the calibration data are well fitted (small residual calibration
variances), the model does not describe new data well (large residual validation variance).
Total residual variance curves for Calibration and Validation
Residual variance
Validation
Calibration
0
PCs
Outliers can sometimes cause large residual variance (or small explained variance).
PCs
Good model
Calibration variance is based on fitting the calibration data to the model. Validation variance is computed by
testing the model on data which was not used to build the model. Compare the two variances: if they differ
significantly, there is good reason to question whether either the calibration data or the test data are truly
Camo Software AS
representative. The figure below shows a situation where the residual validation variance is much larger than
the residual calibration variance (or the explained validation variance is much smaller than the explained
calibration variance). This means that although the calibration data are well fitted (small residual calibration
variances), the model does not describe new data well (large residual validation variance).
Total residual variance curves for Calibration and Validation
Residual variance
Validation
Calibration
0
PCs
Outliers can sometimes be the reason for large residual variance (or small explained variance).
or
toolbar
The size of the residuals tells you about the misfit of the model. It may be a good idea to compare the variable
residuals from an MCR fitting to a PCA fit on the same data (displayed on the plot of Variable Residuals, PCA
Fitting). Since PCA provides the best possible fit along a set of orthogonal components, the comparison tells
you how well the MCR model is performing in terms of fit.
Display the two plots side by side in the Viewer. Check the scale of the vertical axis on each plot to compare
the sizes of the residuals.
Camo Software AS
PCs
If you find that some variables have much larger residual variance than all the other variables for all
components in your model (or for the first 3-4 of them), try rebuilding the model with these variables deleted.
This may produce a model which is easier to interpret.
Calibration variance is based on fitting the model to the calibration data. Validation variance is computed by
testing the model on data not used in calibration.
PCs
If some Y-variables have much larger residual variance than the others for all components (or for the first 3-4
of them), you will not be able to predict them correctly. If your purpose is just to interpret variable
relationships, you may keep these variables in the model, but remember that they are badly explained. If you
intend to make precise predictions, you should recalculate your model without these variables, because the
Camo Software AS
model will not succeed in predicting them anyway. Removing these variables may help the model explain the
other Y-variables with fewer components.
Calibration variance is based on fitting the model to the calibration data. Validation variance is computed by
testing the model on new data, not used at the calibration stage. Validation variance is the one which matters
most to detect which Y-variables will be predicted correctly.
X-variable Residuals
(Line Plot)
This is a plot of residuals for a specified X-variable and component number for all the samples. The plot is
useful for detecting outlying sample/variable combinations, as shown below. An outlier can sometimes be
modeled by incorporating more such samples. This should, however, be avoided since it will reduce the
prediction ability of the model.
Line plot of the variable residuals: one sample is outlying
Residuals
Whereas the sample residual plot gives information about residuals for all variables for a particular sample, this
plot gives information about all possible samples for a particular variable. It is therefore more useful when you
want to investigate how one specific variable behaves in all the samples.
One primary variable selected: a matrix plot shows the residuals for all samples x all secondary variables.
One secondary variable selected: a matrix plot shows the residuals for all samples x all primary variables.
One primary variable and one secondary variable selected: a line plot shows the residuals for all samples.
Camo Software AS
10 Samples
Samples with small residual variance (or large explained variance) for a particular component are well
explained by the corresponding model, and vice versa.
Explained X-Variance
Variables
Raspberry
Color
Sweetness
PC: 1, 2
The plot shows which components contribute most to summarizing the variations in each individual variable.
For instance, in the example above, PC1 summarizes most of the variations in Color, and PC2 does not add
anything to that summary. On the other hand, Raspberry is badly described by PC1, and PC2 is necessary to
achieve a good summary.
Use menu option Edit - Mark - Outliers Only (or its corresponding shortcut button) if you want the system
to mark the badly described variables. For instance, in the example above, variable Sweetness is badly
described by a model with 2 components. Try to re-calculate the model with one more component! If you
already have many components in your model, badly described variables are either noisy variables (they have
little meaningful variations, and can be removed from the analysis), or variables with some data errors.
Camo Software AS
Y-variable Residuals
(Line Plot)
This is a plot of residuals for a specified Y-variable and component number, for all the samples. The plot is
useful for detecting outlying sample or variable combinations, as shown in the figure below. An outlier can
sometimes be modeled by incorporating more components. This should be avoided since it will reduce the
prediction ability of the model, especially if the outlier is due to an anomaly in your original data (eg.
experimental error).
Line plot of the variable residuals: one sample is outlying
Residuals
This plot gives information about all possible samples for a particular variable (as opposed to the sample
residual plot, which gives information about residuals for all variables for a particular sample) hence it is more
useful for studying how a specific variable behaves for all the samples.
10 Samples
Small residual variance (or large explained variance) indicates that, for a particular number of components, the
samples are well explained by the model.
Camo Software AS
Explained Y-Variance
Variables
Raspberry
Color
Sweetness
PC: 1, 2
The plot shows which components contribute most to summarizing the variations in each in dividual response
variable. For instance, in the example above, PC1 summarizes most of the variations in Color, and PC2 does
not add anything to that summary. On the other hand, Raspberry is badly described by PC1, and PC2 is
necessary to achieve a good summary.
Use menu option Edit - Mark - Outliers Only (or its corresponding shortcut button) if you want the system
to mark the badly described variables. For instance, in the example above, variable Sweetness is badly
described by a model with 2 components. Try to re-calculate the model with one more component! If you
already have many components in your model, badly described response variables are either noisy variables
(they have little meaningful variations, and can be removed from the analysis), or variables with some data
errors, or responses which cannot be related to the predictors you have chosen to include in the analysis.
2D Scatter Plots
Classification Scores (2D Scatter Plot)
This is a two dimensional scatter plot or map of scores for (PC1,PC2) from a classification. The plot is
displayed for one class model at a time. All new samples (the samples you are trying to classify) are shown.
This plot shows how the new samples are projected onto the class model. Members of a particular class are
expected to be close to the center of the plot (origo), while non-members should be projected far away from the
center.
If you are classifying known samples, this plot helps you detect classification outliers. Look for known
members projected far away from the center (false negatives), or known non -members projected close to the
center (false positives). There may be errors in the data: check your data and correct them if necessary.
Coomans Plot
This plot shows the orthogonal distances from the new objects to two different classes (models) at the same
time. The membership limits (S0) are indicated. Membership limits reflect the significance level used in the
classification.
Camo Software AS
Samples which fall within the membership limit of a class are recognized as members of that class. Different
colors denote different types of sample: new samples being classified, calibration samples for the model along
the abscissa (A) axis, calibration samples for the model along the ordinate (B) axis, as shown in the figure
below.
Coomans plot
Sample Distance
to Model B
Membership limit
for Model A
Samples
belong to
Model A
Samples belong
to none of the
Models
Membership limit
for Model B
Samples
belong to
both Models
Samples
belong to
Model B
Sample Distance
to Model A
Outlier
Dangerous
outlier
Influential
Leverage
Camo Software AS
Camo Software AS
Outlier
Dangerous
outlier
Influential
Leverage
Camo Software AS
independent variations. Variables Raspberry and Off-flavor are negatively correlated. Variable Sweet cannot
be interpreted in this plot, because it is very close to the center.
Loadings of 6 sensory variables along (PC1,PC2)
PC 2
Raspberry
Thick
Sweet
Redness
Color PC 1
Off-flavor
Note: Variables lying close to the center are poorly explained by the plotted PCs. You cannot interpret them in
that plot!
Camo Software AS
Raspberry
Thick
Sweet
Redness
Color PC 1
Off-flavor
Note: Variables lying close to the center are poorly explained by the plotted PCs. You cannot interpret them in
that plot!
Predictors (X) projected in roughly the same direction from the center as a response, are positively linked
to that response. In the example below, predictors Sweet, Red and Color have a positive link with response
Pref.
Camo Software AS
Predictors projected in the opposite direction have a negative link, as predictor Thick in the example
below.
Predictors projected close to the center, as Bitter in the example below, are not well represented in that plot
and cannot be interpreted.
One response (Pref), 5 sensory predictors
PC 2
Sweet
Pref
Thick
Bitter
Red
PC 1
Color
Caution!
If your X-variables have been standardized, you should also standardize the Y-variable so that the X- and Yloadings have the same scale; otherwise the plot may be difficult to interpret.
This is a two dimensional scatter plot of X-loading weights for two specified components from a PLS or a triPLS analysis.
In PLS, this plot can be useful for detecting which X-variables are most important for predicting Y, although in
that case it is better to use the 2D scatter plot of X-loading weights and Y-loadings.
Note: Passified variables are displayed in a different color so as to be easily identified.
Camo Software AS
How to interpret scores and loadings together (example of the bi-plot), see p.217
Sweet
Pref
Thick
Bitter
Red
PC 1
Color
Camo Software AS
The predicted Y-value from the model is plotted against the measured Y-value. This is a good way to check the
quality of the regression model. If the model gives a good fit, the plot will show points close to a straight line
through the origin and with slope equal to 1. Turn on Plot Statistics (using the View menu) to check the
slope and offset, and RMSEP/RMSEC.
The figures below show two different situations: one indicating a good fit, the other a poor fit of the model.
Predicted vs. Measured shows how well the model fits
Good fit:
Predicted Y
Measured Y
Bad fit:
Predicted Y
Measured Y
You may also see cases where the majority of the samples lie close to the line while a few of them are further
away. This may indicate good fit of the model to the majority of the data, but with a few outliers present (see
the figure below).
Camo Software AS
Outlier
Measured Y
In other cases, there may be a non-linear relationship between the X- and Y-variables, so that the predictions
do not have the same level of accuracy over the whole range of variation of Y. In such cases, the plot may look
like the one shown below. Such non-linearities should be corrected if possible (for instance by a suitable
transformation), because otherwise there will be a systematic bias in the predictions depending on the range of
the sample.
Predicted vs. Measured shows a non-linear relationship
Predicted Y
Systematic
positive bias
Systematic
negative bias
Measured Y
This is a plot of predicted Y-values versus the true (measured) reference Y-values. You can use it to check
whether the model predicts new samples well. Ideally the predicted values should be equal to the reference
values.
Note that this plot is built in the same way as the Predicted vs. Measured plot used during calibration. You can
also turn on Plot Statistics (use the View menu) to display the slope and offset of the regression line, as well
as the true value of the RMSEP for your predicted values.
(3 x 2D Scatter Plots)
This is the projected view of a 3D influence plot. In addition to the original 3D plot, you can see the following:
Scatter Effects
This plot shows each sample plotted against the average sample. Scatter effects appear as differences in slope
and/or offset between the lines in the plot. Differences in the slope are caused by multiplicative scatter effects.
Offset error is due to additive effects.
Camo Software AS
Applying Multiplicative Scatter Correction will improve your model if you detect these scatter effects in your
data table. The examples below show what to look for.
Two cases of scatter effects
Sample i
Wavelength k
Absorbance
(i,k)
Absorbance
(average,k)
Average spectrum
Sample i
Absorbance
(i,k)
Absorbance
(average,k)
Average spectrum
How Multiplicative Scatter Correction works, see p. Feil! Bokmerke er ikke definert.
Scores
This is a two dimensional scatter plot (or map) of scores for two specified components (PCs) from PCA, PCR,
or PLS. The plot gives information about patterns in the samples. The score plot for (PC1,PC2) is especially
useful, since these two components summarize more variation in the data than any other pair of components.
The closer the samples are in the score plot, the more similar they are with respect to the two components
concerned. Conversely, samples far away from each other are different from each other.
The plot can be used to interpret differences and similarities among samples. Look at the present plot together
with the corresponding loading plot, for the same two components. This can help you determine which
variables are responsible for differences between samples. For example, samples to the right of the score plot
will usually have a large value for variables to the right of the loading plot, and a small value for variables to
the left of the loading plot.
Here are some things to look for in the 2D score plot.
Camo Software AS
PC 1
Camo Software AS
(Bi-plot)
This is a two dimensional scatter plot or map of scores for two specified components (PCs), with the Xloadings displayed on the same plot. It is called a bi -plot. It enables you to interpret sample properties and
variable relationships simultaneously.
Scores
The closer two samples are in the score plot, the more similar they are with respect to the two components
concerned. Conversely, samples far away from each other are different from each other.
PC 1
2- Are the samples evenly spread over the whole region, or is there any accumulation of samples at one end?
The figure below shows a typical fan-shaped layout, with most samples accumulated to the right of the plot,
then progressively spreading more and more. This means that the variables responsible for the major variations
are asymmetrically distributed. If you encounter such a situation, study the distributions of those variables
(histograms), and use an appropriate transformation (most often a logarithm).
Asymmetrical distribution of the samples on a score plot
PC 2
PC 1
3- Are some samples very different from the rest? This can indicate that they are outliers, as shown in the
figure below. Outliers should be investigated: there may have been errors in data collection or transcription, or
those samples may have to be removed if they do not belong to the population of interest.
Camo Software AS
Outlier
PC 1
Loadings
The plot shows the importance of the different variables for the two components specified. Variables with
loadings to the right in the loadings plot will be variables which usually have high values for samples to the
right in the score plot, etc.
Note: Passified variables are displayed in a different color so as to be easily identified.
Raspberry
Thick
Sweet
Redness
Color PC 1
Off-flavor
Camo Software AS
PC 2
Jam5
Raspberry
Jam1
Jam2
Jam6
Sweet
Thick
Redness
Color
Jam3
PC 1
Jam8
Jam4
Off-flavor
Jam9
Samples falling within both limits for a class are recognized as members of that class. The level of the limits is
governed by the significance level used in the classification.
Membership limits on the Si vs. Hi plot
Si
Leverage limit
Samples don't
belong to the
model
Samples
belong to
model with
respect to
leverage
Si limit
Samples
belong to
model
Leverage (Hi)
Si/S0 vs. Hi
The Si/S0 vs. Hi plot shows the two limits used for classification: the relative distance from the new sample to
the model (residual standard deviation) and the leverage (distance from the new sample to the model center).
Note: If you select None as significance level with the
limits are drawn.
Samples which fall within both limits for a particular class are said to belong to that class. The level of the
limits is governed by the significance level used in the classification.
Camo Software AS
Leverage limit
Samples don't
belong to the
model
Samples
belong to
model with
respect to
leverage
Si/S0 limit
Samples
belong to
model
Leverage (Hi)
Detecting Outliers
A sample may be outlying according to the X-variables only, or to the Y-variables only, or to both. It may also
not have extreme or outlying values for either separate set of variables, but become an outlier when you
consider the (X,Y) relationship. In the X-Y Relation Outlier plot, such a sample sticks out as being far away
from the relation defined by the other samples, as shown in the figure below. Check your data: there may be a
data transcription error for that sample.
A simple X-Y outlier
U scores
Outlier
T scores
If a sample sticks out in such a way that it is projected far away from the center along the model component,
we have an influential outlier (see the figure below). Such samples are dangerous to the model: they change the
orientation of the component. Check your data. If there is no data transcription error for that sample,
investigate more and decide whether it belongs to another population. If so, you may remove that sample (mark
it and recalculate the model without the marked sample). If not, you will have to gather more samples of the
same kind, in order to make your data more balanced.
Camo Software AS
An influential outlier
U scores
Regression line
without outlier
Influential outlier
T scores
Curved shape
Of the true relationship
T scores
A sigmoid-shaped curvature may indicate that there are interactions between the predictors. Adding cross-term
to the model may improve it.
Sample groups may indicate the need for separate modeling of each subgroup.
Camo Software AS
Predicted Y
The presence of an outlier is shown in the example below. The outlying sample has a much larger residual than
the others; however, it does not seem to disturb the model to a large extent.
A simple outlier has a large residual
Residual
Outlier
Predicted Y
The figure below shows the case of an influential outlier: not only does it have a large residual, it also attracts
the whole model so that the remaining residuals show a very clear trend. Such samples should usually be
excluded from the analysis, unless there is an error in the data or some data transformation can correct for the
phenomenon.
An influential outlier changes the structure of the residuals
Residual
Influential
outlier
Predicted Y
Trend in the
residuals
Small residuals (compared to the variance of Y) which are randomly distributed indicate adequate models.
Camo Software AS
Score
3D Scatter Plots
Influence Plot, X- and Y-variance (3D Scatter Plot)
This is a plot of the residual X- and Y-variances versus leverages. Look for samples with a high leverage and
high residual X- or Y-variance.
To study such samples in more detail, we recommend that you mark them and then plot X-Y relation outliers
for several model components. This way you will detect whether they have an influence on the shape of the XY relationship, in which case they would be dangerous outliers.
The plot is usually easier to read in its projected version. See Projected Influence Plot (3 x 2D Scatter
Plots) for more details.
Camo Software AS
This is a three dimensional scatter plot of X-loading weights for three specified components from PLS; this
plot may be difficult to interpret, both because it is three-dimensional and because it does not include the Yloadings. Thus we would usually recommend that you use the 2D scatter plot of X -loading weights and Yloadings instead.
Note: Passified variables are displayed in a different color so as to be easily identified.
Scores
This is a 3D scatter plot or map of the scores for three specified components from PCA, PCR, or PLS. The plot
gives information about patterns in the samples and is most useful when interpreting components 1, 2 and 3,
since these components summarize most of the variation in the data. It is usually easier to look at 2D score
plots but if you need three components to describe enough variation in the data, the 3D plot is a practical
alternative.
Like with the 2D plot, the closer the samples are in the 3D score plot, the more similar they are with respect to
the three components.
The 3D plot can be used to interpret differences and similarities among samples. Look at the score plot and the
corresponding loadings plot, for the same three components. Together they can be used to determine which
variables are responsible for differences between samples. Samples with high scores along the first component
usually have a large values for variables with high loadings along the first component, etc.
Here are a few patterns to look for in a score plot.
Camo Software AS
PC 1
PC 2
Outlier
PC 1
PC 2
Check how much of the total variation is explained by each component (these numbers are displayed at the
bottom of the plot). If it is large, the plot shows a significant portion of the information in your data and you
can use it to interpret relationships with a high degree of certainty. If the explained variation is smalle r, you
may need to study more components, consider a transformation, or there may be little information in the
original data.
Matrix Plots
Leverages
(Matrix Plot)
This is a matrix plot of leverages for all samples and all model components. It is a useful plot for studying how
the influence of each sample evolves with the number of components in the model.
Camo Software AS
If you have weighted your predictor variables with 1/Sdev (standardization), the weighted regression
coefficients (BW) take these weights into account. Since all predictors are brought back to the same scale, the
coefficients show the relative importance of those variables in the model.
Predictors with a large weighted coefficient play an important role in the regression model; a positive
coefficient shows a positive link with the response, and a negative coefficient shows a negative link.
Predictors with a small weighted coefficient are negligible. You can recalculate the model without those
variables.
The raw regression coefficients are those that may be used to write the model equation in original units:
Y = B0 + B1 * X-variable1 + B2 * X-variable2 +
Since the predictors are kept in their original scales, the coefficients do not reflect the relative importance of
the X-variables in the model.
The raw coefficients do not reflect the importance of the X-variables in the model, because the sizes of these
coefficients depend on the range of variation (and indirectly, on the original units) of the X-variables.
A predictor with a small raw coefficient does not necessarily indicate an unimportant variable
A predictor with a large raw coefficient does not necessarily indicate an important variable.
Camo Software AS
The matrix plot of X1- vs X2-regression coefficients gives you a graphical overview of the regions in your 3-D
arrays which are important for a given response. In the example below, you can see that most of the
information relevant to the prediction of response Severity is concentrated around X1= 250-400 and X2=
300-450, with an additional interesting spot around X1=550 and X2=600.
X1 vs X2 Matrix plot of Regression Coefficients for response Severity
If you have several responses, use the X1 vs Y and X2 vs Y plots to get an overview of one mode w ith respect
to all responses simultaneously. This will allow you to answer questions such as:
- Is there a region of mode 1 (resp. 2) which is important for several responses?
- Is the relationship between X1 and Y the same for all responses?
- Is there a region of mode 1 (resp. 2) which does not play any role for any of the responses? If so, it may be
removed from future models.
Contour plot;
Landscape plot.
Camo Software AS
If you want to interpret several responses together, print out their contour plots on color transparencies and
superimpose the maps.
X2
Continue
experimentation
in this direction
Path of
Steepest
Ascent
X1
Camo Software AS
All other values are between -1 and +1. A large positive value (as shown in red on the figure below) indicates
that the corresponding two variables have a tendency to increase simultaneously. A large negative value (as
shown in blue on the figure below) indicates that when the first variable increases, the other often decreases. A
correlation close to 0 (light green on the figure below)indicates that the two variables vary independently from
each other.
The best layouts for studying cross-correlations are bars (used as default) or map.
Cross-correlation plot, with Bars and Map layout
Layout: Bars
-0.952
-0.562
-0.171
0.219
Layout: Map
0.610
1.000
-0.952
-0.562
-0.171
Cross-Correlation
0.219
0.610
1.000
Cross-Co rrelatio n
Gl ossy
Shap e
Adh
Fi rm
Grainy
Shape
Cond
Fi rm
Sticky
Sticky
Sticky
M elt
Fi rm
Cond
Cond
Shap e
Gl ossy
Cheese cross-co
Adh
Grai ny
Melt
Cheese cross-co
Note:
Be careful when interpreting the color scale of the plot; not all data sets have correlations varying from -1 to
+1. The highest value will always be +1 (diagonal), but the lowest may not even be below zero! This may
happen for instance if you are studying several measurements that all capture more or less the same
phenomenon, e.g. texture or light absorbance in a narrow range.
Look at the values on the color scale before jumping to conclusions!
This is a normal probability plot of all the effects included in an Analysis of Effects model. Effects in the upper
right or lower left of the plot deviating from a fictitious straight line going through the medium effects are
potentially significant. The figure below shows such an example where A, B, and AB are potentially
significant. More specific results about significance can be obtained from other plots, for instance the line plot
of individual effects with p-values, or the effects table.
Two positive and one negative effect are sticking out
Normal Distribution
B
50
AB
0
Effects
You may manually draw a line on the plot with menu option Edit - Insert Draw Item - Line.
Camo Software AS
50
Y-residuals
If the plot shows a strong deviation from a straight line, the residuals are not normally distributed, as in the
figure below. In some cases - but not always - this can indicate lack of fit of the model. However it can also be
an indication that the error terms are simply not normally distributed..
The residuals have a regular but non-normal distribution
Normal distribution
50
Y-residuals
You may manually draw a line on the plot with menu option Edit - Insert Draw Item - Line.
Table Plots
ANOVA Table (Table Plot)
The ANOVA table contains degrees of freedom, sums of squares, mean squares, F -values and p-values for all
sources of variation included in the model.
Camo Software AS
The Multiple Correlation coefficient and the R-square are also presented above the main table. A value close to
1 indicates a good fit, while a value close to 0 indicates a poor fit.
For Response surface analyses, a Model check and a Lack of fit test are displayed after the Variables part of
the ANOVA table. The table may also include a significance test for the intercept, and the coordinates of
max/min/saddle points.
Model Check
The model check tests whether the non-linear part of the model is significant. It includes up to t hree groups of
effects:
Squares (and how they improve a model which already contains interactions);
Lack of Fit
The lack of fit part tests whether the error in response prediction is mostly due to experimental variability or to
an inadequate shape of the model. If the p-value for lack of fit is smaller than 0.05, it means that the model
does not describe the true shape of the response surface. In such cases, you may try a transformation of the
response variable.
Note that:
1. For screening designs, all terms in the ANOVA table will be missing if there are as many terms in the
model as cube samples (i.e. you have a saturated model). In such cases, you cannot use HOIE for significance
testing; try Center samples, Reference samples or COSCIND!
2. If your design has design variables with more than two levels, use Multiple Comparisons in order to see
which levels of a given variable differ significantly from each other.
3. Lack of fit can only be tested if the replicated center samples do not all have the same response values
(which may sometimes happen by accident).
Camo Software AS
The outcome of the classification depends on the significance limit; by default it is set to 5%, but you can tune
it up or down with the
tool.
Look for samples that are not recognized by any of the classes, or those which are allocated to more than one
class.
Detailed Effects
(Table Plot)
This table gives the numerical values of all effects and their corresponding f-ratios and p-values, for the current
response variable. The multiple correlation coefficient and the R-square, which measure the degree of fit of the
model, are also presented above the table. A value close to 1 indicates a model with good fit and a value close
to 0 indicates bad fit.
Interpreting Effects
This table is particularly useful to display the significance of the effects together with the confounding pattern,
for fractional factorial designs where significant effects should be interpreted with caution. If there is any
significant effect in your model (p-value smaller than 0.05), check whether this effect has any confounding. If
so, you may try an educated guess to find out which of the confounded terms is responsible for the observed
effect.
Curvature Check
If you have included replicated center samples in your design, and if you are interpreting your effects with the
Center significance testing method, you will also find the p-value for the curvature test above the table. A pvalue smaller than 0.05 means that you have a significant curvature: you will need an op timization stage to
describe the relationship between your design variables and your response properly.
0.05
0.01;0.05
0.005;0.01
<0.005
Negative effect
NS
----
Positive effect
NS
+
++
+++
Note: If some of your design variables have more than 2 levels, the Effects Overview table contains stars (*)
instead of + and - signs.
Camo Software AS
(Table Plot)
This table shows the measured and predicted Y values from the response surface model, plus their
corresponding X-values and standard error of prediction.
Special Plots
Interaction Effects (Special Plot)
This plot visualizes the interaction between two design variables.
The plot shows the average response value at the Low and High levels of the first design variable, in two
curves: one for the Low level of the second design variable, the other for its High level.
You can see the magnitude of the interaction effect (1/2 * change in the effect of the first design variable when
the second design variable changes from Low to High).
For a positive interaction, the slope of the effect for "High" is larger than for Low;
For a negative interaction, the slope of the effect for "High" is smaller than for Low.
In addition, the plot also contains information about the value of the interaction effect and its significance (pvalue, computed with the significance testing method you have chosen).
Main Effects
Camo Software AS
(Special Plot)
This plot visualizes the main effect of a design variable on a given response.
The plot shows the average response value at the Low and High levels of the design variable. If you have
included center samples, the average response value for the center samples is also displayed.
You can see the magnitude of the main effect (change in the response value when the design variable increases
from Low to High). If you have center samples, you can also detect a curvature visually.
In addition, the plot also contains information about the value of the effect and its significance (p-value,
computed with the significance testing method you have chosen).
Mean
Camo Software AS
Mean and Sdev for 3 responses, with groups Design samples and Center samples
Mean
Variables
Whiteness
Elasticity
Greasiness
Multiple Comparisons
(Special Plot)
This is a comparison of the average response values for the different levels of a design variable. It tells you
which levels of this variable are responsible for a significant change in the response. Th is plot displays one
design variable and one response variable at a time. Look at the plot ID to check which variables are plotted.
The names of the different levels are displayed to the right of the plot, at the same height as the average
response value. If a reference value has been defined in the dialog, it is indicated by circles to the right of
the plot.
Levels which cannot be distinguished statistically are displayed as points linked by a gray vertical bar.
Two levels have significantly different average response values if they are not linked by any bar.
Note that, if there are less than five samples in the data set, the percentiles are not calculated. The plot then
displays one small horizontal bar for each value (each sample). Otherwise, individual samples do not appear on
the plot, except for the maximum and minimum values.
Camo Software AS
Check that the spread (distance between Min and Max) over the Center samples is much smaller than the
spread over the Design samples. If not, either
Interpretation: Spectra
This plot can also be used as a diagnostic tool to study the distribution of a whole set of related variables, like
in spectroscopy the absorbances for several wavelengths. In such cases, we would recommend not to use
subgroups, since otherwise the plot would be too complex to provide interpretable information.
In the figure below, the percentile plot enables you to study the general shape of the spectrum, which is
common to all samples in the data set, and also to detect which wavelengths have the largest variation; these
are probably the most informative wavelengths.
Percentile plot for variables building up a spectrum
Percentiles
Most informative
wavelengths
Variables
Sometimes, some of the variation may not be relevant to your problem. This is the case in the figure below,
which shows an almost uniform spread over all wavelengths. This is very suspicious, since even wavelengths
with absorbances close to zero (i.e. baseline) have a large variation over the collected samples. This may
indicate a baseline shift, which you can correct using multiplicative scatter correction (MSC). Try to plot
scatter effects to check that hypothesis!
As much variation for the baseline as for the peaks is suspicious
Percentiles
r ea
sp
us
i o eline
i
c
sp
as
S u for b
Variables
(Special Plot)
This is a plot of predicted Y-value for all prediction samples. The predicted value is shown as a horizontal line.
Boxes around the predicted value indicate the deviation, i.e. whether the prediction is reliable or not.
Camo Software AS
Deviation
Predicted
Y-value
The deviations are computed as a function of the global model error, the sample leverage, and the sample
residual X-variance. A large deviation indicates that the sample used for prediction is not similar to the
samples used to make the calibration model. This is a prediction outlier: check its values for the X-variables. If
there has been an error, correct it; if the values are correct, the conclusion is that the prediction sample does not
belong to the same population as the samples your model is based upon, and you cannot trust the predicted Y
value.
Camo Software AS
Glossary of Terms
2-D Data
This is the most usual data structure in The Unscrambler, as opposed to 3-D data.
3-D Data
Data structure specific to The Unscrambler which accommodates three-way arrays. A 3-D data table can be
created from scratch or imported from an external source, then freely manipulated and re-formatted. Note that
analyses meant for two-way data structures cannot be run directly on a 3-D data table. You can analyze 3-D Xdata together with 2-D Y-data in a Three-Way PLS regression model. If you want to analyze your 3-D data
with a 2-way method, duplicate it to a 2-D data layout first.
3-Way PLS
See Three-Way PLS Regression.
Accuracy
The accuracy of a measurement method is its faithfulness, i.e. how close the measured value is to the actual
value.
Accuracy differs from precision, which has to do with the spread of successive measurements performed on the
same object.
Additive Noise
Noise on a variable is said to be additive when its size is independent of the level of the data value. The range
of additive noise is the same for small data values as for larger data values.
Camo Software AS
Analysis Of Effects
Calculation of the effects of design variables on the responses. It consists mainly of Analysis of Variance
(ANOVA), various Significance Tests, and Multiple Comparisons whenever they apply.
ANOVA
see Analysis of Variance.
Axial Design
One of the three types of mixture designs with a simplex-shaped experimental region. An axial design consists
of extreme vertices, overall center, axial points, end points. It can only be used for linear modeling, and
therefore it is not available for optimization purposes.
Axial Point
In an axial design, an axial point is positioned on the axis of one of the mixture variables, and must be above
the overall center, opposite the end point.
B-Coefficient
See Regression Coefficient.
Bias
Systematic difference between predicted and measured values. The bias is computed as the average value of
the residuals.
Bilinear Modeling
Bilinear modeling (BLM) is one of several possible approaches for data compression.
The bilinear modeling methods are designed for situations where collinearity exists among the original
variables. Common information in the original variables is used to build new variables, that reflect the
underlying (latent) structure. These variables are therefore called latent variables. The latent variables are
estimated as linear functions of both the original variables and the observations, thereby the name bilinear.
PCA, PCR and PLS are bilinear methods.
Observation
Camo Software AS
Data
Structure
Error
Box-Behnken Design
A class of experimental designs for response surface modeling and optimization, based on only 3 levels of each
design variable. The mid-levels of some variables are combined with extreme levels of others. The
combinations of only extreme levels (i.e. cube samples of a factorial design) are not included in the design.
Box-Behnken designs are always rotatable. On the other hand, they cannot be built as an extension of an
existing factorial design, so they are more recommended when changing the ranges of variation for some of the
design variables after a screening stage, or when it is necessary to avoid too extreme situations.
Box-plot
The Box-plot represents the distribution of a variable in terms of percentiles.
Maximum value
75% percentile
Median
25% percentile
Minimum value
Calibration
Stage of data analysis where a model is fitted to the available data, so that it describes the data as good as
possible.
After calibration, the variation in the data can be expressed as the sum of a modeled part (structure) and a
residual part (noise).
Calibration Samples
Samples on which the calibration is based. The variation observed in the variables measured on the calibration
samples provides the information that is used to build the model.
If the purpose of the calibration is to build a model that will later be applied on new samples for prediction, it is
important to collect calibration samples that span the variations expected in the future prediction samples.
Category Variable
A category variable is a class variable, i.e. each of its levels is a category (or class, or type), without any
possible quantitative equivalent.
Examples: type of catalyst, choice among several instruments, wheat var iety, etc..
Camo Software AS
Candidate Point
In the D-optimal design generation, a number of candidate points are first calculated. These candidate points
consist of extreme vertices and centroid points. Then, a number of candidate points is selected D-optimally to
create the set of design points.
Center Sample
Sample for which the value of every design variable is set at its mid-level (halfway between low and high).
Center samples have a double purpose: introducing one center sample in a screening design enables curvature
checking, and replicating the center sample provides a direct estimation of the experimental error.
Center samples can be included when all design variables are continuous.
Centering
See Mean Centering.
Centroid Design
See Simplex-centroid design.
Centroid Point
A centroid point is calculated as the mean of the extreme vertices on the design region surface associated with
this centroid point. It is used in Simplex-centroid designs, axial designs and D-optimal mixture/non-mixture
designs.
Classification
Data analysis method used for predicting class membership. Classification can be seen as a predictive method
where the response is a category variable. The purpose of the analysis is to be able to predict which category a
new sample belongs to. The main classification method implemented in The Unscrambler is SIMCA
classification.
Classification can for instance be used to determine the geographical origin of a raw material from the levels of
various impurities, or to accept or reject a product depending on its quality.
To run a classification, you need
one or several PCA models (one for each class) based on the same variables;
Camo Software AS
Closure
In MCR, the Closure constraint forces the sum of the concentrations of all the mixture components to be equal
to a constant value (the total concentration) across all samples.
Collinear
See Collinearity.
Collinearity
Linear relationship between variables. Two variables are collinear if the value of one variable can be computed
from the other, using a linear relation. Three or more variables are collinear if one of them can be expressed as
a linear function of the others.
Variables which are not collinear are said to be linearly independent. Collinearity - or near-collinearity, i.e.
very strong correlation - is the major cause of trouble for MLR models, whereas projection methods like PCA,
PCR and PLS handle collinearity well.
x2
Component
1) Context: PCA, PCR, PLS See Principal Component.
2) Context: Curve Resolution: See Pure Components.
3) Context: Mixture Designs: See Mixture Components.
Condition Number
It is the square root of the ratio of the highest eigenvalue to the smallest eigenvalue of the experimental matrix.
The higher the condition number, the more spread the region. On the contrary, the lower the condition number,
the more spherical the region. The ideal condition number is 1; the closer to 1 the better.
Confounded Effects
Two (or more) effects are said to be confounded when variation in the responses cannot be traced back to the
variation in the design variables to which those effects are associated.
Confounded effects can be separated by performing a few new experiments. This is useful when some of the
confounded effects have been found significant.
Camo Software AS
Confounding Pattern
The confounding pattern of an experimental design is the list of the effects that can be studied with this design,
with confounded effects listed on the same line.
Constrained Design
Experimental design involving multi-linear constraints between some of the designed variables. There are two
types of constrained designed: classical Mixture designs and D-optimal designs.
Constraint
1) Context: Curve Resolution:
A constraint is a restriction imposed on the solutions to the multivariate curve resolution problem.
Many constraints take the form of a linear relationship between two variables or more:
a1 . X1 + a2 . X2 + + a n . Xn + a0 >= 0
or
a1 . X1 + a2 . X2 + + a n . Xn + a0 <= 0
where Xi are relevant variables (e.g. estimated concentrations), and each constraint is specified by the set of
constants a0 an.
2) Context: Mixture Designs: See Multi-Linear Constraint.
Continuous Variable
Quantitative variable measured on a continuous scale.
Examples of continuous variables are:
- Amounts of ingredients (in kg, liters, etc.);
- Recorded or controlled values of process parameters (pressure, temperature, etc.).
Corner Sample
See vertex sample.
Correlation
A unitless measure of the amount of linear relationship between two variables.
The correlation is computed as the covariance between the two variables divided by the square root of the
product of their variances. It varies from -1 to +1.
Positive correlation indicates a positive link between the two variables, i.e. when one increases, the other has a
tendency to increase too. The closer to +1, the stronger this link.
Negative correlation indicates a negative link between the two variables, i.e. when one increases, the other has
a tendency to decrease. The closer to -1, the stronger this link.
Camo Software AS
Correlation Loadings
Loading plot marking the 50% and 100% explained variance limits. Correlation Loadings are helpful in
revealing variable correlations.
COSCIND
A method used to check the significance of effects using a scale-independent distribution as comparison. This
method is useful when there are no residual degrees of freedom.
Covariance
A measure of the linear relationship between two variables.
The covariance is given on a scale which is a function of the scales of the two variables, and may not be easy
to interpret. Therefore, it is usually simpler to study the correlation instead.
Cross Terms
See Interaction Effects.
Cross Validation
Validation method where some samples are kept out of the calibration and used for prediction. This is repeated
until all samples have been kept out once. Validation residual variance can then be computed from the
prediction residuals.
In segmented cross validation, the samples are divided into subgroups or segments. One segment at a ti me is
kept out of the calibration. There are as many calibration rounds as segments, so that predictions can be made
on all samples. A final calibration is then performed with all samples.
In full cross validation, only one sample at a time is kept out of the calibration.
Cube Sample
Any sample which is a combination of high and low levels of the design variables, in experimental plans based
on two levels of each variable.
In Box-Behnken designs, all samples which are a combination of high or low levels of some design variables,
and center level of others, are also referred to as cube samples.
Curvature
Curvature means that the true relationship between response variations and predictor variations is non-linear.
In screening designs, curvature can be detected by introducing a center sample.
Data Compression
Concentration of the information carried by several variables onto a few underlying variables.
The basic idea behind data compression is that observed variables often contain common information, and that
this information can be expressed by a smaller number of variables than originally observed.
Camo Software AS
Degree Of Fractionality
The degree of fractionality of a factorial design expresses how much the design has been reduced compared to
a full factorial design with the same number of variables. It can be interpreted as the number of design
variables that should be dropped to compute a full factorial design with the same number of experiments.
Example: with 5 design variables, one can either build
a fractional factorial design with a degree of fractionality of 1, which will include 16 experiments (25-1 );
a fractional factorial design with a degree of fractionality of 2, which will include 8 experiments (25-2 ).
Degrees Of Freedom
The number of degrees of freedom of a phenomenon is the number of independent ways this phenomenon can
be varied.
Degrees of freedom are used to compute variances and theoretical variable distributions. For instance, an
estimated variance is said to be corrected for degrees of freedom if it is computed as the sum of square of
deviations from the mean, divided by the number of degrees of freedom of this sum.
Design Variable
Experimental factor for which the variations are controlled in an experimental design.
Distribution
Shape of the frequency diagram of a measured variable or calculated parameter. Observed distributions can be
represented by a histogram.
Some statistical parameters have a well-known theoretical distribution which can be used for significance
testing.
D-Optimal Design
Experimental design generated by the DOPT algorithm. A D-optimal design takes into account the multi-linear
relationships existing between design variables, and thus works with constrained experimental regions. There
are two types of D-optimal designs: D-optimal Mixture designs and D-optimal Non-Mixture designs,
according to the presence or absence of Mixture variables.
Camo Software AS
D-Optimal Principle
Principle consisting in the selection of a sub-set of candidate points which define a maximal volume region in
the multi-dimensional space. The D-optimal principle aims at minimizing the condition number.
End Point
In an axial or a simplex-centroid design, an end point is positioned at the bottom of the axis of one of the
mixture variables, and is thus positioned on the side opposite to the axial point.
Experimental Design
Plan for experiments where input variables are varied systematically within predefined ranges, so that their
effects on the output variables (responses) can be estimated and checked for significance.
Experimental designs are built with a specific objective in mind, namely screening or optimization.
The number of experiments and the way they are built depends on the objective and on the operational
constraints.
Experimental Error
Random variation in the response that occurs naturally when performing experiments.
An estimation of the experimental error is used for significance testing, as a comparison to structured variation
that can be accounted for by the studied effects.
Experimental error can be measured by replicating some experiments and computing the standard deviation of
the response over the replicates. It can also be estimated as the residual variation when all structured effects
have been accounted for.
Experimental Region
N-dimensional area investigated in an experimental design with N design variables. The experimental region is
defined by:
5. the ranges of variation of the design variables,
7. if any, the multi-linear relationships existing between design variables.
In the case of multi-linear constraints, the experimental region is said to be constrained.
Explained Variance
Share of the total variance which is accounted for by the model.
Camo Software AS
Explained variance is computed as the complement to residual variance, divided by total variance. It is
expressed as a percentage.
For instance, an explained variance of 90% means that 90% of the variation in the data is described by the
model, while the remaining 10% are noise (or error).
Explained X-Variance
See Explained Variance.
Explained Y-Variance
See Explained Variance.
F-Distribution
Fisher Distribution is the distribution of the ratio between two variances.
The F-distribution assumes that the individual observations follow an approximate normal distribution.
Fixed Effect
Effect of a variable for which the levels studied in an experimental design are of specific interest.
Examples are:
- effect of the type of catalyst on yield of the reaction;
- effect of resting temperature on bread volume.
The alternative to a fixed effect is a random effect.
F-Ratio
The F-ratio is the ratio between explained variance (associated to a given predictor) and residual variance. It
shows how large the effect of the predictor is, as compared with random noise.
By comparing the F-ratio with its theoretical distribution (F-distribution), we obtain the significance level
(given by a p-value) of the effect.
Camo Software AS
Such designs are often used for extensive study of the effects of few variables, especially if some variables
have more than two levels. They are also appropriate as advanced screening designs, to study both main effects
and interactions, especially if no Resolution V design is available.
Gap
One of the parameters of the Gap-Segment and Norris Gap derivatives, the gap is the length of the interval that
separates the two segments that are being averaged.
Look up Segment for more information.
Histogram
A plot showing the observed distribution of data points. The data range is divided into a number of bins (i.e.
intervals) and the number of data points that fall into each bin is summed up.
The height of the bar in the histograms shows how many data points fall within the data range of the bin.
Hotelling T2 Ellipse
This 95% confidence ellipse can be included in Score plots and reveals potential outliers, lying outside the
ellipse. The Hotelling statistic is presented in the Method References chapter, which is available as a .PDF file
from CAMOs web site www.camo.com/TheUnscrambler/Appendices .
Influence
A measure of how much impact a single data point (or a single variable) has on the model. The influence
depends on the leverage and the residuals.
Inner Relation
In PLS regression models, scores in X are used to predict the scores in Y and from these predictions, the
is found. This connection between X and Y through their scores is called the inner relation.
estimated Y
Interaction
There is an interaction between two design variables when the effect of the first variable depends on the level
of the other. This means that the combined effect of the two variables is not equal to the sum of their main
effects.
An interaction that increases the main effects is a synergy. If it goes in the opposite direction, it can be called
an antagonism.
Intercept
(Also called Offset). The point where a regression line crosses the ordinate (Y-axis).
Camo Software AS
Interior Point
Point which is not located on the surface, but inside of the experimental region. For example, an axial point is a
particular kind of interior point. Interior points are used in classical mixture designs.
Lack Of Fit
In Response Surface Analysis, the ANOVA table includes a special chapter which checks whether the
regression model describes the true shape of the response surface. Lack of fit means that the true shape is likely
to be different from the shape indicated by the model.
If there is a significant lack of fit, you can investigate the residuals and try a transformation.
Lattice Degree
The degree of a Simplex-Lattice design corresponds to the maximal number of experimental points -1 for a
level 0 of one of the Mixture variables.
Lattice Design
See Simplex-lattice design.
Leveled Variable
A leveled variable is a variable which consists of discrete values instead of a range of continuous values.
Examples are design variables and category variables.
Leveled variables can be used to separate a data table into different groups. This feature is used by the
Statistics task, and in sample plots from PCA, PCR, PLS, MLR, Prediction and Classification results.
Levels
Possible values of a variable. A category variable has several levels, which are all possible categories. A design
variable has at least a low and a high level, which are the lower and higher bounds of its range of variation.
Sometimes, intermediate levels are also included in the design.
Leverage Correction
A quick method to simulate model validation without performing any actual predictions.
It is based on the assumption that samples with a higher leverage will be more difficult to predict accurately
than more central samples. Thus a validation residual variance is computed from the calibration sample
residuals, using a correction factor which increases with the sample leverage.
Note! For MLR, leverage correction is strictly equivalent to full cross -validation. For other methods, leverage
correction should only be used as a quick-and-dirty method for a first calibration, and a proper validation
method should be employed later on to estimate the optimal number of components correctly.
Camo Software AS
Leverage
A measure of how extreme a data point or a variable is compared to the majority.
In PCA, PCR and PLS, leverage can be interpreted as the distance between a projected point (or projected
variable) and the model center. In MLR, it is the object distance to the model center.
Average data points have a low leverage. Points or variables with a high leverage are likely to have a high
influence on the model.
Linear Effect
See Main Effect.
Linear Model
Regression model including as X-variables the linear effects of each predictor. The linear effects are also called
main effects.
Linear models are used in Analysis of Effects in Plackett-Burman and Resolution III fractional factorial
designs. Higher resolution designs allow the estimation of interactions in addition to the linear effects.
Loading Weights
Loading weights are estimated in PLS regression. Each X-variable has a loading weight along each model
component.
The loading weights show how much each predictor (or X-variable) contributes to explaining the response
variation along each model component. They can be used, together with the Y-loadings, to represent the
relationship between X- and Y-variables as projected onto one, two or three components (line plot, 2D scatter
plot and 3D scatter plot respectively).
Loadings
Loadings are estimated in bilinear modeling methods where information carried by several variables is
concentrated onto a few components. Each variable has a loading along each model component.
The loadings show how well a variable is taken into account by the model components. You can use them to
understand how much each variable contributes to the meaningful variation in the data, and to interpret
variable relationships. They are also useful to interpret the meaning of each model component.
Lower Quartile
The lower quartile of an observed distribution is the variable value that splits the observations into 25% lower
values, and 75% higher values. It can also be called 25% percentile.
Main Effect
Average variation observed in a response when a design variable goes from its low to its high level.
Camo Software AS
The main effect of a design variable can be interpreted as linear variation generated in the response, when this
design variable varies and the other design variables have their average values.
MCR
See Multivariate Curve Resolution.
Mean
Average value of a variable over a specific sample set. The mean is computed as the sum of the variable
values, divided by the number of samples.
The mean gives a value around which all values in the sample set are distributed. In Statistics results, the mean
can be displayed together with the standard deviation.
Mean Centering
Subtracting the mean (average value) from a variable, for each data point.
Median
The median of an observed distribution is the variable value that splits the distribution in its middle: half the
observations have a lower value than the median, and the other half have a higher value. It can also be called
50% percentile.
MixSum
Term used in The Unscrambler for mixture sum. See Mixture Sum.
Mixture Components
Ingredients of a mixture.
There must be at least three components to define a mixture. A unique component cannot be called mixture.
Two components mixed together do not require a Mixture design to be studied: study the variation in quantity
of one of them as a classical process variable.
Mixture Constraint
Multi-linear constraint between Mixture variables. The general equation for the Mixture constraint is
X1 + X2 ++ Xn = S
where the Xi represent the ingredients of the mixture, and S is the total amount of mixture. In most cases, S is
equal to 100%.
Mixture Design
Special type of experimental design, applying to the case of a Mixture constraint. There are three types of
classical Mixture designs: Simplex-Lattice design, Simplex-Centroid design, and Axial design. Mixture
designs that do not have a simplex experimental region are generated D-optimally; they are called D-optimal
Mixture designs.
Camo Software AS
Mixture Region
Experimental region for a Mixture design. The Mixture region for a classical Mixture design is a simplex.
Mixture Sum
Total proportion of a mixture which varies in a Mixture design. Generally, the mixture sum is equal to 100%.
However, it can be lower than 100% if the quantity in one of the components has a fixed value.
The mixture sum can also be expressed as fractions, with values varying from 0 to 1.
Mixture Variable
Experimental factor for which the variations are controlled in a mixture design or D-optimal mixture design.
Mixture variables are multi-linearly linked by a special constraint called mixture constraint.
There must be at least three mixture variables to define a mixture design. See Mixture Components.
MLR
See Multiple Linear Regression.
Mode
See Modes.
Model
Mathematical equation summarizing variations in a data set.
Models are built so that the structure of a data table can be understood better than by just looking at all raw
values.
Statistical models consist of a structure part and an error part. The structure part (information) is intended to be
used for interpretation or prediction, and the error part (noise) should be as small as possible for the model to
be reliable.
Model Center
The model center is the origin around which variations in the data are modeled. It is the (0,0) point on a score
plot.
If the variables have been centered, samples close to the average will lie close to the model center.
Model Check
In Response Surface Analysis, a section of the ANOVA table checks how useful the interactions and squares
are, compared with a purely linear model. This section is called Model Check.
If one part of the model is not significant, it can be removed so that the remaining effects are estimated with a
better precision.
Camo Software AS
Modes
In a multi-way array, a mode is one of the structuring dimensions of the array. A two-way array (standard n x p
matrix) has two modes: rows and columns. A three-way array (3-D data table, or some result matrices) has
three modes: rows, columns and planes or e.g. Samples, Primary variables and Secondary variables.
Multi-Linear Constraint
This is a linear relationship between two variables or more. A constraint has the general form:
A1 . X1 + A2 . X2 + + An . Xn + A0 >= 0
or
A1 . X1 + A2 . X2 + + An . Xn + A0 <= 0
where Xi are designed variables (mixture or process), and each constraint is specified by the set of constants A0
An .
A multi-linear constraint cannot involve both Mixture and Process variables.
Multi-Way Analysis
See Three-Way PLS Regression.
Multi-Way Data
See 3-D Data.
Noise
Random variation that does not contain any information.
Camo Software AS
Non-Linearity
Deviation from linearity in the relationship between a response and its predictors.
Non-Negativity
In MCR, the Non-negativity constraint forces the values in a profile to be equal to or greater than zero.
Normal Distribution
Frequency diagram showing how independent observations, measured on a continuous scale, would be
distributed if there were an infinite number of observations and no factors caused systematic effects.
A normal distribution can be described by two parameters:
a theoretical standard deviation, which is the spread of the individual observations around the mean.
NPLS
See Three-Way PLS Regression.
O2V
In The Unscrambler, three-way data structure formed of two Object modes and one Variable mode. A 3-D data
2
table with layout O V is displayed in the Editor as a flat (unfolded) table with as many rows as Primary
samples times Secondary samples and as many columns as Variables.
Offset
See Intercept.
Optimization
Finding the settings of design variables that generate optimal response values.
Orthogonal
Two variables are said to be orthogonal if they are completely uncorrelated, i.e. their correlation is 0.
Camo Software AS
In PCA and PCR, the principal components are orthogonal to each other.
Factorial designs, Plackett-Burman designs, Central Composite designs and Box-Behnken designs are built in
such a way that the studied effects are orthogonal to each other.
Orthogonal Design
Designs built in such a way that the studied effects are orthogonal to each other, are called orthogonal designs.
Examples: Factorial designs, Plackett-Burman designs, Central Composite designs and Box -behnken designs.
D-optimal designs and classical mixture designs are not orthogonal.
Outlier
An observation (outlying sample) or variable (outlying variable) which is abnormal compared to the major part
of the data.
Extreme points are not necessarily outliers; outliers are points that apparently do not belong to the same
population as the others, or that are badly described by a model.
Outliers should be investigated before they are removed from a model, as an apparent outlier may be due to an
error in the data.
OV2
In The Unscrambler, three-way data structure formed of one Object mode and two Variable modes. A 3-D data
table with layout OV 2 is displayed in the Editor as a flat (unfolded) table with as many rows as Objects
(samples) and as many columns as Primary variables times Secondary variables.
Overfitting
For a model, overfitting is a tendency to describe too much of the variation in the data, so that not only
consistent structure is taken into account, but also some noise or uninformative variation.
Overfitting should be avoided, since it usually results in a lower quality of prediction. Validation is an efficient
way to avoid model overfitting.
Passified
When you apply the Passify weighting option to a variable, it becomes Passified. This means that it loses all
influence on the model, but it is not removed from the analysis, so that you can study how it correlates to the
other variables, by plotting Correlation Loadings.
Variables which are not passified may be called active variables.
Passify
New weighting option which allows you, by giving a variable a very low weight in a PCA, PCR or PLS model,
to remove its influence on the model while still showing how it correlates to other variables.
Camo Software AS
PCA
See Principal Component Analysis.
PCR
See Principal Component Regression.
PCs
See Principal Component.
Percentile
The X% percentile of an observed distribution is the variable value that splits the observations into X% lower
values, and 100-X% higher values.
Quartiles and median are percentiles. The percentiles are displayed using a box-plot.
Plackett-Burman Design
A very reduced experimental plan used for a first screening of many variables. It gives information about the
main effects of the design variables with the smallest possible number of experiments.
No interactions can be studied with a Plackett-Burman design, and moreover, each main effect is confounded
with a combination of several interactions, so that these designs should be used only as a first stage, to check
whether there is any meaningful variation at all in the investigated phenomena.
PLS
See PLS Regression.
Camo Software AS
By plotting the first PLS components one can view main associations between X-variables and Y-variables,
and also interrelationships within X-data and within Y-data.
PLS1
Version of the PLS method with only one Y-variable.
PLS2
Version of the PLS method in which several Y-variables are modeled simultaneously, thus taking advantage of
possible correlations or collinearity between Y-variables.
PLS-DA
See PLS Discriminant Analysis.
Precision
The precision of an instrument or a measurement method is its ability to give consistent results over repeated
measurements performed on the same object. A precise method will give several values that are very close to
each other.
Precision can be measured by standard deviation over repeated measurements.
If precision is poor, it can be improved by systematically repeating the measurements over each sample, and
replacing the original values by their average for that sample.
Precision differs from accuracy, which has to do with how close the average measured value is to the target
value.
Prediction
Computing response values from predictor values, using a regression model.
To make predictions, you need
new X-data collected on samples which should be similar to the ones used for calibration.
The new X-values are fed into the model equation (which uses the regression coefficients), and predicted Yvalues are computed.
Predictor
Variable used as input in a regression model. Predictors are usually denoted X-variables.
Primary Sample
In a 3-D data table with layout O2 V, this is the major Sample mode. Secondary samples are nested within each
Primary sample.
Primary Variable
2
In a 3-D data table with layout OV , this is the major Variable mode. Secondary variables are nested within
each Primary variable.
Camo Software AS
Process Variable
Experimental factor for which the variations are controlled in an experimental design, and to which the mixture
variable definition does not apply.
Projection
Principle underlying bilinear modeling methods such as PCA, PCR and PLS.
In those methods, each sample can be considered as a point in a multi -dimensional space. The model will be
built as a series of components onto which the samples - and the variables - can be projected. Sample
projections are called scores, variable projections are called loadings.
The model approximation of the data is equivalent to the orthogonal projection of the samples onto the model.
The residual variance of each sample is the squared distance to its projecti on.
Proportional Noise
Noise on a variable is said to be proportional when its size depends on the level of the data value. The range of
proportional noise is a percentage of the original data values.
Camo Software AS
Pure Components
In MCR, an unknown mixture is resolved into n pure components. The number of components and their
concentrations and instrumental profiles are estimated in a way that explains the structure of the observed data
under the chosen model constraints.
p-value
The p-value measures the probability that a parameter estimated from experimental data should be as large as it
is, if the real (theoretical, non-observable) value of that parameter were actually zero. Thus, p-value is used to
assess the significance of observed effects or variations: a small p-value means that you run little risk of
mistakenly concluding that the observed effect is real.
The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If p-value < 0.05, you have reason to
believe that the observed effect is not due to random variations, and you may conclude that it is a significant
effect.
p-value is also called significance level.
Quadratic Model
Regression model including as X-variables the linear effects of each predictor, all two-variable interactions,
and the square effects.
With a quadratic model, the curvature of the response surface can be approximated in a satisfactory way.
Random Effect
Effect of a variable for which the levels studied in an experimental design can be considered to be a small
selection of a larger (or infinite) number of possibilities.
Examples:
- Effect of using different batches of raw material;
- Effect of having different persons perform the experiments.
The alternative to a random effect is a fixed effect.
Random Order
Randomization is the random mixing of the order in which the experiments are to be performed. The purpose is
to avoid systematic errors which could interfere with the interpretation of the effects of the design variables.
Reference Sample
Sample included in a designed data table to compare a new product under development to an existing product
of a similar type.
The design file will contain only response values for the reference samples, whereas the input part (the design
part) is missing (m).
Regression Coefficient
In a regression model equation, regression coefficients are the numerical coefficients that express the link
between variation in the predictors and variation in the response.
Camo Software AS
Regression
Generic name for all methods relating the variations in one or several response variables (Y-variables) to the
variations of several predictors (X-variables), with explanatory or predictive purposes.
Regression can be used to describe and interpret the relationship between the X-variables and the Y-variables,
and to predict the Y-values of new samples from the values of the X-variables.
Repeated Measurement
Measurement performed several times on one single experiment or sample.
The purpose of repeated measurements is to estimate the measurement error, and to improve the precision of
an instrument or measurement method by averaging over several measurements.
Replicate
Replicates are experiments that are carried out several times. The purpose of including replicates in a data table
is to estimate the experimental error.
Replicates should not be confused with repeated measurements, which give information about measurement
error.
Residual
A measure of the variation that is not taken into account by the model.
The residual for a given sample and a given variable is computed as the difference between observed value and
fitted (or projected, or predicted) value of the variable on the sample.
Residual Variance
The mean square of all residuals, sample- or variable-wise.
This is a measure of the error made when observed values are approximated by fitted values, i.e. when a
sample or a variable is replaced by its projection onto the model.
The complement to residual variance is explained variance.
Residual X-Variance
See Residual Variance.
Residual Y-Variance
See Residual Variance.
Resolution
1) Context: experimental design
Information on the degree of confounding in fractional factorial designs.
Resolution is expressed as a roman number, according to the following code:
in a Resolution III design, main effects are confounded with 2-factor interactions;
Camo Software AS
in a Resolution IV design, main effects are free of confounding with 2-factor interactions, but 2-factor
interactions are confounded with each other;
in a Resolution V design, main effects and 2-factor interactions are free of confounding.
More generally, in a Resolution R design, effects of order k are free of confounding with all effects of order
less than R-k.
2) Context: data analysis
Extraction of estimated pure component profiles and spectra from a data matrix. See Multivariate Curve
Resolution for more details.
Response Variable
Observed or measured parameter which a regression model tries to predict.
Responses are usually denoted Y-variables.
Responses
See Response Variable.
RMSEC
Root Mean Square Error of Calibration. A measurement of the average difference between predicted and
measured response values, at the calibration stage.
RMSEC can be interpreted as the average modeling error, expressed in the same units as the original response
values.
RMSED
Root Mean Square Error of Deviations. A measurement of the average difference between the abscissa and
ordinate values of data points in any 2D scatter plot.
RMSEP
Root Mean Square Error of Prediction. A measurement of the average difference between predicted and
measured response values, at the prediction or validation stage.
RMSEP can be interpreted as the average prediction error, expressed in the same units as the original response
values.
Sample
Object or individual on which data values are collected, and which builds up a row in a data table.
Camo Software AS
Scaling
See Weighting.
Scatter Effects
In spectroscopy, scatter effects are effects that are caused by physical phenomena, like particle size, rather than
chemical properties. They interfere with the relationship between chemical properties and shape of the
spectrum. There can be additive and multiplicative scatter effects.
Additive and multiplicative effects can be removed from the data by different methods. Multiplicative Scatter
Correction removes the effects by adjusting the spectra from ranges of wavelengths supposed to carry no
specific chemical information.
Scores
Scores are estimated in bilinear modeling methods where information carried by several variables is
concentrated onto a few underlying variables. Each sample has a score along each model component.
The scores show the locations of the samples along each model component, and can be used to detect sample
patterns, groupings, similarities or differences.
Screening
First stage of an investigation, where information is sought about the effects of many variables. Since many
variables have to be investigated, only main effects, and optionally interactions, can be studied at this stage.
There are specific experimental designs for screening, such as factorial or Plackett-Burman designs.
Secondary Sample
2
In a 3-D data table with layout O V, this is the minor Sample mode. Secondary samples are nested within each
Primary sample.
Secondary Variable
In a 3-D data table with layout OV2 , this is the minor Variable mode. Secondary variables are nested within
each Primary variable.
Segment
One of the parameters of Gap-Segment derivatives and Moving Average smoothing, a segment is an interval
over which data values are averaged.
In smoothing, X-values are averaged over one segment symmetrically surrounding a data point. The raw value
on this point is replaced by the average over the segment, thus creating a smoothing effect.
In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over one segment on
each side of the data point. The two segments are separated by a gap. The raw value on this point is replaced
by the difference of the two averages, thus creating an estimate of the derivative on this point.
Camo Software AS
SEP
See Standard Error of Performance.
Significance Level
See p-value.
Significant
An observed effect (or variation) is declared significant if there is a small probability that it is due to chance.
SIMCA
See SIMCA Classification.
SIMCA Classification
Classification method based on disjoint PCA modeling.
SIMCA focuses on modeling the similarities between members of the same class. A new sample will be
recognized as a member of a class if it is similar enough to the other members; else it will be rejected.
Simplex
Specific shape of the experimental region for a classical mixture design. A Simplex has N corners but N -1
independent variables in a N-dimensional space. This results from the fact that whatever the proportions of the
ingredients in the mixture, the total amount of mixture has to remain the same: the N th variable depends on the
N-1 other ones. When mixing three components, the resulting simplex is a triangle.
Simplex-Centroid Design
One of the three types of mixture designs with a simplex-shaped experimental region. A Simplex-centroid
design consists of extreme vertices, center points of all "sub-simplexes", and the overall center. A "subsimplex" is a simplex defined by a subset of the design variables. Simplex-centroid designs are available for
optimization purposes, but not for a screening of variables.
Simplex-Lattice Design
One of the three types of mixture designs with a simplex-shaped experimental region. A Simplex-lattice design
is a mixture variant of the full-factorial design. It is available for both screening and optimization purposes,
according to the degree of the design (See lattice degree).
Camo Software AS
Square Effect
Average variation observed in a response when a design variable goes from its center level to an extreme level
(low or high).
The square effect of a design variable can be interpreted as the curvature observed in the response surface, with
respect to this particular design variable.
Standard Deviation
Sdev is a measure of a variables spread around its mean value, expressed in the same unit as the original
values.
Standard deviation is computed as the square root of the mean square of deviations from the mean.
Standardization
Widely used pre-processing that consists in first centering the variables, then scaling them to unit variance.
The purpose of this transformation is to give all variables included in an analysis an equal chance to influence
the model, regardless of their original variances.
In The Unscrambler, standardization can be performed automatically when computing a model, by choosing
1/SDev as variable weights.
The default star distance to center ensures that all design samples are located on the surface of a sphere. In
other words, the star samples are as far away from the center as the cube samples are. As a consequence,
all design samples have exactly the same leverage. The design is said to be rotatable;
The star distance to center can be tuned down to 1. In that case, the star samples will be located at the
centers of the faces of the cube. This ensures that a Central Composite design can be built even if levels
lower than low cube or higher than high cube are impossible. However, the design is no longer
rotatable;
Any intermediate value for the star distance to center is also possible. The design will not be rotatable.
Star Samples
In optimization designs of the Central Composite family, star samples are samples with mid-values for all
design variables except one, for which the value is extreme. They provide the necessary intermediate levels
that will allow a quadratic model to be fitted to the data.
Star samples can be centers of cube faces, or they can lie outside the cube, at a given distance (larger than 1)
from the center of the cube see Star Points Distance To Center.
Camo Software AS
Steepest Ascent
On a regular response surface, the shortest way to the optimum can be found by using the direction of steepest
ascent.
Student t-distribution
=t-distribution. Frequency diagram showing how independent observations, measured on a continuous scale,
are distributed around their mean when the mean and standard deviation have been estimated from the data and
when no factor causes systematic effects.
When the number of observations increases towards an infinite number, the Student t-distribution becomes
identical to the normal distribution.
A Student t-distribution can be described by two parameters: the mean value, which is the center of the
distribution, and the standard deviation, which is the spread of the individual observations around the mean.
Given those two parameters, the shape of the distribution further depends on the number of degrees of
freedom, usually n-1, if n is the number of observations.
Test Samples
Additional samples which are not used during the calibration stage, but only to validate an already calibrated
model.
The data for those samples consist of X-values (for PCA) or of both X- and Y-values (for regression). The
model is used to predict new values for those samples, and the predicted values are then compared to the
observed ones.
Three-Way PLS
See Three-Way PLS Regression.
Training Samples
See Calibration Samples.
Tri-PLS
See Three-Way PLS Regression.
Camo Software AS
T-Scores
The scores found by PCA, PCR and PLS in the X-matrix.
See Scores for more details.
Tukeys Test
A multiple comparison test (see Multiple Comparison Tests for more details).
t-value
The t-value is computed as the ratio between deviation from the mean accounted for by a studied effect, and
standard error of the mean.
By comparing the t-value with its theoretical distribution (Student t -distribution), we obtain the significance
level of the studied effect.
UDA
See User-Defined Analysis.
UDT
See User-Defined Transformation.
Uncertainty Limits
Limits produced by Uncertainty Testing, helping you assess the significance of your X-variables in a
regression model. Variables with uncertainty limits that do not cross the 0 axis are significant.
Uncertainty Test
Martens Uncertainty Test is a significance testing method implemented in The Unscrambler, which assesses
the stability of PCA or Regression results. Many plots and results are associated to the test, allowing the
estimation of the model stability, the identification of perturbing samples or variables, and the selection of
significant X-variables. The test is performed with Cross Validation, and is based on the Jack-knifing principle.
Underfit
A model that leaves aside some of the structured variation in the data is said to underfit.
Unfold
Operation consisting in mapping a three-way data structure onto a flat, two-way layout. An unfolded threeway array has one of its original modes nested into another one. In horizontal unfolding, all planes are
2
displayed side by side, resulting in an OV layout, with Primary and Secondary variables. In vertical unfolding,
2
all planes are displayed on top of each other, resulting in an O V layout, with Primary and Secondary samples.
Unimodality
In MCR, the Unimodality constraint allows the presence of only one maximum per profile.
Camo Software AS
Upper Quartile
The upper quartile of an observed distribution is the variable value that splits the observations into 75% lower
values, and 25% higher values. It can also be called 75% percentile.
U-Scores
The scores found by PLS in the Y-matrix.
See Scores for more details.
Validation Samples
See Test Samples.
Validation
Validation means checking how well a model will perform for future samples taken from the same population
as the calibration samples. In regression, validation also allows for estimation of the prediction error in future
predictions.
The outcome of the validation stage is generally expressed by a validation variance. The closer the validation
variance is to the calibration variance, the more reliable the model conclusions.
When explained validation variance stops increasing with additional model components, it means that the noise
level has been reached. Thus the validation variance is a good diagnostic tool for determining the proper
number of components in a model.
Validation variance can also be used as a way to determine how well a single variable is taken into account in
an analysis. A variable with a high explained validation variance is reliably modeled and is probably quite
precise; a variable with a low explained validation variance is badly taken into account and is probably quite
noisy.
Three validation methods are available in The Unscrambler:
cross validation;
leverage correction.
Variable
Any measured or controlled parameter that has varying values over a given set of samples.
A variable determines a column in a data table.
Camo Software AS
Variance
A measure of a variables spread around its mean value, expressed in square units as compared to the original
values.
Variance is computed as the mean square of deviations from the mean. It is equal to the square of the standard
deviation.
Vertex Sample
A vertex is a point where two lines meet to form an angle. Vertex samples are used in Simplex-centroid, axial
and D-optimal mixture/non-mixture designs.
Ways
See Modes.
Weighting
A technique to modify the relative influences of the variables on a model. This is achieved by giving each
variable a new weight, ie. multiplying the original values by a constant which differs between variables. This is
also called scaling.
The most common weighting technique is standardization, where the weight is the standard deviation of the
variable.
Camo Software AS
Index
2
2-D 235
2-D data 235
2D scatter plot 59
3
3-D 235
3-D data 235
in the Editor 84
unfold 52
3-D data table
O2V 52
OV2 52
OV2 vs. O2V 52
3-D layout 251, 252, 263
3D scatter plot 59
A
absorbance to reflectance 74
accuracy 235
additive noise 235
alternating least squares 167, 235
analysis
constrained experiments 152
Analysis
Constrained Experiments 152
analysis of designed data 147
analysis of effects 148, 236
analysis of variance 148. See ANOVA
ANOVA 148, 236, 246, 249
for linear response surfaces 151
for quadratic response surfaces 151
linear 148
linear with interactions 148
quadratic 148, 149
summary 148
table plot interpretation 227
area normalization 72
averaging 80
axial design 236
axial point 236
C
calibration 108, 237
calibration samples 237
candidate point 238
category variable 237
category variables 17
binary variables 17
levels 17
center sample 238, 241
center samples 23, 40, 149
centering 80, 238
three-way data 83
central composite design 238
center samples 23
cube samples 23
star samples 23
central composite designs 23
centroid design 238
centroid point 238
classification 135, 238
Coomans plot 138
discriminant analysis 138
discrimination power 137
Hi 137
model distance 137
modeling power 137
project onto regression model 138
scores (plot) 202
Si 137
Si vs. Hi 138
SIMCA 135, 260
SIMCA modeling 136
table plot interpretation 228
Index 269
Camo Software AS
classification scores
plot interpretation 202
classify
new samples 136
close file 55
closure 239
clustering 14
find groups of samples 212, 221
clustering results 145
collinear 239
collinearity 239
comparison with scale-independent distribution 149.
See COSIND
component 239
condition number 239
confounded effects 239
confounding 20, 21, 257
confounding pattern 20, 22, 240
constrained design 240
constrained experimental region 240
constraint 240
closure 164
cost 50
non-negativity 164
other constraints in MCR 165
unimodality 164
Constraint 240
closure 164
Cost 50
non-negativity 164
other constraints in MCR 164
unimodality 164
constraints
MCR 163
continuous variable 16
continuous variables 240
levels 16, 17
contour plot 151
Coomans plot 138
interpretation 202
core array 180
corner sample 240
correlation 240
correlation between variables
interpretation 206
interpretation, loading plot 205
correlation loadings 241
interpretation 206, 207, 208
COSCIND 149, 241
covariance 241
create a data table 53
cross terms 241
cross validation 120
full 120, 121
segmented 120, 121
test-set switch 120
270 Index
cross-correlation
matrix plot interpretation 225
table plot interpretation 230
cross-validation 241
cube sample 39, 241
cube samples 23
curvature 40, 241
check 40
detect 189, 229
D
data compression 241
data tables, create by import 55
data tables, create new 53
data tables, create new designed 55
data tables, create new non-designed 54
degree of fractionality 242
degrees of freedom 148, 242
derivatives 76
gap 245
gap-segment 76
Norris-gap 76
Savitzky-Golay 76
segment 259
descriptive multivariate analysis 93
descriptive statistics 89
2D scatter plots 90
box plots 90
line plots 90
plots 90
descriptive variable analysis 90
design 16
Box-Behnken 24
category variables 17
center samples 40
central composite 23
continuous variables 16
design variables 16
D-optimal mixture 242
D-optimal non-mixture 243
D-Optimal Non-Mixture 243
extend 44
fractional factorial 20, 242, 244
full factorial 19, 244
mixture 248
Mixture 248
mixture variables 17
non-design variables 17
orthogonal 252
Plackett-Burman 22, 253
process variables 18
reference samples 42
replicates 42
resolution 20, 22
screening 19
simplex-centroid 260
simplex-lattice 260
types 18
Design Def model 242
design variable 242
design variables 16, 47
category variables 17
continuous variables 16
select 47
designed data 13
detailed effects
plot interpretation 185
table plot interpretation 229
detect
curvature 189, 229
lack of fit 228
outlier 213, 217, 218, 219, 222
significant effects 228, 229
detect lack
of fit 227
detect non-linearities 113
detect outlier 227
deviations
interpretation 233
df 148. See degrees of freedom
differentiation 76
discrimination power 137
plot interpretation 185
distribution 242, 245
normal 251
visualize 61
D-optimal design 242
PLS analysis 152
D-Optimal Design 242
PLS analysis 152
D-optimal mixture design 242
D-optimal non-mixture design 243
D-Optimal Non-Mixture Design 243
D-optimal principle 243
D-Optimal Principle 28, 29, 243
Camo Software AS
F
factors 16
F-distribution 244
file properties 55
Fisher distribution 244
fixed effect 244
fractional design
resolution 20, 22
fractional factorial design 20, 240, 242, 244
f-ratio 148, 244
F-ratio 244
f-ratios
plot interpretation 186
full cross validation 120, 121
full factorial design 19, 244
G
gap 245
gap-segment derivatives 76
gaussian filtering 70, 71
group selection of test set 119, 120
groups
find groups of samples 212, 221
H
E
edge center point 243
editing operations 69
effects
find important 226
n-plot 226
significance 228, 229
effects overview
plot interpretation 229
EMSC 75
end point 243
error measures 110
estimated concentrations 162
plot interpretation 185
Hi 137
higher order interaction effects 149, 245
histogram 61, 242, 245
preference ratings 65
results 66
HOIE 149, 245
Hotelling T2 ellipse 245
I
import data 55
influence 245
plot interpretation 203, 204, 211, 220
influential outlier 217, 218, 219
Index 271
Camo Software AS
J
jack-knifing 121, 127. See uncertainty test
K
Kubelka-Munk 74
L
lack of fit 151, 246
detect 227, 228
in regression 113. See non-linearities
landscape plot 151
lattice degree 246
lattice design 246
least square criterion 246
least squares 246
leveled variables 246
levels 246
levels of continuous variables 16, 17
leverage 245, 247
correction 120
leverage correction 246
leverages
designed data 187, 203, 205
high-leverage sample 187
influential samples 204, 205
interpretation, influence plot 203, 204
plot interpretation 186, 222
limits for outlier warnings 247
line plot 58, 90
linear effect 247
linear model 247
loading weights 111, 247
plot interpretation 189, 208, 209, 221
plot interpretation (tri-PLS) 208, 209
uncertainty 122
loadings 96, 247
p-loadings 111
plot interpretation 187, 188, 205, 206, 207, 220, 221
PLS 111
q-loadings 111
uncertainty 122
272 Index
logarithmic transformation 70
lower quartile 247
M
main effect 247
main effects 18
plot interpretation 231
manual selection of test set 119, 120
Martens' Uncertainty Test 121
matrix
plot 60
matrix plot
3-D 64
maximizing single responses 19
maximum normalization 73
MCR 248
algorithm 167, 235
ambiguity 163
applications 166
co-elution 166
comparison with PCA 160
constraints 163
estimated concentrations 162
estimated spectra 162
initial guess 167
non-unique solution 163
number of components 161
purposes 160
residuals 162
sample residuals 162
spectroscopic monitoring 166
total residuals 162
variable residuals 162
MCR in practice 170
MCR-ALS 167
mean 248
plot interpretation 189, 222
mean and Sdev
plot interpretation 231
mean centering 248
mean normalization 73
Mean Square 148
mean-centering 80
median 248
median filtering 70, 71
minimize single responses 19
MixSum 248
Mixture Component 30, 31
mixture components 248
mixture constraint 248
mixture design 248
PLS analysis 152
Mixture Design 248
PLS analysis 152
mixture region 249
N
noise 76, 250, 255
non-continuous variables 17. See category variables
non-design variables 17
response variables 17
non-designed data 13
non-linearities 113, 151
non-linearity 251
non-negativity 251
normal distribution 251
checking 251
normal probability plot 60, 251
normalization 72
area 72
maximum 73
mean 73
Camo Software AS
peak 73
range 73
unit vector 72
Norris-gap derivatives 76
n-plot 60
N-plot 60
n-plot of effects
plot interpretation 226
n-plot of residuals
plot interpretation 227
nPLS 262
O
O2V 52, 251
objective 16
offset 245, 251
one-way statistics 89
open file 55
optimal number of PCs 192, 195, 196
optimization 19, 251
orthogonal 251
orthogonal designs 252
outlier 99, 113, 252
detect 217, 218, 219, 222, 227
detect in PCA 99
detect in regression 113
influential 217, 218, 219
outlier detection 213
prediction 233
outlier warnings 247
OV2 52, 252
overfitting 252
P
partial least squares 107. See PLS
passified 252
passify 82, 252
PCA 253
interpret scores and loadings 99
loadings 96
purposes 93
scores 96
variances 95
PCA vs. curve resolution 94
PCR 13, 107, 253
PCs 94. See Principal Components
peak normalization 73
percentile 247, 248, 253, 264
percentiles 237
interpretation 232
plot interpretation 232
Plackett-Burman design 253
Plackett-Burman designs 22
planes 250
Index 273
Camo Software AS
p-loadings 111
plot
2D scatter 59
2D scatter, raw data 62
3D scatter 59
3D scatter (raw data) 63
contour 151
histogram 61
histogram (raw data) 64
landscape 151
line 58
matrix 60
matrix (raw data) 63
normal probability 60
normal probability (raw data) 64
raw data, 2D scatter 62
raw data, 3D scatter 63
raw data, histogram 64
raw data, line 61
raw data, matrix 63
raw data, normal probability 64
response surface 151
special plots 66
stability 122
table 67
uncertainty 122
plot interpretation
response surface, contour 224
response surface, landscape 225
plot interpretation
ANOVA 227
bi-plot, scores and loadings 214
box-plot 232
classification scores 202
classification table 228
Coomans plot 202
cross-correlation (matrix plot) 225
cross-correlation (table plot) 230
detailed effects 185, 229
discrimination power 185
effects 226
effects overview 229
estimated concentrations 185
estimated spectra 186
f-ratios 186
influence 203, 204, 211, 220
interaction effects 230
leverages 186, 222
loading weights 189, 208, 209, 221
loadings 187, 188, 205, 206, 207, 220, 221
main effects 231
mean 189, 222
mean and Sdev 231
model distance 189
modeling power 189
multiple comparisons 232
274 Index
percentiles 232
predicted and measured 189
predicted vs. measured 210, 230
predicted vs. reference 211
predicted with deviations 233
prediction 230
p-values of effects 190
p-values of regression coefficients 190
regression coefficients 190, 191, 223
residuals 225, 227
residuals vs. predicted 218
residuals vs. scores 220
response surface 224
RMSE 192
sample residuals 192, 193
scatter effects 211
scores 193, 212, 221
Si vs. Hi 216
Si/S0 vs. Hi 216
standard deviation 194, 225
standard errors 194
total residuals 194
variable residuals 197, 199, 201
variance 195, 196, 197, 198, 199, 200, 201
X-Y relation outliers 217
plots
descriptive statistics 90
normal probability 251
various types 57
PLS 13, 107
for constrained designs 152
loading weights 111
loadings 111
scores 111
PLS discriminant analysis 138
PLS1 254
PLS2 254
precision 254
predicted and measured
plot interpretation 189
predicted vs. measured
plot interpretation 210, 230
predicted vs. Measured
plot interpretation 210, 230
predicted vs. reference 132
plot interpretation 211
predicted with deviation 132
predicted with deviations
plot interpretation 233
predicted Y-values 110
prediction 131, 254
allowed models 132
in practice 133
main results 132
projection equation 131
table plot interpretation 230
predictor 254
preference ratings
plot as histogram 65
preprocessing 12
pre-processing 69
three-way data 83
pre-treatment 69
primary objects 53
Primary Sample 254
Primary Variable 254
primary variables 53
principal component analysis 93
principal component regression 107. See PCR
principal components 94
principles of projection 94
print data 56
process variable 255
process variables 18
projection 94, 255
projection methods
error measures 110
projection to latent structures 107. See PLS
proportional noise 255
pure components 256
p-value 148, 149, 150, 256
p-values of effects
plot interpretation 190
p-values of regression coefficients
plot interpretation 190
Q
q-loadings 111
quadratic effects 19
quadratic model 256
quadratic models 19
R
random effect 256
random order 256
random selection of test set 119, 120
randomization 43, 256
range normalization 73
ranges of variation
how to select 47
raw data 12
2D scatter plot 62
3D scatter plot 63
histogram 64
line plot 61
matrix plot 63
n-plot 64
reference and center samples 149
reference sample 256
reference samples 42, 149
Camo Software AS
reflectance to absorbance 74
reflectance to Kubelka-Munk. 74
re-formatting 69
fill missing 70
regression 105, 254, 257, 258
multivariate 105, 106
non-linearities 113
outlier detection 113
univariate 105, 106
regression coefficient 256
regression coefficients 109
plot interpretation 190, 191, 223
plot interpretation (tri-PLS) 191, 223
uncertainty 122
regression methods 106, 112
regression modeling 114
calibration 108
validation 108
regression models
shape 153
repeated measurement 257
replicate 257
replicates 42
residual 257
residual variance 95, 98, 257
residual variation 97
residual Y-variance 110
residuals 110, 245
MCR 162
n-plot 227
plot interpretation 225
sample 97
variable 97
residuals vs. predicted
plot interpretation 218
residuals vs. Scores
plot interpretation 220
resolution 20, 22, 257
fractional design 20, 22
response surface 246, 249
mixture 155
modeling 19
plot interpretation 224
plots 151
results 150
response surface analysis 258
response surface modeling 150
response variable 258
response variables 16, 17
results
clustering 145
plot as histogram 66
SIMCA 136
RMSE
plot interpretation 192
RMSEC 110, 258
Index 275
Camo Software AS
RMSED 258
RMSEP 110, 120, 258
root mean square error of prediction 120. See RMSEP
rotatability 23, 24
S
saddle point 151
sample 258
residuals 97
sample distribution
interpretation 213
sample leverage 137. See Hi
sample locations, interpretation 212
sample residuals
MCR 162
plot interpretation 192, 193
samples
primary 53
secondary 53
sample-to-model distance 137. See Si
Savitzky-Golay differentiation 76
Savitzky-Golay smoothing 70, 71
scaling 81, 259, 265
scatter effects 259
plot interpretation 211
scores 96, 259
plot interpretation 193, 212, 221
PLS 111
t 263
t-scores 111
u 264
u-scores 111
scores and loadings
bi-plot interpretation 214
screening 18, 259
interaction effects 18
interactions 18
linear model 18
main effects 18
screening designs 19
SDev 261
secondary objects 53
Secondary Sample 259
Secondary Variable 259
secondary variables 53
segment 259
segmented cross validation 120, 121
select
ranges of variation 47
regression method 112
sesign variables 47
sensitivity to pure components 260
shift
variables 80
Si 137
276 Index
Si vs. Hi 138
plot interpretation 216
Si/S0 vs. Hi
plot interpretation 216
significance 121
significance level 260
significance testing 149
center samples 149
constrained designs 153
COSCIND 149
HOIE 149
methods 149
reference and center samples 149
reference samples 149
significance testing methods 229
significance tests 112, 113
significant 260
significant effects
detect 228, 229
SIMCA 135, 238, 260
modeling 136
SIMCA classification 260
SIMCA results 136
model results 136
sample results 136, 137
variable results 136
simplex 260
Simplex 28, 260
simplex-centroid design 260
simplex-lattice design 260
Singular Value Decomposition 107
smoothing 70
SNV 79
special plots 66
spectroscopic transformations 74
absorbance to reflectance 74
reflectance to absorbance 74
reflectance to Kubelka-Munk 74
spectroscopy
data 82
square effect 261
square root 70
SS 148
stability 122
stability plot
segment information 124
standard designs 16
standard deviation 261
plot interpretation 194, 225
standard errors
plot interpretation 194
standard normal variate 79
standardization 81, 265
standardization of variables 261
star points distance to center 261
star samples 23, 261
T
table plot 67
t-distribution 262
test samples 262
test set selection 119
group 119, 120
manual 119, 120
random 119, 120
test set validation 119, 262
tests of significance 112, 113
test-set switch 120
three-way 263
three-way data 51, 175, 235
counter-examples 179
examples 178
logical organization 52
modes 176
notation 176
OV2 and O2V 52
plot as matrix 64
pre-processing 83
ways 176
three-way PLS 13
three-way PLS Regression 262
three-way regression 179
total explained variance 98
total residual variance 98
total residuals
MCR 162
plot interpretation 194
training samples 262
transformations 69
averaging 80
derivatives 76
detect need 65
functions 70
logarithmic 70
MSC / EMSC 75
noise 76
shift variables 80
spectroscopic 74
standard normal variate SNV) 79
transposition 80
Camo Software AS
transpose 80
tri-PLS 13, 262
A-component model 180
inner relation 181
interpretation 182
loadings 180
main results 181
max number of PCs 182
one-component model 179
orthogonality 182
scores 180
weights 180, 181
X-variables 181
tri-PLS regression modeling 182
t-scores 111, 263
Tukeys test 263
t-value 263
two-way statistics 89
types of experimental design 18
U
UDA 263
UDT 263
uncertainty limits 263
uncertainty test 121, 263
details 127
underfit 263
unfold 263
unfolding 3-D data 52
unimodality 263
unit vector normalization 72
univariate regression 105, 106
upper quartile 264
u-scores 111, 264
user-defined transformation 80
V
validation 94, 108, 241, 246, 262, 264
multivariate models 119
results 120
validation methods 119
cross validation 120
leverage correction 120
test set validation 119
Validation Methods 119
cross validation 120
leverage correction 120
test set validation 119
validation samples 264
variable 264
active 252
passified 252
residuals 97
variable residuals
Index 277
Camo Software AS
MCR 162
plot interpretation 197, 199, 201
variables
primary 53
secondary 53
variance 265
degrees of freedom 242
explained 98
explained 95
interpretation 200
plot interpretation 195, 196, 197, 198, 199, 200, 201
residual 95, 98
stabilization 70
total explained 98
total residual 98
variances 95
variation 93
vertex sample 265
W
ways 265
weighting 81, 265
1/SDev 261
in PLS2 and PLS1 82
in sensory analysis 82
spectroscopy data 82
three-way data 83
weights
passify 252
X
X-Y relation outliers
plot interpretation 217
X-Y relationship
interpretation 207, 209
shape 218
278 Index