Xlstat Help
Xlstat Help
http://www.addinsoft.com
Table of Contents
When viewing this document in a pdf editor, click on the page number to go directly to the
page.
INTRODUCTION ...................................................................................................................................... 16
SYSTEM CONFIGURATION.................................................................................................................. 17
INSTALLATION ....................................................................................................................................... 18
LICENSE .................................................................................................................................................... 18
MESSAGES ................................................................................................................................................ 24
OPTIONS .................................................................................................................................................... 25
DISCRETIZATION ................................................................................................................................... 52
DESCRIPTION ............................................................................................................................................. 52
DIALOG BOX .............................................................................................................................................. 52
RESULTS.................................................................................................................................................... 56
REFERENCES ............................................................................................................................................. 56
DATA MANAGEMENT............................................................................................................................ 58
DESCRIPTION ............................................................................................................................................. 58
DIALOG BOX .............................................................................................................................................. 59
CODING ..................................................................................................................................................... 62
DIALOG BOX .............................................................................................................................................. 62
HISTOGRAMS .......................................................................................................................................... 80
DESCRIPTION ............................................................................................................................................. 80
DIALOG BOX .............................................................................................................................................. 89
RESULTS.................................................................................................................................................... 92
EXAMPLE .................................................................................................................................................. 92
REFERENCES ............................................................................................................................................. 92
3
DESCRIPTION ........................................................................................................................................... 103
DIALOG BOX ............................................................................................................................................ 104
RESULTS.................................................................................................................................................. 105
REFERENCES ........................................................................................................................................... 106
EASYLABELS.......................................................................................................................................... 123
DIALOG BOX ............................................................................................................................................ 123
4
DIALOG BOX ............................................................................................................................................ 131
5
DIALOG BOX ............................................................................................................................................ 194
RESULTS.................................................................................................................................................. 197
EXAMPLE ................................................................................................................................................ 198
REFERENCES ........................................................................................................................................... 198
6
DESCRIPTION ........................................................................................................................................... 264
DIALOG BOX ............................................................................................................................................ 268
RESULTS.................................................................................................................................................. 275
EXAMPLE ................................................................................................................................................ 278
REFERENCES ........................................................................................................................................... 278
7
COCHRAN-ARMITAGE TREND TEST.............................................................................................. 343
DESCRIPTION ........................................................................................................................................... 343
DIALOG BOX ............................................................................................................................................ 344
RESULTS.................................................................................................................................................. 346
REFERENCES ........................................................................................................................................... 346
8
DESCRIPTION ........................................................................................................................................... 378
DIALOG BOX ............................................................................................................................................ 379
RESULTS.................................................................................................................................................. 380
EXAMPLE ................................................................................................................................................ 380
REFERENCES ........................................................................................................................................... 380
9
DIALOG BOX ............................................................................................................................................ 411
RESULTS.................................................................................................................................................. 413
REFERENCES ........................................................................................................................................... 413
10
PENALTY ANALYSIS............................................................................................................................ 448
DESCRIPTION ........................................................................................................................................... 448
DIALOG BOX ............................................................................................................................................ 449
RESULTS.................................................................................................................................................. 451
EXAMPLE ................................................................................................................................................ 452
REFERENCES ........................................................................................................................................... 452
ARIMA...................................................................................................................................................... 476
DESCRIPTION ........................................................................................................................................... 476
DIALOG BOX ............................................................................................................................................ 477
RESULTS.................................................................................................................................................. 481
EXAMPLE ................................................................................................................................................ 482
REFERENCES ........................................................................................................................................... 482
11
EXAMPLE ................................................................................................................................................ 490
REFERENCES ........................................................................................................................................... 491
12
DESCRIPTION ........................................................................................................................................... 544
DIALOG BOX ............................................................................................................................................ 546
RESULTS.................................................................................................................................................. 551
EXAMPLE ................................................................................................................................................ 554
REFERENCES ........................................................................................................................................... 554
13
SENSITIVITY AND SPECIFICITY ...................................................................................................... 604
DESCRIPTION ........................................................................................................................................... 604
DIALOG BOX ............................................................................................................................................ 607
RESULTS.................................................................................................................................................. 609
EXAMPLE ................................................................................................................................................ 609
REFERENCES ........................................................................................................................................... 609
14
EXAMPLE ................................................................................................................................................ 660
REFERENCES ........................................................................................................................................... 660
15
Introduction
XLSTAT started over ten years ago in order to make accessible to anyone a powerful,
complete and user-friendly data analysis and statistical solution.
The accessibility comes from the compatibility of XLSTAT with all the Microsoft Excel
versions that are used nowadays (starting from Excel 97 up to Excel 2007), from the interface
that is available in seven languages (German, English, French, Spanish, Italian, Japanese,
Portuguese) and from the permanent availability of a fully functional 30 days evaluation version
on the XLSTAT website www.xlstat.com.
The power of XLSTAT comes from both the C++ programming language, and from the
algorithms that are used. The algorithms are the result of many years of research of thousands
of statisticians, mathematicians, computer scientists throughout the world. Each development
of a new functionality in XLSTAT is preceded by an in-depth research phase that sometimes
includes exchanges with the leading specialists of the methods of interest.
The completeness of XLSTAT is the fruit of over ten years of continuous work, and of regular
exchanges with the users community. Users suggestions have helped a lot improving the
software, by making it well adapted to a variety of requirements.
Last, the usability comes from the user-friendly interface, which after a few minutes of trying it
out, facilitates the use of some statistical methods that might require hours of training with
other software.
The software architecture has considerably evolved over the last 5 years in order to take into
account the advances of Microsoft Excel and the compatibility issues between platforms. The
software relies on Visual Basic Application for the interface and on C++ for the mathematical
and statistical computations.
As always, the Addinsoft team and the XLSTAT distributors are available to answer any
question you have, or to take into account your remarks and suggestions in order to continue
improving the software.
16
System configuration
XLSTAT runs under the following operating systems: Windows 95, Windows 98, Windows Me,
Windows NT, Windows 2000, Windows XP, Windows Vista, and Mac OSX.
To be able to run XLSTAT required that Microsoft Excel is also installed on your computer.
XLSTAT is compatible with the following Excel versions on the Windows systems: Excel 97
(8.0), Excel 2000 (9.0), Excel XP (10.0), Excel 2003 (11.0), and Excel 2007 (12.0). Version X
(10.0) or 2004 (11.0) of Excel is required for the Mac OSX system.
Free patches and upgrades for Microsoft Office are available for free on the Microsoft Website.
We highly recommend that you download and install these patches as some of them are
critical. To check if your Excel version is up to date, please go from time to time to the following
web site:
Windows: http://office.microsoft.com/officeupdate
Mac: http://www.microsoft.com/mac/downloads.aspx
17
Installation
Or insert the CD-Rom you received from us or from a distributor and wait until the
installation procedure starts and then follow the step by step instructions.
If your rights on your computer are restricted, you should ask someone that has administrator
rights on the machine to install the software for you and to run it once. If you have a license
key, the license key must be entered by someone with administrator rights on the machine. If
you want to use the trial version, enter Evaluation as the temporary license key.
Once the installation is over, the administrator must let you have read and write access to the
following folder: the folder where the XLSTAT user files are located (typically C:\Documents
and settings\User Name\Application Data\Addinsoft\XLSTAT2008\), including the
corresponding subfolders.
This folder can be changed by the administrator, using the options dialog box of XLSTAT.
License
18
1. LICENSE. Addinsoft hereby grants you a nonexclusive license to install and use the
Software in machine-readable form on a single computer for use by a single individual if you
are using the demo version or if you have registered your demo version to use it with no time
limits. If you have ordered a multi-users license, the number of users depends directly on the
terms specified on the invoice sent to your company by Addinsoft or the authorized reseller.
2. RESTRICTIONS. Addinsoft retains all right, title, and interest in and to the Software, and
any rights not granted to you herein are reserved by Addinsoft. You may not reverse engineer,
disassemble, decompile, or translate the Software, or otherwise attempt to derive the source
code of the Software, except to the extent allowed under any applicable law. If applicable law
permits such activities, any information so discovered must be promptly disclosed to Addinsoft
and shall be deemed to be the confidential proprietary information of Addinsoft. Any attempt to
transfer any of the rights, duties or obligations hereunder is void. You may not rent, lease,
loan, or resell for profit the Software, or any part thereof. You may not reproduce or distribute
the Software except as expressly permitted under Section 1, and you may not create derivative
works of the Software unless with the express agreement of Addinsoft.
3. SUPPORT. Registered users of the Software are entitled to Addinsoft standard support
services. Demo version users may contact Addinsoft for support but with no guarantee to
benefit from Addinsoft standard support services.
6. TERM AND TERMINATION. This Agreement shall continue until terminated. You may
terminate the Agreement at any time by deleting all copies of the Software. This license
19
terminates automatically if you violate any terms of the Agreement. Upon termination you must
promptly delete all copies of the Software.
8. INDEMNITY. You agree to defend and indemnify Addinsoft against all claims, losses,
liabilities, damages, costs and expenses, including attorney's fees, which Addinsoft may incur
in connection with your breach of this Agreement.
COPYRIGHT (c) 2008 BY Addinsoft SARL, Paris, FRANCE. ALL RIGHTS RESERVED.
20
The XLSTAT approach
The XLSTAT interface totally relies on Microsoft Excel, whether for inputting the data or for
displaying the results. The computations, however, are completely independent of Excel and
the corresponding programs have been developed with the C++ programming language.
In order to guarantee accurate results, the XLSTAT software has been intensively tested and it
has been validated by specialists of the statistical methods of interest.
Addinsoft has always been concerned about permanently improving the XLSTAT software
suite, and welcomes any remarks and improvements you might want to suggest. To contact
Addinsoft, write to [email protected].
Data selection
As with all XLSTAT modules, the selecting of data needs to be done directly on an Excel
sheet, preferably with the mouse. Statistical programs usually require that you first build a list
of variables, then define their type, and at last select the variables of interest for the method
you want to apply to them. The XLSTAT approach is completely different as you only need to
select the data directly on one or more Excel sheets.
Selection by range: you select with the mouse on the Excel sheet all the cells of the
table that corresponds to the selection field of the dialog box.
Selection by columns: this mode is faster but requires that your data set starts on the
first row of the Excel sheet. If this requirement is fulfilled you may select data by clicking
on the name (A, B, ) of the first column of your data set on the Excel sheet, and then
by selecting the next columns by leaving the mouse button pressed and dragging the
mouse cursor over the columns to select.
Selection by rows: this mode is the reciprocal of the selection by rows model. It
requires that your data set starts on the first column (A) of the Excel sheet. If this
requirement is fulfilled you may select data by clicking on the name (1, 2, ) of the first
row of your data set on the Excel sheet, and then by selecting the next rows by leaving
the mouse button pressed and dragging the mouse cursor over the rows to select.
Notes:
21
Multiple selections with selection by rows cannot be used if the transposition option is
activated ( button).
If you selected the name of the variables within the data selection, you should make
sure the Columns labels or Labels included option activated.
You can use keyboard shortcuts to quickly select data. Notice this is possible only you
installed the latest patches for Microsoft Excel. Here is a list of the most useful selection
shortcuts:
Ctrl Space: Selects the whole column corresponding to the already selected cells
Shift Space: Selects the whole row corresponding to the already selected cells
Shift Down: Selects the currently selected cells and the cells on the row below on one
row
Shift Up: Selects the currently selected and the cells on the row below on one row
Shift Left: Selects the currently selected and the cells to the left on one column
Shift Right: Selects the currently selected and the cells to the right on one column
Ctrl Shift Down: Selects all the adjacent non empty cells below the currently selected
cells
Ctrl Shift Up: Selects all the adjacent non empty cells above the currently selected
cells
Ctrl Shift Left: Selects all the adjacent non empty cells to the left of the currently
selected cells
Ctrl Shift Right: Selects all the adjacent non empty cells to the right of the currently
selected cells
Shift Left: Selects one more column to the left of the currently selected columns
22
Shift Right: Selects one more column to the right of the currently selected columns
Ctrl Shift Left: Selects all the adjacent non empty columns to the left of the currently
selected columns
Ctrl Shift Right: Selects all the adjacent non empty columns to the right of the currently
selected columns
Shift Down: Selects one more row to the left of the currently selected rows
Shift Up: Selects one more row to the right of the currently selected rows
Ctrl Shift Down: Selects all the adjacent non empty rows below the currently selected
rows
Ctrl Shift Up: Selects all the adjacent non empty rows above the currently selected
rows
See also:
http://www.xlstat.com/demo-select.htm
23
Messages
XLSTAT uses an innovative message system to give information to the user and to report
problems.
The dialog box below is an example of what happens when an active selection field (he the
Dependent variables) has been activated but left empty. The software detects the problem and
displays the message box.
The information displayed in red (or in blue depending on the severity) indicates which
object/option/selection is responsible for the message. If you click on OK, the dialog box of the
method that had just been activated is displayed again and the field corresponding to the
Quantitative variable(s) is activated.
This message should be explicit enough to help you solve the problem by yourself. If a tutorial
is available, the hyperlink http://www.xlstat.com links to a tutorial on the subject related to the
problem. Sometimes an email address is displayed below the hyperlink to allow you send an
email to Addinsoft using your usual email software, with the content of the XLSTAT message
being automatically displayed in the email message.
24
Options
XLSTAT offers several options in order to allow you to customize and optimize the use of the
software.
To display the options dialog box of XLSTAT, click on Options in the menu or on the
button of the XLSTAT toolbar.
: Click this button to close the dialog box. If you havent previously saved the
options, the changes you have made will not be kept.
General tab:
Language: Use this option to change the language of the interface of XLSTAT.
Memorize during one session: Activate this option if you want XLSTAT to memorize
during one cession (from opening until closing of XLSTAT) the entries and options of
the dialog boxes.
Including data selections: Activate this option so that XLSTAT records the
data selections during one session.
Memorize from one session to the next: Activate this option if you want XLSTAT to
memorize the entries and options of the dialog boxes from one session to the next.
Including data selections: Activate this option so that XLSTAT records the
data selections from one session to the next. This option is useful and saves
your time if you work on spreadsheets that always have the same layout.
Ask for selections confirmation: Activate this option so that XLSTAT prompts you to confirm
the data selections once you clicked on the OK button. If you activate this option, you will be
able to verify the number of rows and columns of all the active selections.
Show only the active functions in menus and toolbars: Activate this option if you want that
only the active functions corresponding to registered modules are displayed in the XLSTAT
menu and in the toolbars.
25
Outputs tab:
Consider empty cells as missing data: this is the default option for XLSTAT and it cannot be
changed. Empty cells are considered by all tools as missing data.
Consider also the following values as missing data: when a cell contains a value that is in
the list, below this option, it will be considered as a missing data, whether the corresponding
selection is for numerical or categorical data.
Consider all text values as missing data: when this option is activated, any text value found
in a table that should contain only numerical values, will be converted and considered by
XLSTAT as a missing data. This option should be activated if you are sure that text values can
not correspond to numerical values converted to text by mistake.
Outputs tab:
Position of new sheets: If you choose the Sheet option in the dialog boxes of the XLSTAT
functions, use this option to modify the position if the results sheets in the Excel workbook.
Number of decimals: Choose the number of decimals to display for the numerical results.
Notice that you always have the possibility to view a different number of decimals afterwards,
by using the Excel formatting options.
Minimum p-value: Enter the minimum p-value below which the p-values are replaced by < p,
where p is the minimum p-value.
Display titles in bold: Activate this option so that XLSTAT displays the titles of the results
tables in bold.
Display table headers in bold: Activate this option to display the headers of the results tables
in bold.
Display the results list in the report header: Activate this option so that XLSTAT displays
the results list at the bottom of the report header.
Display the project name in the report header: Activate this option to display the name of
your project in the report header. Then enter the name of your project in the corresponding
field.
Enlarge the first column of the report by a factor of X: Enter the value of the factor that is
used to automatically enlarge the width of the first column of the XLSTAT report. Default value
is 1. When the factor is 1 the width is left unchanged.
Charts tab:
26
Display charts on separate sheets: Activate this option if you want that the charts are
displayed on separate chart sheets. Note: when the charts are displayed on a spreadsheet you
can still transform them into a chart sheet, by clicking the right button of the mouse, and then
selecting location and then As new sheet.
Charts size:
Automatic: Choose this option if you want XLSTAT to automatically determine the size
of the charts using as a starting value the width and height defined below.
User defined: Activate this option if you want XLSTAT to display charts with
dimensions as defined by the following values:
Display charts with aspect ratio equal to one: Activate this option to ensure that there is no
distortion of distances due to different scales of the horizontal and vertical axes that could lead
to misinterpretations.
Advanced tab:
Random numbers:
Fix the seed to: Activate this option if want to make sure that the computations involving
random numbers always give the same result. Then enter the seed value.
Path for the user's files: This path can be modified if and only if you have administrator rights
on the machine. You can then modify the folder where the users files are saved by clicking the
[] button that will display a box where you can select the appropriate folder. Users files
include the general options as well as the options and selections of the dialog boxes of the
various XLSTAT functions. The folder where the users files are stored must be accessible for
reading and writing to all types of users.
27
Data sampling
Use this tool to generate a subsample of observations from a set of univariate or multivariate
data.
Description
Sampling is one of the fundamental data analysis and statistical techniques. Samples are
generated to:
Obtain very small tables which have the properties of the original table.
To meet these different situations, several methods have been proposed. XLSTAT offers the
following methods for generating a sample of N observations from a table of M rows:
N first rows: The sample obtained is taken from the first N rows of the initial table. This
method is only used if it is certain that the values have not been sorted according to a
particular criterion which could introduce bias into the analysis;
N last rows: The sample obtained is taken from the last N rows of the initial table. This method
is only used if it is certain that the values have not been sorted according to a particular
criterion which could introduce bias into the analysis;
Random without replacement: Observations are chosen at random and may occur only once
in the sample;
Random with replacement: Observations are chosen at random and may occur several times
in the sample;
Systematic from random start: From the j'th observation in the initial table, an observation is
extracted every k observations to be used in the sample. j is chosen at random from among a
number of possibilities depending on the size of the initial table and the size of the final
sample. k is determined such that the observations extracted are as spaced out as possible;
Random stratified (1): Rows are chosen at random within N sequences of observations of
equal length, where N is determined by dividing the number of observations by the requested
sample size;
28
Random stratified (2): Rows are chosen at random within N strata defined by the user. In
each stratum, the number of sampled observations is proportion to the relative frequency of
the stratum.
Random stratified (3): Rows are chosen at random within N strata defined by the user. In
each stratum, the number of sampled observations is proportion to a relative frequency
supplied by the user.
User defined: A variable indicates the frequency of each observation within the output sample.
Dialog box
: Click this button to close the dialog box without doing any computation.
Sampling: Choose the sampling method (see the description section for more details).
Strata: This option is only available for the random stratified sampling (2) and (3). Select in
that field a column that tell to which stratum each observation belongs.
Weight of each stratum: This option is only available for the random stratified sampling (3).
Select a table with two columns, the first containing the strata ID, and the second the weight of
the stratum in the final sample. Whatever the weight unit (size, frequency, percentage),
XLSTAT standardizes the weight so that the sum is equal to the requested sample size.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
29
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (data and
observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Display the report header: Deactivate this option if you want the sampled table to start from
the first row of the Excel worksheet (situation after output to a worksheet or workbook) and not
after the report header. You can thus select the variables of this table by columns.
Shuffle: Activate this option if you want to randomly permute the output data. If this option is
not activated, the sampled data respect the order of the input data.
References
Cochran W.G. (1977). Sampling techniques. Third edition. John Wiley & Sons, New York.
Hedayat A.S. & Sinha B.K. (1991). Design and inference in finite population sampling. John
Wiley & Sons, New York.
30
Distribution sampling
Use this tool to generate a data sample from a continuous or discrete theoretical distribution or
from an existing sample.
Description
Where a sample has been generated from a theoretical distribution, you must choose the
distribution and, if necessary any parameters required for this distribution.
Distributions
P( X 1) p, P( X 0) 1 p with p 0,1
The Bernoulli, named after the swiss mathematician Jacob Bernoulli (1654-1705),
allows to describe binary phenomena where only events can occur with respective
probabilities of p and 1-p.
Beta (): the density function of this distribution (also called Beta type I) is given by:
( )( )
x 1 1 x , with , >0, x 0,1 and B ( , )
1
f ( x)
1
B ( , ) ( )
x c d x
1 1
f ( x) , with , >0, x c, d
1
B ( , ) d c
1
( ) ( )
c, d R, and B ( , )
( )
31
Pour the type I beta distribution, X takes values in the [0,1] range. The beta4
distribution is obtained by a variable transformation such that the distribution is on a
[c, d] interval where c and d can take any value.
Beta (a, b): the density function of this distribution (also called Beta type I) is given by:
(a)(b)
x a 1 1 x , with a,b>0, x 0,1 and B (a, b)
b 1
f ( x)
1
B a, b ( a b)
Binomial (n, p): the density function of this distribution is given by:
n is the number of trials, and p the probability of success. The binomial distribution
is the distribution of the number of successes for n trials, given that the probability
of success is p.
Negative binomial type I (n, p): the density function of this distribution is given by:
Negative binomial type II (k, p): the density function of this distribution is given by:
k x px
P( X x) , with x N, k , p >0
x ! k 1 p
kx
The negative binomial type II distribution is used to represent discrete and highly
heterogeneous phenomena. As k tends to infinity, the negative binomial type II
distribution tends towards a Poisson distribution with =kp.
1/ 2
df / 2
f ( x) x df / 2 1e x / 2 , with x 0, df N*
df / 2
32
E(X) = df and V(X) = 2df
e x
f ( x ) k x k 1 with x 0 and k , 0 and k N
k 1!
,
Note: When k=1, this distribution is equivalent to the exponential distribution. The
Gamma distribution with two parameters is a generalization of the Erlang
distribution to the case where k is a real and not an integer (for the Gamma
distribution the scale parameter is used).
f ( x) exp x , avec x 0 et 0
The exponential distribution is often used for studying lifetime in quality control.
Fisher (df1, df2): the density function of this distribution is given by:
df1 / 2 df 2 / 2
df1 x df1 x
f ( x) 1
1
xB df1 / 2, df 2 / 2 df1 x df 2
,
df1 x df 2
with x 0 and df1 , df 2 N*
E(X) = df2/(df2 -2) if df2>0, and V(X) = 2df2(df1+df2 -2)/[df1(df2-2) (df2 -4)]
Fisher's distribution, from the name of the biologist, geneticist and statistician
Ronald Aylmer Fisher (1890-1962), corresponds to the ratio of two Chi-square
distributions. It is often used for testing hypotheses.
x x
f ( x) exp exp with 0
1
,
33
E(X) = + and V(X) = ( )/6 where is the Euler-Mascheroni constant.
e x /
f ( x) x , with x and k , 0
k 1
k k
1 x
1/ k 1
x
1/ k
f ( x) 1 k exp 1 k , with 0
k k
The GEV (Generalized Extreme Values) distribution is much used in hydrology for
modeling flood phenomena. k lies typically between -0.6 and 0.6.
f ( x ) exp x exp x
The Gumbel distribution, named after Emil Julius Gumbel (1891-1966), is a special
case of the Fisher-Tippett distribution with =1 and =0. It is used in the study of
extreme phenomena such as precipitations, flooding and earthquakes.
x
e s
f ( x) , with R, and s 0
x
s 1 e s
34
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
x 2
f ( x) e , with 0
1 2 2
2
E(X) = and V(X) =
x2
f ( x) e
1 2
This distribution is a special case of the normal distribution with =0 and =1.
Pareto (a, b): the density function of this distribution is given by:
ab a
f ( x) , with a, b 0 and x b
x a 1
The Pareto distribution, named after the Italian economist Vilfredo Pareto (1848-
1923), is also known as the Bradford distribution. This distribution was initially used
to represent the distribution of wealth in society, with Pareto's principle that 80% of
the wealth was owned by 20% of the population.
35
x a b x
1 1
f ( x) , with , >0, x a, b
1
B ( , ) b a
1
( )( )
a, b R, and B ( , )
( )
4m b - 5a
=
b-a
5b a 4m
=
b-a
The PERT distribution is a special case of the beta4 distribution. It is defined by its
definition interval [a, b] and m the most likeky value (the mode). PERT is an
acronym for Program Evaluation and Review Technique, a project management
and planning methodology. The PERT methodology and distribution were
developed during the project held by the US Navy and Lockheed between 1956 and
1960 to develop the Polaris missiles launched from submarines. The PERT
distribution is useful to model the time that is likely to be spent by a team to finish a
project. The simpler triangular distribution is similar to the PERT distribution in that it
is also defined by an interval and a most likely value.
exp x
P( X x) , with x N and 0
x!
df 1/ 2
1 x
( df 1) / 2
f ( x) / df , with df 0
df df / 2
2
The English chemist and statistician William Sealy Gosset (1876-1937), used the
nickname Student to publish his work, in order to preserve his anonymity (the
Guinness brewery forbade its employees to publish following the publication of
confidential information by another researcher). The Students t distribution is the
36
distribution of the mean of df variables standard normal variables. When df=1,
Student's distribution is a Cauchy distribution with the particularity of having neither
expectation nor variance.
Trapezoidal (a, b, c, d): the density function of this distribution is given by:
2 x a
f ( x) , x a, b
d c b a b a
f ( x) , x b, c
2
d c b a
2d x
f ( x ) d c b a d c , x a, b
f ( x) 0 , x a, x d
with a m b
This distribution is useful to represent a phenomenon for which we know that it can
take values between two extreme values (a and d), but that it is more likely to take
values between two values (b and c) within that interval.
Triangular (a, m, b): the density function of this distribution is given by:
2 x a
f ( x) , x a, m
b a m a
2 b x
f ( x) , x m, b
b a b m
f ( x) 0 , x a, x b
with a m b
TriangularQ (q1, m, q2, p1, p2): the density function of this distribution is a
reparametrization of the Triangular distribution. A first step requires estimating the a
and b parameters of the triangular distribution, from the q1 and q2 quantiles to which
percentages p1 and p2 correspond. Once this is done, the distribution functions can be
computed using the triangular distribution functions.
Uniform (a, b): the density function of this distribution is given by:
37
f ( x) , with b a and x a, b
1
ba
The uniform (0,1) distribution is much used for simulations. As the cumulative
distribution function of all the distributions is between 0 and 1, a sample taken in a
Uniform (0,1) distribution is used to obtain random samples in all the distributions
for which the inverse can be calculated.
Uniform discrete (a, b): the density function of this distribution is given by:
f ( x) , with b a, (a, b) N , x N , x a, b
1
b a 1
The uniform discrete distribution corresponds to the case where the uniform
distribution is restricted to integers.
f ( x ) x 1 exp x , with x 0 and 0
1 2 1
We have E(X) = 1 and V(X) = 1 2 1
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
is the shape parameter of the distribution and the scale parameter. When =1,
the Weibull distribution is an exponential distribution with parameter 1/.
38
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
The Weibull distribution, named after the Swede Ernst Hjalmar Waloddi Weibull
(1887-1979), is much used in quality control and survival analysis. is the shape
parameter of the distribution and the scale parameter. When =1 and =0, the
Weibull distribution is an exponential distribution with parameter 1/.
Dialog box
: Click this button to close the dialog box without doing any computation.
Theoretical distribution: Activate this option to sample data in a theoretical distribution. Then
choose the distribution and enter any parameters required by the distribution.
Empirical Distribution: Activate this option to sample data in an empirical distribution. Then
select the data required to build the empirical distribution.
Column labels: Activate this option if the first row of the selected data (data and
weights) contains a label.
Weights: Activate this option if the observations are weighted. Weights must be greater
than or equal to 0. If a column header has been selected, check that the "Column
labels" option is activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
39
Workbook: Activate this option to display the options in a new workbook.
Sample size: Enter the number of values to generate for each of the samples.
Display the report header: Deactivate this option if you want the table of sampled values to
start from the first row of the Excel worksheet (situation after output to a worksheet or
workbook) and not after the report header.
Example
An example showing how to generate a random normal sample is available on the Addinsoft
website:
http://www.xlstat.com/demo-norm.htm
References
El-Shaarawi A.H., Esterby E.S. and Dutka B.J (1981). Bacterial density in water determined
by Poisson or negative binomial distributions. Applied an Environmental Microbiology, 41(1).
107-116.
Fisher R.A. and Tippett H.C. (1928). Limiting forms of the frequency distribution of the
smallest and largest member of a sample. Proc. Cambridge Phil. Soc., 24, 180-190.
Gumbel E.J. (1941). Probability interpretation of the observed return periods of floods. Trans.
Am. Geophys. Union, 21, 836-850.
Jenkinson A. F. (1955). The frequency distribution of the annual maximum (or minimum) of
meteorological elements. Q. J. R. Meteorol. Soc., 81, 158-171.
Perreault L. and Bobe B. (1992). Loi gnralise des valeurs extrmes. Proprits
mathmatiques et statistiques. Estimation des paramtres et des quantiles XT de priode de
retour T. INRS-Eau, rapport de recherche no 350, Qubec.
Weibull W. (1939). A statistical theory of the strength of material. Proc. Roy. Swedish Inst.
Eng. Res. 151(1), 1-45.
40
41
Variables transformation
Use this tool to quickly apply simple transformations to a set of variables.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data: Select the data in the Excel worksheet. If headers have been selected, check that the
"Column labels" option has been activated.
Column labels: Activate this option if the first row of the data selected (data and coding table)
contains a label.
Observation labels: Check this option if you want to use the observation labels. If you do not
check this option, labels will be created automatically (Obs1, Obs2, etc.). If a column header
has been selected, check that the "Colmun labels" option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
42
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Display the report header: Deactivate this option if you want the results table to start from the
first row of the Excel worksheet (situation after output to a worksheet or workbook) and not
after the report header.
Standardize (n-1): Choose this option to standardize the variables using the unbiased
standard deviation.
Other: Choose this option to use another transformation. Then click on the
Transformations tab to choose the transformation to apply.
Transformations tab:
Standardize (n): Choose this option to standardize the variables using the biased standard
deviation.
/ Standard deviation (n-1): Choose this option to divide the variables by their unbiased
standard deviation.
/ Standard deviation (n): Choose this option to divide the variables by their biased standard
deviation.
Rescale from 0 to 100: Choose this option to rescale the data from 0 to 100.
Binarize (0/1): Choose this option to convert all values that are not 0 to 1, and leave the 0s
unchanged.
Sign (-1/0/1): Choose this option to convert all values that are negative to -1, all positive
values to 1, and leave the 0s unchanged.
43
Box-Cox transformation: Activate this option to improve the normality of the sample; the Box-
Cox transformation is defined by the following equation:
X t 1
Yt
, Xt 0, 0 or X t 0, 0
ln( X ), X t 0, 0
t
XLSTAT accepts a fixed value of , or it can find the value that maximizes the likelihood of the
sample, assuming the transformed sample follows a normal distribution.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to remove the observations that contain missing
data.
Estimate missing data: Activate this option to estimate the missing data by using the mean of
the variables.
44
Create a contingency table
Use this tool to create a contingency table from two or more qualitative variables. A chi-square
test is optionally performed.
Description
A contingency table is an efficient way to summarize the relation (or correspondence) between
two categorical variables V1 and V2. It has the following structure:
Category Category Category
V1 \ V2
1 j m2
Category
n(m1,1) n(m1,j) n(m1,m2)
m1
where n(i,j) is the frequency of observations that show both characteristic i for variable V1, and
characteristic j for variable V2.
To create a contingency table from two qualitative variables V1 and V2, the first transformation
consists of recoding the two qualitative variables V1 and V2 as two disjunctive tables Z1 and
Z2 or indicator (or dummy) variables. For each category of a variable there is a column in the
respective disjunctive table. Each time the category c of variable V1 occurs for an observation
i, the value of Z1(i,c) is set to one (the same rule is applied to the V2 variable). The other
values of Z1 and Z2 are zero. The contingency table of the two variables is the table Z1Z2
(where indicates matrix transpose).
The Chi-square distance has been suggested to measure the distance between two
categories. The Pearson chi-square statistic, which is the sum of the Chi-square distances, is
used to test the independence between rows and columns. Is has asymptotically a Chi-square
distribution with (m1-1)(m2-1) degrees of freedom.
Inertia is a measure inspired from physics that is often used in Correspondence Analysis, a
method that is used to analyse in depth contingency tables. The inertia of a set of points is the
weighted mean of the squared distances to the center of gravity. In the specific case of a
45
contingency table, the total inertia of the set of points (one point corresponds to one category)
can be written as:
nij ni . n. j
2
2
n n
, with ni. nij and n. j nij
2 m1 m 2 m2 m1
2
n i 1 j 1 ni. n. j j 1 i 1
n 2
and where n is the sum of the frequencies in the contingency table. We can see that the inertia
is proportional to the Pearson chi-square statistic computed on the contingency table.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Row variable(s): Select the data that correspond to the variable(s) that will be used to
construct the rows of the contingency table(s).
46
Column variable(s): Select the data that correspond to the variable(s) that will be used to
construct the columns of the contingency table(s).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (row and column
variables, weights) includes a header.
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Options tab:
Sort the categories alphabetically: Activate this option so that the caterogies of all the
variables are sorted alphabetically.
Variable-Category labels: Activate this option to create the labels of the contingency table
using both the variable name of the name of the categories. If the option is not activated, the
labels are only based on the categories.
Chi-square test: Activate this option to display the statistics and the interpretation of the Chi-
square test of independence between rows and columns.
Significance level (%): Enter the significance level for the test.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to ignore the observations that contain missing
data.
Group missing values into a new category: Activate this option to group missing data into a
new category of the corresponding variable.
47
Outputs tab:
List of combines: Activate this option to display the table that lists all the possible combines
between the two variables that are used to create a contingency table, and the corresponding
frequencies.
Inertia by cell: Activate this option to display the inertia for each cell of the contingency table.
Chi-square by cell: Activate this option to display the contribution to the chi-square of each
cell of the contingency table.
Significance by cell: Activate this option to display a table indicating, for each cell, if the
actual value is equal (=), lower (<) or higher (>) than the theoretical value, and to run a test
(Fishers exact test of on a 2x2 table having the same total frequency as the complete table,
and the same marginal sums for the cell of interest), in order to determine if the difference with
the theoretical value is significant or not.
Observed frequencies: Activate this option to display the table of the observed frequencies.
This table is almost identical to the contingency table, except that the marginal sums are also
displayed.
Theoretical frequencies: Activate this option to display the table of the theoretical frequencies
computed using the marginal sums of the contingency table.
Proportions or percentages / Row: Activate this option to display the table of proportions or
percentages computed by dividing the values of the contingency table by the marginal sums of
each row.
Proportions or percentages / Column: Activate this option to display the table of proportions
or percentages computed by dividing the values of the contingency table by the marginal sums
of each column.
Proportions or percentages / Total: Activate this option to display the table of proportions or
percentages computed by dividing the values of the contingency table by the sum of all the
cells of the contingency table.
Charts tab:
3D view of the contingency table: Activate this option to display the 3D bar chart
corresponding to the contingency table.
48
49
Full disjunctive tables
Use this tool to create a full disjunctive table from one or more qualitative variables.
Description
Dialog box
: Click this button to close the dialog box without doing any computation.
Data: Select the data in the Excel worksheet. If headers have been selected, check that the
"Variable labels" option has been activated.
Variable labels: Check this option if the first line of the selected data contains a label.
Observation labels: Check this option if you want to use the available line labels. If you do not
check this option, line labels will be created automatically (Obs1, Obs2, etc.). If a column
header has been selected, check that the "Variable labels" option has been activated.
50
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Display the report header: Deactivate this option if you want the full disjunctive table to start
from the first row of the Excel worksheet (situation after output to a worksheet or workbook)
and not after the report header.
Example
Input table:
Q1 Q2
Obs1 A C
Obs2 B D
Obs3 B E
Obs4 A D
Obs1 1 0 1 0 0
Obs2 0 1 0 1 0
Obs3 0 1 0 0 1
Obs4 1 0 0 1 0
51
Discretization
Use this tool to discretize a numerical variable. Several discretization methods are available.
Description
Discretizing a numerical variable means transforming it into an ordinal variable. This process is
used a lot in marketing where it is often referred to as segmentation.
XLSTAT makes available several discretization methods that are more or less automatic. The
number of classes (or intervals, or segments) to generate is either fixed by the user (for
example with the method of equal ranges), or by the method itself (for example, with the 80-20
option where two classes are created).
The Fishers classification algorithm can be very slow when the size of dataset exceeds 1000.
This method generates a number of classes that is lower or equal to the number of classes
requested by the user, as the algorithm is able to automatically merge similar classes.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
52
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Constant range: Choose this method to create classes that have the same range.
Then enter the value of the range. You can optionally specify the minimum that
corresponds to the lower bound of the first interval. This value must be lower or equal to
the minimum value of the series. If the minimum is not specified, the lower bound will be
set to the minimum value of the series.
Intervals: Use this method to create a given number of intervalles with the same range.
The, enter the number of intervals. The range of the intervals is determined by the
difference between the the maximum and minimum values of the series. You can
optionally specify the minimum that corresponds to the lower bound of the first
interval. This value must be lower or equal to the minimum value of the series. If the
minimum is not specified, the lower bound will be set to the minimum value of the
series.
Equal frequencies: Choose this method so that all the classes contain as much as
possible the same number of observations. Then, enter the number of intervals (or
classes) to generate.
Automatic (Fisher): Use this method to create the classes using the Fishers algorithm.
When the size of dataset exceeds 1000, the computations can be very slow. You need
to enter the number of intervals (or classes) to generate. However; this method
generates a number of classes that is lower or equal to the number of classes required
by the user, as the algorithm is able to automatically merge similar classes.
Automatic (k-means) : Choose this method to create classes (or intervals) using the k-
means algorithm. Then, enter the number of intervals (or classes) to generate.
Intervals (user defined): Choose this option to select a column containing in increasing
order the lower bound of the first interval, and the upper bound of all the intervals.
80-20: Use this method to create two classes, the first containing the 80 first % of the
series, the data being sorted in increasing order, the second containing the remaining
20%.
53
20-80: Use this method to create two classes, the first containing the 20 first % of the
series, the data being sorted in increasing order, the second containing the remaining
80%.
80-15-5 (ABC): Use this method to create two classes, the first containing the 80 first %
of the series, the data being sorted in increasing order, the second containing the next
15%, and the third containing the remaining 5%. This method is sometimes referred to
as ABC classification.
5-15-80: Use this method to create two classes, the first containing the 5 first % of the
series, the data being sorted in increasing order, the second containing the next 15%,
and the third containing the remaining 80%.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet in the active workbook.
Variable labels: Check this option if the first line of the selected data contains a label.
Observation labels: Check this option if you want to use the available line labels. If you do not
check this option, line labels will be created automatically (Obs1, Obs2, etc.). If a column
header has been selected, check that the "Variable labels" option has been activated.
Display the report header: Deactivate this option if you do not want to display the report
header.
Options tab:
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Variable labels" option is activated.
Standardize the weights: if you check this option, the weights are standardized such
that their sum equals the number of observations.
54
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations:
For the corresponding sample: Activate this option to ignore an observation which
has a missing value only for the variables that have a missing value.
For all samples: Activate this option to ignore an observation which has a missing
value for all selected variables.
Estimate missing data: Activate this option to estimate the missing data by using the mean of
the variable.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Centroids: Activate this option to display the table of centroids of the classes.
Central objects: Activate this option to display the coordinates of the nearest object to the
centroid for each class.
Results by class: Activate this option to display a table giving the statistics and the objects for
each of the classes.
Results by object: Activate this option to display a table giving the class each object is
assigned to in the initial object order.
Charts tab:
Histograms: Activate this option to display the histograms of the samples. For a theoretical
distribution, the density function is displayed.
Bars: Choose this option to display the histograms with a bar for each interval.
Continuous lines: Choose this option to display the histograms with a continuous line.
Cumulative histograms: Activate this option to display the cumulated histograms of the
samples.
Based on the histogram: Choose this option to display cumulative histograms based
on the same interval definition as the histograms.
55
Empirical cumulative distribution: Choose this option to display cumulative
histograms which actually correspond to the empirical cumulative distribution of the
sample.
Ordinate of the histograms: Choose the quantity to be used for the histograms: density,
frequency or relative frequency.
Results
Summary statistics: This table displays for the selected variables, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation.
Class centroids: This table shows the class centroids for the various descriptors.
Distance between the class centroids: This table shows the Euclidean distances between
the class centroids for the various descriptors.
Central objects: This table shows the coordinates of the nearest object to the centroid for
each class.
Distance between the central objects: This table shows the Euclidean distances between
the class central objects for the various descriptors.
Results by class: The descriptive statistics for the classes (number of objects, sum of
weights, within-class variance, minimum distance to the centroid, maximum distance to the
centroid, mean distance to the centroid) are displayed in the first part of the table. The second
part shows the objects.
Results by object: This table shows the assignment class for each object in the initial object
order.
References
Arabie P., Hubert L.J. and De Soete G. (1996). Clustering and Classification. Wold Scientific,
Singapore.
56
Everitt B.S., Landau S. and Leese M. (2001). Cluster Analysis (4th edition). Arnold, London.
Fisher W.D. (1958). On grouping for maximum homogeneity. Journal of the American
Statistical Association, 53, 789-798.
57
Data management
Use this tool to manage tables of data. Four functions are included in this tool: deduping,
grouping, and joining (inner and outer). These features are common in databases, but are not
included in Excel.
Description
Deduping
Grouping
Grouping is useful when you want to aggregate data. For example, imagine a table that
contains all your sales records (one column with the customer id, and one with the sales
value), and which you want to transform to have one record per customer, and the
corresponding sum of sales. XLSTAT allows you to aggregate the data and to obtain the
summary table within seconds. The sum is only one of the available possibilities.
Joining
Joining is common task in database management. It allows to merge two tables horizontally
on the basis of a common information named the key. For example, imagine you measured
some chemical indicators on 150 sites. Then you want to add geographical information on the
sites where the data were collected. Your geographical table contains information on 1000
sites, including the 150 sites of interest. In order to avoid the tedious work of manually merging
the two tables, a join will allow you to obtain within seconds the merged table that includes
both the collected data and the geographical information.
Inner joins: the merged table includes only keys that are common to both input tables.
Outer joins: the merged table includes all keys that are available in the first, the second or
both input tables.
58
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data: This field is displayed if the selected method is Dedupe or Group. Select the data that
correspond to the table that you want to dedupe or to aggregate.
Observation labels: This field is displayed only for the Dedupe method. Select the column
(column mode) or row (row mode) where the observations labels are available. If you do not
check this option, labels will be created automatically (Obs1, Obs2, etc.). If a column header
has been selected, check that the "Variable labels" option has been activated.
Table 1: This field is displayed if the data management method is Join. Select the data that
correspond to the first input table to use in the join procedure.
Table 2: This field is displayed if the data management method is Join. Select the data that
correspond to the second input table to use in the join procedure.
59
Guess types: this option is displayed only for the Group method. Activate this option if you
want that XLSTAT guesses the types of the variables of the selected table. If you uncheck this
option, XLSTAT will prompt you to confirm or modify the type of the variables.
Dedupe
Group
Join (Inner)
Join (Outer)
Range: Check this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Check this option to display the results in a new worksheet in the active workbook.
Variable labels: Check this option if the first row of the selected data (data and observation
labels) contains a label.
Operation: This option is only available if the method is Group. Select the operation to apply
to the data when aggregating them.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
The following options are only displayed if the selected method is Dedupe:
Frequencies: Activate this option to display in the last column of the deduped table, the
frequencies of each observation in the input table (1 corresponds to non-repeated
observations; values equal or greater than 2 correspond to duplicated observations).
60
Duplicates: Activate this option to display the duplicates that have been removed from the
original table in order to obtain the deduped table.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
61
Coding
Use this tool to code or recode a table into a new table, using a coding table that contains the
initial values and the corresponding new codes.
Dialog box
: Click this button to close the dialog box without doing any computation.
Data: Select the data in the Excel worksheet. If headers have been selected, check that the
"Column labels" option has been activated.
Coding table: Select a two-column table that contains in the first column the initial values, and
in the second column the codes that will replace the values. If headers have been selected,
check that the "Column labels" option has been activated.
Column labels: Activate this option if the first row of the data selected (data and coding table)
contains a label.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
62
Display the report header: Deactivate this option if you want the results table to start from the
first row of the Excel worksheet (situation after output to a worksheet or workbook) and not
after the report header.
63
Presence/absence coding
Use this tool to convert a table of lists (or attributes) into a table of presences/absences
showing the frequencies of the various elements for each of the lists.
Description
This tool is used, for example, to convert a table containing p columns corresponding to p lists
of objects into a table with p rows and q columns where q is the number of different objects
contained in the p lists, and where for each cell of the table, there is a 1 if the object is present
and a 0 if it is absent.
For example, in ecology, if we have p species measurements with, for each measurement, the
different species found in columns, we will obtain a two-way table showing the presence or
absence of each of the species for each of the measurements.
Dialog box
: Click this button to close the dialog box without doing any computation.
Column labels: Activate this option if the first row of the selected data contains a label.
64
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Display the report header: Deactivate this option if you want the results table to start from the
first row of the Excel worksheet (situation after output to a worksheet or workbook) and not
after the report header.
Example
Input table:
List1 List2
E1 E3
E1 E1
E2 E4
E1
E3
Presence/absence table:
E1 E2 E3 E4
Liste1 1 1 1 0
Liste2 1 0 1 1
65
Coding by ranks
Use this tool to recode a table with n observations and p quantitative variables into a table
containing ranks, the latter being determined variable by variable.
Description
This tool is used to recode a table with n observations and p quantitative variables into a table
containing ranks, the ranks being determined variable by variable. Coding in ranks lets you
convert a table of continuous quantitative variables into discrete quantitative variables if only
the order relationship is relevant and not the values themselves.
Two strategies are possible for taking tied values into account: either they are assigned to the
mean rank or they are assigned to the lowest rank of the tied values.
Dialog box
: Click this button to close the dialog box without doing any computation.
Variable labels: Check this option if the first line of the selected data contains a label.
Observation labels: Check this option if you want to use the available line labels. If you do not
check this option, line labels will be created automatically (Obs1, Obs2, etc.). If a column
header has been selected, check that the "Variable labels" option has been activated.
66
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Take ties into account: Activate this option to take account of the presence of tied values and
adapt the rank of tied values as a consequence.
Mean ranks: Choose this option to replace the rank of tied values by the mean of the
ranks.
Minimum: Choose this option to replace the rank of tied values by the minimum of their
ranks.
Display the report header: Deactivate this option if you want the sampled table to start from
the first row of the Excel worksheet (situation after output to a worksheet or workbook) and not
after the report header.
Example
Initial table:
V1 V2
Obs1 1.2 12
Obs2 1.6 11
Obs3 1.2 10
R1 R2
Obs1 1 4
Obs2 4 3
Obs3 1 1
67
Obs4 3 2
R1 R2
Obs1 1.5 4
Obs2 4 3
Obs3 1.5 1
Obs4 3 2
68
Descriptive statistics and Univariate plots
Use this tool to calculate descriptive statistics and display univariate plots (Box plots,
Scattergrams, etc) for a set of quantitative and/or qualitative variables.
Description
Before using advanced analysis methods like, for example, discriminant analysis or multiple
regression, you must first of all reveal the data in order to identify trends, locate anomalies or
simply have available essential information such as the minimum, maximum or mean of a data
sample.
XLSTAT offers you a large number of descriptive statistics and charts which give you a useful
and relevant preview of your data.
Although you can select several variables (or samples) at the same time, XLSTAT calculates
all the descriptive statistics for each of the samples independently.
Let's consider a sample made up of N items of quantitative data {y1, y2, yN} whose
respective weights are {W1, W2, WN}.
Number of missing values: The number of missing values in the sample analyzed. In
the subsequent statistical calculations, values identified as missing are ignored. We
define n to be the number of non-missing values, and {x1, x2, xn} to be the sub-
sample of non-missing values whose respective weights are {w1, w2, wn}.
Sum of weights*: The sum of the weights, Sw. When all weights are 1, or when
weights are "standardized", Sw=n.
Range: The range is the difference between the minimum and maximum of the series.
69
1st quartile*: The first quartile Q1 is defined as the value for which 25% of the values
are less.
Median*: The median Q2 is the value for which 50% of the values are less.
3rd quartile*: The third quartile Q3 is defined as the value for which 75% of the values
are less.
S wi xi
n
i 1
w x
n
2
i i
s ( n) 2 i 1
Sw
Note 1: When all the weights are 1, the variance is the sum of the square deviation to
the mean divided by n, hence its name.
Note 2: The variance (n) is a biased estimate of the variance which assumes that the
sample is a good representation of the total population. The variance (n-1) is, on the
other hand, calculated taking into account an approximation associated with the
sampling.
w x
n
2
i i
s n 1 i 1
2
Sw Sw / n
Note 1: When all the weights are 1, the variance is the sum of the square deviation to
the mean divided by n-1, hence its name.
Note 2: The variance (n) is a biased estimate of the variance which assumes that the
sample is a good representation of the total population. The variance (n-1) is, on the
other hand, calculated taking into account an approximation associated with the
sampling.
Standard deviation (n)*: The standard deviation of the sample defined by s(n).
Standard deviation (n-1)*: The standard deviation of the sample defined by s(n-1).
Variation coefficient: this coefficient is only calculated if the mean of the sample is
non-zero. It is defined by CV = s(n) / . This coefficient measures the dispersion of a
70
sample relative to its mean. It is used to compare the dispersion of samples whose
scales or means differ greatly.
w x
n
3
3 i i
1 avec 3 i 1
s ( n)3 Sw
This coefficient gives an indication of the shape of the distribution of the sample. If the
value is negative (or positive respectively), the distribution is concentrated on the left (or
right respectively) of the mean.
Sw Sw Sw / n 1
G1
Sw 2 Sw / n
Unlike the previous, this coefficient is not biased on the assumption that the data is
normally distributed. This coefficient gives an indication of the shape of the distribution
of the sample. If the value is negative (or positive respectively), the distribution is
concentrated on the left (or right respectively) of the mean.
Q1 2Q2 Q3
A( B )
Q3 Q1
w x
n
4
4 i i
2 -3 avec 4 i 1
s ( n)
4
Sw
This coefficient, sometimes called excess kurtosis, gives an indication of the shape of
the distribution of the sample. If the value is negative (or positive respectively), the peak
of the distribution of the sample is more flattened out (or respectively less) than that of a
normal distribution.
Sw Sw / n
G2
Sw 2Sw / n Sw 3Sw / n
Sw Sw / n 2 6
Sw Sw / n 4
4 3 Sw sw / n
Sw 2Sw / n Sw 3Sw / n s (n)
=
71
Unlike the previous, this coefficient is not biased on the assumption that the data is
normally distributed. This coefficient, sometimes called excess kurtosis, gives an
indication of the shape of the distribution of the sample. If the value is negative (or
positive respectively), the peak of the distribution of the sample is more flattened out (or
respectively less) than that of a normal distribution.
s (n 1)2
s
Sw
Lower bound on mean (x%)*: this statistic corresponds to the lower bound of the
confidence interval at x% of the mean. This statistic is defined by:
L s t100 x / 2
Upper bound on mean (x%)*: this statistic corresponds to the upper bound of the
confidence interval at x% of the mean. This statistic is defined by:
U s t100 x / 2
Standard error (Skewness (Fisher)) *: The standard error of the Fishers skewness
coefficient is defined by:
6 Sw Sw 1
se G1
Sw 2 Sw 1 Sw 3
Standard error (Kurtosis (Fisher)) *: The standard error of the Fishers kurtosis
coefficient is defined by:
4 Sw2 1 se G1
2
se G2
Sw 3 Sw 5
Mean absolute deviation*: as for standard deviation or variance, this coefficient
measures the dispersion (or variability) of the sample. It is defined by:
w
n
i xi
e ( ) i 1
Sw
Median absolute deviation*: this statistic is the median of absolute deviations to the
median.
Geometric mean*: this statistic is only calculated if all the values are strictly positive. It
is defined by:
72
1 n
G exp
Sw i 1
wi Ln xi
x
n
G n
i
i 1
1 n 2
G exp wi Ln xi Ln G
Sw i 1
Sw
H
wi
n
i 1 xi
(*) Statistics followed by an asterisk take the weight of observations into account.
Number of missing values: The number of missing values in the sample analyzed. In
the subsequent statistical calculations, values identified as missing are ignored. We
define n to be the number of non-missing values, and {w1, w2, wn} to be the sub-
sample of weights for the non-missing values.
Sum of weights*: The sum of the weights, Sw. When all the weights are 1, Sw=n.
Mode*: The mode of the sample analyzed. In other words, the most frequent category.
Frequency of mode*: The frequency of the category to which the mode corresponds.
73
(*) Statistics followed by an asterisk take the weight of observations into account.
Several types of chart are available for quantitative and qualitative data:
Lower limit: Linf = X(i) such that {X(i) [Q1 1.5 (Q3 Q1)]} is minimum and X(i) Q1
1.5 (Q3 Q1).
Upper limit: Lsup = X(i) such that {X(i) - [Q3 + 1.5 (Q3 Q1)]} is minimum and X(i)
Q3 + 1.5 (Q3 Q1)
Values that are outside the ]Q1 - 3 (Q3 Q1); Q3 + 3 (Q3 Q1) [ interval are displayed
with the * symbol; values that are in the [Q1 - 3 (Q3 Q1); Q1 1.5 (Q3 Q1)] or the
[Q3 + 1.5 (Q3 Q1); Q3 + 3 (Q3 Q1)] intervals are displayed with the o symbol.
Strip plots: These diagrams represent the data from the sample as strips. For a given
interval, the thicker or more tightly packed the strips, the more data there is.
P-P Charts (normal distribution): P-P charts (for Probability-Probability) are used to
compare the empirical distribution function of a sample with that of a normal variable for
the same mean and deviation. If the sample follows a normal distribution, the data will
lie along the first bisector of the plan.
Q-Q Charts (normal distribution): Q-Q charts (for Quantile-Quantile) are used to
compare the quantities of the sample with that of a normal variable for the same mean
and deviation. If the sample follows a normal distribution, the data will lie along the first
bisector of the plan.
Bar charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as bars.
74
Pie charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as pie charts.
Double pie charts: These charts are used to compare the frequencies or relative
frequencies of sub-samples with those of the complete sample.
Doughnuts: this option is only checked if a column of sub-samples has been selected.
These charts are used to compare the frequencies or relative frequencies of sub-
samples with those of the complete sample.
Stacked bars: this option is only checked if a column of sub-samples has been
selected. These charts are used to compare the frequencies or relative frequencies of
sub-samples with those of the complete sample.
Dialog box
The dialog box is made up of several tabs corresponding to the various options for controlling
the calculations and displaying the results. A description of the various components of the
dialog box are given below.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Quantitative data: Check this option to select the samples of quantitative data you want to
calculate descriptive statistics for.
75
Qualitative data: Check this option to select the samples of qualitative data you want to
calculate descriptive statistics for.
Sub-sample: Check this option to select a column showing the names or indexes of the sub-
samples for each of the observations.
Range: Check this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Check this option to display the results in a new worksheet in the active workbook.
Sample labels: Check this option if the first line of the selections (quantitative data, qualitative
date, sub-samples, and weights) contains a label.
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Sample labels" option is activated.
Standardize the weights: if you check this option, the weights are standardized such
that their sum equals the number of observations.
Options tab:
Descriptive statistics: Check this option to calculate and display descriptive statistics.
Normalize: Check this option to standardize the data before carrying out the analysis.
Rescale from 0 to 100: Check this option to arrange the data on a scale of 0 to 100.
Compare to total sample: this option is only checked if a column of sub-samples has been
selected. Check this option so that the descriptive statistics and charts are also displayed for
the total sample.
Outputs tab:
76
Quantitative Data: Activate the options for the descriptive statistics you want to calculate. The
various statistics are described in the description section.
Display vertically: Check this option so that the table of descriptive statistics is
displayed vertically (one line per descriptive statistic).
Qualitative Data: Activate the options for the descriptive statistics you want to calculate. The
various statistics are described in the description section.
Display vertically: Check this option so that the table of descriptive statistics is
displayed vertically (one line per descriptive statistic).
Box plots: Check this option to display box plots (or box-and-whisker plots). See the
description section for more details.
Horizontal: Check this option to display box plots, scattergrams and strip plots
horizontally.
Vertical: Check this option to display box plots, scattergrams and strip plots vertically.
Group plots: Check this option to group together the various box plots, scattergrams
and strip plots on the same chart to compare them.
Outliers: Check this option to display the points corresponding to outliers (box plots)
with a hollowed-out circle.
Labels position: Select the position where the labels have to be placed on the box
plots, scattergrams and strip plots.
Scattergrams: Check this option to display scattergrams. The mean (red +) and the median
(red line) are always displayed.
77
Strip plots: Check this option to display strip plots. On these charts, a strip corresponds to an
observation.
Bar charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as bars.
Pie charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as pie charts.
Doubles: this option is only checked if a column of sub-samples has been selected.
These charts are used to compare the frequencies or relative frequencies of sub-
samples with those of the complete sample.
Doughnuts: this option is only checked if a column of sub-samples has been selected. These
charts are used to compare the frequencies or relative frequencies of sub-samples with those
of the complete sample.
Stacked bars: this option is only checked if a column of sub-samples has been selected.
These charts are used to compare the frequencies or relative frequencies of sub-samples with
those of the complete sample.
Frequencies: choose this option to make the scale of the plots correspond to the
frequencies of the categories.
Relative frequencies: choose this option to make the scale of the plots correspond to
the relative frequencies of the categories.
References
Filliben J.J. (1975). The Probability Plot Correlation Coefficient Test for Normality.
Technometrics, 17(1), pp 111-117.
78
Lawrence T. DeCarlo (1997). On the Meaning and Use of Kurtosis. Psychological Methods,
2(3), pp. 292-307.
Sokal R.R. and Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third edition. Freeman, New York.
Tomassone R., Dervin C. and Masson J.P. (1993). Biomtrie. Modlisation de Phnomnes
Biologiques. Masson, Paris.
79
Histograms
Use this tool to create a histogram from a sample of continuous or discrete quantitative data.
Description
The histogram is one of the most frequently used display tools as it gives a very quick idea of
the distribution of a sample of continuous or discrete data.
Intervals definition
One of the challenges in creating histograms is defining the intervals, as for a determined set
of data, the shape of the histogram depends solely on the definition of the classes. Between
the two extremes of the single class comprising all the data and giving a single bar and the
histogram with one value per class, there are as many possible histograms as there are data
partitions.
To obtain a visually and operationally satisfying result, defining classes may require several
attempts.
The most traditional method consists of using classes defined by intervals of the same width,
the lower bound of the first interval being determined by the minimum value or a value slightly
less than the minimum value.
To make it easier to obtain histograms, XLSTAT lets you create histograms either by defining
the number of intervals, their width or by specifying the intervals yourself. The intervals are
considered as closed for the lower bound and open for the upper bound.
Cumulative histogram
XLSTAT lets you create cumulative histograms either by cumulating the values of the
histogram or by using the empirical cumulative distribution. The use of the empirical cumulative
distribution is recommended for a comparison with a distribution function of a theoretical
distribution.
XLSTAT lets you compare the histogram with a theoretical distribution whose parameters have
been set by you. However, if you want to check if a sample follows a given distribution, you
80
can use the distribution fitting tool to estimate the parameters of the distribution and if
necessary check if the hypothesis is acceptable.
P( X 1) p, P( X 0) 1 p with p 0,1
The Bernoulli, named after the swiss mathematician Jacob Bernoulli (1654-1705),
allows to describe binary phenomena where only events can occur with respective
probabilities of p and 1-p.
Beta (): the density function of this distribution (also called Beta type I) is given by:
( )( )
x 1 1 x , with , >0, x 0,1 and B ( , )
1
f ( x)
1
B ( , ) ( )
x c d x
1 1
f ( x) , with , >0, x c, d
1
B ( , ) d c
1
( ) ( )
c, d R, and B ( , )
( )
Pour the type I beta distribution, X takes values in the [0,1] range. The beta4
distribution is obtained by a variable transformation such that the distribution is on a
[c, d] interval where c and d can take any value.
Beta (a, b): the density function of this distribution (also called Beta type I) is given by:
(a)(b)
x a 1 1 x , with a,b>0, x 0,1 and B (a, b)
b 1
f ( x)
1
B a, b ( a b)
Binomial (n, p): the density function of this distribution is given by:
81
E(X)= np and V(X) = np(1-p)
n is the number of trials, and p the probability of success. The binomial distribution
is the distribution of the number of successes for n trials, given that the probability
of success is p.
Negative binomial type I (n, p): the density function of this distribution is given by:
Negative binomial type II (k, p): the density function of this distribution is given by:
k x px
P( X x) , with x N, k , p >0
x ! k 1 p
kx
The negative binomial type II distribution is used to represent discrete and highly
heterogeneous phenomena. As k tends to infinity, the negative binomial type II
distribution tends towards a Poisson distribution with =kp.
1/ 2
df / 2
f ( x) x df / 2 1e x / 2 , with x 0, df N*
df / 2
e x
f ( x ) k x k 1 with x 0 and k , 0 and k N
k 1!
,
82
This distribution, developed by the danish scientist A. K. Erlang (1878-1929) when
studying the telephone traffic, is more generally used in the study of queuing
problems.
Note: When k=1, this distribution is equivalent to the exponential distribution. The
Gamma distribution with two parameters is a generalization of the Erlang
distribution to the case where k is a real and not an integer (for the Gamma
distribution the scale parameter is used).
f ( x) exp x , avec x 0 et 0
The exponential distribution is often used for studying lifetime in quality control.
Fisher (df1, df2): the density function of this distribution is given by:
df1 / 2 df 2 / 2
df1 x df1 x
f ( x) 1
1
xB df1 / 2, df 2 / 2 df1 x df 2
,
df1 x df 2
with x 0 and df1 , df 2 N*
E(X) = df2/(df2 -2) if df2>0, and V(X) = 2df2(df1+df2 -2)/[df1(df2-2) (df2 -4)]
Fisher's distribution, from the name of the biologist, geneticist and statistician
Ronald Aylmer Fisher (1890-1962), corresponds to the ratio of two Chi-square
distributions. It is often used for testing hypotheses.
x x
f ( x) exp exp with 0
1
,
e x /
f ( x) x , with x and k , 0
k 1
k k
83
GEV ( , k, ): the density function of this distribution is given by:
1 x
1/ k 1
x
1/ k
f ( x) 1 k exp 1 k , with 0
k k
The GEV (Generalized Extreme Values) distribution is much used in hydrology for
modeling flood phenomena. k lies typically between -0.6 and 0.6.
f ( x ) exp x exp x
The Gumbel distribution, named after Emil Julius Gumbel (1891-1966), is a special
case of the Fisher-Tippett distribution with =1 and =0. It is used in the study of
extreme phenomena such as precipitations, flooding and earthquakes.
x
e s
f ( x) , with R, and s 0
x
s 1 e s
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
84
This distribution is just a reparametrization of the Lognormal distribution.
x 2
f ( x) e , with 0
1 2 2
2
E(X) = and V(X) =
x2
f ( x) e
1 2
This distribution is a special case of the normal distribution with =0 and =1.
Pareto (a, b): the density function of this distribution is given by:
ab a
f ( x) , with a, b 0 and x b
x a 1
The Pareto distribution, named after the Italian economist Vilfredo Pareto (1848-
1923), is also known as the Bradford distribution. This distribution was initially used
to represent the distribution of wealth in society, with Pareto's principle that 80% of
the wealth was owned by 20% of the population.
x a b x
1 1
f ( x) , with , >0, x a, b
1
B ( , ) b a
1
( )( )
a, b R, and B ( , )
( )
4m b - 5a
=
b-a
5b a 4m
=
b-a
The PERT distribution is a special case of the beta4 distribution. It is defined by its
definition interval [a, b] and m the most likeky value (the mode). PERT is an
acronym for Program Evaluation and Review Technique, a project management
85
and planning methodology. The PERT methodology and distribution were
developed during the project held by the US Navy and Lockheed between 1956 and
1960 to develop the Polaris missiles launched from submarines. The PERT
distribution is useful to model the time that is likely to be spent by a team to finish a
project. The simpler triangular distribution is similar to the PERT distribution in that it
is also defined by an interval and a most likely value.
exp x
P( X x) , with x N and 0
x!
df 1/ 2
1 x
( df 1) / 2
f ( x) / df , with df 0
df df / 2
2
The English chemist and statistician William Sealy Gosset (1876-1937), used the
nickname Student to publish his work, in order to preserve his anonymity (the
Guinness brewery forbade its employees to publish following the publication of
confidential information by another researcher). The Students t distribution is the
distribution of the mean of df variables standard normal variables. When df=1,
Student's distribution is a Cauchy distribution with the particularity of having neither
expectation nor variance.
Trapezoidal (a, b, c, d): the density function of this distribution is given by:
86
2 x a
f ( x) , x a, b
d c b a b a
f ( x) , x b, c
2
d c b a
2d x
f ( x ) d c b a d c , x a, b
f ( x) 0 , x a, x d
with a m b
This distribution is useful to represent a phenomenon for which we know that it can
take values between two extreme values (a and d), but that it is more likely to take
values between two values (b and c) within that interval.
Triangular (a, m, b): the density function of this distribution is given by:
2 x a
f ( x) , x a, m
b a m a
2 b x
f ( x) , x m, b
b a b m
f ( x) 0 , x a, x b
with a m b
TriangularQ (q1, m, q2, p1, p2): the density function of this distribution is a
reparametrization of the Triangular distribution. A first step requires estimating the a
and b parameters of the triangular distribution, from the q1 and q2 quantiles to which
percentages p1 and p2 correspond. Once this is done, the distribution functions can be
computed using the triangular distribution functions.
Uniform (a, b): the density function of this distribution is given by:
f ( x) , with b a and x a, b
1
ba
The uniform (0,1) distribution is much used for simulations. As the cumulative
distribution function of all the distributions is between 0 and 1, a sample taken in a
87
Uniform (0,1) distribution is used to obtain random samples in all the distributions
for which the inverse can be calculated.
Uniform discrete (a, b): the density function of this distribution is given by:
f ( x) , with b a, (a, b) N , x N , x a, b
1
b a 1
The uniform discrete distribution corresponds to the case where the uniform
distribution is restricted to integers.
f ( x ) x 1 exp x , with x 0 and 0
1 2 1
We have E(X) = 1 and V(X) = 1 2 1
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
is the shape parameter of the distribution and the scale parameter. When =1,
the Weibull distribution is an exponential distribution with parameter 1/.
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
The Weibull distribution, named after the Swede Ernst Hjalmar Waloddi Weibull
(1887-1979), is much used in quality control and survival analysis. is the shape
parameter of the distribution and the scale parameter. When =1 and =0, the
Weibull distribution is an exponential distribution with parameter 1/.
88
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data: Select the quantitative data. If several samples have been selected, XLSTAT will carry
out the calculations for each of the samples independently while allowing you to superimpose
histograms if you want (see Charts tab). If headers have been selected, check that the
"Sample labels" option has been activated.
Data type:
Continuous: Choose this option so that XLSTAT considers your data to be continuous.
Discrete: Choose this option so that XLSTAT considers your data to be discrete.
Subsamples: Activate this option then select a column (column mode) or a row (row mode)
containing the sample identifiers. The use of this option gives one histogram per subsample
and therefore allows to compare the distribution of data between the subsamples. If a header
has been selected, check that the "Sample labels" option has been activated.
89
Variable-Category labels: Activate this option to use variable-category labels when
displaying outputs. Variable-Category labels include the variable name as a prefix and
the category name as a suffix.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Sample labels: Activate this option if the first row of the selected data (data, sub-samples,
weights) contains a label.
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Sample labels" option is activated.
Options tab:
Intervals: Choose one of the following options to define the intervals for the histogram:
Width: Choose this option to define a fixed width for the intervals.
User defined: Select a column containing in increasing order the lower bound of the
first interval, and the upper bound of all the intervals.
Minimum: Activate this option to enter the value of the lower value of the first interval.
This value must be lower or equal to the minimum of the series.
Remove observations:
For the corresponding sample: Activate this option to ignore an observation which
has a missing value only for samples which have a missing value.
For all samples: Activate this option to ignore an observation which has a missing
value for all selected samples.
90
Estimate missing data: Activate this option to estimate the missing data by using the mean of
the sample.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics of the samples.
Charts tab:
Histograms: Activate this option to display the histograms of the samples. For a theoretical
distribution, the density function is displayed.
Bars: Choose this option to display the histograms with a bar for each interval.
Continuous lines: Choose this option to display the histograms with a continuous line.
Cumulative histograms: Activate this option to display the cumulative histograms of the
samples.
Based on the histogram: Choose this option to display cumulative histograms based
on the same interval definition as the histograms.
Ordinate of the histograms: Choose the quantity to be used for the histograms: density,
frequency or relative frequency.
Display a distribution: Activate this option to compare histograms of samples selected with a
density function and/or to compare the histograms of samples selected with a distribution
function. Then choose the distribution to be used and enter the values of the parameters if
necessary.
91
Results
Summary statistics: This table displays for the selected samples, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation.
Histograms: The histograms are displayed. If desired, you can change the color of the lines,
scales, titles as with any Excel chart.
Descriptive statistics for the intervals: This table displays for each interval its lower bound,
upper bound, the frequency (number of values of the sample within the interval), the relative
frequency (the number of values divided by the total number of values in the sample), and the
density (the ratio of the frequency to the size of the interval).
Example
An example showing how to create a histogram is available on the Addinsoft website at
http://www.xlstat.com/demo-histo.htm
References
Chambers J.M., Cleveland W.S., Kleiner B. and Tukey P.A. (1983). Graphical Methods for
Data Analysis. Duxbury, Boston.
Jacoby W. G. (1997). Statistical Graphics for Univariate and Bivariate Data. Sage
Publications, London.
92
Normality tests
Use this tool to check if a sample can be considered to follow a normal distribution. The
distribution fitting tool enables the parameters of the normal distribution to be estimated but the
tests offered are not as suitable as those given here.
Description
Assuming a sample is normally distributed is common in statistics. But checking that this is
actually true is often neglected. For example, the normality of residuals obtained in linear
regression is rarely tested, even though it governs the quality of the confidence intervals
surrounding parameters and predictions.
The Shapiro-Wilk test which is best suited to samples of less than 5000 observations;
The Lilliefors test is a modification of the Kolmogorov-Smirnov test and is suited to normal
cases where the parameters of the distribution, the mean and the variance are not known and
have to be estimated;
The Jarque-Bera test which is more powerful the higher the number of values.
In order to check visually if a sample follows a normal distribution, it is possible to use P-P
plots and Q-Q plots:
P-P Plots (normal distribution): P-P plots (for Probability-Probability) are used to compare the
empirical distribution function of a sample with that of a sample distributed according to a
normal distribution of the same mean and variance. If the sample follows a normal distribution,
the points will lie along the first bisector of the plan.
Q-Q Plots (normal distribution): Q-Q plots (for Quantile-Quantile) are used to compare the
quantities of the sample with those of a sample distributed according to a normal distribution of
the same mean and variance. If the sample follows a normal distribution, the points will lie
along the first bisector of the plan.
93
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data: Select the quantitative data. If several samples have been selected, XLSTAT carries out
normality tests for each of the samples independently. If headers have been selected, check
that the "Sample labels" option has been activated.
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Sample labels" option is activated.
94
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Sample labels: Activate this option if the first row of the selected data (data, sub-samples,
weights) contains a label.
Significance level (%): Enter the significance level for the tests.
Subsamples: Activate this option then select a column (column mode) or a row (row mode)
containing the sample identifiers. The use of this option gives one series of tests per
subsample. If a header has been selected, check that the "Sample labels" option has been
activated.
Remove observations:
For the corresponding sample: Activate this option to ignore an observation which
has a missing value only for samples which have a missing value.
For all samples: Activate this option to ignore an observation which has a missing
value for all selected samples.
Estimate missing data: Activate this option to estimate the missing data by using the mean of
the sample.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics of the samples.
Charts tab:
P-P plots: Activate this option to display Probability-Probability plots based on the normal
distribution.
Q-Q Plots: Activate this option to display Quantile-Quantile plots based on the normal
distribution.
95
Results
For each test requested, the statistics relating to the test are displayed including, in particular,
the p-value which is afterwards used in interpreting the test by comparing with the chosen
significance threshold.
Example
An example showing how to test the normality of a sample is available on the Addinsoft
website:
http://www.xlstat.com/demo-norm.htm
References
Anderson T.W. and Darling D.A. (1952). Asymptotic theory of certain "Goodness of Fit"
criteria based on stochastic processes. Annals of Mathematical Statistic, 23, 193-212.
Anderson T.W. and Darling D.A. (1954). A test of goodness of fit. Journal of the American
Statistical Association, 49, 765-769.
D'Agostino R.B. and Stephens M.A. (1986). Goodness-of-fit techniques. Marcel Dekker,
New York.
Jarque C.M. and Bera A.K. (1980). Efficient tests for normality, heteroscedasticity and serial
independence of regression residuals. Economic Letters, 6, 255-259.
Lilliefors H. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance
unknown. Journal of the American Statistical Association, 62, 399-402.
Royston P. (1982). An extension of Shapiro and Wilk's W test for normality to large samples.
Applied Statistics, 31, 115-124.
Royston P. (1982). Algorithm AS 181: the W test for normality. Applied Statistics, 31, 176-180.
Royston P. (1995). A remark on Algorithm AS 181: the W test for normality. Applied Statistics,
44, 547-551.
96
Stephens M. A. (1974). EDF statistics for goodness of fit and some comparisons. Journal of
the American Statistical Association, 69, 730-737.
Shapiro S. S. and Wilk M. B. (1965). An analysis of variance test for normality (complete
samples). Biometrika, 52, 3 and 4, 591-611.
Thode H.C. (2002). Testing for normality. Marcel Dekker, New York, USA.
97
Similarity/dissimilarity matrices (Correlations, ...)
Use this tool to calculate a proximity index between the rows or the columns of a data table.
The most classic example of the use of this tool is in calculating a correlation or covariance
matrix between quantitative variables.
Description
This tool offers a large number of proximity measurements between a series of objects
whether they are in rows (usually the observations) or in columns (usually the variables).
The correlation coefficient is a measurement of the similarity of the variables: the more the
variables are similar, the higher the correlation coefficient.
The proximity between two objects is measured by measuring at what point they are similar
(similarity) or dissimilar (dissimilarity).
Quantitative data:
The similarity coefficients proposed by the calculations from the quantitative data are as
follows: Cosine, Covariance (n-1), Covariance (n), Inertia, Gower coefficient, Kendall
correlation coefficient, Pearson correlation coefficient, Spearman correlation coefficient.
The dissimilarity coefficients proposed by the calculations from the quantitative data are as
follows: Bhattacharya's distance, Bray and Curtis' distance, Canberra's distance, Chebychev's
distance, Chi distance, Chi metric, Chord distance, Squared chord distance, Euclidian
distance, Geodesic distance, Kendall's dissimilarity, Mahalanobis distance, Manhattan
distance, Ochiai's index, Pearson's dissimilarity, Spearman's dissimilarity.
Binary data:
The similarity and dissimilarity (pay simple transformation) coefficients proposed by the
calculations from the binary data are as follows: Dice coefficient (also known as the Sorensen
coefficient), Jaccard coefficient, Kulczinski coefficient, Pearson Phi, Ochiai coefficient, Rogers
& Tanimoto coefficient, Sokal & Michener's coefficient (simple matching coefficient), Sokal &
Sneath's coefficient (1), Sokal & Sneath's coefficient (2).
98
Qualitative data:
The similarity coefficients proposed by the calculations from the qualitative data are as
follows: Cooccurrence, Percent agreement.
The dissimilarity coefficients proposed by the calculations from the qualitative data are as
follows: Percent disagreement
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Data: Select a table comprising N objects described by P descriptors. If column headers have
been selected, check that the "Column labels" option has been activated.
Note (1): in the case where the selected data type is Binary , if the input data are not binary,
XLSTAT asks you if they should be automatically transformed into binary data (all values that
are not equal to 0 are replaced by 1s).
Note (2): in the case where the selected data type is Qualitative , whatever the true type of
the data, they are considered as qualitative.
99
Row weights: Activate this option if the rows are weighted. If you do not activate this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Column labels" option is activated.
Proximity type: similarities / dissimilarities: Choose the proximity type to be used. The data
type and proximity type determine the list of possible indexes for calculating the proximity
matrix.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column labels: Activate this option if the first row of the data selections
(Observations/variables table, row labels, row weights, column weights) contains a label.
Row labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
100
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Flag similar objects: Activate this option to identify similar objects in the proximity matrix.
List similar objects: Activate this option to display the list of similar objects.
Dissimilarity threshold: Enter the threshold value of the index from which you consider
objects to be similar. If the index chosen is a similarity, the values will be considered as being
similar if they are greater than this value. If the index chosen is a dissimilarity, the values will
be considered as being similar if they are less than this value.
Bartlett's sphericity test: Activate this option to calculate Bartlett's sphericity test (only for
Pearson correlation or covariance).
Significance level (%): Enter the significance level for the sphericity test.
Results
Summary statistics: This table shows the descriptive statistics for the samples.
Proximity matrix: This table displays the proximities between the object for the chosen index.
If the "Identify similar objects" option has been activated and the dissimilarity threshold has
been exceeded, the values for the similar objects are displayed in bold.
101
List of similar objects: If the "List similar objects" option has been checked and at least one
pair of objects has a similarity beyond the threshold, the list of similar objects is displayed.
Example
http://www.xlstat.com/demo-mds.htm
References
Everitt B.S., Landau S. and Leese M. (2001). Cluster Analysis (4th edition). Arnold, London.
Gower J.C. and P. Legendre (1986). Metric and Euclidean properties of dissimilarity
coefficients. Journal of Classification, 3, 5-48.
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York.
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam.
Sokal R.R. and Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third edition. Freeman, New York.
102
Multicolinearity statistics
Use this tool to identify multicolinearities between your variables.
Description
Variables are said to be multicolinear if there is a linear relationship between them. This is an
extension of the simple case of colinearity between two variables. For example, for three
variables X1, X2 and X3, we say that they are multicolinear if we can write:
X1 = aX2 + bX3
Principle Component Analysis (PCA) can detect the presence of multicolinearities within the
data (a number of non-null factors less than the number of variables indicates the presence of
a multicolinearity), but it cannot identify the variables which are responsible.
To detect the multicolinearities and identify the variables involved, linear regressions must be
carried out on each of the variables as a function of the others. We then calculate
2
The R of each of the models. If the R is 1, then there is a linear relationship between the
dependent variable of the model (the Y) and the explanatory variables (the Xs).
The tolerance for each of the models. The tolerance is (1-R). It is used in several methods
(linear regression, logistic regression, discriminant factorial analysis) as a criterion for filtering
variables. If a variable has a tolerance less than a fixed threshold (the tolerance is calculated
by taking into account variables already used in the model), it is not allowed to enter the model
as its contribution is negligible and it risks causing numerical problems.
The VIF (Variance Inflation Factor) which is equal to the inverse of the tolerance.
Detect multicolinearities within a group of variables can be useful especially in the following
cases:
To identify structures within the data and take operational decisions (for example, stop the
measurement of a variable on a production line as it is strongly linked to others which are
already being measured),
To avoid numerical problems during certain calculations. Certain methods use matrix
inversions. The inverse of a (p x p) matrix can be calculated if it is of rank p (or regular). If it is
of lower rank, in other words, if there are linear relationships between its columns, then it is
singular and cannot be inverted.
103
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Variable labels: Activate this option if the first row of the selection includes a header.
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Variable labels" option is activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
104
Missing data tab:
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Charts tab:
Bar charts: Activate this option to display the bar charts of the following statistics:
Tolerance
VIF
Results
The results comprise the descriptive statistics of the variables selected, the correlation matrix
of the variables and the multicolinearity statistics (R, Tolerance and VIF). Bar charts are used
to locate the variables which are more multi-correlated than the others.
When the tolerance is 0, the VIF has infinite value and is not displayed.
105
References
Belsley D.A., Kuh E. and Welsch R.E. (1980). Regression Diagnostics, Identifying Influential
Data and Sources of Collinearity. Wiley, New York.
106
Contingency tables (descriptive statistics)
Use this tool to compute a variety of descriptive statistics on a contingency table. A chi-square
test is optionally performed. Additional tests on contingency tables are available in the Tests
on contingency tables section.
Description
A contingency table is an efficient way to summarize the relation (or correspondence) between
two categorical variables V1 and V2. It has the following structure:
Category Category Category
V1 \ V2
1 j m2
Category
n(m1,1) n(m1,j) n(m1,m2)
m1
where n(i,j) is the frequency of observations that show both characteristic i for variable V1, and
characteristic j for variable V2.
The Chi-square distance has been suggested to measure the distance between two
categories. The Pearson chi-square statistic, which is the sum of the Chi-square distances, is
used to test the independence between rows and columns. Is has asymptotically a Chi-square
distribution with (m1-1)(m2-1) degrees of freedom.
Inertia is a measure inspired from physics that is often used in Correspondence Analysis, a
method that is used to analyse in depth contingency tables. The inertia of a set of points is the
weighted mean of the squared distances to the center of gravity. In the specific case of a
contingency table, the total inertia of the set of points (one point corresponds to one category)
can be written as:
nij ni . n. j
2
2
2 m1 m 2 n n
, with ni. nij and n. j nij
m2 m1
2
n i 1 j 1 ni. n. j j 1 i 1
n 2
107
and where n is the sum of the frequencies in the contingency table. We can see that the inertia
is proportional to the Pearson chi-square statistic computed on the contingency table.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Contingency table: Select the data that correspond to the contingency table. If row and
column labels are included, make sure that the Labels included option is checked.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
108
Labels included: Activate this option if the row and column labels are selected.
Options tab:
Chi-square test: Activate this option to display the statistics and the interpretation of the Chi-
square test of independence between rows and columns.
Significance level (%): Enter the significance level for the test.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Replace missing data by 0: Activate this option if you consider that missing data are
equivalent to 0.
Replace missing data by their expected value: Activate this option if you want to replace the
missing data by the expected value. The expectation is given by:
ni. n j .
E (nij )
n
where ni. is the row sum, n.j is the column sum, and n is the grand total of the table before
replacement of the missing data.
Outputs tab:
List of combines: Activate this option to display the table that lists all the possible combines
between the two variables that are used to create a contingency table, and the corresponding
frequencies.
Inertia by cell: Activate this option to display the inertia for each cell of the contingency table.
Chi-square by cell: Activate this option to display the contribution to the chi-square of each
cell of the contingency table.
Significance by cell: Activate this option to display a table indicating, for each cell, if the
actual value is equal (=), lower (<) or higher (>) than the theoretical value, and to run a test
(Fishers exact test of on a 2x2 table having the same total frequency as the complete table,
109
and the same marginal sums for the cell of interest), in order to determine if the difference with
the theoretical value is significant or not.
Observed frequencies: Activate this option to display the table of the observed frequencies.
This table is almost identical to the contingency table, except that the marginal sums are also
displayed.
Theoretical frequencies: Activate this option to display the table of the theoretical frequencies
computed using the marginal sums of the contingency table.
Proportions or percentages / Row: Activate this option to display the table of proportions or
percentages computed by dividing the values of the contingency table by the marginal sums of
each row.
Proportions or percentages / Column: Activate this option to display the table of proportions
or percentages computed by dividing the values of the contingency table by the marginal sums
of each column.
Proportions or percentages / Total: Activate this option to display the table of proportions or
percentages computed by dividing the values of the contingency table by the sum of all the
cells of the contingency table.
Charts tab:
3D view of the contingency table: Activate this option to display the 3D bar chart
corresponding to the contingency table.
110
XLSTAT-Pivot
Use this module to turn an individuals/variables table into a dynamic pivot table optimized to let
you understand and analyze the issue phenomenon corresponding to one of the variables
describing the individuals.
Description
A dynamic pivot table allows to take more than two variables into account and to organize the
table structure into a hierarchy. The table is said to be dynamic in the sense that software
functionalities allow to navigate among the hierarchy and to create a focused view on particular
classes of particular variables.
XLSTAT-Pivot allows you to create dynamic pivot tables whose structure is optimized with
respect to a target variable. Numeric continuous or discrete explanatory variables are
automatically sliced into classes that contribute to optimize the quality of the table.
The target variable can be a binary variable (for example 0/1 or Yes/No), or a quantitative
variable.
When you run XLSTAT-Pivot you will see successively three dialog boxes:
- The first dialog box lets you select the data and a few options.
- The second dialog box allows you to confirm or modify the data types as detected by the
XLSTAT-Pivot modeling engine.
- The third dialog box allows you to select the dimensions that you want to use in the pivot
table (up to four variables may be selected). To help you select the variables the Ki coefficient,
that measures the explanatory power of the variable, and the Kr coefficient, measuring the
contribution to the robustness of the model, are displayed.
111
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Response variable: Select the response variable(s) you want to model. If several variables
have been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
Quantitative: If you select this option, you must select a quantitative variable.
Binary: If you select this option, you must select a variable containing exactly two
distinct values.
X / Explanatory variables: Select one or more explanatory variables. The variables can be
quantitative and/or qualitative.
112
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Sample labels" option is activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (response and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Contributions: Activate this option to display the contributions table and the corresponding
bar chart.
Pivot table: Activate this option to display the dynamic pivot table.
113
Results
Ki: This indicator is given in % corresponding to the information brought by the explanatory
variables to explain the target variable. This concept is quite similar to the R2 concept when
talking of linear regression. The closest to 100% the Ki is, the more explanatory variables
explain the target variable.
Kr: This indicator measures the model robustness. The robustness of a model corresponds to
its capacity to adapt to new data sets. XLSTAT-Pivot uses 75% of data to adjust the model and
25% of data to validate the model. A model is said to be robust if its Kr is superior to 95%.
The first table presents the variables contribution (raw, % relative and cumulated
contribution). This table allows you to quickly see which variables have the greater impact on
the target variable. A bar chart of the contributions is also displayed. This histogram is an
Excel chart that you modify to suit your needs.
The most important result provided by XLSTAT-Pivot is the dynamic pivot table. Each cell
corresponds to a unique combination of the values of the explanatory variables. It is described
by the following 4 values, that can be displayed or not according to the user preferences:
Target average: Percentage of the cases where the target variable is equal to 1 in the
case of a binary variable; average of the target variable calculated on the sub-
population corresponding to the combination in the case of continuous variable;
Target size: Count of the "1" occurrences for the target variable in the case of binary
variable, sum of the target variable calculated on the sub-population corresponding to
the combination in the case of a continuous variable;
Example
An example based on data collected for a population census in the United States is
permanently available on the Addinsoft web site. To download this data, go to:
http://www.xlstat.com/demo-pivot.htm
114
References
Vapnik V. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York.
115
Scatter plots
rd
Use this tool to create 2- or 3-dimensional plots (the 3 dimension being represented by the
size of the point), or indeed 4-dimensional plots (a qualitative variable can be selected). This
tool is also used to create matrices of plots to enable a study of a series of 2-dimensional plots
to be made at the same time.
Note: XLSTAT-3DPlot can create plots with much more impact thanks to its large number of
options with the possibility of representing data on a third axis.
Dialog box
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
X: In this field select the data to be used as coordinates along the X-axis.
Y: In this field select the data to be used as coordinates along the Y-axis.
Z: Check this option to select the values which will determine the size of the points on the
charts.
Use bubbles: Check this option to use charts with MS Excel bubbles.
Groups: Check this option to select the values which correspond to the identifier of the group
to which each observation belongs. On the chart, the color of the point depends on the group.
116
Range: Check this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Check this option to display the results in a new worksheet in the active workbook.
Variable labels: Check this option if the first line of the selected data (X, Y, Z, Groups,
Weights and observation labels) contains a label.
Observation labels: Check this option if you want to use the available line labels. If you do not
check this option, labels will be created automatically (Obs1, Obs2, etc.). If a column header
has been selected, check that the "Variable labels" option has been activated.
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Variable labels" option is activated.
Options tab:
Matrix of plots: Check this option to display all possible combinations of variables in pairs in
the form of a two-entry table with Y-variables in rows and Y-variables in columns.
Histograms: Activate this option so that if the X and Y variables are identical, XLSTAT
displays a histogram instead of a X/X plot.
Q-Q plots: Activate this option so that if the X and Y variables are identical, XLSTAT
displays a Q-Q plot instead of a X/X plot.
Frequencies: Check this option to display the frequencies for each point on the charts.
Only if >1: Check this option if you only want frequencies strictly greater than zero to be
displayed.
Confidence ellipses: Activate this option to display confidence ellipses. The confidence
ellipses correspond to a 95% confidence interval for a bivariate normal distribution with the
same means and the same covariance matrix as the variables represented in abscissa and
ordinates.
Legend: Check this option if you want the chart legend to be displayed.
117
Example
A tutorial on using Scatter plots is available on the XLSTAT website on the following page:
http://www.xlstat.com/demo-scatter.htm
References
Chambers J.M., Cleveland W.S., Kleiner B. and Tukey P.A. (1983). Graphical Methods for
Data Analysis. Duxbury, Boston.
Jacoby W. G. (1997). Statistical Graphics for Univariate and Bivariate Data. Sage
Publications, London.
118
Parallel coordinates plots
Use this tool to visualize multidimensional data (described by P quantitative and Q qualitative
variables) on a single two dimensional chart.
Description
This visualization method is useful for data analysis when you need to discover or validate
groups. For example, this method could be used after Agglomerative Hierarchical Clustering.
If the number of observations is too high, the visualization might be not very efficient or even
impossible due to the Excel restrictions (maximum of 255 data series). In that case, it is
recommended to use the random sampling option.
Dialog box
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
119
General tab:
Quantitative Data: Check this option to select the samples of quantitative data you want to
calculate descriptive statistics for.
Qualitative Data: Check this option to select the samples of qualitative data you want to
calculate descriptive statistics for.
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Variable labels" option is activated.
Groups: Check this option to select the values which correspond to the identifier of the group
to which each observation belongs. On the chart, the color of the point depends on the group.
Range: Check this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Check this option to display the results in a new worksheet in the active workbook.
Variable labels: Check this option if the first line of the selected data (quantitative data,
qualitative data, weights, groups and observation labels) contains a label.
Observation labels: Check this option if you want to use the available line labels. If you do not
check this option, line labels will be created automatically (Obs1, Obs2, etc.). If a column
header has been selected, check that the "Variable labels" option has been activated.
Rescale: Check this option so that all variables are represented on the same scale of 0% to
100% (for numeric variables 0 corresponds to the minimum and 100 to the maximum; for all
nominal variables, the categories are regularly spaced and classified in alphabetic order.)
Options tab:
Display as many lines as possible: Check this option to display as many lines as possible
(the maximum is 250 due to the limitations of Excel).
Display the descriptive statistics lines: Check this option to display lines for the following
statistics only:
Median
120
First quantile (%): enter the value of the first quantile (2.5% by default).
Second quantile (%): enter the value of the second quantile (97.5% by default).
Example
A tutorial on generating a parallel coordinates plot is available on the Addinsoft website at the
following address:
http://www.xlstat.com/demo-pcor.htm
References
Inselberg A. (1985). The plane with parallel coordinates. The Visual Computer, 1, pp. 69-91.
Eickemeyer J. S., Inselberg A., Dimsdale B. (1992). Visualizing p-flats in n-space Using
Parallel Coordinates. Technical Report G320-3581, IBM Palo Alto Scientific Center.
Wegman E.J. (1990). Hyperdimensional Data Analysis Using Parallel Coordinates. J. Amer.
Statist. Assoc., 85, 411, pp 664-675.
121
AxesZoomer
Use this tool to change the minimum and maximum values on the X- and Y-axes of a plot.
Dialog box
Important: before running this tool, you must select a scatter plot or curve.
122
EasyLabels
Use this tool to add labels, formatted if required, to a series of values on a chart.
Dialog box
Important: before running this tool, you must select a scatter plot or curve, or a series of points
on a plot.
: Click this button to close the dialog box without making any changes.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that the labels are in a column. If the arrow points to the right,
XLSTAT considers that the labels are in a row.
Labels: Select the labels to be added to the series of values selected on the plot.
Header in the first cell: Check this option if the first cell of the labels selected is a header and
not a label.
Use the text properties: Check this option if you want the text format used in the cells
containing the labels to also be applied to the text of labels in the chart:
Style: Check this option to use the same font style (normal, bold, italic).
123
Use the cell properties: Check this option if you want the format applied to the cells
containing the labels to also be applied to the labels in the chart:
Use the point properties: Check this option if you want the label color to be the same as the
color of the points:
Inside color: Check this option to use the color inside the points.
Border color: Check this option to use the border color of the points.
124
Reposition labels
Use this tool to change the position of observation labels on a chart.
Dialog box
: Click this button to close the dialog box without making any changes.
Corners: Check this option to place labels in the direction of the corner of the quadrant in
which the point is located.
Distance to point:
Automatic: Check this option for XLSTAT to automatically determine the most
appropriate distance to the point.
User defined: Check this option to enter the value (in pixels) of the distance between
the label and the point.
Right: Check this option to place labels to the right of the point.
Left: Check this option to place labels to the left of the point.
Apply only to the selected series: Check this option to only change the position of labels for
the series selected.
125
EasyPoints
Use this tool to modify the size, the color or the shape of the points that are displayed in an
Excel chart.
Dialog box
Important: before running this tool, you must select a scatter plot or curve, or a series of points
on a plot.
: Click this button to close the dialog box without making any changes.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that the labels are in a column. If the arrow points to the right,
XLSTAT considers that the labels are in a row.
Size: Activate this option and select the cells that give the size to be applied to the points. The
size of the points is determined by the values in the cells.
Header in the first cell: Check this option if the first cell of the labels selected is a header and
not a label.
Rescale: Choose the interval of sizes to use when displaying the points. The minimum must
be between 2 and 71, and the maximum between 3 and 72.
Shapes and/or color: Activate this option to change the shape of the points and/or the color to
be applied to the points. Select the cells and which color (if the Use the cell properties that tell
which shape should be used for each point: 1 corresponds to a square, 2 to a diamond, 3 to a
triangle, 4 to an x, 5 to a star (*), 6 to a point (.), 7 to a dash (-), 8 to a plus (+) and 9 to a circle
(o). The color of the border of the points depends on the color of the bottom border of the cells
and the inside color of the points depends on the background color of the cells (Note: the
default color of the cells is none, so you need to set it to white to obtain white points).
126
Change shapes: Check this option if you want the shapes to be changed depending on the
values selected in the Shapes and or color field.
Use the cell properties: Check this option if you want the format applied to the cells to also be
applied to the points in the chart:
Border: Check this option to use the cell borders as the foreground color.
Background: Check this option to use the cell color as the background color.
Example
An example describing how to use the EasyPoints tool is available on the Addinsoft website at:
http://www.xlstat.com/demo-easyp.htm
127
Orthonormal plots
Use this tool to adjust the minimum and maximum of the X- and Y- axes so that the plot
becomes orthonormal. This tool is particularly useful if you have enlarged an orthonormal plot
produced by XLSTAT (for example after a PCA) and you want to ensure the plot is still
orthonormal.
Note: an orthonormal plot is where a unit on the X-axis appears the same size as a unit on the
Y-axis. Orthonormal plots avoid interpretation errors due to the effects of dilation or
overwriting.
Dialog box
: Click this button to close the dialog box without making any changes.
128
Plot transformations
Use this tool to apply one or more transformations to the points in a plot.
Dialog box
Important: before running this tool, you must select a scatter plot or curve.
: Click this button to close the dialog box without doing any computation.
Symmetry:
Horizontal axis: Check this option to apply a symmetry around the X-axis.
Vertical axis: Check this option to apply a symmetry around the Y-axis.
Note: if you select both the previous options, the symmetry applied will be a central
symmetry.
Translation:
Horizontal: Check this option to enter the number of units for a horizontal translation.
Vertical: Check this option to enter the number of units for a vertical translation.
Rotation:
Angle (): enter the angle in degrees for the rotation to be applied.
129
Left: if this option is activated, an anti-clockwise rotation is applied.
Rescaling:
Range: Check this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Check this option to display the results in a new worksheet in the active workbook.
Display the new coordinates: Check this option to display the coordinates once all the
transformations have been applied.
Update Min and Max on the new plot: Check this option for XLSTAT to automatically adjust
the minimum and maximum of the X- and Y- axes, once the transformations have been carried
out, so that all points are visible.
Orthonormal plot: Check this option for XLSTAT to automatically adjust the minimum and
maximum of the X- and Y- axes, once the transformations have been carried out, so that the
plot becomes orthonormal.
130
Merge plots
Use this tool to merge multiple plots into one.
Dialog box
Important: before using this tool, you must select at least two plots of the same type (e.g. two
scatter plots).
Display title: Check this option, to display a title on the merged plot.
Title of the first chart: Check this option to use the title of the first chart.
New title: Check this option to enter a title for the merged plot.
Orthonormal plot: Check this option for XLSTAT to verify that the plot resulting from the
merged plots is orthonormal.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet in the active workbook.
New chart sheet: Check this option to display the plot resulting from the merge in a new chart
sheet.
131
Display the report header: clear this option to stop the previous report header for the chart
from being displayed.
132
Factor analysis
Factor analysis highlights, where possible, the existence of underlying factors common to the
quantitative variables measured in a set of observations.
Description
The factor analysis method dates from the start of the 20th century (Spearman, 1904) and has
undergone a number of developments, several calculation methods having been put forward.
This method was initially used by psychometricians, but its field of application has little by little
spread into many other areas, for example, geology, medicine and finance.
It is EFA which will be described below and which is used by XLSTAT. It is a method which
reveals the possible existence of underlying factors which give an overview of the information
contained in a very large number of measured variables. The structure linking factors to
variables is initially unknown and only the number of factors may be assumed.
CFA in its traditional guise uses a method identical to EFA but the structure linking underlying
factors to measured variables is assumed to be known. A more recent version of CFA is linked
to models of structural equations.
Spearman's historical example, even if the subject of numerous criticisms and improvements,
may still be used to understand the principle and use of the method. By analyzing correlations
between scores obtained by children in different subjects, Spearman wanted to form a
hypothesis that the scores depended ultimately on one factor, intelligence, with a residual part
due to an individual, cultural or other effect.
Thus the score obtained by an individual (i) in subject (j) could be written as x(i,j) = + b(j)F +
e(i,j), where is the average score in the sample studied and F the individual's level of
intelligence (the underlying factor) and e(i,j) the residual.
Generalizing this structure to p subjects (the input variables) and to k underlying factors, we
obtain the following model:
(1) x = + f + u
133
where x is a vector of dimension (p x 1), in the mean vector, is the matrix (p x k) of the
factor loadings and f and u are the random vectors of dimensions (k x 1) and (p x 1)
respectively are assumed to be independent. The elements of f are called common factors,
and those of u specific factors.
If we set the norm of f to 1, then the covariance matrix for the input variables from expression
(1) is written as:
(2) = +
Thus the variance of each of the variables can be divided into two parts: The communality (as
it arises from the common factors),
hi2 ij2 ,
k
(3)
j 1
and ii the specific variance or unique variance (as it is specific to the variable in question).
It can be shown that the method used to calculate matrix , an essential challenge in factorial
analysis, is independent of the scale. It is therefore equivalent to working from the covariance
matrix or correlation matrix.
The challenge of factorial analysis is to find matrices and , such that equation (2) can be at
least approximately verified.
Note: factor analysis is sometimes included with Principle Component Analysis (PCA) as PCA
is a special case of factor analysis (where k, the number of factors, equals p, the number of
variables). Nevertheless, these two methods are not generally used in the same context.
Indeed, PCA is first and foremost used to reduce the number of dimensions while maximizing
the unchanged variability in order to obtain independent (non-correlated) factors or for
visualizing data in a 2- or 3-dimensional space. Whereas, factor analysis is used to identify a
latent structure and for possibly reducing afterwards the number of variables measured if they
are redundant with respect to the latent factors.
Extracting Factors
Principle components: this method is also used in Principle Component Analysis (PCA). It is
only offered here in order to make a comparison between the results of the three methods
bearing in mind that the results from the module dedicated to PCA are more complete.
Principal factors: this method is probably the most used. It is an iterative method which
enables the communalities to be gradually converged. The calculations are stopped when the
maximum change in the communalities is below a given threshold or when a maximum
134
number of iterations is reached. The initial communalities can be calculated according to
various methods.
Maximum likelihood: this method was first put forward by Lawley (1940). The proposal to use
the Newton-Raphson algorithm (iterative method) dates from Jennrich (1969). It was
afterwards improved and generalized by Jreskog (1977). This method assumes that the input
variables follow a normal distribution. The initial communalities are calculated according to the
method proposed by Jreskog (1977). As part of this method, an adjustment test is calculated.
The statistic used for the test follows a Chi distribution to (p-k) / 2 (p+k) / 2 degrees of
2
Number of factors
Determining the number of factors to select is one of the challenges of factor analysis. The
"automatic" method offered by XLSTAT is uniquely based on the spectral decomposition of the
correlation matrix and the detection of a threshold from which the contribution made by
information (in the sense of variability) is not significant.
The likelihood maximum method offers an adjustment test to help determine the correct
number of principle factors for the principle factor method. For the principal factors method, the
defining the number of factors is more difficult?
The Kaiser-Guttman rule suggests that only those factors with associated eigenvalues which
are strictly greater than 1 should be kept. The number of factors to be kept corresponds to the
first turning point found on the curve. Crossed validation methods have been suggested to
achieve this aim.
Communalities are by definition the squares of correlations. They must therefore be between 0
and 1. However, it may happen that the iterative algorithms (principle factors method or
likelihood maximum method) will produce solutions with communalities equal to 1 (Heywood
cases), or greater than 1 (ultra Heywood cases). There may be many reasons for these
anomalies (too many factors, not enough factors, etc.). When this happens, XLSTAT sets the
communalities to 1 and adapts the elements of in consequence.
Rotations
Once the results have been obtained, they may be transformed in order to make them more
easy to interpret, for example by trying to arrange that the coordinates of the variables on the
factors are either high (in absolute value), or near to zero. There are two main families of
rotations:
135
Orthogonal rotations can be used when the factors are not correlated (hence orthogonal). The
methods offered by XLSTAT are Varimax, Quartimax, Equamax, Parsimax and Orthomax.
Varimax rotation is the most used. It ensures that for each factor there are few high factor
loadings and few that are low. Interpretation is thus made easier as, in principle, the initial
variables will mostly be associated with one of the factors.
Oblique transformations can be used when the factors are correlated (hence oblique). The
methods offered by XLSTAT are Quartimin and Oblimin.
The Promax method, also offered by XLSTAT, is a mixed procedure since it consists initially of
a Varimax rotation followed by an oblique rotation so that the high factor loadings and low
factor loadings are the same but with the low values even lower.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
The main data entry field is used to select one of three types of table:
136
Observations/variables table / Correlation matrix / Covariance matrix: Choose the option
appropriate to the format of your data, and then select the data. If your data correspond to a
table comprising N observations described by P quantitative variables select the
Observations/variables option. If column headers have been selected, check that the "Variable
labels" option has been activated. If you select a correlation or covariance matrix, and if you
include the variable names in the first row of the selection, you must also select them in the
first column.
Extraction method: Choose the factor extraction method to be used, The three possible
methods are (see the description section for more details):
Principal components
Principal factors
Maximum likelihood
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (input table, weights,
observations labels) includes a header. Where the selection is a correlation or covariance
matrix, if this option is activated, the first column must also include the variable labels.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Options tab:
Number of factors:
Automatic: Activate this option to make XLSTAT determine the number of factors
automatically.
137
User defined Activate this option to tell XLSTAT the number of factors to use in the
calculations.
Initial communalities: Choose this calculation method for the initial communalities (this option
is only visible for the principle factors methods):
Squared multiple correlations: The initial communalities are based a variable's level
of dependence with regard to the other variables.
Random: The initial communalities are drawn from the interval ]0 ; 1[.
Maximum: The initial communalities are set to the maximum value of the squares of the
multiple correlations.
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number if iterations has been exceeded. Default value:
50.
Convergence: Enter the maximum value of the evolution in the communalities from one
iteration to another which, when reached, means that the algorithm is considered to
have converged. Default value: 0.0001.
Rotation: Activate this option if you want to apply a rotation to the factor coordinate matrix.
Number of factors: Enter the number of factors the rotation is to be applied to.
Method: Choose the rotation method to be used. For certain methods a parameter
must be entered (Gamma for Orthomax, Tau for Oblimin, and the power for Promax).
Kaiser normalization: Activate this option to apply Kaiser normalization during the
rotation calculation.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
138
Pairwise deletion: Activate this option to remove observations with missing data only when
the variables involved in the calculations have missing data. For example, when calculating the
correlation between two variables, an observation will only be ignored if the data
corresponding to one of the two variables is missing.
Estimate missing data: Activate this option to estimate the missing data before the
calculation starts.
Mean or mode: Activate this option to estimate the missing data by using the mean
(quantitative variables) or the mode (qualitative variables) for the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data for an observation
by searching for the nearest neighbour to the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Correlations: Activate this option to display the correlation or covariance matrix depending on
the type of options chosen in the "General" tab. If the Test significance option is activated,
the significant correlations at the selected significance threshold are displayed in bold.
Cronbach's Alpha: Activate this option to compute the Cronbach's alpha coefficient.
Eigenvalues: Activate this option to display the table and chart (scree plot) of eigenvalues.
Factor pattern: Activate this option to display factor loadings (coordinates of variables in the
factor space).
Factor/Variable correlations: Activate this option to display correlations between factors and
variables.
Factor pattern coefficients: Activate this option if you want the coefficients of the factor
pattern to be displayed. Multiplying the (standardized) coordinates of the observations in the
initial space by these coefficients gives the coordinates of the observations in the factor space.
Factor structure: Activate this option to display correlations between factors and variables
after rotation.
Charts tab:
139
Variables charts: Activate this option to display charts representing the variables in the new
space.
Vectors: Activate this option to display the initial variables in the form of vectors.
Correlations charts: Activate this option to display charts showing the correlations between
the factors and initial variables.
Vectors: Activate this option to display the initial variables in the form of vectors.
Observations charts: Activate this option to display charts representing the observations in
the new space.
Labels: Activate this option to have observation labels displayed on the charts. The
number of labels displayed can be changed using the filtering option.
Colored labels: Activate this option to show labels in the same color as the points.
N first rows: The N first observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The N last observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected. This includes the number of observations, the number of missing values,
the number of non-missing values, the mean and the standard deviation (unbiased).
Correlation/Covariance matrix: This table shows the data to be used afterwards in the
calculations. The type of correlation depends on the option chosen in the "General" tab in the
dialog box. For correlations, significant correlations are displayed in bold.
Cronbach's Alpha: If this option has been activated, the value of Cronbach's Alpha is
displayed.
140
Maximum change in communality at each iteration: This table is used to observe the
maximum change in communality for the last 10 iterations. For the maximum likelihood
method, the evolution of a criterion which is proportional to the opposite of the likelihood
maximum is also displayed.
Goodness of fit test: The goodness of fit test is only displayed when the likelihood maximum
method has been chosen.
Reproduced correlation matrix: This matrix is the product of the factor loadings matrix with
its transpose.
Residual correlation matrix: This matrix is calculated as the difference between the variables
correlation matrix and the reproduced correlation matrix.
Eigenvalues: This table shows the eigenvalues associated with the various factors together
with the corresponding percentages and cumulative percentages.
Factor pattern: This table shows the factor loadings (coordinates of variables in the vector
space, also called factor pattern). The corresponding chart is displayed.
Factor/Variable correlations: This table displays the correlations between factors and
variables.
Factor pattern coefficients: This table displays the coefficients of the factor pattern to be
displayed. Multiplying the (standardized) coordinates of the observations in the initial space by
these coefficients gives the coordinates of the observations in the factor space.
Where a rotation has been requested, the results of the rotation are displayed with the
rotation matrix first applied to the factor loadings. This is followed by the modified variability
percentages associated with each of the axes involved in the rotation. The coordinates of the
variables and observations after rotation are displayed in the following tables.
Factor structure: This table shows the correlations between factors and variables after
rotation.
Example
http://www.xlstat.com/demo-fa.htm
141
References
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Crawford C.B. and Ferguson G.A. (1970). A general rotation criterion and its use in
orthogonal rotation. Psychometrika, 35(3), 321-332.
Cureton E.E. and Mulaik S.A. (1975). The weighted Varimax rotation and the Promax
rotation. Psychometrika, 40(2), 183-195.
Jennrich R.I. and Robinson S.M. (1969). A Newton-Raphson algorithm for maximum
likelihood factor analysis. Psychometrika, 34(1), 111-123.
Jreskog K.G. (1977). Factor Analysis by Least-Squares and Maximum Likelihood Methods,
in Statistical Methods for Digital Computers, eds. K. Enslein, A. Ralston, and H.S. Wilf. John
Wiley & Sons, New York.
Lawley D.N. (1940). The estimation of factor loadings by the Method of maximum likelihood.
Proceedings of the Royal Society of Edinburgh. 60, 64-82.
Loehlin J.C. (1998). Latent Variable Models: an introduction to factor, path, and structural
analysis, LEA, Mahwah.
Mardia K.V., Kent J.T. and Bibby J.M. (1979). Multivariate Analysis. Academic Press,
London.
142
Principal Component Analysis (PCA)
Use Principle Component Analysis to analyze a quantitative observations/variables table or a
correlation or covariance matrix. This method is used to:
Obtain non-correlated factors which are linear combinations of the initial variables.
Description
Principle Component Analysis (PCA) is one of the most frequently used multivariate data
analysis methods. Given a table of quantitative data (continuous or discrete) in which n
observations (observations, products, etc.) are described by p variables (the descriptors,
attributes, measurements, etc.), if p is quite high, it is impossible to grasp the structure of the
data and the nearness of the observations by merely using univariate statistical analysis
methods or even a correlation matrix.
Uses of PCA
The study and visualization of the correlations between variables to hopefully be able to limit
the number of variables to be measured afterwards;
Obtaining non-correlated factors which are linear combinations of the initial variables so as to
use these factors in modeling methods such as linear regression, logistic regression or
discriminant analysis.
Principle of PCA
143
observations will be able to be represented on a 2- 3-dimensional chart, thus making
interpretation much easier.
Correlations or covariance
PCA is used to calculate matrices to project the variables in a new space using a new matrix
which shows the degree of similarity between the variables. It is common to use the Pearson
correlation coefficient or the covariance as the index of similarity, Pearson correlation and
covariance have the advantage of giving positive semi-defined matrices whose properties are
used in PCA. However other indexes may be used. XLSTAT provides Spearman and Kendall
or polychoric correlations for ordinal data (tetrachoric correlations are a special case of
polychoric correlations which use binary data).
Traditionally, a correlation coefficient rather than the covariance is used as using a correlation
coefficient removes the effect of scale: thus a variable which varies between 0 and 1 does not
weigh more in the projection than a variable varying between 0 and 1000. However in certain
areas, when the variables are supposed to be on an identical scale or we want the variance of
the variables to influence factor building, covariance is used.
Note: where PCA is carried out on a correlation matrix, it is called normalized PCA.
Representing the variables in a space of k factors enables the correlations between the
variables and between the variables and factors to be visually interpreted with certain
precautions.
Indeed if the observations or variables are being represented in the factor space, two points a
long distance apart in a k-dimensional space may appear near in a 2-dimensional space
depending on the direction used for the projection (see diagram below).
144
We can consider that the projection of a point on an axis, a plan or a 3-dimensional space is
reliable if the sum of the squared cosines on the representation axis is near to 1. The squared
cosines are displayed in the results given by XLSTAT in order to avoid any incorrect
interpretation.
If the factors are afterwards to be used with other methods, it is useful to study the relative
contribution (expressed as a percentage or a proportion) of the different variables in building
each of the factor axes so as to make the results obtained afterwards easy to interpret. The
contributions are displayed in the results given by XLSTAT.
Number of factors
Two methods are commonly used for determining the number of factors to be used for
interpreting the results:
The scree test (Cattell, 1966) is based on the decreasing curve of eigenvalues. The number of
factors to be kept corresponds to the first turning point found on the curve.
We can also use the cumulative variability percentage represented by the factor axes and
decide to use only a certain percentage.
Graphic representations
One of the advantages of PCA is that it simultaneously provides the best view of the variables
and data and biplots combining both (see below). However these representations are only
reliable if the sum of the variability percentages associated with the axes of the representation
space are sufficiently high. If this percentage is high (for example 80%), the representation can
be considered as reliable. If the percentage is reliable, it is recommended to produce
representations on several axis pairs in order to validate the interpretation made on the first
two factor axes.
145
Biplots
After carrying out a PCA, it is possible to simultaneously represent both observations and
variables in the factor space. The first work on this subject dates from Gabriel (1971). Gower
(1996) and Legendre (1998) synthesized the previous work and extended this graphical
representation technique to other methods. The term biplot is reserved for simultaneous
representations which respect the fact that the projection of observations on variable vectors
must be representative of the input data for the same variables. In other words, the projected
points on the variable vectors must respect the order and the relative distances of the
observations for that same variable, in the input data.
Correlation biplot: This type of biplot interprets the angles between the variables as these are
directly linked to the correlations between the variables. The position of two observations
projected onto a variable vector can be used to determine their relative level for this variable.
The distance between the two observations is an approximation of the Mahalanobis distance in
the k-dimensional factor space. Lastly, the projection of a variable vector in the representation
space is an approximation of the standard deviation of the variable (the length of the vector in
the k-dimensional factor space is equal to the standard deviation of the variable).
Distance biplot: A distance biplot is used to interpret the distances between the observations
as these are an approximation of their Euclidean distance in the p-dimensional variable space.
The position of two observations projected onto a variable vector can be used to determine
their relative level for this variable. Lastly, the length of a variable vector in the representation
space is representative of the variable's level of contribution to building this space (the length
of the vector is the square root of the sum of the contributions).
Symmetric biplot: This biplot was proposed by Jobson (1992) and is half-way between the
two previous biplots. If neither the angles nor the distances can be interpreted, this
representation may be chosen as it is a compromise between the two.
XLSTAT lets you adjust the lengths of the variable vectors so as to improve the readability of
the charts. However, if you use this option with a correlation biplot, the projection of a variable
vector will no longer be an approximation of the standard deviation of the variable.
146
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
The main data entry field is used to select one of three types of table:
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
147
Variable labels: Activate this option if the first row of the data selections
(observations/variables table, weights, observation labels) includes a header. Where the
selection is a correlation or covariance matrix, if this option is activated, the first column must
also include the variable labels.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Options tab:
Filter factors: You can activate one of the following two options in order to reduce the number
of factors for which results are displayed.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
Maximum Number: Activate this option to set the number of factors to take into
account.
Rotation: Activate this option if you want to apply a rotation to the factor coordinate matrix.
Number of factors: Enter the number of factors the rotation is to be applied to.
Method: Choose the rotation method to be used. For certain methods a parameter
must be entered (Gamma for Orthomax, Tau for Oblimin, and the power for Promax).
Kaiser normalization: Activate this option to apply Kaiser normalization during the
rotation calculation.
Supplementary observations: Activate this option if you want to calculate the coordinates
and represent additional observations. These observations are not taken into account for the
factor axis calculations (passive observations as opposed to active observations). Several
methods for selecting supplementary observations are provided:
148
Random: The observations are randomly selected. The Number of observations N to
display must then be specified.
N last rows: The last N observations are selected for validation. The Number of
observations N to display must then be specified.
N first rows: The first N observations are selected for validation. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you must then select an indicator variable
set to 0 for active observations and 1 for passive observations.
Supplementary variables: Activate this option if you want to calculate coordinates afterwards
for variables which were not used in calculating the factor axes (passive variables as opposed
to active variables).
o Color observations: Activate this option so that the observations are displayed
using different colors, depending on the value of the first qualitative
supplementary variable.
o Display the centroids: Activate this option to display the centroids that
correspond to the categories of the qualitative supplementary variables.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Pairwise deletion: Activate this option to remove observations with missing data only when
the variables involved in the calculations have missing data. For example, when calculating the
correlation between two variables, an observation will only be ignored if the data
corresponding to one of the two variables is missing.
Estimate missing data: Activate this option to estimate the missing data before the
calculation starts.
149
Mean or mode: Activate this option to estimate the missing data by using the mean
(quantitative variables) or the mode (qualitative variables) for the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data for an observation
by searching for the nearest neighbour to the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the correlation or covariance matrix depending on
the type of options chosen in the "General" tab.
Test significance: Where a correlation was chosen in the "General" tab in the dialog
box, activate this option to test the significance of the correlations.
Bartlett's sphericity test: Activate this option to perform the Bartlett sphericity test.
Significance level (%): Enter the significance level for the above tests.
Eigenvalues: Activate this option to display the table and chart (scree plot) of eigenvalues.
Factor loadings: Activate this option to display the coordinates of the variables in the factor
space.
Factor scores: Activate to display the coordinates of the observations (factor scores) in the
new space created by PCA.
Contributions: Activate this option to display the contribution tables for the variables and
observations.
Squared cosines: Activate this option to display the tables of squared cosines for the
variables and observations.
Charts tab:
Correlations charts: Activate this option to display charts showing the correlations between
the components and initial variables.
Vectors: Activate this option to display the initial variables in the form of vectors.
150
Observations charts: Activate this option to display charts representing the observations in
the new space.
Labels: Activate this option to have observation labels displayed on the charts. The
number of labels displayed can be changed using the filtering option.
Biplots: Activate this option to display charts representing the observations and variables
simultaneously in the new space.
Vectors: Activate this option to display the initial variables in the form of vectors.
Labels: Activate this option to have observation labels displayed on the biplots. The
number of labels displayed can be changed using the filtering option.
Type of biplot: Choose the type of biplot you want to display. See the description section for
more details.
Colored labels: Activate this option to show labels in the same color as the points.
N first rows: The first N observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The last N observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
151
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected. This includes the number of observations, the number of missing values,
the number of non-missing values, the mean and the standard deviation (unbiased).
Correlation/Covariance matrix: This table shows the data to be used afterwards in the
calculations. The type of correlation depends on the option chosen in the "General" tab in the
dialog box. For correlations, significant correlations are displayed in bold.
Bartlett's sphericity test: The results of the Bartlett sphericity test are displayed. They are
used to confirm or reject the hypothesis according to which the variables do not have
significant correlation.
Eigenvalues: The eigenvalues and corresponding chart (scree plot) are displayed. The
number of eigenvalues is equal to the number of non-null eigenvalues.
If the corresponding output options have been activated, XLSTAT afterwards displays the
factor loadings in the new space, then the correlations between the initial variables and the
components in the new space. The correlations are equal to the factor loadings in a
normalized PCA (on the correlation matrix).
Contributions: Contributions are an interpretation aid. The variables which had the highest
influence in building the axes are those whose contributions are highest.
Squared cosines: As in other factor methods, squared cosine analysis is used to avoid
interpretation errors due to projection effects. If the squared cosines associated with the axes
used on a chart are low, the position of the observation or the variable in question should not
be interpreted.
The factor scores in the new space are then displayed. If supplementary data have been
selected, these are displayed at the end of the table.
Contributions: This table shows the contributions of the observations in building the principal
components.
Squared cosines: This table displays the squared cosines between the observation vectors
and the factor axes.
152
Where a rotation has been requested, the results of the rotation are displayed with the
rotation matrix first applied to the factor loadings. This is followed by the modified variability
percentages associated with each of the axes involved in the rotation. The coordinates,
contributions and cosines of the variables and observations after rotation are displayed in the
following tables.
Example
A tutorial on how to use Principal Component Analysis is available on the Addinsoft website:
http://www.xlstat.com/demo-pca.htm
References
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Gabriel K.R. (1971). The biplot graphic display of matrices with application to principal
component analysis. Biometrika, 58, 453-467.
Gower J.C. and Hand D.J. (1996). Biplots. Monographs on Statistics and Applied Probability,
54, Chapman and Hall, London.
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York.
Jolliffe I.T. (2002). Principal Component Analysis, Second Edition. Springer, New York.
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam, 403-406.
153
Discriminant Analysis (DA)
Use discriminant analysis to explain and predict the membership of observations to several
classes using quantitative or qualitative explanatory variables.
Description
Discriminant Analysis (DA) is an old method (Fisher, 1936) which in its classic form has
changed little in the past twenty years. This method, which is both explanatory and predictive,
can be used to:
Check on a two- or three-dimensional chart if the groups to which observations belong are
distinct,
DA may be used in numerous applications, for example in ecology and the prediction of
financial risks (credit scoring).
Two models of DA are used depending on a basic assumption: if the covariance matrices are
assumed to be identical, linear discriminant analysis is used. If, on the contrary, it is assumed
that the covariance matrices differ in at least two groups, then the quadratic model is used.
2
The Box test is used to test this hypothesis (the Bartlett approximation enables a Chi
distribution to be used for the test). We start with linear analysis then, depending on the
results from the Box test, carry out quadratic analysis if required.
Multicolinearity issues
With linear and still more with quadratic models, we can face problems of variables with a null
variance or multicolinearity between variables. XLSTAT has been programmed so as to avoid
these problems. The variables responsible for these problems are automatically ignored either
for all calculations or, in the case of a quadratic model, for the groups in which the problems
arise. Multicolinearity statistics are optionally displayed so that you can identify the variables
which are causing problems.
154
Stepwise methods
As for linear and logistic regression, efficient stepwise methods have been proposed. They
can, however, only be used when quantitative variables are selected as the input and output
tests on the variables assume them to be normally distributed. The stepwise method gives a
powerful model which avoids variables which contribute only little to the model.
Among the numerous results provided, XLSTAT can display the classification table (also called
confusion matrix) used to calculate the percentage of well-classified observations. When only
two classes (or categories or modalities) are present in the dependent variable, the ROC curve
may also be displayed.
The ROC curve (Receiver Operating Characteristics) displays the performance of a model and
enables a comparison to be made with other models. The terms used come from signal
detection theory.
The proportion of well-classified positive events is called the sensitivity. The specificity is the
proportion of well-classified negative events. If you vary the threshold probability from which an
event is to be considered positive, the sensitivity and specificity will also vary. The curve of
points (1-specificity, sensitivity) is the ROC curve.
Let's consider a binary dependent variable which indicates, for example, if a customer has
responded favorably to a mail shot. In the diagram below, the blue curve corresponds to an
ideal case where the n% of people responding favorably corresponds to the n% highest
probabilities. The green curve corresponds to a well-discriminating model. The red curve (first
bisector) corresponds to what is obtained with a random Bernoulli model with a response
probability equal to that observed in the sample studied. A model close to the red curve is
therefore inefficient since it is no better than random generation. A model below this curve
would be disastrous since it would be less even than random.
155
The area under the curve (or AUC) is a synthetic index calculated for ROC curves. The AUC
corresponds to the probability such that a positive event has a higher probability given to it by
the model than a negative event. For an ideal model, AUC=1 and for a random model, AUC =
0.5. A model is usually considered good when the AUC value is greater than 0.7. A well-
discriminating model must have an AUC of between 0.87 and 0.9. A model with an AUC
greater than 0.9 is excellent.
The results of the model as regards forecasting may be too optimistic: we are effectively trying
to check if an observation is well-classified while the observation itself is being used in
calculating the model. For this reason, cross-validation was developed: to determine the
probability that an observation will belong to the various groups, it is removed from the learning
sample, then the model and the forecast are calculated. This operation is repeated for all the
observations in the learning sample. The results thus obtained will be more representative of
the quality of the model. XLSTAT gives the option of calculating the various statistics
associated with each of the observations in cross-validation mode together with the
classification table and the ROC curve if there are only two classes.
Lastly, you are advised to validate the model on a validation sample wherever possible.
XLSTAT has several options for generating a validation sample automatically.
Where there are only two classes to predict for the dependent variable, discriminant analysis is
very much like logistic regression. Discriminant analysis is useful for studying the covariance
structures in detail and for providing a graphic representation. Logistic regression has the
advantage of having several possible model templates, and enabling the use of stepwise
156
selection methods including for qualitative explanatory variables. The user will be able to
compare the performances of both methods by using the ROC curves.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Qualitative: Select the qualitative variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
X / Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
data selected may be of the numerical type. If a variable header has been selected, check that
the "Variable labels" option has been activated.
157
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
selected data may be of any type, but numerical data will automatically be considered as
nominal. If a variable header has been selected, check that the "Variable labels" option has
been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Observation weights: Activate this option if the observations are weighted. If you do not
activate this option, the weights will all be considered as 1. XLSTAT uses these weights for
calculating degrees of freedom. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Variable labels" option has been activated.
Options tab:
Tolerance: Enter the value of the tolerance threshold below which a variable will automatically
be ignored.
Equality of covariance matrices: Activate this option if you want to assume that the
covariance matrices associated with the various classes of the dependent variable are equal.
Prior probabilities: Activate this option if you want to take prior possibilities into account. The
probabilities associated with each of the classes are equal to the frequency of the classes.
Note: this option has no effect if the prior possibilities are equal for the various groups.
Filter factors: You can activate one of the two following options in order to reduce the number
of factors used in the model:
158
Minimum %: Activate this option and enter the minimum percentage of total variability
that the selected factors should represent.
Maximum number: Activate this option to set the maximum number of factors to take
into account.
Significance level (%): Enter the significance level for the various tests calculated.
Model selection: Activate this option if you want to use one of the four selection methods
provided:
Stepwise (Forward): The selection process starts by adding the variable with the
largest contribution to the model. If a second variable is such that its entry probability is
greater than the entry threshold value, then it is added to the model. After the third
variable is added, the impact of removing each variable present in the model after it has
been added is evaluated. If the probability of the calculated statistic is greater than the
removal threshold value, the variable is removed from the model.
Stepwise (Backward): This method is similar to the previous one but starts from a
complete model.
Forward: The procedure is the same as for stepwise selection except that variables are
only added and never removed.
Backward: The procedure starts by simultaneously adding all variables. The variables
are then removed from the model following the procedure used for stepwise selection.
Classes weight correction: If the number of observations for the various classes for the
dependent variables are not uniform, there is a risk of penalizing classes with a low number of
observations in establishing the model. To get over this problem, XLSTAT has two options:
Corrective weights: You can select the weights to be assigned to each observation.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
159
Random: The observations are randomly selected. The Number of observations N
must then be specified.
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: The first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: Activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
160
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Multicolinearity statistics: Activate this option to display the table of multicolinearity statistics.
Covariance matrices: Activate this option to display the inter-class, intra-class, intra-class
total, and total covariance matrices.
SSCP matrices: Activate this option to display the inter-class, intra-class total, and total SSCP
(Sums of Squares and Cross Products) matrices.
Distance matrices: Activate this option to display the matrices of distances between groups.
Canonical correlations and functions: Activate this option for canonical correlations and
functions.
Eigenvalues: Activate this option to display the table and chart (scree plot) of eigenvalues.
Factor loadings: Activate this option to display the coordinates of the variables in factor
space.
Factor scores: Activate this option to display the coordinates of the observations in the factor
space. The prior and posterior classes for each observation, the probabilities of assignment for
each class and the distances of the observations from their centroid are also displayed in this
table.
Confusion matrix: Activate this option to display the table showing the numbers of well- and
badly-classified observations for each of the classes.
Charts tab:
Correlation charts: Activate this option to display the charts involving correlations between
the factors and input variables.
Vectors: Activate this option to display the input variables with vectors.
161
Observations charts: Activate this option to display the charts that allow visualizing the
observations in the new space.
Labels: Activate this option to display the observations labels on the charts. The
number of labels can be modulated using the filtering option.
Display the centroids: Activate this option to display the centroids that correspond to
the categories of the dependent variable.
Centroids and confidence circles: Activate this option to display a chart with the centroids
and the confidence circles around the means.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
N first rows: The N first observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The N last observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected. The number of missing values, the number of non-missing values, the
mean and the standard deviation (unbiased) are displayed for the quantitative variables. For
qualitative variables, including the dependent variable, the categories with their respective
frequencies and percentages are displayed.
162
Correlation matrix: This table displays the correlations between the explanatory variables.
Means by class: This table provides the means of the various explanatory variables for the
various classes of the dependent variable.
Sum of weights, prior probabilities and logarithms of determinants for each class: These
statistics are used, among other places, in the posterior calculations of probabilities for the
observations.
Multicolinearity: This table identifies the variables responsible for the multicolinearity between
variables. As soon as a variable is identified as being responsible for a multicolinearity (its
tolerance is less than the limit tolerance set in the "options" tab in the dialog box), it is not
included in the multicolinearity statistics calculation for the following variables. Thus in the
extreme case where two variables are identical, only one of the two variables will be eliminated
from the calculations. The statistics displayed are the tolerance (equal to 1-R), its inverse and
the VIF (Variance inflation factor).
SSCP matrices: The SSCP (Sums of Squares and Cross Products) matrices are proportional
to the covariance matrices. They are used in the calculations and check the following
relationship: SSCP total = SSCP inter + SSCP intra total.
Covariance matrices: The inter-class covariance matrix (equal to the unbiased covariance
matrix for the means of the various classes), the intra-class covariance matrix for each of the
classes (unbiased), the total intra-class covariance matrix, which is a weighted sum of the
preceding ones, and the total covariance matrix calculated for all observations (unbiased) are
displayed successively.
Box test: The Box test is used to test the assumption of equality for intra-class covariance
2
matrices. Two approximations are available, one based on the Chi distribution, and the other
on the Fisher distribution. The results of both tests are displayed.
Kullbacks test: The Kullback test is used to test the assumption of equality for intra-class
2
covariance matrices. The statistic calculated is approximately distributed according to a Chi
distribution.
Mahalanobis distances: The Mahalanobis distance is used to measure the distance between
classes taking account of the covariance structure. If we assume the intra-class variance
matrices are equal, the distance matrix is calculated by using the total intra-class covariance
matrix which is symmetric. If we assume the intra-class variance matrices are not equal, the
Mahalanobis distance between classes i and j is calculated by using the intra-class covariance
matrix for class i which is symmetric. The distance matrix is therefore asymmetric.
Fishers distances: If the covariance matrices are assumed to be equal, the Fisher distances
between the classes are displayed. They are calculated from the Mahalanobis distance and
163
are used for a significance test. The matrix of p-values is displayed so as to identify which
distances are significant.
Generalized squared distances: If the covariance matrices are not assumed to be equal, the
table of generalized squared distances between the classes is displayed. The generalized
distance is also calculated from the Mahalanobis distances and uses the logarithms of the
determinants of the covariance matrices together with the logarithms of the prior probabilities if
required by the user.
Wilks Lambda test (Raos approximation): The test is used to test the assumption of
equality of the mean vectors for the various classes. When there are two classes, the test is
equivalent to the Fisher test mentioned previously. If the number of classes is less than or
equal to three, the test is exact. The Rao approximation is required from four classes to obtain
a statistic approximately distributed according to a Fisher distribution.
Unidimensional test of equality of the means of the classes: These tests are used to test
the assumption of equality of the means between classes variable by variable. Wilk's
univariate lambda is always between 0 and 1. A value of 1 means the class means are equal.
A low value is interpreted as meaning there are low intra-class variations and therefore high
inter-class variations, hence a significant difference in class means.
Pillais trace: The test is used to test the assumption of equality of the mean vectors for the
various classes. It is less used than Wilk's Lambda test and also uses the Fisher distribution
for calculating p-values.
Hotelling-Lawley trace: The test is used to test the assumption of equality of the mean
vectors for the various classes. It is less used than Wilk's Lambda test and also uses the
Fisher distribution for calculating p-values.
Roys greatest root: The test is used to test the assumption of equality of the mean vectors
for the various classes. It is less used than Wilk's Lambda test and also uses the Fisher
distribution for calculating p-values.
Eigenvalues: This table shows the eigenvalues associated with the various factors together
with the corresponding discrimination percentages and cumulative percentages. In
discriminant analysis, the number of non-null eigenvalues is equal to at most (k-1) where k is
the number of classes. The scree plot is used to display how the discriminant power is
distributed between the discriminant factors. The sum of the eigenvalues is equal to the
Hotelling trace.
Eigenvectors: This table shows the eigenvectors afterwards used in the canonical
correlations, canonical functions and observation coordinates (scores) calculations.
164
between the initial variables and the factors in a correlation circle. The correlation circle is an
aid in interpreting the representation of the observations in factor space.
Canonical correlations: The canonical correlations associated with each factor are the
square roots of L(i) / (1- L(i)) where L(i) is the eigenvalue associated with factor i. Canonical
correlations are also a measurement of the discriminant power of the factors. Their sum is
equal to the Pilais trace.
Functions at the centroids: This table gives the evaluation of the discriminant functions for
the mean points for each of the classes.
Classification functions: The classification functions can be used to determine which class
an observation is to be assigned to using values taken for the various explanatory variables. If
the covariance matrices are assumed equal, these functions are linear. If the covariance
matrices are assumed unequal, these functions are quadratic. An observation is assigned to
the class with the highest classification function.
Confusion matrix for the estimation sample: The confusion matrix is deduced from prior
and posterior classifications together with the overall percentage of well-classified
observations. Where the dependent variable only comprises two classes, the ROC curve is
displayed (see the description section for more details).
Cross-validation: Where cross-validation has been requested, the table containing the
information for the observations and the confusion matrix are displayed (see the description
section for more details).
Example
http://www.xlstat.com/demo-da.htm
165
References
Fisher R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179 -188.
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York.
Tomassone R., Danzart M, Daudin J.J. and Masson J.P. (1988). Discrimination et Classement.
Masson, Paris.
166
Correspondence Analysis (CA)
Use this tool to visualize the links between the categories of two qualitative variables. The
variables can be available as an observations/variables table, as a contingency table, or as a
more general type of two-way table.
Description
Correspondence Analysis is a powerful method that allows studying the association between
two qualitative variables. The research of J.-P. Benzcri that started in the early sixties led to
the emergence of the method. His disciples worked on several developments of the basic
method. For example, M.J. Greenacres book (1984) contributed to the popularity of the
method throughout the world. The work of C. Lauro from the University of Naples led to a non-
symmetrical variant of the method.
Measuring the association between two qualitative variables is a complex subject that first
requires transforming the data: it is not possible to compute a correlation coefficient using the
data directly, as one could do with quantitative variables.
The first transformation consists of recoding the two qualitative variables V1 and V2 as two
disjunctive tables Z1 and Z2 or indicator (or dummy) variables. For each category of a variable
there is a column in the respective disjunctive table. Each time the category c of variable V1
occurs for an observation i, the value of Z1(i,c) is set to one (the same rule is applied to the V2
variable). The other values of Z1 and Z2 are zero. The generalization of this idea to more than
two variables is called Multiple Correspondence Analysis. When there are only two variables, it
is sufficient to study the contingency table of the two variables, that is the table Z1Z2 (where
indicates matrix transpose).
The Chi-square distance has been suggested to measure the distance between two
categories. To represent the distance between two categories it is not necessary to use from
the X1 and X2 disjunctive tables. It is enough to start from the contingency table that
algebraically corresponds to the X1X2 product.
Category Category Category
V1 \ V2
1 j m2
167
Category
n(m1,1) n(m1,j) n(m1,m2)
m1
where n(i,j) is the frequency of observations that show both characteristic i for variable V1, and
characteristic j for variable V2.
Inertia is a measure inspired from physics that is often used in Correspondence Analysis. The
inertia of a set of points is the weighted mean of the squared distances to the center of gravity.
In the specific case of CA, the total inertia of the set of points (one point corresponds to one
category) can be written as:
nij ni . n. j
2
2
2 m1 m 2 n n
, with ni. nij and n. j nij
m2 m1
2
n i 1 j 1 ni. n. j j 1 i 1
n 2
and where n is the sum of the frequencies in the contingency table. We can see that the inertia
is proportional to the Pearson chi-square statistic computed on the contingency table.
The aim of CA is to represent as much of the inertia on the first principal axis as possible, a
maximum of the residual inertia on the second principal axis and so on until all the total inertia
is represented in the space of the principal axes. One can show that the number of dimensions
of the space is equal to min(m1, m2)-1.
n. j / n nij / n. j ni. / n
m2 m1 2
b / RC j 1 i 1
1 ni. / n
m1
2
i 1
As with the total inertia, it is possible to compute a representation space for the categories,
such that the proportion of the Goodman and Kruskals tau represented on the chart is
maximized on the first axes.
168
Greenacre (1984) defined a framework (the generalized singular value decomposition) that
allows computing both CA and the related method of NSCA in a similar way.
The analysis of a subset of categories is a new method that has been recently developed by
Greenacre (2006). It allows parts of tables to be analyzed while maintaining the margins of the
whole table and thus the same weights and chi-square distances of the whole table, simplifying
the analysis of large tables by breaking down the interpretation into parts.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT loads the data:
Case where the data are in a contingency table or a more general two-way table: if the
arrow points down XLSTAT allows you to select data by columns or by range. If the
arrow points to the right, XLSTAT allows you to select data by rows or by range.
Case where the data are in an observations/variables table: if the arrow points down,
XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and
columns to observations.
General tab:
The first selection field lets you alternatively select two types of tables:
Two-way table: Select this option if your data correspond to a two-way table where the cells
contain the frequencies corresponding to the various categories of two qualitative variables (in
this case it is more precisely a contingency table), or other type of values.
169
Observations/variables table: Select this option if your data correspond to N observations
described by 2 qualitative variables. This type of table typically corresponds to a survey with 2
questions. During the computations, XLSTAT will automatically transform this table into a
contingency table.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: This option is visible if the selected table is a contingency table or a more
general two-way table. Activate this option if the labels of the columns and rows are included in
the selection.
Variable labels: This option is visible only if you selected the observations/variables table
format. Activate this option if the first row contains the variable labels (case of an
observations/variables table) or the category labels (case of a disjunctive table).
Weights: This option is visible only if you selected the observations/variables table format.
Activate this option if you want to weight the observations. If you do not activate this option, the
weights are considered to be equal to 1. The weights must be greater or equal to 0. If the
Variable labels option is activated make sure that the header of the selection has also been
selected.
Options tab:
Advanced analysis: This option is active only in the case where the input is a contingency
table or a more general two-way table. The possible options are:
Supplementary data: If you select this option you may then enter the number of
supplementary rows and/or columns. Supplementary rows and columns are passive
data that are not taken into account for the computation of the representation space.
Their coordinates are computed a posteriori. Notice that supplementary data should be
the last rows and/or columns of the data table.
Subset analysis: If you select this option you can then enter the number of rows and/or
columns to exclude from the subset analysis. See the description section for more
information on this topic. Notice that the excluded data should be the last rows and/or
columns of the data table.
170
Rows depend on columns: Select this option if you consider that the row variable
depends on the column variable and if you want to analyze the association between
both while taking into account this dependency.
Columns depend on rows: Select this option if you consider that the column variable
depends on the row variable and if you want to analyze the association between both
while taking into account this dependency.
Test of independence: Activate this option if you want XLSTAT to compute a test of
independence based on the chi-square statistic.
Significance level (%): Enter the value of the significance level for the test (default
value: 5%).
Filter factors: You can activate one of the two following options in order to reduce the number
of factors displayed:
Minimum %: Activate this option and then enter the minimum percentage that should
be reached to determine the number of factors to display.
Maximum number: Activate this option to set the maximum number of factors to take
into account when displaying the results.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Replace missing data by 0: Activate this option if you consider that missing data are
equivalent to 0.
Replace missing data by their expected value: Activate this option if you want to replace the
missing data by the expected value. The expectation is given by:
ni. n j .
E (nij )
n
where ni. is the row sum, n.j is the column sum, and n is the grand total of the table before
replacement of the missing data.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
171
Remove observations: Activate this option to ignore the observations that contain missing
data.
Group missing values into a new category: Activate this option to group missing data into a
new category of the corresponding variable.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics for the two
selected variables.
Disjunctive table: Activate this option to display the full disjunctive table that corresponds to
the qualitative variables.
Sort the categories alphabetically: Activate this option so that the categories of all the
variables are sorted alphabetically.
Inertia by cell: Activate this option to display the inertia for each cell of the contingency table.
Common options:
Eigenvalues: Activate this option to display the table and the scree plot of the eigenvalues.
Chi-square distances: Activate this option to display the chi-square distances between the
row points and between the column points.
Row and column profiles: Activate this option to display the row and column profiles.
Principal coordinates: Activate this option to display the principal coordinates of the row
points and the column points.
Standard coordinates: Activate this option to display the standard coordinates of the row
points and the column points.
Contributions: Activate this option to display the contributions of the row points and the
column points to the principal axes.
Squared cosines: Activate this option to display the squared cosines of the row points and the
column points to the principal axes.
172
Charts tab:
3D view of the contingency table / two-way table: Activate this option to display the 3D bar
chart corresponding to the contingency table or to the two-way table.
Symmetric plots: Activate this option to display the plots for which the row points and the
column points play a symmetrical role. These maps are based on the principal coordinates of
the row points and the column points.
Rows and columns: Activate this option to display a chart on which the row points and
the column points are displayed.
Rows: Activate this option to display a chart on which only the row points are displayed.
Columns: Activate this option to display a chart on which only the column points are
displayed.
Asymmetric plots: Activate this option to display the plots for which the row points and the
column points play an asymmetrical role. These plots use on the one hand the principal
coordinates and on the other hand the standard coordinates.
Rows: Activate this option to display a chart where the row points are displayed using
their principal coordinates, and the column points are displayed using their standard
coordinates.
Columns: Activate this option to display a chart where the row points are displayed
using their standard coordinates, and the column points are displayed using their
principal coordinates.
Labels: Activate this option to display the labels of the categories on the charts.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
Vectors: Activate this option to display the vectors for the standard coordinates on the
asymmetric charts.
Length factor: Activate this option to modulate the length of the vectors.
Results
Descriptive statistics: This table is displayed only if the input data correspond to an
observations/variables table.
173
Disjunctive table: This table is displayed only if the input data correspond to an
observations/variables table. This table is an intermediate table that allows to obtain the
contingency table that corresponds to the two selected variables.
Contingency table: The contingency table is displayed at this stage. The 3D bar chart that
follows corresponds to the table.
Inertia by cell: This table displays the inertia that corresponds to each cell of the contingency
table.
Test of independence between rows and columns: This test allows us to determine if we
can reject the null hypothesis that rows and columns of the table are independent. A detailed
interpretation of this test is displayed below the table that summarizes the test statistic.
Eigenvalues and percentages of inertia: The eigenvalues and the corresponding scree plot
are displayed. Only the non-trivial eigenvalues are displayed. If a filtering has been requested
in the dialog box, it is not applied to this table, but only to the results that follow. Note: the sum
of the eigenvalues is equal to the total inertia. To each eigenvalue corresponds a principal axis
which accounts for a certain percentage of inertia. This allows us to measure the cumulative
percentage of inertia for a given set of dimensions.
A series of results is displayed afterwards, first for the row points, then for the column points:
Weights, distances and squared distances to the origin, inertias and relative inertias: This table
gives basic statistics for the points.
Chi-square distances: This table displays the chi-square distances between the profile points.
Principal coordinates: This table displays the principal coordinates which are used later to
represent projections of the profile points in symmetric and asymmetric plots.
Standard coordinates: This table displays the standard coordinates which are used later to
represent projections of unit profile points in asymmetric plots.
Contributions: The contributions are helpful for interpreting the plots. The categories that
have influenced the most the calculation of the axes are those that have the higher
contributions. An approach consists of restricting the interpretation to the categories whose
contribution to a given axis is higher than the corresponding relative weight that is displayed in
the first column.
Squared cosines: As with other data analysis methods, the analysis of the squared cosines
allows us to avoid misinterpretations of the plots that are due to projection effects. If, for a
given category, the cosines are low on the axes of interest, then any interpretation of the
position of the category is hazardous.
174
The plots (or maps) are the ultimate goal of Correspondence Analysis, because they allow us
to considerably accelerate our understanding of the association patterns in the data table.
Symmetric plots: These plots are exclusively based on the principal coordinates. Depending
on the choices made in the dialog box, a symmetric plot mixing row points and column points,
a plot with only the row points, and a plot with only the column points, are displayed. The
percentage of inertia that corresponds to each axis and the percentage of inertia cumulated
over the two axes are displayed on the map.
Asymmetric plots: These plots use the principal coordinates the standard coordinates for the
rows and columns or vice versa. The percentage of inertia that corresponds to each axis and
the percentage of inertia cumulated over the two axes are displayed on the map. In an
asymmetric row plot, on can study the way the row points are positioned relatively to the
column vectors. The latter indicate directions: if two row points are displayed in the same
direction as a column vector, the row point that is the furthest in the column vector direction is
the one that is more associated with the columns.
Example
http://www.xlstat.com/demo-ca.htm
References
Benzcri J.P. (1969). Statistical analysis as a tool to make patterns emerge from data. In
Watanabe S. (ed.), Methodologies of Pattern Recognition. Academic Press, New York. pp 35-
60.
Benzcri J.P. (1973). LAnalyse des Donnes, Tome2 : LAnalyse des Correspondances.
Dunod, Paris.
Benzcri J.P. (1992). Correspondence Analysis Handbook. Marcel Decker, New York.
175
Greenacre M. J., Pardo R. (2006). Subset correspondence analysis: Visualizing relationships
among a selected set of response categories from a questionnaire survey. Sociological
Methods & Research, 35 (2), 193-218.
Lauro C., Balbi S. (1999). The analysis of structured qualitative data. Applied Stochastic
Models and Data Analysis. 15, 1-27.
Lauro N.C., DAmbra L. (1984). Non-symmetrical correspondence analysis. In: Diday E. et al.
(eds.), Data Analysis and Informatics, III, North Holland, Amsterdam. 433-446.
Lebart L., Morineau A. & Piron M. (1997). Statistique Exploratoire Multidimensionnelle, 2me
dition. Dunod, Paris. 67-107.
Saporta G. (1990). Probabilits, Analyse des Donnes et Statistique. Technip, Paris. 199-216.
176
Multiple Correspondence Analysis (MCA)
Use this tool to visualize the links between the categories of two or more qualitative variables.
Description
Multiple Correspondence Analysis (MCA) is a method that allows studying the association
between two or more qualitative variables. MCA is to qualitative variables what Principal
Component Analysis is to quantitative variables. One can obtain maps where it is possible to
visually observe the distances between the categories of the qualitative variables and between
the observations.
The generation of the disjunctive table is, in any case, a preliminary step of the MCA
computations. The p qualitative variables are broken down into p disjunctive tables Z1, Z2, ,
Zp, composed of as many columns as there are categories in each of the variables. Each time
th
a category c of the j variable corresponds to an observation i, one sets the value of Zj(i,c) to
one. The other values of Zj are zero. The p disjunctive tables are concatenated into a full
disjunctive table.
A series of transformations allows the computing of the coordinates of the categories of the
qualitative variables, as well as the coordinates of the observations in a representation space
that is optimal for a criterion based on inertia. In the case of MCA one can show that the total
inertia is equal to the average number of categories minus one. As a matter of fact, the inertia
does not only depend on the degree of association between the categories but is seriously
inflated. Greenacre (1993) suggested an adjusted version of inertia, inspired from Joint
Correspondence Analysis (JCA). This adjustment allows us to have higher and more
meaningful percentages for the maps.
The analysis of a subset of categories is a method that has very recently been developed by
Greenacre and Pardo (2006). It allows us to concentrate the analysis on some categories only,
while still taking into account all the available information in the input table. XLSTAT allows you
to select the categories that belong to the subset.
177
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
The first selection field lets you alternatively select two types of tables:
Observations/variables table: Select this option if your data correspond to a table with N
observations described by P qualitative variables. If the headers of the columns have also
been selected, make sure the Variable labels option is activated.
Disjunctive table: Select this option if your data correspond to a disjunctive table. If the
headers of the columns have also been selected, make sure the Variable labels option is
activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
178
Variable labels: Activate this option if the first row contains the variable labels (case of an
observations/variables table) or the category labels (case of a disjunctive table).
Weights: Activate this option if you want to weight the observations. If you do not activate this
option, the weights are considered to be equal to 1. The weights must be greater or equal to 0.
If the Variable labels option is activated make sure that the header of the selection has also
been selected.
Options tab:
Advanced analysis:
Supplementary data: If you select this option, the Supplementary data tab is
activated, and you can then modify the corresponding options.
Subset analysis: If you select this option, XLSTAT will ask you to select during the
computations the categories that belong to the subset to analyze.
Sort the categories alphabetically: Activate this option so that the categories of all the
variables are sorted alphabetically.
Variable-Category labels: Activate this option to use variable-category labels when displaying
outputs. Variable-Category labels include the variable name as a prefix and the category name
as a suffix.
Filter factors: You can activate one of the three following options in order to reduce the
number of factors displayed:
Minimum %: Activate this option and then enter the minimum percentage that should
be reached to determine the number of factors to display.
Maximum number: Activate this option to set the maximum number of factors to take
into account when displaying the results.
1/p: Activate this option to only take into account the factors which eigenvalue is greater
than 1/p, where p is the number of variables. This is the default option.
Supplementary observations: Activate this option if you want to compute the coordinates
and to display supplementary observations. These observations are not taken into account for
the first phase of the computations. They are passive observations. Several methods are
available to identify the supplementary observations:
179
Random: The observations are randomly selected. The Number of observations N
must then be specified.
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use as supplementary observations.
Supplementary variables: Activate this option if you want to compute a posteriori the
coordinates of variables that are not taken into account for the computing of the principal axes
(passive variables, as opposed to active variables).
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to ignore the observations that contain missing
data.
Group missing values into a new category: Activate this option to group missing data into a
new category of the corresponding variable.
Replace missing data: Activate this option to replace missing data. When a missing data
corresponds to a quantitative supplementary variable, they are replaced by the mean of the
variable. When a missing data corresponds to a qualitative variable of the initial table (active
variables) or to a qualitative supplementary variable (passive variable), a new Missing
category is create for the variable.
Outputs tab:
180
Descriptive statistics: Activate this option to display the descriptive statistics for the selected
variables.
Disjunctive table: Activate this option to display the full disjunctive table that corresponds to
the selected qualitative variables.
Observations: Activate this option to display the results that concern the observations.
Variables: Activate this option to display the results that concern the variables.
Eigenvalues: Activate this option to display the table and the scree plot of the eigenvalues.
Test values: Activate this option to display the test values for the variables.
Significance level (%): Enter the significance level used to determine if the test values
are significant or not.
Charts tab:
3D view of the Burt table: Activate this option to display a 3D visualization of the Burt table.
Symmetric plots: Activate this option to display the symmetric observations and variables
plots.
Observations and variables: Activate this option to display a plot that shows both the
observations and variables.
Observations: Activate this option to display a plot that shows only the observations.
Variables: Activate this option to display a plot that shows only the variables.
181
Asymmetric plots: Activate this option to display plots for which observations and variables
play an asymmetrical role. These plots are based on the principal coordinates for the
observations and the standard coordinates for the variables.
Variables: Activate this option to display an asymmetric plot where the variables are
displayed using their principal coordinates, and where the observations are displayed
using their standard coordinates.
Labels: Activate this option to display the labels of the categories on the charts.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
Vectors: Activate this option to display the vectors for the standard coordinates on the
asymmetric charts.
Length factor: Activate this option to modulate the length of the vectors.
N first rows: The N first observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The N last observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
This dialog is displayed if you selected the Advanced analysis / Subset analysis option in
the MCA dialog box.
182
: Click this button to start the computations.
The list of categories that corresponds to the complete set of active qualitative variables is
displayed so that you can select the subset of categories on which the analysis will be focused.
Results
Descriptive statistics: This table is displayed only if the input data correspond to an
observations/variables table.
Disjunctive table: This table is displayed only if the input data correspond to an
observations/variables table. This table is an intermediary table that allows us to obtain the
contingency table that corresponds to the two selected variables.
Burt table: The Burt table is displayed only if the corresponding option is activated in the
dialog box. The 3D bar chart that follows is the graphical visualization of this table.
Eigenvalues and percentages of inertia: The eigenvalues, the percentages of inertia, the
percentages of adjusted inertia and the corresponding scree plot are displayed. Only the non-
trivial eigenvalues are displayed. If a filtering has been requested in the dialog box, it is not
applied to this table, but only to the results that follow.
A series of results is displayed afterwards, first for the variables, then for the observations:
Principal coordinates: This table displays the principal coordinates which are used later to
represent projections of profile points in symmetric and asymmetric plots.
Standard coordinates: This table displays the standard coordinates which are used later to
represent projections of unit profile points in asymmetric plots.
Contributions: The contributions are helpful for interpreting the plots. The categories that
have influenced the most the calculation of the axes are those that have the higher
contributions. A shortcut consists of restricting the analysis to the categories which contribution
on a given axis is higher than the corresponding relative weight that is displayed in the first
column.
183
Squared cosines: As with other data analysis methods, the analysis of the squared cosines
allows us to avoid misinterpretations of the plots that are due to projection effects. If, for a
given category, the cosines are low on the axes of interest, then any interpretation of the
position of the category is hazardous.
The plots (or maps) are the ultimate goal of Multiple Correspondence Analysis, because they
considerably facilitate our interpretation of the data.
Symmetric plots: These plots are exclusively based on the principal coordinates. Depending
on the choices made in the dialog box, a symmetric plot mixing observations and variables, a
plot showing only the categories of the variables, and a plot showing only the observations, are
displayed. The percentage of adjusted inertia that corresponds to each axis and the
percentage of adjusted inertia cumulated over the two axes are displayed on the map.
Asymmetric plots: These plots use the principal coordinates for the categories of the
variables and the standard coordinates for the observations and vice versa. The percentage of
adjusted inertia that corresponds to each axis and the percentage of adjusted inertia
cumulated over the two axes are displayed on the map. On an asymmetric observations plot,
on can study the way the observations are positioned relatively to the category vectors. The
later indicate directions: if two observations are displayed in the same direction as a category
vector, the observation that is the furthest in the category vector direction is more likely to have
selected that category of response.
Example
http://www.xlstat.com/demo-mca.htm
References
184
Greenacre M. J., Pardo R. (2006). Multiple correspondence analysis of subsets of response
categories. In Multiple Correspondence Analysis and Related Methods (eds Michael
Greenacre & Jrg Blasius), Chapman & Hall/CRC, London, 197-217.
Saporta G. (1990). Probabilits, Analyse des Donnes et Statistique. Technip, Paris. 217-239.
185
Multidimensional Scaling (MDS)
Use multidimensional scaling to represent in a two- or three-dimensional space the
observations for which only a proximity matrix (similarity or dissimilarity) is available.
Description
This example is only intended to demonstrate the performance of the method and to give a
general understanding of how it is used. Practically, MDS is often used in psychometry
(perception analysis) and marketing (distances between products obtained from consumer
classifications) but there are applications in a large number of domains.
If the starting matrix is a similarity matrix (a similarity is greater the nearer the objects are), it
will automatically be converted into a dissimilarity matrix for the calculations. The conversion is
carried out by subtracting the matrix data from the value of the diagonal.
There are two types of MDS depending on the nature of the dissimilarity observed:
Metric MDS: The dissimilarities are considered as continuous and giving exact information
to be reproduced as closely as possible. There are a number of sub-models:
186
nd
using a near 2 -degree polynomial relationship (the polynomial relationship
being identical for all pairs of distances).
Note: the absolute model is used to compare distances in the representation space with those
in the initial space. The other models have the advantage of speeding up the calculations.
Non metric MDS: with this type of MDS, only the order of the dissimilarities counts. In other
words, the MDS algorithm does not have to try to reproduce the dissimilarities but only their
order. Two models are available:
Ordinal (1): the order of the distances in the representation space must
correspond to the order of the corresponding dissimilarities. If there are two
dissimilarities of the same rank, then there are no restrictions on the
corresponding distances. In other words, dissimilarities of the same rank need
not necessarily give equal distances in the representation space.
Ordinal (2): the order of the distances in the representation space must
correspond to the order of the corresponding dissimilarities. If dissimilarities
exist in the same rank, the corresponding distances must be equal.
The MDS algorithms aim to reduce the difference between the disparity matrix from the models
and the distance matrix obtained in the representation configuration. For the absolute model,
the disparity is equal to the dissimilarity of the starting matrix. The difference is measured
through the Stress, several variations of which have been proposed:
Raw Stress:
r wij Dij d ij 2
i j
where Dij is the disparity between individuals i and j, and dij is the Euclidean
distance on the representation for the same individuals. wij is the weight of the ij
proximity (value is 1 by default).
Normalized Stress:
w D d ij
2
ij ij
n
i j
w Dij
2
ij
i j
Kruskal's stress 1:
w D d ij
2
ij ij
1
i j
w d
2
ij ij
i j
Kruskal's stress 2:
187
w D d ij
2
ij ij
2
i j
w d d
2
ij ij
i j
Note: for a given number of dimensions, the weaker the stress, the better the quality of the
representation. Furthermore, the higher the number of dimensions, the weaker the stress.
To find out whether the result obtained is satisfactory and to determine which is the correct
number of dimensions needed to give a faithful representation of the data, the evolution in the
stress with the number of dimensions and the point from which the stress stabilizes may be
observed. The Shepard diagram is used to observe any ruptures in the ordination of the
distances. The more the chart looks linear, the better the representation. For the absolute
model, for an ideal representation, the points must be aligned along the first bisector.
There are several MDS algorithms including, in particular, ALSCAL (Takane et al. 1977) and
SMACOF (Scaling by MAjorizing a COnvex Function) which minimizes the "Normalized
Stress" (de Leeuw, 1977). XLSTAT uses the SMACOF algorithm.
Dialog box
The dialog box is made up of several tabs corresponding to the various options for controlling
the calculations and displaying the results. A description of the various components of the
dialog box are given below.
: Click this button to close the dialog box without doing any computation.
188
: Click these buttons to change the way XLSTAT handles the data. if the arrow points
down XLSTAT allows you to select data by columns or by range. If the arrow points to the
right, XLSTAT allows you to select data by rows or by range.
General tab:
The main data entry field is used to select one of two types of table:
Data: Select a similarity or dissimilarity matrix. If only the lower or upper triangle is available,
the table is accepted. If differences are detected between the lower and upper parts of the
selected matrix, XLSTAT warns you and offers to change the data (by calculating the average
of the two parts) to continue with the calculations.
Dissimilarities / Similarities : Choose the option that corresponds to the type of your data.
Model : Select the model to be used. See description for more details.
Dimensions: Enter the minimum and maximum number of dimensions for the object
representation space. The algorithm will be repeated for all dimensions between the two
boundaries.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet in the active workbook.
Labels included: Activate this option if you have included row and column labels in the
selection.
Weights: Activate this option if the data are weighted. You then select a weighting matrix
(without selecting labels for rows and columns). If you do not activate this option, the weights
will be considered as 1. Weights must be greater than or equal to 0.
Options tab:
Stress: Choose the type of stress to be used for returning the results, given that the SMACOF
algorithm minimizes the raw stress. See the description section for more details.
Initial configuration:
189
Random: Activate this option to make XLSTAT generate the starting configuration
randomly. Then enter the number of times the algorithm is to be repeated from a new
randomly-generated configuration. The default value for the number of repetitions is
100. Note: the configuration displayed in the results is the repetition for which the best
result was found.
User defined: Activate this option to select an initial configuration which the algorithm
will use as a basis for carrying out optimization.
Stop conditions:
Iterations: Enter the maximum number of iterations for the SMACOF algorithm. Stress
Optimization is stopped when the maximum number if iterations has been exceeded.
Default value: 100.
Convergence: Enter the minimum value of evolution in stress from one iteration to
another which, when reached, means that the algorithms is considered to have
converged. Default value: 0.00001.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Ignore missing data: If you activate this option, XLSTAT does not include proximities
corresponding to missing data when minimizing stress.
Outputs tab:
Distances: Activate this option to display the matrix of Euclidean distances corresponding to
the optimum configuration.
Disparities: Activate this option to display the disparity matrix corresponding to the optimum
configuration.
Residual distances: Activate this option to display the matrix of residual distances
corresponding to the difference between the distance matrix and the disparity matrix.
Charts tab:
Evolution of stress: Activate this option to display the stress evolution chart according to the
number of dimensions in the configuration.
Configuration: Activate this option to display the configuration representation chart. This chart
is only displayed for the configuration in a two-dimensional space if this has been calculated.
190
Labels: Activate this option if you want object labels to be displayed.
Colored labels: Activate this option to show labels in the same color as the points.
Results
Stress after minimization: This table shows the final stress obtained, the number of iterations
required and the level of convergence reached for the dimensions considered. Where multiple
dimensions were considered, a chart is displayed showing the stress evolution as a function of
the number of dimensions.
The results which follow are displayed for each of the dimensions considered.
Configuration: This table shows the coordinates of objects in the representation space. If this
is a two-dimensional space, a graphic representation of the configuration is provided. If you
have XLSTAT-3DPlot, you can also display a three-dimensional configuration.
Distances measured in the representation space: This table shows the distances between
objects in the representation space.
Disparities computed using the model: This table shows the disparities calculated according
to the model chosen (absolute, interval, etc.).
Residual distances: These distances are the difference between the dissimilarities of the
starting matrix and the distances measured in the representation space.
Comparative table: This table is used to compare dissimilarities, disparities and distances and
the ranks of these three measurements for all paired combinations of objects.
Shepard diagram: This chart compares the disparities and the distances to the dissimilarities.
For a metric model, the representation is better the more the points are aligned with the first
bisector of the plan. For a non-metric model, the model is better the more regularly the line of
dissimilarities/disparities increases. Furthermore, the performance of the model can be
evaluated by observing if the (dissimilarity/distance) points are near to the
(dissimilarity/disparity) points.
Example
http://www.xlstat.com/demo-mds.htm
191
References
Borg I. and Groenen P. (1997). Modern Multidimensional Scaling. Theory and applications.
Springer Verlag, New York.
Cox T.C. and Cox M.A.A. (2001). Multidimensional Scaling (2nd edition). Chapman and Hall,
New York.
Heiser W.J. (1991). A general majorization method for least squares multidimensional scaling
of pseudodistances that may be negative. Psychometrika, 56,1, 7-27.
192
k-means clustering
Use k-means clustering to make up homogeneous groups of objects (classes) on the basis of
their description by a set of quantitative variables.
Description
k-means clustering was introduced by McQueen in 1967. Other similar algorithms had been
developed by Forgey (1965) (moving centers) and Friedman (1967).
An object may be assigned to a class during one iteration then change class in the following
iteration, which is not possible with Agglomerative Hierarchical Clustering for which
assignment is irreversible.
By multiplying the starting points and the repetitions, several solutions may be explored.
The disadvantage of this method is that it does not give a consistent number of classes or
enable the proximity between classes or objects to be determined.
Note: if you want to take qualitative variables into account in the clustering, you must first
perform a Multiple Correspondence Analysis (MCA) and consider the resulting coordinates of
the observations on the factorial axes as new variables.
For the first iteration, a starting point is chosen which consists in associating the center of the k
classes with k objects (either taken at random or not). Afterwards the distance between the
objects and the k centers is calculated and the objects are assigned to the centers they are
nearest to. Then the centers are redefined from the objects assigned to the various classes.
The objects are then reassigned depending on their distances from the new centers. And so
on until convergence is reached.
Classification criteria
193
Several classification criteria may be used to reach a solution. XLSTAT offers four criteria to be
minimized.
Trace(W): The W trace, pooled SSPC matrix, is the most traditional criterion. Minimizing the W
trace for a given number of classes amounts to minimizing the total within-class variance, in
other words minimizing the heterogeneity of the groups. This criterion is sensitive to effects of
scale. In order to avoid giving more weight to certain variables and not to others, the data must
be normalized beforehand. Moreover, this criterion tends to produce classes of the same size.
Wilks lambda: The results given by minimizing this criterion are identical to that given by the
determinant of W. Wilk's lambda criterion corresponds to the division of determinant(W) by
determinant(T) where T is the total inertia matrix. Dividing by the determinant of T always gives
a criterion between 0 and 1.
Trace(W) / Median: If this criterion is chosen, the class centroid is not the mean point of the
class but the median point which corresponds to an object of the class. The use of this criterion
gives rise to longer calculations.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
194
Observations/variables table: Select a table comprising N objects described by P
descriptors. If column headers have been selected, check that the "Variable labels" option has
been activated.
Column weights: Activate this option if the columns are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Column labels" option is activated.
Row weights: Activate this option if the rows are weighted. If you do not activate this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Column labels" option is activated.
Classification criterion: Choose the classification criterion (see the description section for
more details).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column labels: Activate this option if the first row of the data selections
(Observations/variables table, row labels, row weights, column weights) contains a label.
Row labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Options tab:
Cluster rows: Activate this option is you want to create classes of objects in rows described
by descriptors in columns.
Cluster columns: Activate this option is you want to create classes of objects in columns
described by descriptors in rows.
195
Center: Activate this option is you want to center the data before starting the calculations.
Reduce: Activate this option is you want to reduce the data before starting the calculations.
You can then select whether you want to apply the transformation on the rows or the columns.
Stop conditions:
Iterations: Enter the maximum number of iterations for the k-means algorithm. The
calculations are stopped when the maximum number if iterations has been exceeded.
Default value: 500.
Convergence: Enter the minimum value of evolution for the chosen criterion from one
iteration to another which, when reached, means that the algorithms is considered to
have converged. Default value: 0.00001.
Initial partition: Use these options to choose the way the first partition is chosen, in other
words, the way objects are assigned to classes in the first iteration of the clustering algorithm.
N classes by data order: Objects are assigned to classes depending on their order.
Defined by centers: The user has to select the k centers corresponding to the k
classes. The number of rows must be equal to the number of classes and the number of
columns equal to the number of columns in the data table. If the Column labels option
is activated you need to include a header in the selection.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
196
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Results in the original space: Activate this option to display the results in the original space.
If the center/reduce options are activated and this option is not activated, the results are
provided in the standardized space.
Centroids: Activate this option to display the table of centroids of the classes.
Central objects: Activate this option to display the coordinates of the nearest object to the
centroid for each class.
Results by class: Activate this option to display a table giving the statistics and the objects for
each of the classes.
Results by object: Activate this option to display a table giving the class each object is
assigned to in the initial object order.
Charts tab:
Evolution of the criterion: Activate this option for the evolution chart of the chosen criterion.
Results
Summary statistics: This table displays for the descriptors of the objects, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation.
Optimization summary: This table shows the evolution of the within-class variance. If several
repetitions have been requested, the results for each repetition are displayed.
Statistics for each iteration: Activate this option to see the evolution of miscellaneous
statistics calculated as the iterations for the repetition proceed, given the optimum result for the
197
chosen criterion. If the corresponding option is activated in the Charts tab, a chart showing the
evolution of the chosen criterion as the iterations proceed is displayed.
Note: if the values are standardized (option in the Options tab), the results for the optimization
summary and the statistics for each iteration are calculated in the standardized space. On the
other hand, the following results are displayed in the original space if the "Results in the
original space" option is activated.
Variance decomposition for the optimal classification: This table shows the within-class
variance, the inter-class variance and the total variance.
Class centroids: This table shows the class centroids for the various descriptors.
Distance between the class centroids: This table shows the Euclidean distances between
the class centroids for the various descriptors.
Central objects: This table shows the coordinates of the nearest object to the centroid for
each class.
Distance between the central objects: This table shows the Euclidean distances between
the class central objects for the various descriptors.
Results by class: The descriptive statistics for the classes (number of objects, sum of
weights, within-class variance, minimum distance to the centroid, maximum distance to the
centroid, mean distance to the centroid) are displayed in the first part of the table. The second
part shows the objects.
Results by object: This table shows the assignment class for each object in the initial object
order.
Example
http://www.xlstat.com/demo-cluster2.htm
References
Arabie P., Hubert L.J. and De Soete G. (1996). Clustering and Classification. Wold Scientific,
Singapore.
Everitt B.S., Landau S. and Leese M. (2001). Cluster analysis (4th edition). Arnold, London.
198
Forgey E. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of
classication. Biometrics, 21, 768.
Friedman H.P. and Rubin J. (1967). On some invariant criteria for grouping data. Journal of
the American Statistical Association, 62, 1159-1178.
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York, 483-568.
MacQueen J. (1967). Some method for classication and analysis of multivariate observations.
In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,
281-297.
Saporta G. (1990). Probabilits, Analyse des Donnes et Statistique. Technip, Paris, 251-260.
199
Agglomerative Hierarchical Clustering (AHC)
Use Agglomerative Hierarchical Clustering to make up homogeneous groups of objects
(classes) on the basis of their description by a set of variables, or from a matrix describing the
similarity or dissimilarity between the objects.
Description
Agglomerative Hierarchical Clustering (AHC) is a classification method which has the following
advantages:
You work from the dissimilarities between the objects to be grouped together. A type of
dissimilarity can be chosen which is suited to the subject studied and the nature of the data.
One of the results is the dendrogram which shows the progressive grouping of the data. It is
then possible to gain an idea of a suitable number of classes into which the data can be
grouped.
The disadvantage of this method is that it is slow. Furthermore, the dendrogram can become
unreadable if too much data is used.
Principle of AHC
The process starts by calculating the dissimilarity between the N objects. Then two objects
which when clustered together minimize a given agglomeration criterion, are clustered together
thus creating a class comprising these two objects. Then the dissimilarity between this class
and the N-2 other objects is calculated using the agglomeration criterion. The two objects or
classes of objects whose clustering together minimizes the agglomeration criterion are then
clustered together. This process continues until all the objects have been clustered.
These successive clustering operations produce a binary clustering tree (dendrogram), whose
root is the class that contains all the observations. This dendrogram represents a hierarchy of
partitions.
It is then possible to choose a partition by truncating the tree at a given level, the level
depending upon either user-defined constraints (the user knows how many classes are to be
obtained) or more objective criteria.
200
Similarities and dissimilarities
The proximity between two objects is measured by measuring at what point they are similar
(similarity) or dissimilar (dissimilarity). If the user chooses a similarity, XLSTAT converts it into
a dissimilarity as the AHC algorithm uses dissimilarities. The conversion for each object pair
consists in taking the maximum similarity for all pairs and subtracting from this the similarity of
the pair in question.
The similarity coefficients proposed are as follows: Cooccurrence, Cosine, Covariance (n-1),
Covariance (n), Dice coefficient (also known as Sorensen coefficient), General similarity,
Gower coefficient, Inertia, Jaccard coefficient, Kendall correlation coefficient, Kulczinski
coefficient, Ochiai coefficient, Pearsons correlation coefficient, Pearson Phi, Percent
agreement, Rogers & Tanimoto coefficient, Sokal & Michener coefficient (or simple matching
coefficient), Sokal & Sneath coefficient (1), Sokal & Sneath coefficient (2), Spearman
correlation coefficient.
The dissimilarity coefficients proposed: Bhattacharya's distance, Bray and Curtis' distance,
Canberra's distance, Chebychev's distance, Chi distance, Chi metric, Chord distance,
Squared chord distance, Dice coefficient, Euclidian distance, Geodesic distance, Jaccard
coefficient, Kendall dissimilarity, Kulczinski coefficient, Mahalanobis distance, Manhattan
distance, Ochiai coefficient, Pearson's dissimilarity, Pearson's Phi, General dissimilarity,
Rogers & Tanimoto coefficient, Sokal & Michener's coefficient, Sokal & Sneath's coefficient (1),
Sokal & Sneath coefficient (2), Spearman dissimilarity.
Note: some of the abovementioned coefficients should be used with binary data only. If the
data are not binary, XLSTAT asks you if it should automatically transform the data into binary
data.
Agglomeration methods
To calculate the dissimilarity between two groups of objects A and B, different strategies are
possible. XLSTAT offers the following methods:
Simple linkage: The dissimilarity between A and B is the dissimilarity between the object of A
and the object of B that are the most similar. Agglomeration using simple linkage tends to
contract the data space and to flatten the levels of each step in the dendrogram. As the
dissimilarity between two elements of A and of B is sufficient to link A and B, this criterion can
lead to very long clusters (chaining effect) while they are not homogeneous.
Complete linkage: The dissimilarity between A and B is the largest dissimilarity between an
object of A and an object of B. Agglomeration using complete linkage tends to dilate the data
space and to produce compact clusters.
Unweighted pair-group average linkage: The dissimilarity between A and B is the average
of the dissimilarities between the objects of A and the objects of B. Agglomeration using
201
Unweighted pair-group average linkage is a good compromise between the two preceding
criteria, and provides a fair representation of the data space properties.
Weighted pair-group average linkage: The average dissimilarity between the objects of A
and of B is calculated as the sum of the weighted dissimilarities, so that equal weights are
assigned to both groups. As with unweighted pair-group average linkage, this criterion
provides a fairly good representation of the data space properties.
Flexible linkage: This criterion uses a parameter that varies between [-1,+1]; this can
generate a family of agglomeration criteria. For = 0 the criterion is weighted pair-group
average linkage. When is near to 1, chain-like clusters result, but as decreases and
becomes negative, more and more dilatation is obtained.
Wards method: This method aggregates two groups so that within-group inertia increases as
little as possible to keep the clusters homogeneous. This criterion, proposed by Ward (1963),
can only be used in cases with quadratic distances, i.e. cases of Euclidian distance and Chi-
square distance.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
202
have been selected, check that the "Variable labels" option has been activated. For a proximity
matrix, if column labels have been selected, row labels must also be selected.
Proximity type: similarities / dissimilarities: Choose the proximity type to be used. The data
type and proximity type determine the list of possible indexes for calculating the proximity
matrix.
Agglomeration method: Choose the agglomeration method (see the description section for
more details).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet in the active workbook.
Column labels: Activate this option if the first row of the data selections
(Observations/variables table, row labels, row weights, column weights) contains a label.
Where the selection is a proximity matrix, if this option is activated, the first column must also
include the object labels.
Row labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Column weights: Activate this option if the columns are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Column labels" option is activated.
Row weights: Activate this option if the rows are weighted. If you do not activate this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Column labels" option is activated.
Options tab:
Cluster rows: Activate this option is you want to create classes of objects in rows described
by data in columns.
Cluster columns: Activate this option is you want to create classes of objects in columns
described by data in rows.
203
Center: Activate this option is you want to center the data before starting the calculations.
Reduce: Activate this option is you want to reduce the data before starting the calculations.
You can then select whether you want to apply the transformation on the rows or the columns.
Truncation: Activate this option is you want XLSTAT to automatically define the truncation
level, and therefore the number of classes to retain, or if you want to define the number of
classes to create, or the level at which the dendrogram is to be truncated.
Within-class variances: Activate this option to select the within-class variances. This option is
only active if object weights have been selected (row weights if you are clustering rows,
column weights if you are clustering columns). This option can be used if you previously
clustered the objects using another method (k-means for example) and want to use a method
such as unweighted pair group averages to cluster the groups previously obtained. If a column
header has been selected, check that the "Column labels" option is activated.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
204
Node statistics: Activate this option to display the statistics for dendrogram nodes.
Centroids: Activate this option to display the table of centroids of the classes.
Central objects: Activate this option to display the coordinates of the nearest object to the
centroid for each class.
Results by class: Activate this option to display a table giving the statistics and the objects for
each of the classes.
Results by object: Activate this option to display a table giving the class each object is
assigned to in the initial object order.
Charts tab:
Levels bar chart: Activate this option to display the diagram of levels showing the impact of
successive clusterings.
Full: Activate this option to display the full dendrogram (all objects are represented).
Truncated: Activate this option to display the truncated dendrogram (the dendrogram
starts at the level of the truncation).
Labels: Activate this option to display object labels (full dendrogram) or classes
(truncated dendrogram) on the dendrogram.
Colors: Activate this option to use colors to represent the different groups on the full
dendrogram.
Results
Summary statistics: This table displays for the descriptors of the objects, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation.
Node statistics: This table shows the data for the successive nodes in the dendrogram. The
first node has an index which is the number of objects increased by 1. Hence it is easy to see
at any time if an object or group of objects is clustered with another object or group of objects
at the level of a new node in the dendrogram.
205
Levels bar chart: This table displays the statistics for dendrogram nodes.
Dendrograms: The full dendrogram displays the progressive clustering of objects. If truncation
has been requested, a broken line marks the level the truncation has been carried out. The
truncated dendrogram shows the classes after truncation.
Class centroids: This table shows the class centroids for the various descriptors.
Distance between the class centroids: This table shows the Euclidean distances between
the class centroids for the various descriptors.
Central objects: This table shows the coordinates of the nearest object to the centroid for
each class.
Distance between the central objects: This table shows the Euclidean distances between
the class central objects for the various descriptors.
Results by class: The descriptive statistics for the classes (number of objects, sum of
weights, within-class variance, minimum distance to the centroid, maximum distance to the
centroid, mean distance to the centroid) are displayed in the first part of the table. The second
part shows the objects.
Results by object: This table shows the assignment class for each object in the initial object
order.
Example
http://www.xlstat.com/demo-cluster.htm
References
Arabie P., Hubert L.J. and De Soete G. (1996). Clustering and Classification. Wold Scientific,
Singapore.
Everitt B.S., Landau S. and Leese M. (2001). Cluster analysis (4th edition). Arnold, London.
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York, 483-568.
206
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam, 403-406.
Saporta G. (1990). Probabilits, Analyse des Donnes et Statistique. Technip, Paris, 251-260.
Ward J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 238-244.
207
Univariate clustering
Use univariate clustering to optimally cluster objects in k homogeneous classes, based on their
description using a single quantitative variable.
Description
Homogeneity is measured here using the sum of the within-class variances. To maximize the
homogeneity of the classes, we therefore try to minimize the sum of the within-class variances.
The algorithm used here is very fast and uses the method put forward by W.D. Fisher (1958).
This method can be seen as a process of turning a quantitative variable into a discrete ordinal
variable. There are many applications, e.g. in mapping applications for creating color scales or
in marketing for creating homogeneous segments.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
208
Observations/variables table: Select a table comprising N objects described by P
descriptors. If column headers have been selected, check that the "Variable labels" option has
been activated.
Row weights: Activate this option if the rows are weighted. If you do not activate this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Column labels" option is activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet in the active workbook.
Column labels: Activate this option if the first row of the data selections
(Observations/variables table, row labels, row weights, column weights) contains a label.
Row labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
209
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Centroids: Activate this option to display the table of centroids of the classes.
Central objects: Activate this option to display the coordinates of the nearest object to the
centroid for each class.
Results by class: Activate this option to display a table giving the statistics and the objects for
each of the classes.
Results by object: Activate this option to display a table giving the class each object is
assigned to in the initial object order.
Results
Summary statistics: This table displays for the descriptor of the objects, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation.
Class centroids: This table shows the class centroids for the various descriptors.
Distance between the class centroids: This table shows the Euclidean distances between
the class centroids for the various descriptors.
Central objects: This table shows the coordinates of the nearest object to the centroid for
each class.
Distance between the central objects: This table shows the Euclidean distances between
the class central objects for the various descriptors.
Results by class: The descriptive statistics for the classes (number of objects, sum of
weights, within-class variance, minimum distance to the centroid, maximum distance to the
centroid, mean distance to the centroid) are displayed in the first part of the table. The second
part shows the objects.
Results by object: This table shows the assignment class for each object in the initial object
order.
210
References
Fisher W.D. (1958). On grouping for maximum homogeneity. Journal of the American
Statistical Association, 53, 789-798.
211
Distribution fitting
Use this tool to fit a distribution to a sample of continuous or discrete quantitative data.
Description
Fitting a distribution to a data sample consists, once the type of distribution has been chosen,
in estimating the parameters of the distribution so that the sample is the most likely possible
(as regards the maximum likelihood) or that at least certain statistics of the sample (mean,
variance for example) correspond as closely as possible to those of the distribution.
Distributions
P( X 1) p, P( X 0) 1 p with p 0,1
The Bernoulli, named after the swiss mathematician Jacob Bernoulli (1654-1705),
allows to describe binary phenomena where only events can occur with respective
probabilities of p and 1-p.
Beta (): the density function of this distribution (also called Beta type I) is given by:
( )( )
x 1 1 x , with , >0, x 0,1 and B ( , )
1
f ( x)
1
B ( , ) ( )
x c d x
1 1
f ( x) , with , >0, x c, d
1
B ( , ) d c
1
( ) ( )
c, d R, and B ( , )
( )
212
Pour the type I beta distribution, X takes values in the [0,1] range. The beta4
distribution is obtained by a variable transformation such that the distribution is on a
[c, d] interval where c and d can take any value.
Beta (a, b): the density function of this distribution (also called Beta type I) is given by:
(a)(b)
x a 1 1 x , with a,b>0, x 0,1 and B (a, b)
b 1
f ( x)
1
B a, b ( a b)
Binomial (n, p): the density function of this distribution is given by:
n is the number of trials, and p the probability of success. The binomial distribution
is the distribution of the number of successes for n trials, given that the probability
of success is p.
Negative binomial type I (n, p): the density function of this distribution is given by:
Negative binomial type II (k, p): the density function of this distribution is given by:
k x px
P( X x) , with x N, k , p >0
x ! k 1 p
kx
The negative binomial type II distribution is used to represent discrete and highly
heterogeneous phenomena. As k tends to infinity, the negative binomial type II
distribution tends towards a Poisson distribution with =kp.
1/ 2
df / 2
f ( x) x df / 2 1e x / 2 , with x 0, df N*
df / 2
213
E(X) = df and V(X) = 2df
e x
f ( x ) k x k 1 with x 0 and k , 0 and k N
k 1!
,
Note: When k=1, this distribution is equivalent to the exponential distribution. The
Gamma distribution with two parameters is a generalization of the Erlang
distribution to the case where k is a real and not an integer (for the Gamma
distribution the scale parameter is used).
f ( x) exp x , avec x 0 et 0
The exponential distribution is often used for studying lifetime in quality control.
Fisher (df1, df2): the density function of this distribution is given by:
df1 / 2 df 2 / 2
df1 x df1 x
f ( x) 1
1
xB df1 / 2, df 2 / 2 df1 x df 2
,
df1 x df 2
with x 0 and df1 , df 2 N*
E(X) = df2/(df2 -2) if df2>0, and V(X) = 2df2(df1+df2 -2)/[df1(df2-2) (df2 -4)]
Fisher's distribution, from the name of the biologist, geneticist and statistician
Ronald Aylmer Fisher (1890-1962), corresponds to the ratio of two Chi-square
distributions. It is often used for testing hypotheses.
x x
f ( x) exp exp with 0
1
,
214
E(X) = + and V(X) = ( )/6 where is the Euler-Mascheroni constant.
e x /
f ( x) x , with x and k , 0
k 1
k k
1 x
1/ k 1
x
1/ k
f ( x) 1 k exp 1 k , with 0
k k
The GEV (Generalized Extreme Values) distribution is much used in hydrology for
modeling flood phenomena. k lies typically between -0.6 and 0.6.
f ( x ) exp x exp x
The Gumbel distribution, named after Emil Julius Gumbel (1891-1966), is a special
case of the Fisher-Tippett distribution with =1 and =0. It is used in the study of
extreme phenomena such as precipitations, flooding and earthquakes.
x
e s
f ( x) , with R, and s 0
x
s 1 e s
215
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
x 2
f ( x) e , with 0
1 2 2
2
E(X) = and V(X) =
x2
f ( x) e
1 2
This distribution is a special case of the normal distribution with =0 and =1.
Pareto (a, b): the density function of this distribution is given by:
ab a
f ( x) , with a, b 0 and x b
x a 1
The Pareto distribution, named after the Italian economist Vilfredo Pareto (1848-
1923), is also known as the Bradford distribution. This distribution was initially used
to represent the distribution of wealth in society, with Pareto's principle that 80% of
the wealth was owned by 20% of the population.
216
x a b x
1 1
f ( x) , with , >0, x a, b
1
B ( , ) b a
1
( )( )
a, b R, and B ( , )
( )
4m b - 5a
=
b-a
5b a 4m
=
b-a
The PERT distribution is a special case of the beta4 distribution. It is defined by its
definition interval [a, b] and m the most likeky value (the mode). PERT is an
acronym for Program Evaluation and Review Technique, a project management
and planning methodology. The PERT methodology and distribution were
developed during the project held by the US Navy and Lockheed between 1956 and
1960 to develop the Polaris missiles launched from submarines. The PERT
distribution is useful to model the time that is likely to be spent by a team to finish a
project. The simpler triangular distribution is similar to the PERT distribution in that it
is also defined by an interval and a most likely value.
exp x
P( X x) , with x N and 0
x!
df 1/ 2
1 x
( df 1) / 2
f ( x) / df , with df 0
df df / 2
2
The English chemist and statistician William Sealy Gosset (1876-1937), used the
nickname Student to publish his work, in order to preserve his anonymity (the
Guinness brewery forbade its employees to publish following the publication of
confidential information by another researcher). The Students t distribution is the
217
distribution of the mean of df variables standard normal variables. When df=1,
Student's distribution is a Cauchy distribution with the particularity of having neither
expectation nor variance.
Trapezoidal (a, b, c, d): the density function of this distribution is given by:
2 x a
f ( x) , x a, b
d c b a b a
f ( x) , x b, c
2
d c b a
2d x
f ( x ) d c b a d c , x a, b
f ( x) 0 , x a, x d
with a m b
This distribution is useful to represent a phenomenon for which we know that it can
take values between two extreme values (a and d), but that it is more likely to take
values between two values (b and c) within that interval.
Triangular (a, m, b): the density function of this distribution is given by:
2 x a
f ( x) , x a, m
b a m a
2 b x
f ( x) , x m, b
b a b m
f ( x) 0 , x a, x b
with a m b
TriangularQ (q1, m, q2, p1, p2): the density function of this distribution is a
reparametrization of the Triangular distribution. A first step requires estimating the a
and b parameters of the triangular distribution, from the q1 and q2 quantiles to which
percentages p1 and p2 correspond. Once this is done, the distribution functions can be
computed using the triangular distribution functions.
Uniform (a, b): the density function of this distribution is given by:
218
f ( x) , with b a and x a, b
1
ba
The uniform (0,1) distribution is much used for simulations. As the cumulative
distribution function of all the distributions is between 0 and 1, a sample taken in a
Uniform (0,1) distribution is used to obtain random samples in all the distributions
for which the inverse can be calculated.
Uniform discrete (a, b): the density function of this distribution is given by:
f ( x) , with b a, (a, b) N , x N , x a, b
1
b a 1
The uniform discrete distribution corresponds to the case where the uniform
distribution is restricted to integers.
f ( x ) x 1 exp x , with x 0 and 0
1 2 1
We have E(X) = 1 and V(X) = 1 2 1
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
is the shape parameter of the distribution and the scale parameter. When =1,
the Weibull distribution is an exponential distribution with parameter 1/.
219
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
The Weibull distribution, named after the Swede Ernst Hjalmar Waloddi Weibull
(1887-1979), is much used in quality control and survival analysis. is the shape
parameter of the distribution and the scale parameter. When =1 and =0, the
Weibull distribution is an exponential distribution with parameter 1/.
Fitting method
Moments: this simple method uses the definition of the moments of the distribution as a
function of the parameters to determine the latter. For most distributions, the use of the mean
and the variance is sufficient. However, for certain distributions, the mean suffices (for example
Poisson's distribution), or, if not, the asymmetry coefficient is also required (Weibull's
distribution for example).
Likelihood: the parameters of the distribution are estimated by maximizing the likelihood of the
sample. This method, more complex, has the advantage of rigor for all distributions and
enables approximate standard deviations to be obtained for parameter estimators. The
maximum likelihood method is offered for the negative binomial type II distribution, Fisher-
Tippett distribution, GEV distribution and Weibull distribution.
For certain distributions, the moments method gives exactly the same result as the maximum
likelihood method. This is particularly true for the normal distribution.
Once the parameters for the chosen distribution have been estimated, the hypothesis must be
tested in order to check if the phenomenon observed through the sample follows the
distribution in question. XLSTAT offers two goodness of fit tests.
The Chi-square goodness of fit test is a parametric test using the distance (as regards Chi-
square) between the histogram of the theoretical distribution (determined by the estimated
parameters) and the histogram of the empirical distribution of the sample. The histograms are
calculated using k intervals chosen by the user. It is shown that the statistic calculated
asymptotically follows a Chi-square distribution with (n-k) degrees of freedom where n is the
number of observations in the sample. This test is better for discrete distributions and it is
recommended to check that the expected frequency in each class is not less than 5.
It may happen that the Chi-square test leads to a bad fit of the distribution to the data with one
class contributing much more to the Chi-square than the others. In this case, the union of the
220
class in question with a neighbouring class is used to check if the conclusion is due only to the
class in question or it is actually the fit which is incorrect.
The Kolmogorov-Smirnov goodness of fit test is an exact non-parametric test based on the
maximum distance between a theoretical distribution function (entirely determined by the
known values of its parameters) and the empirical distribution function of the sample. This test
can only be used for continuous distributions.
When a parameter estimation precedes the goodness of fit test, the Kolmogorov-Smirnov test
is not correct as the parameters are estimated by trying to bring the theoretical distribution as
close as possible to the data. If it confirms the goodness of fit hypothesis, the Kolmogorov-
Smirnov test risks being optimistic.
For the case where the distribution used is the normal distribution, Lilliefors and Stephens (see
normality tests) have put forward a modified Kolmogorov-Smirnov test which allows
parameters to be estimated on the sample tested.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
221
Data: Select the data for which the goodness of fit test is to be calculated. You can select
several columns (columns mode) or rows (rows mode) if you want to carry out tests on several
samples at the same time.
Distribution: Choose the probability distribution to be used for the fit and/or goodness of fit
tests. See the description section for more information on the distributions offered.
Parameters: You can choose to enter the parameters for the distribution, or estimate them. If
you choose to enter the parameters, you must enter their values.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Sample labels: Activate this option if the sample labels are on the first row (columns mode) or
in the first column (rows mode) of the selected data.
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Standardize weights: If you activate this option, the weights are standardized such that
their sum equals the number of observations.
Options tab:
Tests: Choose the type of goodness of fit test (see the description section for more details on
the tests).
Significance level (%): Enter the significance level for the above tests.
Estimation method: Choose the method of estimating the parameters of the chosen
distribution (see the description section for more details on estimation methods).
222
Maximum likelihood: Activate this option to use the maximum likelihood method. You
can then change the convergence limit value which when reached means the
algorithm is considered to have converged. Default value: 0.00001.
Intervals: For a Chi-square distribution, or if you want to compare the density of the
distribution chosen with the sample histogram, you must choose one of the following options:
Width: Choose this option to define a fixed width for the intervals.
User defined: Select a column containing in increasing order the lower bound of the
first interval, and the upper bound of all the intervals.
Minimum: Activate this option to enter the value of the lower value of the first interval.
This value must be lower or equal to the minimum of the series.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the samples
selected.
Charts tab:
Histograms: Activate this option to display the histograms of the samples. For a theoretical
distribution, the density function is displayed.
Bars: Choose this option to display the histograms with a bar for each interval.
Continuous lines: Choose this option to display the histograms with a continuous line.
223
Cumulative histograms: Activate this option to display the cumulated histograms of the
samples. For a theoretical distribution, the distribution function is displayed.
Results
Summary statistics: This table displays for the selected samples, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation.
Estimated parameters: This table displays the parameters for the distribution.
Statistics estimated on the input data and computed using the estimated parameters of
the distribution: This table is used to compare the mean, variance, skewness and kurtosis
coefficients calculated from the sample with those calculated from the values of the distribution
parameters.
Kolmogorov-Smirnov test: The results of the Kolmogorov-Smirnov test are displayed if the
corresponding option has been activated.
Chi-square test: The results of the Chi-square test are displayed if the corresponding option
has been activated.
Comparison between the observed and theoretical frequencies: This table is displayed if a
Chi-square test was requested.
Descriptive statistics for the intervals: This table is displayed if histograms have been
requested. It shows for each interval the frequencies, the relative frequencies, together with
the densities for the samples and distribution chosen.
Example
http://www.xlstat.com/demo-dfit.htm
224
References
El-Shaarawi A.H., Esterby E.S. and Dutka B.J (1981). Bacterial density in water determined
by Poisson or negative binomial distributions. Applied an Environmental Microbiology, 41(1).
107-116.
Fisher R.A. and Tippett H.C. (1928). Limiting forms of the frequency distribution of the
smallest and largest member of a sample. Proc. Cambridge Phil. Soc., 24, 180-190.
Gumbel E.J. (1941). Probability interpretation of the observed return periods of floods. Trans.
Am. Geophys. Union, 21, 836-850.
Jenkinson A. F. (1955). The frequency distribution of the annual maximum (or minimum) of
meteorological elements. Q. J. R. Meteorol. Soc., 81, 158-171.
Perreault L. and Bobe B. (1992). Loi gnralise des valeurs extrmes. Proprits
mathmatiques et statistiques. Estimation des paramtres et des quantiles XT de priode de
retour T. INRS-Eau, rapport de recherche no 350, Qubec.
Weibull W. (1939). A statistical theory of the strength of material. Proc. Roy. Swedish Inst.
Eng. Res. 151(1), 1-45.
225
Linear regression
Use this tool to create a simple or multiple linear regression model for explanation or
prediction.
Description
Linear regression is without doubt the most frequently used statistical method. A distinction is
usually made between simple regression (with only one explanatory variable) and multiple
regression (several explanatory variables) although the overall concept and calculation
methods are identical.
yi 0 j xij i
p
(1)
j 1
where yi is the value observed for the dependent variable for observation i, xij is the value
taken by variable j for observation i, and i is the error of the model.
The statistical framework and the hypotheses which accompany it are not required for fitting
this model. Furthermore, minimization using the least squares method (the sum of squared
errors i is minimized) provides an exact analytical solution. However, to be able to test the
hypothesis and measure the explanatory power of the various explanatory variables in the
model, a statistical framework is necessary.
The linear regression hypotheses are as follows: the errors i follow the same normal
distribution N(0,) and are independent.
The way the model with this hypothesis added is written means that, within the framework of
the linear regression model, the yis are the expression of random variables with mean i and
variance , where
i 0 j xij
p
j 1 i
To use the various tests proposed in the results of linear regression, it is recommended to
check retrospectively that the underlying hypotheses have been correctly verified. The
normality of the residues can be checked by analyzing certain charts or by using a normality
test. The independence of the residues can be checked by analyzing certain charts or by using
the Durbin-Watson test.
226
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Quantitative: Select the response variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
X / Explanatory variables:
Quantitative: Select the quantitative explanatory variables in the Excel worksheet. The data
selected must be of type numeric. If the variable header has been selected, check that the
"Variable labels" option has been activated.
Qualitative: Activate this option to perform an ANCOVA analysis. Then select the qualitative
explanatory variables (the factors) in the Excel worksheet. The selected data may be of any
227
type, but numerical data will automatically be considered as nominal. If the variable header has
been selected, check that the "Variable labels" option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Observation weights: Activate this option if the observations are weighted. If you do not
activate this option, the weights will all be taken as 1. Weights must be greater than or equal
to 0. A weight of 2 is equivalent to repeating the same observation twice. If a column header
has been selected, check that the "Variable labels" option has been activated.
Regression weights: Activate this option if you want to carry out a weighted least squares
regression. If you do not activate this option, the weights will be considered as 1. Weights
must be greater than or equal to 0. If a column header has been selected, check that the
"Variable labels" option is activated.
Options tab:
Fixed constant: Activate this option to fix the constant of the regression model to a value you
then enter (0 by default).
Tolerance: Activate this option to prevent the OLS regression calculation algorithm taking into
account variables which might be either constant or too correlated with other variables already
used in the model (0.0001 by default).
Interactions / Level: Activate this option to include interactions in the model then enter the
maximum interaction level (value between 1 and 4).
228
Confidence interval (%): Enter the percentage range of the confidence interval to use for the
various tests and for calculating the confidence intervals around the parameters and
predictions. Default value: 95.
Model selection: Activate this option if you want to use one of the four selection methods
provided:
Best model: This method lets you choose the best model from amongst all the models
which can handle a number of variables varying from "Min variables" to "Max
Variables". Furthermore, the user can choose several "criteria" to determine the best
model.
o Criterion: Choose the criterion from the following list: Adjusted R, Mean
Square of Errors (MSE), Mallows Cp, Akaike's AIC, Schwarz's SBC, Amemiya's
PC.
o Min variables: Enter the minimum number of variables to be used in the model.
Note: this method can cause long calculation times as the total number of models
explored is the sum of the Cn,ks for k varying from "Min variables" to "Max variables",
where Cn,k is equal to n!/[(n-k)!k !]. It is there recommended that the value of "Max
variables" be increased gradually.
Stepwise: The selection process starts by adding the variable with the largest
contribution to the model (the criterion used is Student's t statistic). If a second variable
is such that the probability associated with its t is less than the "Probability for entry",
it is added to the model. The same for a third variable. After the third variable is added,
the impact of removing each variable present in the model after it has been added is
evaluated (still using the t statistic). If the probability is greater than the "Probability of
removal", the variable is removed. The procedure continues until no more variables can
be added or removed.
Forward: The procedure is the same as for stepwise selection except that variables are
only added and never removed.
Backward: The procedure starts by simultaneously adding all variables. The variables
are then removed from the model following the procedure used for stepwise selection.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
229
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
230
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the correlation matrix for quantitative variables
(dependent or explanatory).
Analysis of variance: Activate this option to display the analysis of variance table.
Type I SS: Activate this option to display the Type I analysis of variance table (Type I Sum of
Squares).
Type III SS: Activate this option to display the Type III analysis of variance table (Type III Sum
of Squares).
Standardized coefficients: Activate this option if you want the standardized coefficients (beta
coefficients) for the model to be displayed.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Adjusted predictions: Activate this option to calculate and display adjusted predictions
in the table of predictions and residuals.
Cook's D: Activate this option to calculate and display Cook's distances in the table of
predictions and residuals.
Charts tab:
Predictions and residuals: Activate this option to display the following charts.
(1) Line of regression: This chart is only displayed if there is only one explanatory
variable and this variable is quantitative.
(2) Explanatory variable versus standardized residuals: This chart is only displayed
if there is only one explanatory variable and this variable is quantitative.
(4) Predictions for the dependent variable versus the dependent variable.
231
(5) Bar chart of standardized residuals.
Results
Summary statistics: The tables of descriptive statistics show the simple statistics for all the
variables selected. The number of observations, missing values, the number of non-missing
values, the mean and the standard deviation (unbiased) are displayed for the dependent
variables (in blue) and the quantitative explanatory variables. For qualitative explanatory
variables the names of the various categories are displayed together with their respective
frequencies.
Correlation matrix: This table is displayed to give you a view of the correlations between the
various variables selected.
Summary of the variables selection: Where a selection method has been chosen, XLSTAT
displays the selection summary. For a stepwise selection, the statistics corresponding to the
different steps are displayed. Where the best model for a number of variables varying from p to
q has been selected, the best model for each number or variables is displayed with the
corresponding statistics and the best model for the criterion chosen is displayed in bold.
Goodness of fit statistics: The statistics relating to the fitting of the regression model are
shown in this table:
Sum of weights: The sum of the weights of the observations used in the calculations.
In the formulas shown below, W is the sum of the weights.
DF: The number of degrees of freedom for the chosen model (corresponding to the
error part).
R: The determination coefficient for the model. This coefficient, whose value is
between 0 and 1, is only displayed if the constant of the model has not been fixed by
the user. Its value is defined by:
w y
n
yi
2
i i
wi yi ,
1 n
R 1 i 1
, where y
n i 1
w (y
n
i i y )2
i 1
232
The R is interpreted as the proportion of the variability of the dependent variable explained
by the model. The nearer R is to 1, the better is the model. The problem with the R is that it
does not take into account the number of variables used to fit the model.
Adjusted R: The adjusted determination coefficient for the model. The adjusted R can
be negative if the R is near to zero. This coefficient is only calculated if the constant of
the model has not been fixed by the user. Its value is defined by:
W 1
R 1 1 R
W p 1
The adjusted R is a correction to the R which takes into account the number of
variables used in the model.
MSE: The mean of the squares of the errors (MSE) is defined by:
n
MSE wi yi y i
1 2
W p * i 1
RMSE: The root mean square of the errors (RMSE) is the square root of the MSE.
y yi
100 n
MAPE wi i
W i 1 yi
y
n
yi yi 1 y i 1
2
i
DW i2
w y
n
yi
2
i i
i 1
This coefficient is the order 1 autocorrelation coefficient and is used to check that the
residuals of the model are not autocorrelated, given that the independence of the
residuals is one of the basic hypotheses of linear regression. The user can refer to a
table of Durbin-Watson statistics to check if the independence hypothesis for the
residuals is acceptable.
SSE
Cp 2 p * W
where SSE is the sum of the squares of the errors for the model with p explanatory
variables and is the estimator of the variance of the residuals for the model
233
comprising all the explanatory variables. The nearer the Cp coefficient is to p*, the less
the model is biased.
SSE
AIC W ln 2p*
W
This criterion, proposed by Akaike (1973) is derived from the information theory and
uses Kullback and Leibler's measurement (1951). It is a model selection criterion which
penalizes models for which adding new explanatory variables does not supply sufficient
information to the model, the information being measured through the MSE. The aim is
to minimize the AIC criterion.
SSE
SBC W ln ln W p *
W
This criterion, proposed by Schwarz (1978) is similar to the AIC and, like this, the aim is
to minimize it.
1 R W p *
PC
W p*
This criterion, proposed by Amemiya (1980) is used, like the adjusted R to take
account of the parsimony of the model.
Press RMSE: Press' statistic is only displayed if the corresponding option has been
activated in the dialog box. It is defined by:
Press wi yi y i ( i )
n 2
i 1
where y i ( i ) is the prediction for observation i when the latter is not used for estimating
parameters. We then get:
Press RMSE
Press
W - p*
Press's RMSE can then be compared to the RMSE. A large difference between the two
shows that the model is sensitive to the presence or absence of certain observations in
the model.
234
If the Type I SS and Type III SS (SS: Sum of Squares) are activated, the corresponding tables
are displayed.
The table of Type I SS values is used to visualize the influence that progressively adding
explanatory variables has on the fitting of the model, as regards the sum of the squares of the
errors (SSE), the mean of the squares of the errors (MSE), Fisher's F, or the probability
associated with Fisher's F. The lower the probability, the larger the contribution of the variable
to the model, all the other variables already being in the model. Note: the order in which the
variables are selected in the model influences the values obtained.
The table of Type III SS values is used to visualize the influence that removing an explanatory
variable has on the fitting of the model, all other variables being retained, as regards the sum
of the squares of the errors (SSE), the mean of the squares of the errors (MSE), Fisher's F, or
the probability associated with Fisher's F. The lower the probability, the larger the contribution
of the variable to the model, all the other variables already being in the model. Note: unlike
Type I SS, the order in which the variables are selected in the model has no influence on the
values obtained.
The analysis of variance table is used to evaluate the explanatory power of the explanatory
variables. Where the constant of the model is not set to a given value, the explanatory power is
evaluated by comparing the fit (as regards least squares) of the final model with the fit of the
rudimentary model including only a constant equal to the mean of the dependent variable.
Where the constant of the model is set, the comparison is made with respect to the model for
which the dependent variable is equal to the constant which has been set.
The parameters of the model table displays the estimate of the parameters, the
corresponding standard error, the Students t, the corresponding probability, as well as the
confidence interval
The equation of the model is then displayed to make it easier to read or re-use the model.
The table of standardized coefficients (also called beta coefficients) are used to compare the
relative weights of the variables. The higher the absolute value of a coefficient, the more
important the weight of the corresponding variable. When the confidence interval around
standardized coefficients has value 0 (this can be easily seen on the chart of normalized
coefficients), the weight of a variable in the model is not significant.
The predictions and residuals table shows, for each observation, its weight, the value of the
qualitative explanatory variable, if there is only one, the observed value of the dependent
variable, the model's prediction, the residuals, the confidence intervals together with the fitted
prediction and Cook's D if the corresponding options have been activated in the dialog box.
Two types of confidence interval are displayed: a confidence interval around the mean
(corresponding to the case where the prediction would be made for an infinite number of
observations with a set of given values for the explanatory variables) and an interval around
the isolated prediction (corresponding to the case of an isolated prediction for the values given
for the explanatory variables). The second interval is always greater than the first, the random
235
values being larger. If the validation data have been selected, they are displayed at the end of
the table.
The charts which follow show the results mentioned above. If there is only one explanatory
variable in the model, the first chart displayed shows the observed values, the regression line
and both types of confidence interval around the predictions. The second chart shows the
normalized residuals as a function of the explanatory variable. In principle, the residuals should
be distributed randomly around the X-axis. If there is a trend or a shape, this shows a problem
with the model.
The three charts displayed next show respectively the evolution of the standardized residuals
as a function of the dependent variable, the distance between the predictions and the
observations (for an ideal model, the points would all be on the bisector), and the standardized
residuals on a bar chart. The last chart quickly shows if an abnormal number of values are
outside the interval ]-2, 2[ given that the latter, assuming that the sample is normally
distributed, should contain about 95% of the data.
If you have selected the data to be used for calculating predictions on new observations,
the corresponding table is displayed next.
Example
A tutorial on simple linear regression is available on the Addinsoft website:
http://www.xlstat.com/demo-reg.htm
A tutorial on multiple linear regression is available on the Addinsoft website:
http://www.xlstat.com/demo-reg2.htm
References
Akaike H. (1973). Information Theory and the Extension of the Maximum Likelihood Principle.
In: Second International Symposium on Information Theory. (Eds: V.N. Petrov and F. Csaki).
Academiai Kiad, Budapest. 267-281.
236
Mallows C.L. (1973). Some comments on Cp. Technometrics, 15, 661-675.
Tomassone R., Audrain S., Lesquoy de Turckheim E. and Miller C. (1992). La Rgression,
Nouveaux Regards sur une Ancienne Mthode Statistique. INRA et MASSON, Paris.
237
ANOVA
Use this model to carry out ANOVA (ANalysis Of VAriance) of one or more balanced or
unbalanced factors. The advanced options enable you to choose the constraints on the model
and to take account of interactions between the factors. Multiple comparison tests can be
calculated.
Description
Analysis of Variance (ANOVA) uses the same conceptual framework as linear regression. The
main difference comes from the nature of the explanatory variables: instead of quantitative,
here they are qualitative. In ANOVA, explanatory variables are often called factors.
yi 0 k ( i , j ), j i
p
(1)
j 1
where yi is the value observed for the dependent variable for observation i, k(i,j) is the index of
the category of factor j for observation i, and i is the error of the model.
The hypotheses used in ANOVA are identical to those used in linear regression: the errors i
follow the same normal distribution N(0,) and are independent.
The way the model with this hypothesis added is written means that, within the framework of
the linear regression model, the yis are the expression of random variables with mean i and
variance , where
i 0 k (i , j ), j
p
j 1
To use the various tests proposed in the results of linear regression, it is recommended to
check retrospectively that the underlying hypotheses have been correctly verified. The
normality of the residues can be checked by analyzing certain charts or by using a normality
test. The independence of the residues can be checked by analyzing certain charts or by using
the Durbin Watson test.
Interactions
By interaction is meant an artificial factor (not measured) which reflects the interaction between
at least two measured factors. For example, if we carry out treatment on a plant, and tests are
carried out under two different light intensities, we will be able to include in the model an
238
interaction factor treatment*light which will be used to identify a possible interaction between
the two factors. If there is an interaction between the two factors, we will observe a significantly
larger effect on the plants when the light is strong and the treatment is of type 2 while the effect
is average for weak light, treatment 2 and strong light, treatment 1 combinations.
To make a parallel with linear regression, the interactions are equivalent to the products
between the continuous explanatory values although here obtaining interactions requires
nothing more than simple multiplication between two variables. However, the notation used to
represent the interaction between factor A and factor B is A*B.
We talk of balanced ANOVA when the numbers of categories are equal for all factors. When
the numbers of all categories for one of the factors are not equal, then the ANOVA is said to be
unbalanced. XLSTAT can handle both cases.
Constraints
During the calculations, each factor is broken down into a sub-matrix containing as many
columns as there are categories in the factor. Typically, this is a full disjunctive table.
Nevertheless, the breakdown poses a problem: if there are g categories, the rank of this sub-
matrix is not g but g-1. This leads to the requirement to delete one of the columns of the sub-
matrix and possibly to transform the other columns. Several strategies are available depending
on the interpretation we want to make afterwards:
1) a1=0: the parameter for the first category is null. This choice allows us force the effect of the
first category as a standard. In this case, the constant of the model is equal to the mean of the
dependent variable for group 1.
2) an=0: the parameter for the last category is null. This choice allows us force the effect of the
last category as a standard. In this case, the constant of the model is equal to the mean of the
dependent variable for group g.
3) Sum (ai) = 0: the sum of the parameters is null. This choice forces the constant of the
model to be equal to the mean of the dependent variable when the ANOVA is balanced.
4) Sum (ai) = 0 (PH): the sum of the parameters is null. The difference with the previous option
comes from how interactions are processed. Here, the sub-matrices are not calculated for
interactions by applying the same rule as for factors, but by using the horizontal product (PH)
of the sub-matrices of the factors involved in the interaction.
5) Sum (ni.ai) = 0: the sum of the parameters is null. This choice forces the constant of the
model to be equal to the mean of the dependent variable even when the ANOVA is
unbalanced.
239
Note: even if the choice of constraint influences the values of the parameters, it has no effect
on the predicted values and on the different fitting statistics.
One of the main applications of ANOVA is multiple comparisons testing whose aim is to check
if the parameters for the various categories of a factor differ significantly or not. For example, in
the case where four treatments are applied to plants, we want to know not only if the
treatments have a significant effect, but also if the treatments have different effects.
Numerous tests have been proposed for comparing the means of categories. The majority of
these tests assume that the sample is normally distributed. XLSTAT provides the main tests
including:
Tukey's HSD test: this test is the most used (HSD: Honestly Significant Difference).
Fisher's LSD test: this is Student's test that tests the hypothesis that all the means for the
various categories are equal (LSD: Least Significant Difference).
Bonferroni's t* test: this test is derived from Student's test and is less reliable as it takes into
account the fact that several comparisons are carried out simultaneously. Consequently, the
significance level of the test is modified according to the following formula:
'
g ( g 1) / 2
where g is the number of categories of the factor whose categories are being compared.
Dunn-Sidak's test: this test is derived from Bonferroni's test. It is more reliable in some
situations.
' 1 1
2 / g ( g 1)
The following tests are more complex as they are based on iterative procedures where the
results depend on the number of combinations remaining to be tested for each category.
Newman-Keuls's test (SNK): this test is derived from Student's test (SNK: Student Newman-
Keuls), and is very often used although not very reliable.
REGWQ test: this test is among the most reliable in a majority of situations (REGW: Ryan-
Einot-Gabriel-Welsch).
All the above tests enable comparisons to be made between all pairs of categories and belong
to the MCA test family (Multiple Comparisons of All, or All-Pairwise Comparisons).
240
Other tests make comparisons between all categories and a control category. These tests are
called MCB tests (Multiple Comparisons with the Best, Comparisons with a control). XLSTAT
offers the Dunnett test which is the most used. There are three Dunnett tests:
Two-tailed test: the null hypothesis assumes equality between the category tested and the
control category. The alternative hypothesis assumes the means of the two categories differ.
Left one-tailed test: the null hypothesis assumes equality between the category tested and
the control category. The alternative hypothesis assumes that the mean of the control category
is greater than the mean of the category tested.
Right one-tailed test: the null hypothesis assumes equality between the category tested and
the control category. The alternative hypothesis assumes that the mean of the control category
is less than the mean of the category tested.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
241
Quantitative: Select the response variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
X / Explanatory variables:
Quantitative: Activate this option to perform an ANCOVA analysis. Then select the
quantitative explanatory variables in the Excel worksheet. The data selected must be of type
numeric. If the variable header has been selected, check that the "Variable labels" option has
been activated.
Qualitative: Select the qualitative explanatory variables (the factors) in the Excel worksheet.
The selected data may be of any type, but numerical data will automatically be considered as
nominal. If the variable header has been selected, check that the "Variable labels" option has
been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Observation weights: Activate this option if the observations are weighted. If you do not
activate this option, the weights will all be taken as 1. Weights must be greater than or equal
to 0. A weight of 2 is equivalent to repeating the same observation twice. If a column header
has been selected, check that the "Variable labels" option has been activated.
Regression weights: Activate this option if you want to carry out a weighted least squares
regression. If you do not activate this option, the weights will be considered as 1. Weights
must be greater than or equal to 0. If a column header has been selected, check that the
"Variable labels" option is activated.
242
Options tab:
Fixed constant: Activate this option to fix the constant of the regression model to a value you
then enter (0 by default).
Tolerance: Activate this option to prevent the OLS regression calculation algorithm taking into
account variables which might be either constant or too correlated with other variables already
used in the model (0.0001 by default).
Interactions / Level: Activate this option to include interactions in the model then enter the
maximum interaction level (value between 1 and 4).
Confidence interval (%): Enter the percentage range of the confidence interval to use for the
various tests and for calculating the confidence intervals around the parameters and
predictions. Default value: 95.
Constraints: Details on the various options are available in the description section.
a1 = 0: Choose this option so that the parameter of the first category of each factor is set to 0.
an = 0: Choose this option so that the parameter of the last category of each factor is set to 0.
Sum (ai) = 0: for each factor, the sum of the parameters associated with the various
categories is set to 0.
Sum (ai) = 0 (PH): for each factor, the sum of the parameters associated with the various
categories is set to 0. For interactions, the sub-matrices are determined by carrying out the
horizontal product of the sub-matrices of the factors concerned.
Sum (ni.ai) = 0: for each factor, the sum of the parameters associated with the various
categories weighted by their frequencies is set to 0.
Model selection: Activate this option if you want to use one of the four selection methods
provided:
Best model: This method lets you choose the best model from amongst all the models
which can handle a number of variables varying from "Min variables" to "Max
Variables". Furthermore, the user can choose several "criteria" to determine the best
model.
o Criterion: Choose the criterion from the following list: Adjusted R, Mean
Square of Errors (MSE), Mallows Cp, Akaike's AIC, Schwarz's SBC, Amemiya's
PC.
o Min variables: Enter the minimum number of variables to be used in the model.
243
o Max variables: Enter the maximum number of variables to be used in the
model.
Note: this method can cause long calculation times as the total number of models
explored is the sum of the Cn,ks for k varying from "Min variables" to "Max variables",
where Cn,k is equal to n!/[(n-k)!k !]. It is there recommended that the value of "Max
variables" be increased gradually.
Stepwise: The selection process starts by adding the variable with the largest
contribution to the model (the criterion used is Student's t statistic). If a second variable
is such that the probability associated with its t is less than the "Probability for entry",
it is added to the model. The same for a third variable. After the third variable is added,
the impact of removing each variable present in the model after it has been added is
evaluated (still using the t statistic). If the probability is greater than the "Probability of
removal", the variable is removed. The procedure continues until no more variables can
be added or removed.
Forward: The procedure is the same as for stepwise selection except that variables are
only added and never removed.
Backward: The procedure starts by simultaneously adding all variables. The variables
are then removed from the model following the procedure used for stepwise selection.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
244
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the correlation matrix for quantitative variables
(dependent or explanatory).
Analysis of variance: Activate this option to display the analysis of variance table.
Type I SS: Activate this option to display the Type I analysis of variance table (Type I Sum of
Squares).
Type III SS: Activate this option to display the Type III analysis of variance table (Type III Sum
of Squares).
245
Standardized coefficients: Activate this option if you want the standardized coefficients (beta
coefficients) for the model to be displayed.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Adjusted predictions: Activate this option to calculate and display adjusted predictions
in the table of predictions and residuals.
Cook's D: Activate this option to calculate and display Cook's distances in the table of
predictions and residuals.
Multiple comparisons:
Apply to all factors: Activate this option to compute the selected tests for all factors.
Use least squares means: Activate this option to compare the means using their least
squares estimators (obtained from the parameters of the model). If this option is not activated,
the means are computed using their estimation based on the data.
Sort up: Activate this option to sort the compared categories in increasing order, the sort
criterion being their respective means. If this option is not activated, the sort is decreasing.
Pairwise comparisons: Activate this option then choose the comparison methods.
Comparisons with a control: Activate this option then choose the type of Dunnett test you
want to carry out.
Charts tab:
Predictions and residuals: Activate this option to display the following charts.
(1) Line of regression: This chart is only displayed if there is only one explanatory
variable and this variable is quantitative.
(2) Explanatory variable versus standardized residuals: This chart is only displayed
if there is only one explanatory variable and this variable is quantitative.
(4) Predictions for the dependent variable versus the dependent variable.
246
(5) Bar chart of standardized residuals.
Means charts: Activate this option to display the charts used to display the means of the
various categories of the various factors.
Results
Summary statistics: The tables of descriptive statistics show the simple statistics for all the
variables selected. The number of observations, missing values, the number of non-missing
values, the mean and the standard deviation (unbiased) are displayed for the dependent
variables (in blue) and the quantitative explanatory variables. For qualitative explanatory
variables the names of the various categories are displayed together with their respective
frequencies.
Correlation matrix: This table is displayed to give you a view of the correlations between the
various variables selected.
Summary of the variables selection: Where a selection method has been chosen, XLSTAT
displays the selection summary. For a stepwise selection, the statistics corresponding to the
different steps are displayed. Where the best model for a number of variables varying from p to
q has been selected, the best model for each number or variables is displayed with the
corresponding statistics and the best model for the criterion chosen is displayed in bold.
Goodness of fit statistics: The statistics relating to the fitting of the regression model are
shown in this table:
Sum of weights: The sum of the weights of the observations used in the calculations.
In the formulas shown below, W is the sum of the weights.
DF: The number of degrees of freedom for the chosen model (corresponding to the
error part).
R: The determination coefficient for the model. This coefficient, whose value is
between 0 and 1, is only displayed if the constant of the model has not been fixed by
the user. Its value is defined by:
247
w y
n
yi
2
i i
where y wi yi ,
1 n
R 1 i 1
,
n i 1
w (y
n
i i y) 2
i 1
The R is interpreted as the proportion of the variability of the dependent variable explained
by the model. The nearer R is to 1, the better is the model. The problem with the R is that it
does not take into account the number of variables used to fit the model.
Adjusted R: The adjusted determination coefficient for the model. The adjusted R can
be negative if the R is near to zero. This coefficient is only calculated if the constant of
the model has not been fixed by the user. Its value is defined by:
W 1
R 1 1 R
W p 1
The adjusted R is a correction to the R which takes into account the number of
variables used in the model.
MSE: The mean of the squares of the errors (MSE) is defined by:
n
MSE wi yi y i
1 2
W p * i 1
RMSE: The root mean square of the errors (RMSE) is the square root of the MSE.
y yi
100 n
MAPE wi i
W i 1 yi
y
n
yi yi 1 y i 1
2
i
DW i2
w y
n
yi
2
i i
i 1
This coefficient is the order 1 autocorrelation coefficient and is used to check that the
residuals of the model are not autocorrelated, given that the independence of the
residuals is one of the basic hypotheses of linear regression. The user can refer to a
table of Durbin-Watson statistics to check if the independence hypothesis for the
residuals is acceptable.
248
SSE
Cp 2 p * W
where SSE is the sum of the squares of the errors for the model with p explanatory
variables and is the estimator of the variance of the residuals for the model
comprising all the explanatory variables. The nearer the Cp coefficient is to p*, the less
the model is biased.
SSE
AIC W ln 2p*
W
This criterion, proposed by Akaike (1973) is derived from the information theory and
uses Kullback and Leibler's measurement (1951). It is a model selection criterion which
penalizes models for which adding new explanatory variables does not supply sufficient
information to the model, the information being measured through the MSE. The aim is
to minimize the AIC criterion.
SSE
SBC W ln ln W p *
W
This criterion, proposed by Schwarz (1978) is similar to the AIC and, like this, the aim is
to minimize it.
1 R W p *
PC
W p*
This criterion, proposed by Amemiya (1980) is used, like the adjusted R to take
account of the parsimony of the model.
Press RMSE: Press' statistic is only displayed if the corresponding option has been
activated in the dialog box. It is defined by:
Press wi yi y i ( i )
n 2
i 1
where y i ( i ) is the prediction for observation i when the latter is not used for estimating
parameters. We then get:
Press RMSE
Press
W - p*
249
Press's RMSE can then be compared to the RMSE. A large difference between the two
shows that the model is sensitive to the presence or absence of certain observations in
the model.
If the Type I SS and Type III SS (SS: Sum of Squares) are activated, the corresponding tables
are displayed.
The table of Type I SS values is used to visualize the influence that progressively adding
explanatory variables has on the fitting of the model, as regards the sum of the squares of the
errors (SSE), the mean of the squares of the errors (MSE), Fisher's F, or the probability
associated with Fisher's F. The lower the probability, the larger the contribution of the variable
to the model, all the other variables already being in the model. Note: the order in which the
variables are selected in the model influences the values obtained.
The table of Type III SS values is used to visualize the influence that removing an explanatory
variable has on the fitting of the model, all other variables being retained, as regards the sum
of the squares of the errors (SSE), the mean of the squares of the errors (MSE), Fisher's F, or
the probability associated with Fisher's F. The lower the probability, the larger the contribution
of the variable to the model, all the other variables already being in the model. Note: unlike
Type I SS, the order in which the variables are selected in the model has no influence on the
values obtained.
The analysis of variance table is used to evaluate the explanatory power of the explanatory
variables. Where the constant of the model is not set to a given value, the explanatory power is
evaluated by comparing the fit (as regards least squares) of the final model with the fit of the
rudimentary model including only a constant equal to the mean of the dependent variable.
Where the constant of the model is set, the comparison is made with respect to the model for
which the dependent variable is equal to the constant which has been set.
The parameters of the model table displays the estimate of the parameters, the
corresponding standard error, the Students t, the corresponding probability, as well as the
confidence interval
The equation of the model is then displayed to make it easier to read or re-use the model.
The table of standardized coefficients (also called beta coefficients) are used to compare the
relative weights of the variables. The higher the absolute value of a coefficient, the more
important the weight of the corresponding variable. When the confidence interval around
standardized coefficients has value 0 (this can be easily seen on the chart of standardized
coefficients), the weight of a variable in the model is not significant.
The predictions and residuals table shows, for each observation, its weight, the value of the
qualitative explanatory variable, if there is only one, the observed value of the dependent
variable, the model's prediction, the residuals, the confidence intervals together with the fitted
prediction and Cook's D if the corresponding options have been activated in the dialog box.
250
Two types of confidence interval are displayed: a confidence interval around the mean
(corresponding to the case where the prediction would be made for an infinite number of
observations with a set of given values for the explanatory variables) and an interval around
the isolated prediction (corresponding to the case of an isolated prediction for the values given
for the explanatory variables). The second interval is always greater than the first, the random
values being larger. If the validation data have been selected, they are displayed at the end of
the table.
The charts which follow show the results mentioned above. If there is only one explanatory
variable in the model, the first chart displayed shows the observed values, the regression line
and both types of confidence interval around the predictions. The second chart shows the
standardized residuals as a function of the explanatory variable. In principle, the residuals
should be distributed randomly around the X-axis. If there is a trend or a shape, this shows a
problem with the model.
The three charts displayed next show respectively the evolution of the standardized residuals
as a function of the dependent variable, the distance between the predictions and the
observations (for an ideal model, the points would all be on the bisector), and the standardized
residuals on a bar chart. The last chart quickly shows if an abnormal number of values are
outside the interval ]-2, 2[ given that the latter, assuming that the sample is normally
distributed, should contain about 95% of the data.
If you have selected the data to be used for calculating predictions on new observations,
the corresponding table is displayed next.
If multiple comparison tests have been requested, the corresponding results are then
displayed.
Example
A tutorial on one-way ANOVA and multiple comparisons tests is available on the Addinsoft
website:
http://www.xlstat.com/demo-ano.htm
251
References
Akaike H. (1973). Information Theory and the Extension of the Maximum Likelihood Principle.
In: Second International Symposium on Information Theory. (Eds: V.N. Petrov and F. Csaki).
Academiai Kiad, Budapest. 267-281.
Hsu J.C. (1996). Multiple Comparisons: Theory and Methods, CRC Press, Boca Raton.
Lea P., Naes T. & Robotten M. (1997). Analysis of Variance for Sensory Data, John Wiley &
Sons, London.
Sahai H. and Ageel M.I. (2000). The Analysis of Variance. Birkhaser, Boston.
Tomassone R., Audrain S., Lesquoy de Turckheim E. and Miller C. (1992). La Rgression,
Nouveaux Regards sur une Ancienne Mthode Statistique. INRA et MASSON, Paris.
252
ANCOVA
Use this module to model a quantitative dependent variable by using quantitative and
qualitative dependent variables as part of a linear model.
Description
ANCOVA (ANalysis of COVAriance) can be seen as a mix of ANOVA and linear regression as
the dependent variable is of the same type, the model is linear and the hypotheses are
identical. In reality it is more correct to consider ANOVA and linear regression as special cases
of ANCOVA.
If p is the number of quantitative variables, and q the number of factors (the qualitative
variables including the interactions between qualitative variables), the ANCOVA model is
written as follows:
yi 0 j xij k (i , j ), j i
p q
(1)
j 1 j 1
where yi is the value observed for the dependent variable for observation i, xij is the value
taken by quantitative variable j for observation i, k(i,j) is the index of the category of factor j for
observation i and i is the error of the model.
The hypotheses used in ANOVA are identical to those used in linear regression and ANOVA:
the errors i follow the same normal distribution N(0,) and are independent.
One of the features of ANCOVA is to enable interactions between quantitative variables and
factors to be taken into account. The main application is to test if the level of a factor (a
qualitative variable) has an influence on the coefficient (often called slope in this context) of a
quantitative variable. Comparison tests are used to test if the slopes corresponding to the
various levels of a factor differ significantly or not. A model with one quantitative variable and a
factor with interaction is written:
hence we get
253
yi 0 k (i ,1),1 k (i ,1),1 xi1 i (4)
The comparison of the parameters are used to test if the factor has an affect on the slope.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Quantitative: Select the response variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
X / Explanatory variables:
254
Quantitative: Select the quantitative explanatory variables in the Excel worksheet. The data
selected must be of type numeric. If the variable header has been selected, check that the
"Variable labels" option has been activated.
Qualitative: Select the qualitative explanatory variables (the factors) in the Excel worksheet.
The selected data may be of any type, but numerical data will automatically be considered as
nominal. If the variable header has been selected, check that the "Variable labels" option has
been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Observation weights: Activate this option if the observations are weighted. If you do not
activate this option, the weights will all be taken as 1. Weights must be greater than or equal
to 0. A weight of 2 is equivalent to repeating the same observation twice. If a column header
has been selected, check that the "Variable labels" option has been activated.
Regression weights: Activate this option if you want to carry out a weighted least squares
regression. If you do not activate this option, the weights will be considered as 1. Weights
must be greater than or equal to 0. If a column header has been selected, check that the
"Variable labels" option is activated.
Options tab:
Fixed constant: Activate this option to fix the constant of the regression model to a value you
then enter (0 by default).
Tolerance:.
255
Interactions / Level: Activate this option to include interactions in the model then enter the
maximum interaction level (value between 1 and 4).
Confidence interval (%): Enter the percentage range of the confidence interval to use for the
various tests and for calculating the confidence intervals around the parameters and
predictions. Default value: 95.
Constraints: Details on the various options are available in the description section.
a1 = 0: Choose this option so that the parameter of the first category of each factor is set to 0.
an = 0: Choose this option so that the parameter of the last category of each factor is set to 0.
Sum (ai) = 0: for each factor, the sum of the parameters associated with the various
categories is set to 0.
Sum (ai) = 0 (PH): for each factor, the sum of the parameters associated with the various
categories is set to 0. For interactions, the sub-matrices are determined by carrying out the
horizontal product of the sub-matrices of the factors concerned.
Sum (ni.ai) = 0: for each factor, the sum of the parameters associated with the various
categories weighted by their frequencies is set to 0.
Model selection:
Best model: This method lets you choose the best model from amongst all the models
which can handle a number of variables varying from "Min variables" to "Max
Variables". Furthermore, the user can choose several "criteria" to determine the best
model.
o Criterion: Choose the criterion from the following list: Adjusted R, Mean
Square of Errors (MSE), Mallows Cp, Akaike's AIC, Schwarz's SBC, Amemiya's
PC.
o Min variables: Enter the minimum number of variables to be used in the model.
Note: this method can cause long calculation times as the total number of models
explored is the sum of the Cn,ks for k varying from "Min variables" to "Max variables",
where Cn,k is equal to n!/[(n-k)!k !]. It is there recommended that the value of "Max
variables" be increased gradually.
Stepwise: The selection process starts by adding the variable with the largest
contribution to the model (the criterion used is Student's t statistic). If a second variable
is such that the probability associated with its t is less than the "Probability for entry",
it is added to the model. The same for a third variable. After the third variable is added,
256
the impact of removing each variable present in the model after it has been added is
evaluated (still using the t statistic). If the probability is greater than the "Probability of
removal", the variable is removed. The procedure continues until no more variables can
be added or removed.
Forward: The procedure is the same as for stepwise selection except that variables are
only added and never removed.
Backward: The procedure starts by simultaneously adding all variables. The variables
are then removed from the model following the procedure used for stepwise selection.
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the correlation matrix for quantitative variables
(dependent or explanatory).
Analysis of variance: Activate this option to display the analysis of variance table.
Type I SS: Activate this option to display the Type I analysis of variance table (Type I Sum of
Squares).
Type III SS: Activate this option to display the Type III analysis of variance table (Type III Sum
of Squares).
Standardized coefficients: Activate this option if you want the standardized coefficients (beta
coefficients) for the model to be displayed.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Adjusted predictions: Activate this option to calculate and display adjusted predictions
in the table of predictions and residuals.
Cook's D: Activate this option to calculate and display Cook's distances in the table of
predictions and residuals.
Multiple comparisons:
Apply to all factors: Activate this option to compute the selected tests for all factors.
Use least squares means: Activate this option to compare the means using their least
squares estimators (obtained from the parameters of the model). If this option is not activated,
the means are computed using their estimation based on the data.
257
Sort up: Activate this option to sort the compared categories in increasing order, the sort
criterion being their respective means. If this option is not activated, the sort is decreasing.
Pairwise comparisons: Activate this option then choose the comparison methods.
Comparisons with a control: Activate this option then choose the type of Dunnett test you
want to carry out.
Comparison of slopes: Activate this option to compare the interaction slopes between the
quantitative and qualitative variables (see the description section on this subject).
Charts tab:
Predictions and residuals: Activate this option to display the following charts.
(1) Line of regression: This chart is only displayed if there is only one explanatory
variable and this variable is quantitative.
(2) Explanatory variable versus standardized residuals: This chart is only displayed
if there is only one explanatory variable and this variable is quantitative.
(4) Predictions for the dependent variable versus the dependent variable.
Means charts: Activate this option to display the charts used to display the means of the
various categories of the various factors.
Results
Summary statistics: The tables of descriptive statistics show the simple statistics for all the
variables selected. The number of observations, missing values, the number of non-missing
values, the mean and the standard deviation (unbiased) are displayed for the dependent
variables (in blue) and the quantitative explanatory variables. For qualitative explanatory
258
variables the names of the various categories are displayed together with their respective
frequencies.
Correlation matrix: This table is displayed to give you a view of the correlations between the
various variables selected.
Summary of the variables selection: Where a selection method has been chosen, XLSTAT
displays the selection summary. For a stepwise selection, the statistics corresponding to the
different steps are displayed. Where the best model for a number of variables varying from p to
q has been selected, the best model for each number or variables is displayed with the
corresponding statistics and the best model for the criterion chosen is displayed in bold.
Goodness of fit statistics: The statistics relating to the fitting of the regression model are
shown in this table:
Sum of weights: The sum of the weights of the observations used in the calculations.
In the formulas shown below, W is the sum of the weights.
DF: The number of degrees of freedom for the chosen model (corresponding to the
error part).
R: The determination coefficient for the model. This coefficient, whose value is
between 0 and 1, is only displayed if the constant of the model has not been fixed by
the user. Its value is defined by:
w y
n
yi
2
i i
where y wi yi ,
1 n
R 1 i 1
,
n i 1
w (y
n
i i y) 2
i 1
The R is interpreted as the proportion of the variability of the dependent variable explained
by the model. The nearer R is to 1, the better is the model. The problem with the R is that it
does not take into account the number of variables used to fit the model.
Adjusted R: The adjusted determination coefficient for the model. The adjusted R can
be negative if the R is near to zero. This coefficient is only calculated if the constant of
the model has not been fixed by the user. Its value is defined by:
W 1
R 1 1 R
W p 1
The adjusted R is a correction to the R which takes into account the number of
variables used in the model.
MSE: The mean of the squares of the errors (MSE) is defined by:
259
n
MSE wi yi y i
1 2
W p * i 1
RMSE: The root mean square of the errors (RMSE) is the square root of the MSE.
y yi
100 n
MAPE wi i
W i 1 yi
y
n
yi yi 1 y i 1
2
i
DW i2
w y
n
yi
2
i i
i 1
This coefficient is the order 1 autocorrelation coefficient and is used to check that the
residuals of the model are not autocorrelated, given that the independence of the
residuals is one of the basic hypotheses of linear regression. The user can refer to a
table of Durbin-Watson statistics to check if the independence hypothesis for the
residuals is acceptable.
SSE
Cp 2 p * W
where SSE is the sum of the squares of the errors for the model with p explanatory
variables and is the estimator of the variance of the residuals for the model
comprising all the explanatory variables. The nearer the Cp coefficient is to p*, the less
the model is biased.
SSE
AIC W ln 2p*
W
This criterion, proposed by Akaike (1973) is derived from the information theory and
uses Kullback and Leibler's measurement (1951). It is a model selection criterion which
penalizes models for which adding new explanatory variables does not supply sufficient
information to the model, the information being measured through the MSE. The aim is
to minimize the AIC criterion.
260
SSE
SBC W ln ln W p *
W
This criterion, proposed by Schwarz (1978) is similar to the AIC and, like this, the aim is
to minimize it.
1 R W p *
PC
W p*
This criterion, proposed by Amemiya (1980) is used, like the adjusted R to take
account of the parsimony of the model.
Press RMSE: Press' statistic is only displayed if the corresponding option has been
activated in the dialog box. It is defined by:
Press wi yi y i ( i )
n 2
i 1
where y i ( i ) is the prediction for observation i when the latter is not used for estimating
parameters. We then get:
Press RMSE
Press
W - p*
Press's RMSE can then be compared to the RMSE. A large difference between the two
shows that the model is sensitive to the presence or absence of certain observations in
the model.
If the Type I SS and Type III SS (SS: Sum of Squares) are activated, the corresponding tables
are displayed.
The table of Type I SS values is used to visualize the influence that progressively adding
explanatory variables has on the fitting of the model, as regards the sum of the squares of the
errors (SSE), the mean of the squares of the errors (MSE), Fisher's F, or the probability
associated with Fisher's F. The lower the probability, the larger the contribution of the variable
to the model, all the other variables already being in the model. Note: the order in which the
variables are selected in the model influences the values obtained.
The table of Type III SS values is used to visualize the influence that removing an explanatory
variable has on the fitting of the model, all other variables being retained, as regards the sum
of the squares of the errors (SSE), the mean of the squares of the errors (MSE), Fisher's F, or
the probability associated with Fisher's F. The lower the probability, the larger the contribution
of the variable to the model, all the other variables already being in the model. Note: unlike
261
Type I SS, the order in which the variables are selected in the model has no influence on the
values obtained.
The analysis of variance table is used to evaluate the explanatory power of the explanatory
variables. Where the constant of the model is not set to a given value, the explanatory power is
evaluated by comparing the fit (as regards least squares) of the final model with the fit of the
rudimentary model including only a constant equal to the mean of the dependent variable.
Where the constant of the model is set, the comparison is made with respect to the model for
which the dependent variable is equal to the constant which has been set.
The parameters of the model table displays the estimate of the parameters, the
corresponding standard error, the Students t, the corresponding probability, as well as the
confidence interval
The equation of the model is then displayed to make it easier to read or re-use the model.
The table of standardized coefficients (also called beta coefficients) are used to compare the
relative weights of the variables. The higher the absolute value of a coefficient, the more
important the weight of the corresponding variable. When the confidence interval around
standardized coefficients has value 0 (this can be easily seen on the chart of standardized
coefficients), the weight of a variable in the model is not significant.
The predictions and residuals table shows, for each observation, its weight, the value of the
qualitative explanatory variable, if there is only one, the observed value of the dependent
variable, the model's prediction, the residuals, the confidence intervals together with the fitted
prediction and Cook's D if the corresponding options have been activated in the dialog box.
Two types of confidence interval are displayed: a confidence interval around the mean
(corresponding to the case where the prediction would be made for an infinite number of
observations with a set of given values for the explanatory variables) and an interval around
the isolated prediction (corresponding to the case of an isolated prediction for the values given
for the explanatory variables). The second interval is always greater than the first, the random
values being larger. If the validation data have been selected, they are displayed at the end of
the table.
The charts which follow show the results mentioned above. If there is only one explanatory
variable in the model, the first chart displayed shows the observed values, the regression line
and both types of confidence interval around the predictions. The second chart shows the
standardized residuals as a function of the explanatory variable. In principle, the residuals
should be distributed randomly around the X-axis. If there is a trend or a shape, this shows a
problem with the model.
The three charts displayed next show respectively the evolution of the normalized residuals
as a function of the dependent variable, the distance between the predictions and the
observations (for an ideal model, the points would all be on the bisector), and the normalized
residuals on a bar chart. The last chart quickly shows if an abnormal number of values are
outside the interval ]-2, 2[ given that the latter, assuming that the sample is normally
distributed, should contain about 95% of the data.
262
If you have selected the data to be used for calculating predictions on new observations,
the corresponding table is displayed next.
If multiple comparison tests have been requested, the corresponding results are then
displayed.
Example
http://www.xlstat.com/demo-anco.htm
References
Akaike H. (1973). Information Theory and the Extension of the Maximum Likelihood Principle.
In: Second International Symposium on Information Theory. (Eds: V.N. Petrov and F. Csaki).
Academiai Kiad, Budapest. 267-281.
Hsu J.C. (1996). Multiple Comparisons: Theory and Methods, CRC Press, Boca Raton.
Lea P., Naes T. and Robotten M. (1997). Analysis of Variance for Sensory Data, John Wiley
& Sons, London.
Sahai H. and Ageel M.I. (2000). The Analysis of Variance. Birkhaser, Boston.
Tomassone R., Audrain S., Lesquoy de Turckheim E. and Miller C. (1992). La Rgression,
Nouveaux Regards sur une Ancienne Mthode Statistique. INRA et MASSON, Paris.
263
Logistic regression
Use logistic regression to model a binary or polytomous variable using quantitative and/or
qualitative explanatory variables.
Description
The principle of the logistic regression model is to link the occurrence or non-occurrence of an
event to explanatory variables. For example, in the phytosanitary domain, we are seeking to
find out from which dosage of a chemical agent an insect will be neutralized.
Models
Logistic and linear regression belong to the same family of models called GLM (Generalized
Linear Models): in both cases, an event is linked to a linear combination of explanatory
variables.
For linear regression, the dependent variable follows a normal distribution N (, ) where is a
linear function of the explanatory variables. For logistic regression, the dependent variable,
also called the response variable, follows a Bernoulli distribution for parameter p (p is the
mean probability that an event will occur) when the experiment is repeated once, or a Binomial
(n, p) distribution if the experiment is repeated n times (for example the same dose tried on n
insects). The probability parameter p is here a linear combination of explanatory variables.
The must common functions used to link probability p to the explanatory variables are the
logistic function (we refer to the Logit model) and the standard normal distribution function (the
Probit model). Both these functions are perfectly symmetric and sigmoid: XLSTAT provides
two other functions: the complementary Log-log function is closer to the upper asymptote. The
Gompertz function is on the contrary closer the axis of abscissa.
264
x2
X
p 2 dx
1
Probit: exp
2
The knowledge of the distribution of the event being studied gives the likelihood of the sample.
To estimate the parameters of the model (the coefficients of the linear function), we try to
maximize the likelihood function. Contrary to linear regression, an exact analytical solution
does not exist. So an iterative algorithm has to be used. XLSTAT uses a Newton-Raphson
algorithm. The user can change the maximum number of iterations and the convergence
threshold if desired.
Separation problem
In the example above, the treatment variable is used to make a clear distinction between the
positive and negative cases.
Treatment 1 Treatment 2
Response + 121 0
Response - 0 85
In such cases, there is an indeterminacy on one or more parameters for which the variance is
as high as the convergence threshold is low which prevents a confidence interval around the
parameter from being given. To resolve this problem and obtain a stable solution, Firth (1993)
proposed the use of a penalized likelihood function. XLSTAT offers this solution as an option
and uses the results provided by Heinze (2002). If the standard deviation of one of the
parameters is very high compared with the estimate of the parameter, it is recommended to
restart the calculations with the "Firth" option activated.
Confidence interval
In most software, the calculation of confidence intervals for parameters is as for linear
regression assuming that the parameters are normally distributed. XLSTAT also offers the
alternative "profile likelihood" method (Venzon and Moolgavkar, 1988). This method is more
reliable as it does not require the assumption that the parameters are normally distributed.
Being iterative, however, it can slow down the calculations.
265
The multinomial logit model
The multinomial logit model, that correspond to the case where the dependent variable has
more than two categories, has a different parameterization from the logit model because the
response variable has more than two categories. It focuses on the probability to choose one of
the J categories knowing some explanatory variables.
p y j | xi
log j j X i
p y 1| xi
where the category 1 is called the reference or control category. All obtained parameters have
to be interpreted relatively to this reference category.
exp j j X i
p y j | xi
1 k 2 exp k k X i
J
p y 1| xi
1
1 k 2 exp k k X i
J
The model is estimated using a maximum likelihood method; the log-likelihood is as follows:
l , yij log p y j | xi
n J
i 1 j 1
To estimate the parameters of the model (the coefficients of the linear function), we try to
maximize the likelihood function. Contrary to linear regression, an exact analytical solution
does not exist. XLSTAT uses the Newton-Raphson algorithm to iteratively find a solution.
Some results that are displayed for the logistic regression are not applicable in the case of the
multinomial case.
XLSTAT can display the classification table (also called the confusion matrix) used to calculate
the percentage of well-classified observations for a given cutoff point. Typically, for a cutoff
value of 0.5, if the probability is less than 0.5, the observation is considered as being assigned
to class 0, otherwise it is assigned to class 1.
266
The ROC curve can also be displayed. The ROC curve (Receiver Operating Characteristics)
displays the performance of a model and enables a comparison to be made with other models.
The terms used come from signal detection theory.
The proportion of well-classified positive events is called the sensitivity. The specificity is the
proportion of well-classified negative events. If you vary the threshold probability from which an
event is to be considered positive, the sensitivity and specificity will also vary. The curve of
points (1-specificity, sensitivity) is the ROC curve.
Let's consider a binary dependent variable which indicates, for example, if a customer has
responded favorably to a mail shot. In the diagram below, the blue curve corresponds to an
ideal case where the n% of people responding favorably corresponds to the n% highest
probabilities. The green curve corresponds to a well-discriminating model. The red curve (first
bisector) corresponds to what is obtained with a random Bernoulli model with a response
probability equal to that observed in the sample studied. A model close to the red curve is
therefore inefficient since it is no better than random generation. A model below this curve
would be disastrous since it would be less even than random.
The area under the curve (or AUC) is a synthetic index calculated for ROC curves. The AUC
corresponds to the probability such that a positive event has a higher probability given to it by
the model than a negative event. For an ideal model, AUC=1 and for a random model, AUC =
0.5. A model is usually considered good when the AUC value is greater than 0.7. A well-
discriminating model must have an AUC of between 0.87 and 0.9. A model with an AUC
greater than 0.9 is excellent.
267
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Dependent variables:
Response variable(s): Select the response variable(s) you want to model. If several variables
have been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
Response type: Choose the type of response variable you have selected:
Binary variable: If you select this option, you must select a variable containing exactly
two distinct values. If the variable has value 0 and 1, XLSTAT will see to it that the high
probabilities of the model correspond to category 1 and that the low probabilities
correspond to category 0. If the variable has two values other than 0 or 1 (for example
Yes/No), the lower probabilities correspond to the first category and the higher
probabilities to the second.
Sum of binary variables: If your response variable is a sum of binary variables, it must
be of type numeric and contain the number of positive events (event 1) amongst those
observed. The variable corresponding to the total number of events observed for this
268
observation (events 1 and 0 combined) must then be selected in the "Observation
weights" field. This case corresponds, for example, to an experiment where a dose D (D
is the explanatory variable) of a medicament is administered to 50 patients (50 is the
value of the observation weights) and where it is observed that 40 get better under the
effects of the dose (40 is the response variable).
Multinomial: if your response variable has more than two categories, a multinomial
logit model is estimated. A new field called control category appears. You can choose
between the first (a1=0) and the last (an=0) category to be the reference category.
Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
data selected may be of the numerical type. If the variable header has been selected, check
that the "Variable labels" option has been activated.
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
selected data may be of any type, but numerical data will automatically be considered as
nominal. If the variable header has been selected, check that the "Variable labels" option has
been activated.
Classic: Activate this option to calculate a logistic regression on the variables selected
in the previous operations.
PCR: Activate this option to calculate a logistic regression on the principal components
extracted from the selected explanatory variables.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
269
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Observation weights: This field must be entered if the "sum of binary variables" option has
been chosen. Otherwise, this field is not active. If a column header has been selected, check
that the "Variable labels" option has been activated.
Regression weights: Activate this option if you want to weight the influence of observations to
adjust the model. If you do not activate this option, the weights will be considered as 1.
Weights must be greater than or equal to 0. If a column header has been selected, check that
the "Variable labels" option is activated.
Control category: In the multinomial case, you need to choose which category is the control.
Options tab:
Tolerance: Enter the value of the tolerance threshold below which a variable will automatically
be ignored.
Firths method: Activate this option to use Firth's penalized likelihood (see description).
Interactions / Level: Activate this option to include interactions in the model then enter the
maximum interaction level (value between 1 and 4).
Confidence interval (%): Enter the percentage range of the confidence interval to use for the
various tests and for calculating the confidence intervals around the parameters and
predictions. Default value: 95.
Stop conditions:
Iterations: Enter the maximum number of iterations for the Newton-Raphson algorithm.
The calculations are stopped when the maximum number if iterations has been
exceeded. Default value: 100.
Convergence: Enter the maximum value of the evolution of the log of the likelihood
from one iteration to another which, when reached, means that the algorithm is
considered to have converged. Default value: 0.000001.
270
Options specific to the PCR logistic regression
Filter factors: You can activate one of the following two options in order to reduce the number
of factors for which results are displayed.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
Maximum Number: Activate this option to set the number of factors to take into
account.
Model selection: Activate this option if you want to use one of the five selection methods
provided:
Best model: This method lets you choose the best model from amongst all the models
which can handle a number of variables varying from "Min variables" to "Max
Variables". Furthermore, the user can choose several "criteria" to determine the best
model.
o Criterion: Choose the criterion from the following list: Likelihood, LR (likelihood
ratio), Score, Wald, Akaike's AIC, Schwarz's SBC.
o Min variables: Enter the minimum number of variables to be used in the model.
Note: although XLSTAT uses a very powerful algorithm to reduce the number of
calculations required as much as possible, this method can require a long calculation
time.
Stepwise (Forward): The selection process starts by adding the variable with the
largest contribution to the model. If a second variable is such that its entry probability is
greater than the entry threshold value, then it is added to the model. After the third
variable is added, the impact of removing each variable present in the model after it has
been added is evaluated. If the probability of the calculated statistic is greater than the
removal threshold value, the variable is removed from the model.
Stepwise (Backward): This method is similar to the previous one but starts from a
complete model.
271
Forward: The procedure is the same as for stepwise selection except that variables are
only added and never removed.
Backward: The procedure starts by simultaneously adding all variables. The variables
are then removed from the model following the procedure used for stepwise selection.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: Activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
272
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the explanatory variables correlation matrix.
Goodness of fit statistics: Activate this option to display the table of goodness of fit statistics
for the model.
Type III analysis: Activate this option to display the type III analysis of variance table.
Model coefficients: Activate this option to display the table of coefficients for the model.
Optionally, confidence intervals of type "profile likelihood" can be calculated (see
description).
Standardized coefficients: Activate this option if you want the standardized coefficients (beta
coefficients) for the model to be displayed.
Equation: Activate this option to display the equation for the model explicitly.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Multiple comparisons: This option is only active if qualitative explanatory variables have been
selected. Activate this option to display the results of the comparison tests.
Probability analysis: If only one explanatory variable has been selected, activate this option
so that XLSTAT calculates the value of the explanatory variable corresponding to various
probability levels.
Classification table: Activate this option to display the posterior observation classification
table using a cutoff point to be defined (default value 0.5).
273
Options specific to the classical PCR logistic regression:
Factor loadings: Activate this option to display the coordinates of the variables (factor
loadings). The coordinates are equal to the correlations between the principal components and
the initial variables for normalized PCA.
Factor scores: Activate to display the coordinates of the observations (factor scores) in the
new space created by PCA. The principal components are afterwards used as explanatory
variables in the regression.
Charts tab:
Correlations charts: Activate this option to display charts showing the correlations between
the components and initial variables.
Vectors: Activate this option to display the input variables in the form of vectors.
Observations charts: Activate this option to display charts representing the observations in
the new space.
Labels: Activate this option to have observation labels displayed on the charts. The
number of labels displayed can be changed using the filtering option.
Biplots: Activate this option to display charts representing the observations and variables
simultaneously in the new space.
Vectors: Activate this option to display the initial variables in the form of vectors.
274
Labels: Activate this option to have observation labels displayed on the biplots. The
number of labels displayed can be changed using the filtering option.
Colored labels: Activate this option to show variable and observation labels in the same color
as the corresponding points. If this option is not activated the labels are displayed in black
color.
N first rows: The N first observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The N last observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
Results
XLSTAT displays a large number tables and charts to help in analyzing and interpreting the
results.
Summary statistics: This table displays descriptive statistics for all the variables selected. For
the quantitative variables, the number of missing values, the number of non-missing values,
the mean and the standard deviation (unbiased) are displayed. For qualitative variables,
including the dependent variable, the categories with their respective frequencies and
percentages are displayed.
Correlation matrix: This table displays the correlations between the explanatory variables.
Correspondence between the categories of the response variable and the probabilities:
This table shows which categories of the dependent variable have been assigned probabilities
0 and 1.
Summary of the variables selection: Where a selection method has been chosen, XLSTAT
displays the selection summary. For a stepwise selection, the statistics corresponding to the
different steps are displayed. Where the best model for a number of variables varying from p to
q has been selected, the best model for each number or variables is displayed with the
corresponding statistics and the best model for the criterion chosen is displayed in bold.
275
Goodness of fit coefficients: This table displays a series of statistics for the independent
model (corresponding to the case where the linear combination of explanatory variables
reduces to a constant) and for the adjusted model.
Observations: The total number of observations taken into account (sum of the weights of
the observations);
Sum of weights: The total number of observations taken into account (sum of the weights of
the observations multiplied by the weights in the regression);
-2 Log(Like.) : The logarithm of the likelihood function associated with the model;
R (McFadden): Coefficient, like the R , between 0 and 1 which measures how well the
2
model is adjusted. This coefficient is equal to 1 minus the ratio of the likelihood of the
adjusted model to the likelihood of the independent model;
R(Cox and Snell): Coefficient, like the R , between 0 and 1 which measures how well the
2
model is adjusted. This coefficient is equal to 1 minus the ratio of the likelihood of the
adjusted model to the likelihood of the independent model raised to the power 2/Sw, where
Sw is the sum of weights.
R(Nagelkerke): Coefficient, like the R , between 0 and 1 which measures how well the
2
model is adjusted. This coefficient is equal to ratio of the R of Cox and Snell, divided by 1
minus the likelihood of the independent model raised to the power 2/Sw;
Test of the null hypothesis H0: Y=p0: The H0 hypothesis corresponds to the independent
model which gives probability p0 whatever the values of the explanatory variables. We seek to
check if the adjusted model is significantly more powerful than this model. Three tests are
available: the likelihood ratio test (-2 Log(Like.)), the Score test and the Wald test. The three
2
statistics follow a Chi distribution whose degrees of freedom are shown.
Type III analysis: This table is only useful if there is more than one explanatory variable. Here,
the adjusted model is tested against a test model where the variable in the row of the table in
question has been removed. If the probability Pr > LR is less than a significance threshold
which has been set (typically 0.05), then the contribution of the variable to the adjustment of
the model is significant. Otherwise, it can be removed from the model.
For PCR logistic regression, the first table of the model parameters corresponds to the
parameters of the model which use the principal components which have been selected. This
276
table is difficult to interpret. For this reason, a transformation is carried out to obtain model
parameters which correspond to the initial variables.
Model parameters:
Binary case: The parameter estimate, corresponding standard deviation, Wald's Chi ,
2
the corresponding p-value and the confidence interval are displayed for the constant
and each variable of the model. If the corresponding option has been activated, the
"profile likelihood" intervals are also displayed.
Multinomial case: In the multinomial case, (J-1)*(p+1) parameters are obtained, where
J is the number of categories and p is the number of variables in the model. Thus, for
each explanatory variable and for each category of the response variable (except for
the reference category), the parameter estimate, corresponding standard deviation,
2
Wald's Chi , the corresponding p-value and the confidence interval are displayed. The
odds-ratios with corresponding confidence interval are also displayed.
The equation of the model is then displayed to make it easier to read or re-use the model.
The table of standardized coefficients (also called beta coefficients) are used to compare the
relative weights of the variables. The higher the absolute value of a coefficient, the more
important the weight of the corresponding variable. When the confidence interval around
standardized coefficients has value 0 (this can easily be seen on the chart of standardized
coefficients), the weight of a variable in the model is not significant.
The predictions and residuals table shows, for each observation, its weight, the value of the
qualitative explanatory variable, if there is only one, the observed value of the dependent
variable, the model's prediction, the same values divided by the weights, the standardized
residuals and a confidence interval.
Classification table: Activate this option to display the table showing the percentage of well-
classified observations for both categories. If a validation sample has been extracted, this table
is also displayed for the validation data.
ROC curve: The ROC curve is used to evaluate the performance of the model by means of
the area under the curve (AUC) and to compare several models together (see the description
section for more details).
If only one quantitative variable has been selected, the probability analysis table allows to
see to which value of the explanatory variable corresponds a given probability of success.
277
Example
Tutorials on how to use logistic regression and the multinomial logit model are available on the
Addinsoft website:
References
Firth D (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27-38.
Furnival G. M. and Wilson R.W. Jr. (1974). Regressions by leaps and bounds.
Technometrics, 16(4), 499-511.
Hosmer D.W. and Lemeshow S. (2000). Applied Logistic Regression, Second Edition. John
Wiley and Sons, New York.
Lawless J.F. and Singhal K. (1978). Efficient screening of nonnormal regression models.
Biometrics, 34, 318-327.
Tallarida R.J. (2000). Drug Synergism & Dose-Effect Data Analysis, CRC/Chapman & Hall,
Boca Raton.
Venzon, D. J. and Moolgavkar S. H. (1988). A method for computing profile likelihood based
confidence intervals. Applied Statistics, 37, 87-94.
278
Nonparametric regression
This tool carries out two types of nonparametric regression: Kernel regression and LOWESS
regression.
Description
Parametric regression can be used when the hypotheses about the more classical regression
methods cannot be verified or when we are mainly interested in only the predictive quality of
the model and not its structure.
Kernel regression
Kernel regression is a modeling tool which belongs to the family of smoothing methods. Unlike
linear regression which is both used to explain phenomena and for prediction (understanding a
phenomenon to be able to predict it afterwards), Kernel regression is mostly used for
prediction. The structure of the model is variable and complex, the latter working like a filter or
black box. There are many variations of Kernel regression in existence.
As with any modeling method, a learning sample of size nlearn is used to estimate the
parameters of the model. A sample of size nvalid can then be used to evaluate the quality of the
model. Lastly, the model can be applied to a prediction sample of size npred, for which the
values of the dependent variable Y are unknown.
The first characteristic of Kernel Regression is the use of a kernel function, to weigh the
observations of the learning sample, depending on their "distance" from the predicted
observation. The closer the values of the explanatory variables for a given observation of the
learning sample are to the values observed for the observation being predicted, the higher the
weight. Many kernel functions have been suggested. XLSTAT includes the following kernel
functions: Uniform, Triangle, Epanechnikov, Quartic, Triweight, Tricube, Gaussian, and
Cosine.
The second characteristic of Kernel regression is the bandwidth associated to each variable.
It is involved in calculating the kernel and the weights of the observations, and differentiates or
rescales the relative weights of the variables while at the same time reducing or augmenting
the impact of observations of the learning sample, depending on how far they are from the
observation to predict. The term bandwidth refers to the filtering methods. The lower a given
variable and kernel function, the fewer will be the number of observations to influence the
prediction.
Example: let Y be the dependent variable, and (X1, X2, , Xk) the k explanatory variables. For
the prediction of yi from observation i (1 i nvalid), given the observation j (1 j nlearn), the
279
weight determined using a Gaussian kernel, with a bandwidth fixed to hl for each of the Xl
variables (l= 1k), is given by:
k x jl xil
2
wij
1
l 1 hl
exp
2 hl
k
k
l 1
The third characteristic is the polynomial degree used when fitting the model to the
observations of the learning sample. In the case where the polynomial degree is 0 (constant
polynomial), the Nadaraya-Watson formula is used to compute the i'th prediction:
napp
wij y j
yi
j 1
napp
wij
j 1
For the constant polynomial, the explanatory variables are only taken into account for
computing of the weight of the observations in the learning sample. For higher polynomial
degrees (experience shows that higher orders are not necessary and XLSTAT works with
polynomials of degrees 0 to 2), the variables are used in calculating a polynomial model. Once
the model has been fitted, it is applied to the validation or prediction sample in order to
estimate the values of the dependent variable.
Once the parameters of the model have been estimated, the prediction value is calculated
using the following formulae:
k
Degree 1: yi a 0 a l xill
l 1
k k k
Degree 2: yi a 0 al xill blm xil xim
l 1 l 1 m 1
Notes:
Before we estimate the parameters of the polynomial model, the observations of the
learning sample are previously weighted using the Nadaraya-Watson formula.
st nd
For a 1 or 2 order model, for each observation of the validation and prediction
samples, the polynomial parameters are estimated. This makes Kernel Regression a
numerically intensive method.
Two strategies are suggested in order to restrict the size of the learning sample taken into
account for the estimation of the parameters of the polynomial:
Moving window: to estimate yi, we take into account a fixed number of observations
previously observed. Consequently, with this strategy, the learning sample evolves at
each step.
280
k nearest neighbours: this method, complementary to the previous, restricts the size of
the learning sample to a given value k.
K u ijl xil x jl
wij
k
where u ijl
l 1 hl hl
K u .
1
2 u 1
K u 1 u . u 1
K u
3
4
1 u 2 . u 1
Quartic: the kernel function is defined as:
K u
15
16
1 u 2 . u 1
2
Triweight: the kernel function is defined as:
K u
35
32
1 u 2 . u 1
3
.
Tricube: the kernel function is defined as:
K u 1 u
3 3
u 1
K u cos u . u 1
4 2
281
LOWESS regression
LOWESS regression (Locally weighted regression and smoothing scatter plots) was
introduced by Cleveland (1979) in order to create smooth curves through scattergrams. New
versions have since been perfected to increase the robustness of the models. LOWESS
regression is very similar to Kernel regression as it is also based on polynomial regression and
requires a kernel function to weight the observations.
The LOWESS algorithm can be described as follows: for each point i to predict:
1 - First, the euclidean distances d(i,j) between the observations i and j are computed. The
fraction f of the N closest observations to observation i are selected. The weight of the selected
points are selected using the Tricube kernel and the following distance:
d (i, j )
D(i, j )
Max j (d (i, j ))
Poids ( j ) TricubeD(i, j )
r( j)
D' (i, j )
6.Mediane j ( r ( j ) )
where r(j) is the residual corresponding to observation j after the previous step.
5 - Steps 3 and 4 are performed a second time. A final prediction is then computed for
observation i.
Notes:
The only input parameters apart from the observations for the method are the f fraction
of nearest individuals (in % in XLSTAT) and the polynomial degree.
282
Robust LOWESS regression is about three times more time consuming than LOWESS
regression.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Quantitative: Select the response variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
X / Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
283
data selected must be of type numeric. If the variable header has been selected, check that
the "Variable labels" option has been activated.
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
selected data may be of any type, but numerical data will automatically be considered as
nominal. If the variable header has been selected, check that the "Variable labels" option has
been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Polynomial degree: enter the order of the polynomial if the LOWESS regression or a
polynomial method is chosen.
Options tab:
Learning samples:
Moving window: choose this option if you want the size of the learning sample to be
constant. You need to enter the size S of the window. In that case, to estimate Y(i+1),
the observations i-S-1 to i will be used, and the first observation XLSTAT will be able to
compute a prediction for, is the S+1'th observation.
Expanding window: choose this option if you want the size of the learning sample to
be expanding step by step. You need to enter the initial size S of the window. In that
case, to estimate Y(i+1), observations 1 to i will be used, and the first observation
XLSTAT will be able to compute a prediction for, is the S+1'th observation.
284
All: the learning and validation samples are identical. This method has no interest for
prediction, but it is a way to evaluate the method in case of perfect information.
K nearest neighbours::
Rows: the k points retained for the analysis are k points which are the closest to the
point to predict, for a given bandwidth and a given kernel function. k is the value to enter
here.
%: the points retained for the analysis are the closest to the point to predict and
represent x% of the total learning sample available, where x is the value to enter.
Tolerance: Enter the value of the tolerance threshold below which a variable will automatically
be ignored.
Interactions / Level: Activate this option to include interactions in the model then enter the
maximum interaction level (value between 1 and 4).
Kernel: the kernel function that will be used. The possible options are: Uniform, Triangle,
Epanechnikov, Quartic, Triweight, Tricube, Gaussian, Cosine. A description of these functions
is available in the description section.
Bandwidth: XLSTAT allows you to choose a method for automatically computing the
bandwidths (one per variable), or you can fix them. The different options are:
Constant: the bandwidth is constant and equal to the fixed value. Enter the value of the
bandwidth.
Fixed: the bandwidth is defined in a vertical range of cells in an Excel sheet, which you
need to select. The cells must be equal to the number of explanatory variables, and in
the same order as the variables.
Range: the value hl of the bandwidth for each variable Xl is determined by the following
formula:
Standard deviation: the value hl of the bandwidth for each explanatory variable is
equal to the standard deviation of the variable computed on the learning sample.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
285
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
These options are available only for PCR and OLS regression. With PLS regression, the
missing data are automatically handled by the algorithm.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
286
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the explanatory variables correlation matrix.
Goodness of fit statistics: Activate this option to display the table of goodness of fit statistics
for the model.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Charts tab:
Data and predictions: Activate this option to display the chart of observations and predictions:
As a function of X1: Activate this option to display the observed and predicted
observations as a function of the values of the X1 variable.
As a function of time: Activate this option to select the data giving the date of each
observation to display the results as a function of time.
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected. The number of missing values, the number of non-missing values, the
mean and the standard deviation (unbiased) are displayed for the quantitative variables. For
qualitative variables, including the dependent variable, the categories with their respective
frequencies and percentages are displayed.
Correlation matrix: This table displays the correlations between the selected variables.
2
The determination coefficient R ;
The sum of squares of the errors (or residuals) of the model (SSE or SSR respectively);
287
The means of the squares of the errors (or residuals) of the model (MSE or MSR);
The root mean squares of the errors (or residuals) of the model (RMSE or RMSR).
Predictions and residuals: Table giving for each observation the input data, the value
predicted by the model and the residuals.
Charts:
If only one quantitative explanatory variable or temporal variable has been selected ("As a
function of time" option in the "Charts" tab in the dialog box), the first chart shows the data and
the curve for the predictions made by the model. If the "As a function of X1" option has been
selected, the first chart shows the observed data and predictions as a function of the first
explanatory variable selected. The second chart displayed is the bar chart of the residuals.
Example
http://www.xlstat.com/demo-kernel.htm
References
Cleveland W.S. (1979). Robust locally weighted regression and smoothing scatterplots. J.
Amer. Statist. Assoc., 74, 829-836.
Cleveland W.S. (1994). The Elements of Graphing Data. Hobart Press, Summit, New Jersey.
Wand M.P. and Jones M.C. (1995). Kernel Smoothing. Chapman and Hall, New York.
Watson G.S. (1964). Smooth regression analysis. Sankhy Ser.A, 26, 101-116.
288
Nonlinear regression
Use this tool to fit data to any linear or non-linear function. The method used is least squares.
Either pre-programmed functions or functions added by the user may be used.
Description
Nonlinear regression is used to model complex phenomena which cannot be handled by the
linear model. XLSTAT provides preprogrammed functions from which the user may be able to
select the model which describes the phenomenon to be modeled.
When the model required is not available, the user can define a new model and add it to their
personal library. To improve the speed and reliability of the calculations, it is recommended to
add derivatives of the function for each of the parameters of the model.
When this is possible (preprogrammed functions or user defined functions where the first
derivatives have been entered by the user) the Levenberg-Marquardt algorithm is used. When
the derivatives are not available, a more complex and slower but efficient algorithm is used.
This algorithm does not, however, enable the standard deviations of the parameter estimators
to be obtained.
Syntax:
The library of user functions is held in the file Models.txt in the user directory defined during
installation or by using the options XLSTAT dialog box. The library is built as follows:
289
Rows 4 to (3 + N1): derivatives definition for function 1
When the derivatives have not been supplied by the user, "Unknown" replaces the derivatives
of the function.
You can modify manually the items of this file but you should be cautious not to make an error.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Quantitative: Select the response variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
290
X / Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
data selected must be of type numeric. If the variable header has been selected, check that
the "Variable labels" option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will all be taken as 1. Weights must be greater than or equal to 0. If the
Variable labels option is activated you need to include a header in the selection.
Functions tab:
Built-in function: Activate this option to fit one of the functions available from the list of built-in
functions to the data. Select a function from the list.
Edit: Click this button to display the active built-in function in the "Function: Y=" field. You can
then copy the function to afterwards change it to create a new function or the derivatives of a
new function.
User defined functions: Activate this option to fit one of the functions available from the list of
user-defined functions to the data, or to add a new function.
291
Delete: Click this button to delete the active function from the list of user-defined functions.
Add: Click this button to add a function to the list of user-defined functions. You must then
enter the function in the "Function: Y=" field, then, if you want and given that it will speed up
the calculations and enable the standard deviations of the parameters to be obtained, you can
select the derivatives of the function for each of the parameters. To do this, activate the
"Derivatives" option, then select the derivatives in an Excel worksheet.
Derivatives: These will speed up the calculations and enable the standard deviations of the
parameters to be obtained,
Options tab:
Initial values: Activate this option to give XLSTAT a starting point. Select the cells which
correspond to the initial values of the parameters. The number of rows selected must be the
same as the number of parameters.
Parameters bounds: Activate this option to give XLSTAT a possible region for all the
parameters of the model selected. You must them select a two-column range, the one on the
left being the lower bounds and the one on the right the upper bounds. The number of rows
selected must be the same as the number of parameters.
Parameters labels: Activate this option if you want to specify the names of the parameters.
XLSTAT will display the results using the selected labels instead of using generic labels pr1,
pr2, etc. The number of rows selected must be the same as the number of parameters.
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number if iterations has been exceeded. Default value:
200.
Convergence: Enter the maximum value of the evolution in the Sum of Squares of
Errors (SSE) from one iteration to another which, when reached, means that the
algorithm is considered to have converged. Default value: 0.00001.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
292
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
293
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the explanatory variables correlation matrix.
Goodness of fit statistics: Activate this option to display the table of goodness of fit statistics
for the model.
Model parameters: Activate this option to display the values of the parameters for the model
after fitting.
Equation of the model: Activate this option to display the equation of the model once fitted.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Charts:
Data and predictions: Activate this option to display the chart of observations and the
curve for the fitted function.
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected: the number of observations, the number of missing values, the number of
non-missing values, the mean and the standard deviation (unbiased).
Correlation matrix: This table displays the correlations between the selected variables.
2
The determination coefficient R ;
The sum of squares of the errors (or residuals) of the model (SSE or SSR respectively);
294
The means of the squares of the errors (or residuals) of the model (MSE or MSR);
The root mean squares of the errors (or residuals) of the model (RMSE or RMSR);
Model parameters: This table gives the value of each parameter after fitting to the model. For
built-in functions, or user-defined functions when derivatives for the parameters have been
entered, the standard deviations of the estimators are calculated.
Predictions and residuals: This table gives for each observation the input data, the value
predicted by the model and the residuals. It is followed by the equation of the model.
Charts:
If only one quantitative explanatory variable has been selected, the first chart represents the
data and the curve for the function chosen. The second chart displayed is the bar chart of the
residuals.
Example
Tutorials showing how to run a nonlinear regression are available on the Addinsoft website on
the following pages:
http://www.xlstat.com/demo-nonlin.htm
http://www.xlstat.com/demo-nonlin2.htm
References
Ramsay J.O. and Silverman B.W. (1997). Functional Data Analysis. Springer-Verlag, New York.
Ramsay J.O. and Silverman B.W. (2002). Applied Functional Data Analysis. Springer-Verlag,
New York.
295
Classification and regression trees
Classification and regression trees are methods that deliver models that meet both explanatory
and predictive goals. Two of the strengths of this method are on the one hand the simple
graphical representation by trees, and on the other hand the compact format of the natural
language rules. We distinguish the following two cases where these modeling techniques
should be used:
- use classification trees to explain and predict the belonging of objects (observations,
individuals) to a class, on the basis of explanatory quantitative and qualitative variables.
- use regression tree to build an explanatory and predicting model for a dependent quantitative
variable based on explanatory quantitative and qualitative variables.
Note: Sometimes the term segmentation tree or decision tree is employed when talking of the
abovementioned models.
Description
Classification and regression tree analysis was proposed in different ways. AID trees
(Automatic Interaction Detection) have been developed by Morgan and Sonquist (1963).
CHAID (CHi-square Automatic Interaction Detection) was proposed by Kass (1980) and later
enriched by Biggs (Biggs et al, 1991) when he introduced the exhaustive CHAID procedure.
The name of the Classification And Regression Trees (CART) methods, comes from the title of
the book of Breiman (1984). The QUEST (QUick, Efficient, Statistical Tree ) method is more
recent (Loh and Shih, 1997).
These explanatory and predictive methods can be deployed when one needs to:
XLSTAT offers the choice between four different methods of classification and regression tree
analysis: CHAID, exhaustive CHAID, CART and Quest. In most cases CHAID and exhaustive
CHAID gives the best results. In special situations the two other methods can be of interest.
CHAID is the only algorithm implemented in XLSTAT that can lead to non binary tree.
296
With all these methods, explanatory quantitative variables are transformed into discrete
variables with k categories. The discretization is performed using the Fishers method (this
method is available in the Univariate partitioning function).
These two methods proceed in three steps: splitting, merging and stopping.
Splitting: Starting with the root node that contains all the objects, the best split variable is the
one for which the p-value is lower than the user defined threshold. In the case of a quantitative
dependent variable, an ANOVA F-test is used to compare the mean values for the dependent
variable Y for each of the categories of the explanatory variable X. In the case of a qualitative
dependent variable, the user can choose between the Pearson Chi-square and maximum
likelihood ratio.
Merging: In the case of a qualitative split variable, the procedure tries to merge similar
categories of that variable into common sub nodes. This step is repeated in the case of the
exhaustive CHAID until only two sub nodes remain. This is the reason, why exhaustive CHAID
leads to a binary tree. During the merge the Pearson Chi-square or the maximum likelihood
ratio is computed. If the maximum value is bigger than the user defined threshold, the two
corresponding groups of categories are merged. This step is repeated recursively until the
maximum p-value is smaller or equal to the threshold, or until there are only two remaining
categories.
Stopping: For every newly created sub-node the stop criteria are checked. If none of the
criteria are met, the node is treated in the same way as the root node. The following are the
stop criteria:
Pure node: The node contains only objects of one category or one value of the dependent
variable.
Maximum tree depth: The level of the node has reached the user defined maximum tree
depth.
Minimum size for a parent-node: The node contains fewer objects than the user defined
minimum size for a parent-node.
Minimum size for a son-node: After splitting this node, there is at least one sub-node which
size is smaller than the used defined minimum size for a son-node.
CART
This method verifies recursively for each node if a splitting is possible using the selected
measure. Several measures of impurity are available. In the case of a quantitative dependent
297
variable a measure base on the LSD (Least Square Deviation) is being used. In the case of a
qualitative dependent variable, the user has the choice between the Gini and the Twoing
indexes. For a quantitative explanatory variable, a univariate partitioning into k clusters is
carried out. In the case of a qualitative explanatory variable, every possible grouping of the k
categories into 2 subsets is tested (there 2 1 possibilities). Then all the k-1 possible split
k
For every newly created sub-node the stop criteria are checked. If none of the criteria are met,
the node is treated in the same way as the root node.
Pure node: The node contains only objects of one class or one value of the dependent
variable.
Maximum tree depth: The level of the node has reached the user defined maximum tree
depth.
Minimum size for a parent-node: The node contains fewer objects than the user defined
minimum size for a parent-node.
Minimum size for a son-node: After splitting this node, there is at least one sub-node which
size is smaller than the used defined minimum size for a son-node.
QUEST
This method can only be applied to qualitative dependent variables. This method carries out a
splitting using two separate sub-steps. First, we look for the best splitting variable among the
explanatory variables; second, the split point for the split variable is calculated:
Selection of the split variable: For a quantitative explanatory variable, an ANOVA F-test is
carried out to compare the mean values of each explanatory variable X for the different
categories of the qualitative dependent variable Y. In the case of a qualitative explanatory
variable a Chi-square test is performed for each explanatory variable. We define X* as the
explanatory variable for which the p-values is the smallest. If the p-value corresponding to X* is
smaller than alpha / p, where alpha is the user defined threshold and p is the number of
explanatory variables, then X* is chosen as the split variable. In the case where no X* is found,
Levenes F statistic is calculated for all the quantitative explanatory variables. We define by X**
the explanatory variable corresponding to the smallest p-value. If the p-value of X** is smaller
than alpha / (p + pX), pX being the number of quantitative explanatory variables, then X** is
chosen as the split variable. In the case where no X** is found, the node is not split.
Choice of the split point: In the case of a qualitative explanatory variable X, the latter variable is
first transformed into a qualitative variable X. The detailed description of the transformation
can be found in Loh and Shih (1997). In the case of quantitative variable, similar classes of Y
are first grouped together by a k-means clustering of the mean values of X until obtaining two
groups of classes. Then, a discriminant analysis using a quadratic model is carried out on
these two groups of classes, in order to determine the optimal split point for that variable.
298
Stop conditions: For every newly created sub-node the stop criteria are checked. If none of the
criteria are met, the node is treated in the same way as the root node.
Pure node: The node does only contain objects of one class or one value of the dependent
variable.
Maximum tree depth: The level of the node has reached the user defined maximum tree
depth.
Minimal parent node size: The node contains fewer objects than the user defined minimal
parent node size.
Minimal son node size: After splitting this node, a sub node would exist that size would be
smaller than the used defined minimal son node size.
Among the numerous results provided, XLSTAT can display the classification table (also called
confusion matrix) used to calculate the percentage of well-classified observations. When only
two classes are present in the dependent variable, the ROC curve may also be displayed.
The ROC curve (Receiver Operating Characteristics) displays the performance of a model and
enables a comparison to be made with other models. The terms used come from signal
detection theory.
The proportion of well-classified positive events is called the sensitivity. The specificity is the
proportion of well-classified negative events. If you vary the threshold probability from which an
event is to be considered positive, the sensitivity and specificity will also vary. The curve of
points (1-specificity, sensitivity) is the ROC curve.
Let's consider a binary dependent variable which indicates, for example, if a customer has
responded favorably to a mail shot. In the diagram below, the blue curve corresponds to an
ideal case where the n% of people responding favorably corresponds to the n% highest
probabilities. The green curve corresponds to a well-discriminating model. The red curve (first
bisector) corresponds to what is obtained with a random Bernoulli model with a response
probability equal to that observed in the sample studied. A model close to the red curve is
therefore inefficient since it is no better than random generation. A model below this curve
would be disastrous since it would be less even than random.
299
The area under the curve (or AUC) is a synthetic index calculated for ROC curves. The AUC
corresponds to the probability such that a positive event has a higher probability given to it by
the model than a negative event. For an ideal model, AUC=1 and for a random model, AUC =
0.5. A model is usually considered good when the AUC value is greater than 0.7. A well-
discriminating model must have an AUC of between 0.87 and 0.9. A model with an AUC
greater than 0.9 is excellent.
Lastly, you are advised to validate the model on a validation sample wherever possible.
XLSTAT has several options for generating a validation sample automatically.
Classification and regression trees apply to quantitative and qualitative dependent variables. In
the case of a Discriminant analysis or logistic regression, only qualitative dependent variables
can be used. In the case of a qualitative depending variable with only two categories, the user
will be able to compare the performances of both methods by using ROC curves.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
300
: Click this button to start the computations.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables: Select the dependent variable(s) you want to model. If several
variables have been selected, XLSTAT carries out calculations for each of the variables
separately. If a column header has been selected, check that the "Variable labels" option has
been activated.
Response type: Confirm the type of response variable you have selected:
Quantitative: Activate this option if the selected dependent variables are quantitative.
Qualitative: Activate this option if the selected dependent variables are qualitative.
X / Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
data selected may be of the numerical type. If a variable header has been selected, check that
the "Variable labels" option has been activated.
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
selected data may be of any type, but numerical data will automatically be considered as
nominal. If a variable header has been selected, check that the "Variable labels" option has
been activated.
301
Observation weights: Activate this option if the observations are weighted. If you do not
activate this option, the weights will all be considered as 1. Weights must be greater than or
equal to 0. If a column header has been selected, check that the "Variable labels" option has
been activated.
Method: Choose the method to be used. CHAID, exhaustive CHAID, CART and Quest are
possible choices. In the case of Quest, the Response type is automatically changed to
qualitative.
Measure: In the case of the CHAID or exhaustive CHAID methods with a qualitative response
type, the user can choose between the Pearson Chi-square and the likelihood ratio measures.
In the case of the CART method together with a qualitative response type, the user can
choose between the Gini and Twoing measures.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Options tab:
Minimum parent size: Enter the minimum number of objects that a node must contain
to be split.
Minimum son size: Enter the minimum number of objects that an every newly created
node must contain after a possible split in order to allow the splitting.
302
Significance level (%): Enter the significance level. This value is compared to the p-values of
the F and Chi-square tests. p-values smaller than this value authorize a split. This option is not
active for the CART method.
CHAID options: these options are only active with the CHAID methods for the grouping or
splitting of qualitative explanatory variables.
Merge threshold: Enter the value of the merge significance threshold. Significance
values smaller than this value lead to merge two subgroups of categories. The
categories of a qualitative explanatory variable may be merged to simplify the
computations and the visualization of results.
Authorize redivision: Activate this option if you want to allow that previously merged
categories are split again.
o Split threshold: Enter the value of the split significance threshold. P-values
greater than this value lead to split the categories or group of categories into
two subgroups of categories.
Number of intervals: This option is only active if quantitative explanatory variables have been
selected. You can choose the maximum number of intervals generated during the
discretization of the quantitative explanatory variables using univariate partitioning by the
Fishers method. Maximum value is 10.
Stop conditions: If the observations are weighted and if the CHAID method is being used, the
calculation of the node weights is done by a converging procedure. In that case the
convergence criteria can be defined.
Iterations: Enter the maximum number of iterations for the calculus of the node weight.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
303
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: The first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: Activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
304
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Tree structure: Activate this option to display a table of the nodes, and information on the
number of objects, the p-value of the split, and the first two son-nodes. In the case of a
qualitative dependent variable the predicted category is displayed. In the case of a quantitative
dependent variable, the expected value of the node is displayed.
Node frequencies: Activate this option to display the absolute and relative frequencies of the
different nodes.
Rules: This table displays the rules in natural language, by default only for the dominant
categories of each node. Activate the All categories option to display the rules for all the
categories of the dependent variable and all nodes.
Results by object: Activate this option to display for each observation, the observed category,
the predicted category, and, in the case of a qualitative dependent variable, the probabilities
corresponding to the various categories of the dependent variable.
Confusion matrix: Activate this option to display the table showing the numbers of well- and
badly-classified observations for each of the categories.
Charts tab:
Tree chart: Activate this option to display the classification and regression tree graphical.
Pruning can be done by the help of the context menu of the tree chart.
Bar charts: Choose this option so that on the tree, the relative frequencies of the
categories are displayed using a bar chart.
o Frequencies: Activate this option to display the frequencies on the bar charts
o %: Activate this option to display the % (of the total population) on the bar
charts.
Pie charts: Choose this option so that on the tree, the relative frequencies of the
categories are displayed using a pie chart.
305
Contextual menu for the trees
When you click on a node on a classification tree, and then do a right click on the mouse, a
contextual menu is displayed with the following commands:
Show the entire tree: Select this option to display the entire tree and to undo previous pruning
actions.
Hide the subtree: Select this option to hide all the nodes below the selected node. Hidden
parts of the tree are indicated by a red rectangle of the corresponding parent node.
Show the subtree: Select this option to show all the nodes below the selected node.
Set the pruning level: Select this option to change the maximum tree depth.
Reset this Menu: Select this option to deactivate the context menu of the tree chart and to
activate the standard menu of Excel.
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected. The number of missing values, the number of non-missing values, the
mean and the standard deviation (unbiased) are displayed for the quantitative variables. For
qualitative variables, including the dependent variable, the categories with their respective
frequencies and percentages are displayed.
Correlation matrix: This table displays the correlations between the explanatory variables.
Tree structure: This table displays the nodes and information on the number of objects, the
significance level of the split, and the two first son-nodes. In the case of a qualitative
dependent variable the predicted category is displayed. In the case of a quantitative dependent
variable the expected value of the node is displayed.
Split level chart: This chart shows the significance level of the split variables for the internal
nodes of the tree.
Tree chart: A legend is first displayed so that you can identify which color corresponds to
which category (qualitative dependent variable) or interval (quantitative dependent variable) of
the dependent variable. The graphical visualization of the tree allows to quickly see how it has
been iteratively built, in order to obtain rules that are as pure as possible, which means that the
leaves of the tree should ideally correspond to only one category (or interval).
Every node is displayed as a bar chart or a pie chart. For the pie chars, the inner circle of the
pie corresponds to the relative frequencies of the categories (or intervals) to which the objects
contained in the node correspond. The outer ring shows the relative frequencies of the
categories of the objects contained in the parent node.
306
The node identifier, the number of objects, their relative frequency, and the purity (if the
dependent variable is qualitative), or the predicted value (if the dependent variable is
quantitative) are displayed beside each node. Between a parent and a son node, the split
variable is displayed with a grey background. Arrows point from this split variable to the son
nodes. The values (categories in the case of a qualitative explanatory variable, intervals in the
case of a quantitative explanatory variable) corresponding to each son node are displayed in
the top left box displayed next to the son node.
Pruning can be done using the contextual menu of the tree chart. Select a node of the chart
and click on the right button of the mouse to activate the context menu. The available options
are described in the contextual menu section.
Node frequencies: This table displays the frequencies of the categories of the dependent
variable.
Rules: The rules are displayed in natural language for the dominant categories of each node.
If the option all categories is checked in the dialog box, then the rules for all categories and
every node are displayed.
Results by object: This table displays for each observation, the observed category, the
predicted category, and, in the case of a qualitative dependent variable, the probabilities
corresponding to the various categories of the dependent variable.
Confusion matrix: This table displays the numbers of well- and badly-classified observations
for each of the categories (see the description section for more details).
Example
A tutorial on how to use classification and regression trees is available on the Addinsoft
website:
http://www.xlstat.com/demo-dtr.htm
References
Bigss D., Ville B. and Suen E. (1991). A method of choosing multiway partitions for
classification and decision trees. Journal of Applied Statistics, 18(1), 49-62.
307
Breiman L., Friedman J.H., Olshen R., and Stone C.J. (1984). Classification and Regression
Tree Wadsworth & Brooks/Cole Advanced Books & Software, Pacific California.
Lim T. S., Loh W. Y. and Shih Y. S. (2000). A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3),
203-228.
Loh W. Y. and Shih Y. S., (1997). Split selection methods for classification trees. Statistica
Sinica, 7, 815 - 840.
Morgan J.N. and Sonquist J.A. (1963). Problems in the analysis of survey data and a
proposal. Journal. Am. Statist. Assoc., 58, 415-434.
Rakotomalala R. (1997). Graphes dInduction, PhD Thesis, Universit Claude Bernard Lyon
1.
308
PLS/PCR/OLS Regression
Use this module to model and predict the values of one or more dependant quantitative
variables using a linear combination of one or more explanatory quantitative and/or qualitative
variables.
Description
The three regression methods available in this module have the common characteristic of
generating models that involve linear combines of explanatory variables. The difference
between the three method lies on the way the correlation structures between the variables are
handled.
OLS Regression
From the three methods it is the most classical. Ordinary Least Squares regression (OLS) is
more commonly named linear regression (simple or multiple depending on the number of
explanatory variables).
In the case of a model with p explanatory variables, the OLS regression model writes
Y 0 j X j
p
j 1
where Y is the dependent variable, 0, is the intercept of the model, Xj corresponds to the j
th
explanatory variable of the model (j= 1 to p), and is the random error with expectation 0 and
variance .
In the case where there are n observations, the estimation of the predicted value of the
dependent variable Y for the i observation is given by:
th
y i 0 j xij
p
j 1
The OLS method corresponds to minimizing the sum of square differences between the
observed and predicted values. This minimization leads to the following estimators of the
parameters of the model:
X ' DX 1 X ' Dy
wi ( yi yi ) 2
1
W p i 1
309
where is the vector of the estimators of the i parameters, X is the matrix of the explanatory
variables preceded by a vector of 1s, y is the vector of the n observed values of the dependent
variable, p* is the number of explanatory variables to which we add 1 if the intercept is not
fixed, wi is the weight of the i observation, and W is the sum of the wi weights, and D is a
th
y X X ' DX X ' Dy
1
The limitations of the OLS regression come from the constraint of the inversion of the XX
matrix: it is required that the rank of the matrix is p+1, and some numerical problems may arise
if the matrix is not well behaved. XLSTAT uses algorithms due to Dempster (1969) that allow
circumventing these two issues: if the matrix rank equals q where q is strictly lower than p+1,
some variables are removed from the model, either because they are constant or because
they belong to a block of collinear variables.
Furthermore, an automatic selection of the variables is performed if the user selects a too high
number of variables compared to the number of observations. The theoretical limit is n-1, as
with greater values the XX matrix becomes non-invertible).
The deleting of some of the variables may however not be optimal: in some cases we might
not add a variable to the model because it is almost collinear to some other variables or to a
block of variables, but it might be that it would be more relevant to remove a variable that is
already in the model and to the new variable.
For that reason, and also in order to handle the cases where there a lot of explanatory
variables, other methods have been developed.
PCR Regression
PCR (Principal Components Regression) can be divided into three steps: we first run a PCA
(Principal Components Analysis) on the table of the explanatory variables, then we run an OLS
regression on the selected components, the we compute the parameters of the model that
correspond to the input variables.
PCA allows to transform an X table with n observations described by variables into an S table
with n scores described by q components, where q is lower or equal to p and such that (SS) is
invertible. An additional selection can be applied on the components so that only the r
components that are the most correlated with the Y variable are kept for the OLS regression
step. We then obtain the R table.
The OLS regression is performed on the Y and R tables. In order to circumvent the
interpretation problem with the parameters obtained from the regression, XLSTAT transforms
the results back into the initial space to obtain the parameters and the confidence intervals that
correspond to the input variables.
310
PLS Regression
This method is quick, efficient and optimal for a criterion based on covariances. It is
recommended in cases where the number of variables is high, and where it is likely that the
explanatory variables are correlated.
The idea of PLS regression is to create, starting from a table with n observations described by
p variables, a set of h components with h<p. The method used to build the components differs
from PCA, and presents the advantage of handling missing data. The determination of the
number of components to keep is usually based on a criterion that involves a cross-validation.
The user may also set the number of components to use.
Some programs differentiate PLS1 from PLS2. PLS1 corresponds to the case where there is
only one dependent variable. PLS2 corresponds to the case where there are several
dependent variables. The algorithms used by XLSTAT are such that the PLS1 is only a
particular case of PLS2.
In the case of the OLS and PCR methods, if models need to be computed for several
dependent variables, the computation of the models is simply a loop on the columns of the
dependent variables table Y. In the case of PLS regression, the covariance structure of Y also
influences the computations.
Y Th Ch' Eh
XWhCh' Eh
1
XWh Ph'Wh Ch' Eh
where Y is the matrix of the dependent variables, X is the matrix of the explanatory variables.
Th, Ch, W*h , Wh and Ph, are the matrices generated by the PLS algorithm, and Eh is the matrix
of the residuals.
The matrix B of the regression coefficients of Y on X, with h components generated by the PLS
regression algorithm is given by:
1
B Wh Ph'Wh Ch'
Note: the PLS regression leads to a linear model as the OLS and PCR do.
Notes:
The three methods give the same results if the number of components obtained from the PCA
(in PCR) or from the PLS regression is equal to the number of explanatory variables.
311
The components obtained from the PLS regression are built so that they explain as well as
possible Y, while the components of the PCR are built to describe X as well as possible.
XLSTAT allows partly compensating this drawback of the PCR by allowing the selection of the
components that are the most correlated with Y.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Quantitative: Select the dependent variable(s). The data must be numerical. If the Variable
labels option is activated make sure that the headers of the variables have also been
selected.
X / Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables. Then select the corresponding data. The data must be numerical. If the Variable
labels option is activated make sure that the headers of the variables have also been
selected.
312
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables. Then select the corresponding data. Whatever their Excel format, the data are
considered as categorical. If the Variable labels option is activated make sure that the
headers of the variables have also been selected.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Observation weights: Activate this option if you want to weight the observations. If you do not
activate this option, the weights are considered to be equal to 1. The weights must be greater
or equal to 0 and they must be integer values. Setting a case weight to 2 is equivalent to
repeating twice the same observation. If the "Variable labels" option is activated, make sure
that the header of the selection has also been selected.
Regression weights: This option is active only with PCR and OLS regression. Activate this
option if you want to run a weighted least squares regression. If you do not activate this option,
the regression weights are considered to be equal to 1. The weights must be greater or equal
to 0. If the "Variable labels" option is activated make sure that the header of the selection has
also been selected.
313
Options tab:
Common options:
Confidence interval (%): Enter the size in % of the confidence interval that is used for the
various tests, parameters and predictions. Default value: 95.
Stop conditions:
Automatic: Activate this option so that XLSTAT automatically determines the number
of components to keep.
Qi threshold: Activate this option to set the threshold value of the Qi criterion used to
determine if the contribution of a component is significant or not. The default value is
0.0975 which corresponds to 1-0.95.
Qi improvement: Activate this option to set the threshold value of the Qi improvement
criterion used to determine if the contribution of a component is significant or not. The
default value is 0.05 which corresponds to a 5% improvement. This value is computed
as follows:
Q h Q h 1
Q h Imp
Q h 1
Minimum Press: Activate this option so that the number of components used in the
model corresponds to the model with the minimum Press statistic.
Max components: Activate this option to set the pour fixer le maximum number of
components to take into account in the model.
Standardized PCA: Activate this option to run a PCA on the correlation matrix. Inactivate this
option to run a PCA on the covariance matrix (unstandardized PCA).
Filter components: You can activate one of the two following options in order to reduce the
number of components used in the model:
Minimum %: Activate this option and enter the minimum percentage of total variability
that the selected components should represent.
Maximum number: Activate this option to set the maximum number of components to
take into account.
314
Sort components by: Choose one of the following options to determine which criterion
should be used to select the components on the basis of the Minimum %, or of the
Maximum number:
Correlations with Ys: Activate this option so that the components selection is based on
the sorting down of R coefficient between the dependent variable Y and the
components. This option is recommended.
Eigenvalues: Activate this option so that the selection of the components is based on
the sorting down of the eigenvalues corresponding to the components.
Fixed intercept: Activate this option to set the intercept (or constant) of the model to a given
value. Then enter the value in the corresponding field (0 by default).
Tolerance: Activate this option to allow the OLS algorithm to automatically remove the
variables that would either be constant or highly correlated with other variables or group of
variables (Minimum and default value is 0.0001. Maximum value allowed is 1). The higher the
tolerance, the more the model tolerates collinearities between the variables.
Constraints: This option is active only if you have selected qualitative explanatory variables.
Choose the type of constraint:
a1 = 0: For each qualitative variable, the parameter of the model that corresponds to the
first category of the variable is set to 0. This type of constraint is useful when you
consider that the first category corresponds to a standard, or to a null effect.
Sum(ai) = 0: For each qualitative variable, the sum of the parameters corresponding to
the various categories equals 0.
Sum(ni.ai) = 0: For each qualitative variable, the sum of the parameters corresponding
to the various categories weighted by their frequency equals 0.
Model selection: Activate this option if you want to use one of the following model selection
methods:
Best model: This method allows choosing the best model among all the models that
are based on a number of variables that is bounded by Min variables and Max
variables. The quality of the model depends on a selection Criterion.
Criterion: Select the criterion in the following list: adjusted R, Mean Squares of Errors
(MSE), Mallows Cp, Akaikes AIC, Schwarzs SBC, Anemiyas PC.
Min variables: Enter the minimum number of variables to take into account in the
model.
315
Max variables: Enter the maximum number of variables to take into account in the
model.
Note: this method can lead to very long computations because the total number
of models explored is the sum of the Cn,k where k varies between Min variables and
Max variables, and where Cn,k is n!/[(n-k)!k !]. It is therefore highly recommended that
you increase step by step the value of Max variables.
Stepwise: The selection process starts with the adding of the variable that contributes
the most to the model (the criterion used here is the Students t statistic). If a second
variable is such that the probability of its t is lower than the Threshold level, it is added
to the model. The procedure is the same for the third variable. Then, starting with the
third variable, the algorithm evaluates how the removal of one of the variables would
impact the model. If the probability corresponding to the Students t of one of the
variables is greater than the Threshold level, the variable is removed. The procedure
continues until no variable can be either added or removed from the model.
Forward: The procedure is identical to the stepwise, except that there are no removal
steps.
Backward: The procedure starts with the selection of all the available variables. The
variables are then removed from the model one by one using the same methodology as
for the stepwise selection.
Threshold level: Enter the value of the threshold probability for the Students t statistic
during the selection process.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
316
Prediction tab:
Prediction: Activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: Activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
These options are available only for PCR and OLS regression. With PLS regression, the
missing data are automatically handled by the algorithm.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics for all the
selected variables.
Correlations: Activate this option to display the correlation matrix for the quantitative variables
(dependent and explanatory).
317
Standardized coefficients: Activate this option to display the standardized parameters of the
model (also name beta coefficients).
Equation: activate this option to explicitly display the equation of the model.
Predictions and residuals: Activate this option to display the table of predictions and
residuals.
t, u and ucomponents: Activate this option to display the tables corresponding to the
components. If this option is not activated the corresponding charts are not displayed.
c, w, w* and p vectors: Activate this option to display the tables corresponding to the vectors
obtained from the PLS algorithm. If this option is not activated the corresponding charts are not
displayed.
VIPs: Activate this option to display the table and the charts of the Variable Importance for the
Projection.
Confidence intervals: Activate this option to compute the confidence intervals of the
standardized coefficients. The computations involve a jacknife method.
Outliers analysis: Activate this option to display the table and the charts of the outliers
analysis.
Factor loadings: Activate this option to display the factor loadings. The factor loadings are
equal to the correlations between the principal components and the input variables if the PCA
is based on the correlation matrix (standardized PCA).
Correlations Factors/Variables: Activate this option to display the correlations between the
principal component and the input variables.
Factor scores: Activate this option to display the factor scores (coordinates of the
observations in the new space) generated by the PCA. The scores are used in the regression
step of the PCR.
Analysis of variance: Activate this option to display the analysis of variance table.
Adjusted predictions: Activate this option to compute and display the adjusted predictions in
the predictions and residuals table.
318
Cooks D: Activate this option to compute and display the Cooks distances in the predictions
and residuals table.
Press: Activate this option to compute and display the Press statistic.
Charts tab:
Standardized coefficients: Activate this option to display a chart with the standardized
coefficients of the model, and the corresponding confidence intervals.
Predictions and residuals: Activate this option to display the following charts:
(1) Regression line: this chart is displayed only if there is one explanatory variable
and if that variable is quantitative.
(2) Explanatory variable versus standardized residuals: this chart is displayed only if
there is one explanatory variable and if that variable is quantitative.
Confidence intervals: Activate this option to display the confidence intervals on charts
(1) and (4).
Correlation charts: Activate this option to display the charts involving correlations between
the components and input variables. In the case of PCR, activate this option to display the
correlation circle.
Vectors: Activate this option to display the input variables with vectors.
Observations charts: activate this option to display the charts that allow visualizing the
observations in the new space.
Labels: Activate this option to display the observations labels on the charts. The
number of labels can be modulated using the filtering option.
319
Biplots: Activate this option to display the charts where the input variables and the
observations are simultaneously displayed.
Vectors: Activate this option to display the input variables with vectors.
Labels: Activate this option to display the observations labels on the biplots. The
number of labels can be modulated using the filtering option.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
N first rows: The N first observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The N last observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
Results
Descriptive statistics: the tables of descriptive statistics display for all the selected variables
a set of basic statistics. For the dependent variables (colored in blue), and the quantitative
explanatory variables, XLSTAT displays the number of observations, the number of
observations with missing data, the number of observations with no missing data, the mean,
and the unbiased standard deviation. For the qualitative explanatory variables XLSTAT
displays the name and the frequency of the categories.
Correlation matrix: this table is displayed to allow your visualizing the correlations among the
explanatory variables, among the dependent variables and between both groups.
The first table displays the model quality indexes. The quality corresponds here to the
cumulated contribution of the components to the indexes:
The Qcum index measures the global contribution of the h first components to the
predictive quality of the model (and of the sub-models if there are several dependent
variables). The Qcum(h) index writes:
320
PRESS
q
kj
Q cum( h) 1
h
k 1
SCE
q
j 1
k j 1
k 1
The index involves the PRESS statistic (that requires a cross-validation), and the Sum of
Squares of Errors (SSE) for a model with one less component. The search for the maximum
of the Qcum index is equivalent to finding the most stable model.
The RYcum index is the sum of the coefficients of determination between the
dependent variables and the h first components. It is therefore a measure of the
explanatory power of the h first components for the dependent variables of the model.
The RXcum index is the sum of the coefficients of determination between the
explanatory variables and the h first components. It is therefore a measure of the
explanatory power of the h first components for the explanatory variables of the model.
A bar chart is displayed to allow the visualization of the evolution of the three indexes when the
number of components increases. While the RYcum and RXcum indexes necessarily
increase with the number of components, this is not the case with Qcum.
The next table corresponds to the correlation matrix of the explanatory and dependent
variables with the t and components. A chart displays the correlations with the t components.
The next table displays the w vectors, followed by the w* vectors and the c vectors, that are
directly involved in the model, as it is shown in the Description section. If to h=2 corresponds
a valid model, it is shown that the projection of the x vectors on the y vectors on the variables
on the w*/c axes chart, gives a fair idea of the sign and the relative weight of the
corresponding coefficients in the model.
The next table displays the scores of the observations in the space of the t components. The
corresponding chart is displayed. If some observations have been selected for the validation,
they are displayed on the chart.
The next table displays the standardized scores of the observations in the space of the t
components. These scores are equivalent to computing the correlations of each observation
(represented by an indicator variable) with the components. This allows displaying the
observations on the correlations map that follows where the Xs, the Ys and the observations
are simultaneously displayed. An example of an interpretation of this map is available in
Tenenhaus (2003).
The next table corresponds to the scores of the observations in the space of the u and then
the u components. The chart based on the u~ is displayed. If some observations have been
selected for the validation, they are displayed on the chart.
The table with the Q quality indexes allows visualizing how the components contribute to the
explanation of the dependent variables. The table of the cumulated Q quality indexes allows
measuring the quality that corresponds to a space with an increasing number of dimensions.
321
The table of the R and redundancies between the input variables (dependent and
explanatory) and the components t and u~ allow evaluating the explanatory power of the t and
u~. The redundancy between an X table (n rows and p variables) and a c component is the
part of the variance of X explained by c. We define it as the mean of the squares of the
correlation coefficients between the variables and the component:
R ( x j , c)
1 p 2
Rd X , c
p j 1
From the redundancies one can deduce the VIPs (Variable Importance for the Projection)
that measure the importance of an explanatory variable for the building of the t components.
The VIP for the jth explanatory variable and the component h is defined by:
p
Rd Y , t w
h
VIPhj i
2
ij
Rd Y , t
h
i 1
i
i 1
On the VIP charts (one bar chart per component), a border line is plotted to identify the VIPs
that are greater than 0.8 and above: these thresholds suggested by Wold (1995) and
Ericksson (2001) allow identifying the variables that are moderately (0.8<VIP<1) or highly
influential (VIP>1).
The next table displays the outliers analysis. The DModX (distances from each observation
to the model in the space of the X variables) allow identifying the outliers for the explanatory
variables, while the DModY (distances from each observation to the model in the space of the
Y variables) allow identifying the outliers for the dependent variables. On the corresponding
charts the threshold values DCrit are also displayed to help identifying the outliers: the DMod
values that are above the DCrit threshold correspond to outliers. The DCrit are computed using
th
the threshold values classically used in box plots. The value of the DModX for the i
observation writes:
e( X , t )
p
2
ij
n
DModX i
j 1
n h 1 ph
where the e(X,t)ij (i = 1 n) are the residuals of the regression of X on the j component. The
th
th
value of the DModY for the i observation writes:
e(Y , t )
q
2
ij
DModYi
j 1
qh
where q is the number of dependent variables and the e(Y,t)ij (i = 1 n) are the residuals of
th
the regression of Y on the j component.
322
The next table displays the parameters of the models corresponding to the one or more
dependent variables. It is followed by the equation corresponding to each model, if the number
of explanatory variables does not exceed 20.
For each of the dependent variables a series of tables and charts is displayed.
Goodness of fit statistics: this table displays the goodness of fit statistics of the PLS
regression model for each dependent variable. The definition of the statistics is as follows:
The table of the standardized coefficients (also named beta coefficients) allows comparing
the relative weight of the variables in the model. To compute the confidence intervals, in the
case of PLS regression, the classical formulae based on the normality hypotheses used in
OLS regression do not apply. A bootstrap method suggested by Tenenhaus et al. (2004)
allows estimating the confidence intervals. The greater the absolute value of a coefficient, the
greater the weight of the variable in the model. When the confidence interval around the
standardized coefficients includes 0, which can easily be observed on the chart, the weight of
the variable in the model is not significant.
In the predictions and residuals table, the weight, the observed value of the dependent
variable, the corresponding prediction, the residuals and the confidence intervals are displayed
for each observation. Two types of confidence intervals are displayed: an interval around the
mean (it corresponds to the case where the prediction is made for an infinite number of
observations with a give set of values of the explanatory variables) and an interval around an
individual prediction (it corresponds to the case where the prediction is made for only one
observation). The second interval is always wider than the first one, as the uncertainty is of
course higher. If some observations have been selected for the validation, they are displayed
in this table.
the distance between the predicted and observed values (for an ideal model the all the
points would be on the bisecting line),
If you have selected data to use in prediction mode, a table displays the predictions on the
new observations and the corresponding confidence intervals.
The PCR regression requires a Principal Component Analysis step. The first results concern
the latter.
323
Eigenvalues: the table of the eigenvalues and the corresponding scree plot are displayed.
The number of eigenvalues displayed is equal to the number of non null eigenvalues. If a
components filtering option has been selected it is applied only before the regression step.
If the corresponding outputs options have been activated, XLSTAT displays the factor
loadings (the coordinates of the input variables in the new space), then the correlations
between the input variables and the components. The correlations are equal to the factor
loadings if the PCA is performed on the correlation matrix. The next table displays the factor
scores (the coordinates of the observations in the new space), and are later used for the
regression step. If some observations have been selected for the validation, they are displayed
in this table. A biplot is displayed if the corresponding option has been activated.
If the filtering option based on the correlations with the dependent variables has been selected,
the components used in the regression step are those that have the greatest determination
coefficients (R) with the dependent variables. The matrix of the correlation coefficients
between the components and the dependent variables is displayed. The number of
components that are kept depends on the number of eigenvalues and on the selected options
(Minimum % or Max components).
If the filtering option based on the eigenvalues has been selected, the components used in the
regression step are those that have the greatest eigenvalues. The number of components that
are kept depends on the number of eigenvalues and on the selected options (Minimum % or
Max components).
Goodness of fit statistics: this table displays statistics that are related to the goodness of fit
of the regression model:
Observations: the number of observations taken into account for the computations. In
the formulae below, n corresponds to number of observations.
Sum of weights: the sum of weights of the observations taken into account. In the
formulae below, W corresponds to the sum of weights.
DF: the number of degrees of freedom of the selected model (corresponds to the error
DF of the analysis of variance table).
w y
n
yi
2
i i
wi yi
1 n
R 1 i 1
, with y
n i 1
w (y
n
i i y )2
i 1
324
The R is interpreted is the proportion of variability of the dependent variable
explained by the model. The close the R to 1, the better fitted the model. The major
drawback of the R is that it does not take into account the number variables used to fit
the model.
W 1
R 1 1 R
W p 1
The adjusted R is a correction of the R that allows taking into account the number of
variables used in the model.
n
MSE wi yi y i
1 2
W p * i 1
RMSE: the Root Mean Squares of Errors (RMSE) is the square root of the MSE.
y yi
100 n
MAPE wi i
W i 1 yi
y
n
yi yi 1 y i 1
2
i
DW i2
w y
n
yi
2
i i
i 1
This statistic corresponds to the order 1 autocorrelation coefficient and allows verifying if the
residuals are not autocorrelated. The independence of the residuals is one of the
hypotheses of the linear regression. The user will need to look at a Durbin-Watson table to
know if the hypothesis of independence between the residuals is accepted or rejected.
SCE
Cp 2 p * W
where SSE is the sum of squares of errors for the model with p explanatory variables, and
where corresponds to the estimator of the variance of the residuals for the model that
325
includes all the explanatory variables. The closer the Cp coefficient to p* the less biased the
model.
SCE
AIC W ln 2p*
W
This criterion suggested by Akaike (1973) derives from the information theory and is based
on the Kullback and Leibler measure (1951). It is a models selection criterion that penalizes
models for which the addition of a new explanatory variable does not bring sufficient
information. The lower the AIC, the better the model.
SCE
SBC W ln ln W p *
W
This criterion suggested by Schwarz (1978) is close to the AIC, and the goal is to minimize it.
1 R W p *
PC
W p*
This criterion suggested by Anemiya (1980) allows as the adjusted R to take into account
the parsimony of the model.
Press RMCE: la Press RMSE statistic is displayed only if the corresponding option has
been activated in the dialog box. The Press statistic is defined by
Press wi yi y i ( i )
n 2
i 1
where y i ( i ) is the prediction of the i observation when it is not included in the data
th
set used for the estimation of the parameters of the model. When obtain:
Press RMCE
Press
W - p*
The Press RMSE can then be compared to the RMSE. A large difference between
both indicates that the model is sensitive to the presence or absence of some
observations.
The analysis of variance table allows evaluating how much information the explanatory
variables bring to the model. In the case where the intercept of the model is not fixed by the
user, the explanatory power is measured by comparing the fit of the selected model with the fit
326
of a basic model where the dependent variable equals its mean. When the intercept is fixed to
a given value, the selected model is compared to a basic model where the dependent model
equals the fixed intercept.
In the case of a PCR regression, the first table of model parameters corresponds to the
parameters of the model based on the selected components. This table is not easy to interpret.
For that reason a transformation is performed to obtain the parameters of the model
corresponding to the input variables. The latter table is directly obtained in the case of an OLS
regression. In this table you will find the estimate of the parameters, the corresponding
standard error, the Students t, the corresponding probability, as well as the confidence
interval.
The equation of the model is then displayed to facilitate the visualization or the reuse of the
model.
The table of the standardized coefficients (also named beta coefficients) allows comparing
the relative weight of the variables in the model. The greater the absolute value of a coefficient,
the greater the weight of the variable in the model. When the confidence interval around the
standardized coefficients includes 0, which can easily be observed on the chart, the weight of
the variable in the model is not significant.
In the predictions and residuals table, the weight, the value of the explanatory variable if
there is only one, the observed value of the dependent variable, the corresponding prediction,
the residuals and the confidence intervals, the adjusted prediction and the Cooks D, are
displayed for each observation. Two types of confidence intervals are displayed: an interval
around the mean (it corresponds to the case where the prediction is made for an infinite
number of observations with a give set of values of the explanatory variables) and an interval
around an individual prediction (it corresponds to the case where the prediction is made for
only one observation). The second interval is always wider than the first one, as the
uncertainty is of course higher. If some observations have been selected for the validation,
they are displayed in this table.
The charts that follow allow visualizing the results listed above. If there is only one explanatory
variable in the model, and if that variable is quantitative, then the first chart allows visualizing
the observations, the regression line and the confidence intervals around the prediction. The
second chart displays the standardized residuals versus the explanatory variable. The
residuals should be randomly distributed around the abscissa axis. If a trend can be observed,
that means there is a problem with the model.
The three charts that are displayed afterwards allow visualizing respectively the standardized
residuals versus the dependent variable, the distance between the predicted and observed
values (for an ideal model the all the points would be on the bisecting line), and the bar chart of
the standardized residuals. The third chart makes it possible to quickly see if there is an
unexpected number of high residuals: the normality assumption for the residuals is such that
only 5% of the standardized residuals should be out of the ]-2, 2[ interval.
If you have selected data to use in prediction mode, a table displays the predictions on the
new observations and the corresponding confidence intervals.
327
OLS regression results:
If the Type I SS and Type III SS (SS: Sum of Squares) options have been activated, the
corresponding tables are displayed.
The Type I SS table allows visualizing the influence of the progressive addition of new
explanatory variables to the model. The influence is given by the Sum of Squares of Errors
(SSE), de la Mean Squares of Errors (MSE), the Fishers F statistic, and the probability
corresponding to the Fishers F. The smaller the probability, the more information the variable
brings to the model. Note: the order of selection of the variables influences the results obtained
here.
The Type III SS table allows visualizing the influence of the withdrawal of an explanatory
variable on the goodness of fit of the model, all the other variables being included. The
influence is measured by the Sum of Squares of Errors (SSE), de la Mean Squares of Errors
(MSE), the Fishers F statistic, and the probability corresponding to the Fishers F. The smaller
the probability, the more information the variable brings to the model. Note: the order of the
variables in the selection does not influence the results in this table.
Examples
A tutorial on how to use PLS regression is available on the Addinsoft website on following
page:
http://www.xlstat.com/demo-pls.htm
References
Akaike H. (1973). Information Theory and the Extension of the Maximum Likelihood Principle.
In: Second International Symposium on Information Theory. (Eds: V.N. Petrov and F. Csaki).
Academiai Kiad, Budapest. 267-281.
Bastien P., Esposito Vinzi V. and Tenenhaus M. (2005). PLS Generalised Regression.
Computational Statistics and Data Analysis, 48, 17-46.
328
Schwarz G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
Tenenhaus M., Pags J., Ambroisine L. and Guinot C. (2005). PLS methodology for
studying relationships between hedonic judgements and product characteristics. Food Quality
and Preference. 16, 4, 315-325.
Wold, S., Martens H. and Wold H. (1983). The Multivariate Calibration Problem in Chemistry
solved by the PLS Method. In: Ruhe A.and Kgstrm B. (eds.), Proceedings of the Conference
on Matrix Pencils. Springer Verlag, Heidelberg. 286-293.
Wold S. (1995). PLS for multivariate linear modelling. In: van de Waterbeemd H. (ed.), QSAR:
Chemometric Methods in Molecular Design. Vol 2. Wiley-VCH, Weinheim, Germany. 195-218.
329
Correlation tests
Use this tool to compute the correlation coefficients of Pearson, Spearman or Kendall,
between two or more variables, and to determine if the correlations are significant or not.
Several visualizations of the correlation matrices are proposed.
Description
Three correlation coefficients are proposed to compute the correlation between a set of
quantitative variables, whether continuous, discrete or ordinal (in the latter case, the classes
must be represented by values that respect the order):
Pearson correlation coefficient: this coefficient corresponds to the classical linear correlation
coefficient. This coefficient is well suited for continuous data. Its value ranges from -1 to 1, and
it measure the degree of linear correlation between two variables. Note: the squared Pearson
correlation coefficient gives an idea of how much of the variability of a variable is explained by
the other variable. The p-values that are computed for each coefficient allow testing the null
hypothesis that the coefficients are not significantly different from 0. However, one needs to be
cautions when interpreting these results, as if two variables are independent, their correlation
coefficient is zero, but the reciprocal is not true.
Spearman correlation coefficient (rho): this coefficient is based on the ranks of the
observations and not on their value. This coefficient is adapted to ordinal data. As for the
Pearson correlation, one can interpret this coefficient in terms of variability explained, but here
we mean the variability of the ranks.
Kendall correlation coefficient (tau): as for the Spearman coefficient, it is well suited for ordinal
variables as it is also based on ranks. However, this coefficient is conceptually very different. It
can be interpreted in terms of probability: it is the difference between the probabilities that the
variables vary in the same direction and the probabilities that the variables vary in the opposite
direction. When the number of observations is lower than 50 and when there are no ties,
XLSTAT gives the exact p-value. If not, an approximation is used. The latter is known as being
reliable when there are more than 8 observations.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
330
: Click this button to start the computations.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Type of correlation: Choose the type of correlation to use for the computations (see the
description section for more details).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (row and column
variables, weights) includes a header.
331
Significance level (%): Enter the significance level for the test of on the correlations (default
value: 5%).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Pairwise deletion: Activate this option to remove observations with missing data only when
the variables involved in the calculations have missing data. For example, when calculating the
correlation between two variables, an observation will only be ignored if the data
corresponding to one of the two variables is missing.
Estimate missing data: Activate this option to estimate the missing data before the
calculation starts.
Mean or mode: Activate this option to estimate the missing data by using the mean
(quantitative variables) or the mode (qualitative variables) for the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data for an observation
by searching for the nearest neighbour to the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Correlations: Activate this option to display the correlation matrix that corresponds to the
correlation type selected in the General tab. If the significant correlations in bold option is
activated, the correlations that are significant at the selected significance level are displayed in
bold..
p-values: Activate this option to display the p-values that correspond to each correlation
coefficient.
Charts tab:
332
Correlation maps: Several visualizations of a correlation matrix are proposed.
The blue-red option allows to represent low correlations with cold colors (blue is used
for the correlations that are close to -1) and the high correlations are with hot colors
(correlations close to 1 are displayed in red color).
The Black and white option allows to either display in black the positive correlations
and in white the negative correlations (the diagonal of 1s is display in grey color), or to
display in black the significant correlations, and in white the correlations that are not
significantly different from 0.
The Patterns option allows to represent positive correlations by lines that rise from left
to right, and the negative correlations by lines that rise from right to left. The higher the
absolute value of the correlation, the large the space between the lines.
Scatter plots: Activate this option to display the scatter plots for all two by two combinations of
variables.
Matrix of plots: Check this option to display all possible combinations of variables in
pairs in the form of a two-entry table with the various variables displayed in rows and in
columns.
Histograms: Activate this option so that XLSTAT displays a histogram when the X and
Y variables are identical.
Q-Q plots: Activate this option so that XLSTAT displays a Q-Q plot when the X and Y
variables are identical.
Results
The correlation matrix and the table of the p-values are displayed. The correlation maps allow
identifying potential structures in the matrix, of to quickly identify interesting correlations.
333
Example
http://www.xlstat.com/demo-corrsp.htm
References
Best D. J. and Roberts D. E. (1975). Algorithm AS 89: The upper tail probabilities of
Spearman's rho. Applied Statistics, 24, 377379.
Best D.J. and Gipps P.G. (1974). Algorithm AS 71, Upper tail probabilities of Kendall's tau.
Applied Statistics, 23, 98-100.
Hollander M. and Wolfe D. A. (1973). Nonparametric Statistical Inference. John Wiley &
Sons, New York.
Kendall M. (1955). Rank Correlation Methods, Second Edition. Charles Griffin and Company,
London.
Lehmann E.L (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San
Francisco.
334
Tests on contingency tables (chi-square, ...)
Use this tool to study the association between the rows and the columns of a contingency
table, and to test the independence between the rows and the columns.
Note: to build a contingency table from two qualitative variables you can use the Build a
contingency table feature of XLSTAT.
Description
Many association measures and several tests have been proposed to evaluate the relationship
between the R rows and the C columns of a contingency table.
Some association measures have been specifically developed for the 2x2 tables. Others can
only be used with ordinal variables.
XLSTAT always displays all the measures. However, measures that concern ordinal variables
should only be interpreted if the variables are ordinal and sorted in increasing order in the
contingency table.
Tests of independence between the rows and the columns of a contingency table
The Pearson chi-square statistic allows to test the independence between the rows
and the columns of the table, by measuring to which extend the observed table is far (in
the chi-square sense) from the expected table computed using the same marginal
sums. The statistic writes:
n f ij ni. n. j
2
One shows that this statistic follows a Chi-square distribution with (R-1)(C-1) degrees of
freedom. However, this result is asymptotical and, before using the test, it is
recommended to make sure that:
In the case where R=2 and C=2, a continuity correction has been suggested by Yates
(1934). The modified statistic writes:
335
n f ij 0.5
2
Y2
ij
2 2
i 1 j 1 fij
A test based on the likelihood ratio and on the Wilks G statistic has been developed as
an alternative to the Pearson chi-square test. It consists in comparing the likelihood of
the observed table to the likelihood of the expected table defined as for the Pearson chi-
square test. G is defined by:
i 1 j 1
As for the Pearson statistic, G follows asymptotically a Chi-square distribution with (R-
1)(C-1) degrees of freedom.
The Fishers exact test allows to compute the probability that a table showing a
stronger association between the rows and the columns would be observed, the
marginal sums being fixed, and under the null hypothesis of independence between
rows and columns. In the case of a 2x2 table, the independence is measured through
the odds ratio (see below for further details) given by =(n11.n22)/(n12.n21). The
independence is corresponds to the case where =1. There are three possible
alternative hypotheses: the two-sided test corresponds to 1, the lower one-sided test
to 1 and the upper one-sided test to >1.
XLSTAT allows to compute the Fishers exact two-sided test when R2 and C2. The
computing method is based on the network algorithm developed by Mehta (1986) and
Clarkson (1993). It may fail in some cases. The user is prompted when this happens.
Monte Carlo test: A nonparametric test based on simulations has been developed to
test the independence between rows and columns. A number of Monte Carlo
simulations defined by the user are performed in order to generate contingency tables
that have the same marginal sums as the observed table. The chi-square statistic is
computed for each of the simulated tables. The p-value is then determined by suing the
distribution obtained from the simulations.
A first series of association coefficients between the rows and the columns of a contingency
table is proposed:
The Pearsons Phi coefficient allows to measure the association between the rows and
the columns of an RxC table. In the case of a 2x2 table, its value ranges from -1 to 1
and writes:
336
When R>2 and/or C>2, it ranges between 0 and the minimum of the square roots of R-1
and C-1. In that case, the Pearsons Phi writes:
P P2 / n
Contingency coefficient: This coefficient, also derived from the Pearsons chi-square
statistic, writes:
C P2 / P2 n
Cramers V: This coefficient is also derived from the Pearson chi-square statistic. In the
case of 2x2 table, its value has the [-1; 1] range. It writes:
V P
When R>2 and/or C>2, it ranges between 0 and 1 and its value is given by:
P2 / n
V
min( R 1, C 1)
The closer V is to 0, the more the rows and the columns are independent.
Tschuprows T: This coefficient is also derived from the Pearson chi-square statistic.
Its value ranges from 0 to 1 and is given by:
P2 / n
T
( R 1, C 1)
The closer T is to 0, the more the rows and the columns are independent.
Goodman and Kruskal tau (R/C) and (C/R): This coefficient, unlike the Pearson
coefficient is asymmetric. It allows to measure the degree of dependence of the rows on
the columns (R/C) or vice versa (C/R).
Cohens kappa: This coefficient is computed on RxR tables. It is useful in the case of
paired qualitative samples. For example, we ask the same question to the same
individuals at two different times. The results are summarized in a contingency table.
The Cohens kappa, which value ranges from 0 to 1, allows to measure to which extend
the answer are identical. The close the kappa is to 1, the higher the association
between the two variables.
Yules Q: This coefficient is used on 2x2 tables only. It is computed using the product of
the concordant data (n11.n22) and the product of the discordant data (n12.n21). It
ranges from -1 to 1. A negative value corresponds to a discordance between the two
variables, a value close to 0 corresponds to the independence, and a value close to 1 to
337
the concordance. The Yules Q is equal to the Goodman and Kruskal Gamma when the
latter is computed on a 2x2 table.
Yules Y: This coefficient is used on 2x2 tables only. It is similar to the Yules Q and
ranges from -1 to 1.
A second series of association coefficients between the rows and the columns of a
contingency table is proposed. Confidence ranges around the estimated values are available.
As the confidence ranges are computed using asymptotical results, their reliability increased
with the number of the data.
Theils U (R/C) and (C/R): The asymmetric coefficient U of uncertainty of Theil (R/C)
allows to measure the proportion of uncertainty of the row variable that is explained by
the column variable, and reciprocally in the C/R case. These coefficients range from 0
to 1. The symmetric version of the coefficient that ranges from 0 to 1 is computed using
the two asymmetric (R/C) and (C/R) coefficients.
Odds ratio and Log(Odds ratio): The odds ratio is given in the case of 2x2 by
=(n11.n22)/ (n12.n21). varies from 0 to infinity. can be interpreted as the increase
in chances of being in column 1, when being in row 1 compared to when being in row 2.
The case =1 corresponds to no advantage. When >1, the probability is times higher
for row 1 than for row 2. We compute the logarithm of the odds because its variance is
easier to compute, and because it is symmetric around 0, which allows to obtain a
confidence interval. The confidence of the odds ration itself is computed by taking the
exponential of the confidence interval on the log(odds ratio).
338
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Contingency table: Select the data that correspond to the contingency table. If row and
column labels are included, make sure that the Labels included option is checked.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: Activate this option if the row and column labels are selected.
Options tab:
339
Chi-square test: Activate this option to display the statistics and the interpretation of the Chi-
square test of independence between rows and columns.
Likelihood ratio test: Activate this option to perform the Wilks G likelihood ratio test.
Monte Carlo method: Activate this option to compute the p-value using Monte Carlo
simulations.
Significance level (%): Enter the significance level for the test.
Fishers exact test: Activate this option to compute the Fishers exact test. In the case of a
2x2 table, you can choose the alternative hypothesis. In the other cases, the two-sided is
automatically used (see the description section for more details).
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Replace missing data by 0: Activate this option if you consider that missing data are
equivalent to 0.
Replace missing data by their expected value: Activate this option if you want to replace the
missing data by the expected value. The expectation is given by:
ni. n j .
E (nij )
n
where ni. is the row sum, n.j is the column sum, and n is the grand total of the table before
replacement of the missing data.
Outputs tab:
List of combines: Activate this option to display the table that lists all the possible combines
between the two variables that are used to create a contingency table, and the corresponding
frequencies.
Inertia by cell: Activate this option to display the inertia for each cell of the contingency table.
Chi-square by cell: Activate this option to display the contribution to the chi-square of each
cell of the contingency table.
Significance by cell: Activate this option to display a table indicating, for each cell, if the
actual value is equal (=), lower (<) or higher (>) than the theoretical value, and to run a test
(Fishers exact test of on a 2x2 table having the same total frequency as the complete table,
340
and the same marginal sums for the cell of interest), in order to determine if the difference with
the theoretical value is significant or not.
Association coefficients: Activate this option pour display the various association
coefficients.
Observed frequencies: Activate this option to display the table of the observed frequencies.
This table is almost identical to the contingency table, except that the marginal sums are also
displayed.
Theoretical frequencies: Activate this option to display the table of the theoretical frequencies
computed using the marginal sums of the contingency table.
Proportions or percentages / Row: Activate this option to display the table of proportions or
percentages computed by dividing the values of the contingency table by the marginal sums of
each row.
Proportions or percentages / Column: Activate this option to display the table of proportions
or percentages computed by dividing the values of the contingency table by the marginal sums
of each column.
Proportions or percentages / Total: Activate this option to display the table of proportions or
percentages computed by dividing the values of the contingency table by the sum of all the
cells of the contingency table.
Charts tab:
3D view of the contingency table: Activate this option to display the 3D bar chart
corresponding to the contingency table.
Results
The results that are displayed correspond to the various statistics, tests and association
coefficients described in the description section.
References
Agresti A. (1990). Categorical data analysis. John Wiley & Sons, New York.
341
Agresti A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1),
131-177.
Mehta C.R. and Patel N.R. (1986). Algorithm 643. FEXACT: A Fortran subroutine for Fisher's
exact test on unordered r*c contingency tables. ACM Transactions on Mathematical Software,
12, 154-161.
Clarkson D.B., Fan Y. and Joe H. (1993). A remark on algorithm 643: FEXACT: An algorithm
for performing Fisher's exact test in r x c contingency tables. ACM Transactions on
Mathematical Software, 19, 484-488.
Fleiss J.L. (1981). Statistical Methods for Rates and Proportions, Second Edition. John Wiley
& Sons, New York.
Saporta G. (1990). Probabilits, Analyse des Donnes et Statistique. Technip, Paris. 199-216.
Sokal R.R. and Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research, Third edition. Freeman, New York.
Yates F. (1934). Contingency tables involving small numbers and the Chi-square test. Journal
of the Royal Statistical Society, Suppl.1, 217-235.
342
Cochran-Armitage trend test
Use this tool to test if a series of proportions, possibly computed from a contingency table, can
be considered as varying linearly with an ordinal or continuous variable.
Description
If X is the score variable, the statistic that is computed to test for the linearity is given by:
n X X
r
s ni X i X
i1 i r
z i 1 2 2
p1 1 p1 s 2
with
i 1
Note: if X is an ordinal variable, the minimum value of X has no influence on the value of z.
In the case of the two-tailed (or two-sided) test, the null (H0) and alternative (Ha) hypotheses
are:
H0 : z = 0
Ha : z D
In the one-tailed case, you need to distinguish the left-tailed (or lower-tailed or lower one-
sided) test and the right-tailed (or upper-tailed or upper one-sided) test. In the left-tailed test,
the following hypotheses are used:
H0 : z = 0
Ha : z < 0
If Ha is chosen, one concludes that the proportions decrease when the score variable
increases.
H0 : z = 0
Ha : z > 0
343
If Ha is chosen, one concludes that the proportions increase when the score variable
increases.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to observations and columns to
variables. If the arrow points to the right (row mode), XLSTAT considers that rows correspond
to variables and columns to observations.
General tab:
Contingency table: Select a contingency table. If the column labels of the table have been
selected, make sure the Column labels option is checked.
Proportions: Select the column (or row if in row mode) that contains the proportions. If a
column has been selected, make sure the Column labels option is checked.
Sample sizes: If you selected proportions, you must select the corresponding sample sizes. If
a column has been selected, make sure the Column labels option is checked.
Row labels: Activate this option to select the labels of the rows.
344
Data format:
Contingency table: Activate this option I your data are contained in a contingency
table.
Proportions: Activate this option if your data are available as proportions and sample
sizes.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column labels: Activate this option if column headers have been selected within the
selections.
Scores: You can choose between ordinal scores (1, 2, 3, ...) or user defined scores.
User defined: Activate this option to select the scores. If a column has been selected,
make sure the Column labels option is checked.
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Significance level (%): Enter the significance level for the test (default value: 5%).
Asymptotic p-value: Activate this option to compute the p-value based on the asymptotic
distribution of the z statistic.
Monte Carlo method: Activate this option to compute the p-value using Monte Carlo
simulations. Enter the number of simulations to perform.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
345
Outputs tab:
Charts tab:
Proportions: Activate this option to display a scatter plot with the scores as abscissa and the
proportions as ordinates.
Results
The results include a summary table with the input data, a chart showing the proportions as a
function of the scores. The next results correspond to the test itself, and its interpretation.
References
Agresti A. (1990). Categorical Data Analysis. John Wiley and Sons, New York.
Armitage P. (1955). Tests for linear trends in proportions and frequencies. Biometrics; 11,
375-386.
Cochran W.G. (1954). Some methods for strengthening the common Chi-square tests,
Biometrics, 10, 417-451.
Snedecor G.W. and Cochran W.G. (1989). Statistical Methods, 8th Edition. Iowa State
University Press, Ames.
346
Mantel test
Use this test to compute the linear correlation between two proximity matrices (simple Mantel
test), or to compute the linear correlation between two matrices knowing their correlation with a
third matrix (partial Mantel test).
Description
Mantel (1967) proposed a first statistic to measure the correlation between two proximity
(similarity or dissimilarity) and symmetric A and B matrices of size n:
z ( AB ) ab
n 1 n
ij ij
i n j i 1
The standardized Mantel statistic, easier to use because it varies between -1 and 1, is the
Pearson correlation coefficient between the two matrices:
aij a bij b
n 1 n
r ( AB )
1
n n 1 / 2 1 i n j i 1 sa sb
Notes:
In the case where the similarities or dissimilarities would be ordinal, one can use the
Spearman or Kendall correlation coefficients.
In the case where the matrices are not symmetric, the computations are possible.
While it is not a problem to compute the correlation coefficient between two sets of proximity
coefficients, testing their significance can not be done using the usual approach that is used to
test correlations: to use the latter tests, one needs to assume the independence of the data,
which is not the case here. A permutation test has been proposed to determine if the
correlation coefficient can be considered as showing a significant correlation between the
matrices or not.
In the case of the two-tailed (or two-sided) test, the null (H0) and alternative (Ha) hypotheses
are:
H0 : r(AB) = 0
Ha : r(AB) 0
347
In the one-tailed case, you need to distinguish the left-tailed (or lower-tailed or lower one-
sided) test and the right-tailed (or upper-tailed or upper one-sided) test. In the left-tailed test,
the following hypotheses are used:
H0 : r(AB) = 0
Ha : r(AB) < 0
H0 : r(AB) = 0
Ha : r(AB) > 0
The Mantel test consists of computing the correlation coefficient that would be obtained after
permuting the rows and columns of one of the matrices. The p-value is calculated using the
distribution of the r(AB) coefficients obtained from S permutations. In the case where n, the
number of rows and columns of the matrices, is lower than 10, all the possible permutations
can easily be computed. If n is greater than n, one needs to randomly generate a set of S
permutations in order to estimated the distribution of r(AB).
A Mantel test for more than two matrices has been proposed (Smouse et al., 1986): when we
have three proximity matrices A, B and C, the partial Mantel statistic r(AB.C) for the A and B
matrices knowing the C matrix is computed as a partial correlation coefficient. In order to
determine if the coefficient is significantly different from 0, a p-value is computed using random
permutations as described by Smouse et al (1986).
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
348
: Click this button to reload the default options.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Matrix A: Select the first proximity matrix. If the row and column labels are included, make
sure the labels included option is checked.
Matrix B: Select the second proximity matrix. If the row and column labels are included, make
sure the labels included option is checked.
Matrix C: Activate this option if you want to compute the partial Mantel test. Then select the
third proximity matrix. If the row and column labels are included, make sure the labels
included option is checked.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: Activate this option if the row and column labels have been selected.
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Significance level (%): Enter the significance level for the test.
Exact p-values: Activate this option so that XLSTAT tries to compute all the possible
permutations when possible, to obtain an exact distribution of the Mantel statistic.
349
Number of permutations: Enter the number of permutations to perform in the case where it is
not possible to generate all the possible permutations.
Type of correlation: Select the type of correlation to use to compute the standardized Mantel
statistic.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Charts tab:
Scatter plot: Activate this option to display a scatter plot using the values of matrix A on the X
axis and the values of the matrix B on the Y axis.
Histogram: Activate this option to display the histogram computed from the distribution of
r(AB) based on the permutations.
Results
The displayed results correspond to the standardized Mantel statistic, to the corresponding p-
value for the selected alternative hypothesis. A first level interpretation of the test is provided.
The histogram of the r(AB) distribution is displayed if the corresponding option has been
checked. The observed value of r(AB) is displayed on the histogram.
Example
An example showing how to use the Mantel test is displayed on the Addinsoft website:
http://www.xlstat.com/demo-mantel.htm
References
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam.
350
Mantel N. (1967). A technique of disease clustering and a generalized regression approach.
Cancer Research, 27, 209-220.
Smouse P.E., Long J.C. and Sokal R.R. (1986). Multiple regression and correlation
extension of the Mantel test of matrix correspondence. Systematic Zoology, 35, 627-632.
Sokal R.R. and Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third Edition. Freeman, New York.
351
One-sample t and z tests
Use this tool to compare the mean of a normally-distributed sample with a given value.
Description
Let the average of a sample be represented by . To compare this mean with a reference
value, two parametric tests are possible:
Student's t test if the true variance of the population from which the sample has been extracted
2
is not known; the variance of sample s is used as variance estimator.
These two tests are said to be parametric as their use requires the assumption that the
samples are distributed normally. Moreover, it also assumed that the observations are
independent and identically distributed. The normality of the distribution can be tested
beforehand using the normality tests.
Three types of test are possible depending on the alternative hypothesis chosen:
For the two-tailed test, the null H0 and alternative Ha hypotheses are as follows:
H0 : = 0
Ha : 0
H0 : = 0
Ha : < 0
H0 : = 0
Ha : > 0
352
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
One column/row per sample: Activate this option for XLSTAT to consider that each
column (column mode) or row (row mode) corresponds to a sample. You can then test
the hypothesis on several samples at the same time.
One sample: Activate this for XLSTAT to consider that all the selected values,
whatever the number of rows or columns belong to the same sample.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
353
Column/row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Theoretical mean: Enter the value of the theoretical mean with which the mean of the sample
is to be compared.
Significance level (%): Enter the significance level for the tests (default value: 5%).
Where a z test has been requested, the population variance value must be entered.
Estimated using samples: Activate this option for XLSTAT to estimate the variance of
the population from the sample data. This should, in principle, lead to a t test, but this
option is offered for teaching purposes only.
User defined: enter the value of the known variance of the population.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
354
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
References
Sincich T. (1996). Business Statistics by Example, 5th Edition. Prentice-Hall, Upper Saddle
River.
Sokal R.R. & Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third Edition. Freeman, New York.
355
Two-sample t and z tests
Use this tool to compare the means of two normally distributed independent or paired samples.
Description
Parametric t and z tests are used to compare the means of two samples. The calculation
method differs according to the nature of the samples. A distinction is made between
independent samples (for example a comparison of annual sales by shop between two regions
for a chain of supermarkets), or paired samples (for example if comparing the annual sales
within the same region over two years).
The t and z tests are known as parametric because the assumption is made that the samples
are normally distributed. This hypothesis could be tested using normality tests.
Take a sample S1 comprising n1 observations, of mean 1 and variance s1. Take a second
sample S2, independent of S1 comprising n2 observations, of mean 2 and variance s2. Let
D be the assumed difference between the means (D is 0 when equality is assumed).
Student's t test if the true variance of the populations from which the samples are extracted is
not known;
Student's t Test
The use of Student's t test requires a decision to be taken beforehand on whether variances of
the samples are to be considered equal or not. XLSTAT gives the option of using Fisher's F
test to test the hypothesis of equality of the variances and to use the result of the test in the
subsequent calculations.
If we consider that the two samples have the same variance, the common variance is
estimated by:
356
1 2 D
t
s 1/ n1 1/ n 2
If we consider that the variances are different, the statistic is given by:
1 2 D
t
s1 / n1 s 2 / n 2
s1 / n1 s 2 / n2
2
df
s1 / n1 s 2 / n2
2 2
n1 1 n2 1
Cochran and Cox (1950) proposed an approximation to determine the p-value. It is given as an
option in XLSTAT.
z-Test
For the z-test, the variance s of the population is presumed to be known. The user can enter
this value or estimate it from the data (this is offered for teaching purposes only). The test
statistic is given by:
1 2 D
z
1/ n1 1/ n 2
If two samples are paired, they have to be of the same size. Where values are missing from
certain observations, either the observation is removed from both samples or the missing
values are estimated.
We study the mean of the calculated differences for the n observations. If d is the mean of the
differences, s the variance of the differences and D the supposed difference, the statistic of
the t test is given by:
357
d D
t
s/ n
d D
z
/ n
Alternative hypotheses
Three types of test are possible depending on the alternative hypothesis chosen:
For the two-tailed test, the null H0 and alternative Ha hypotheses are as follows:
H0 : 1 - 2 = D
Ha : 1 - 2 D
H0 : 1 - 2 = D
Ha : 1 - 2 < D
H0 : 1 - 2 = D
Ha : 1 - 2 > D
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
358
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data / Sample 1: If the format of the selected data is "one column per variable", select the
data for the various samples in the Excel worksheet. If the format of the selected data is "one
column per sample" or "paired samples", select a column of data corresponding to the first
sample.
Sample identifiers / Sample 2: If the format of the selected data is "one column per variable",
select the data identifying the two samples to which the selected data values correspond. If the
format of the selected data is "one column per sample" or "paired samples", select a column of
data corresponding to the second sample.
One column/row per sample: Activate this option to select one column (or row in row
mode) per sample.
One column/row per variable: Activate this option for XLSTAT to carry out as many
tests as there are columns/rows, given that each column/row must contain the same
number of rows/columns and that a sample identifier which enables each observation to
be assigned to a sample must also be selected.
Paired samples: Activate this option to carry out tests on paired samples. You must
then select a column (or row in row mode) per sample, all the time ensuring that the
samples are of the same size.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
359
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Options tab:
Alternative hypotheses: Choose the alternative hypothesis to be used for the test (see
description).
Hypothesized difference (D): Enter the value of the supposed difference between the
samples.
Significance level (%): Enter the significance level for the tests (default value: 5%).
Weights: This option is only available if the data format is One column/row per variable or
if the data re paired Check this option if the observations are weighted. If you do not check this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Column/rw labels" option is activated.
Where a z test has been requested, the value of the known variance of the populations, or, for
a test on paired samples, the variance of the difference must be entered.
Estimated using samples: Activate this option for XLSTAT to estimate the variance of the
population from the sample data. This should, in principle, lead to a t test, but this option is
offered for teaching purposes only.
User defined: Enter the values of the known variances of the populations.
360
Assume equality: Activate this option to consider that the variances of the samples are equal.
Cochran-Cox: Activate this option to calculate the p-value by using the Cochran and Cox
method where the variances are assumed to be unequal.
Use an F test: Activate this option to use Fisher's F test to determine whether the variances of
both samples can be considered to be equal or not.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
Charts tab:
Dominance diagram: Activate this option to display a dominance diagram in order to make a
visual comparison of the samples.
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
The dominance diagram enables a visual comparison of the samples to be made. The first
sample is represented on the x-axis and the second on the y-axis. To build this diagram, the
data from the samples is sorted first of all. When an observation in the second sample is
greater than an observation in the first sample, a "+" is displayed. When an observation in the
second sample is less than an observation in the first sample, a "-" is displayed. In the case of
a tie, a "o" is displayed.
361
References
Cochran W.G. and Cox G.M. (1950). Experimental Designs. John Wiley & Sons, New York.
Sincich T. (1996). Business Statistics by Example, 5th Edition. Prentice-Hall, Upper Saddle
River.
Sokal R.R. and Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third Edition. Freeman, New York.
Tomassone R., Dervin C. and Masson J.P. (1993). Biomtrie. Modlisation de Phnomnes
Biologiques. Masson, Paris.
362
Comparison of the means of k samples
If you want to compare the means of k samples, you have to use the ANOVA tool which
enables multiple comparison tests to be used.
Description
Three parametric tests are offered for the comparison of the variances of two samples. Take a
sample S1 comprising n1 observations with variance s1. Take a second sample S2
comprising n2 observations with variance s2. XLSTAT offers three tests for comparing the
variances of the two samples.
Fisher's F test
s12
F
R.s 22
This statistic follows a Fisher distribution with (n1-1) and (n2-1) degrees of freedom if both
samples follow a normal distribution.
Three types of test are possible depending on the alternative hypothesis chosen:
For the two-tailed test, the null H0 and alternative Ha hypotheses are as follows:
H0 : s1 = s2.R
Ha : s1 s2.R
H0 : s1 = s2.R
363
Ha : s1 < s2.R
H0 : s1 = s2.R
Ha : s1 > s2.R
Levene's test
Levene's test can be used to compare two or more variances. It is a two-tailed test for which
the null and alternative hypotheses are given by the following for the case where two variances
are being compared:
H0 : s1 = s2
Ha : s1 s2
The statistic from this test is more complex than that from the Fisher test and involves absolute
deviations at the mean (original article by Levene, 1960) or at the median (Brown and
Forsythe, 1974). The use of the mean is recommended for symmetrical distributions with
averagely thick tails. The use of the median is recommended for asymmetric distributions.
The Levene statistic follows a Fishers F distribution with 1 and n1+n2-2 degrees of freedom.
Bartlett's test can be used to compare two or more variances. This test is sensitive to the
normality of the data. In other words, if the hypothesis of normality of the data seems fragile, it
is better to use Levene's or Fisher's test. On the other hand, Bartlett's test is more powerful if
the samples follow a normal distribution.
This also is a two-tailed test which can be used with two or more variances. Where two
variances are compared, the hypotheses are:
H0 : s1 = s2
Ha : s1 s2
364
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data / Sample 1: If the format of the selected data is "one column per variable", select the
data for the various samples in the Excel worksheet. If the format of the selected data is "one
column per sample", select a column of data corresponding to the first sample.
Sample identifiers / Sample 2: If the format of the selected data is "one column per variable",
select the data identifying the two samples to which the selected data values correspond. If the
format of the selected data is "one column per sample", select a column of data corresponding
to the second sample.
One column/row per sample: Activate this option to select one column (or row in row
mode) per sample.
One column/row per variable: Activate this option for XLSTAT to carry out as many
tests as there are columns/rows, given that each column/row must contain the same
number of rows/columns and that a sample identifier which enables each observation to
be assigned to a sample must also be selected.
365
Column/row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Fisher's F test: Activate this option to use Fisher's F test (see description).
Levene's test: Activate this option to use Levene's test (see description).
Mean: Activate this option to use Levene's test based on the mean.
Median: Activate this option to use Levene's test based on the median.
Bartlett's test: Activate this option to use Bartlett's test (see description).
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Hypothesized ratio (R): Enter the value of the supposed ratio between the variances of the
samples.
Significance level (%): Enter the significance level for the tests (default value: 5%).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
366
References
Brown M. B. and Forsythe A. B. (1974). Robust tests for the equality of variances. Journal of
the American Statistical Association, 69, 364-367.
Sokal R.R. & Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third Edition. Freeman, New York.
367
k-sample comparison of variances
Use this tool to compare the variances of k samples.
Description
Two parametric tests are offered for the comparison of the variances of k samples (k 2).
Take k samples S1, S2, , Sk, comprising n1, n2, , nk observations with variances s1, s2,
, sk.
Levene's test
Levene's test can be used to compare two or more variances. This is a two-tailed test for
which the null and alternative hypotheses are:
H0 : s1 = s2 = = sk
The statistic from this test involves absolute deviations at the mean (original article by Levene,
1960) or at the median (Brown and Forsythe, 1974). The use of the mean is recommended for
symmetrical distributions with averagely thick tails. The use of the median is recommended for
asymmetric distributions.
The Levene statistic follows a Fisher distribution with k-1 and n1+n2-2 degrees of freedom.
Bartlett's test can be used to compare two or more variances. This test is sensitive to the
normality of the data. In other words, if the hypothesis of normality of the data seems fragile, it
is better to use Levene's or Fisher's test. On the other hand, Bartlett's test is more powerful if
the samples follow a normal distribution.
This also is a two-tailed test which can be used with two or more variances. Where two
variances are compared, the hypotheses are:
H0 : s1 = s2 = = sk
368
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data / Sample 1: If the format of the selected data is "one column per variable", select the
data for the various samples in the Excel worksheet. If the format of the selected data is "one
column per sample", select a column of data corresponding to the first sample.
Sample identifiers / Sample 2: If the format of the selected data is "one column per variable",
select the data identifying the k samples to which the selected data values correspond. If the
format of the selected data is "one column per sample", select a column of data corresponding
to the second sample.
One column/row per sample: Activate this option to select one column (or row in row
mode) per sample.
One column/row per variable: Activate this option for XLSTAT to carry out as many
tests as there are columns/rows, given that each column/row must contain the same
369
number of rows/columns and that a sample identifier which enables each observation to
be assigned to a sample must also be selected.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Levene's test: Activate this option to use Levene's test (see description).
Mean: Activate this option to use Levene's test based on the mean.
Median: Activate this option to use Levene's test based on the median.
Bartlett's test: Activate this option to use Bartlett's test (see description).
Options tab:
Significance level (%): Enter the significance level for the tests (default value: 5%).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
370
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
References
Brown M. B. and Forsythe A. B. (1974). Robust tests for the equality of variances. Journal of
the American Statistical Association, 69, 364-367.
Sokal R.R. and Rohlf F.J. (1995). Biometry. The Principles and Practice of Statistics in
Biological Research. Third Edition. Freeman, New York.
Tomassone R., Dervin C. and Masson J.P. (1993). Biomtrie. Modlisation de Phnomnes
Biologiques. Masson, Paris.
371
Multidimensional tests (Mahalanobis, ...)
Use this tool to compare two or more samples simultaneously on several variables.
Description
The tests implemented in this tool are used to compare samples described by several
variables. For example, instead of comparing the average of two samples as with the Student t
test, we compare here simultaneously for the same samples averages measured for several
variables.
Compared to a procedure that would involve as many Student t tests as there are variables,
the method proposed here has the advantage of using the structure of covariance of the
variables and of obtaining an overall conclusion. It may be that two samples are different for a
variable with a Student t test, but that overall it is impossible to reject the hypothesis that they
are similar.
Mahalanobis distance
The Mahalanobis distance, from the name if the Indian statistician Prasanta Chandra
Mahalanobis (1893-1972), allows computing the distance between two points in a p-
dimensional space, while taking into account the covariance structure across the p
dimensions. The square of the Mahalanobis distance writes:
d M2 x1 x2 1 x1 x2
'
In other words, it is the transposed of the vector of the difference of coordinates for p
dimensions between the two points, multiplied by the inverse of the covariance matrix
multiplied by the vector of differences. The Euclidean distance corresponds to the Mahalanobis
distance where the covariance matrix is the identity matrix, which means that the variables are
standardized and independent.
The Mahalanobis distance can be used to compare two groups (or samples) because the
Hotelling T statistic defined by:
n1n2 2
T2 dM
n1 n2
follows a Hotelling distribution, if the samples are normally distributed for all variables. The F
statistic that is used for the comparison test where the null hypotesis H0 is that the means of
the two samples are equal, is defined by:
n1 n2 p 1
F T2
n1 n2 2 p
372
This statistic follows a Fishers F distribution with p and n1+n2-p-1 degrees of freedom if the
samples are normally distributed for all the variables.
Note: This test can only be used if we assume that the samples are normally distributed and
have identical covariance matrices. The second hypothesis can be tested with the Box or
Kullback tests available in this tool.
If we want to compare more than two samples, the test based on the Mahalanobis distance
can be used to identify possible sources of the difference observed at the global level. It is then
recommended to use the Bonferroni correction for the alpha significance level. For k samples,
we use the following significance level should be used:
2
*
k k 1
Wilks lambda
The Wilks lambda statistic follows the three parameters Wilks distribution defined by:
A
p , m, n
A B
where A and B are two semi-defined positive matrices that respectively follow Wishart Wp(I,
m) and Wp(I, n) distributions, where I is the identity matrix.
When we want to compare the means of p variables for k independent groups (or samples or
classes), testing as null hypothesis H0 that the p averages are equal, if we assume that the
covariance matrices are the same for the k groups, is equivalent to calculate the following
statistic:
W
p, n k , k 1
W B
where
The distribution of the Wilks lambda is complex, so we use instead the Raos F statistic given
by:
F
1 m1/ s
2
1/ s m1
373
with
p 2 k 1 4
2
s
p 2 k 1 5
2
m1 p k 1
m1 s n p k 2 / 2 p k 1 / 2 1
One can show that if the sample size is large, then F follows a Fishers F distribution with m1
and m2 degrees of freedom. When p2 or k=2, the F statistic is exactly distributed as
F(m1,m2).
Note: This test can only be used if we assume that the samples are normally distributed and
have identical covariance matrices. The second hypothesis can be tested with the Box or
Kullback tests available in this tool.
Box test: The Box test is used to test the assumption of equality for intra-class covariance
2
matrices. Two approximations are available, one based on the Chi distribution, and the other
on the Fisher distribution.
Kullbacks test: The Kullbacks test is used to test the assumption of equality for intra-class
2
covariance matrices. The statistic calculated is approximately distributed according to a Chi
distribution.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
374
: Click this button to delete the data selections.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Groups: Check this option to select the values which correspond to the identifier of the group
to which each observation belongs.
Weights: Activate this option if the observations are weighted. Weights must be greater than
or equal to 0. If a column header has been selected, check that the "Column labels" option is
activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections includes a header.
Options tab:
Wilks Lambda test: Activate this option to compute the Wilks lambda test.
Mahalanobis test: Activate this option to compute the Mahalanobis distances as well as the
corresponding F statistics and p-values.
Bonferroni correction: Activate this option if you want to use a Bonferroni correction
during the computation of the p-values corresponding to the Mahalanobis distances.
Box test: Activate this option to compute the Box test using the two available approximations.
375
Kullbacks test: Activate this option to compute the Kullbacks test.
Significance level (%): Enter the significance level for the tests (default value: 5%).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Covariance matrices: Activate this option to display the inter-class, intra-class, intra-class
total, and total covariance matrices.
Results
The results displayed by XLSTAT correspond to the various tests that have been selected.
Example
http://www.xlstat.com/demo-maha.htm
References
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York.
376
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam.
377
z-test for one proportion
Use this test to compare a proportion calculated from a sample with a given proportion.
Description
Let n be the number of observations verifying a certain property among a sample of size N.
The proportion of the sample verifying the property is defined by p1 = n / N. Let p2 be a known
proportion with which we wish to compare p1. Let D be the assumed difference (exact,
minimum or maximum) between the two proportions p1 and p2. D is usually 0.
The two-tailed (or two-sided) test corresponds to testing the difference between p1 - p2 and D,
using the null (H0) and alternative (Ha) hypotheses shown below:
H0 : p1 - p2 = D
Ha : p1 - p2 D
In the one-tailed case, you need to distinguish the left-tailed (or lower-tailed or lower one-
sided) test and the right-tailed (or right-sided or upper one-sided) test. In the left-tailed test, the
following hypotheses are used:
H0 : p1 - p2 = D
Ha : p1 - p2 < D
H0 : p1 - p2 = D
Ha : p1 - p2 > D
the probability p1 of having the property in question is identical for all observations,
the number of observations is large enough, and the proportions are neither too close to
0 nor to 1.
Note: to determine whether N is sufficiently large one should make sure that:
378
0 p1 2 p1 1 p1 / N
p1 2 p1 1 p1 / N 1
Dialog box
: Click this button to close the dialog box without doing any computation.
Frequency / Proportion: Enter the number of observations n for which the property is
observed (see description), or the corresponding proportion (see "data format" below).
Test proportion: Enter the value of the test proportion with which the proportion observed is to
be compared.
Data format: Choose here if you would prefer to enter the value of the number of
observations for which the property is observed, or the proportion observed.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
379
Alternative hypotheses: Choose the alternative hypothesis to be used for the test (see
description).
Hypothesized difference (D): Enter the value of the supposed difference between the
proportions.
Significance level (%):Enter the significance level for the test (default value: 5%).
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
Example
http://www.xlstat.com/demo-prop.htm
References
Fleiss J.L. (1981). Statistical Methods for Rates and Proportions. John Wiley & Sons, New
York.
Sincich T. (1996). Business Statistics by Example, 5th Edition. Prentice-Hall, Upper Saddle
River.
380
z-test for two proportions
Use this tool to compare two proportions calculated for two samples.
Description
Let n1 be the number of observations verifying a certain property for sample S1 of size N1,
and n2 the number of observations verifying the same property for sample S2 of size N2. The
proportion of sample S1 verifying the property is defined by p1 = n1 / N1, and the proportion
for S2 is defined by p2 = n2 / N2. Let D be the assumed difference (exact, minimum or
maximum) between the two proportions p1 and p2. D is usually set to 0.
The two-tailed (or two-sided) test corresponds to testing the difference between p1 - p2 and D,
using the null (H0) and alternative (Ha) hypotheses shown below:
H0 : p1 - p2 = D
Ha : p1 - p2 D
In the one-tailed case, you need to distinguish the left-tailed (or lower-tailed or lower one-
sided) test and the right-tailed (or right-sided or upper one-sided) test. In the left-tailed test, the
following hypotheses are used:
H0 : p1 - p2 = D
Ha : p1 - p2 < D
H0 : p1 - p2 = D
Ha : p1 - p2 > D
the probability p1 of having the property in question is identical for all observations in
sample S1,
the probability p2 of having the property in question is identical for all observations in
sample S2,
the number of observations N1 and N2 are large enough, and the proportions are
neither too close to 0 nor to 1.
381
Note: to determine whether N1 and N2 are sufficiently large one should make sure that:
0 p1 2 p1 1 p1 / N1 0 p2 2 p2 1 p2 / N 2
and
p1 2 p1 1 p1 / N1 1 p2 2 p2 1 p2 / N 2 1
Dialog box
: Click this button to close the dialog box without doing any computation.
Frequency 1 / Proportion 1: : Enter the number of observations n1 for which the property is
observed (see description), or the corresponding proportion (see "data format" below).
Frequency 2 / Proportion 2: Enter the number of observations n2 for which the property is
observed (see description), or the corresponding proportion (see "data format" below).
Data format: Choose here if you would prefer to enter the values of the number of
observations for which the property is observed, or the proportions observed.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
382
Workbook: Activate this option to display the options in a new workbook.
Alternative hypotheses: Choose the alternative hypothesis to be used for the test (see
description).
Hypothesized difference (D): Enter the value of the supposed difference between the
proportions.
Significance level (%): Enter the significance level for the test (default value: 5%).
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
Example
http://www.xlstat.com/demo-prop.htm
References
Fleiss J.L. (1981). Statistical Methods for Rates and Proportions. John Wiley & Sons, New
York.
Sincich T. (1996). Business Statistics by Example, 5th Edition. Prentice-Hall, Upper Saddle
River.
383
Comparison of k proportions
Use this tool to compare k proportions, and to determine if they can be considered as equal, or
if at least one pair of proportions shows a significant difference.
Description
XLSTAT offers three different approaches to compare proportions and to determine whether
they can be considered as equal (null hypothesis H0) or if at least two proportions are
significantly different (alternative hypothesis Ha):
Chi-square test: This test is identical to that used for the contingency tables;
Monte Carlo method: The Monte Carlo method is used to calculate a distribution of the Chi2
distance based on simulations with the constraint of complying with the total number of
observations for the k groups. This results in an empirical distribution which gives a more
reliable critical value (on condition that the number of simulations is large) than that given by
the Chi2 theoretical distribution which corresponds to the asymptotic case.
Marascuilo procedure: It is advised to use the Marascuilo procedure only if the Chi-square test
or the equivalent test based on Monte Carlo simulations reject H0. The Marascuilo procedure
compares all pairs of proportions, which enables the proportions possibly responsible for
rejecting H0 to be identified.
Dialog box
: Click this button to close the dialog box without doing any computation.
384
Frequencies / Proportions: Select the data in the Excel worksheet.
Sample sizes: Select the data corresponding to the sizes of the samples.
Sample labels: Activate this option if sample labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the row labels are automatically generated by
XLSTAT (Sample1, Sample2 ).
Data format: Choose here if you would prefer to enter the value of the number of
observations for which the property is observed, or the proportions observed.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column labels: Activate this option if the first line of the data selected
(frequencies/proportions, sample size and sample labels) contain a label.
Monte Carlo method: Activate this option to use the simulation method and enter the number
of simulations.
Significance level (%): Enter the significance level for the three tests (default value: 5%).
Results
The results of the Chi2 test are displayed first if the corresponding option has been activated.
For the Chi2 test and the Monte Carlo method, the p-value is compared with the significance
level in order to validate the null hypothesis.
385
The results obtained from Monte Carlo simulations are all the more close to the Chi-square
results the higher the total number of observations and number of simulations. The difference
relates to the critical value and the p-value.
The Marascuilo procedure identifies which proportions are responsible for rejecting the null
hypothesis. It is possible to identify which pairs of proportions are significantly different by
looking at the results in the "Significant" column.
Note: it might be that the Marascuilo procedure does not identify significant differences among
the pairs of proportions, while the Chi-square test rejects the null hypothesis. In general, this
happens when the two proportions are significantly different as identified by the Marascuilo
procedure. More in-depth analysis might be necessary before making a decision.
Example
http://www.xlstat.com/demo-kprop.htm
References
Agresti A. (1990). Categorical Data Analysis. John Wiley & Sons, New York.
Marascuilo L. A. and Serlin R. C. (1988). Statistical Methods for the Social and Behavioral
Sciences. Freeman, New York.
386
Comparison of two distributions (Kolmogorov-Smirnov)
Use this tool to compare the distributions of two samples and to determine whether they can
be considered identical.
Description
The Kolmogorov-Smirnov test compares two distributions. This test is used for distribution
fitting tests for comparing an empirical distribution determined from a sample with a known
distribution. It can also be used for comparing two empirical distributions.
Note: this test enables the similarity of the distributions to be tested at the same time as their
shape and position.
H0 : F1(x) = F2(x)
D1 sup F1 x F 2 x
x
D1 is the maximum absolute difference between the two empirical distributions. Its value
therefore lies between 0 (distributions perfectly identical) and 1 (separations perfectly
separated). The alternative hypothesis associated with this statistic is:
Ha : F 1(x) F 2(x)
D2 sup F1 x F 2 x
x
D3 sup F 2 x F1 x
x
387
The alternative hypothesis associated with D3 is:
Nikoforov (1994) proposed an exact test method for the Kolmogorov-Smirnov on two samples.
This method is used by XLSTAT for the three alternative hypotheses. XLSTAT also enables
the supposed difference D between the distributions to be introduced. The value must be
between 0 and 1.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data / Sample 1: If the format of the selected data is "one column per variable", select the
data for the various samples in the Excel worksheet. If the format of the selected data is "one
column per sample", select a column of data corresponding to the first sample.
388
Sample identifiers / Sample 2: If the format of the selected data is "one column per variable",
select the data identifying the two samples to which the selected data values correspond. If the
format of the selected data is "one column per sample", select a column of data corresponding
to the second sample.
One column/row per sample: Activate this option to select one column (or row in row
mode) per sample.
One column/row per variable: Activate this option for XLSTAT to carry out as many
tests as there are columns/rows, given that each column/row must contain the same
number of rows/columns and that a sample identifier which enables each observation to
be assigned to a sample must also be selected.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Kolmogorov-Smirnov test: Activate this option to run the Kolmogorov-Smirnov test (see
description).
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Hypothesized difference (D): Enter the value of the maximum supposed difference between
the empirical distribution functions of the samples. The value must be between 0 and 1.
Significance level (%): Enter the significance level for the test (default value: 5%).
389
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
Charts tab:
Dominance diagram: Activate this option to display a dominance diagram in order to make a
visual comparison of the samples.
Cumulative histograms: Activate this option to display the chart showing the empirical
distribution functions for the samples.
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
References
Durbin J. (1973). Distribution Theory for Tests Based on the Sample Distribution Function.
SIAM, Philadelphia.
Kolmogorov A. (1941). Confidence limits for an unknown distribution function. Ann. Math.
Stat. 12, 461463
Nikiforov A.M. (1994). Algorithm AS 288: Exact two-sample Smirnov test for arbitrary
distributions. Applied.statistics, 43(1), 265-270.
390
391
Comparison of two samples (Wilcoxon, Mann-Whitney, ...)
Use this tool to compare two samples described by ordinal or discrete quantitative data
whether independent or paired.
Description
To get round the assumption that a sample is normally distributed required for using the
parametric tests (z test, Student's t test, Fisher's F test, Levene's test and Bartlett's test), non-
parametric tests have been put forward.
As for parametric tests, a distinction is made between independent samples (for example a
comparison of annual sales by shop between two regions for a chain of supermarkets), or
paired samples (for example if comparing the annual sales within the same region over two
years).
If we designate D to be the assumed difference in position between the samples (in general we
test for equality, and D is therefore 0), and P1-P2 to be the difference of position between the
samples, three tests are possible depending on the alternative hypothesis chosen:
For the two-tailed test, the null H0 and alternative Ha hypotheses are as follows:
H0 : P1 - P2 = D
Ha : P1 - P2 D
H0 : P1 - P2 = D
Ha : P1 - P2 < D
H0 : P1 - P2 = D
Ha : P1 - P2 > D
Three researchers, Mann, Whitney, and Wilcoxon, separately perfected a very similar non-
parametric test which can determine if the samples may be considered identical or not on the
basis of their ranks. This test is often called the Mann-Whitney test, sometimes the Wilcoxon-
Mann-Whitney test or the Wilcoxon Rank-Sum test (Lehmann, 1975).
392
We sometimes read that this test can determine if the samples come from identical populations
or distributions. This is completely untrue. It can only be used to study the relative positions of
the samples. For example, if we generate a sample of 500 observations taken from an N(0,1)
distribution and a sample from a distribution of 500 observations from an N(0,4) distribution,
the Mann-Whitney test will find no difference between the samples.
Let S1 be a sample made up of n1 observations (x1, x2, , xn1) and S2 a second sample
made up of n2 observations (y1, y2, , yn1) independent of S1. Let N be the sum of n1 and
n2.
To calculate the Wilcoxon Ws statistic which measures the difference in position between the
first sample S1 and sample S2 from which D has been subtracted, we combine the values
obtained for both samples, then put them in order. The Ws statistic is the sum of the ranks of
one of the samples. For XLSTAT, the sum is calculated on the first sample.
The Mann-Whitney U statistic is the sum of the number of pairs (xi, yi) where xi>yi, from
among all the possible pairs. We show that
n1n 2
E(U ) and V(U ) n1n 2 N 1
1
2 12
We may observe that the variances of Ws and U are identical. In fact, the relationship between
U and Ws is:
n1 n1 1
Ws U
2
When there are ties between the values in the two samples, the rank assigned to the tied
values is the mean of their rank before processing (for example, for two samples of respective
size 3 and 3, if the ordered list of values is {1, 1.2, 1.2, 1.4, 1.5, 1.5}, the ranks are initially {1,
2, 3, 4, 5, 6} then after inclusion {1, 2.5, 2.5, 4, 5.5, 5.5}. Although this does not change the
expectation of Ws and U, the variance is, on the other hand, modified.
n1n 2 di3 d i
nd
12 12 N N 1
where nd is the number of distinct values and di the number of observations for each of the
values.
393
For the calculation of the p-values associated with the statistic, XLSTAT can use an exact
method if the user wants for the following cases:
The calculations may be appreciably slowed down where there are ties. A normal
approximation has been proposed to get round this problem. We have:
u E(U ) c
P(U u )
V(U )
where F is the distribution function for the standardized normal distribution, and c is a
continuity correction used to increase the quality of the approximation (c is or - depending
on the nature of the test). The approximation is more reliable the higher n1 and n2 are.
If the user requests that an exact test be used and this is not possible because of the
constraints given below, XLSTAT indicates in the results report that an approximation has
been used.
Two tests have been proposed for the cases where samples are paired: the sign test and the
Wilcoxon signed rank test.
Let S1 be a sample made up of n observations (x1, x2, , xn) and S2 a second sample paired
with S1, also comprising n observations (y1, y2, , yn1). Let (p1, p2, , pn) be the n pairs of
values (xi, yi).
Sign test
Let N+ be the number of pairs where yi>xi, N0 the number of pairs where yi=xi, and N- the
number of pairs where yi<xi. We can show that N+ follows a binomial distribution with
parameters (n-N0) and probability . The expectation and the variance of N+ are therefore:
n N0 n N0
E( N ) and V( N )
2 4
The p-value associated with N+ and the type of test chosen (two-tailed, right or left one-tailed)
can therefore be determined exactly.
Note: This test is called the sign test as it constructs the differences within the n pairs from the
sign. This test is therefore used to compare evolutions evaluated on an ordinal scale. For
example, this test would be used to determine if the effect of a medicine is positive from a
394
survey where the patient simply declares if he feels less well, not better, or better after taking
it.
The disadvantage of the sign test is that it does not take into account the size of the difference
between each pair, data which is often available.
Wilcoxon proposed a test which takes into account the size of the difference within pairs. This
test is called the Wilcoxon signed rank test, as the sign of the differences i2 also involved.
As for the sign test, the differences for all the pairs is calculated, then they are ordered and
finally the positive differences S1, S2, , Sp and the negative differences R1, R2, , Rm
(p+m=n) are separated.
The statistic used to show whether both samples have the same position is defined as the sum
of the Si's:
Vs Si
p
i 1
n n 1 n n 1 2n 1
E(Vs ) and V(Vs )
4 24
Where they might be ties among the differences, or null differences for certain pairs, we have:
n n 1 d 0 d 0 1
E(Vs )
4
n n 1 2n 1 d 0 d 0 1 2d 0 1 di3 di
nd
V(Vs ) i 1
24 48
where d0 is the number of null differences, nd the number of distinct differences, and di the
number of values corresponding to the i'th distinct difference value (it is the same as
considering that the di's are the number of ties for the i'th distinct difference value).
Where there are no null differences or ties among the differences, if n is less than or equal to
100, XLSTAT calculates an exact p-value (Lehmann, 1975). Where there are ties, a normal
approximation is used. We have, in effect:
v E(Vs ) c
P(Vs v)
Vs
V( )
395
where F is the distribution function for the standardized normal distribution, and c is a
continuity correction used to increase the quality of the approximation (c is or - depending
on the nature of the test). The approximation is more reliable the higher n is.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data / Sample 1: If the format of the selected data is "one column per variable", select the
data for the various samples in the Excel worksheet. If the format of the selected data is "one
column per sample" or "paired samples", select a column of data corresponding to the first
sample.
Sample identifiers / Sample 2: If the format of the selected data is "one column per variable",
select the data identifying the two samples to which the selected data values correspond. If the
format of the selected data is "one column per sample" or "paired samples", select a column of
data corresponding to the second sample.
396
One column/row per sample: Activate this option to select one column (or row in row
mode) per sample.
One column/row per variable: Activate this option for XLSTAT to carry out as many
tests as there are columns/rows, given that each column/row must contain the same
number of rows/columns and that a sample identifier which enables each observation to
be assigned to a sample must also be selected.
Paired samples: Activate this option to carry out tests on paired samples. You must
then select a column (or row in row mode) per sample, all the time ensuring that the
samples are of the same size.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Mann-Whitney test: Activate this option to run the Mann-Whitney test (see description).
Sign test: Activate this option to use sign test (see description).
Wilcoxon signed rank test: Activate this option to use Wilcoxon signed rank test (see
description).
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Hypothesized difference (D): Enter the value of the supposed difference between the
samples.
Significance level (%): Enter the significance level for the test (default value: 5%).
Exact p-values: Activate this option if you want XLSTAT to calculate the exact p-value as far
as possible (see description).
397
Continuity correction: Activate this option if you want XLSTAT to use the continuity
correction if the exact p-values calculation has not been requested or is not possible (see
description).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
Charts tab:
Dominance diagram: Activate this option to display a dominance diagram in order to make a
visual comparison of the samples.
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
References
Cheung Y.K. Klotz J.H. (1997). The Mann Whitney Wilcoxon distribution using linked lists.
Statistica Sinica, 7, 805-813.
Lehmann E.L (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San
Francisco.
398
Siegel S. and Castellan N. J. (1988). Nonparametric Statistics for the Behavioral Sciences,
Second Edition. McGraw-Hill, New York.
399
Comparison of k samples (Kruskal-Wallis, Friedman, ...)
Use this tool to compare k independent samples (Kruskall-Wallis test and Dunn's procedure)
or paired samples (Friedman's test and Nemenyi's procedure).
Description
To get round the assumption that a sample is normally distributed required for using multiple
comparison tests (offered in XLSTAT after an ANOVA), non-parametric tests were proposed.
As for parametric tests, a distinction is made between independent samples (for example a
comparison of crop yields from fields with similar properties but treated with three different
types of fertilizer), from cases where they are paired (for example if comparing the scores
given by 10 judges to 3 different products).
The Kruskal-Wallis test is often used as an alternative to the ANOVA where the assumption
of normality is not acceptable. It is used to test if k samples (k2) come from the same
population or populations with identical properties as regards a position parameter (the
position parameter is conceptually close to the median, but the Kruskal-Wallis test takes into
account more information than just the position given by the median).
If Mi is the position parameter for sample i, the null H0 and alternative Ha hypotheses for the
Kruskal-Wallis test are as follows:
H0 : M1 = M2 = = Mk
The calculation of the K statistic from the Kruskal-Wallis test involves, as for the Mann-Whitney
test, the rank of the observations once the k samples (or groups) have been mixed. K is
defined by:
Ri2
3 N 1
k
K
12
N N 1 i 1 ni
where ni is the size of sample i, N is the sum of the ni's, and Ri is the sum of the ranks for
sample i.
When k=2, the Kruskal-Wallis test is equivalent to the Mann-Whitney test and K is equivalent
to Ws.
400
When there are ties, the mean ranks are used for the corresponding observations as in the
case of the Mann-Whitney test. K is then given by:
Ri2
3 N 1
k
12
N N 1 i 1 ni
K
1 di3 di / N 3 N
nd
i 1
where nd is the number of distinct values and di the number of observations for each of the
values.
For the calculation of the p-value associated with a given value of K, XLSTAT uses an
approximation of the K distribution by a Chi-square distribution with (k-1) degrees of freedom.
This approximation is reliable, except when N is small. The p-values associated with K, which
for the exact case depends on the statistic K and the k sizes of the samples, have been
tabulated for the case where k = 3 (Lehmann 1975, Hollander and Wolfe 1999).
The Friedman test is a non-parametric alternative to the ANOVA with two factors where the
assumption of normality is not acceptable. It is used to test if k paired samples (k2) of size n,
come from the same population or from populations having identical properties as regards the
position parameter. As the context is often that of the ANOVA with two factors, we sometimes
speak of the Friedman test with k treatments and n blocks.
If Mi is the position parameter for sample i, the null H0 and alternative Ha hypotheses for the
Friedman test are as follows:
H0 : M1 = M2 = = Mk
Let n be the size of k paired samples. The Q statistic from the Friedman test is given by:
Ri2 3n k 1
k
Q
12
nk k 1 i 1
Where there are ties, the average ranks are used for the corresponding observations. Q is
then given by:
401
k
Ri2 3n k 1
12
nk k 1 i 1
Q
1 d dij / n / k 3 k
N nd ( j )
3
ij
j 1 i 1
where nd(j) is the number of distinct values for block j, and dij the number of observations for
each of the values.
As for the Kruskal-Wallis test, the p-value associated with a given value of Q can be
2
approximated by a Chi distribution with (k-1) degrees of freedom. This approximation is
reliable when kn is greater than 30, the quality also depending on the number of ties. The p-
values associated with Q have been tabulated for (k = 3, n 15) and (k = 4, n 8) (Lehmann
1975, Hollander and Wolfe 1999).
Whether for the Kruskal-Wallis or the Friedman test, if the p-value is such that the H0
hypothesis has to be rejected, then at least one sample (or group) is different from another. To
identify which samples are responsible for rejecting H0, a multiple comparison procedure
can be used (Dunn, 1963, Nemenyi, 1963). To take into account the fact that there are k(k-1)/2
possible comparisons, the correction of the significance level proposed by Bonferroni can be
applied. The significance level used for pairwise comparisons is:
2
'
k k 1
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
402
: Click this button to reload the default options.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data / Sample 1: If the format of the selected data is "one column per variable", select the
data for the various samples in the Excel worksheet. If the format of the selected data is "one
column per sample" or "paired samples", select a column of data corresponding to the first
sample.
Sample identifiers / Sample 2: If the format of the selected data is "one column per variable",
select the data identifying the k samples to which the selected data values correspond. If the
format of the selected data is "one column per sample" or "paired samples", select a column of
data corresponding to the second sample.
One column/row per sample: Activate this option to select one column (or row in row
mode) per sample.
One column/row per variable: Activate this option for XLSTAT to carry out as many
tests as there are columns/rows, given that each column/row must contain the same
number of rows/columns and that a sample identifier which enables each observation to
be assigned to a sample must also be selected.
Paired samples: Activate this option to carry out tests on paired samples. You must
then select a column (or row in row mode) per sample, all the time ensuring that the
samples are of the same size.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
403
Column/Row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Kruskal-Wallis test: Activate this option to run the Kruskal-Wallis test (see description).
Friedman test: Activate this option to run a Friedman test (see description).
Bonferroni correction: Activate this option to use the Bonferroni corrected significance
level for the multiple comparisons.
Options tab:
Significance level (%): Enter the significance level for the test (default value: 5%).
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
samples.
Results
The results displayed by XLSTAT relate to the various statistics of the tests selected and the
interpretation arising from these.
Example
A tutorial showing how to use the Friedmans test is available on the Addinsoft website:
404
http://www.xlstat.com/demo-friedman.htm
References
Dunn O.J. (1964). Multiple Comparisons Using Rank Sums. Technometrics, 6(3), 241-252.
Lehmann E.L (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San
Francisco.
Siegel S. and Castellan N. J. (1988). Nonparametric Statistics for the Behavioral Sciences,
Second Edition. McGraw-Hill, New York.
405
Cochran's Q test
Use this tool to compare k2 paired samples which values are binary.
Description
The Cochrans Q test is presented using two different approaches. Some authors present it as
a particular case of the Friedmans test (comparison a k paired samples) when the variable is
binary (Lehmann, 1975), while other present it as a marginal homogeneity test for a k-
dimensional contingency table (Agresti, 1990).
As a consequence, the null H0 and alternative Ha hypotheses for the Cochrans Q test can
either be,
or,
XLSTAT uses the first approach, as it is the most used. The term treatment has been chosen
for the k samples that are being compared.
- You can select data in a raw format. In this case, each column corresponds to a treatment
and each row to a subject (or individual, or bloc).
- You can also select the data in a grouped format. Here, each column corresponds to a
treatment, and each row corresponds to a unique combine of the k treatments. You then need
to select the frequencies corresponding to each combine (field Frequencies in the dialog
box).
406
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to observations and columns to
variables. If the arrow points to the right (row mode), XLSTAT considers that rows correspond
to variables and columns to observations.
General tab:
Subjects/Treatments table: Select a table where each row (or column if in column mode)
corresponds to a subject, and each column (or row in row mode) corresponds to a treatment. If
headers have been selected with the data, make sure the Treatment labels or Labels
included is checked.
Data format:
Subjects/Treatments table:
o Raw: Choose that option if the input data are in a raw format (as opposed to
grouped).
407
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Treatment labels: Activate this option if headers have been selected with the input data.
Options tab:
Significance level (%): Enter the significance level for the test (default value: 5%).
Outputs tab:
Descriptive statistics: Activate this option to compute and display the statistics that
correspond to each treatment.
Results
Descriptive statistics: This table displays the descriptive statistics that correspond to the k
treatments.
The results that correspond to the Cochrans Q test are then displayed, followed by a short
interpretation of the test.
References
Agresti A. (1990). Categorical Data Analysis. John Wiley and Sons, New York.
Cochran W.G. (1950). The comparison of percentages in matched samples. Biometrika, 37,
256-266.
408
Lehmann E.L (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San
Francisco.
409
McNemars test
Use this tool to compare 2 paired samples which values are binary. The data can be
summarized in a 2x2 contingency table.
Description
The McNemars test is a special case of the Cochrans Q test when there are only two
treatments. As for the Cochrans Q test, the variable of interest is binary. However, the
McNemars test has two advantages:
In the case of the two-tailed (or two-sided) test, the null (H0) and alternative (Ha) hypotheses
are:
H0 : Treatment 1 = Treatment 2
Ha : Treatment 1 Treatment 2
In the one-tailed case, you need to distinguish the left-tailed (or lower-tailed or lower one-
sided) test and the right-tailed (or upper-tailed or upper one-sided) test. In the left-tailed test,
the following hypotheses are used:
H0 : Treatment 1 = Treatment 2
Ha : Treatment 1 Treatment 2
H0 : Treatment 1 = Treatment 2
- You can select data in a raw format. In this case, each column corresponds to a treatment
and each row to a subject (or individual, or bloc).
- You can also select the data in a grouped format. Here, each column corresponds to a
treatment, and each row corresponds to a unique combine of the k treatments. You then need
410
to select the frequencies corresponding to each combine (field Frequencies in the dialog
box).
- You can also select a contingency table with two rows and two columns. In the case where
you choose this, the first and second treatments are respectively considered as corresponding
to the rows and the columns. The positive response cases (or successes) are considered as
corresponding to the first row of the contingency table for the first treatment, and to the first
column for the second treatment.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to observations and columns to
variables. If the arrow points to the right (row mode), XLSTAT considers that rows correspond
to variables and columns to observations.
General tab:
411
Data format:
o Raw: Choose that option if the input data are in a raw format (as opposed to
grouped).
Contingency table (2x2): Activate this option if your data are available in a 2x2
contingency table.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Treatment labels/Labels included: Activate this option if headers have been selected with
the input data. In the case of a contingency table, the row and column labels must be selected
if this option is checked.
Positive response code: Enter the value that corresponds to a positive response in your
experiment.
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Significance level (%): Enter the significance level for the test (default value: 5%).
412
Exact p-value: Activate this option to compute the exact p-value.
Outputs tab:
This tab is only visible if the Subjects/Treatments table format has been chosen.
Descriptive statistics: Activate this option to compute and display the statistics that
correspond to each treatment.
Contingency table: Activate this option to display the 2x2 contingency table.
Results
Descriptive statistics: This table displays the descriptive statistics that correspond to the two
treatments.
Contingency table: The 2x2 contingency table built from the input data is displayed.
The results that correspond to the McNemars test are then displayed, followed by a short
interpretation of the test.
References
Agresti A. (1990). Categorical Data Analysis. John Wiley and Sons, New York.
McNemar Q. (1947). Note on the sampling error of the difference between correlated
proportions or percentages. Psychometrika, 12, 153-157.
Lehmann E.L (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San
Francisco.
413
One-sample runs test
Use this tool to test whether a series of binary events is randomly distributed or not.
Description
The first version of this nonparametric test was presented by Mood (1940) and is based on the
same runs statistic as the two-sample test by Wald and Wolfowitz (1940), which is why this
test is sometimes mistakenly referred to as the Wald and Wolfowitz runs test. However, the
article by Mood makes reference to the article by Wald and Wolfowitz and the asymptotic
distribution of the statistic uses also the results given by these authors.
XLSTAT accepts as input, continuous data or binary categorical data. For continuous data, a
cut-point must be chosen by the user so that the data are transformed into a binary sample.
In the case of the two-tailed (or two-sided) test, the null (H0) and alternative (Ha) hypotheses
are:
In the one-tailed case, you need to distinguish the left-tailed (or lower-tailed or lower one-
sided) test and the right-tailed (or upper-tailed or upper one-sided) test. In the left-tailed test,
the following hypotheses are used:
414
The expectation of the number of runs R is given by:
E(R) = 2mn/N
where m is the number of events of type 1, and n the number of events of type 2, and N is the
total sample size.
The minimum value of R is always 2. The maximum value is given by 2Min(m, n) - t, where t is
1 if m=n, and 0 if not.
If r is the number of runs measured on the sample, it was shown by Wald and Wolfowitz that
asymptotically, when m or n tend to infinity,
r E ( R)
N (0,1)
V ( R)
XLSTAT offers three ways to compute the p-values. You can compute the p-value based on:
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
415
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to observations and columns to
variables. If the arrow points to the right (row mode), XLSTAT considers that rows correspond
to variables and columns to observations.
General tab:
Data: Select a column (or row in row mode) of data corresponding to the series of data to
analyze.
Quantitative: Activate this option to select one column (or row in row mode) of
quantitative data. The data will then be transformed on the basis of the cut point (see
below).
Qualitative: Activate this option to select one column (or row in row mode) of binary
data.
Cut point: Choose the type of value that will be used to discretize the continuous data into a
binary sample.
Mean: Observations are split into two groups depending on whether there are lower or
greater than the mean.
Median: Observations are split into two groups depending on whether there are lower
or greater than the median.
User defined: Select this option to enter the value used to transform the data and enter
that value. The observations are split into two groups depending on whether there are
lower or greater than the given value.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
416
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if the first row (column mode) or first column (rows
mode) of the selected data contain labels.
Options tab:
Alternative hypothesis: Choose the alternative hypothesis to be used for the test (see
description).
Significance level (%): Enter the significance level for the test (default value: 5%).
Exact p-value: Activate this option if you want XLSTAT to calculate the exact p-value (see
description).
Asymptotic p-value: Activate this option if you want XLSTAT to calculate the p-value based
on the asymptotic approximation (see description).
Continuity correction: Activate this option if you want XLSTAT to use the continuity
correction when computing the asymptotic p-value.
Monte Carlo method: Activate this option if you want XLSTAT to calculate the p-value based
on Monte Carlo permutations, and enter the number of random permutations to perform.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Results
The results that correspond to the one-sample runs test are displayed, followed by a short
interpretation of the test.
417
References
Mood A. M. (1940). The distribution theory of runs. Ann. Math. Statist., 11(4), 367-392.
Siegel S. and Castellan N. J. (1988). Nonparametric Statistics for the Behavioral Sciences,
Second Edition. McGraw-Hill, New York, 58-54.
Wald A. and Wolfowitz J. (1940). On a test whether two samples are from the same
population, Ann. Math. Stat., 11(2), 147-162.
418
DataFlagger
Use DataFlagger to show up the values within or outside a given interval, or which are equal to
certain values.
Dialog box
: Click this button to close the dialog box without doing any change.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
Flag a value or a text: Activate this option is you want to identify or show up a value or a
series of values in the selected range.
Value or text: Choose this option to find and flag a single value or a character string.
List values or texts: Choose this option to find and flag a series of values or texts. You
must then select the series of values or texts in question in an Excel worksheet.
Flag an interval: Activate this option is you want to identify or show up values within or outside
an interval. You then have to define the interval.
Inside: Choose this option to find and flag values within an interval. Afterwards choose
the boundary types (open or closed) for the interval, then enter the values of the
boundaries.
419
Outside: Choose this option to find and flag values outside an interval. Afterwards
choose the boundary types (open or closed) for the interval, then enter the values of the
boundaries.
Font: Use the following options to change the font of the values obeying the flagging rules.
Cell: Use the following option to change the background color of the cell.
420
Min/Max Search
Use this tool to locate the minimum and/or maximum values in a range of values. If the
minimum value is encountered several times, XLSTAT makes a multiple selection of the
minimum values enabling you afterwards to browse between them simply using the "Enter"
key.
Dialog box
: Click this button to close the dialog box without doing any search.
Find the minimum: Activate this option to make XLSTAT look for the minimum value(s) in the
selection. If the "Multiple selection" option is activated and several minimum values are found,
they will all be selected and you can navigate between them using the "Enter" key.
Find the maximum: Activate this option to make XLSTAT look for the maximum value(s) in
the selection. If the "Multiple selection" option is activated and several maximum values are
found, they will all be selected and you can navigate between them using the "Enter" key.
Multiple selection: Activate this option to enable multiple occurrences of the minimum and/or
maximum values to be selected at the same time.
421
Remove text values in a selection
Use this tool to remove text values in a data set that is expected to contain only numerical
data. This tool is useful if you are importing data from a format that generates empty cells with
a text format in Excel.
Dialog box
: Click this button to close the dialog box without doing any change.
Clean only the cells with empty strings: Activate this option if you want to only clean the
cells that correspond to empty strings.
422
Sheets management
Use this tool to manage the sheets contained in the open Excel workbooks.
Dialog box
When you start this tool, it displays a dialog box that lists all the sheets contained in all the
workbooks, whether they are hidden or not.
Delete: Click this button to delete all the selected sheets. Warning: deleting hidden sheets is
irreversible.
423
Delete hidden sheets
Use this tool to delete the hidden sheets generated by XLSTAT or other applications. XLSTAT
generates hidden sheets to create certain charts. This tool is used to choose which hidden
sheets are to be deleted and which kept.
Dialog box
Hidden sheets: The list of hidden sheets is displayed. Select the hidden sheets you want to
delete.
All: Click this button to select all the sheets in the list.
None: Click this button to deselect all the sheets in the list.
Delete: Click this button to delete all the selected sheets. Warning: deleting hidden sheets is
irreversible.
424
Unhide hidden sheets
Use this tool to unhide the hidden sheets generated by XLSTAT or other applications. XLSTAT
generates hidden sheets to create certain charts.
Dialog box
Hidden sheets: The list of hidden sheets is displayed. Select the hidden sheets you want to
unhide.
All: Click this button to select all the sheets in the list.
None: Click this button to deselect all the sheets in the list.
425
Export to GIF/JPG/PNG/TIF
Use this tool to export a table, a chart, or any selected object on an Excel sheet to a GIF, JPG,
PNG ou TIF file.
Dialog box
File name: Enter the name of the file to which the image should be saved, or select the file in a
folder.
Resize: Activate this option to modify the size of the graphic before saving it to a file.
Display the grid: Activate this option if you want that while generating the file, XLSTAT keeps
the gridlines that separate the cells. This option is only active when cells or tables are selected.
426
Display the main bar
Use this tool to display the main XLSTAT toolbar if it is no longer displayed, or to place the
main toolbar on the top left of the Excel worksheet.
427
External Preference Mapping (PREFMAP)
Use this method to model and represent graphically the preference of judges for a series of
objects depending on objective criteria or linear combinations of criteria.
Description
External preference mapping (PREFMAP) is used to display on the same chart (in two of three
dimensions) objects and indications showing the preference levels of judges (in general,
consumers) in certain points in the representation space. The preference level is represented
on the preference map in the form of vectors, ideal or anti-ideal points, or isopreference curves
depending on the type of model chosen.
These models are themselves constructed from objective data (for example physico-chemical
descriptors, or scores provided by experts on well-determined criteria) which enable the
position of the judges and the products to be interpreted according to objective criteria.
If there are only two or three objective criteria, the axes of the representation space are
defined by the criteria themselves (possibly standardized to avoid the effects of scale). On the
other hand, if the number of descriptors is higher, a method for reducing the number of
dimensions must be used. In general, PCA is used. Nevertheless, it is also possible to use
factorial analysis if it is suspected that underlying factors are present, or MDS
(multidimensional scaling) if the initial data are the distances between the products. If the
descriptors used by the experts are qualitative variables, a PCA can be used to create a 2- or
3-dimensional space.
- How can I reposition a product so that it is again more preferred by its core target?
Preference models
428
have been proposed within the framework of PREFMAP. For a given judge, if we designate yi
to be their preference for product i, and X1, X2, , Xp to be the p criteria or combinations of
criteria (in general p=2) describing product i, the models are:
yi a0 a j xij
p
Vector :
j 1
yi a0 a j xij b xij2
p p
Circular:
j 1 j 1
yi a0 a j xij b j xij2
p p
Elliptic:
j 1 j 1
yi a0 a j xij b j xij2 c
p p p 1 p
Quadratic: jk xij xik
j 1 j 1 j 1 k j 1
The coefficients aj are estimated by multiple linear regression. It will be noted that the models
are classified from the simplest to the most complex. XLSTAT lets you either chose one model
to use for all judges, or choose a model giving the best result as regards the p-value of
Fisher's F for a particular judge or the p-value of the F-ratio test. In other words, you can
choose a model which is both parsimonious and powerful at the same time.
The vector model represents individuals on the sensorial map in the form of vectors. The size
of the vectors is a function of the R of the model: the longer the vector, the better the
corresponding model. The preference of the judge will be stronger the further you are in the
direction indicated by the vector. The interpretation of the preference can be done by
projecting the different products on the vectors (product preference). The disadvantage of the
vector model is that it neglects the fact that for certain criteria, like the saltiness or temperature
for example, there can be an increase of preference to an optimum value then a decrease.
The circular model takes into account this concept of optimum. If the surface area for the
model has a maximum in terms of preference (this happens if the b coefficient is estimated
negative), this is known as the ideal point. If the surface area for the model has a minimum in
terms of preference (this happens if the b coefficient is estimated positive), this is known as the
anti-ideal point. With the circular model, circular lines of isopreference can be drawn around
the ideal or anti-ideal points.
The elliptical model is more flexible, as it takes the effect of scale into account better. The
disadvantage of this model is that there is not always an optimum: as with the circular model, it
can generate an ideal point or an anti-ideal point if all the bj coefficients have the same sign,
but we may also obtain a saddle point (in the form of a surface shaped like a horse's saddle) if
all the bj coefficients do not have the same sign. The saddle point cannot easily be interpreted.
It corresponds only to an area where the preference is less sensitive to variations.
429
Lastly, the quadratic model takes more complex preference structures into account, as it
includes interaction terms. As with the elliptical model we can obtain an ideal, an anti-ideal, or
a saddle point.
Preference map
The judges (or groups of judges if a classification of judges has been carried out beforehand)
represented in the corresponding model by a vector, an ideal point (labeled +), an anti-ideal
point (labeled -), or a saddle point (labeled o);
The descriptors which correspond to the representation axes with which they are associated
(when a PCA precedes the PREFMAP, a biplot from the PCA is studied to interpret the
position of the objects as a function of the objective criteria).
The PREFMAP, with the interpretation given by the preference map is an aid to interpretation
and decision-making which is potentially very powerful since it allows preference data to be
linked to objective data. However, the models associated with the judges must be adjusted
correctly in order that the interpretation is reliable.
Preference scores
The preference score for each object for a given judge, whose value is between 0 (minimum)
and 1 (maximum), is calculated from the prediction of the model for the judge. The more the
product is preferred, the higher the score. A preference order of objects is deducted from the
preference scores for each of the judges.
Contour plot
The contour plot shows the regions corresponding to the various preference consensus levels
on a chart whose axes are the same as the preference map. At each point on the chart, the
percentage of judges for whom the preference calculated from the model is greater than their
mean preference is calculated. In the regions with cold colors (blue), a low proportion of
models give high preferences. On the other hand, the regions with hot colors (red) indicate a
high proportion of models with high preferences.
430
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Preference data: Select the preference data. The table contains the various objects
(products) studied in the rows and the judges in the columns. This is reversed in transposed
mode. If column headers have been selected, check that the "Variable labels" option has been
activated.
Note: XLSTAT considers that the preferences are the increasing data (the more a judge likes
an object, the higher the preference).
Center: Activate this option is you want to center the preference data before starting the
calculations.
Reduce: Activate this option is you want to reduce the preference data before starting the
calculations.
431
Preliminary transformation: Activate this option if you want to transform the data.
Normalization: Activate this option to standardize the data for the X-configuration
before carrying out the PREFMAP.
PCA (Pearson): Activate this option for XLSTAT to transform the selected descriptors
using a normalized Principle Components Analysis (PCA). The number of factors used
afterwards used for the calculations is determined by the number of dimensions
chosen.
PCA (Covariance): Activate this option for XLSTAT to transform the selected
descriptors using a non-normalized Principle Components Analysis (PCA). The number
of factors used afterwards used for the calculations is determined by the number of
dimensions chosen.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Check this option if the first line of the data selected (Y, X, object labels)
contains a label.
Objects labels: Activate this option if observation labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the labels are automatically generated by XLSTAT
(Obs1, Obs2 ).
Model: Choose the type of model to use to link the preferences to the X configuration if the
option "Find the best model" (see Options tab) has not been activated.
Dimensions: Enter the number of dimensions to use for the PREFMAP model (default value:
2).
Options tab:
Find the best model: Activate this option to allow XLSTAT to find the best model for each
judge.
F-ratio: Activate this option to use the F-ratio test to select the model that is the best
compromise between quality of the fit and parsimony in variables. A more complex
432
model is accepted if the p-value corresponding to the F is lower than the significance
level.
F: Activate this option to select the model that gives the best p-value based computed
the Fishers F.
Significance level (%): enter the significance level. The p-values of the models are displayed
in bold when they are less than this level.
Prediction tab:
Prediction: activate this option if you want to select data to use them in prediction mode. If you
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
X / Configuration: Activate this option to select the configuration data to use for the
predictions. The first row must not include variable labels.
Object labels: Activate this option if you want to use object labels for the prediction data. The
first row must not include variable labels. If this option is not activated, the labels are
automatically generated by XLSTAT (PredObs1, PredObs2, etc.).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
433
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the correlation matrix for the different variables
selected.
Analysis of variance: Activate this option to display the analysis of variance table for the
various models.
Model coefficients: Activate this option to display the parameters of the models.
Model predictions: Activate this option to display the predictions of the models.
Ranks of the preference scores: Activate this option to display the ranks for the preference
scores.
Sorted objects: Activate this option to display the objects in decreasing order of preference for
each of the judges.
If a preliminary transformation based on PCA has been requested, the following options are
available:
Factor loadings: Activate this option to display the coordinates of the variables (factor
loadings). The coordinates are equal to the correlations between the principal components and
the initial variables for normalized PCA.
Factor scores: Activate to display the coordinates of the observations (factor scores) in the
new space created by PCA. These coordinates are afterwards used for the PREFMAP.
If a preliminary transformation based on PCA has been requested, the following options are
available:
Factor loadings: Activate this option to display the coordinates of the variables (factor
loadings). The coordinates are equal to the correlations between the principal components and
the initial variables for normalized PCA.
Factor scores: Activate to display the coordinates of the observations (factor scores) in the
new space created by PCA. The principal components are afterwards used as explanatory
variables in the regression.
434
Charts (PCA) tab:
This tab is visible only if a PCA based preliminary transformation has been requested.
Correlations charts: Activate this option to display charts showing the correlations between
the components and initial variables.
Vectors: Activate this option to display the input variables in the form of vectors.
Observations charts: Activate this option to display charts representing the observations in
the new space.
Labels: Activate this option to have observation labels displayed on the charts. The
number of labels displayed can be changed using the filtering option.
Biplots: Activate this option to display charts representing the observations and variables
simultaneously in the new space.
Vectors: Activate this option to display the initial variables in the form of vectors.
Labels: Activate this option to have observation labels displayed on the biplots. The
number of labels displayed can be changed using the filtering option.
Type of biplot: Choose the type of biplot you want to display. See the description section of
the PCA for more details.
Colored labels: Activate this option to show variable and observation labels in the same color
as the corresponding points. If this option is not activated the labels are displayed in black
color.
Charts tab:
435
Preference map: Activate this option to display the preference map.
Display ideal points: Activate this option to display the ideal points.
Display anti-ideal points: Activate this option to display the anti-ideal points.
Display saddle points: Activate this option to display the saddle points.
Domain restriction: Activate this option to only display the solution points (ideal, anti-
ideal, saddle) if they are within a domain to be defined. Then enter the size of the area
to be used for the display: this is expressed as a %age of the area delimited by the X
configuration (value between 100 and 500).
Vectors length: The options below are used to determine the lengths of the vectors on
the preference map when a vector model is used.
o Coefficients: Choose this option so that the length of the vectors is only
determined by the coefficients of the vector model.
o R: Choose this option so that the length of the vectors is only determined by
2
the R value of the model. Thus the better the model is adjusted, the longer is
the corresponding vector on the map.
o Lengthening factor: Use this option to multiply the length of all vectors by an
arbitrary value (default value: 1)
Circular model:
Contour plot: Activate this option to display the contour plot (see description). Afterwards,
enter the level with respect to the preference mean that a judge can be considered to have a
preference for a product (the default value, 100, is the mean).
Results
Summary statistics: This table shows the number of non-missing values, the mean and the
standard deviation (unbiased) for all judges and all dimensions of the X configuration (before
transformation if that has been requested).
436
Correlation matrix: This table is displayed to give you a view of the correlations between the
various variables selected.
Model selection: This table shows which model was used for each judge. If the model is not a
vector model, the solution point type is displayed (ideal, anti-ideal, saddle) with its coordinates.
Analysis of variance: This table shows the statistics used to evaluate the goodness of fit of
the model (R, F, and Pr>F). When the p-value (Pr>F) is less than the chosen significance
level, it is displayed in bold. If the F-ratio test was chosen in the Options tab, the results of the
F-ratio test are displayed if it was successful at least once.
Model coefficients: This table displays the various coefficients of the chosen model for each
judge.
Model predictions: This table shows the preferences estimated by the model for each judge
and each product. Note: if the preferences have been standardized, these results therefore
apply to the standardized preferences.
Ranks of the preference scores: This table displays the ranks of the preference scores. The
higher the rank, the higher the preference.
Objects sorted by increasing preference order: This table shows the list of objects in
increasing order of preference, for each judge. In other words, the last line corresponds to
objects preferred by the judges according to the preference models.
The preference map and the contour plot are then displayed. On the preference map, the
ideal points are shown by (+), the anti-ideal points by (-) and saddle points by (o).
Example
http://www.xlstat.com/demo-prefmap.htm
References
Naes T. and Risvik E. (1996). Multivariate Analysis of Data in Sensory Science. Elsevier
Science, Amsterdam.
Schlich P. and McEwan J.A. (1992). Cartographie des prfrences. Un outil statistique pour
l'industrie agro-alimentaire. Sciences des aliments, 12, 339-355
437
438
Generalized Procrustes Analysis (GPA)
Use Generalized Procrustes Analysis (GPA) to transform several multidimensional
configurations so that they become as much alike as possible. A comparison of transformed
configurations can then be carried out.
Description
Procrustes (or Procustes), which in ancient Greek means "the one who lengthens while
stretching", is a character of the Greek mythology. The name of the gangster Procrustes is
associated to the bed that he used to torture the travelers to whom he proposed the lodging.
Procrustes installed his future victim on a bed with variable dimensions: short for the tall ones
and long for the small ones. According to cases, he cut off with a sword what exceeded out of
the bed or stretched the body of the traveler until bringing the size of the traveler to that of the
bed, by using a mechanism that Hephaistos had manufactured for him. In both cases the
torment was appalling. Theseus, while traveling to Athens, met the robber, discovered the trap
and laid down slantwise on the bed. When Procrustes adjusted the body of Theseus, he did
not understand the situation immediately and remained perplexed giving Theseus the time to
slit with his sword the brigand in two equal parts.
Concept
Let us take the example of 5 experts rating 4 cheeses according to 3 criteria. The ratings can
go from 1 to 10. One can easily consider that an expert tends to be harder in his notation,
leading to a shift to the bottom of the ratings, or that another expert tends to give ratings
around the average, without daring to use extreme ratings. To work on an average
configuration could lead to false interpretations. One can easily see that a translation of the
ratings of the first expert is necessary, or that rescaling the ratings of the second expert would
make his ratings possibly closer to those of the other experts.
Once the consensus configuration has been obtained, it is possible to run a PCA (Principal
Components Analysis) on the consensus configuration in order to allow an optimal
visualization in two or three dimensions.
439
Structure of the data
1. If the number and the designation of the p dimensions are identical for the m configurations,
one speaks in sensory analysis about conventional profiles.
2. If the number p and the designation of the dimensions varies from one configuration to the
other, one speaks in sensory analysis about free profiles, and the data can then only be
represented by a series of m matrices of size n x p(k), k=1,2, , m.
If the labels of the dimensions vary from one configuration to the other, XLSTAT indicates by
Var(i) the ith dimension of the configurations, but it keeps the original labels when displaying
the correlations circle chart.
Data transposition
It sometimes occurs that the number (m x p) of columns exceeds the limits of Excel. To get
around that drawback, XLSTAT allows you to use transposed tables. To use transposed tables
(in that case all tables that you want to select need to be transposed), you only need to click
the blue arrow at the bottom left of the dialog box, which then becomes red.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
440
: Click this button to display the help.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Configurations: Select the data that correspond to the configurations. If a column header has
been selected, check that the "Dimension labels" option has been activated.
Equal: Choose this option if the number of variables is identical for all the tables. In that
case XLSTAT determines automatically the number of variables in each table
User defined: Choose this option to select a column that contains the number of
variables contained in each table. If the "Variable labels" option has been activated, the
first row must correspond to a header.
Configuration labels: Check this option if you want to use the available configuration labels. If
you do not check this option, labels will be created automatically (C1, C2, etc.). If a column
header has been selected, check that the "Dimension labels" option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Dimension labels: Activate this option if the first row (or column if in transposed mode) of the
selected data (configurations, configuration labels, object labels) contains a header.
441
Object labels: Check this option if you want to use the available configuration labels. If you do
not check this option, labels will be created automatically (Obs1, Obs 2, etc.). If a column
header has been selected, check that the "Dimension labels" option has been activated.
Options tab:
Scaling: Activate this option to run rescale the matrices during the GPA.
Rotation/Reflection: Activate this option to perform the rotation/reflection steps of the GPA.
PCA: Activate this option to run a PCA at the end of the GPA steps.
Filter factors: You can activate one of the following two options in order to reduce the number
of factors which are taken into account after the PCA.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
Maximum Number: Activate this option to set the number of factors to take into
account.
Tests:
Consensus test: Activate this option pour to use a permutation test that allows to
determine if a consensus is reached after the GPA transformations.
Dimensions test: Activate this option pour to use a permutation test that allows to
determine what is the appropriate number of factors to keep.
Number of permutations: Enter the number of permutations to perform for the tests (Default
value: 300)
Significance level (%): Enter the significance level for the tests.
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number of iterations has been exceeded. Default
value: 100.
Convergence: Enter the maximum value of the evolution in the convergence criterion
from one iteration to another which, when reached, means that the algorithm is
considered to have converged. Default value: 0.00001.
442
Missing data tab:
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove the observations: Activate this option to remove observations with missing data.
Outputs tab:
Residuals by object: Activate this option to display the residuals for each object.
Residuals by configuration: Activate this option to display the residuals for each
configuration.
Scaling factors: Activate this option to display the scaling factors applied to each
configuration.
Rotation matrices: Activate this option to display the rotation matrices corresponding to each
configuration.
The following options are available only if a PCA has been requested:
Consensus configuration: Activate this option to display the coordinates of the dimensions
for the consensus configuration.
Configurations: Activate this option to display the coordinates of the dimensions for each
configuration.
Objects coordinates: Activate this option to display the coordinates of the objects after the
transformations.
Presentation by object: Activate this option to display one table of coordinates per
object.
443
The following options are available only if a PCA has been requested:
Correlations charts: Activate this option to display the correlations charts for the consensus
configuration and individual configurations.
Vectors: Activate this option to display the dimensions in the form of vectors.
Objects coordinates: Activate this option to display the maps showing the objects.
Presentation by configuration: Activate this option to display a chart where the color
depends on the configuration.
Presentation by object: Activate this option to display a chart where the color depends
on the object.
Colored labels: Activate this option to show variable and observation labels in the same color
as the corresponding points. If this option is not activated the labels are displayed in black
color.
Type of biplot: Choose the type of biplot you want to display. See the description section of
the PCA for more details.
Charts tab:
Residuals by object: Activate this option to display the bar chart of the residuals for each
object.
Residuals by configuration: Activate this option to display the bar chart of the residuals for
each configuration.
444
Scaling factors: Activate this option to display the the bar chart of the scaling factors applied
to each configuration.
Test histograms: Activate this option to display the histograms that correspond to the
consensus and dimensions tests.
Results
PANOVA table: Inspired from the format of the analysis of variance table of the linear model,
this table allows to evaluate the relative contribution of each transformation to the evolution of
the variance. In this table are displayed the residual variance before and after the
transformations, the contribution to the evolution of the variance of the rescaling, rotation and
translation steps. The computing of the Fishers F statistic allows to compare the relative
contributions of the transformations. The corresponding probabilities allow to determine
whether the contributions are significant or not.
Residuals by object: This table and the corresponding bar chart allow to visualize the
distribution of the residual variance by object. Thus, it is possible to identify for which objects
the GPA has been the less efficient, in other words, which objects are the farther from the
consensus configuration.
Residuals by configuration: This table and the corresponding bar chart allow to visualize the
distribution of the residual variance by configuration. Thus, it is possible to identify for which
configurations the GPA has been the less efficient, in other words, which configurations are the
farther from the consensus configuration.
Scaling factors for each configuration: This table and the corresponding bar chart allow to
compare the scaling factors applied to the configurations. It is used in sensory analysis to
understand how the experts use the rating scales.
Rotation matrices: The rotation matrices that have been applied to each configuration are
displayed if requested by the user.
Results of the consensus test: This table displays the number of permutations that have
been performed, the value of Rc which corresponds to the proportion of the original variance
explained by the consensus configuration, and the quantile corresponding to Rc, calculated
using the distribution of Rc obtained from the permutations. To evaluate if the GPA is effective,
one can set a confidence interval (typically 95%), and if the quantile is beyond the confidence
interval, one concludes that the GPA significantly reduced the variance.
Results of the dimensions test: This table displays for each factor retained at the end of the
PCA step, the number of permutations that have been performed, the F calculated after the
445
GPA (F is here the ratio of the variance between the objects, on the variance between the
configurations), and the quantile corresponding to F calculated using the distribution of F
obtained from the permutations. To evaluate if a dimension contributes significantly to the
quality of the GPA, one can set a confidence interval (typically 95%), and if the quantile is
beyond the confidence interval, one concludes that factor contributes significantly. As an
indication are also displayed, the critical values and the p-value that corresponds to the
Fishers F distribution for the selected alpha significance level. It may be that the conclusions
resulting from the Fishers F distribution is very different from what the permutations test
indicates: using Fishers F distribution requires to assume the normality of the data, which is
not necessarily the case.
Objects coordinates before the PCA: This table corresponds to the mean over the
configurations of the objects coordinates, after the GPA transformations and before the PCA.
Eigenvalues: If a PCA has been requested, the table of the eigenvalues and the
corresponding scree-plot are displayed. The percentage of the total variability corresponding to
each axis is computed from the eigenvalues.
Correlations of the variables with the factors: These results correspond to the correlations
between the variables of the consensus configuration before and after the transformations
(GPA and PCA if the latter has been requested). These results are not displayed on the circle
of correlations as they are not always interpretable.
Objects coordinates: This table corresponds to the mean over the configurations of the
objects coordinates, after the transformations (GPA and PCA if the latter has been requested).
These results are displayed on the objects charts.
Variance by configuration and by dimension: This table allows to visualize how the
percentage of total variability corresponding to each axis is divided up for the configurations.
Correlations of the variables with the factors: These results, displayed for all the
configurations, correspond to the correlations between the variables of the configurations
before and after the transformations (GPA and PCA if the latter has been requested). These
results are displayed on the circle of correlations.
446
Objects coordinates (presentation by object): This series of tables corresponds to the
objects coordinates for each configuration after the transformations (GPA and PCA if the latter
has been requested). These results are displayed on the second series of objects charts..
Example
http://www.xlstat.com/demo-gpa.htm
References
Naes T. and Risvik E. (1996). Multivariate Analysis of Data in Sensory Science. Elsevier
Science, Amsterdam.
Wakeling I.N., Raats M.M. and MacFie H.J.H. (1992). A new significance test for consensus
in generalized Procrustes analysis. Journal of Sensory Studies, 7, 91-96.
Wu W., Gyo Q., de Jong S. and Massart D.L. (2002). Randomisation test for the number of
dimensions of the group average space in generalised Procrustes analysis. Food Quality and
Preference, 13, 191-200.
447
Penalty analysis
Use this tool to analyze the results of a survey run using a five level JAR (Just About Right)
scale, on which the intermediary level 3 corresponds to the preferred value for the consumer.
Description
Penalty analysis is a method used in sensory data analysis to identify potential directions for
the improvement of products, on the basis of surveys performed on consumers or experts.
- Preference data (or liking scores) that correspond to a global satisfaction index for a
product (for example, liking scores on a 9 point scale for a chocolate bar), or for a
characteristic of a product (for example, the comfort of a car rated from 1 to 10).
- Data collected on a JAR (Just About Right) 5 point scale. These correspond to ratings
ranging from 1 to 6 for one ore more characteristics of the product of interest. 1
corresponds not Not enough at all , 2 to Not enough , 3 to JAR (Just About
Right), an ideal for the consumer, 4 to Too much and 5 to Far too much . For
example, for a chocolate bar, one can rate the bitterness, and for the comfort of the car,
the sound volume of the engine.
The method, based on multiple comparisons such as those used in ANOVA, consists in
identifying, for each characteristic studied on the JAR scale, if the rankings on the JAR scale
are related to significantly different results in the liking scores. For example, if a chocolate is
too bitter, does that significantly impact the liking scores?
The word penalty comes from the fact that we are looking for the characteristics which can
penalize the consumer satisfaction for a given product. The penalty is the est difference
between the mean of the liking scores for the JAR category, and the mean of the scores for the
other categories.
1. The data of the JAR scale are aggregated: on one hand, categories 1 and 2 are grouped,
and on the other hand categories 4 and 5 are grouped, which leads to a three point scale.
We now have three levels: Not enough, JAR, and Too much.
2. We then compute and compare the means of the liking scores for the three categories, to
identify significant differences. The difference between the means of the 2 non-JAR
categories and the JAR category is called mean drops.
448
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Liking scores: Select the preference data. Several columns can be selected. If a column
header has been selected, check that the "Column labels" option has been activated.
Just about right data: Select the data measured on the JAR scale. Several columns can be
selected. If a column header has been selected, check that the "Column labels" option has
been activated.
Labels of the 3 JAR levels: Activate this option if you want to use labels for the 3 point JAR
scale. There must be three rows and as many columns as in the Just about right data
selection. If a column header has been selected, check that the "Column labels" option has
been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
449
Column labels: Activate this option if the first row of the data selections (Liking scores, Just
about right data, labels of the 3 JAR levels) includes a header.
Options tab:
Threshold for population size: Enter the % of the total population that should represent a
category to be taken into account for multiple comparisons.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to ignore the observations that contain missing
data.
Estimate missing data: Activate this option to estimate the missing data by using the mean of
the variables.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Correlations: Activate this option to display the matrix of correlations of the selected
dimensions. If all data are ordinal, it is recommended to use the Spearman coefficient of
correlation.
3 levels table: Activate this option to display the JAR data once they are collapsed from 5 to 3
categories.
Penalty table: Activate this option to display the table that displays the mean drops for the
non-JAR categories, as well as the penalties.
Multiple comparisons: Activate this option to run the multiple comparisons tests on the
difference between means. Several methods are available, grouped into two categories:
450
multiple pairwise comparisons, and multiple comparisons with a control, the latter being here
the JAR category.
Significance level (%): Enter the significance level used to determine if the differences
are significant or not.
Charts tab:
Stacked bars: Activate this option to display a stacked bars chart that allows visualizing the
relative frequencies of the various categories of the JAR scale.
3D: Activate this option to display the stacked bars in three dimensions.
Summary: Activate this option to display the charts that summarize the penalty analysis.
Mean drops vs %: Activate this option do display the chart that displays the mean drops as a
function of the corresponding % of the population of testers.
Results
After the display of the basic statistics and the correlation matrix for the liking scores and the
JAR data, XLSTAT displays a table that shows for each JAR dimension the frequencies for the
5 levels. The corresponding stacked bar diagram is then displayed.
The table of the collapsed data on three levels is then displayed, followed by the
corresponding relative frequencies table and the stacked bar diagram.
The penalty table allows to visualize the statistics for the 3 point scale JAR data, including the
means, the mean drops, the penalties and the results of the multiple comparisons tests.
Last, the summary charts allow to quickly identify the JAR dimensions for which the differences
between the JAR category and the 2 non-JAR categories (Not enough, Too much) are
significantly different: when the difference is significant, the bars are displayed in red color,
whereas they are displayed in green color when the difference is not significant. The bars are
displayed in grey when the size of a group is lower than the select threshold (see the Options
tab of the dialog box).
The mean drops vs % chart displays the mean drops as a function of the corresponding % of
the population of testers. The threshold % of the population over which the results are
considered significant is displayed with a dotted line.
451
Example
http://www.xlstat.com/demo-pen.htm
References
Popper P., Schlich P., Delwiche J., Meullenet J.-F., Xiong R., Moskovitz H., Lesniauskas
R.O., Carr T.B., Eberhardt K., Rossi F., Vigneau E. Qannari, Courcoux P. and Marketo C.
(2004). Workshop summary: Data Analysis workshop: getting the most out of just-about-right
data. Food Quality and Preference, 15, 891-899.
452
Semantic differential charts
Use this method to easily visualize on a chart, ratings given to objects by a series of judges on
a series of dimensions.
Description
The psychologist Charles E. Osgood has developed the visualization method Semantic
differential in order to plot the differences between individuals' connotations for a given word.
When applying the method, Osgood asked survey participants to describe a word on a series
of scales ranging from one extreme to the other (for example favorable/unfavorable). When
patterns were significantly different form one individual to the other or from one group of
individuals to the other, Osgood could then interpret the Semantic Differential as a mapping of
the psychological or even behavioral distance between the individuals or groups.
Analysis of the experts perceptions for a product (for example a yogurt) described by a series
of criteria (for example, acidity, saltiness, sweetness, softness) on similar scales (either from
one extreme to the other, or on a similar likert scale for each criterion). A Semantic differential
chart will allow to quickly see which experts agree, and if significantly different patterns are
obtained.
This tool can be used in sensory data analysis. Here are two examples:
A panel of experts rates (from 1 to 5) a chocolate bar (the object) on three criteria (the
"attributes") namely the flavor, the texture, the odor. In this case, the input table contains in cell
th th
(i,j) the rating given by the i judge to the product on the j criterion. The semantic differential
chart allows to quickly compare the judges.
A panel of experts rates (from 1 to 5) a series of chocolate bars (the objects) on three criteria
(the "attributes") namely the flavor, the texture, the odor. In this case, the input table contains
th th
in cell (i,j) the average rating given by the judges to i the product on the j criterion. The
semantic differential chart allows to quickly compare the objects.
453
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Data: Select the data on the Excel worksheet. If a column header has been selected, check
that the "Descriptor labels" option has been activated.
Objects: Choose this option to create a chart where values correspond to the abscissa,
descriptors to ordinates, and the objects to the lines on the chart.
Descriptors: Choose this option to create a chart where objects correspond to the
abscissa, descriptors to ordinates, and the descriptors to the lines on the chart.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Descriptor labels: Activate this option if the first row of the selected data (data, observation
labels) contains a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Descriptor labels option is activated you need to include a header
454
in the selection. If this option is not activated, the observations labels are automatically
generated by XLSTAT (Obs1, Obs2 ).
Charts tab:
Color: Activate this option to use different colors when displaying the lines corresponding to
the various objects/descriptors.
Results
The result that is displayed is the Semantic Differential chart. As it is an Excel chart, you can
modify it as much as you want.
Example
http://www.xlstat.com/demo-sd.htm
References
Judd C.M., Smith E.R. and Kidder L.H (1991). Research Methods in Social Relations. Holt,
Rinehart & Winston, New York.
Osgood C.E., Suci G.J. and Tannenbaum P.H. (1957). The Measurement of Meaning.
University of Illinois Press, Urbana.
Oskamp S. (1977). Attitudes and Opinions. Prentice-Hall, Englewood Cliffs, New Jersey.
Snider J. G. and Osgood C.E. (1969). Semantic Differential Technique. A Sourcebook. Aldine
Press, Chicago.
455
Descriptive analysis (Times Series)
Use this tool to compute the descriptive statistics that are specially suited for time series
analysis.
Description
One of the key issues in time series analysis is to determine whether the value we observe at
time t depends on what has been observed in the past or not. If the answer is yes, then the
next question is how.
The sample autocovariance function (ACVF) and the autocorrelation function (ACF) give an
idea of the degree of dependence between the values of a time series. The visualization of the
ACF or of the partial autocorrelation function (PACF) helps to identify the suitable models to
explain the passed observations and to do predictions. The theory shows that the PACF
function of an AR(p) an autoregressive process of order p - is zero for lags greater than p.
The cross-correlations function (CCF) allows to relate two time series, and to determine if they
co-vary and to which extend.
The ACVF, the ACF, the PACF and CCF are computed by this tool.
One important step in time series analysis is the transformation of time series (see
Transforming time series) which goal is to obtain a white noise. Obtaining a white noise means
that all deterministic and autocorrelations components have been removed. Several white
noise tests, based on the ACF, are available to test whether a time series can be assumed to
be a white noise or not.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
456
: Click this button to reload the default options.
General tab:
Times series: Select the data that correspond to the time series for which you want to
compute the various spectral functions.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Series labels: Activate this option if the first row of the selected series includes a header.
Options tab:
Time steps: the number of time steps for which the statistics are computed can be
automatically determined by XLSTAT, or set by the user.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Remplace by the average of the previous and next values: Activate this option to estimate
the missing data by the mean of the first preceding non missing value and of the first next non
missing value.
Outputs tab:
457
Descriptive statistics: Activate this option to display the descriptive statistics of the selected
series.
Autocorrelations: Activate this option to estimate the autocorrelation function of the selected
series (ACF).
Autocovariances: Activate this option to estimate the autocovariance function of the selected
series.
Partial autocorrelations: Activate this option to compute the partial autocorrelations of the
selected series (PACF).
Confidence interval (%): Activate this option to display the confidence intervals. The value
you enter (between 1 and 99) is used to determine the confidence intervals for the estimated
values. Confidence intervals are automatically displayed on the charts.
White noise assumption: Activate this option if you want that the confidence intervals
are computed using the assumption that the time series is a white noise.
White noise tests: Activate this option if you want XLSTAT to display the results of the
normality test and the white noise tests.
h1 : Enter the minimum number of lags to compute the white noise tests.
h2 : Enter the maximum number of lags to compute the white noise tests.
s : Enter the number of lags between two series of white noise tests. s must be a
multiple of (h2-h1).
Charts tab:
Autocorrelogram: Activate this option to display the autocorrelogram of the selected series.
Partial autocorrelogram: Activate this option to display the partial autocorrelogram of the
selected series.
Cross-correlations: Activate this option to display the cross-correlations diagram in the case
where several series have been selected.
458
Results
Summary statistics: This table displays for the selected variables, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation (unbiased).
Normality and white noise tests: Table displaying the results of the various tests. The
Jarque-Bera normality test is computed once on the time series, while the other tests (Box-
Pierce, Ljung-Box and McLeod-Li) are computed at each selected lag. The degrees of freedom
(DF), the value of the statistics and the p-value computed using a Chi-Square(DF) distribution
are displayed. For the Jarque-Bera test, the lower the p-value, the more likely the normality of
the sample. For the three other tests, the lower the p-value, the less likely the randomness of
the data.
Descriptive functions for the series: Table displaying for each time lag the values of the
various selected descriptive functions, and the corresponding confidence intervals.
Charts: For each selected function, a chart is displayed if the "Charts" option has been
activated in the dialog box.
If several time series have been selected and if the "cross-correlations" option has been
selected the following results are displayed:
Normality and white noise tests: Table displaying the results of the various tests, Box-
Pierce, Ljung-Box and McLeod-Li, which are computed at each selected lag. The degrees of
freedom (DF), the value of the statistics and the p-value computed using a Chi-Square(DF)
distribution are displayed. The lower the p-value, the less likely the randomness of the data.
Cross-correlations: Table displaying for each time lag the value of the cross-correlation
function.
Example
A tutorial explaining how to use descriptive analysis with a time series is available on the
Addinsoft web site. To consult the tutorial, please go to:
http://www.xlstat.com/demo-desc.htm
459
References
Box G. E. P. and Jenkins G. M. (1976). Time Series Analysis: Forecasting and Control.
Holden-Day, San Francisco.
Brockwell P.J. and Davis R.A. (1996). Introduction to Time Series and Forecasting. Springer
Verlag, New York.
Fuller W.A. (1996). Introduction to Statistical Time Series, Second Edition. John Wiley & Sons,
New York.
Jarque C.M. and Bera A.K. (1980). Efficient tests for normality, heteroscedasticity and serial
independence of regression residuals. Economic Letters, 6, 255-259.
Ljung G.M. and Box G. E. P. (1978). On a measure of lack of fit in time series models.
Biometrika, 65, 297-303.
McLeod A.I. and Li W.K. (1983). Diagnostic checking ARMA times series models using
squares-residual autocorrelation. J Time Series Anal., 4, 269-273.
Shumway R.H. and Stoffer D.S. (2000). Time Series Analysis and Its Applications. Springer
Verlag, New York.
460
Time series transformation
Use this tool to transform a time series A into a time series B that has better properties:
removed trend, reduced seasonality, and better normality.
Description
XLSTAT offers four different possibilities for transforming a time series {Xt} into {Yt}, (t=1,,n):
Box-Cox transformation to improve the normality of the time series; the Box-Cox
transformation is defined by the following equation:
X t 1
Yt
, Xt 0, 0 or X t 0, 0
ln( X ), X t 0, 0
t
XLSTAT accepts a fixed value of , or it can find the value that maximizes the likelihood, the
model being a simple linear model with the time as sole explanatory variable.
Differencing, to remove trend and seasonalities and to obtain stationarity of the time series.
The difference equation writes:
d
Y t 1 B 1 B s
D
Xt
where d is the order of the first differencing component, s is the period of the seasonal
component, D is the order of the seasonal component, and B is the lag operator defined by:
BX t X t 1
The values of (d, D, s) can be chosen in a trial and error process, or guessed by looking at the
descriptive functions (ACF, PACF). Typical values are (1,1,s), (2,1,s). s is 12 for monthly data
with a yearly seasonality, 0 when there is no seasonality.
Detrending and deseasonalizing, using the classical decomposition model which writes:
X t mt s t t
where mt is the trend component and st the seasonal component, and t is a N(0,1) white noise
component. XLSTAT allows to fit this model in two separate and/or successive steps:
1 Detrending model:
461
X t mt t a i t i t
k
i 0
where k is the polynomial degree. The ai parameters are obtained by fitting a linear model to
the data. The transformed time series writes:
Y t t X t a i t i
p
i 0
2 Deseasonalization model:
X t st t bi t , i = t mod p
where p is the period. The bi parameters are obtained by fitting a linear model to the data. The
transformed time series writes:
Y t t X tbi
Note: there exist many other possible transformations. Some of them are available in the
transformations tool of XLSTAT-Pro (see the "Preparing data" section). Linear filters may also
be applied. Moving average smoothing methods which are linear filters are available in the
"Smoothing" tool of XLSTAT.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
462
General tab:
Times series: Select the data that correspond to the time series for which you want to
compute the various spectral functions.
Date data: Activate this option if you want to select date or time data. These data must be
available either in the Excel date/time formats or in a numerical format. If this option is not
activated, XLSTAT creates its own time variable ranging from 1 to the number of data.
Check intervals: Activate this option so that XLSTAT checks that the spacing between
the date data is regular.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Series labels: Activate this option if the first row of the selected series includes a header.
Options tab:
Box-Cox: Activate this option to compute the Box-Cox transformation. You can either fix the
value of the Lambda parameter, or decide to let XLSTAT optimize it (see the description for
further details).
Differencing: Activate this option to compute differenced series. You need to enter the
differencing orders (d, D, s). See the description for further details.
Polynomial regression: Activate this option to detrend the time series. You need to enter
polynomial degree. See the description for further details.
Deseasonalization: Activate this option to remove the seasonal components using a linear
model. You need to enter the period of the series. See the description for further details.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
463
Remplace by the average of the previous and next values: Activate this option to estimate
the missing data by the mean of the first preceding non missing value and of the first next non
missing value.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics of the selected
series.
Charts tab:
Display charts: Activate this option to display the charts of the series before and after
transformation.
Results
Summary statistics: This table displays for the selected variables, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation (unbiased).
Box-Cox transformation:
Estimates of the parameters of the model: This table is available only if the Lambda
parameter has been optimized. It displays the three parameters of the model, which are
Lambda, the Intercept of the model and slope coefficient.
Series before and after transformation: This table displays the series before and after
transformation. If Lambda has been optimized, the transformed series corresponds to the
residuals of the model. If it hasnt then the transformed series is the direct application of the
Box-Cox transformation.
Differencing
Series before and after transformation: This table displays the series before transformation
and the differenced series. The first d+D+s data are not available in the transformed series
because of the lag due to the differencing itself.
464
Detrending (Polynomial regression)
Goodness of fit coefficients: This table displays the goodness of fit coefficients.
Estimates of the parameters of the model: This table displays the parameters of the model.
Series before and after transformation: This table displays the series before and after
transformation. The transformed series corresponds to the residuals of the model.
Deseasonalization
Goodness of fit coefficients: This table displays the goodness of fit coefficients.
Estimates of the parameters of the model: This table displays the parameters of the model.
The intercept is equal to the mean of the series before transformation.
Series before and after transformation: This table displays the series before and after
transformation. The transformed series corresponds to the residuals of the model.
Example
A tutorial explaining how to transform time series is available on the Addinsoft web site. To
consult the tutorial, please go to:
http://www.xlstat.com/demo-desc.htm
References
Box G. E. P. and Jenkins G. M. (1976). Time Series Qnalysis: Forecasting and Control.
Holden-Day, San Francisco.
Brockwell P.J. and Davis R.A. (1996). Introduction to Time Series and Forecasting. Springer
Verlag, New York.
Shumway R.H. and Stoffer D.S. (2000). Time Series Analysis and Its Applications. Springer
Verlag, New York.
465
466
Smoothing
Use this tool to smooth a time series and make predictions, using moving averages,
exponential smoothing, Fourier smoothing, Holt or Holt-Winters methods.
Description
Several smoothing methods are available. We define by {Yt}, (t=1,,n), the time series of
interest, by PtYt+h the predictor of Yt+h with minimum mean square error, and t a N(01) white
noise. The smoothing methods are described by the following equations:
Y t t t
Pt Yt h t , h 1,2,...
S Y 1 S , 0 2
t t t 1
Y P Y S , h 1,2,...
th t t h t
Exponential smoothing is useful when one needs to model a value by simply taking into
account past observations. It is called "exponential" because the weight of past observations
decreases exponentially. This method it is not very satisfactory in terms of prediction, as the
predictions are constant after n+1.
467
Y t t 1t t
Pt Yt h t 1t h 1,2,...
S t Y t 1 S t 1, 0 2
Tt S t 1 Tt 1
h h
Yt h Pt Yt h 2
S t 1 Tt , 1 h 1,2,...
1 1
Yt h Pt Yt h Yt , 0 h 1,2,...
This model is sometimes referred to as the Holt-Winters non seasonal algorithm. It allows to
take into account a permanent component and a trend that varies with time. This models
adapts itself quicker to the data compared with the double exponential smoothing. Is involves a
second parameter. The predictions for t>n take into account the permanent component and
the trend component. The equations of the model write:
Y t t 1t t
Pt Yt h t 1t h 1,2,...
S t Y t 1 S t 1Tt 1 , 0 2
T S S 1 T , 0 4 / 2
t t t 1 t 1
Yt h Pt Yt h S t hTt , h 1,2,...
This method allows to take into account a trend that varies with time and a seasonal
component with a period p. The predictions take into account the trend and the seasonality.
The model is called additive because the seasonality effect is stable and does not grow with
time. The equations of the model write:
468
Y t t 1t s p (t ) t
Pt Yt h t 1t s p (t ) h 1,2,...
S Y S 1 S T
t t t p t 1 t 1
Tt S t S t 1 1 Tt 1
Dt Yt S t 1 Dt p
Yt h Pt Yt h S t hTt Dt p h , h 1,2,...
For the definition of the additive-invertible region please refer to Archibald (1990).
This method allows to take into account a trend that varies with time and a seasonal
component with a period p. The predictions take into account the trend and the seasonality.
The model is called multiplicative because the seasonality effect varies with time. The more
the discrepancies between the observations are high, the more the seasonal component
increases. The equations of the model write:
Y t t 1t s p (t ) t
Pt Yt h t 1t s p (t ) h 1,2,...
S Y / S 1 S T
t t t p t 1 t 1
Tt S t S t 1 1 Tt 1
Dt Yt / S t 1 Dt p
Yt h Pt Yt h S t hT t Dt p h , h 1,2,...
For the definition of the additive-invertible region please refer to Archibald (1990).
Note 1: for all the above models, XLSTAT estimates the values of the parameters that
minimize the mean square error (MSE). However, it is also possible to maximize the likelihood,
as, apart from the Holt-Winters multiplicative model, it is possible to write these models as
ARIMA models. For example, the simple exponential smoothing is equivalent to an
ARIMA(0,1,1) model, and the Holt-Winters additive model is equivalent to an ARIMA
(0,1,p+1)(0,1,0) p. If you prefer to maximize the likelihood, we advise you to use the ARIMA
procedure of XLSTAT.
Note 2: for all the above models, initial values for S, T and D, are required. XLSTAT offers
several options, including backcasting to set these values. When backcasting is selected, the
algorithm reverses the series, starts with simple initial values corresponding to the Y(x) option,
then computes estimates and uses these estimates as initial values. The values corresponding
to the various options for each method are described hereunder:
469
Simple exponential smoothing:
Y(1) : S1 Y1
Mean(6): S1 Yi / 6
6
i 1
Backcasting
Optimized
Mean(6): S 2 Yi / 6, T1 ( S 2 Y1 )
6
i 1
Backcasting
S1 p Y1 p , T1 p Y1 p Y1 , Di Yi Y1 T1 p i 1, i 1,..., p
Holt-Winters seasonal additive model:
Y(1 p) :
Backcasting
S1 p Y1 p , T1 p Y1 p Y1 , Di Yi / Y1 T1 p i 1, i 1,..., p
Holt Winters seasonal multiplicative model:
Y(1 p) :
Backcasting
Moving average
This model is a simple way to take into account past and optionally future observations to
predict values. It works as a filter that is able to remove noise. While with the smoothing
methods defined below, an observation influences all future predictions (even if the decay is
exponential), in the case of the moving average the memory is limited to q. If the constant l is
set to zero, the prediction depends on the past q values and on the current value, and if l is set
to one, it also depends on the next q values. Moving averages are often used as filters, and
not as way to do accurate predictions. However XLSTAT enables you to do predictions based
on the moving average model that writes:
470
Y t t t
wiYt i
ql
t ql
i q
wi
i q
where l is a constant, which, when set to zero, allows the prediction to depend on the q
previous values and on the current value. If l is set to one, the prediction also depends on the
q next values. The wi (i=1q) are the weights. Weights can be either constant, fixed by the
user, or based on existing optimal weights for a given application. XLSTAT allows to use the
Spencer 15-points model that passes polynomials of degree 3 without distortion.
Fourier smoothing
The concept of the Fourier smoothing is to transform a time series into its Fourier coordinates,
then remove part of the higher frequencies, and then transform the coordinates back to a
signal. This new signal is a smoothed series.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
471
Times series: Select the data that correspond to the time series for which you want to
compute the various spectral functions.
Date data: Activate this option if you want to select date or time data. These data must be
available either in the Excel date/time formats or in a numerical format. If this option is not
activated, XLSTAT creates its own time variable ranging from 1 to the number of data.
Check intervals: Activate this option so that XLSTAT checks that the spacing between
the date data is regular.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Series labels: Activate this option if the first row of the selected series includes a header.
Model: Select the smoothing model you want to use (see description for more information on
the various models).
Options tab:
Method: Select the method for the selected model (see description for more information on the
various models).
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number if iterations has been exceeded. Default value:
500.
Convergence: Enter the maximum value of the evolution in the convergence criterion
from one iteration to another which, when reached, means that the algorithm is
considered to have converged. Default value: 0.00001.
472
Confidence interval (%): The value you enter (between 1 and 99) is used to determine the
confidence intervals for the predicted values. Confidence intervals are automatically displayed
on the charts.
S1: Choose an estimation method for the initial values. See the description for more
information on that topic.
Depending on the model type, and on the method you have chosen, different options are
available in the dialog box. In the description section, you can find information on the various
models and on the corresponding parameters.
In the cas of exponential or Holt-Winters models, you can decide to set the parameters to a
given value, or to optimize them. In the case of the Holt-Winters seasonal models, you need to
enter the value of the period.
In the case of the the Fourier smoothing, you need to enter to the proportion p of the spectrum
that needs to be kept after the high frequencies are removed.
For the moving average model, you need to specify the number q of time steps that must be
taken into account to compute the predicted value. You can decided to only consider the
previous q steps (the left part) of the series.
Validation tab:
Validation: Activate this option to use some data for the validation of the model.
Time steps: Enter the number the number of data at the end of the series that need to be
used for the validation.
Prediction tab:
Time steps: Enter the number the number of time steps for which you want XLSTAT to
compute a forecast.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
473
Remplace by the average of the previous and next values: Activate this option to estimate
the missing data by the mean of the first preceding non missing value and of the first next non
missing value.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics of the selected
series.
Goodness of fit coefficients: Activate this option to display the goodness of fit statistics.
Model parameters: Activate this option to display the table of the model parameters.
Predictions and residuals: Activate this option to display the table of the predictions and the
residuals.
Charts tab:
Display charts: Activate this option to display the charts of the series before and after
smoothing, as well as the bar chart of the residuals.
Results
Goodness of fit coefficients: This table displays the goodness of fit coefficients which include
the number of degrees of freedom (DF), the DDL, the sum of squares of errors (SSE) the
mean square of errors (MSE), the root of the MSE (RMSE), the mean absolute percentage
error (MAPE), the mean percentage error (MPE) the mean absolute error (MAE) and the
coefficient of determination (R). Note: all these statistics are computed for the observations
involved in the estimation of the model only; the validation data are not taken into account.
Model parameters: This table displays the estimates of the parameters, and, if available, the
standard error of the estimates. Note: to S1 corresponds the first computed value of the S
series, and to T1 corresponds the first computed value of the series T. See the description for
more information.
Series before and after smoothing: This table displays the series before and after
smoothing. If some predictions have been computed (t>n), and if the confidence intervals
option has been activated, the confidence intervals are computed for the predictions.
474
Charts: The first chart displays the data, the model, and the predictions (validation + prediction
values) as well as the confidence intervals. The second chart corresponds to the bar chart of
the residuals.
Example
A tutorial explaining how to do forecasting with the Holt-Winters method is available on the
Addinsoft web site. To consult the tutorial, please go to:
http://www.xlstat.com/demo-hw.htm
References
Archibald B.C. (1990). Parameter space of the Holt-Winters' model. International Journal of
Forecasting, 6, 199-209.
Box G. E. P. and Jenkins G. M. (1976). Time Series Analysis: Forecasting and control.
Holden-Day, San Francisco.
Brockwell P.J. and Davis R.A. (1996). Introduction to Time Series and Forecasting. Springer
Verlag, New York.
Brown R.G. (1962). Smoothing, Forecasting and Prediction of Discrete Time Series. Prentice-
Hall, New York.
Brown R.G. and Meyer R.F. (1961). The fundamental theorem of exponential smoothing.
Operations Research, 9, 673-685.
Chatfield, C. (1978). The Holt-Winters forecasting procedure. Applied Statistics, 27, 264-279.
Holt C.C. (1957). Forecasting seasonals and trends by exponentially weighted moving
averages. ONR Reseach Memorandum 52, Carnegie Institute of Technology, Pittsburgh.
Makridakis S.G., Wheelwright S.C. and Hyndman R.J. (1997). Forecasting : Methods and
Applications. John Wiley & Sons, New York.
Shumway R.H. and Stoffer D.S. (2000). Time Series Analysis and Its Applications. Springer
Verlag, New York.
475
ARIMA
Use this tool to fit an ARMA (Autoregressive Moving Average), an ARIMA (Autoregressive
Integrated Moving Average) or a SARIMA (Seasonal Autoregressive Integrated Moving
Average) model, and to compute forecasts using the model which parameters are either
known or to be estimated.
Description
The models of the ARIMA family allow to represent in a synthetic way phenomena that vary
with time, and to predict future values with a confidence interval around the predictions.
The mathematical writing of the ARIMA models differs from one author to the other. The
differences concern most of the time the sign of the coefficients. XLSTAT is using the most
commonly found writing, used by most software.
Si we define by {Xt} a series with mean , then if the series is supposed to follow an
s
ARIMA(p,d,q)(P,D,Q) model, we can write:
Y 1 B d 1 B s D X
B B Yt BB s Z t ,
t t
s
Z t N 0,
with
z 1 i z i
p
P
z 1 i z i ,
i 1 i 1
z 1 i z i
q Q
z 1 z i ,
i 1
i
i 1
s is the period of the model (for example 12 if the data are monthly data, and if one noticed a
yearly cyclicity in the data).
476
Remark 1: the {Yt} process is causal if and only if for any z such that |z| <=1, (z) 0 and (z)
0.
Remark 2: if D=0, the model is an ARIMA(p,d,q) model. In that case, P, Q and s are
considered as null.
Remark 4: if d=0, D=0 and q=0, the model simplifies to an AR(p) model.
Remark 5: if d=0, D=0 and p=0, the model simplifies to an MA(q) model.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Times series: Select the data that correspond to the time series for which you want to
compute the various spectral functions.
Date data: Activate this option if you want to select date or time data. These data must be
available either in the Excel date/time formats or in a numerical format. If this option is not
activated, XLSTAT creates its own time variable ranging from 1 to the number of data.
Check intervals: Activate this option so that XLSTAT checks that the spacing between
the date data is regular.
477
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Series labels: Activate this option if the first row of the selected series includes a header.
p: Enter the order of the autoregressive part of the model. For example, enter 1 for an
AR(1) model or for an ARMA(1,2) model.
d: Enter the differencing order of the model. For example, enter 1 for an ARIMA(0,1,2)
model.
q: Enter the order of the moving average part of the model. For example, enter 2 for a
MA(2) model or for an ARIMA(1,1,2) model.
P: Enter the order of the autoregressive seasonal part of the model. For example, enter
1 for an ARIMA(1,1,0)(1,1,0) model. You can modify this value only if D*0. If D=0,
XLSTAT considers that P=0.
D: Enter the differencing order for the seasonal part of the model. For example, enter 1
for an ARIMA(0,1,1)(0,1,1) model.
Q: Enter the order of the moving average seasonal part of the model. For example,
enter 1 for an ARIMA(0,1,1)(0,1,1) model. You can modify this value only if D*0. If
D=0, XLSTAT considers that P=0.
s: Enter the period of the model. You can modify this value only if D*0. If D=0, XLSTAT
considers that s=0.
Options tab:
Preliminary estimation: Activate this option if you want to use a preliminary estimation
method. This option is available only if D=0.
Burg: Activate this option to estimate the coefficients of the autoregressive AR(p) model
using the Burgs algorithm.
478
Innovations: Activate this option to estimate the coefficients of the moving average
MA(q) model using the Innovations algorithm.
Initial coefficients: Activate this option to select the initial values of the coefficients of the
model.
Phi: Select here the value of the coefficients corresponding to the autoregressive part of
the model (including the seasonal part). The number of values to select is equal to p+P.
Theta: Select here the value of the coefficients corresponding to the moving average
part of the model (including the seasonal part). The number of values to select is equal
to q+Q.
Optimize: Activate this option to estimate the coefficients using one of the two available
methods:
Likelihood: Activate this option pour maximize the likelihood of the parameters knowing
the data.
Least squares: Activate this option to minimize the sum of squares of the residuals.
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number if iterations has been exceeded. Default value:
500.
Convergence: Enter the maximum value of the evolution in the convergence criterion
from one iteration to another which, when reached, means that the algorithm is
considered to have converged. Default value: 0.00001.
Confidence interval (%): The value you enter (between 1 and 99) is used to determine the
confidence intervals for the predicted values. Confidence intervals are automatically displayed
on the charts.
479
Validation tab:
Validation: Activate this option to use some data for the validation of the model.
Time steps: Enter the number the number of data at the end of the series that need to be
used for the validation.
Prediction tab:
Time steps: Enter the number the number of time steps for which you want XLSTAT to
compute a forecast.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Remplace by the average of the previous and next values: Activate this option to estimate
the missing data by the mean of the first preceding non missing value and of the first next non
missing value.
Outputs tab:
Descriptive statistics: Activate this option to display the descriptive statistics of the selected
series.
Goodness of fit coefficients: Activate this option to display the goodness of fit statistics.
Model parameters: Activate this option to display the table of the model parameters.
Predictions and residuals: Activate this option to display the table of the predictions and the
residuals.
Charts tab:
Display charts: Activate this option to display the chart that display the input data together
with the model predictions, as well the bar chart of the residuals.
480
Results
Summary statistics: This table displays for the selected variables, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation (unbiased).
If a preliminary estimation and an optimization have been requested the results for the
preliminary estimation are first displayed followed by the results after the optimization. If initial
coefficients have been entered the results corresponding to these coefficients are displayed
first.
Observations: The number of data used for the fitting of the model.
SSE: Sum of Squares of Errors. This statistic is minimized if the "Least Squares" option
has been selected for the optimization.
WN variance: The white noise variance is equal to the SSE divided by N. In some
software, this statistic is named sigma2 (sigma-square).
WN variance estimate: This statistic is usually equal to the previous. In the case of a
preliminary estimation using the Yule-Walker or Burgs algorithms, a slightly different
estimate is displayed.
-2Log(Like.): This statistic is minimized if the "Likelihood" option has been selected for
the optimization. It is equal to 2 times the natural logarithm of the likelihood.
FPE: Akaikes Final Prediction Error. This criterion is adapted to autoregressive models.
AICC: This criterion has been suggested by Brockwell (Akaike Information Criterion
Corrected).
Model parameters:
Constant: the constant is null for the models that do not have an autoregressive component.
For the models that include an autoregressive part, the constant equals .. The
constant is also null if the "Center" option is not activated.
481
The following table gives the estimator for each coefficient of each polynomial, as well as the
standard deviation obtained either directly from the estimation method (preliminary estimation),
or from the Fishers information matrix (Hessian). The asymptotical standard deviations are
also computed. For each coefficient and each standard deviation, a confidence interval is
displayed. The coefficients are identified as follows:
SAR(i): coefficient that corresponds to the order i coefficient of the (z) polynomial.
MA(i): coefficient that corresponds to the order i coefficient of the (z) polynomial.
SMA(i): coefficient that corresponds to the order i coefficient of the (z) polynomial.
Data, Predictions and Residuals: This table displays the data, the corresponding predictions
computed with the model, and the residuals. If the user requested it, predictions are computed
for the validation data and forecasts for future values. Standard deviations and confidence
intervals are computed for validation predictions and forecasts.
Charts: Two charts are displayed. The first chart displays the data, the corresponding values
predicted by the model, and the predictions corresponding to the values for the validation
and/or prediction time steps. The second chart corresponds to the bar chart of residuals.
Example
A tutorial explaining how to do fit an ARIMA model and to use the model to do forecasting is
available on the Addinsoft web site. To consult the tutorial, please go to:
http://www.xlstat.com/demo-arima.htm
References
Box G. E. P. and Jenkins G. M. (1984). Time Series Analysis: Forecasting and Control, 3
rd
Brockwell P.J. and Davis R.A. (2002). Introduction to Time Series and Forecasting, 2
nd
482
Fuller W.A. (1996). Introduction to Statistical Time Series, Second Edition. John Wiley & Sons,
New York.
Mlard G. (1984). Algorithm AS197: a fast algorithm for the exact likelihood of autoregressive-
moving average models. Journal of the Royal Statistical Society, Series C, Applied Statistics,
33, 104-114.
483
Spectral analysis
Use this tool to transform a time series into its coordinates in the space of frequencies, and
then to analyze its characteristics in this space.
Description
This tool allows to transform a time series into its coordinates in the space of frequencies, and
then to analyze its characteristics in this space. From the coordinates we can extract the
magnitude, the phase, build representations such as the periodogram, the spectral density,
and test if the series is stationary. By looking at the spectral density, we can identify seasonal
components, and decide to which extend we should filter noise. Spectral analysis is a very
general method used in a variety of domains.
The spectral representation of a time series {Xt}, (t=1,,n), decomposes {Xt} into a sum of
sinusoidal components with uncorrelated random coefficients. From there we can obtain
decomposition the autocovariance and autocorrelation functions into sinusoids.
The spectral density corresponds to the transform of a continuous time series. However, we
usually have only access to a limited number of equally spaced data, and therefore, we need
to obtain first the discrete Fourier coordinates (cosine and sine transforms), and then the
periodogram. From the periodogram, using a smoothing function, we can obtain a spectral
density estimate which is a better estimator of the spectrum.
Using fast and powerful methods, XLSTAT automatically computes the Fourier cosine and sine
transforms of {Xt}, for each Fourier frequency, and then the various functions that derive from
these transforms.
With n being the sample size, and [i] being the largest integer less than or equal to i, the
Fourier frequencies write:
2k n 1 n
k , k= ,...,
n 2 2
X t cos( k (t 1))
2 n
ak
n t 1
X t sin( k (t 1))
2 n
bk
n t 1
484
n n 2
I k a k bk2
2 t 1
The spectral density estimate (or discrete spectral average estimator) of the time series {Xt}
writes:
wJ
p
fk i k i
i p
J k i I k i , 0 k i n
with J k i I ( k i ) , k i 0
J k i I n ( k i ) , k i n
where p, the bandwidth, and wi, the weights, are either fixed by the user, or determined by the
choice of a kernel. XLSTAT suggests the use of the following kernels:
Bartlett:
c 1/ 2, e 1/ 3
wi 1 i if i 1
w 0
i otherwise
Parzen:
c 1, e 1/ 5
wi 1 6 i 6 i if i 0.5
2 3
wi 2 1 i if 0.5 i 1
3
wi 0 otherwise
Quadratic spectral:
c 1 / 2, e 1 / 5
25 sin(6i / 5)
w 2 2
i
cos( 6 i / 5)
12 i 6 i / 5
Tukey-Hanning:
c 2 / 3, e 1/ 5
wi (1 cos(i )) / 2 if i 1
w 0
i otherwise
Truncated:
485
c 1/ 4, e 1/ 5
wi 1 if i 1
w 0 otherwise
i
Note: the bandwidth p is a function of n, the size of the sample. The weights wi must be
positive and must sum to one. If they dont, XLSTAT automatically rescales them.
If a second time series {Yt} is available, several additional functions can be computed to
estimate the cross-spectrum:
Real k
n n
a X ,k aY ,k b X ,k bY ,k
2 t 1
The imaginary part of the cross-periodogram of {Xt} and {Yt} writes:
Imag k
n n
a X ,k bY ,k b X ,k aY ,k
2 t 1
The cospectrum estimate (real part of the cross-spectrum) of the time series {Xt} and {Yt}
writes:
wR
p
Ck i k i
i p
Rk i Realk i , 0 k i n
with Rk i Real ( k i ) , k i 0
Rk i Realn ( k i ) , k i n
The quadrature spectrum (imaginary part of the cross-periodogram) estimate of the time series
{Xt} and {Yt} writes:
wH
p
Qk i k i
i p
H k i Imag k i , 0 k i n
with H k i Imag ( k i ) , k i 0
H k i Imag n ( k i ) , k i n
486
The amplitude of the cross-spectrum of {Xt} and {Yt} writes:
Ak C k2 Qk2
The squared coherency estimate between the {Xt} and {Yt} series writes:
Ak2
Kk
fX ,k fY ,k
White noise tests: XLSTAT optionally displays two test statistics and the corresponding p-
values for white noise: Fisher's Kappa and Bartlett's Kolmogorov-Smirnov statistic.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Times series: Select the data that correspond to the time series for which you want to
compute the various spectral functions.
Date data: Activate this option if you want to select date or time data. These data must be
available either in the Excel date/time formats or in a numerical format. If this option is not
activated, XLSTAT creates its own time variable ranging from 1 to the number of data.
487
Check intervals: Activate this option so that XLSTAT checks that the spacing between
the date data is regular.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Series labels: Activate this option if the first row of the selected series includes a header.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Remplace by the average of the previous and next values: Activate this option to estimate
the missing data by the mean of the first preceding non missing value and of the first next non
missing value.
White noise tests: Activate this option if you want to display the results of the white noise
tests.
Cosine part: Activate this option if you want to display the Fourier cosine coefficients.
Sine part: Activate this option if you want to display the Fourier sine coefficients.
Amplitude: Activate this option if you want to display the amplitude of the spectrum.
Phase: Activate this option if you want to display the phase of the spectrum.
Spectral density: Activate this option if you want to display the estimate of spectral density.
Kernel weighting: Select the type of kernel. The kernel functions are described in the
description section.
488
o e: Enter the value of the e parameter. This parameter is described in the
description section.
Fixed weighting: Select on an Excel sheet the values of the fixed weights. The number
of weights must be odd. Symmetric weights are recommended (Example: 1,2,3,2,1).
Cross-spectrum: Activate this option to analyze the cross-spectra. The computations are only
done of at least two series have been selected.
Real part: Activate this option to display the real part of the cross-spectrum.
Imaginary part: Activate this option to display the imaginary part of the cross-spectrum.
Cospectrum: Activate this option to display the cospectrum estimate (real part of the
cross-spectrum).
Quadrature spectrum: Activate this option to display the quadrature estimate (real part
of the cross-spectrum).
Charts tab:
Spectral density: Activate this option to display the chart of the spectral density.
Results
White noise tests: This table displays both the Fishers Kappa Bartletts Kolmogorov-Smirnov
statistics and the corresponding p-values. If the p-values are lower than the significance level
(typically 0.05), then you need to reject the assumption that the times series is just a white
noise.
A table is displayed for each selected time series. It displays various columns:
489
Sine part: the sine coefficients of the Fourier transform
Charts: XLSTAT displays the periodogram and the spectral density charts on both the
frequency and period scales.
If two series or more have been selected, and if the cross-spectrum options have been
selected, XLSTAT displays additional information:
Charts: XLSTAT displays the amplitude of the estimate of the cross-spectrum on both the
frequency and period scales.
Example
An example of Spectral analysis is available on the Addinsoft web site. To consult the tutorial,
please go to:
http://www.xlstat.com/demo-spectral.htm
490
References
Brockwell P.J. and Davis R.A. (1996). Introduction to Time Series and Forecasting. Springer
Verlag, New York.
Davis H.T. (1941). The Analysis of Economic Time Series. Principia Press, Bloomington.
Chiu S-T (1989). Detecting periodic components in a white Gaussian time series. Journal of
the Royal Statistical Society, Series B, 51, 249-260.
Fuller W.A. (1996). Introduction to Statistical Time Series, Second Edition. John Wiley & Sons,
New York.
Nussbaumer H.J. (1982). Fast Fourier Transform and Convolution Algorithms, Second
Edition. Springer-Verlag, New York.
Shumway R.H. and Stoffer D.S. (2000). Time Series Analysis and Its Applications. Springer
Verlag, New York.
491
Fourier transformation
Use this tool to transform a time series or a signal to its Fourier coordinates, or to do the
inverse transformation.
Description
Use this tool to transform a time series or a signal to its Fourier coordinates, or to do the
inverse transformation. While the Excel function is limited to powers of two for the length of the
time series, XLSTAT is not restricted. Outputs optionally include the amplitude and the phase.
Dialog box
: Click this button to close the dialog box without doing any computation.
Real part: Activate this option and then select the signal to transform, or the real part of the
Fourier coordinates for an inverse transformation.
Imaginary part: Activate this option and then select the imaginary part of the Fourier
coordinates for an inverse transformation.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
492
Column labels: Activate this option if the first row of the data selections (real part, imaginary
part) includes a header.
Inverse transformation: Activate this option if you want to compute the inverse Fourier
transform.
Amplitude: Activate this option if you want to compute and display the amplitude of the
spectrum.
Phase: Activate this option if you want to compute and display the phase of the spectrum.
Results
Real part: This column contains the real part after the transform or the inverse transform.
Imaginary part: This column contains the real part after the transform or the inverse
transform.
References
Fuller W.A. (1996). Introduction to Statistical Time Series, Second Edition. John Wiley & Sons,
New York.
493
XLSTAT-Sim
XLSTAT-Sim is an easy to use and powerful solution to create and run simulation models.
Introduction
XLSTAT-Sim is a module that allows to build and compute simulation models, an innovative
method for estimating variables, whose exact value is not known, but that can be estimated by
means of repeated simulation of random variables that follow certain theoretical laws. Before
running the model, you need to create the model, defining a series of input and output (or
result) variables.
XLSTAT-Sim is a module that allows to build and compute simulation models, an innovative
method for estimating variables, whose exact value is not known, but that can be estimated by
means of repeated simulation of random variables that follow certain theoretical laws. Before
running the model, you need to create the model, defining a series of input and output (or
result) variables.
Simulation models
Simulation models allow to obtain information, such as mean or median, on variables that do
not have an exact value, but for which we can know, assume or compute a distribution. If some
result variables depend of these distributed variables by the way of known or assumed
formulae, then the result variables will also have a distribution. XLSTAT-Sim allows you to
define the distributions, and then obtain through simulations an empirical distribution of the
input and output variables as well as the corresponding statistics.
Simulation models are used in many areas such as finance and insurance, medicine, oil and
gas prospecting, accounting, or sales prediction.
- Distributions are associated to random variables. XLSTAT gives a choice of more than 20
distributions to describe the uncertainty on the values that a variable can take (see the chapter
Define a distribution for more details). For example, you can choose a triangular distribution if
you have a quantity for which you know it can vary between two bounds, but with a value that
is more likely (a mode). At each iteration of the computation of the simulation model, a random
draw is performed in each distribution that has been defined.
- Scenario variables allow to include in the simulation model a quantity that is fixed in the
model, except during the tornado analysis where it can vary between two bounds (see the
494
chapter Define a scenario variable for more details, and the section on tornado analysis
below).
- Result variables correspond to outputs of the model. They depend either directly or
indirectly, through one or more Excel formulae, on the random variables to which distributions
have been associated and if available on the scenario variables. The goal of computing the
simulation model is to obtain the distribution of the result variables (see the chapter Define a
result variable for more details).
- Statistics allow to track a given statistic a result variable. For example, we might want to
monitor the standard deviation of a result variable (see the chapter Define a statistic for more
details).
A correct model should comprise at least one distribution and one result. Models can contain
any number of these four elements.
A model can be limited to a single Excel sheet or can use a whole Excel folder.
Simulation models can take into account the dependencies between the input variables
described by distributions. If you know that two variables are usually related such that the
correlation coefficient between them is 0.4, then you want that, when you do simulations, the
sampled values for both variables have the same property. This is possible in XLSTAT-Sim by
entering in the Run dialog box the correlation or covariance matrix between some or all the
input random variables used in the model.
Outputs
When you run the model, a series of results is displayed. While giving the critical statistics
such are information on the distribution of the input and result variables, it also allows
interpreting relationships between variables. Sensitivity analysis is also available if scenario
variables have been included.
Descriptive statistics:
The report that is generated after the simulation contains information on the distributions of the
model. The user may choose from a range of descriptive statistics the most important
indicators that should be integrated into the report in order to easily interpret the results. A
selection of charts is also available to graphically display the relationships.
Details and formulae relative to the descriptive statistics are available in the description section
of the Descriptive statistics tool of XLSTAT.
495
Charts:
Lower limit: Linf = X(i) such that {X(i) [Q1 1.5 (Q3 Q1)]} is minimum and X(i) Q1
1.5 (Q3 Q1).
Upper limit: Lsup = X(i) such that {X(i) - [Q3 + 1.5 (Q3 Q1)]} is minimum and X(i)
Q3 + 1.5 (Q3 Q1)
Values that are outside the ]Q1 - 3 (Q3 Q1); Q3 + 3 (Q3 Q1) [ interval are displayed
with the * symbol. Values that are in the [Q1 - 3 (Q3 Q1); Q1 1.5 (Q3 Q1)] or the
[Q3 + 1.5 (Q3 Q1); Q3 + 3 (Q3 Q1)] intervals are displayed with the o symbol.
P-P Charts (normal distribution): P-P charts (for Probability-Probability) are used to
compare the empirical distribution function of a sample with that of a normal variable for
the same mean and deviation. If the sample follows a normal distribution, the data will
lie along the first bisector of the plan.
Q-Q Charts (normal distribution): Q-Q charts (for Quantile-Quantile) are used to
compare the quantities of the sample with that of a normal variable for the same mean
and deviation. If the sample follows a normal distribution, the data will lie along the first
bisector of the plan.
Correlations:
Once the computations are over, the simulation report may contain information on the
correlations between the different variables included in the simulation model. Three different
correlation coefficients are available:
496
coefficient allow testing the null hypothesis that the coefficients are not significantly
different from 0. However, one needs to be cautions when interpreting these results, as if
two variables are independent, their correlation coefficient is zero, but the reciprocal is not
true.
- Spearman correlation coefficient (rho): This coefficient is based on the ranks of the
observations and not on their value. This coefficient is adapted to ordinal data. As for the
Pearson correlation, one can interpret this coefficient in terms of variability explained, but
here we mean the variability of the ranks.
- Kendall correlation coefficient (tau): As for the Spearman coefficient, it is well suited for
ordinal variables as it is also based on ranks. However, this coefficient is conceptually very
different. It can be interpreted in terms of probability: it is the difference between the
probabilities that the variables vary in the same direction and the probabilities that the
variables vary in the opposite direction. When the number of observations is lower than 50
and when there are no ties, XLSTAT gives the exact p-value. If not, an approximation is
used. The latter is known as being reliable when there are more than 8 observations.
Sensitivity analysis:
The sensitivity analysis displays information about the impact of the different input variables for
one output variable. Based on the simulation results and on the correlation coefficient that has
been chosen (see above), the correlations between the input random variables and the result
variables are calculated and displayed in a declining order of impact on the result variable.
Tornado and spider analyses are not based on the iterations of the simulation but on a point by
point analysis of all the input variables (random variables with distributions and scenario
variables).
During the tornado analysis, for each result variable, each input random variable and each
scenario variable are studied one by one. We make their value vary between two bounds and
record the value of the result variable in order to know how each random and scenario variable
impacts the result variables. For a random variable, the values explored can either be around
the median or around the default cell value, with bounds defined by percentiles or deviation.
For a scenario variable, the analysis is performed between two bounds specified when defining
the variables. The number of points is an option that can be modified by the user before
running the simulation model.
The spider analysis does not only display the maximum and minimum change of the result
variable, but also the value of the result variable for each data point of the random and
scenario variables. This is useful to check if the dependence between distribution variables
and result variables is monotonous or not.
497
Toolbar
Click this icon to define a new distribution (see Define a distribution for more details).
Click this icon to define a new scenario variable (see Define a scenario variable for more
details).
Click this icon to define a new result (see Define a result variable for more details).
Click this icon to define a new statistic (see Define a statistic for more details).
Click this icon to reinitialize the simulation model and do a first simulation iteration.
Click this icon to export the simulation model. All XLSTAT-Sim functions are transformed
to comments. The formulae in the cells are stored as cell comments and the formulae are
either replaced by the default value or by the formula linking to other cells in the case of
XLSTAT_SimRes.
Click this icon to import the simulation model. All XLSTAT-Sim functions are extracted
from cell comments and exported as formulae in the corresponding cells.
498
Options
To display the options dialog box, click the button of the XLSTAT-SIM toolbar. Use this
dialog box to define the general options of the XLSTAT-SIM module.
General tab:
Model limited to: This option allows defining the size of the active simulation model. Limit if
possible your model to a single Excel sheet. The following options are available:
Sheet: Only the simulation functions in the active Excel sheet will be used in the
simulation model. The other sheets are ignored.
Workbook: All the simulation functions of the active workbook are included in the
simulation model. This option allows using several Excel sheets for one model.
Sampling method: This option allows choosing the method of sample generation. Two
possibilities are available:
Latin hypercubes: The samples are generated using the Latin Hypercubes method.
This method divides the distribution function of the variable into sections that have the
same size and then generates equally sized samples within each section. This leads to
a faster convergence of the simulation. You can enter the number of sections. Default
value is 500.
Single step memory: Enter the maximum number of simulation steps that will be stored in the
single step mode in order to calculate the statistics fields. When the limit is reached, the
window moves forward (the first iteration is forgotten and the new one is stored). The default
value is 500. This value can be larger, if necessary.
Number of iterations by step: Enter the value of the number of simulation iterations that are
performed during one step. The default value is 1.
Format tab:
Use these options to set the format of the various model elements that are displayed on the
Excel sheets:
Distributions: You can define the color of the font and the color of the background of
the cells where the definition of the input random variables and their corresponding
distributions are stored.
Scenario variables: You can define the color of the font and the color of the
background of the cells where the scenario variables are stored.
499
Result variables: You can define the color of the font and the color of the background
of the cells where the result variables are stored.
Statistics: You can define the color of the font and the color of the background of the
cells where the statistics are stored.
Convergence tab:
Stop conditions: Activate this option to stop the simulation if the convergence criteria are
reached.
Criterion: Select the criterion that should be used for testing the convergence. There
are three options available:
o Mean: The means of the monitored result variables (see below) of the
simulation model will be used to check if the convergence conditions are met.
Test frequency: Enter the number of iterations to perform before the convergence
criteria are checked again. Default value: 100.
Convergence: Enter the value in % of the evolution of the convergence criteria from
one check to the next, which, when reached, means that the algorithm has converged.
Default value: 3%.
Confidence interval (%): Enter the size in % of the confidence interval that is
computed around the selected criterion. The upper bound of the interval is compared to
the convergence value defined above, in order to determine if the convergence is
reached or not. Default value: 95%.
Monitored results: Select which result variables of the simulation model should be
monitored for the convergence. There are two options available:
o All result variables: All result variables of the simulation model will be
monitored during the convergence test.
o Activated result variables: Only result variables that have their ConvActive
parameter equal to 1 are monitored.
500
References tab:
Reference to Excel cells: Select the way references to names of variables to the simulation
models are generated:
Absolute reference: XLSTAT creates absolute references (for example $A$4) to the
cell.
Relative reference: XLSTAT creates absolute references (for example A4) to the cell.
Note: The absolute reference will not be changed if you copy and paste the XLSTAT_Sim
formula, contrary to the relative reference.
Results tab:
Filter level for results: Select the level of details that will be displayed in the report. This
controls for the descriptive statistics tables and the histograms of the different model elements:
Activated: Details are only displayed for the elements that have a value of the Visible
parameter set to 1.
Example
Examples showing how to build a simulation model are available on the Addinsoft website at:
http://www.xlstat.com/demo-sim1.htm
http://www.xlstat.com/demo-sim2.htm
http://www.xlstat.com/demo-sim3.htm
http://www.xlstat.com/demo-sim4.htm
References
Vose, D. (2008). Risk Analysis A Quantitative Guide, Third Edition, John Wiley & Sons, New
York.
501
Define a distribution
Use this tool in a simulation model when there is uncertainty on the value of a variable (or
quantity) that can be described with a distribution. The distribution will be associated with the
currently selected cell.
Description
This function is one of the essential elements of a simulation model. For a more detailed
description on how a simulation model is constructed and calculated, please read the
introduction on XLSTAT-Sim.
This tool allows to define the theoretical distribution function with known parameters that will be
used to generate a sample of a given random variable. A wide choice of distribution functions
is available.
To define the distribution that a given variable (physically, a cell on the Excel sheet) follows,
you need to create a call to one of the XLSTAT_SimX functions or to use the dialog box that
will generate for you the formula calling XLSTAT_SimX. X stands for the distribution (see the
table below for additional details).
XLSTAT_SimX syntax:
XLSTAT_SimX stands for one of the available distribution functions that are listed in the table
below. A variable based on the corresponding distribution is defined. See the table below to
see the available distributions.
VarName is a string giving the name of the variable for which the distribution is being defined.
The name of the variable is used in the report to identify the variable.
Param1 is an optional input (default is 0) that gives the value of the first parameter of the
distribution if relevant.
Param2 is an optional input (default is 0) that gives the value of the second parameter of the
distribution if relevant.
Param3 is an optional input (default is 0) that gives the value of the third parameter of the
distribution if relevant.
Param4 is an optional input (default is 0) that gives the value of the fourth parameter of the
distribution if relevant.
502
Param5 is an optional input (default is 0) that gives the value of the fifth parameter of the
distribution if relevant.
TruncMode est un entier optionnel qui indique si et comment la distribution est tronque. Une
valeur de 0 (valeur par dfaut) correspond ne pas tronquer la distribution. Une valeur de 1
correspond tronquer la distribution entre deux bornes entrer. Une valeur de 2 correspond
tronquer entre deux percentiles dont les valeurs doivent tre entres.
DefaultType is an optional integer that chooses the default value of the variable: 0 (default
value) corresponds to the theoretical expected mean; 1 to the value given by the DefaultValue
argument.
DefaultValue is an optional value giving the default value displayed in the cell before any
simulation is performed. When no simulation process is ongoing, the default value will be
displayed in the Excel cell as the result of the function.
Visible is an optional input that indicates if the details of this variable should be displayed in
the simulation report. This option is only taken into account when the Filter level for results in
the Options dialog box of XLSTAT-Sim is set to Activated (see the Format tab). 0 deactivates
the display and 1 activates the display. Default value is 1.
503
Example:
The function will associate to the cell where they are entered a normal distribution with mean
50000 and standard deviation 5000. The cell will show 50000 (the default value). If a report is
generated afterwards the results corresponding to that cell will be identified by Revenue Q1.
The Param3, Param4 and Param5 are not entered because the Normal distribution has only
two parameters. As the other parameters are not entered, they are set to their default value.
504
Determination of the parameters
In general, the choice of law and parameters of the law is guided by an empirical knowledge of
the phenomenon, the results already available or working hypothesis.
To select the best suited law and the corresponding parameters you can use the Distribution
fitting tool of XLSTAT. If you have a sample of data, by the help of this tool you can find the
best parameters for a given distribution.
P( X 1) p, P( X 0) 1 p with p 0,1
The Bernoulli, named after the swiss mathematician Jacob Bernoulli (1654-1705),
allows to describe binary phenomena where only events can occur with respective
probabilities of p and 1-p.
Beta (): the density function of this distribution (also called Beta type I) is given by:
( )( )
x 1 1 x , with , >0, x 0,1 and B ( , )
1
f ( x)
1
B ( , ) ( )
x c d x
1 1
f ( x) , with , >0, x c, d
1
B ( , ) d c
1
( ) ( )
c, d R, and B ( , )
( )
Pour the type I beta distribution, X takes values in the [0,1] range. The beta4
distribution is obtained by a variable transformation such that the distribution is on a
[c, d] interval where c and d can take any value.
Beta (a, b): the density function of this distribution (also called Beta type I) is given by:
505
(a)(b)
x a 1 1 x , with a,b>0, x 0,1 and B (a, b)
b 1
f ( x)
1
B a, b ( a b)
Binomial (n, p): the density function of this distribution is given by:
n is the number of trials, and p the probability of success. The binomial distribution
is the distribution of the number of successes for n trials, given that the probability
of success is p.
Negative binomial type I (n, p): the density function of this distribution is given by:
Negative binomial type II (k, p): the density function of this distribution is given by:
k x px
P( X x) , with x N, k , p >0
x ! k 1 p
kx
The negative binomial type II distribution is used to represent discrete and highly
heterogeneous phenomena. As k tends to infinity, the negative binomial type II
distribution tends towards a Poisson distribution with =kp.
1/ 2
df / 2
f ( x) x df / 2 1e x / 2 , with x 0, df N*
df / 2
506
e x
f ( x ) k x k 1 with x 0 and k , 0 and k N
k 1!
,
Note: When k=1, this distribution is equivalent to the exponential distribution. The
Gamma distribution with two parameters is a generalization of the Erlang
distribution to the case where k is a real and not an integer (for the Gamma
distribution the scale parameter is used).
f ( x) exp x , avec x 0 et 0
The exponential distribution is often used for studying lifetime in quality control.
Fisher (df1, df2): the density function of this distribution is given by:
df1 / 2 df 2 / 2
df1 x df1 x
f ( x) 1
1
xB df1 / 2, df 2 / 2 df1 x df 2
,
df1 x df 2
with x 0 and df1 , df 2 N*
E(X) = df2/(df2 -2) if df2>0, and V(X) = 2df2(df1+df2 -2)/[df1(df2-2) (df2 -4)]
Fisher's distribution, from the name of the biologist, geneticist and statistician
Ronald Aylmer Fisher (1890-1962), corresponds to the ratio of two Chi-square
distributions. It is often used for testing hypotheses.
x x
f ( x) exp exp with 0
1
,
507
e x /
f ( x) x , with x and k , 0
k 1
k k
1 x
1/ k 1
x
1/ k
f ( x) 1 k exp 1 k , with 0
k k
The GEV (Generalized Extreme Values) distribution is much used in hydrology for
modeling flood phenomena. k lies typically between -0.6 and 0.6.
f ( x ) exp x exp x
The Gumbel distribution, named after Emil Julius Gumbel (1891-1966), is a special
case of the Fisher-Tippett distribution with =1 and =0. It is used in the study of
extreme phenomena such as precipitations, flooding and earthquakes.
x
e s
f ( x) , with R, and s 0
x
s 1 e s
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
508
ln x
2
f ( x) e , with x, 0
1 2 2
x 2
x 2
f ( x) e , with 0
1 2 2
2
E(X) = and V(X) =
x2
f ( x) e
1 2
This distribution is a special case of the normal distribution with =0 and =1.
Pareto (a, b): the density function of this distribution is given by:
ab a
f ( x) a 1 , with a, b 0 and x b
x
The Pareto distribution, named after the Italian economist Vilfredo Pareto (1848-
1923), is also known as the Bradford distribution. This distribution was initially used
to represent the distribution of wealth in society, with Pareto's principle that 80% of
the wealth was owned by 20% of the population.
509
x a b x
1 1
f ( x) , with , >0, x a, b
1
B ( , ) b a
1
( )( )
a, b R, and B ( , )
( )
4m b - 5a
=
b-a
5b a 4m
=
b-a
The PERT distribution is a special case of the beta4 distribution. It is defined by its
definition interval [a, b] and m the most likeky value (the mode). PERT is an
acronym for Program Evaluation and Review Technique, a project management
and planning methodology. The PERT methodology and distribution were
developed during the project held by the US Navy and Lockheed between 1956 and
1960 to develop the Polaris missiles launched from submarines. The PERT
distribution is useful to model the time that is likely to be spent by a team to finish a
project. The simpler triangular distribution is similar to the PERT distribution in that it
is also defined by an interval and a most likely value.
exp x
P( X x) , with x N and 0
x!
df 1/ 2
1 x
( df 1) / 2
f ( x) / df , with df 0
df df / 2
2
The English chemist and statistician William Sealy Gosset (1876-1937), used the
nickname Student to publish his work, in order to preserve his anonymity (the
Guinness brewery forbade its employees to publish following the publication of
confidential information by another researcher). The Students t distribution is the
510
distribution of the mean of df variables standard normal variables. When df=1,
Student's distribution is a Cauchy distribution with the particularity of having neither
expectation nor variance.
Trapezoidal (a, b, c, d): the density function of this distribution is given by:
2 x a
f ( x) , x a, b
d c b a b a
f ( x) , x b, c
2
d c b a
2d x
f ( x ) d c b a d c , x a, b
f ( x) 0 , x a, x d
with a m b
This distribution is useful to represent a phenomenon for which we know that it can
take values between two extreme values (a and d), but that it is more likely to take
values between two values (b and c) within that interval.
Triangular (a, m, b): the density function of this distribution is given by:
2 x a
f ( x) , x a, m
b a m a
2 b x
f ( x) , x m, b
b a b m
f ( x) 0 , x a, x b
with a m b
TriangularQ (q1, m, q2, p1, p2): the density function of this distribution is a
reparametrization of the Triangular distribution. A first step requires estimating the a
and b parameters of the triangular distribution, from the q1 and q2 quantiles to which
percentages p1 and p2 correspond. Once this is done, the distribution functions can be
computed using the triangular distribution functions.
Uniform (a, b): the density function of this distribution is given by:
511
f ( x) , with b a and x a, b
1
ba
The uniform (0,1) distribution is much used for simulations. As the cumulative
distribution function of all the distributions is between 0 and 1, a sample taken in a
Uniform (0,1) distribution is used to obtain random samples in all the distributions
for which the inverse can be calculated.
Uniform discrete (a, b): the density function of this distribution is given by:
f ( x) , with b a, (a, b) N , x N , x a, b
1
b a 1
The uniform discrete distribution corresponds to the case where the uniform
distribution is restricted to integers.
f ( x ) x 1 exp x , with x 0 and 0
1 2 1
We have E(X) = 1 and V(X) = 1 2 1
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
is the shape parameter of the distribution and the scale parameter. When =1,
the Weibull distribution is an exponential distribution with parameter 1/.
512
1 2 1
We have E(X) = 1 and V(X) = 2 1 2 1
The Weibull distribution, named after the Swede Ernst Hjalmar Waloddi Weibull
(1887-1979), is much used in quality control and survival analysis. is the shape
parameter of the distribution and the scale parameter. When =1 and =0, the
Weibull distribution is an exponential distribution with parameter 1/.
Dialog box
: click this button to close the dialog box without doing any modification.
General tab:
Variable name: Enter the name of the random variable or select a cell where the name is
available. If you select a cell, an absolute reference (for example $A$4) or a relative reference
(for example A4) to the cell is created, depending on your choice in the XLSTAT options. (See
the Options section for more details)
Distributions: Select the distribution that you want to use for the simulation. See the
description section for more information on the available distributions.
Parameters: Enter the value of the parameters of the distribution you selected.
Absolute: Select this option, if you want to enter the lower and upper bound of the
truncation as absolute values.
513
Percentile: Select this option, if you want to enter the lower and upper bound of the
truncation as percentile values.
Lower bound: Enter the value of the lower bound of the truncation.
Upper bound: Enter the value of the upper bound of the truncation.
Options tab:
Default cell value: Choose the default value of the random variable. This value will be
returned when no simulation model is running. The value may be defined by one of the
following three methods:
Expected value: This option selects the expected value of the distribution as the
default cell value.
Reference: Choose a cell in the active Excel sheet that contains the default value.
Display results: Activate this option to display the detailed results for the random variable in
the simulation report. This option is only active if you selected the Activated filter level in the
simulation preferences. (See the Options section for more details).
Results
The result is function call to XLSTAT_SimX with the selected parameters. The following
formula is generated in the active Excel cell:
The background color and the font color in the Excel cell are applied according to your choices
in the XLSTAT-Sim options.
514
Define a scenario variable
Use this tool to define a variable which value varies between two known bounds during the
tornado analysis.
Description
This function allows to build a scenario variable that is used during the tornado analysis. For a
more detailed description on how a simulation model is constructed, please read the
introduction on XLSTAT-Sim.
A scenario variable is used for tornado analysis. This function gives you the possibility to
define a scenario variable by letting XLSTAT know the bounds between which it varies. To
define the scenario variable (physically, a cell on the Excel sheet), you need to create a call to
the XLSTAT_SimSVar function or to use the dialog box that will generate for you the formula
calling XLSTAT_SimSVar.
XLSTAT_SimSVar syntax
SVarName is a string that contains the name of the scenario variable. This can be a reference
to a cell in the same Excel sheet. The name is used during the report to identify the cell.
LowerBound corresponds to the lower bound of the interval of possible values for the
scenario variable.
UpperBound corresponds to the upper bound of the interval of possible values for the
scenario variable.
Type is an integer that indicates the data type of the scenario variable. 1 stands for a
continuous variable and 2 for a discrete variable. This input is optional with default value 1.
Step is a number that indicates in the case of a discrete variable the step size between two
values to be examined during the tornado analysis. This input is optional with default value 1.
DefaultType is an optional integer that chooses the default value of the variable: 0 (default
value) corresponds to the theoretical expected mean; 1 to the value given by the DefaultValue
argument.
DefaultValue is a value that that corresponds to the default value of the scenario variable. The
default value is returned as the result of this function.
515
Visible is an optional input that indicates if the details of this variable should be displayed in
the simulation report. This option is only taken into account when the Filter level for results in
the options dialog box of XLSTAT-Sim is set to Activated (see the Format tab). 0 deactivates
the display and 1 activates the display. Default value is 1.
Dialog box
: click this button to close the dialog box without doing any modification.
General tab:
Variable name: Enter the name of the scenario variable or select a cell where the name is
available. If you select a cell, an absolute reference (for example $A$4) or a relative reference
(for example A4) to the cell is created, depending on your choice in the XLSTAT options. (See
the Options section for more details)
Lower bound: Enter the value of the lower bound or select a cell in the active Excel sheet that
contains the value of the lower bound of the interval in which the scenario variable varies.
Upper bound: Enter the value of the upper bound or select a cell in the active Excel sheet that
contains the value of the upper bound of the interval in which the scenario variable varies.
Data type:
Continuous: Choose this option to define a continuous scenario variable that can take
any value between the lower and upper bounds.
o Step: Enter the value of the step or select a cell in the active Excel sheet that
contains the value of the step.
516
Options tab:
Default cell value: Choose the default value of the random variable. This value will be
returned when no simulation model is running. The value may be defined by one of the
following three methods:
Expected value: This option returns the center of the interval as the default cell value.
Reference: Choose a cell in the active Excel sheet that contains the default value.
Display results: Activate this option to display the detailed results for the random variable in
the simulation report. This option is only active if you selected the Activated filter level in the
simulation preferences. (See the Options section for more details).
Results
The result is function call to XLSTAT_SimSVar with the selected parameters. The following
formula is generated in the active Excel cell:
The background color and the font color in the Excel cell are applied according to your choices
in the XLSTAT-Sim options.
517
Define a result variable
Use this tool in a simulation model to define a result variable which calculation is the real aim
of the simulation model.
Description
This result variable is one of the two essential elements of a simulation model. For a more
detailed description on how a simulation model is constructed and calculated, please read the
introduction on XLSTAT-Sim.
Result variables can be used to define when a simulation process should stop during a run. If,
in the XLSTAT-Sim Options dialog box, you asked that the Activated result variables are
used the stop the simulations when, for example the mean has converged, then, if the
ConvActiv parameter of the result variable is set to 1, the mean of the variable will used to
determine if the simulation process has converged or not.
To define the result variable (physically, a cell on the Excel sheet), you need to create a call to
the XLSTAT_SimRes function or to use the dialog box that will generate for you the formula
calling XLSTAT_SimRes.
XLSTAT_SimRes syntax:
ResName is a string that contains the name of the result variable or a reference to a cell
where the name is located. The name is used during the report to identify the result variable.
Formula is a string that contains the formula that is used to calculate the results. The formula
links directly or indirectly the random input variables and, if available the scenario variables, to
the result variable. This corresponds to an Excel formula without the leading =.
DefaultValue of type number is optional and contains the default value of the result variable.
This value is not used in the computations.
ConvActiv is an integer that indicates if this result is checked during the convergence tests.
This option is only active, if the Activated result variables convergence option is activated in
the XLSTAT-Sim options dialog box.
Visible is an optional input that indicates if the details of this variable should be displayed in
the simulation report. This option is only taken into account when the Filter level for results in
the options dialog box of XLSTAT-Sim is set to Activated (see the Format tab). 0 deactivates
the display and 1 activates the display. Default value is 1.
518
Example:
This function defines in the active cell a result variable called Forecast N +1" calculated as the
sum of cells B3 and B4 minus B5. The Visible parameter is not entered because it is only
necessary when the Filter level for the results is set to Activated (see the Options dialog
box) and because we want the result to be anyway visible.
Dialog box
: click this button to close the dialog box without doing any modification.
General tab:
Variable name: Enter the name of the random variable or select a cell where the name is
available. If you select a cell, it depends on the selection in the options, whether an absolute
(for example $A$4) or a relative reference (for example A4) to the cell is created. (See the
Options section for more details)
Use to monitor convergence: Activate this option to include this result variable in the result
variables that are used to test for convergence. This option is only active, if you selected the
Activated results variables option in the XLSTAT-Sim convergence options. ConvActiv should
be 1 if you want the variable to be used to monitor the results. Default value is 1.
Display Results: Activate this option to display the detailed results for the result variable in the
simulation report. This option is only active, if you selected the restricted filter level in the
simulation preferences. (See the XLSTAT-Sim options for more details).
519
Results
A function call to XLSTAT_SimRes with the selected parameters and the following syntax will
be generated in the active Excel cell:
The background color and the font color in the Excel cell are applied according to your choices
in the XLSTAT-Sim options.
520
Define a statistic
Use this tool in a simulation model to define a statistic based on a variable of the simulation
model. The statistic is updated after each iteration of the simulation process. Results relative to
the defined statistics are available in the simulation report. A wide choice of statistics is
available.
Description
This function is one of the four elements of a simulation model. For a more detailed description
on how a simulation model is constructed and calculated, please read the introduction on
XLSTAT-Sim.
This tool allows to create a function that calculates a statistic after each iteration of the
simulation process. The statistic is computed and stored. During the step by step simulations,
you can track how the statistic evolves. In the simulation report you can optionally see details
on the statistic. A wide choice of statistics is available.
To define the statistic function (physically, a cell on the Excel sheet), you need to create a call
to a XLSTAT_SimStatX/TheoX/SPCX function or to use the dialog box that will generate for
you the formula calling the function. X stands for the statistic as defined in the tables below. A
variable based on the corresponding statistic is created.
XLSTAT_SimStat/Theo/SPC Syntax
X stands for one of the selected statistic. The available statistics are listed in the tables below.
StatName is a string that contains the name of the statistic or a reference to a cell where the
name is located. The name is used during the report to identify the statistic.
Reference indicates the model variable to be tracked. This is a reference to a cell in the same
Excel sheet.
Visible is an optional input that indicates if the details of this statistic should be displayed when
the Filter level for results in the Options dialog box of XLSTAT-Sim is set to Activated (see
the Format tab). 0 deactivates the display and 1 activates the display. Default value is 1.
521
Descriptive statistics
Details and formulae relative to the above statistics are available in the description section of
the Descriptive statistics tool of XLSTAT.
Theoretical statistics
These statistics are based on the theoretical computation of the mean, variance and standard
deviation of the distribution, as opposed to the empirical computation based on the simulated
samples.
522
SPC
Statistics from the domain of SPC (Statistical Process Control) are listed hereunder. These
statistics are only available and calculated, if you have a valid license for the XLSTAT-SPC
module.
Dialog box
: click this button to close the dialog box without doing any modification.
523
: click this button to reload the default options.
General tab:
Name: Enter the name of the statistic or select a cell where the name is available. If you select
a cell, it depends on the selection in the options, whether an absolute (for example $A$4) or a
relative reference (for example A4) to the cell is created. (See the Options section for more
details).
Reference: Choose a cell in the active Excel sheet that contains the simulation model variable
that you want to track with the selected statistic.
Statistic: Activate one of the following options and choose the statistic to compute:
Descriptive: Select one of the available statistics (See description section for more
details).
Theoretical: Select one of the available statistics (See description section for more
details).
SPC: Select one of the available statistics (See description section for more details).
Display Results: Activate this option to display the detailed results for statistic in the
simulation report. This option is only active, if you selected the restricted filter level in the
simulation preferences (See the XLSTAT-Sim options section for more details).
Results
A function call to XLSTAT_SimStat/Theo/SPC with the selected parameters and the following
syntax will be generated in the active Excel cell:
The background color and the font color in the Excel cell are applied according to your choices
in the XLSTAT-Sim options.
524
Run
Once you have designed the simulation model using the four tools define a distribution,
define a scenario variable, define a result, and define a statistic, you can click the
icon of XLSTAT-SIM toolbar to display the Run dialog box that lets you define additional
options before running the simulation model and displaying the report. A description of the
results is available below.
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Number of simulations: Enter the number of simulations to perform for the model (Default
value: 300).
525
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: Activate this option if the row and column labels are selected.
Options tab:
Tornado/Spider: Choose the options for the calculation of the tornado and spider analysis.
Number of points: Choose the number of points between the two bounds of the
intervals that are used for the tornado analysis.
Standard value: Choose how the standard value around which the intervals to check
during the tornado and spider analysis needs to be computed for each variable.
o Median: The default value of the distribution fields is the median of the
simulated values.
o Default cell value: The default value defined for the variables is used.
Interval definition: Choose an option for the definition of the limits of the intervals of
the variables that are checked during the tornado/spider analyses.
SPC tab:
Calculate Process capabilities: Activate this option to calculate process capabilities for input
random variables, result variables and statistics.
Variable names: Select the data that correspond to the names of the variables for
which you want to calculate process capabilities.
526
LSL: Select the data that correspond to the lower specification limit (LSL) of the process
for the variables for which the names have been selected.
USL: Select the data that correspond to the upper specification limit (USL) of the
process for the variables for which the names have been selected.
Target: Select the data that correspond to the target of the process for the variables for
which the names have been selected.
Outputs tab:
Correlations: Activate this option to display the correlation matrix between the variables. If the
significant correlations in bold option is activated, the correlations that are significant at
the selected significance level are displayed in bold.
Type of correlation: Choose the type of correlation to use for the computations (see
the description section for more details).
Significance level (%): Enter the significance level for the test of on the correlations
(default value: 5%).
p-values: Activate this option to display the p-values corresponding to the correlations.
Sensitivity: Activate this option to display the results of the sensitivity analysis.
Tornado: Activate this option to display the results of the tornado analysis.
Spider: Activate this option to display the results of the spider analysis.
Simulation details: Activate this option to display the details on the iterations of the
simulation.
Descriptive statistics: Activate this option to compute and display descriptive statistics for the
variables of the model.
Display vertically: Check this option so that the table of descriptive statistics is
displayed vertically (one line per descriptive statistic).
527
Charts tab:
Histograms tab:
Histograms: Activate this option to display the histograms of the samples. For a theoretical
distribution, the density function is displayed.
Bars: Choose this option to display the histograms with a bar for each interval.
Continuous lines: Choose this option to display the histograms with a continuous line.
Cumulative histograms: Activate this option to display the cumulated histograms of the
samples.
Based on the histogram: Choose this option to display cumulative histograms based
on the same interval definition as the histograms.
Intervals: Select one of the following options to define the intervals of the histogram:
Width: Choose this option to define a fixed width for the intervals.
User defined: Select a column containing in increasing order the lower bound of the
first interval, and the upper bound of all the intervals.
Minimum: Activate this option to enter the minimum value of the histogram. If the
Automatic option is chosen, the minimum is that of the sample. Otherwise, it is the value
defined by the user.
Box plots: Check this option to display box plots (or box-and-whisker plots). See the
description section for more details.
Horizontal: Check this option to display box plots and scattergrams horizontally.
Vertical: Check this option to display box plots and scattergrams vertically.
528
Group plots: Check this option to group together the various box plots and
scattergrams on the same chart to compare them.
Outliers: Check this option to display the points corresponding to outliers (box plots)
with a hollowed-out circle.
Scattergrams: Check this option to display scattergrams. The mean (red +) and the median
(red line) are always displayed.
Correlations tab:
The blue-red option allows to represent low correlations with cold colors (blue is used
for the correlations that are close to -1) and the high correlations are with hot colors
(correlations close to 1 are displayed in red color).
The Black and white option allows to either display in black the positive correlations
and in white the negative correlations (the diagonal of 1s is display in grey color), or to
display in black the significant correlations, and in white the correlations that are not
significantly different from 0.
The Patterns option allows to represent positive correlations by lines that rise from left
to right, and the negative correlations by lines that rise from right to left. The higher the
absolute value of the correlation, the large the space between the lines.
Scatter plots: Activate this option to display the scatter plots for all two by two combinations of
variables.
Matrix of plots: Check this option to display all possible combinations of variables in
pairs in the form of a two-entry table with the various variables displayed in rows and in
columns.
o Q-Q plots: Activate this option so that XLSTAT displays a Q-Q plot when the X
and Y variables are identical.
529
o Confidence ellipses: Activate this option to display confidence ellipses. The
confidence ellipses correspond to a x% confidence interval (where x is
determined using the significance level entered in the General tab) for a
bivariate normal distribution with the same means and the same covariance
matrix as the variables represented in abscissa and ordinates.
Results
The first results are general results that display information about the model:
Distributions: This table shows for each input random variable in the model, its name, the
Excel cell where it is located, the selected distribution, the static value, the data type, the
truncation mode and bounds and the parameters of the distribution.
Scenario variables: This table shows for each input random variable in the model, its name,
the Excel cell where it is located, the default value, the type, the lower und upper limit and the
step size.
Result variables: This table shows for each result variable in the model, its name, the Excel
cell where it is located, and the formula for its calculation.
Statistics: This table shows for each statistic in the model, its name, the Excel cell that
contains it and the selected statistic.
Convergence: If the option convergence in the simulation options has been activated, then
this table displays for each result variable that has been selected for convergence checking,
the value and the variation of the lower and upper bound of the confidence interval for the
selected convergence criterion. Under the matrix information about the selected convergence
criterion, the corresponding threshold of variation, and the number of executed iterations of
simulation are displayed.
In the following section, details for the different model elements, distributions, scenario
variables, result variables and statistics, are displayed.
Descriptive statistics: For each type of variable, the statistics selected in the dialog box are
displayed in a table.
Descriptive statistics for the intervals: This table displays for each interval of the histogram
its lower bound, upper bound, the frequency (number of values of the sample within the
530
interval), the relative frequency (the number of values divided by the total number of values in
the sample), and the density (the ratio of the frequency to the size of the interval).
Sensitivity: A table with the correlations, the contributions and the absolute value of the
contributions between the input random variables is displayed for each result variable. The
contributions are then plotted on a chart.
Tornado: This table displays the minimum, the maximum and the range of the result variable
when the input random variables and the scenario variables vary in the defined ranges. Then
the minimum and the maximum are shown on a chart.
Spider: This table displays for all the points that are evaluated during the tornado analysis the
value of each result variable when the input random variables and scenario variables vary.
These values are then displayed in a chart.
The correlation matrix and the table of the p-values are displayed so that you can see the
relationships between the input variables and the output variables. The correlation maps allow
identifying potential structures in the matrix, of to quickly identify interesting correlations.
Simulation details: A table showing the values of each variable at each iteration is displayed.
531
Subgroup Charts
Use this tool to supervise production quality, in the case where you have a group of
measurements for each point in time. The measurements need to be quantitative data. This
tool is useful to recap the mean and the variability of the measured production quality.
Integrated in this tool, you will find Box-Cox transformations, calculation of process capability
and the application of rules for special causes and Westgard rules (an alternative set of rules
to identify special causes) to complete your analysis.
Description
Control charts were first mentioned in a document by Walter Shewhart that he wrote during his
time working at Bell Labs in 1924. He described his methods completely in his book (1931).
For a long time, there was no significant innovation in the area of control charts. With the
development of CUSUM, UWMA and EWMA charts in 1936, Deming expanded the set of
available control charts.
Control charts were originally used in area of goods production. Therefore the wording is still
from that domain. Today this approach is being applied to a large number of different fields, for
instance services, human resources, and sales. In the following chapters we will use the
wording from the production and shop floors.
Subgroup charts
The subgroup charts tool offers you the following chart types alone or in combination:
- X (X bar)
- R
- S
- S
An X bar chart is useful to follow the mean of a production process. Mean shifts are easily
visible in the diagrams.
An R chart (Range chart) is useful to analyze the variability of the production. A large
difference in production, caused for example by the use of different production lines, will be
easily visible.
532
S and S charts are also used to analyze the variability of production. The S chart draws the
standard deviation of the process and the S chart draws the variance (which is the square of
the standard deviation).
Note 1: If you want to investigate smaller mean shifts, then you can also use CUSUM group
charts which are, by the way, often preferred to subgroup control charts.
Note 2: If you have only one measurement for each point in time, then please use the control
charts for individuals.
Note 3: If you have measurements in qualitative values (for instance ok, not ok, conform not
conform), then use the control charts for attributes.
This tool offers you the following options for the estimation of the standard deviation (sigma) of
the data set, given k subgroups and ni (i=1, k) measurements per subgroup:
n
k
1 si2
i
/ c4 1 ni 1
k
s i 1
n
k
1 i 1
i
i 1
- R bar: The estimator for sigma is calculated based on the average range of the k subgroups.
s R / d 2
- S bar: The estimator for sigma is calculated based on the average of the standard deviations
of the k subgroups:
si / c4 ,
1 k 2
s
k i 1
Process capability
Process capability describes a process and informs if the process is under control and if values
taken by the measured variables are inside the specification limits of the process. In the latter
case, on says that the process is capable.
533
During the interpretation of the different indicators for the process capability please pay
attention to the fact that some indicators suppose normality or at least symmetry of the
distribution of the measured values. By the use of a normality test, you can verify these
premises (see the Normality Tests in XLSTAT-Pro).
If the data are not normally distributed, you have the following possibilities to obtain results for
the process capabilities.
- Use the Box-Cox transformation to improve the normality of the data set. Then verify again
the normality using a normality test.
Box-Cox transformation
Box-Cox transformation is used to improve the normality of the time series; the Box-Cox
transformation is defined by the following equation:
X t 1
X t 0, 0
Yt
,
ln( X ), X t 0, 0
t
Where the series {Xt} being transformed into series {Yt}, (t=1,,n):
Note: if < 0, the first equation is still valid, but Xt must be strictly positive. XLSTAT accepts a
fixed value of , or it can find the value that maximizes the likelihood value, the model being a
simple linear model with the time as sole explanatory variable.
Chart rules
XLSTAT offers you the possibility to apply rules for special causes and Westgard rules. Two
sets of rules are available in order to interpret control charts. You can activate and deactivate
separately the rules in each set.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
534
: Click this button to start the computations.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
Mode tab:
Subgroup charts: Activate the option if you have a data set with several
measurements for each point in time.
Individual charts: Activate this option if you have a data set with one quantitative
measurement for each point in time.
Attribute charts: Activate this option if you have a data set with one qualitative
measurement for each point.
Time weighted: Activate this option if you want to use a time weighted chart like
UWMA, EWMA or CUSUM.
At this stage, the subgroup charts family should be selected. If not, you should switch to the
help corresponding to the selected chart family. The options below correspond to the
subgroups charts
X bar chart: Activate this option if you want to calculate the X bar chart to analyze the
mean of the process.
R chart: Activate this option if you want to calculate the R chart to analyze variability of
the process.
535
S chart: Activate this option if you want to calculate the S chart to analyze variability of
the process.
S chart: Activate this option if you want to calculate the S chart to analyze variability
of the process.
X bar R chart: Activate this option if you want to calculate the X bar chart together with
the R chart to analyze the mean value and variability of the process.
X bar S chart: Activate this option if you want to calculate the X bar chart together with
the S chart to analyze the mean value and variability of the process.
X bar S chart: Activate this option if you want to calculate the X bar chart together
with the S chart to analyze the mean value and variability of the process.
General tab:
Columns/Rows: Activate this option for XLSTAT to take each column (in column mode)
or each row (in row mode) as a separate measurement that belongs to the same
subgroup.
One column/row: Activate this option if the measurements of the different subgroups
are all on the same column (column mode) or one row (row mode). To assign the
different measurements to their corresponding subgroup, please enter a constant group
size or select a column or row with the group identifier in it.
Data: If the data format One column/row is selected, please choose the unique column or
row that contains all the data. The assignment of the data to their corresponding subgroup
must be specified using the Groups field or setting the common subgroup size.. If you select
the data Columns/rows option, please select a data area with one column/row per
measurement in a subgroup.
Groups: If the data format One column/row is selected, then activate this Option to select a
column/row that contains the group identifier. Select the data that identify for each element of
the data selection the corresponding group.
Common subgroup size: If the data format One column/row is selected and the subgroup
size is constant, then you can deactivate the groups option and enter in this field the common
subgroup size.
Phase: Activate this option to supply one column/row with the phase identifier.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
536
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if the first row (column mode) or column (row mode)
of the data selections contains a label.
Options tab:
Bound: Activate this option, if you want to enter a maximum value to accept for the
upper control limit of the process. This value will be used when the calculated upper
control limit is greater than the value entered here.
Value: Enter the upper control limit. This value will be used in place of the calculated
upper control limit.
Bound: Activate this option, if you want to enter a minimum value to accept for the
lower control limit of the process. This value will be used when the calculated lower
control limit is greater than the value entered here.
Value: Enter the lower control limit. This value will be used and overrides the calculated
upper control limit.
Calculate process capabilities: Activate this option to calculate process capabilities based on
the input data (see the description section for more details).
USL: If the calculation of the process capabilities is activated, please enter here the
upper specification limit (USL) of the process.
LSL: If the calculation of the process capabilities is activated, please enter here the
lower specification limit (LSL) of the process.
537
Target: If the calculation of the process capabilities is activated, activate this option to
add the target value of the process.
Box-Cox: Activate this option to compute the Box-Cox transformation. You can either fix the
value of the Lambda parameter, or decide to let XLSTAT optimize it (see the description
section for further details).
k Sigma: Activate this option to enter the distance between the upper and the lower control
limit and the center line of the control chart. The distance is fixed to k times the factor you enter
multiplied by the estimated standard deviation. Corrective factors according to Burr (1969) will
be applied.
alpha: Activate this option to define the size of the confidence range around the center line of
the control chart. 100 - alpha % of the distribution of the control chart is inside the control
limits. Corrective factors according to Burr (1969) will be applied.
Mean: Activate this option to enter a value for the center line of the control chart. This value
should be based on historical data.
Sigma: Activate this option to enter a value for the standard deviation of the control chart. This
value should be based on historical data. If this option is activated, then you cannot choose an
estimation method for the standard deviation in the Estimation tab.
Estimation tab:
Method for Sigma: Select an option to determine the estimation method for the standard
deviation of the control chart (see the description section for further details):
R-bar
S-bar
Outputs tab:
Display zones: Activate this option to display beside the lower and upper control limit also the
limits of the zones A and B.
538
Normality Tests: Activate this option to check normality of the data. (see the Normality Tests
tool for further details).
Significance level (%): Enter the significance level for the tests.
Test special causes: Activate this option to analyze the points of the control chart according
to the rules for special causes. You can activate the following rules independently:
Apply Westgard rules: Activate this option to analyze the points of the control chart according
to the Westgard rules. You can activate the following rules independently:
Rule 1 2s
Rule 1 3
Rule 2 2s
Rule 4s
Rule 4 1s
Rule 10 X
539
Charts tab:
Display charts: Activate this option to display the control charts graphically.
Continuous line: Activate this option to connect the points in the control chart.
Needles view: Activate this option to display for each point of the control chart, the
minimum and maximum of the corresponding subgroup.
Box view: Activate this option to display the control charts using bars.
Connect through missing: Activate this option to connect the points, even when missing
values separate the points.
Normal Q-Q plots: Check this option to display Q-Q plots based on the normal distribution.
Display a distribution: Activate this option to compare histograms of samples selected with a
density function.
Run Charts: Activate this option to display a chart of the latest data points. Each individual
measurement is displayed.
Results
Estimation:
Estimated mean: This table displays the estimated mean values for the different phases.
Estimated standard deviation: This table displays the estimated standard deviation values
for the different phases.
Box-Cox transformation:
Estimates of the parameters of the model: This table is available only if the Lambda
parameter has been optimized. It displays the estimator for Lambda.
Series before and after transformation: This table displays the series before and after
transformation. If Lambda has been optimized, the transformed series corresponds to the
residuals of the model. If it hasnt then the transformed series is the direct application of the
Box-Cox transformation
540
Process capabilities:
Process capabilities: These tables are displayed, if the process capability option has been
selected. There is one table for each phase. A table contains the following indicators for the
process capability and if possible the corresponding confidence intervals: Cp, Cpl, Cpu, Cpk,
Pp, Ppl, Ppu, Ppk, Cpm, Cpm (Boyle), Cp 5.5, Cpk 5.5, Cpmk, and Cs (Wright).
For Cp, Cpl, and Cpu, information about the process performance is supplied and for Cp a
status information is given to facilitate the interpretation.
Cp values have the following status based on Ekvall and Juran (1974):
Based on Montgomery (2001), Cp needs to have the following minimal values for the process
performance to be as expected:
1.50 for new processes or for existing processes when the variable is critical
Based on Montgomery (2001), Cpu and Cpl need to have the following minimal values for
process performance to be as expected:
1.45 for new processes or for existing processes when the variable is critical
Capabilities: This chart contains information about the specification and control limits. A line
between the lower und upper limits represents the interval with an additional vertical mark for
the center line. The different control limits of each phase are drawn separately.
Chart information:
541
The following results are displayed separately for each requested chart. Charts can be
selected alone or in combination with the X bar chart.
X bar/ R/ S/ S chart: This table contains information about the center line and the upper and
lower control limits of the selected chart. There will be one column for each phase.
Observation details: This table displays detailed information for each subgroup. For each
subgroup the corresponding phase, the size, the mean, the minimum and the maximum
values, the center line, and the lower and upper control limits are displayed. If the information
about the zones A, B and C are activated, then the lower and upper control limits of the zones
A and B are displayed as well.
Rule details: If the rules options are activated, a detailed table about the rules will be
displayed. For each subgroup, there is one row for each rule that applies. Yes indicates that
the corresponding rule was fired for the corresponding subgroup and No indicates that the
rule does not apply.
X bar/ R/ S/ S chart: If the charts are activated, then a chart containing the information of the
two tables above is displayed. Each subgroup is displayed. The center line and the lower and
upper control limits are displayed as well. If the corresponding options have been activated,
the lower and upper control limits for the zones A and B are included and there are labels for
the subgroups for which rules were fired. A legend with the activated rules and the
corresponding rule number is displayed below the chart.
Normality tests:
For each of the four tests, the statistics relating to the test are displayed including, in particular,
the p-value which is afterwards used in interpreting the test by comparing with the chosen
significance threshold.
Histograms: The histograms are displayed. If desired, you can change the color of the lines,
scales, titles as with any Excel chart.
Example
A tutorial explaining how to use the SPC subgroup charts tool is available on the Addinsoft
web site. To consult the tutorial, please go to:
542
http://www.xlstat.com/demo-spc1.htm
References
Burr, I. W. (1967). The effect of non-normality on constants for X and R charts. Industrial
Quality control, 23(11), 563-569.
Burr I. W. (1969). Control charts for measurements with varying sample sizes. Journal of
Quality Technology, 1(3), 163-167.
Deming W. E. (1993). The New Economics for Industry, Government, and Education.
Cambridge, MA: Center for Advanced Engineering Study, Massachusetts Institute of
Technology.
Ekvall D. N. (1974). Manufacturing Planning. In Quality Control Hand-. book,. 3rd Ed. (J. M.
Juran, et al. eds.) pp. 9-22-39, McGraw-Hill Book Co., New York.
Montgomery D.C. (2001), Introduction to Statistical Quality Control, 4th edition, John Wiley &
Sons.
Nelson L.S. (1984). The Shewhart Control Chart - Tests for Special Causes. Journal of Quality
Technology, 16, 237-239.
Pyzdek Th. (2003). The Six Sigma Handbook Revised and Expanded, McGraw Hill, New
York.
Ryan Th. P. (2000). Statistical Methods for Quality Improvement, Second Edition, Wiley Series
in probability and statistics, John Wiley & Sons, New York.
543
Individual Charts
Use this tool to supervise the production quality, in the case where you have a single
measurement for each point in time. The measurements need to be quantitative variables.
This tool is useful to recap the moving mean and median and the variability of the production
quality that is being measured.
Integrated in this tool, you will find Box-Cox transformations, calculation of process capability
and the application of rules for special causes and Westgard rules (an alternative rule set to
identify special causes) available to complete your analysis.
Description
Control charts were first mentioned in a document by Walter Shewhart that he wrote during his
time working at Bell Labs in 1924. He described his methods completely in his book (1931).
For a long time, there was no significant innovation in the area of control charts. With the
development of CUSUM, UWMA and EWMA charts in 1936, Deming expanded the set of
available control charts.
Control charts were originally used in area of goods production. Therefore the wording is still
from that domain. Today this approach is being applied to a large number of different fields, for
instance services, human resources, and sales. In the following lines, we use the wording from
the production and shop floors.
Individual charts
The individual charts tool offers you the following chart types alone or in combination:
- X Individual
- MR moving range
An X individual chart is useful to follow the moving average of a production process. Mean
shifts are easily visible in the diagrams.
An MR chart (moving range diagram) is useful to analyze the variability of the production.
Large difference in production, caused by the use of different production lines, will be easily
visible.
544
Note 1: If you want to investigate smaller mean shifts, then you can also use CUSUM
individual charts which are often preferred in comparison with the individual control charts,
because they can detect smaller mean shifts.
Note 2: If you have more than one measurement for each point in time, then you should use
the control charts for subgroups.
Note 3: If you have measurements in qualitative values (for instance ok, not ok, conform not
conform), then use the control charts for attributes.
This tool offers you the following options for the estimation of the standard deviation (sigma) of
the data set, given n measurements:
- Average moving range: The estimator for sigma is calculated based on the average moving
range using a window length of m measurements.
s m / d 2 ,
- Median moving range: The estimator for sigma is calculated based on the median of the
moving range using a window length of m measurements.
s median / d 4 ,
- standard deviation: The estimator for sigma is calculated based on the standard deviation of
the n measurements.
s s / c4
Process capability
Process capability describes a process and informs if the process is under control and the
distribution of the measured variables are inside the specification limits of the process. If the
distributions of the measured variables are in the technical specification limits, then the
process is called capable.
During the interpretation of the different indicators for the process capability please pay
attention to the fact that some indicators suppose normality or at least symmetry of the
distribution of the measured values. By the use of a normality test, you can verify these
premises (see the Normality Tests in XLSTAT-Pro).
545
If the data are not normally distributed, you have the following possibilities to obtain results for
the process capabilities.
- Use the Box-Cox transformation to improve the normality of the data set. Then verify again
the normality using a normality test.
Box-Cox transformation
Box-Cox transformation is used to improve the normality of the time series; the Box-Cox
transformation is defined by the following equation:
X t 1
X t 0, 0
Yt
,
ln( X ), X t 0, 0
t
Where the series {Xt} being transformed into series {Yt}, (t=1,,n):
Note: if < 0 the first equation is still valid, but Xt must be strictly positive. XLSTAT accepts a
fixed value of , or it can find the value that maximizes the likelihood value, the model being a
simple linear model with the time as sole explanatory variable.
Chart rules
XLSTAT offers you the possibility to apply rules for special causes and Westgard rules on the
data set. Two sets of rules are available in order to interpret control charts. You can activate
and deactivate separately the rules in each set.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
546
: Click this button to display the help.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
Mode tab:
Chart family: Select the type of chart family that you want to use:
Subgroup charts: Activate this option if you have a data set with several
measurements for each point in time.
Individual charts: Activate this option if you have a data set with one quantitative
measurement for each point in time.
Attribute charts: Activate this option if you have a data set with one qualitative
measurement for each point.
Time weighted: Activate this option if you want to use a time weighted chart like
UWMA, EWMA or CUSUM.
At this stage, the individual charts family is selected. If you want to switch to another chart
family, please change the corresponding option and call the help function again if you want to
obtain more details on the available options. The options below correspond to the subgroups
charts
X Individual chart: Activate this option if you want to calculate the X individual chart to
analyze the mean of the process.
MR Moving Range chart: Activate this option if you want to calculate the MR chart to
analyze variability of the process.
X-MR Individual/Moving Range chart: Activate this option if you want to calculate the
X Individual chart together with the MR chart to analyze the mean value and variability
of the process.
General tab:
547
Data: Please choose the unique column or row that contains all the data.
Phase: Activate this option to supply one column/row with the phase identifier.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if the first row (column mode) or column (row mode)
of the data selections contains a label.
Options tab:
Bound: Activate this option, if you want to enter a maximum value to accept for the
upper control limit of the process. This value will be used when the calculated upper
control limit is greater than the value entered here.
Value: Enter the upper control limit. This value will be used and overrides the
calculated upper control limit.
Bound: Activate this option, if you want to enter a minimum value to accept for the
lower control limit of the process. This value will be used when the calculated lower
control limit is greater than the value entered here.
Value: Enter the lower control limit. This value will be used in place of the calculated
upper control limit.
548
Calculate Process capabilities: Activate this option to calculate process capabilities based
on the input data (see the description section for more details).
USL: If the calculation of the process capabilities is activated, please enter here the upper
specification limit (USL) of the process.
LSL: If the calculation of the process capabilities is activated, please enter here the lower
specification limit (LSL) of the process.
Target: If the calculation of the process capabilities is activated, activate this option to add the
target value of the process.
Confidence interval (%):If the Calculate Process Capabilities option is activated, please
enter the percentage range of the confidence interval to use for calculating the confidence
interval around the parameters. Default value: 95.
Box-Cox: Activate this option to compute the Box-Cox transformation. You can either fix the
value of the Lambda parameter, or decide to let XLSTAT optimize it (see the description
section for further details).
k Sigma: Activate this option to enter the distance between the upper and the lower control
limit and the center line of the control chart. The distance is fixed to k times the factor you enter
multiplied by the estimated standard deviation. Corrective factors according to Burr (1969) will
be applied.
alpha: Activate this option to enter the size of the confidence range around the center line of
the control chart. The alpha is used to compute the upper and lower control limits. 100 alpha
% of the distribution of the control chart is inside the control limits. Corrective factors according
to Burr (1969) will be applied.
Mean: Activate this option to enter a value for the center line of the control chart. This value
should be based on historical data.
Sigma: Activate this option to enter a value for the standard deviation of the control chart. This
value should be based on historical data. If this option is activated, then you cannot choose an
estimation method for the standard deviation in the Estimation tab.
Estimation tab:
Method for Sigma: Select an option to determine the estimation method for the standard
deviation of the control chart (see the description section for further details):
549
Median Moving Range
o MR Length: Change this value to modify the number of observations that are
taken into account in the moving range.
Standard deviation: The estimator of sigma is calculated using the standard deviation
of the n measurements.
Outputs tab:
Display zones: Activate this option to display beside the lower and upper control limit also the
limits of the zones A and B.
Normality Tests: Activate this option to check normality of the data. (see the Normality Tests
tool for further details).
Significance level (%): Enter the significance level for the tests.
Test special causes: Activate this option to analyze the points of the control chart according
to the rules for special causes. You can activate the following rules independently:
Apply Westgard rules: Activate this option to analyze the points of the control chart according
to the Westgard rules. You can activate the following rules independently:
Rule 1 2s
550
Rule 1 3
Rule 2 2s
Rule 4s
Rule 4 1s
Rule 10 X
Charts tab:
Display charts: Activate this option to display the control charts graphically.
Continuous line: Activate this option to connect the points in the control chart.
Connect through missing: Activate this option to connect the points in the control charts,
even when missing values are between the points.
Display a distribution: Activate this option to compare histograms of samples selected with a
density function.
Run Charts: Activate this option to display a chart of the latest data points. Each individual
measurement is displayed.
Number of observations: Enter the maximal number of the last observations to be displayed
in the Run chart.
Results
Estimation:
Estimated mean: This table displays the estimated mean values for the different phases.
Estimated standard deviation: This table displays the estimated standard deviation values
for the different phases.
551
Box-Cox transformation:
Estimates of the parameters of the model: This table is available only if the Lambda
parameter has been optimized. It displays the estimator for Lambda.
Series before and after transformation: This table displays the series before and after
transformation. If Lambda has been optimized, the transformed series corresponds to the
residuals of the model. If it hasnt then the transformed series is the direct application of the
Box-Cox transformation
Process capability:
Process capabilities: These tables are displayed, if the process capability option has been
selected. There is one table for each phase. A table contains the following indicators for the
process capability and if possible the corresponding confidence intervals: Cp, Cpl, Cpu, Cpk,
Pp, Ppl, Ppu, Ppk, Cpm, Cpm (Boyle), Cp 5.5, Cpk 5.5, Cpmk, and Cs (Wright).
For Cp, Cpl, and Cpu, information about the process performance is supplied and for Cp a
status information is given to facilitate the interpretation.
Cp values have the following status based on Ekvall and Juran (1974):
Based on Montgomery (2001), Cp needs to have the following minimal values for the process
performance to be as expected:
1.50 for new processes or for existing processes when the variable is critical
Based on Montgomery (2001), Cpu and Cpl need to have the following minimal values for
process performance to be as expected:
1.45 for new processes or for existing processes when the variable is critical
552
Capabilities: This chart contains information about the specification and control limits. A line
between the lower und upper limits represents the interval with an additional vertical mark for
the center line. The different control limits of each phase are drawn separately.
Chart information:
The following results are displayed separately for each requested chart. Charts can be
selected alone or in combination with the X individual chart.
X Individual / MR moving range chart: This table contains information about the center line
and the upper and lower control limits of the selected chart. There will be one column for each
phase.
Observation details: This table displays detailed information for each observation. For each
observation, the corresponding phase, the mean or median, the center line, the lower and
upper control limits are displayed. If the information about the zones A, B and C are activated,
then the lower and upper control limits of the zones A and B are displayed as well.
Rule details: If the rules options are activated, a detailed table about the rules will be
displayed. For each observation, there is one row for each rule that applies. Yes indicates
that the corresponding rule was fired, and No indicates that the rule does not apply.
X Individual / MR moving range Chart: If the charts are activated, then a chart containing the
information of the two tables above is displayed. Each observation is displayed. The center line
and the lower and upper control limits are displayed as well. If the corresponding options have
been activated, the lower and upper control limits for the zones A and B are included and there
are labels for the observations for which rules were fired. A legend with the activated rules and
the corresponding rule number is displayed below the chart.
Normality tests:
For each of the four tests, the statistics relating to the test are displayed including, in particular,
the p-value which is afterwards used in interpreting the test by comparing with the chosen
significance threshold.
Histograms: The histograms are displayed. If desired, you can change the color of the lines,
scales, titles as with any Excel chart.
553
Example
A tutorial explaining how to use the SPC subgroup charts tool is available on the Addinsoft
web site. To consult the tutorial, please go to:
http://www.xlstat.com/demo-spc2.htm
References
Burr, I. W. (1967). The effect of non-normality on constants for X and R charts. Industrial
Quality control, 23(11), 563-569.
Burr, I. W. (1969). Control charts for measurements with varying sample sizes. Journal of
Quality Technology, 1(3), 163-167.
Deming, W. E. (1993). The New Economics for Industry, Government, and Education.
Cambridge, MA: Center for Advanced Engineering Study, Massachusetts Institute of
Technology.
Ekvall D. N. (1974). Manufacturing Planning. In Quality Control Hand-. book,. 3rd Ed. (J. M.
Juran, et al. eds.) pp. 9-22-39, McGraw-Hill Book Co., New York
Montgomery, D.C. (2001), Introduction to Statistical Quality Control, 4th edition, John Wiley &
Sons.
Nelson, L.S. (1984), "The Shewhart Control Chart - Tests for Special Causes," Journal of
Quality Technology, 16, 237-239.
Pyzdek Th. (2003). The Six Sigma Handbook Revised and Expanded, McGraw Hill, New
York.
Ryan Th. P. (2000). Statistical Methods for Quality Improvement, Second Edition, Wiley Series
in probability and statistics, John Wiley & Sons, New York.
554
Attribute charts
Use this tool to supervise the production quality, in the case where you have a single
measurement for each point in time. The measurements are based on attribute or attribute
counts of the process.
This tool is useful to recap the categorical variables of the measured production quality.
Integrated in this tool, you will find Box-Cox transformations, calculation of process capability
and the application of rules for special causes and Westgard rules (an alternative rule set to
identify special causes) available to complete your analysis.
Description
Control charts were first mentioned in a document by Walter Shewhart that he wrote during his
time working at Bell Labs in 1924. He described his methods completely in his book (1931).
For a long time, there was no significant innovation in the area of control charts. With the
development of CUSUM, UWMA and EWMA charts in 1936, Deming expanded the set of
available control charts.
Control charts were originally used in area of goods production. Therefore the wording is still
from that domain. Today this approach is being applied to a large number of different fields, for
instance services, human resources, and sales. In the following chapters we will use the
wording from the production and shop floors.
Attribute charts
The attribute charts tool offers you the following chart types:
- P chart
- NP chart
- C chart
- U chart
These charts analyze either nonconforming products or nonconformities. They are usually
used to inspect the quality before delivery (outgoing products) or the quality at delivery
(incoming products). Not all the products need to be necessarily inspected.
Inspections are done by inspection units having a well defined size. The size can be 1 in the
case of the reception of television sets at a warehouse. The size would be 24 in the case of
peaches delivered in crates of 24 peaches.
555
P and NP charts allow to analyze the fraction respectively the absolute number of
nonconforming products of a production process. For example, we can count the number of
nonconforming television sets, or the number of crates that contain at least one bruised peach.
C and U chart analyze the fraction respectively the absolute number of occurrences of
nonconformities in an inspection unit. For example, we can count the number of defect
transistors for each inspection unit (there might be more than one transistor not working in one
television set), or the number of bruised peaches per crate.
A P chart is useful to follow the fraction of non conforming units of a production process.
An NP chart is useful to follow the absolute number of non conforming units of a production
process.
A C chart is useful in the case of a production having a constant size for each inspection unit. It
can be used to follow the absolute number of the non conforming items per inspection.
A U chart is useful in the case of a production having a non constant size of each inspection
unit. It can be used to follow the fraction of the non conforming items per inspection.
Process capability
Process capability describes a process and informs if the process is under control and the
distribution of the measured variables are inside the specification limits of the process. If the
distributions of the measured variables are in the technical specification limits, then the
process is called capable.
During the interpretation of the different indicators for the process capability please pay
attention to the fact that some indicators suppose normality or at least symmetry of the
distribution of the measured values. By the use of a normality test, you can verify these
premises (see the Normality Tests in XLSTAT-Pro).
If the data are not normally distributed, you have the following possibilities to obtain results for
the process capabilities.
- Use the Box-Cox transformation to improve the normality of the data set. Then verify again
the normality using a normality test.
Box-Cox transformation
Box-Cox transformation is used to improve the normality of the time series; the Box-Cox
transformation is defined by the following equation:
556
X t 1
X t 0, 0
Yt
,
ln( X ), X t 0, 0
t
Where the series {Xt} being transformed into series {Yt}, (t=1,,n):
Note: if < 0 the first equation is still valid, but Xt must be strictly positive. XLSTAT accepts a
fixed value of , or it can find the value that maximizes the likelihood value, the model being a
simple linear model with the time as sole explanatory variable.
Chart rules
XLSTAT offers you the possibility to apply rules for special causes and Westgard rules on the
data set. Two sets of rules are available in order to interpret control charts. You can activate
and deactivate separately the rules in each set.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
Mode tab:
Chart family: Select the type of chart family that you want to use:
557
Subgroup charts: Activate this option if you have a data set with several
measurements for each point in time.
Individual charts: Activate this option if you have a data set with one quantitative
measurement for each point in time.
Attribute charts: Activate this option if you have a data set with one qualitative
measurement for each point.
Time weighted: Activate this option if you want to use a time weighted chart like
UWMA, EWMA or CUSUM.
At this stage, the attribute charts family is selected. If you want to switch to another chart
family, please change the corresponding option and call the help function again if you want to
obtain more details on the available options. The options below correspond to the subgroups
charts
Chart type: Select the type of chart you want to use (see the description section for more
details):
P chart
NP chart
C chart
U chart
General tab:
Data: Please choose the unique column or row that contains all the data.
Phase: Activate this option to supply one column/row with the phase identifier.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
558
Column/Row labels: Activate this option if the first row (column mode) or column (row mode)
of the data selections contains a label.
Options tab:
Bound: Activate this option, if you want to enter a maximum value to accept for the
upper control limit of the process. This value will be used when the calculated upper
control limit is greater than the value entered here.
Value: Enter the upper control limit. This value will be used and overrides the
calculated upper control limit.
Bound: Activate this option, if you want to enter a minimum value to accept for the
lower control limit of the process. This value will be used when the calculated lower
control limit is greater than the value entered here.
Value: Enter the lower control limit. This value will be used and overrides the
calculated upper control limit.
..
Calculate Process capabilities: Activate this option to calculate process capabilities based
on the input data (see the description section for more details).
USL: If the calculation of the process capabilities is activated, please enter here the upper
specification limit (USL) of the process.
LSL: If the calculation of the process capabilities is activated, please enter here the lower
specification limit (LSL) of the process.
Target: If the calculation of the process capabilities is activated, activate this option to add the
target value of the process.
Confidence interval (%):If the Calculate Process Capabilities option is activated, please
enter the percentage range of the confidence interval to use for calculating the confidence
interval around the parameters. Default value: 95.
Box-Cox: Activate this option to compute the Box-Cox transformation. You can either fix the
value of the Lambda parameter, or decide to let XLSTAT optimize it (see the description
section for further details).
559
k Sigma: Activate this option to enter the distance between the upper and the lower control
limit and the center line of the control chart. The distance is fixed to k times the factor you enter
multiplied by the estimated standard deviation. Corrective factors according to Burr (1969) will
be applied.
alpha: Activate this option to enter the size of the confidence range around the center line of
the control chart. The alpha is used to compute the upper and lower control limits. 100 alpha
% of the distribution of the control chart is inside the control limits. Corrective factors according
to Burr (1969) will be applied.
P bar / C bar / U bar: Activate this option to enter a value for the center line of the control
chart. This value should be based on historical data.
Outputs tab:
Display zones: Activate this option to display beside the lower and upper control limit also the
limits of the zones A and B.
Normality Tests: Activate this option to check normality of the data. (see the Normality Tests
tool for further details).
Significance level (%): Enter the significance level for the tests.
Test special causes: Activate this option to analyze the points of the control chart according
to the rules for special causes. You can activate the following rules independently:
560
None: Click this button to deselect all.
Apply Westgard rules: Activate this option to analyze the points of the control chart according
to the Westgard rules. You can activate the following rules independently:
Rule 1 2s
Rule 1 3
Rule 2 2s
Rule 4s
Rule 4 1s
Rule 10 X
Charts tab:
Display charts: Activate this option to display the control charts graphically.
Continuous line: Activate this option to connect the points in the control chart.
Connect through missing: Activate this option to connect the points in the control charts,
even when missing values are between the points.
Display a distribution: Activate this option to compare histograms of samples selected with a
density function.
Run Charts: Activate this option to display a chart of the latest data points. Each individual
measurement is displayed.
Number of observations: Enter the maximal number of the last observations to be displayed
in the Run chart.
561
Results
Estimation:
Estimated mean: This table displays the estimated mean values for the different phases.
Estimated standard deviation: This table displays the estimated standard deviation values
for the different phases.
Box-Cox transformation:
Estimates of the parameters of the model: This table is available only if the Lambda
parameter has been optimized. It displays the estimator for Lambda.
Series before and after transformation: This table displays the series before and after
transformation. If Lambda has been optimized, the transformed series corresponds to the
residuals of the model. If it hasnt then the transformed series is the direct application of the
Box-Cox transformation
Process capability:
Process capabilities: These tables are displayed, if the process capability option has been
selected. There is one table for each phase. A table contains the following indicators for the
process capability and if possible the corresponding confidence intervals: Cp, Cpl, Cpu, Cpk,
Pp, Ppl, Ppu, Ppk, Cpm, Cpm (Boyle), Cp 5.5, Cpk 5.5, Cpmk, and Cs (Wright).
For Cp, Cpl, and Cpu, information about the process performance is supplied and for Cp a
status information is given to facilitate the interpretation.
Cp values have the following status based on Ekvall and Juran (1974):
Based on Montgomery (2001), Cp needs to have the following minimal values for the process
performance to be as expected:
1.50 for new processes or for existing processes when the variable is critical
562
Based on Montgomery (2001), Cpu and Cpl need to have the following minimal values for
process performance to be as expected:
1.45 for new processes or for existing processes when the variable is critical
Capabilities: This chart contains information about the specification and control limits. A line
between the lower und upper limits represents the interval with an additional vertical mark for
the center line. The different control limits of each phase are drawn separately.
Chart information:
The following results are displayed separately for each requested chart. Charts can be
selected alone or in combination with the X attribute chart.
P / NP / C / U chart: This table contains information about the center line and the upper and
lower control limits of the selected chart. There will be one column for each phase.
Observation details: This table displays detailed information for each observation. For each
observation the corresponding phase, the value for P, NP, C or U, the subgroup size, the
center line, the lower and upper control limits are displayed. If the information about the zones
A, B and C are activated, then the lower and upper control limits of the zones A and B are
displayed as well.
Rule details: If the rules options are activated, a detailed table about the rules will be
displayed. For each subgroup there is one row for each rule that applies. Yes indicates that
the corresponding rule was fired, and No indicates that the rule does not apply.
P / NP / C / U Chart: If the charts are activated, then a chart containing the information of the
two tables above is displayed. The center line and the lower and upper control limits are
displayed as well. If the corresponding options have been activated, the lower and upper
control limits for the zones A and B are included and there are labels for the subgroups for
which rules were fired. A legend with the activated rules and the corresponding rule number is
displayed below the chart.
Normality tests:
563
For each of the four tests, the statistics relating to the test are displayed including, in particular,
the p-value which is afterwards used in interpreting the test by comparing with the chosen
significance threshold.
Histograms: The histograms are displayed. If desired, you can change the color of the lines,
scales, titles as with any Excel chart.
Example
A tutorial explaining how to use the attributes charts tool is available on the Addinsoft web site.
To consult the tutorial, please go to:
http://www.xlstat.com/demo-spc3.htm
References
Burr I. W. (1967). The effect of non-normality on constants for X and R charts. Industrial
Quality control, 23(11), 563-569.
Burr I. W. (1969). Control charts for measurements with varying sample sizes. Journal of
Quality Technology, 1(3), 163-167.
Deming, W. E. (1993). The New Economics for Industry, Government, and Education.
Cambridge, MA: Center for Advanced Engineering Study, Massachusetts Institute of
Technology.
Ekvall D. N. (1974). Manufacturing Planning. In Quality Control Hand-. book,. 3rd Ed. (J. M.
Juran, et al. eds.) pp. 9-22-39, McGraw-Hill Book Co., New York
Montgomery D.C. (2001), Introduction to Statistical Quality Control, 4th edition, John Wiley &
Sons.
Nelson L.S. (1984), "The Shewhart Control Chart - Tests for Special Causes," Journal of
Quality Technology, 16, 237-239.
564
Pyzdek Th. (2003). The Six Sigma Handbook Revised and Expanded, McGraw Hill, New
York.
Ryan Th. P. (2000). Statistical Methods for Quality Improvement, Second Edition, Wiley Series
in probability and statistics, John Wiley & Sons, New York.
565
Time Weighted Charts
Use this tool to supervise production quality, in the case where you have a group of
measurements or a single measurement for each point in time. The measurements need to be
quantitative variables.
This tool is useful to recap the mean and the variability of the measured production quality.
Integrated in this tool, you will find Box-Cox transformations, calculation of process capability
and the application of rules for special causes and Westgard rules (an alternative rule set to
identify special causes) available to complete your analysis.
Description
Control charts were first mentioned in a document by Walter Shewhart that he wrote during his
time working at Bell Labs in 1924. He described his methods completely in his book (1931).
For a long time, there was no significant innovation in the area of control charts. With the
development of CUSUM, UWMA and EWMA charts in 1936, Deming expanded the set of
available control charts.
Control charts were originally used in area of goods production. Therefore the wording is still
from that domain. Today this approach is being applied to a large number of different fields, for
instance services, human resources, and sales. In the following chapters we will use the
wording from the production and shop floors.
The time weighted charts tool offers you the following chart types:
A CUSUM, UWMA or EWMA chart is useful to follow the mean of a production process. Mean
shifts are easily visible in the diagrams.
These charts are not directly based on the raw data. They are based on the smoothed data.
566
In the case of UWMA charts, data are smoothed using a uniform weighting in a moving
window. Then the chart is analyzed like Shewhart charts.
In the case of EWMA charts, the data is smoothed using a exponentially weighting. Then the
chart is analyzed like Shewhart charts.
CUSUM charts
These charts are not directly based on the raw data. They are based on the normalized data.
These charts help to detect mean shifts of at a user defined granularity. The granularity is
defined by the design parameter k. k is the half of the mean shift to be detected. To detect a 1
sigma shift, k is set to 0.5.
Two kinds of CUSUM charts can be drawn: one and two sided charts. In the case of a one
sided CUSUM chart, upper and lower cumulated sums SH and SL are recursively calculated.
If SH or SL is bigger than the threshold h, then a mean shift is detected. The value of h can be
chosen by the user (h is usually set to 4 or 5).
The initial value of SH and SL at the beginning of the calculation and after detecting a mean
shift is usually 0. Using the option FIR (Fast Initial Response) can change this initial value to a
user defined value.
In the case of a two sided CUSUM chart the normalized data are calculated. The upper and
lower control limits are called U mask or V mask. These names are related to the shape
that the control limits draws on the chart. For a given data point the maximal upper and lower
limits for mean shift detection are calculated backwards and drawn in the chart in a U or V
mask format. The default data point for the origin of the mask is the last data point. The user
can change this by the option origin.
This tool offers you the following options for the estimation of the standard deviation (sigma) of
the data set, given k subgroups and ni (i=1, k) measurements per subgroup:
n
k
1 si2
i
/ c4 1 ni 1
k
s i 1
n
k
1 i 1
i
i 1
- R bar: The estimator for sigma is calculated based on the average range of the k subgroups.
567
s R / d 2
- S bar: The estimator for sigma is calculated based on the average of the standard deviations
of the k subgroups:
si / c4 ,
1 k 2
s
k i 1
- Average moving range: The estimator for sigma is calculated based on the average moving
range using a window length of m measurements.
s m / d 2 ,
- Median moving range: The estimator for sigma is calculated based on the median of the
moving range using a window length of m measurements.
s median / d 4 ,
- standard deviation: The estimator for sigma is calculated based on the standard deviation of
the n measurements.
s s / c4
Box-Cox transformation
Box-Cox transformation is used to improve the normality of the time series; the Box-Cox
transformation is defined by the following equation:
X t 1
X t 0, 0
Yt
,
ln( X ), X t 0, 0
t
Where the series {Xt} being transformed into series {Yt}, (t=1,,n):
568
Note: if < 0 the first equation is still valid, but Xt must be strictly positive. XLSTAT accepts a
fixed value of , or it can find the value that maximizes the likelihood value, the model being a
simple linear model with the time as sole explanatory variable.
Process capability
Process capability describes a process and informs if the process is under control and the
distribution of the measured variables are inside the specification limits of the process. If the
distributions of the measured variables are in the technical specification limits, then the
process is called capable.
During the interpretation of the different indicators for the process capability please pay
attention to the fact that some indicators suppose normality or at least symmetry of the
distribution of the measured values. By the use of a normality test, you can verify these
premises (see the Normality Tests in XLSTAT-Pro).
If the data are not normally distributed, you have the following possibilities to obtain results for
the process capabilities.
- Use the Box-Cox transformation to improve the normality of the data set. Then verify again
the normality using a normality test.
Chart rules
XLSTAT offers you the possibility to apply rules for special causes and Westgard rules on the
data set. Two sets of rules are available in order to interpret control charts. You can activate
and deactivate separately the rules in each set.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
569
: Click this button to delete the data selections.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
Mode tab:
Chart family: Select the type of chart family that you want to use:
Subgroup charts: Activate this option if you have a data set with several
measurements for each point in time.
Individual charts: Activate this option if you have a data set with one quantitative
measurement for each point in time.
Attribute charts: Activate this option if you have a data set with one qualitative
measurement for each point.
Time weighted: Activate this option if you want to use a time weighted chart like
UWMA, EWMA or CUSUM.
At this stage, the time weighted charts family is selected. If you want to switch to another chart
family, please change the corresponding option and call the help function again if you want to
obtain more details on the available options. The options below correspond to the subgroups
charts
Chart type: Select the type of chart you want to use (see the description section for more
details):
CUSUM chart
UWMA chart
EWMA chart
General tab:
570
Columns/Rows: Activate this option for XLSTAT to take each column (in column mode)
or each row (in row mode) as a separate measurement that belongs to the same
subgroup.
Data: If the data format One column/row is selected, please choose the unique column or
row that contains all the data. The assignment of the data to their corresponding subgroup
must be specified using the Groups field or setting the common subgroup size. If you select
the data Columns/rows option, please select a data area with one column/row per
measurement in a subgroup.
Groups: If the data format one column/row is selected, then activate this Option to select a
column/row that contains the group identifier. Select the data that identifies for each element of
the data selection the corresponding group.
Common subgroup size: If the data format One column/row is selected and the subgroup
size is constant, then you can deactivate the groups option and enter in this field the common
subgroup size.
Phase: Activate this option to supply one column/row with the phase identifier.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if the first row (column mode) or column (row mode)
of the data selections contains a label.
Standardize: In the case of a CUSUM chart, please activate this option to display the
cumulated sums and the control limits normalized.
571
Target: In the case of a CUSUM chart, please activate this option to enter the target value that
will be used during the normalization of the data. Default value is the estimated mean.
Weight: In the case of a EWMA chart, please activate this option to enter the weight factor of
the exponential smoothing.
MA Length: In the case of a UWMA chart, please activate this option to enter the length of the
window of the moving average.
Options tab:
Bound: Activate this option, if you want to enter a maximum value to accept for the
upper control limit of the process. This value will be used when the calculated upper
control limit is greater than the value entered here.
Value: Enter the upper control limit. This value will be used and overrides the
calculated upper control limit.
Bound: Activate this option, if you want to enter a minimum value to accept for the
lower control limit of the process. This value will be used when the calculated lower
control limit is greater than the value entered here.
Value: Enter the lower control limit. This value will be used and overrides the
calculated upper control limit.
..
Calculate Process capabilities: Activate this option to calculate process capabilities based
on the input data (see the description section for more details).
USL: If the calculation of the process capabilities is activated, please enter here the upper
specification limit (USL) of the process.
LSL: If the calculation of the process capabilities is activated, please enter here the lower
specification limit (LSL) of the process.
Target: If the calculation of the process capabilities is activated, activate this option to add the
target value of the process.
Confidence interval (%):If the Calculate Process Capabilities option is activated, please
enter the percentage range of the confidence interval to use for calculating the confidence
interval around the parameters. Default value: 95.
572
Box-Cox: Activate this option to compute the Box-Cox transformation. You can either fix the
value of the Lambda parameter, or decide to let XLSTAT optimize it (see the description
section for further details).
k Sigma: Activate this option to enter the distance between the upper and the lower control
limit and the center line of the control chart. The distance is fixed to k times the factor you enter
multiplied by the estimated standard deviation. Corrective factors according to Burr (1969) will
be applied.
alpha: Activate this option to enter the size of the confidence range around the center line of
the control chart. The alpha is used to compute the upper and lower control limits. 100 alpha
% of the distribution of the control chart is inside the control limits. Corrective factors according
to Burr (1969) will be applied.
Mean: Activate this option to enter a value for the center line of the control chart. This value
should be based on historical data.
Sigma: Activate this option to enter a value for the standard deviation of the control chart. This
value should be based on historical data. If this option is activated, then you cannot choose an
estimation method for the standard deviation in the Estimation tab.
Estimation tab:
Method for Sigma: Select an option to determine the estimation method for the standard
deviation of the control chart (see the description section for further details):
Pooled standard deviation: The standard deviation is calculated using all available
measurements. That means having n subgroups with k measurements for each
subgroup, all the n * k measurements will be weighted equally to calculate the standard
deviation.
R bar: The estimator of sigma is calculated using the average range of the n
subgroups.
S bar: The estimator of sigma is calculated using the average standard deviation of the
n subgroups.
Average Moving Range: The estimator of sigma is calculated using the average
moving range using a window length of m measurements.
Median Moving Range: The estimator of sigma is calculated using the median of the
moving range using a window length of m measurements.
573
o MR Length: Activate this option to change the window length of the moving
range.
Standard deviation: The estimator of sigma is calculated using the standard deviation
of the n measurements.
Design tab:
Scheme: Chose one of the following options depending on the kind of chart that you want (see
the description section for further details):
One sided (LCL/UCL): The upper and lower cumulated sum are calculated separately
for each point.
o FIR: Activate this option to change the initial value of the upper and lower
cumulated sum. Default value is 0.
Two sided (U-Mask): The normalized values are displayed. Starting from the origin
point the upper and lower limits for the mean shift detection a displayed backwards in
form of a mask.
o Origin: Activate this option to change the origin of the mask. Default value is
the last data point.
Design: In this section you can determine the Parameter of the mean-shift detection (see the
description section for further details):
h: Enter the threshold for the upper and lower cumulated sum or mask from above
which a mean shift is detected.
k: Enter the granularity of the mean shift detection. K is the half of the mean shift to be
detected. Default value is 0.5 to detect 1 sigma mean shifts.
Outputs tab:
Display zones: Activate this option to display beside the lower and upper control limit also the
limits of the zones A and B.
Normality Tests: Activate this option to check normality of the data. (see the Normality Tests
tool for further details).
574
Significance level (%): Enter the significance level for the tests.
Test special causes: Activate this option to analyze the points of the control chart according
to the rules for special causes. You can activate the following rules independently:
Apply Westgard rules: Activate this option to analyze the points of the control chart according
to the Westgard rules. You can activate the following rules independently:
Rule 1 2s
Rule 1 3
Rule 2 2s
Rule 4s
Rule 4 1s
Rule 10 X
Charts tab:
575
Display charts: Activate this option to display the control charts graphically.
Continuous line: Activate this option to connect the points in the control chart.
Box view: Activate this option to display the control charts using bars.
Connect through missing: Activate this option to connect the points in the control charts,
even when missing values are between the points.
Display a distribution: Activate this option to compare histograms of samples selected with a
density function.
Run Charts: Activate this option to display a chart of the latest data points. Each individual
measurement is displayed.
Results
Estimation:
Estimated mean: This table displays the estimated mean values for the different phases.
Estimated standard deviation: This table displays the estimated standard deviation values
for the different phases.
Box-Cox transformation:
Estimates of the parameters of the model: This table is available only if the Lambda
parameter has been optimized. It displays the estimator for Lambda.
Series before and after transformation: This table displays the series before and after
transformation. If Lambda has been optimized, the transformed series corresponds to the
residuals of the model. If it hasnt then the transformed series is the direct application of the
Box-Cox transformation
Process capabilities:
Process capabilities: These tables are displayed, if the process capability option has been
selected. There is one table for each phase. A table contains the following indicators for the
576
process capability and if possible the corresponding confidence intervals: Cp, Cpl, Cpu, Cpk,
Pp, Ppl, Ppu, Ppk, Cpm, Cpm (Boyle), Cp 5.5, Cpk 5.5, Cpmk, and Cs (Wright).
For Cp, Cpl, and Cpu, information about the process performance is supplied and for Cp a
status information is given to facilitate the interpretation.
Cp values have the following status based on Ekvall and Juran (1974):
Based on Montgomery (2001), Cp needs to have the following minimal values for the process
performance to be as expected:
1.50 for new processes or for existing processes when the variable is critical
Based on Montgomery (2001), Cpu and Cpl need to have the following minimal values for
process performance to be as expected:
1.45 for new processes or for existing processes when the variable is critical
Capabilities: This chart contains information about the specification and control limits. A line
between the lower und upper limits represents the interval with an additional vertical mark for
the center line. The different control limits of each phase are drawn separately.
Chart information:
The following results are displayed separately for the requested chart.
UWMA / EWMA / CUSUM chart: This table contains information about the center line and the
upper and lower control limits of the selected chart. There will be one column for each phase.
577
Observation details: This table displays detailed information for each subgroup. For each
subgroup the corresponding phase, the values according to the selected diagram type, the
center line, the lower and upper control limits are displayed. If the information about the zones
A, B and C are activated, then the lower and upper control limits of the zones A and B are
displayed as well.
Rule details: If the rules options are activated, a detailed table about the rules will be
displayed. For each subgroup there is one row for each rule that applies. Yes indicates that
the corresponding rule was fired, and No indicates that the rule does not apply.
UWMA / EWMA / CUSUM Chart: If the charts are activated, then a chart containing the
information of the two tables above is displayed. The center line and the lower and upper
control limits are displayed as well. If the corresponding options have been activated, the lower
and upper control limits for the zones A and B are included and there are labels for the
subgroups for which rules were fired. A legend with the activated rules and the corresponding
rule number is displayed below the chart.
Normality tests:
For each of the four tests, the statistics relating to the test are displayed including, in particular,
the p-value which is afterwards used in interpreting the test by comparing with the chosen
significance threshold.
Histograms: The histograms are displayed. If desired, you can change the color of the lines,
scales, titles as with any Excel chart.
Example
A tutorial explaining how to use the SPC time weighted charts tool is available on the Addinsoft
web site. To consult the tutorial, please go to:
http://www.xlstat.com/demo-spc4.htm
578
References
Burr, I. W. (1967). The effect of non-normality on constants for X and R charts. Industrial
Quality control, 23(11), 563-569.
Burr I. W. (1969). Control charts for measurements with varying sample sizes. Journal of
Quality Technology, 1(3), 163-167.
Deming W. E. (1993). The New Economics for Industry, Government, and Education.
Cambridge, MA: Center for Advanced Engineering Study, Massachusetts Institute of
Technology.
Ekvall D. N. (1974). Manufacturing Planning. In Quality Control Handbook,. 3rd Ed. (J. M.
Juran, et al. eds.) pp. 9-22-39, McGraw-Hill Book Co., New York.
Montgomery D.C. (2001), Introduction to Statistical Quality Control, 4th edition, John Wiley &
Sons.
Nelson L.S. (1984). The Shewhart Control Chart - Tests for Special Causes. Journal of Quality
Technology, 16, 237-239.
Pyzdek Th. (2003). The Six Sigma Handbook Revised and Expanded, McGraw Hill, New
York.
Ryan Th. P. (2000). Statistical Methods for Quality Improvement, Second Edition, Wiley Series
in probability and statistics, John Wiley & Sons, New York.
579
Pareto plots
Use this tool to calculate descriptive statistics and display Pareto plots (bar and pie charts) for
a set of qualitative variables.
Description
A Pareto chart draws its name from an Italian economist, but J. M. Juran is credited with being
the first to apply it to industrial problems.
The causes that should be investigated (e. g., nonconforming items) are listed and
percentages assigned to each one so that the total is 100 %. The percentages are then used
to construct the diagram that is essentially a bar or pie chart. Pareto analysis uses the ranking
causes to determine which of them should be pursued first.
XLSTAT offers you a large number of descriptive statistics and charts which give you a useful
and relevant insight of your data.
Although you can select several variables (or samples) at the same time, XLSTAT calculates
all the descriptive statistics for each of the samples independently.
Number of missing values: The number of missing values in the sample analyzed. In
the subsequent statistical calculations, values identified as missing are ignored. We
define n to be the number of non-missing values, and {w1, w2, wn} to be the sub-
sample of weights for the non-missing values.
Sum of weights*: The sum of the weights, Sw. When all the weights are 1, Sw=n.
Mode*: The mode of the sample analyzed. In other words, the most frequent category.
Frequency of mode*: The frequency of the category to which the mode corresponds.
580
Cumulated relative frequency by category*: The cumulated relative frequency of
each of the categories.
(*) Statistics followed by an asterisk take the weight of observations into account.
Bar charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as bars.
Pie charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as pie charts.
Double pie charts: These charts are used to compare the frequencies or relative
frequencies of sub-samples with those of the complete sample.
Doughnuts: this option is only checked if a column of sub-samples has been selected.
These charts are used to compare the frequencies or relative frequencies of sub-
samples with those of the complete sample.
Stacked bars: this option is only checked if a column of sub-samples has been
selected. These charts are used to compare the frequencies or relative frequencies of
sub-samples with those of the complete sample.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
581
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Causes: Select a column (or a row in row mode) of qualitative data that represent the list of
causes you want to calculate descriptive statistics for.
Frequencies: Check this option, if your data is already aggregated in a list of causes and a
corresponding list of frequencies of these causes. Select here the list of frequencies that
correspond to the selected list of causes.
Sub-sample: Check this option to select a column showing the names or indexes of the sub-
samples for each of the observations.
Range: Check this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Check this option to display the results in a new worksheet in the active workbook.
Sample labels: Check this option if the first line of the selections (qualitative date, sub-
samples, and weights) contains a label.
Weights: Check this option if the observations are weighted. If you do not check this option,
the weights will be considered as 1. Weights must be greater than or equal to 0. If a column
header has been selected, check that the "Sample labels" option is activated.
Standardize the weights: if you check this option, the weights are standardized such
that their sum equals the number of observations.
Options tab:
Descriptive statistics: Check this option to calculate and display descriptive statistics.
582
Compare to total sample: this option is only checked if a column of sub-samples has been
selected. Check this option so that the descriptive statistics and charts are also displayed for
the total sample.
Combine categories: Select the option that determine if and how categories of the qualitative
data should be combined.
Frequency less than: Choose this option to combine categories having a frequency
smaller that the user defined value.
% smaller than: Choose this option to combine categories having a % smaller that the
user defined value.
Smallest categories: Choose this option to combine the m smallest categories. The
value m is defined by the user.
Cumulated %: Choose this option to combine all categories, as soon as the cumulative
% of the Pareto plot is bigger than the user defined value.
Outputs tab:
Qualitative data: Activate the options for the descriptive statistics you want to calculate. The
various statistics are described in the description section.
Display vertically: Check this option so that the table of descriptive statistics is
displayed vertically (one line per descriptive statistic).
Charts tab:
Bar charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as bars.
Pie charts: Check this option to represent the frequencies or relative frequencies of the
various categories of qualitative variables as pie charts.
583
Doubles: this option is only checked if a column of sub-samples has been selected.
These charts are used to compare the frequencies or relative frequencies of sub-
samples with those of the complete sample.
Doughnuts: this option is only checked if a column of sub-samples has been selected. These
charts are used to compare the frequencies or relative frequencies of sub-samples with those
of the complete sample.
Stacked bars: this option is only checked if a column of sub-samples has been selected.
These charts are used to compare the frequencies or relative frequencies of sub-samples with
those of the complete sample.
Frequencies: choose this option to make the scale of the plots correspond to the
frequencies of the categories.
Relative frequencies: choose this option to make the scale of the plots correspond to
the relative frequencies of the categories.
Example
An example showing how to create Pareto charts is available on the Addinsoft website:
http://www.xlstat.com/demo-pto.htm
References
Juran J.M. (1960). Pareto, Lorenz, Cournot, Bernouli, Juran and others. Industrial Quality-
Control, 17(4), 25.
Pyzdek Th. (2003). The Six Sigma Handbook Revised and Expanded, McGraw Hill, New
York.
Ryan Th. P. (2000). Statistical Methods for Quality Improvement, Second Edition, Wiley Series
in probability and statistics, John Wiley & Sons, New York.
584
Kaplan-Meier analysis
Use this tool to build a population survival curve, and to obtain essential statistics such as the
median survival time. Kaplan-Meier analysis, which main result is the Kaplan-Meier table, is
based on irregular time intervals, contrary to the Life table analysis, where the time intervals
are regular.
Description
The Kaplan Meier method (also called product-limit) analysis belongs to the descriptive
methods of survival analysis, as well as Life table analysis. The life table analysis method was
developed first, but the Kaplan-Meier method has been shown to be superior in many cases.
Kaplan-Meier analysis allows to quickly obtain a population survival curve and essential
statistics such as the median survival time. Kaplan-Meier analysis, which main result is the
Kaplan-Meier table, is based on irregular time intervals, contrary to the Life table analysis,
where the time intervals are regular.
Kaplan-Meier analysis is used to analyze how a given population evolves with time. This
technique is mostly applied to survival data and product quality data. There are three main
reasons why a population of individuals or products may evolve: some individuals die
(products fail), some other go out of the surveyed population because they get healed
(repaired) or because their trace is lost (individuals move from location, the study is
terminated, ). The first type of data is usually called "failure data", or "event data", while the
second is called "censored data".
Left censoring: when an event is reported at time t=t(i), we know that the event occurred at t *
t(i).
Right censoring: when an event is reported at time t=t(i), we know that the event occurred at t *
t(i), if it ever occurred.
Interval censoring: when an event is reported at time t=t(i), we know that the event occurred
during [t(i-1); t(i)].
Exact censoring: when an event is reported at time t=t(i), we know that the event occurred
exactly at t=t(i).
The Kaplan Meier method requires that the observations are independent. Second, the
censoring must be independent: if you consider two random individuals in the study at time t-1,
if one of the individuals is censored at time t, and if the other survives, then both must have
equal chances to survive at time t. There are four different types of independent censoring:
585
Simple type I: all individuals are censored at the same time or equivalently individuals are
followed during a fixed time interval.
Progressive type I: all individuals are censored at the same date (for example, when the study
terminates).
Type II: the study is continued until n events have been recorded.
Random: the time when a censoring occurs is independent of the survival time.
The Kaplan Meier analysis allows to compare populations, through their survival curves. For
example, it can be of interest to compare the survival times of two samples of the same
product produced in two different locations. Tests can be performed to check if the survival
curves have arisen from identical survival functions. These results can later be used to model
the survival curves and to predict probabilities of failure.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Date data: Select the data that correspond to the times or the dates when the events or the
censoring are recorded. If a column header has been selected on the first row, check that the
"Column labels" option has been activated.
Weighted data: Activate this option if for a given time, several events are recorded on the
same row (for example, at time t=218, 10 failures and 2 censured data have been observed). If
586
you activate this option, the "Event indicator" field replaces the "Status variable" field, and the
Censoring indicator field replaces the "Event code" and "Censured code" boxes.
Status indicator: Select the data that correspond to an event or censoring data. This field is
not available if the Weighted data option is checked. If a column header has been selected
on the first row, check that the "Column labels" option has been activated.
Event code: Enter the code used to identify an event data within the Status variable. Default
value is 1.
Censored code Event code: Enter the code used to identify an event data within the Status
variable. Default value is 1.
Censored code: Enter the code used to identify a censored data within the Status variable.
Default value is 0.
Event indicator: Select the data that correspond to the counts of events recorded at each
time. Note: this option is available only if the "weighted data" option is selected. If a column
header has been selected on the first row, check that the "Column labels" option has been
activated.
Censoring indicator: Select the data that correspond to the counts of right-censored data
recorded at a given time. Note: this option is available only if the "weighted data" option is
selected. If a column header has been selected on the first row, check that the "Column labels"
option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: Activate this option if the row and column labels have been selected.
Groups: Activate this option if you want to group the data. Then select the data that
correspond to the group to which each observation belongs.
Compare: Activate this option if want to compare the survival curves, and perform the
comparison tests.
587
Options tab:
Significance level (%): Enter the significance level for the comparison tests (default value
5%). This value is also used to determine the confidence intervals around the estimated
statistics.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Charts tab:
Survival distribution function: Activate this option to display the charts corresponding to the
survival distribution function.
-Log(SDF) : Activate this option to display the Log() of the survival distribution function (SDF).
Log(-Log(SDF)): Activate this option to display the Log(Log()) of the survival distribution
function.
Censored data: Activate this option to identify on the charts the times when censored data
have been recorded (the identifier is a hollowed circle "o").
Results
Basic statistics: This table displays the total number of observations, the number of events,
and the number of censored data.
Kaplan-Meier table: This table displays the various results obtained from the analysis,
including:
Proportion failed: proportion of individuals who "failed" (the event did occur).
588
Survival rate: proportion of individuals who "survived" (the event did not occur).
Mean and Median residual lifetime: A first table displays the mean residual lifetime, the
standard error, and a confidence range. A second table displays statistics (estimator, and
confidence range) for the 3 quartiles including the median residual lifetime (50%). The median
residual lifetime is one of the key results of the Kaplan-Meier analysis as it allows to evaluate
the time remaining for half of the population to "fail".
Charts: Depending on the selected options, up to three charts are displayed: Survival
distribution function (SDF), -Log(SDF) and Log(-Log(SDF)).
If the "Compare" option has been activated in the dialog box, XLSTAT displays the following
results:
Test of equality of the survival functions: This table displays the statistics for three different
tests: the Log-rank test, the Wilcoxon test, and the Tarone Ware test. These tests are based
on a Chi-square test. The lower the corresponding p-value, the more significant the differences
between the groups.
Charts: Depending on the selected options, up to three charts with one curve for each group
are displayed: Survival distribution function (SDF), -Log(SDF), Log(-Log(SDF)).
Example
http://www.xlstat.com/demo-km.htm
References
Brookmeyer R. and Crowley J. (1982). A confidence interval for the median survival time.
Biometrics, 38, 29-41.
589
Collett D. (1994). Modeling Survival Data In Medical Research. Chapman and Hall, London.
Cox D.R. and Oakes D. (1984). Analysis of Survival Data. Chapman and Hall, London.
Elandt-Johnson R.C. and Johnson N.L. (1980). Survival Models and Data Analysis. John
Wiley & Sons, New York.
Kalbfleisch J.D. and Prentice R.L. (1980). The Statistical Analysis of Failure Time Data. John
Wiley & Sons, New York.
590
Life tables
Use this tool to build a survival curve for a given population, and to obtain essential statistics
such as the median survival time. Life table analysis, which main result is the life table (also
named actuarial table) works on regular time intervals, contrary to the Kaplan Meier analysis,
where the time intervals are taken as they are in the data set. XLSTAT enables you to take into
account censored data, and grouping information.
Description
Life table analysis belongs to the descriptive methods of survival analysis, as well as Kaplan
Meier analysis. The life table analysis method was developed first, but the Kaplan-Meier
method has been shown to be superior in many cases.
Life table analysis allows to quickly obtain a population survival curve and essential statistics
such as the median survival time. Life table analysis, which main result is the life table (also
called actuarial table) works on regular time intervals, contrary to the Kaplan Meier analysis,
where the time intervals are taken as they are in the data set.
Life table analysis allows to analyze how a given population evolves with time. This technique
is mostly applied to survival data and product quality data. There are three main reasons why a
population of individuals or products may evolve: some individuals die (products fail), some
other go out of the surveyed population because they get healed (repaired) or because their
trace is lost (individuals move from location, the study is terminated, ). The first type of data
is usually called "failure data", or "event data", while the second is called "censored data".
Left censoring: when an event is reported at time t=t(i), we know that the event occurred at t *
t(i).
Right censoring: when an event is reported at time t=t(i), we know that the event occurred at t *
t(i), if it ever occurred.
Interval censoring: when an event is reported at time t=t(i), we know that the event occurred
during [t(i-1); t(i)].
Exact censoring: when an event is reported at time t=t(i), we know that the event occurred
exactly at t=t(i).
The life table method requires that the observations are independent. Second, the censoring
must be independent: if you consider two random individuals in the study at time t-1, if one of
the individuals is censored at time t, and if the other survives, then both must have equal
chances to survive at time t. There are four different types of independent censoring:
591
Simple type I: all individuals are censored at the same time or equivalently individuals are
followed during a fixed time interval.
Progressive type I: all individuals are censored at the same date (for example, when the study
terminates).
Type II: the study is continued until n events have been recorded.
Random: the time when a censoring occurs is independent of the survival time.
The life table method allows to compare populations, through their survival curves. For
example, it can be of interest to compare the survival times of two samples of the same
product produced in two different locations. Tests can be performed to check if the survival
curves have arisen from identical survival functions. These results can later be used to model
the survival curves and to predict probabilities of failure.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Date data: Select the data that correspond to the times or the dates when the events or the
censoring are recorded. If a column header has been selected on the first row, check that the
"Column labels" option has been activated.
Weighted data: Activate this option if for a given time, several events are recorded on the
same row (for example, at time t=218, 10 failures and 2 censured data have been observed). If
592
you activate this option, the "Event indicator" field replaces the "Status variable" field, and the
Censoring indicator field replaces the "Event code" and "Censured code" boxes.
Status indicator: Select the data that correspond to an event or censoring data. This field is
not available if the Weighted data option is checked. If a column header has been selected
on the first row, check that the "Column labels" option has been activated.
Event code: Enter the code used to identify an event data within the Status variable. Default
value is 1.
Censored code Event code: Enter the code used to identify an event data within the Status
variable. Default value is 1.
Censored code: Enter the code used to identify a censored data within the Status variable.
Default value is 0.
Event indicator: Select the data that correspond to the counts of events recorded at each
time. Note: this option is available only if the "weighted data" option is selected. If a column
header has been selected on the first row, check that the "Column labels" option has been
activated.
Censoring indicator: Select the data that correspond to the counts of right-censored data
recorded at a given time. Note: this option is available only if the "weighted data" option is
selected. If a column header has been selected on the first row, check that the "Column labels"
option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: Activate this option if the row and column labels have been selected.
Groups: Activate this option if you want to group the data. Then select the data that
correspond to the group to which each observation belongs.
Compare: Activate this option if want to compare the survival curves, and perform the
comparison tests.
Options tab:
593
Significance level (%): Enter the significance level for the comparison tests (default value
5%). This value is also used to determine the confidence intervals around the estimated
statistics.
Time intervals:
Constant width: Activate this option if want to enter the constant interval width. In this
case, the lower bound is automatically set to 0.
User defined: Activate this option to define the intervals that should be used to perform
the life table analysis. Then select the data that correspond to the lower bound of the
first interval and to the upper bounds of all the intervals.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Charts tab:
Survival distribution function: Activate this option to display the charts corresponding to the
survival distribution function.
-Log(SDF) : Activate this option to display the Log() of the survival distribution function (SDF).
Log(-Log(SDF)): Activate this option to display the Log(Log()) of the survival distribution
function.
Censored data: Activate this option to identify on the charts the times when censored data
have been recorded (the identifier is a hollowed circle "o").
Results
Basic statistics: This table displays the total number of observations, the number of events,
and the number of censored data.
Life table: This table displays the various results obtained from the analysis, including:
At risk: Number of individuals that were at risk during the time interval.
594
Events: Number of events recorded during the time interval.
Effective at risk: Number of individuals that were at risk at the beginning of the interval
minus half of the individuals who have been censored during the time interval.
Survival rate: Proportion of individuals who "survived" (the event did not occur) during
the time interval. Ratio of individuals who survived over the individuals who were
"effective at risk".
Conditional probability of failure: Ratio of individuals who failed over the individuals
who were "effective at risk".
Probability density function: estimated density function at the midpoint of the interval.
Hazard rate: estimated hazard rate function at the midpoint of the interval. Also called
failure rate. Corresponds to the failure rate for the survivors.
Median residual lifetime: Amount of time remaining to reduce the surviving population
(individuals at risk) by one half. Also called median future lifetime.
Median residual lifetime: Table displaying the median residual lifetime at the beginning of the
experiment, and its standard error. This statistic is one of the key results of the life table
analysis as it allows to evaluate the time remaining for half of the population to "fail".
Charts: Depending on the selected options, up to five charts are displayed: Survival
distribution function (SDF), Probability density function, Hazard rate function, -Log(SDF), Log(-
Log(SDF)).
If the "Compare" option has been activated in the dialog box, XLSTAT displays the following
results:
595
Test of equality of the survival functions: This table displays the statistics for three different
tests: the Log-rank test, the Wilcoxon test, and the Tarone Ware test. These tests are based
on a Chi-square test. The lower the corresponding p-value, the more significant the differences
between the groups.
Charts: Depending on the selected options, up to five charts with one curve for each group are
displayed: Survival distribution function (SDF), Probability density function, Hazard rate
function, -Log(SDF), Log(-Log(SDF)).
Example
An example of survival analysis by the mean of life tables is available on the Addinsoft
website:
http://www.xlstat.com/demo-life.htm
References
Brookmeyer R. and Crowley J. (1982). A confidence interval for the median survival time.
Biometrics, 38, 29-41.
Collett D. (1994). Modeling Survival Data In Medical Research. Chapman and Hall, London.
Cox D.R. and Oakes D. (1984). Analysis of Survival Data. Chapman and Hall, London.
Elandt-Johnson R.C. and Johnson N.L. (1980). Survival Models and Data Analysis. John
Wiley & Sons, New York.
Kalbfleisch J.D. and Prentice R.L. (1980). The Statistical Analysis of Failure Time Data. John
Wiley & Sons, New York.
596
Cox Proportional Hazards Model
Use Cox proportional hazards (also known as Cox regression) to model a survival time using
quantitative and/or qualitative covariates.
Description
Cox proportional hazards model is a frequently used method in the medical domain (when a
patient will get well or not).
The principle of the proportional hazards model is to link the survival time of an individual to
covariates. For example, in the medical domain, we are seeking to find out which covariate has
the most important impact on the survival time of a patient.
Models
A Cox model is a well-recognised statistical technique for exploring the relationship between
the survival of a patient and several explanatory variables. A Cox model provides an estimate
of the treatment effect on survival after adjustment for other explanatory variables. It allows us
to estimate the hazard (or risk) of death, or other event of interest, for individuals, given their
prognostic variables.
Interpreting a Cox model involves examining the coefficients for each explanatory variable. A
positive regression coefficient for an explanatory variable means that the hazard is higher.
Conversely, a negative regression coefficient implies a better prognosis for patients with higher
values of that variable.
Coxs method does not assume any particular distribution for the survival times, but it rather
assumes that the effects of the different variables on survival are constant over time and are
additive in a particular scale.
The hazard function is the probability that an individual will experience an event (for example,
death) within a small time interval, given that the individual has survived up to the beginning of
the interval. It can therefore be interpreted as the risk of dying at time t. The hazard function
(denoted by (t,X)) can be estimated using the following equation:
t , X 0 t exp X
597
The first term depends only on time and the second one depends on X. We are only interested
on the second term.
If we only estimate the second term, a very important hypothesis has to be verified: the
proportional hazards hypothesis. It means that the hazard ratio between two different
observations does not depend on time.
Cox developed a modification of the likelihood function called partial likelihood to estimate the
coefficients not taking into account the time dependant term of the hazard function:
log L i 1 X i log
n
j :t( j ) t( i )
exp X j
To estimate the parameters of the model (the coefficients of the linear function), we try to
maximize the partial likelihood function. Contrary to linear regression, an exact analytical
solution does not exist. So an iterative algorithm has to be used. XLSTAT uses a Newton-
Raphson algorithm. The user can change the maximum number of iterations and the
convergence threshold if desired.
Strata
When the proportional hazards hypothesis does not hold, the model can be stratified. If the
hypothesis holds on sub-samples, then the partial likelihood is estimated on each sub-sample
and these partial likelihoods are summed in order to obtain the estimated partial likelihood. In
XLSTAT, strata are defined using a qualitative variable.
Qualitative variables
Qualitative covariates are treated using a complete disjunctive table. In order to have
independent variables in the model, the binary variable associated to the first modality of each
qualitative variable has to be removed from the model. In XLSTAT, the first modality is always
selected and, thus, its effect corresponds to a standard. The impact of the other modalities are
obtained relatively to the omitted modality.
Ties handling
The proportional hazards model has been developed by Cox (1972) in order to treat
continuous time survival data. However, frequently in practical applications, some observations
occur at the same time. The classical partial likelihood cannot be applied. With XLSTAT, you
can use two alternative approaches in order to handle ties:
- Breslows method (1974) (default method): The partial likelihood has the following form:
log L i 1 l i 1 X l di log
T d
j:t( j ) t( i )
exp X j ,
where T is the number of times and di is the number of observations associated to time
t(i).
598
- Efrons method (1977): The partial likelihood has the following form:
log L i 1 l i 1 X l r i 0 log j:t t exp X j
r
T d d 1 di
exp X
di j 1 j
( j) (i)
where T is the number of times and di is the number of observations associated to time
t(i).
If there are no ties, partial likelihoods are equivalent to Cox partial likelihood.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
General tab:
Date data: Select the data that correspond to the times or the dates when the events or the
censoring are recorded. If a column header has been selected on the first row, check that the
"Column labels" option has been activated.
Status indicator: Select the data that correspond to an event or censoring data. If a column
header has been selected on the first row, check that the "Column labels" option has been
activated.
Event code: Enter the code used to identify an event data within the Status variable. Default
value is 1.
Censored code: Enter the code used to identify a censored data within the Status variable.
Default value is 0.
599
Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
data selected may be of the numerical type. If the variable header has been selected, check
that the "Column labels" option has been activated.
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
selected data may be of any type, but numerical data will automatically be considered as
nominal. If the variable header has been selected, check that the "Column labels" option has
been activated (see description).
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column labels: Activate this option if the first row of the data selections (time, status and
explanatory variables labels) includes a header.
Options tab:
Significance level (%): Enter the significance level for the comparison tests (default value
5%). This value is also used to determine the confidence intervals around the estimated
statistics.
Ties handling: Select the method to be used when there is more than one observation for one
time (see description). Default method: Breslows method.
Stop conditions:
Iterations: Enter the maximum number of iterations for the Newton-Raphson algorithm.
The calculations are stopped when the maximum number if iterations has been
exceeded. Default value: 100.
Convergence: Enter the maximum value of the evolution of the log of the likelihood
from one iteration to another which, when reached, means that the algorithm is
considered to have converged. Default value: 0.000001.
600
Model selection: Activate this option if you want to use one of the two selection methods
provided:
Forward: The selection process starts by adding the variable with the largest
contribution to the model. If a second variable is such that its entry probability is greater
than the entry threshold value, then it is added to the model. This process is iterated
until no new variable can be entered in the model.
Backward: This method is similar to the previous one but starts from a complete model.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Goodness of fit statistics: Activate this option to display the table of goodness of fit statistics
for the model.
Test of the null hypothesis H0: beta=0: Activate this option to display the table of statistics
associated to the test of the null hypothesis H0 (likelihood ratio, Wald statistic and score
statistic)
Model coefficients: Activate this option to display the table of coefficients for the model. The
last columns display the hazard ratios and their confidence intervals (the hazard ratio is
calculated as the exponential of the estimated coefficient).
Residuals: Activate this option to display the residuals for all the observations (deviance
residuals, martingale residuals, Schoenfeld residuals and score residuals).
Charts tab:
Survival distribution function: Activate this option to display the charts corresponding to the
cumulative survival distribution function.
-Log(SDF) : Activate this option to display the Log() of the survival distribution function (SDF).
601
Log(-Log(SDF)): Activate this option to display the Log(Log()) of the survival distribution
function.
Hazard function: Activate this option to display the hazard function when all covariates are at
their mean value.
Results
XLSTAT displays a large number of tables and charts to help in analyzing and interpreting the
results.
Summary statistics: This table displays descriptive statistics for all the variables selected. For
the quantitative variables, the number of missing values, the number of non-missing values,
the mean and the standard deviation (unbiased) are displayed. For qualitative variables, the
categories with their respective frequencies and percentages are displayed.
Summary of the variables selection: When a selection method has been chosen, XLSTAT
displays the selection summary.
Goodness of fit coefficients: This table displays a series of statistics for the independent
model (corresponding to the case where there is no impact of covariates, beta=0) and for the
adjusted model.
-2 Log(Like.) : The logarithm of the likelihood function associated with the model;
Test of the null hypothesis H0: beta=0: The H0 hypothesis corresponds to the independent
model (no impact of the covariates). We seek to check if the adjusted model is significantly
more powerful than this model. Three tests are available: the likelihood ratio test (-2
2
Log(Like.)), the Score test and the Wald test. The three statistics follow a Chi distribution
whose degrees of freedom are shown.
602
Model parameters: The parameter estimate, corresponding standard deviation, Wald's Chi ,
2
the corresponding p-value and the confidence interval are displayed for each variable of the
model. The hazard ratios for each variable with confidence intervals are also displayed.
The residual table shows, for each observation, the time variable, the censoring variable and
the value of the residuals (deviance, martingale, Schoenfeld and score).
Charts: Depending on the selected options, charts are displayed: Cumulative Survival
distribution function (SDF), -Log(SDF) and Log(-Log(SDF)), hazard function at mean of
covariates, residuals.
Example
http://www.xlstat.com/demo-cox.htm
References
Collett D. (1994). Modeling Survival Data In Medical Research. Chapman and Hall, London.
Cox D. R. (1972). Regression Models and Life Tables (with Discussion). Journal of the Royal
Statistical Society, Series B 34:187-220.
Cox D. R. and Oakes D. (1984). Analysis of Survival Data. Chapman and Hall, London.
Effron B. (1977). Efficiency of Coxs likelihood function for censored data. Journal of the
American Statistical Association, 72:557-565
Hill C., Com-Nougu C., Kramar A., Moreau T., OQuigley J. Senoussi R. and Chastang
C. (1996). Analyse statistique des donnes de survie. 2 Edition, INSERM, Mdecine-
nd
Sciences, Flammarion.
Kalbfleisch J. D. and Prentice R. L. (2002). The Statistical Analysis of Failure Time Data. 2
nd
603
Sensitivity and Specificity
Use this tool to compute, among others, the sensitivity, specificity, odds ratio, predictive
values, and likelihood ratios associated with a test or a detection method. These indices can
be used to assess the performance of a test. For example in medicine it can be used to
evaluate the efficiency of a test used to diagnose a disease or in quality control to detect the
presence of a defect in a manufactured product.
Description
This method was first developed during World War II to develop effective means of detecting
Japanese aircrafts. It was then applied more generally to signal detection and medicine where
it is now widely used.
The problem is as follows: we study a phenomenon, often binary (for example, the presence or
absence of a disease) and we want to develop a test to detect effectively the occurrence of a
precise event (for example, the presence of the disease).
Let V be the binary or multinomial variable that describes the phenomenon for N individuals
that are being followed. We note by + the individuals for which the event occurs and by -those
for which it does not. Let T be a test which goal is to detect if the event occurred or not. T can
be a binary (presence/absence), a qualitative (for example the color), or a quantitative variable
(for example a concentration). For binary or qualitative variables, let t1 be the category
corresponding to the occurrence of the event of interest. For a quantitative variable, let t1 be
the threshold value under or above which the event is assumed to happen.
Once the test has been applied to the N individuals, we obtain an individuals/variables table in
which for each individual you find if the event occurred or not, and the result of the test.
604
In the example above, there are 25 individuals for whom the test has detected the presence of
the disease and 13 for which it has detected its absence. However, for 20 individuals diagnosis
is bad because for 8 of them the test contends the absence of the disease while the patients
are sick, and for 12 of them, it concludes that they are sick while they are not.
True positive (TP): Number of cases that the test declares positive and that are truly positive.
False positive (FP): Number of cases that the test declares positive and that in reality are
negative.
True negative (VN): Number of cases that the test declares negative and that are truly
negative.
False negative (FN): Number of cases that the test declares negative and that in reality are
positive.
Sensitivity (equivalent to the True Positive Rate): Proportion of positive cases that are well
detected by the test. In other words, the sensitivity measures how the test is effective when
used on positive individuals. The test is perfect for positive individuals when sensitivity is 1,
equivalent to a random draw when sensitivity is 0.5. If it is below 0.5, the test is counter-
performing and it would be useful to reverse the rule so that sensitivity is higher than 0.5
(provided that this does not affect the specificity). The mathematical definition is given by:
Sensitivity = TP/(TP + FN).
Specificity (also called True Negative Rate): proportion of negative cases that are well
detected by the test. In other words, specificity measures how the test is effective when used
on negative individuals. The test is perfect for negative individuals when the specificity is 1,
equivalent to a random draw when the specificity is 0.5. If it is below 0.5, the test is counter
performing-and it would be useful to reverse the rule so that specificity is higher than 0.5
(provided that this does not affect the sensitivity). The mathematical definition is given by:
Specificity = TN/(TN + FP).
False Positive Rate (FPR): Proportion of negative cases that the test detects as positive (FPR
= 1-Spcificit).
False Negative Rate (FNR): Proportion of positive cases that the test detects as negative
(FNR = 1-Sensibilit)
Prevalence: relative frequency of the event of interest in the total sample (TP+FN)/N.
Positive Predictive Value (PPV): Proportion of truly positive cases among the positive cases
detected by the test. We have PPV = TP / (TP + FP), or PPV = Sensitivity x Prevalence /
605
[(Sensitivity x Prevalence + (1-Specificity)(1-Prevalence)]. It is a fundamental value that
depends on the prevalence, an index that is independent of the quality of the test.
Negative Predictive Value (NPV): Proportion of truly negative cases among the negative
cases detected by the test. We have NPV = TN / (TN + FN), or PPV = Specificity x (1 -
Prevalence) / [(Specificity (1-Prevalence) + (1-Sensibility) x Prevalence]. This index depends
also on the prevalence that is independent of the quality of the test.
Positive Likelihood Ratio (LR+): This ratio indicates to which point an individual has more
chances to be positive in reality when the test is telling it is positive. We have LR+ = Sensitivity
/ (1-Specificity). The LR+ is a positive or null value.
Negative Likelihood Ratio (LR-): This ratio indicates to which point an individual has more
chances to be negative in reality when the test is telling it is positive. We have LR- = (1-
Sensitivity) / (Specificity). The LR- is a positive or null value.
Odds ratio: The odds ratio indicates how much an individual is more likely to be positive if the
test is positive, compared to cases where the test is negative. For example, an odds ratio of 2
means that the chance that the positive event occurs is twice higher if the test is positive than if
it is negative. The odds ratio is a positive or null value. We have Odds ratio = TPxTN /
(FPxFN).
Relative risk: The relative risk is a ratio that measures how better the test behaves when it is
a positive report than when it is negative. For example, a relative risk of 2 means that the test
is twice more powerful when it is positive that when it is negative. A value close to 1
corresponds to a case of independence between the rows and columns, and to a test that
performs as well when it is positive as when it is negative. Relative risk is a null or positive
value given by: Relative risk = TP/(TP+FP) / (FN/(FN+TN)).
Confidence intervals
For the various presented above, several methods of calculating their variance and, therefore
their confidence intervals, have been proposed. There are two families: the first concerns
proportions, such as sensitivity and specificity, and the second ratios, such as LR +, LR- the
odds ratio and the relative risk.
For proportions, XLSTAT allows you to use the simple (Wald, 1939) or adjusted (Agresti and
Coull, 1998) Wald intervals, a calculation based on the Wilson score (Wilson, 1927), possibly
with a correction of continuity, or the Clopper-Pearson (1934) intervals. Agresti and Caffo
recommend using the adjusted Wald interval or the Wilson score intervals.
For ratios, the variances are calculated using a single method, with or without correction of
continuity.
606
Once the variance of the above statistics is calculated, we assume their asymptotic normality
(or of their logarithm for ratios) to determine the corresponding confidence intervals. Many of
the statistics are proportions and should lie between 0 and 1. If the intervals fall partly outside
these limits, XLSTAT automatically corrects the bounds of the interval.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Data format:
2x2 table (Test/Event): Choose this option if your data are available in a 2x2 contingency
table with the tests results in rows and the positive and negative events in columns. You can
then specify in which column of the table are the positive events, and on which row are the
cases detected as positive by the test. The option "Labels included" must be activated if the
labels of the rows and columns were selected with the data.
Individual data: Choose this option if your data are recorded in a individuals/variables table.
You must then select the event data that correspond to the phenomenon of interest (for
example, the presence or absence of a disease) and specify which code is associated with
607
positive events (for example + when a disease is diagnosed). You must also select the test
data corresponding to the value of the diagnostic test. This test may be quantitative
(concentration), binary (positive or negative) or qualitative (color). If the test is quantitative, you
must specify if XLSTAT should consider it as positive when the test is above or below a given
threshold value. If the test is qualitative or binary, you must select the value corresponding to a
positive test.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Labels included: Activate this option if the row and column labels are selected. This option is
available if you selected the 2x2 table format.
Variable labels: Activate this option if, in column mode, the first row of the selected data
contains a header, or in row mode, if the first column of the selected data contains a header.
This option is available if you selected the individual data format.
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Options tab:
Confidence intervals:
Size (%): Enter the size of the confidence interval in % (default value: 95).
Wald: Activate this option if you want to calculate confidence intervals on the various
indexes using the approximation of the binomial distribution by the normal distribution.
Activate "Adjusted" to use the adjustment of Agresti and Coull.
Wilson score: Activate this option if you want to calculate confidence intervals on the
various indexes using the Wilson score approximation.
Continuity correction: Activate this option if you want to apply the continuity correction
to the Wilson score and to the interval on ratios.
608
A priori prevalence: If you know that the disease involves a certain proportion of individuals in
the total population, you can use this information to adjust predictive values calculated from
your sample.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Results
The results are made of the contingency table followed by the table that displays the various
indices described in the description section.
Example
An example showing how to compute sensitivity and specificity is available on the Addinsoft
website:
http://www.xlstat.com/demo-sens.htm
References
Agresti A. (1990). Categorical Data Analysis. John Wiley and Sons, New York.
Agresti A., and Coull B.A. (1998). Approximate is better than exact for interval estimation of
binomial proportions. The American Statistician, 52, 119-126.
Agresti A. and Caffo, B. (2000). Simple and effective confidence intervals for proportions and
differences of proportions result from adding two successes and two failures. The American
Statistician, 54, 280-288.
609
Clopper C.J. and Pearson E.S. (1934). The use of confidence or fiducial limits illustrated in
the case of the binomial. Biometrika, 26, 404-413.
Newcombe R. G. (1998). Two-sided confidence intervals for the single proportion: comparison
of seven methods. Statistics in Medicine, 17, 857-872.
Zhou X.H., Obuchowski N.A., McClish D.K. (2002). Statistical Methods in Diagnostic
Medicine. John Wiley & Sons.
Pepe M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction,
Oxford University Press.
Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference.
Journal of the American Statistical Association, 22, 209-212.
Wald, A., & Wolfowitz, J. (1939). Confidence limits for continuous distribution functions. The
Annals of Mathematical Statistics, 10, 105-118.
610
ROC curves
Use this tool to generate an ROC curve that allows to represent the evolution of the proportion
of true positive cases (also called sensitivity) as a function of the proportion of false positives
cases (corresponding to 1 minus specificity), and to evaluate a binary classifier such as a test
to diagnose a disease, or to control the presence of defects on a manufactured product.
Description
ROC curves have first been developed during World War II to develop effective means of
detecting Japanese aircrafts. This methodology was then applied more generally to signal
detection and medicine where it is now widely used.
The problem is as follows: we study a phenomenon, often binary (for example, the presence or
absence of a disease) and we want to develop a test to detect effectively the occurrence of a
precise event (for example, the presence of the disease).
Let V be the binary or multinomial variable that describes the phenomenon for N individuals
that are being followed. We note by + the individuals for which the event occurs and by -those
for which it does not. Let T be a test which goal is to detect if the event occurred or not. T is
most of the time continuous (for example, a concentration) but it can also be ordinal (to
represent levels).
We want to set the threshold value below or beyond which the event occurs. To do so, we
examine a set of possible threshold values for each we calculate various statistics among
which the simplest are:
True positive (TP): Number of cases that the test declares positive and that are truly
positive.
False positive (FP): Number of cases that the test declares positive and that in reality are
negative.
True negative (VN): Number of cases that the test declares negative and that are truly
negative.
False negative (FN): Number of cases that the test declares negative and that in reality are
positive.
Prevalence: Relative frequency of the event of interest in the total sample (TP+FN)/N.
611
Several indices have been developed to evaluate the performance of a test at a given
threshold value:
Sensitivity (equivalent to the True Positive Rate): Proportion of positive cases that are well
detected by the test. In other words, the sensitivity measures how the test is effective when
used on positive individuals. The test is perfect for positive individuals when sensitivity is 1,
equivalent to a random draw when sensitivity is 0.5. If it is below 0.5, the test is counter-
performing and it would be useful to reverse the rule so that sensitivity is higher than 0.5
(provided that this does not affect the specificity). The mathematical definition is given by:
Sensitivity = TP/(TP + FN).
Specificity (also called True Negative Rate): proportion of negative cases that are well
detected by the test. In other words, specificity measures how the test is effective when used
on negative individuals. The test is perfect for negative individuals when the specificity is 1,
equivalent to a random draw when the specificity is 0.5. If it is below 0.5, the test is counter
performing-and it would be useful to reverse the rule so that specificity is higher than 0.5
(provided that this does not affect the sensitivity). The mathematical definition is given by:
Specificity = TN/(TN + FP).
False Positive Rate (FPR): Proportion of negative cases that the test detects as positive (FPR
= 1-Spcificit).
False Negative Rate (FNR): Proportion of positive cases that the test detects as negative
(FNR = 1-Sensibilit)
Prevalence: relative frequency of the event of interest in the total sample (TP+FN)/N.
Positive Predictive Value (PPV): Proportion of truly positive cases among the positive cases
detected by the test. We have PPV = TP / (TP + FP), or PPV = Sensitivity x Prevalence /
[(Sensitivity x Prevalence + (1-Specificity)(1-Prevalence)]. It is a fundamental value that
depends on the prevalence, an index that is independent of the quality of the test.
Negative Predictive Value (NPV): Proportion of truly negative cases among the negative
cases detected by the test. We have NPV = TN / (TN + FN), or PPV = Specificity x (1 -
Prevalence) / [(Specificity (1-Prevalence) + (1-Sensibility) x Prevalence]. This index depends
also on the prevalence that is independent of the quality of the test.
Positive Likelihood Ratio (LR+): This ratio indicates to which point an individual has more
chances to be positive in reality when the test is telling it is positive. We have LR+ = Sensitivity
/ (1-Specificity). The LR+ is a positive or null value.
Negative Likelihood Ratio (LR-): This ratio indicates to which point an individual has more
chances to be negative in reality when the test is telling it is positive. We have LR- = (1-
Sensitivity) / (Specificity). The LR- is a positive or null value.
612
Odds ratio: The odds ratio indicates how much an individual is more likely to be positive if the
test is positive, compared to cases where the test is negative. For example, an odds ratio of 2
means that the chance that the positive event occurs is twice higher if the test is positive than if
it is negative. The odds ratio is a positive or null value. We have Odds ratio = TPxTN /
(FPxFN).
Relative risk: The relative risk is a ratio that measures how better the test behaves when it is
a positive report than when it is negative. For example, a relative risk of 2 means that the test
is twice more powerful when it is positive that when it is negative. A value close to 1
corresponds to a case of independence between the rows and columns, and to a test that
performs as well when it is positive as when it is negative. Relative risk is a null or positive
value given by: Relative risk = TP/(TP+FP) / (FN/(FN+TN)).
Confidence intervals
For the various presented above, several methods of calculating their variance and, therefore
their confidence intervals, have been proposed. There are two families: the first concerns
proportions, such as sensitivity and specificity, and the second ratios, such as LR +, LR- the
odds ratio and the relative risk.
For proportions, XLSTAT allows you to use the simple (Wald, 1939) or adjusted (Agresti and
Coull, 1998) Wald intervals, a calculation based on the Wilson score (Wilson, 1927), possibly
with a correction of continuity, or the Clopper-Pearson (1934) intervals. Agresti and Caffo
recommend using the adjusted Wald interval or the Wilson score intervals.
For ratios, the variances are calculated using a single method, with or without correction of
continuity.
Once the variance of the above statistics is calculated, we assume their asymptotic normality
(or of their logarithm for ratios) to determine the corresponding confidence intervals. Many of
the statistics are proportions and should lie between 0 and 1. If the intervals fall partly outside
these limits, XLSTAT automatically corrects the bounds of the interval.
ROC curve
The ROC curve corresponds to the graphical representation of the couple (1 specificity,
sensitivity) for the various possible threshold values.
613
The area under the curve (AUC) is a synthetic index calculated for ROC curves. The AUC is
the probability that a positive event is classified as positive by the test given all possible values
of the test. For an ideal model we have AUC = 1 (above in blue), where for a random pattern
we have AUC = 0.5 (above in red). One usually considers that the model is good when the
value of the AUC is higher than 0.7. A well discriminating model should have an AUC between
0.87 and 0.9. A model with an AUC above 0.9 is excellent.
Sen (1960), Bamber (1975) and Hanley and McNeil (1982) have proposed different methods to
calculate the variance of the AUC. All are available in XLSTAT. XLSTAT offers as well a
comparison test of the AUC to 0.5, the value 0.5 corresponding to a random classifier. This
test is based on the difference between the AUC and 0.5 divided by the variance calculated
according to one of the three proposed methods. The statistic obtained is supposed to follow a
standard normal distribution, which allows the calculation of the p-value.
The AUC can also be used to compare different tests between them. If the different tests have
been applied to different groups of individuals, samples are independent. In this case, XLSTAT
uses a Student test to compare the AUCs (which requires assuming the normality of the AUC,
which is acceptable if the samples are not too small). If different tests were applied to the same
individuals, the samples are paired. In this case, XLSTAT calculates the covariance matrix of
the AUCs as described by Delong and Delong (1988) on the basis of Sens work (1960), to
then calculate the variance of the difference between two AUCs, and to calculate the p-value
assuming the normality.
614
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Event data: slectionnez les donnes correspondant au phnomne tudi (par exemple la
prsence ou labsence maladie) et prciser quel code est associ lvnement positif (par
exemple M ou + pour un individu malade).
Test data: slectionnez les donnes correspondant la valeur du test de diagnostique. Les
donnes doivent tre quantitatives. Sil sagit de donnes ordinales, elles doivent tre
recodes en donnes quantitatives (par exemple 0,1,2,3,4). Vous devez prciser si lon doit
considrer quil est positif lorsque la valeur test est suprieure ou infrieure une valeur seuil
dtermine au cours des calculs.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
615
Variable labels: Activate this option if, in column mode, the first row of the selected data
contains a header, or in row mode, if the first column of the selected data contains a header.
This option is available if you selected the individual data format.
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Groups: Check this option to select the values which correspond to the identifier of the group
to which each observation belongs.
Options tab:
Confidence intervals:
Size (%): Enter the size of the confidence interval in % (default value: 95).
Wald: Activate this option if you want to calculate confidence intervals on the various
indexes using the approximation of the binomial distribution by the normal distribution.
Activate "Adjusted" to use the adjustment of Agresti and Coull.
Wilson score: Activate this option if you want to calculate confidence intervals on the
various indexes using the Wilson score approximation.
Continuity correction: Activate this option if you want to apply the continuity correction
to the Wilson score and to the interval on ratios.
A priori prevalence: If you know that the disease involves a certain proportion of individuals in
the total population, you can use this information to adjust predictive values calculated from
your sample.
Test on AUC: You can compare the AUC (Area Under the Curve) to 0.5, the value it would
have if the test variable were purely random. This test is conducted using the method of
calculating the variance chosen above.
Costs: Activate this option if you want to evaluate the cost associated with the various possible
decisions based on the threshold values of the test variable. You need to enter the costs that
616
correspond to the different situations: TP (true positive), FP (false positive), FN (true negative),
TN (true negative).
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Pairwise deletion: Activate this option to remove observations with missing data only when
the test variables involved in the calculations have missing data.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
ROC analysis: Activate this option to display the table that lists the various indices calculated
for each value of the test variable. You can choose to show or not show predictive values,
likelihood ratios and of true/false positive and negative counts.
Test on the AUC: Activate this option if you want to display the results of the comparison of
the AUC to 0.5, the value that corresponds to a random classifier.
Comparison of the AUCs: If you have selected several test variables or a group variable,
activate this option to compare the AUCs obtained for the different variables or different
groups.
Charts tab:
True/False +/- : Activate this option to display the stacked bars chart that shows the % of the
TP/TN/FP/FN for the different values of the test variable.
Decision plot: Activate this option to display the decision plot of your choice. This plot will help
you to decide what level of the test variable is best.
Comparison of the ROC curves: Activate this option to display on a single plot the ROC
curves that correspond to the various test variables or to the different groups. This option is
only available if you select two or more test variables or if a group variable has been selected.
617
Results
Summary statistics: In this first table you can find statistics for the selected test(s), followed
by a table recalling, for the phenomenon of interest, for the number of occurrences of each
event and the prevalence of the positive event in the sample. The row displayed in bold
corresponds to the positive event.
ROC curve: The ROC curve is then displayed. The strait dotted line that goes from (0 ;0) to
(1 ;1) corresponds to the curve of a random test with no discrimination. The colored line
corresponds to the ROC curve. Small squares correspond to observations (one square per
observed value of the test variable).
ROC analysis: This table displays for each possible threshold value of the test variable, the
various indices presented in the description section. On the line below the table you'll find a
reminder of the rule set out in the dialog box to identify positive cases compared to the
threshold value. Below the table you will find a stacked bars chart showing the evolution of the
TP, TN, FP, FN depending on the value of the threshold value. If the corresponding option was
activated, the decision plot is then displayed (for example, changes in the cost depending on
the threshold value).
Area under the curve (AUC): This table displays the AUC, its standard error and a confidence
interval.
Comparison of the AUC to 0.5: These results allow to compare the test to a random
classifier. The confidence interval corresponds to the difference. Various statistics are then
displayed including the p-value, followed by the interpretation of the comparison test.
Comparison of the AUCs: If you selected several test variables, once the above results are
displayed for each variable, you will find the covariance matrix of the AUC, followed by the
table of differences for each pair of AUCs with as comments the confidence interval, and then
the table of the p-values. Values in bold correspond to significant differences. Last, a graph
that compares the ROC curves displayed.
Example
An example showing how to compute ROC curves is available on the Addinsoft website:
http://www.xlstat.com/demo-roc.htm
618
References
Agresti A. (1990). Categorical Data Analysis. John Wiley and Sons, New York.
Agresti A., and Coull B.A. (1998). Approximate is better than exact for interval estimation of
binomial proportions. The American Statistician, 52, 119-126.
Agresti A. and Caffo, B. (2000). Simple and effective confidence intervals for proportions and
differences of proportions result from adding two successes and two failures. The American
Statistician, 54, 280-288.
Bamber D. (1975). The area above the ordinal dominance graph and the area below the
receiver operating characteristic graph. Journal of Mathematical Psychology, 12, 387-415.
Clopper C.J. and Pearson E.S. (1934). The use of confidence or fiducial limits illustrated in
the case of the binomial. Biometrika, 26, 404-413.
DeLong E.R., DeLong D.M., Clarke-Pearson D.L. (1988). Comparing the areas under two or
more correlated receiver operating characteristic curves: a nonparametric approach.
Biometrics, 44(3), 837-845.
Hanley J.A. and McNeil B.J. (1982). The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology, 143, 29-36.
Hanley J. A. and McNeil B. J. (1983). A method of comparing the area under two ROC
curves derived from the same cases. Radiology, 148, 839-843.
Newcombe R. G. (1998). Two-sided confidence intervals for the single proportion: comparison
of seven methods. Statistics in Medicine, 17, 857-872.
Pepe M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction,
Oxford University Press.
Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference.
Journal of the American Statistical Association, 22, 209-212.
Wald, A., & Wolfowitz, J. (1939). Confidence limits for continuous distribution functions. The
Annals of Mathematical Statistics, 10, 105-118.
Zhou X.H., Obuchowski N.A., McClish D.K. (2002). Statistical Methods in Diagnostic
Medicine. John Wiley & Sons.
619
Canonical Correlation Analysis (CCorA)
Use Canonical Correlation Analysis (CCorA, sometimes CCA, but we prefer to use CCA for
Canonical Correspondence Analysis) to study the correlation between two sets of variables
and to extract from these tables a set of canonical variables that are as much as possible
correlated with both tables and orthogonal to each other.
Description
Canonical Correlation Analysis (CCorA, sometimes CCA, but we prefer to use CCA for
Canonical Correspondence Analysis) is one of the many methods that allow to study the
relationship between two sets of variables. Discovered by Hotelling (1936) this method is used
a lot in ecology but is has been supplanted by RDA (Redundancy Analysis) and by CCA
(Canonical Correspondence Analysis).
This method is symmetrical, contrary to RDA, and is not oriented towards prediction. Let Y1
and Y2 be two tables, with respectively p and q variables. CCorA aims at obtaining two vectors
a(i) and b(i) such that
cov(Y1a (i ), Y2b(i ))
(i ) cor (Y1a(i ), Y2b(i ))
var(Y1a (i )).var(Y2b(i ))
is maximized. Constraints must be introduced so that the solution for a(i) et b(i) is unique. As
we are in the end trying to maximize the covariance between Y1a(i) and Y2b(i) and to minimize
their respective variance, we might obtain components that are well correlated among
eachother, but that are not explaining well Y1 and Y2. Once the solution has been obtained for
i=1, we look for the solution for i=2 where a(2) and b(2) must respectively be orthogonal to a(1)
and b(2), and so on. The number of vectors that can be extracted is to the maximum equal to
min(p, q).
Note: The inter-batteries analysis of Tucker (1958) is an alternative where one wants to
maximize the covariance between the Y1a(i) and Y2b(i) components.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
620
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to sites and columns to
objects/variables. If the arrow points to the right (row mode), XLSTAT considers that rows
correspond to objects/variables and columns to sites.
General tab:
Y1: Select the data that corresponds to the first table. If the Column labels option is activated
(column mode) you need to include a header on the first row of the selection. If the Row
labels option is activated (row mode) you need to include a header in the first column of the
selection in the selection.
Y2: Select the data that corresponds to the second table. If the Column labels option is
activated (column mode) you need to include a header on the first row of the selection. If the
Row labels option is activated (row mode) you need to include a header in the first column of
the selection in the selection.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Column/Row labels: Activate this option if, in column mode, the first row of the selected data
contains a header, or in row mode, if the first column of the selected data contains a header.
Observation labels: Activate this option if observation labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the sites labels are automatically generated by
XLSTAT (Obs1, Obs2 ).
621
Options tab:
Type of analysis: Select from which type of matrix the canonical analysis should be
performed.
Y1:
Y2:
Filter factors: You can activate one of the following two options in order to reduce the number
of factors for which results are displayed.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
Maximum Number: Activate this option to set the number of factors to take into
account.
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
622
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Eigenvalues: Activate this option to display the table and the scree plot of the eigenvalues.
Wilks Lambda test: Activate this option to display the results of the Wilks lambda test.
Variables/Factors correlations: Activate this option to display the correlations between the
initial variables of Y1 and Y2 with the canonical variables.
Squared cosines: Activate this option to display the squared cosines of the initial variables in
the canonical space.
Scores: Activate this option to display the coordinates of the observations in the space of the
canonical variables.
Charts tab:
Correlation charts: Activate this option to display the charts involving correlations between
the components and the variables.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
Results
Summary statistics: This table displays the descriptive statistics for the objects and the
explanatory variables.
623
Similarity matrix: The matrix that corresponds to the type of analysis chosen in the dialog
box is displayed.
Eigenvalues and percentages of inertia: In this table are displayed the eigenvalues, the
corresponding inertia, and the corresponding percentages. Note: in some software, the
eigenvalues that are displayed are equal to L / (1-L), where L is the eigenvalues given by
XLSTAT.
Wilks Lambda test: This test allows to determine if the two tables Y1 and Y2 are significantly
related to each canonical variable.
Canonical correlations: The canonical correlations, bounded by 0 and 1, are higher when the
correlation between Y1 and Y2 is high. However, they do not tell to what extent the canonical
variables are related to Y1 and Y2. The squared canonical correlations are equal to the
eigenvalues and, as a matter of fact, correspond to the percentage of variability carried by the
canonical variable.
The results listed below are computed separately for each of the two groups of input variables.
Redundancy coefficients: These coefficients allow to measure for each set of input variables
what proportion of the variability of the input variables is predicted by the canonical variables.
Correlations between input variables and canonical variables (also called Structure
correlation coefficients, or Canonical factor loadings) allow understanding how the canonical
variables are related to the input variables.
The canonical variable adequacy coefficients correspond, for a given canonical variable, to
the sum of the squared correlations between the input variables and canonical variables,
divided by the number of input variables. They give the percentage of variability taken into
account by the canonical variable of interest.
Square cosines: The squared cosines of the input variables in the space of canonical
variables allow to know if an input variable is well represented in the space of the canonical
variables. The squared cosines for a given input variable sum to 1. The sum over a reduced
number of canonical axes gives the communality.
Scores: The scores correspond to the coordinates of the observations in the space of the
canonical variables.
624
Example
http://www.xlstat.com/demo-ccora.htm
References
Hotelling H. (1936). Relations between two sets of variables. Biometrika, 28, 321-327.
Jobson J.D. (1992). Applied Multivariate Data Analysis. Volume II: Categorical and
Multivariate Methods. Springer-Verlag, New York.
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam.
625
Redundancy Analysis (RDA)
Use Redundancy Analysis (RDA) to analyze a table of response variables using the
information provided by a set of explanatory variables, and visualize on the same plot the two
sets of variables, and the observations.
Description
Redundancy Analysis (RDA) was developed by Van den Wollenberg (1977) as an alternative
to Canonical Correlation Analysis (CCorA). RDA allows studying the relationship between two
tables of variables Y and X. While the CCorA is a symmetric method, RDA is non-symmetric.
In CCorA, the components extracted from both tables are such that their correlation is
maximized. In RDA, the components extracted from X are such that they are as much as
possible correlated with the variables of Y. Then the components of Y are extracted so that
they are as much as possible correlated with the components extracted from X.
Principles of RDA
Let Y be a table of response variables with n observations and p variables. This table can be
analyzed using Principal Components Analysis (PCA) to obtain a simultaneous map of the
observations and the variables in two or three dimensions.
Let X be a table that contains the measures recorded for the same n observations on q
quantitative and/or qualitative variables.
- An unconstrained part, which corresponds to the analysis of the residuals. The number of
dimensions for the unconstrained RDA is equal to min(n-1, p).
Partial RDA
Partial RDA adds a preliminary step. The X table is subdivided into two groups. The first group
X(1) contains conditioning variables which effect we want to remove, as it is either known or
626
without interest for the study. Regressions are run on the Y and X(2) tables and the residuals
of the regressions are used for the RDA step. Partial RDA allows to analyze the effect of the
second group of variables, after the effect of the first group has been removed.
Biplot scaling
XLSTAT offers three different types of scaling. The type of scaling changes the way the scores
of the response variables and the observations are computed, and as a matter of fact, their
respective position on the plot. Let u(ik) be the normalized score of variable i on the kth axis,
v(ik) the normalized score of observation i on the kth axis, L(k) the eigenvalue corresponding
to axis k, and T the total inertia (the sum of the L(k) for the constrained and unconstrained
RDA). The three scalings available in XLSTAT are identical to those of vegan (a package for
the R software, Oksanen, 2007). The u(ik) are multiplied by c, and the v(ik) by d, and r is a
constant equal to 4 n 1T , where n is the number of observations.
Scaling 1: c r Lk / T d r
Scaling 2: cr d r Lk / T
Scaling 3: c r 4 Lk / T d r 4 Lk / T
In addition to the observations and the response variables, the explanatory variables can be
displayed. The coordinates of the latter are obtained by computing the correlations between
the X table and the observation scores.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
627
: Click this button to start the computations.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to sites and columns to
objects/variables. If the arrow points to the right (row mode), XLSTAT considers that rows
correspond to objects/variables and columns to sites.
General tab:
Response variables Y: Select the table that corresponds to response variables. If the
Column labels option is activated (column mode) you need to include a header on the first
row of the selection. If the Row labels option is activated (row mode) you need to include a
header in the first column of the selection in the selection.
Explanatory variables X: Select the data that correspond to the various explanatory variables
that have been measured for the same observations as for table Y.
Quantitative: Activate this option if you want to use quantitative variables and then
select these variables.
Qualitative: Activate this option if you want to use qualitative variables and then select
these variables.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Partial RDA: Activate this option to run a partial RDA. If you activate this option, a dialog box
will be displayed during the analysis, so that you can select the conditioning variables (see the
description section for further details).
628
Column/Row labels: Activate this option if, in column mode, the first row of the selected data
contains a header, or in row mode, if the first column of the selected data contains a header.
Observation labels: Activate this option if observation labels are available. Then select the
corresponding data. If the Column labels option is activated you need to include a header in
the selection. If this option is not activated, the sites labels are automatically generated by
XLSTAT (Obs1, Obs2 ).
Options tab:
Filter factors: You can activate one of the following two options in order to reduce the number
of factors for which results are displayed.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
Maximum Number: Activate this option to set the number of factors to take into
account.
Permutation test: Activate this option if you want to use a permutation test to test the
independence of the two tables.
Number of permutations: Enter the number of permutations to perform for the test
(Default value: 500)
Significance level (%): Enter the significance level for the test.
Response variables:
Center: Activate this option to center the variables before running the RDA.
Reduce: Activate this option to standardize the variables before running the RDA
Explanatory variables X:
Center: Activate this option to center the variables before running the RDA.
Reduce: Activate this option to standardize the variables before running the RDA.
Biplot type: Select the type of biplot you want to display. The type changes the way the
scores of the response variables and the observations are scaled (see the description section
for further details).
629
Missing data tab:
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Unconstrained RDA results: Activate this option to display the results of the unconstrained
RDA.
Eigenvalues: Activate this option to display the table and the scree plot of the eigenvalues.
Scores (Observations): Activate this option to display the scores of the observations.
Scores (Response variables): Activate this option to display the scores of the response
variables.
WA scores: Activate this option to compute and display the weighted average scores.
LC scores: Activate this option to compute and display the linear combinations scores.
Contributions: Activate this option to display the contributions of the observations and the
response variables.
630
Squared cosines: Activate this option to display the squared cosines of the observations and
the response variables.
Scores (Explanatory variables): Activate this option to display the scores of the explanatory
variables.
Charts tab:
Response variables: Activate this option to display the response variables on the
chart.
Explanatory variables: Activate this option to display the explanatory variables on the
chart.
Labels: Activate this option to display the labels of the sites on the charts.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
Vectors: Activate this option to display the vectors for the standard coordinates on the
asymmetric charts.
Length factor: Activate this option to modulate the length of the vectors.
Results
Summary statistics: This table displays the descriptive statistics for the objects and the
explanatory variables.
If a permutation test was requested, its results are first displayed so that we can check if the
relationship between the tables is significant or not.
Eigenvalues and percentages of inertia: In these tables are displayed for the RDA and the
unconstrained RDA the eigenvalues, the corresponding inertia, and the corresponding
percentages, either in terms of constrained inertia (or unconstrained inertia), or in terms of total
inertia.
631
The scores and of the observations, response variables and explanatory variables are
displayed. These coordinates are used to produce the plot.
The charts allow to visualize the relationship between the observations, the response variables
and the explanatory variables. When qualitative variables have been included, the
corresponding categories are displayed with a hollowed red circle.
Example
http://www.xlstat.com/demo-rda.htm
References
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam.
Oksanen J., Kindt R., Legendre P. and O'Hara R.B. (2007). vegan: Community Ecology
Package version 1.8-5. http://cran.r-project.org/.
Van den Wollenberg, A.L. (1977). Redundancy analaysis. An alternative for canonical
correlation analysis. Psychometrika, 42(2), 207-219.
632
Canonical Correspondence Analysis (CCA)
Use Canonical Correspondence Analysis (CCA) to analyze a contingency table (typically with
sites as rows and species in columns) while taking into account the information provided by a
set of explanatory variables contained in a second table and measured on the same sites.
Description
Canonical Correspondence Analysis (CCA) has been developed to allow ecologists to relate
the abundance of species to environmental variables (Ter Braak, 1986). However, this method
can be used in other domains. Geomarketing and demographic analyses should be able to
take advantage of it.
Principles of CCA
Let T1 be a contingency table corresponding to the counts on n sites of p objects. This table
can be analyzed using Correspondence Analysis (CA) to obtain a simultaneous map of the
sites and objects in two or three dimensions.
Let T2 be a table that contains the measures recorded on the same n sites of corresponding to
q quantitative and/or qualitative variables.
- An unconstrained part, which corresponds to the analysis of the residuals. The number of
dimensions for the unconstrained CCA is equal to min(n-1-q, p-1).
Partial CCA
Partial CCA adds a preliminary step. The T2 table is subdivided into two groups of variables:
the first group contains conditioning variables which effect we want to remove, as it is either
known or without interest for the study. A CCA is run using these variables. A second CCA is
run using the second group of variables which effect we want to analyze. Partial CCA allows to
633
analyze the effect of the second group of variables, after the effect of the first group has been
removed.
PLS-CCA
Tenenhaus (1998) has shown that it is possible to relate discriminant PLS to CCA. Addinsoft is
the first software editor to propose a comprehensive and effective integration between the two
methods. Using a restructuring of data based on the proposal Tenenhaus, a PLS step is
applied to the data, either to create orthogonal PLS components that are optimally designed
for the CCA to avoid the constraints in terms of number of variables that can be used, or to
select the most influential variables before running the CCA. As calculations of the CCA step
and results are identical to what is done with the classical CCA, users can see this approach
as a selection method that identifies the variables of higher interest, either because they are
selected in the model, or by looking at the chart of the VIPs (see the section on PLS regression
for more information). In the case of a partial CCA, the preliminary step is unchanged.
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
634
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down (column mode), XLSTAT considers that rows correspond to sites and columns to
objects/variables. If the arrow points to the right (row mode), XLSTAT considers that rows
correspond to objects/variables and columns to sites.
General tab:
Sites/Objects data: Select the contingency table that corresponds to the counts of the various
objects recorded on each different site. If the Column labels option is activated (column
mode) you need to include a header on the first row of the selection. If the Row labels option
is activated (row mode) you need to include a header in the first column of the selection in the
selection.
Sites/Variables data: Select the data that correspond to the various explanatory variables that
have been measured on the various sites and that you want to use in the analysis.
Quantitative: Activate this option if you want to use quantitative variables and then
select these variables.
Qualitative: Activate this option if you want to use qualitative variables and then select
these variables.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Partial CCA: Activate this option to run a partial CCA. If you activate this option, a dialog box
will be displayed during the analysis, so that you can select the conditioning variables (see the
description for additional details).
Column/Row labels: Activate this option if, in column mode, the first row of the selected data
contains a header, or in row mode, if the first column of the selected data contains a header.
Sites labels: Activate this option if sites labels are available. Then select the corresponding
data. If the Column labels option is activated you need to include a header in the selection. If
this option is not activated, the sites labels are automatically generated by XLSTAT (Obs1,
Obs2 ).
635
CCA: Activate this option if you want to run a classical CCA.
PLS-CCA: Activate this option if you want to run a PLS-CCA (see the description section for
additional details).
Options tab:
Filter factors: You can activate one of the following two options in order to reduce the number
of factors for which results are displayed.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
Maximum Number: Activate this option to set the number of factors to take into
account.
Permutation test: Activate this option if you want to use a permutation test to test the
independence of the two tables.
Number of permutations: Enter the number of permutations to perform for the test
(Default value: 500)
Significance level (%): Enter the significance level for the test.
PLS-CCA: If you choose to run a PLS-CCA the following options are available.
Automatic: Select this option if you want XLSTAT to automatically determine how
many PLS components should be used for the CCA step.
User defined:
636
Do not accept missing data: Activate this option so that XLSTAT does not continue
calculations if missing values have been detected.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Row and column profiles: Activate this option to display the row and column profiles.
Unconstrained CCA results: Activate this option to display the results of the unconstrained
CCA.
Eigenvalues: Activate this option to display the table and the scree plot of the eigenvalues.
Principal coordinates: Activate this option to display the principal coordinates of the sites,
objects and variables.
Standard coordinates: Activate this option to display the standard coordinates of the sites,
objects and variables.
Contributions: Activate this option to display the contributions of the sites, objects and
variables.
Squared cosines: Activate this option to display the squared cosines of the sites, objects and
variables.
Weighted averages: Activate this option to display the weighted averages that correspond to
the variables of the sites/variables table.
637
Regression coefficients: Activate this option to display regression coefficients that
correspond to the various variables in the factor space.
Charts tab:
Sites and objects / Symmetric: Activate this option to display a symmetric chart that
includes both the sites and the objects. For both the sites and the objects, the principal
coordinates of are used.
Sites / Asymmetric: Activate this option to display the asymmetric chart of the sites.
The principal coordinates are used for the sites, and the standard coordinates are used
for the objects.
Objects / Asymmetric: Activate this option to display the asymmetric chart of the
objects. The principal coordinates are used for the objects, and the standard
coordinates are used for the sites.
Sites: Activate this option to display a chart on which only the sites are displayed. The
principal coordinates are used.
Objects: Activate this option to display a chart on which only the objects are displayed.
The principal coordinates are used.
Variables:
Correlations: Activate this option to display the quantitative and qualitative variables on
the charts, using as coordinates their correlations (equal to their standard coordinates).
Regression coefficients: Activate this option to display the quantitative and qualitative
variables on the charts, using the regression coefficients as coordinates.
Labels: Activate this option to display the labels of the sites on the charts.
Colored labels: Activate this option to display the labels with the same color as the
corresponding points. If this option is not activated the labels are displayed in black.
Vectors: Activate this option to display the vectors for the standard coordinates on the
asymmetric charts.
Length factor: Activate this option to modulate the length of the vectors.
638
Results
Summary statistics: This table displays the descriptive statistics for the objects and the
explanatory variables.
Inertia: This table displays the distribution of the inertia between the constrained CCA and the
unconstrained CCA.
Eigenvalues and percentages of inertia: In these tables are displayed for the CCA and the
unconstrained CCA the eigenvalues, the corresponding inertia, and the corresponding
percentages, either in terms of constrained inertia (or unconstrained inertia), or in terms of total
inertia.
Weighted averages: This table displays the weighted means as well the global weighted
means.
The principal coordinates and standard coordinates of the sites, the objects and the
variables are then displayed. These coordinates are used to produce the various charts.
Regression coefficients: This table displays the regression coefficients of the variables in the
factor space.
The charts allow to visualize the relationship between the sites, the objects and the variables.
When qualitative variables have been included, the corresponding categories are displayed
with a hollowed red circle.
Example
http://www.xlstat.com/demo-cca.htm
References
Chessel D., Lebreton J.D and Yoccoz N. (1987). Proprits de l'analyse canonique des
correspondances; une illustration en hydrobiologie. Revue de Statistique Applique, 35(4), 55-
72.
639
Legendre P. and Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier,
Amsterdam.
Palmer M.W. (1993). Putting things in even better order: The advantages of canonical
correspondence analysis. Ecology, 74(8), 2215-2230.
640
Multiple Factor Analysis (MFA)
Use the Multiple Factor Analysis (MFA) to simultaneously analyze several tables of variables,
and to obtain results, particularly charts, that allow to study the relationship between the
observations, the variables and the tables. Within a table, the variables must be of the same
type (quantitative or qualitative), but the tables can be of different types.
Description
Multiple Factor Analysis (MFA) makes it possible to analyze several tables of variables
simultaneously, and to obtain results, in particular charts, that allow to study the relationship
between the observations, the variables and tables (Escofier and Pags, 1984). Inside a table
the variables must be of the same type (quantitative or qualitative), but the tables can be of
different types.
The AFM is a synthesis of the PCA (Principal Component Analysis) and the MCA (Multiple
Correspondence Analysis) that it generalizes to enable the use of quantitative and qualitative
variables. The methodology of the MFA breaks up into two phases:
1. We successively carry out for each table a PCA or an MCA according to the type of the
variables of the table. One stores the value of the first eigenvalue of each analysis to then
weigh the various tables in the second part of the analysis.
2. One carries out a weighted PCA on the columns of all the tables, knowing that the tables
of qualitative variables are transformed into a complete disjunctive table, each indicator
variable having a weight that is a function of the frequency of the corresponding category.
The weighting of the tables makes it possible to prevent that the tables that include more
variables do not weigh too much in the analysis.
This method can be very useful to analyze surveys for which one can identify several groups of
variables, or for which the same questions are asked at several time intervals.
The authors that developed the method (Escofier and Pags, 1984) particularly insisted on the
use of the results which are obtained from the MFA. The originality of method is that it allows
visualizing in a two or three dimensional space, of the tables (each table being represented by
a point), the variables, the principal axes of the analyses of the first phase, and the individuals.
In addition, one can study the impact of the other tables on an observation by simultaneously
visualizing the observation described by the all the variables and the projected observations
described by the variables of only one table.
Note 1: as for PCA, the qualitative variables are represented by the centroids of the categories
on the charts of the observations.
Note 2: an MFA performed on K tables that contain each one qualitative variable is equivalent
to an MCA performed on the K variables.
641
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Number of tables: Enter the number K of tables in which the selected data are subdivided.
Table labels: Activate this option if you want to use labels for the K tables. If this option is not
activated, the name of the tables are automatically generated (Table1,Table2, .). If column
headers have been selected, check that the "Variable labels" option has been activated.
Equal: Choose this option if the number of variables is identical for all the tables. In that
case XLSTAT determines automatically the number of variables in each table
642
User defined: Choose this option to select a column that contains the number of
variables contained in each table. If the "Variable labels" option has been activated, the
first row must correspond to a header.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header. Where the selection is
a correlation or covariance matrix, if this option is activated, the first column must also include
the variable labels.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Options tab:
PCA type: Choose the type of matrix to be used for PCA. The difference between the Pearson
(n) and the Pearson (n-1) options, only influences the way the variables are standardized, and
the difference can only be noticed on the coordinates of the observations.
Data type: Specify which is the type of data contained in the various tables, knowing that the
type must be the same withing a given table. In the case where the Mixed type is selected,
you need to select a column that indicates the type of data in each table. Use 0 for a table that
contains quantitative variables, and 1 for a table that contains qualitative variables.
Filter factors: You can activate one of the following two options in order to reduce the number
of factors for which results are displayed.
Minimum %: Activate this option then enter the minimum percentage of the total
variability that the chosen factors must represent.
643
Maximum Number: Activate this option to set the number of factors to take into
account.
Display charts on two axes: Activate this option if you want that the numerous graphical
representations displayed after the PCA, MCA and MFA are only displayed on the first two
axes, without your being prompted after each analysis.
Supplementary observations: Activate this option if you want to calculate the coordinates
and represent additional observations. These observations are not taken into account for the
factor axis calculations (passive observations as opposed to active observations). Several
methods for selecting supplementary observations are provided:
N last rows: The last N observations are selected for validation. The Number of
observations N to display must then be specified.
N first rows: The first N observations are selected for validation. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you must then select an indicator variable
set to 0 for active observations and 1 for passive observations.
Supplementary tables: Activate this option if you want to use some tables as supplementary
tables. The variables of these tables will not be taken into account for the computation of the
factors of the MFA. However, the separate analyses of the first phase of the MFA will be run
on these tables. Select a column that contains the indicators (0/1) that let XLSTAT know
which are among the K tables the active ones (1) and the supplementary ones (0).
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to ignore the observations that contain missing
data.
Adapted strategies: Activate this option to choose strategies that are adapted to the data
type.
644
Quantitative variables:
o Mean: Activate this option to estimate the missing data of an observation by the
mean of the corresponding variable.
Qualitative variables:
o New category: Choose this option to group missing data into a new category of
the corresponding variable.
Outputs tab:
General:
Descriptive statistics: Activate this option to display the descriptive statistics for all the
selected variables.
Correlations: Activate this option to display correlation matrix for the selected quantitative
variables.
Eigenvalues: Activate this option to display the table and chart (scree plot) of eigenvalues.
Squared cosines: Activate this option to display the tables of squared cosines.
PCA:
645
Factor loadings: Activate this option to display the coordinates of the variables in the factor
space.
Factor scores: Activate to display the coordinates of the observations (factor scores) in the
new space created by PCA.
MCA:
Disjunctive table: Activate this option to display the full disjunctive table that corresponds to
the selected qualitative variables.
Observations: Activate this option to display the results that concern the observations.
Variables: Activate this option to display the results that concern the variables.
Test-values: Activate this option to display the test-values for the variables.
Significance level (%): Enter the significance level used to determine if the test values
are significant or not.
MFA:
Tables:
Coordinates: Activate this option to display the coordinates of the tables in the MFA
space. Note: the contributions and the squared cosines are also displayed if the
corresponding options are checked in the Outputs/General tab.
646
Lg coefficients: Activate this option to display the Lg coefficients.
Variables:
Factor loadings: Activate this option to display the factor loadings in the MFA space.
Partial axes:
Maximum number: Enter the maximum number of factors to keep from the analyses of
the first phase that you then want to analyze in the MFA space.
Coordinates: Activate this option to display the coordinates of the partial axes in the
space obtained from the MFA.
Correlations: Activate this option to display the correlations between the factors of the
MFA and the partial axes.
Correlations between axes: Activate this option to display the correlation between the
partial axes.
Observations:
Factor scores: Activate this option to display the factor scores in the MFA space.
Coordinates of the projected points: Activate this option to display the coordinates of
the projected points in the MFA space. The projected points correspond to the
projections of the observations in spaces reduced to the number of dimensions of each
table.
Charts tab:
General:
Colored labels: Activate this option to show labels in the same color as the points.
647
Filter: Activate this option to modulate the number of observations displayed:
N first rows: The N first observations are displayed on the chart. The Number of
observations N to display must then be specified.
N last rows: The N last observations are displayed on the chart. The Number of
observations N to display must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to display.
PCA:
Correlation charts: Activate this option to display the charts involving correlations between
the components and the variables.
Observations charts: activate this option to display the charts that allow visualizing the
observations in the new space.
Labels: Activate this option to display the observations labels on the charts.
Biplots: Activate this option to display the charts where the input variables and the
observations are simultaneously displayed.
Vectors: Activate this option to display the input variables with vectors.
Labels: Activate this option to display the observations labels on the biplots.
Type of biplot: Choose the type of biplot you want to display. See the description section of
the PCA for more details.
648
MCA:
Symmetric plots: Activate this option to display the symmetric observations and variables
plots.
Observations and variables: Activate this option to display a plot that shows both the
observations and variables.
Observations: Activate this option to display a plot that shows only the observations.
Variables: Activate this option to display a plot that shows only the variables.
Asymmetric plots: Activate this option to display plots for which observations and variables
play an asymmetrical role. These plots are based on the principal coordinates for the
observations and the standard coordinates for the variables.
Variables: Activate this option to display an asymmetric plot where the variables are
displayed using their principal coordinates, and where the observations are displayed
using their standard coordinates.
Labels: Activate this option to display the labels of the categories on the charts.
Vectors: Activate this option to display the vectors for the standard coordinates on the
asymmetric charts.
Length factor: Activate this option to modulate the length of the vectors.
MFA:
These options concern only the results of the second phase of the MCA:
Table charts: Activate this option to display the charts that allow to visualize the tables in the
MFA space.
Correlation charts: Activate this option to display the charts involving correlations between
the components and the quantitative variables used in the MFA.
649
Observations charts: Activate this option to display the chart of the observations in the MFA
space.
Color observations: Activate this option so that the observations are displayed using
different colors, depending on the value of the first qualitative supplementary variable.
Display the centroids: Activate this option to display the centroids that correspond to
the categories of the qualitative variables of the supplementary tables.
Correlation charts (partial axes) : Activate this option to display the correlation chart for the
partial axes obtained from the first phase of the MFA.
Charts of the projected points: Activate this option to display the chart that shows at the
same time the observations in the MFA space, and the observations projected in the sub-
space of the each table.
Observation labels: Activate this option to display the observations labels on the
charts.
Projected points labels: Activate this option to display the labels of the projected
points.
Results
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the
variables selected. This includes the number of observations, the number of missing values,
the number of non-missing values, the mean and the standard deviation (unbiased).
Correlation/Covariance matrix: This table shows the correlations between all the quantitative
variables. The type of coefficient depends on what has been chosen in the dialog box.
The results of the analyses performed on each individual table (PCA or MCA) are then
displayed. These results are identical to those you would obtain after running the PCA or MCA
function of XLSTAT.
Afterwards, the results of the second phase of the MFA are displayed.
Eigenvalues: The eigenvalues and corresponding chart (scree plot) are displayed. The
number of eigenvalues is equal to the number of non-null eigenvalues.
Eigenvectors: This table shows the eigenvectors obtained from the spectral decomposition.
These vectors take into account the variable weights used in the MFA.
650
The coordinates of the tables are then displayed and used to create the plots of the tables.
The latter allow to visualize the distance between the tables. The coordinates of the
supplementary tables are displayed in the second part of the table.
Contributions (%): Contributions are an interpretation aid. The tables which had the highest
influence in building the axes are those whose contributions are highest.
Squared cosines: As in other factor methods, squared cosine analysis is used to avoid
interpretation errors due to projection effects. If the squared cosines associated with the axes
used on a chart are low, the position of the observation or the variable in question should not
be interpreted.
RV coefficients: The RV coefficients of relationship between the tables are another measure
derived from the Lg coefficients. The value of the RV coefficients varies between 0 and 1.
The results that follow concern the quantitative variables. As for a PCA, the coordinates of the
variables (factor loadings), their correlation with the axes, the contributions and the squared
cosines are displayed.
The coordinates of the partial axes, and even more their correlations, allow to visualize in
the new space the link between the factors obtained from the first phase of the MFA, and those
obtained from the second phase.
Les results that concern the observations are then displayed as they are after a PCA
(coordinates, contributions in %, and squared cosines).
Last, the coordinates of the projected points in the space resulting from the MFA are
displayed. The projected points correspond to projections of the observations in the spaces
reduced to the dimensions of each table. The representation of the projected points
superimposed with those of the complete observations makes it possible to visualize at the
same time the diversity of the information brought by the various tables for a given
observation, and to visualize the relative distances from two observations according to the
various tables.
651
Example
References
Escofier B. and Pags J. (1984). L'analyse factorielle multiple: une mthode de comparaison
de groupes de variables. In : Sokal R.R., Diday E., Escoufier Y., Lebart L., Pags J. (Eds),
Data Analysis and Informatics III, 41-55. North-Holland, Amsterdam.
Escofier B. and Pags J. (1994). Multiple Factor Analysis (AFMULT package). Computational
Statistics and Data Analysis, 18, 121-140.
Robert P. and Escoufier Y. (1976). An unifying tool for linear multivariate methods. The RV
coefficient. Applied Statistics, 25 (3), 257-265.
652
Dose effect analysis
Use this function to model the effects of a dose on a response variable, if necessary taking into
account an effect of natural mortality.
Description
This tool uses logistic regression (Logit, Probit, complementary Log-log, Gompertz models) to
model the impact of doses of chemical components (for example a medicine or phytosanitary
product) on a binary phenomenon (healing, death).
More information on logistic regression is available in the help section on this subject.
Natural mortality
This tool takes natural mortality into account in order to model the phenomenon studied more
accurately. Indeed, if we consider an experiment carried out on insects, certain will die
because of the dose injected, and others from other phenomenon. None of these associated
phenomena are relevant to the experiment concerning the effects of the dose but may be
taken into account. If p is the probability from a logistic regression model corresponding only to
the effect of the dose, and if m is natural mortality, then the observed probability that the insect
will succumb is:
P(obs) = m + (1- m) * p
p = (P(obs) m) / (1 m)
The natural mortality m may be entered by the user as it is known from previous experiments,
or is determined by XLSTAT.
ED 50, ED 90, ED 99
653
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Dependent variables:
Response variable(s): Select the response variable(s) you want to model. If several variables
have been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
Response type: Choose the type of response variable you have selected:
Binary variable: If you select this option, you must select a variable containing exactly
two distinct values. If the variable has value 0 and 1, XLSTAT will see to it that the high
probabilities of the model correspond to category 1 and that the low probabilities
correspond to category 0. If the variable has two values other than 0 or 1 (for example
Yes/No), the lower probabilities correspond to the first category and the higher
probabilities to the second.
Sum of binary variables: If your response variable is a sum of binary variables, it must
be of type numeric and contain the number of positive events (event 1) amongst those
observed. The variable corresponding to the total number of events observed for this
654
observation (events 1 and 0 combined) must then be selected in the "Observation
weights" field. This case corresponds, for example, to an experiment where a dose D (D
is the explanatory variable) of a medicament is administered to 50 patients (50 is the
value of the observation weights) and where it is observed that 40 get better under the
effects of the dose (40 is the response variable).
Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
data selected may be of the numerical type. If the variable header has been selected, check
that the "Variable labels" option has been activated.
Qualitative: Activate this option if you want to include one or more qualitative explanatory
variables in the model. Then select the corresponding variables in the Excel worksheet. The
selected data may be of any type, but numerical data will automatically be considered as
nominal. If the variable header has been selected, check that the "Variable labels" option has
been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
655
Observation weights: This field must be entered if the "sum of binary variables" option has
been chosen. Otherwise, this field is not active. If a column header has been selected, check
that the "Variable labels" option has been activated.
Options tab:
Firths method: Activate this option to use Firth's penalized likelihood (see description).
Confidence interval (%): Enter the percentage range of the confidence interval to use for the
various tests and for calculating the confidence intervals around the parameters and
predictions. Default value: 95.
Stop conditions:
Iterations: Enter the maximum number of iterations for the Newton-Raphson algorithm.
The calculations are stopped when the maximum number if iterations has been
exceeded. Default value: 100.
Convergence: Enter the maximum value of the evolution of the log of the likelihood
from one iteration to another which, when reached, means that the algorithm is
considered to have converged. Default value: 0.000001.
Take the log: Activate this option so that XLSTAT uses the logarithm of the input variables in
the model.
Optimized: Choose this option so that XLSTAT optimizes the value of the natural
mortality parameter.
User defined: Choose this option to set the value of the natural mortality parameter.
Validation tab:
Validation: Activate this option if you want to use a sub-sample of the data to validate the
model.
Validation set: Choose one of the following options to define how to obtain the observations
used for the validation:
N last rows: The N last observations are selected for the validation. The Number of
observations N must then be specified.
656
N first rows: The N first observations are selected for the validation. The Number of
observations N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only
0s and 1s. The 1s identify the observations to use for the validation.
Prediction tab:
Prediction: activate this option if you want to select data to use them in prediction mode. If
activate this option, you need to make sure that the prediction dataset is structured as the
estimation dataset: same variables with the same order in the selections. On the other hand,
variable labels must not be selected: the first row of the selections listed below must
correspond to data.
Quantitative: activate this option to select the quantitative explanatory variables. The first row
must not include variable labels.
Qualitative: activate this option to select the qualitative explanatory variables. The first row
must not include variable labels.
Observations labels: activate this option if observations labels are available. Then select the
corresponding data. If this option is not activated, the observations labels are automatically
generated by XLSTAT (PredObs1, PredObs2 ).
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Correlations: Activate this option to display the explanatory variables correlation matrix.
657
Goodness of fit statistics: Activate this option to display the table of goodness of fit statistics
for the model.
Type III analysis: Activate this option to display the type III analysis of variance table.
Model coefficients: Activate this option to display the table of coefficients for the model.
Optionally, confidence intervals of type "profile likelihood" can be calculated (see
description).
Standardized coefficients: Activate this option if you want the standardized coefficients (beta
coefficients) for the model to be displayed.
Equation: Activate this option to display the equation for the model explicitly.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
Probability analysis: If only one explanatory variable has been selected, activate this option
so that XLSTAT calculates the value of the explanatory variable corresponding to various
probability levels.
Charts tab:
Results
XLSTAT displays a large number tables and charts to help in analyzing and interpreting the
results.
Summary statistics: This table displays descriptive statistics for all the variables selected. For
the quantitative variables, the number of missing values, the number of non-missing values,
658
the mean and the standard deviation (unbiased) are displayed. For qualitative variables,
including the dependent variable, the categories with their respective frequencies and
percentages are displayed.
Correlation matrix: This table displays the correlations between the explanatory variables.
Correspondence between the categories of the response variable and the probabilities:
This table shows which categories of the dependent variable have been assigned probabilities
0 and 1.
Goodness of fit coefficients: This table displays a series of statistics for the independent
model (corresponding to the case where the linear combination of explanatory variables
reduces to a constant) and for the adjusted model.
Observations: The total number of observations taken into account (sum of the weights of
the observations);
Sum of weights: The total number of observations taken into account (sum of the weights of
the observations multiplied by the weights in the regression);
-2 Log(Like.) : The logarithm of the likelihood function associated with the model;
R (McFadden): Coefficient, like the R , between 0 and 1 which measures how well the
2
model is adjusted. This coefficient is equal to 1 minus the ratio of the likelihood of the
adjusted model to the likelihood of the independent model;
R(Cox and Snell): Coefficient, like the R , between 0 and 1 which measures how well the
2
model is adjusted. This coefficient is equal to 1 minus the ratio of the likelihood of the
adjusted model to the likelihood of the independent model raised to the power 2/Sw, where
Sw is the sum of weights.
R(Nagelkerke): Coefficient, like the R , between 0 and 1 which measures how well the
2
model is adjusted. This coefficient is equal to ratio of the R of Cox and Snell, divided by 1
minus the likelihood of the independent model raised to the power 2/Sw;
Test of the null hypothesis H0: Y=p0: The H0 hypothesis corresponds to the independent
model which gives probability p0 whatever the values of the explanatory variables. We seek to
check if the adjusted model is significantly more powerful than this model. Three tests are
available: the likelihood ratio test (-2 Log(Like.)), the Score test and the Wald test. The three
2
statistics follow a Chi distribution whose degrees of freedom are shown.
659
Type III analysis: This table is only useful if there is more than one explanatory variable. Here,
the adjusted model is tested against a test model where the variable in the row of the table in
question has been removed. If the probability Pr > LR is less than a significance threshold
which has been set (typically 0.05), then the contribution of the variable to the adjustment of
the model is significant. Otherwise, it can be removed from the model.
Model parameters: The parameter estimate, corresponding standard deviation, Wald's Chi ,
2
the corresponding p-value and the confidence interval are displayed for the constant and each
variable of the model. If the corresponding option has been activated, the "profile likelihood"
intervals are also displayed.
The equation of the model is then displayed to make it easier to read or re-use the model.
The table of standardized coefficients (also called beta coefficients) are used to compare the
relative weights of the variables. The higher the absolute value of a coefficient, the more
important the weight of the corresponding variable. When the confidence interval around
standardized coefficients has value 0 (this can be easily seen on the chart of normalized
coefficients), the weight of a variable in the model is not significant.
The predictions and residuals table shows, for each observation, its weight, the value of the
qualitative explanatory variable, if there is only one, the observed value of the dependent
variable, the model's prediction, the same values divided by the weights, the standardized
residuals and a confidence interval.
If only one quantitative variable has been selected, the probability analysis table allows to
see to which value of the explanatory variable corresponds a given probability of success.
Example
A tutorial on how to use the dose effect analysis is available on the Addinsoft website:
http://www.xlstat.com/demo-dose.htm
References
Abbott W.S. (1925). A method for computing the effectiveness of an insecticide. Jour. Econ.
Entomol., 18, 265-267.
Agresti A. (1990). Categorical Data Analysis. John Wiley & Sons, New York.
Finney D.J. (1971). Probit Analysis. 3rd ed., Cambridge, London and New-York.
Firth D (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27-38.
660
Heinze G. and Schemper M. (2002). A solution to the problem of separation in logistic
regression. Statistics in Medicine, 21, 2409-2419.
Hosmer D.W. and Lemeshow S. (2000). Applied Logistic Regression, Second Edition. John
Wiley and Sons, New York.
Tallarida R.J. (2000). Drug Synergism & Dose-Effect Data Analysis, CRC/Chapman & Hall,
Boca Raton.
Venzon, D. J. and Moolgavkar S. H. (1988). A method for computing profile likelihood based
confidence intervals. Applied Statistics, 37, 87-94.
661
Four-parameter parallel lines logistic regression
Use this tool to analyze the effect of a quantitative variable on a response variable using the
four-parameter logistic model. XLSTAT enables you to take into account some standard data
while fitting the model, and to automatically remove outliers.
Description
d a
ya (1)
x
b
1
c
where a, b, c, d are the parameters of the model, and where x corresponds to the explanatory
variable and y to the response variable. a and d are parameters that respectively represent the
lower and upper asymptotes, and b is the slope parameter. c is the abscissa of the mid-height
point which ordinate is (a+b)/2. When a is lower than d, the curve decreases from d to a, and
when a is greater than d, the curve increases from a to d.
d a
ya (2)
x x
b
1 st sp
c1 c2
where st is 1 if the observation comes from the standard sample (STD), and 0 if not, and
where sp is 1 if the observation is from the sample of interest (SOI), and 0 if not. This is a
constrained model because the observations corresponding to the standard sample influence
the optimization of the values of a, b, and d. From the above writing of the model, one can
understand that this model generates two parallel curves, which only difference is the
positioning of the curve, the shift being given by (c2-c1). If c2 is greater than c1, the curve
corresponding to the sample of interest is shifted to the right of the curve corresponding to the
standard sample, and vice-versa.
662
If the Dixons test option is activated, XLSTAT can test for each sample (STD and SOI) if some
outliers influence too much the fit of the model. In the (a) case, a Dixons test is performed
once the model (1) is fitted. If an outlier is detected, it is removed, and the model is fitted again,
and so on until no outlier is detected. In the (b) case, we first perform a Dixons test on the
STD, than on the SOI, and then, the model (2) is fitted on the merge of the two samples,
without the outliers.
In the (b) case, and if the sum of the sample sizes is greater than 9, a Fishers F test is
performed to detect if the a, b and d parameters obtained for both models with the model (1)
are not significantly different from those obtained with model (2).
Dialog box
The dialog box is divided into several tabs that correspond to a variety of options ranging from
the selection of data to the display of results. You will find below the description of the various
elements of the dialog box.
: Click this button to close the dialog box without doing any computation.
: Click these buttons to change the way XLSTAT handles the data. If the arrow points
down, XLSTAT considers that rows correspond to observations and columns to variables. If
the arrow points to the right, XLSTAT considers that rows correspond to variables and columns
to observations.
General tab:
Y / Dependent variables:
Quantitative: Select the response variable(s) you want to model. If several variables have
been selected, XLSTAT carries out calculations for each of the variables separately. If a
column header has been selected, check that the "Variable labels" option has been activated.
663
X / Explanatory variables:
Quantitative: Select the quantitative explanatory variables to include in the model. If the
variable header has been selected, check that the "Variable labels" option has been activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
Variable labels: Activate this option if the first row of the data selections (dependent and
explanatory variables, weights, observations labels) includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Subsamples: Activate this option then select a column (column mode) or a row (row mode)
containing the sample identifier(s). The identifiers must be 0 and/or 1. If a header has been
selected, check that the "Variable labels" option has been activated.
Options tab:
Initial values: Activate this option to give XLSTAT a starting point. Select the cells which
correspond to the initial values of the parameters. The number of rows selected must be the
same as the number of parameters.
Parameters bounds: Activate this option to give XLSTAT a possible region for all the
parameters of the model selected. You must them select a two-column range, the one on the
left being the lower bounds and the one on the right the upper bounds. The number of rows
selected must be the same as the number of parameters.
Parameters labels: Activate this option if you want to specify the names of the parameters.
XLSTAT will display the results using the selected labels instead of using generic labels pr1,
pr2, etc. The number of rows selected must be the same as the number of parameters.
664
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number if iterations has been exceeded. Default value:
100.
Convergence: Enter the maximum value of the evolution in the Sum of Squares of
Errors (SSE) from one iteration to another which, when reached, means that the
algorithm is considered to have converged. Default value: 0.00001.
Dixons test: Activate this option to use the Dixons test to remove outliers from the estimation
sample.
Confidence intervals: Activate this option to enter the size of the confidence interval for the
Dixons test.
Remove observations: Activate this option to remove the observations with missing data.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the variables
selected.
Goodness of fit statistics: Activate this option to display the table of goodness of fit statistics
for the model.
Model parameters: Activate this option to display the values of the parameters for the model
after fitting.
Equation of the model: Activate this option to display the equation of the model once fitted.
Predictions and residuals: Activate this option to display the predictions and residuals for all
the observations.
665
Charts tab:
Data and predictions: Activate this option to display the chart of observations and the curve
for the fitted function.
Results
Summary statistics: This table displays for the selected variables, the number of
observations, the number of missing values, the number of non-missing values, the mean and
the standard deviation (unbiased).
Fisher's test assessing parallelism between curves: The Fishers F test is used to
determine if one can consider that the models corresponding the standard sample and the
sample of interest are significantly different or not. If the probability corresponding to the F
value is lower than the significance level, then one can consider that the difference is
significant.
2
The determination coefficient R ;
The sum of squares of the errors (or residuals) of the model (SSE or SSR respectively);
The means of the squares of the errors (or residuals) of the model (MSE or MSR);
The root mean squares of the errors (or residuals) of the model (RMSE or RMSR);
Model parameters: This table displays the estimator and the standard error of the estimator
for each parameter of the model. It is followed by the equation of the model.
Predictions and residuals: This table displays giving for each observation the input data and
corresponding prediction and residual. The outliers detected by the Dixons test, if any, are
displayed at the bottom of the table.
Charts: On the first chart are displayed in blue color, the data and the curve corresponding to
the standard sample, and in red color, the data and the curve corresponding to the sample of
666
interest. A chart that allows to compare predictions and observed values as well as the bar
chart of the residuals are also displayed.
Example
A tutorial on how to use the four parameters logistic regression tool is available on the
Addinsoft website:
http://www.xlstat.com/demo-4pl.htm
References
Tallarida R.J. (2000). Drug Synergism & Dose-Effect Data Analysis. CRC/Chapman & Hall,
Boca Raton.
667
XLSTAT-PLSPM
XLSTAT-PLSPM is a module of XLSTAT that is dedicated to the Partial Least Squares Path
Modeling approach, an innovative method for representing complex relationships between
observed variables and latent variables.
Description
Partial Least Squares Path Modeling (PLS-PM) is a statistical approach for modeling complex
multivariable relationships (structural equation models) among observed and latent variables.
Since a few years, this approach has been enjoying increasing popularity in several sciences
(Esposito Vinzi et al., 2007). Structural Equation Models include a number of statistical
methodologies allowing the estimation of a causal theoretical network of relationships linking
latent complex concepts, each measured by means of a number of observable indicators.
The first presentation of the finalized PLS approach to path models with latent variables has
been published by Wold in 1979 and then the main references on the PLS algorithm are Wold
(1982 and 1985).
Herman Wold opposed LISREL (Jreskog, 1970) "hard modeling" (heavy distribution
assumptions, several hundreds of cases necessary) to PLS "soft modeling" (very few
distribution assumptions, few cases can suffice). These two approaches to Structural Equation
Modeling have been compared in Jreskog and Wold (1982).
Furthermore, PLS Path Modeling can be used for analyzing multiple tables and it is directly
related to more classical data analysis methods used in this field. In fact, PLS-PM may be also
viewed as a very flexible approach to multi-block (or multiple table) analysis by means of both
the hierarchical PLS path model and the confirmatory PLS path model (Tenenhaus and Hanafi,
2007). This approach clearly shows how the data-driven tradition of multiple table analysis
can be somehow merged in the theory-driven tradition of structural equation modeling so as
to allow running the analysis of multi-block data in light of current knowledge on conceptual
relationships between tables.
668
A PLS Path model is described by two models: (1) a measurement model relating the manifest
variables to their own latent variable and (2) a structural model relating some endogenous
latent variables to other latent variables. The measurement model is also called the outer
model and the structural model the inner model.
There exist four options for the standardization of the manifest variables depending upon three
conditions that eventually hold in the data:
Condition 1: The scales of the manifest variables are comparable. For instance, in the ECSI
example the item values (between 0 and 100) are comparable. On the other hand, for
instance, weight in tons and speed in km/h would not be comparable.
Condition 2: The means of the manifest variables are interpretable. For instance, if the
difference between two manifest variables is not interpretable, the location parameters are
meaningless.
If condition 1 does not hold, then the manifest variables have to be standardized (mean 0 and
variance 1).
If condition 1 holds, it is useful to get the results based on the raw data. But the calculation of
the model parameters depends upon the validity of the other conditions:
- Condition 2 and 3 do not hold: The manifest variables are standardized (mean 0 variance
1) for the parameter estimation phase. Then the manifest variables are rescaled to their
original means and variances for the final expression of the weights and loadings.
- Condition 2 holds, but not condition 3: The manifest variables are not centered, but are
standardized to unitary variance for the parameter estimation phase. Then the manifest
variables are rescaled to their original variances for the final expression of the weights and
loadings (to be defined later).
Lohmller (1989) introduced a standardization parameter to select one of these four options:
669
2. The measurement model
2.1.1. Definition
In this model each manifest variable reflects its latent variable. Each manifest variable is
related to its latent variable by a simple regression:
(1) xh = h0 + h + h,
where has mean m and standard deviation 1. It is a reflective scheme: each manifest
variable xh reflects its latent variable . The only hypothesis made on model (1) is called by H.
Wold the predictor specification condition:
(2) E(xh | ) = h0 + h.
This hypothesis implied that the residual h has a zero mean and is uncorrelated with the latent
variable .
In the reflective way the block of manifest variables is unidimensional in the meaning of factor
analysis. On practical data this condition has to be checked. Three main tools are available to
check the unidimensionality of a block: use of principal component analysis of each block of
manifest variables, Cronbach's and Dillon-Goldstein's .
A block is essentially unidimensional if the first eigenvalue of the correlation matrix of the block
MVs is larger than 1 and the second one smaller than 1, or at least very far from the first one.
The first principal component can be built in such a way that it is positively correlated with all
670
(or at least a majority of) the MVs. There is a problem with MV negatively correlated with the
first principal component.
b) Cronbach's
cor(x , x ) h h'
hh ' p
p cor( x , x ) p 1
(3) .
h h'
h h '
The Cronbachs alpha is also defined for original (raw) variables as:
cov(x h , xh ' )
h h ' p
p 1
(4) .
var x h
h
A block is considered as unidimensional when the Cronbach's alpha is larger than 0.7.
c) Dillon-Goldstein's
The sign of the correlation between each MV xh and its LV is known by construction of the
item and is supposed here to be positive. In equation (1) this hypothesis means that all the
loadings h are positive. A block is unidimensional if all these loadings are large.
( h ) 2 Var( )
p
(5) h 1
.
( h ) Var( ) Var( h )
p p
2
h 1 h 1
Let's now suppose that all the MVs xh and the latent variable are standardized. An
approximation of the latent variable is obtained by standardization of the first principal
component t1 of the block MVs. Then h is estimated by cor(xh, t1) and, using equation (1),
Var(h) is estimated by 1 cor (xh, t1). So we get an estimate of the Dillon-Goldstein's :
2
671
p
cor(x h , t1 )
2
h 1
(6) .
cor(x h , t1 ) 1 cor ( x h , t1 )
p 2 p
2
h 1 h 1
A block is considered as unidimensional when the Dillon-Goldstein's is larger than 0.7. This
statistic is considered to be a better indicator of the unidimensionality of a block than the
Cronbach's alpha (Chin, 1998, p.320).
PLS Path Modeling is a mixture of a priori knowledge and data analysis. In the reflective way,
the a priori knowledge concerns the unidimensionality of the block and the signs of the
loadings. The data have to fit this model. If they do not, they can be modified by removing
some manifest variables that are far from the model. Another solution is to change the model
and use the formative way that will now be described.
In the formative way, it is supposed that the latent variable is generated by its own manifest
variables. The latent variable is a linear function of its manifest variables plus a residual term:
(7) h x h .
h
In the formative model the block of manifest variables can be multidimensional. The predictor
specification condition is supposed to hold as:
(8) E ( | x1 ,..., x p j ) h xh .
h
This hypothesis implies that the residual vector has a zero mean and is uncorrelated with the
MVs xh.
672
(9) xh = h0 + h + h, for h = 1 to p1
x
p
(10) h h .
h=p1 1
The p1 first manifest variables follow a reflective way and the (p p1) last ones a formative
way. The predictor specification hypotheses still hold and lead to the same consequences as
before on the residuals.
The causality model leads to linear equations relating the latent variables between them (the
structural or inner model):
(11) j j 0 ji i j .
i
The standardized latent variables (mean = 0 and standard deviation = 1) are estimated as
linear combinations of their centered manifest variables:
(12) yj [ w jh (x jh x jh )] ,
where the symbol means that the left variable represents the standardized right variable
and the sign shows the sign ambiguity. This ambiguity is solved by choosing the sign
making yj positively correlated to a majority of xjh.
673
The standardized latent variable is finally written as:
(13) yj w jh (x jh x jh ) .
(14) j
m w jh x jh ,
(15) j w j.
jh x jh y j m
When all manifest variables are observed on the same measurement scale, it is nice to
express (Fornell (1992)) latent variables estimates in the original scale as:
*j
w x
w
jh jh
(16) .
jh
Equation (16) is feasible when all outer weights are positive. Finally, most often in real
applications, latent variables estimates are required on a 0-100 scale so as to have a
reference scale to compare individual scores. From the equation (16), for the i-th observed
case, this is easily obtained by the following transformation:
*
xmin
ij0 100 100
ij
xmax xmin
(17) ,
where xmin and xmax are, respectively, the minimum and the maximum value of the
measurement scale common to all manifest variables.
The inner estimate zj of the standardized latent variable (j mj) is defined by:
(18) zj
j' : j' is connected with j
e jj' y j' ,
where the inner weights ejj are equal to the signs of the correlations between yj and the yj's
connected with yj. Two latent variables are connected if there exists a link between the two
674
variables: an arrow goes from one variable to the other in the arrow diagram describing the
causality model. This choice of inner weights is called the centroid scheme.
Centroid scheme:
This choice shows a drawback in case the correlation is approximately zero as its sign may
change for very small fluctuations. But it does not seem to be a problem in practical
applications.
In the original algorithm, the inner estimate is the right term of (18) and there is no
standardization. We prefer to standardize because it does not change anything for the final
inner estimate of the latent variables and it simplifies the writing of some equations.
Two other schemes for choosing the inner weights exist: the factorial scheme and the path
weighting (or structural) scheme. These two new schemes are defined as follows:
Factorial scheme:
The inner weights eji are equal to the correlation between yi and yj. This is an answer to the
drawbacks of the centroid scheme described above.
The latent variables connected to j are divided into two groups: the predecessors of j, which
are latent variables explaining j, and the followers, which are latent variables explained by j.
For a predecessor j of the latent variable j, the inner weight ejj is equal to the regression
coefficient of yj in the multiple regression of yj on all the yjs related to the predecessors of j. If
j is a successor of j then the inner weight ejj is equal to the correlation between yj and yj.
These new schemes do not significantly influence the results but are very important for
theoretical reasons. In fact, they allow to relate PLS Path modeling to usual multiple table
analysis methods.
There are three classical ways to estimate the weights wjh: Mode A, Mode B and Mode C.
Mode A:
675
In mode A the weight wjh is the regression coefficient of zj in the simple regression of xjh on the
inner estimate zj:
as zj is standardized.
Mode B:
In mode B the vector wj of weights wjh is the regression coefficient vector in the multiple
regression of zj on the manifest centered variables (xjh - x jh ) related to the same latent
variable j:
wj = (XjXj) Xjzj,
-1
(20)
where Xj is the matrix with columns defined by the centered manifest variables xjh - x jh related
to the j-th latent variable j.
Mode A is appropriate for a block with a reflective measurement model and Mode B for a
formative one. Mode A is often used for an endogenous latent variable and mode B for an
exogenous one. Modes A and B can be used simultaneously when the measurement model is
the MIMIC one. Mode A is used for the reflective part of the model and Mode B for the
formative part.
In practical situations, mode B is not so easy to use because there is often strong
multicollinearity inside each block. When this is the case, PLS regression may be used instead
of OLS multiple regression. As a matter of fact, it may be noticed that mode A consists in
taking the first component from a PLS regression, while mode B takes all PLS regression
components (and thus coincides with OLS multiple regression). Therefore, running a PLS
regression and retaining a certain number of significant components may be meant as a new
intermediate mode between mode A and mode B.
Mode C:
In mode C the weights are all equal in absolute value and reflect the signs of the correlations
between the manifest variables and their latent variables:
676
These weights are then normalized so that the resulting latent variable has unitary variance.
Mode C actually refers to a formative way of linking manifest variables to their latent variables
and represents a specific case of Mode B whose comprehension is very intuitive to
practitioners.
The starting step of the PLS algorithm consists in beginning with an arbitrary vector of weights
wjh. These weights are then standardized in order to obtain latent variables with unitary
variance.
A good choice for the initial weight values is to take wjh = sign(cor(xjh, h)) or, more simply, wjh
= sign(cor(xjh, h)) for h = 1 and 0 otherwise or they might be the elements of the first
eigenvector from a PCA of each block.
Then the steps for the outer and the inner estimates, depending on the selected mode, are
iterated until convergence (guaranteed only for the two-blocks case, but practically always
encountered in practice even with more than two blocks).
~ , the standardized latent
After the last step, final results are yielded for the inner weights w
The latent variable estimates are sensitive to the scaling of the manifest variables in Mode A,
but not in mode B. In the latter case, the outer LV estimate is the projection of the inner LV
estimate on the space generated by its manifest variables.
The structural equations (11) are estimated by individual OLS multiple regressions where the
latent variables j are replaced by their estimates j . As usual, the use of OLS multiple
regressions may be disturbed by the presence of strong multicollinearity between the
estimated latent variables. In such a case, PLS regression may be applied instead.
In XLSTAT-PLSPM, there exists a specific treatment for missing data (Lohmller, 1989):
1. When some cells are missing in the data, means and standard deviations of the manifest
variables are computed on all the available data.
677
2. All the manifest variables are centered.
3. If a unit has missing values on a whole block j, the value of the latent variable estimate yj is
missing for this unit.
4. If a unit i has some missing values on a block j (but not all), then the outer estimate yji is
defined by:
y ji
jh: x jhi exists
jh (x jhi -x jh ) .
w
That means that each missing data of variable xjh is replaced by the mean x jh .
5. If a unit i has some missing values on its latent variables, then the inner estimate zji is
defined by:
z ji
k : k is connected with j
e jk y ki .
and yki exists
That means that each missing data of variable yk is replaced by its mean 0.
6. The weights wjh are computed using all the available data on the basis of the following
procedures:
For mode A: The outer weight wjh is the regression coefficient of zj in the regression of
(x jh x jh ) on zj calculated on the available data.
For mode B: When there are no missing data, the outer weight vector wj is equal to:
wj = (XjXj) Xjzj.
-1
wj = [Var(Xj)] Cov(Xj,zj),
-1
where Var(Xj) is the covariance matrix of Xj and Cov(Xj,zj) the column vector of the
covariances between the variables xjh and zj.
When there are missing data, each element of Var(Xj) and Cov(Xj,zj) is computed using
all the pairwise available data and wj is computed using the previous formula.
678
This pairwise deletion procedure shows the drawback of possibly computing
covariances on different sample sizes and/or different statistical units. However, in the
case of few missing values, it seems to be very robust. This justifies why the
blindfolding procedure that will be presented in the next section yields very small
standard deviations for parameters.
7. The path coefficients are the regression coefficients in the multiple regressions relating
some latent variables to some others. When there are some missing values, the procedure
described in point 6 (Mode B) is also used to estimate path coefficients.
Nevertheless, missing data con be also treated with other classical procedures, such as mean
imputation, listwise deletion, multiple imputation, the NIPALS algorithm (discussed below) and
so on so forth.
6. Model Validation
A path model can be validated at three levels: (1) the quality of the measurement model, (2)
the quality of the structural model, and (3) each structural regression equation.
The communality index measures the quality of the measurement model for each block. It is
defined, for block j, as:
Communality j cor 2 x jh , y j .
p
1 j
(22)
p j h 1
Communality p jCommunality j ,
1 J
(23)
p j1
The redundancy index measures the quality of the structural model for each endogenous
block. It is defined, for an endogenous block j, as:
(24)
Redundancy j Communality j R 2 y j , y j''s explaining y j
.
The average redundancy for all endogenous blocks can also be computed.
A global criterion of goodness-of-fit (GoF) can be proposed (Amato, Esposito Vinzi and
2
Tenenhaus 2004) as the geometric mean of the average communality and the average R :
679
(25) GoF Communality R 2 .
As a matter of fact, differently from LISREL, PLS Path Modeling does not optimize any global
scalar function so that it naturally lacks of an index that can provide the user with a global
validation of the model (as it is instead the case with and related measures in LISREL). The
2
GoF represents an operational solution to this problem as it may be meant as an index for
validating the PLS model globally, as looking for a compromise between the performances of
the measurement and the structural model, respectively.
The cv-communality (cv stands for cross-validated) index measures the quality of the
measurement model for each block. It is a kind of cross-validated R-square between the block
MVs and their own latent variable calculated by a blindfolding procedure.
The quality of each structural equation is measured by the cv-redundancy index (i.e. Stone-
2
Geissers Q ). It is a kind of cross-validated R-square between the manifest variables of an
endogenous latent variable and all the manifest variables associated with the latent variables
explaining the endogenous latent variable, using the estimated structural model.
Following Wold (1982, p. 30), the cross-validation test of Stone and Geisser fits soft modeling
like hand in glove. In PLS Path Modeling statistics on each block and on each structural
regression are available.
The significance levels of the regression coefficients can be computed using the usual
Students t statistic or using cross-validation methods like jack-knife or bootstrap.
1. The data matrix is divided into G groups. The value G = 7 is recommended by Herman
Wold. We give in the following table an example on a dataset made by 12 statistical units
and 5 variables. The first group is related to letter a, the second one to letter b, and so on.
680
2. Each group of cells is removed at its turn from the data. So a group of cells appears to be
missing (for example all cells with letter a).
3. A PLS model is run G times by excluding each time one of the groups.
4. One way to evaluate the quality of the model consists in measuring its capacity to predict
manifest variables using other latent variables. Two indices are used: communality and
redundancy.
5. In the communality option, we get prediction for the values of the centered manifest
variables not included in the analysis, using the latent variable estimate, by the following
formula:
Pred(x jhi x jh ) jh -i y j -i ,
where jh -i and yj(-i) are computed on data where the i-th value of variable xjh is missing.
SSE jh (x jhi -x jh - jh -i y j -i )2 .
Sum of squared prediction errors for one MV:
SSE j
CV-Communality measure for Block j: H 2j 1 .
SSO j
The index H 2j is the cross-validated communality index. The mean of the cv-communality
indices can be used to measure the global quality of the measurement model if they are
positive for all blocks.
6. In the redundancy option, we get a prediction for the values of the centred manifest
variables not used in the analysis by using the following formula:
where jh -i is the same as in the previous paragraph and Pred(yj(-i)) is the prediction for
the i-th observation of the endogenous latent variable yj using the regression model
computed on data where the i-th value of variable xjh is missing.
681
The following terms are also computed:
2
SSE 'j
F =1-
j
SSO j
The index Fj2 is the cross-validated redundancy index. The mean of the various cv-
redundancy indices related to the endogenous blocks can be used to measure the global
quality of the structural model if they are positive for all endogenous blocks.
The significance of PLS-PM parameters, coherently with the distribution-free nature of the
estimation method, is assessed by means of non-parametric procedures. As a matter of fact,
besides the classical blindfolding procedure, Jackknife and Bootstrap resampling options are
available.
6.3.1. Jackknife
The Jackknife procedure builds resamples by deleting a certain number of units from the
original sample (with size N). The default option consists in deleting 1 unit at a time so that
each Jackknife sub-sample is made of N-1 units. Increasing the number of deleted units leads
to a potential loss in robustness of the t-statistic because of a smaller number of sub-samples.
The complete statistical procedure is described in Chin (1998, p.318-320).
6.3.2. Bootstrap
The Bootstrap samples, instead, are built by resampling with replacement from the original
sample. The procedure produces samples consisting of the same number of units as in the
original sample. The number of resamples has to be specified. The default is 100 but a higher
number (such as 200) may lead to more reasonable standard error estimates.
682
We must take into account that, in PLS-PM, latent variables are defined up to the sign. It
means that yj w jh (x jh -x jh ) and -yj are both equivalent solutions. In order to remove
this indeterminacy, Wold (1985) suggests retaining the solution where the correlations
between the manifest variables xjh and the latent variable yj show a majority of positive signs.
Referring to the signs of the elements in the first eigenvector obtained on the original sample is
also a way of controlling the sign in the different bootstrap re-samples.
The roots of the PLS algorithm are in the NILES (Non linear Iterative LEast Squares
estimation), which later became NIPALS (Non linear Iterative PArtial Least Squares), algorithm
for Principal Component Analysis (Wold, 1966). We now remind the original algorithm of H.
Wold and show how it can be included in the PLS-PM framework. The interests of the NIPALS
algorithm are double as it shows: how PLS handles missing data and how to extend the PLS
approach to more than one dimension.
The original NIPALS algorithm is used to run a PCA in presence of missing data. This original
algorithm can be slightly modified to go into the PLS framework by standardizing the principal
components. Once this is done, the final step of the NIPALS algorithm is exactly the Mode A of
the PLS approach when only one block of data is available. This means that PLS-PM can
actually yield the first-order results of a PCA whenever it is applied to a block of reflective
manifest variables.
The other dimensions are obtained by working on the residuals of X on the previous
standardized principal components.
PLS Path Modeling can be also used so as to find the main data analysis methods to relate
two sets of variables. Table 1 shows the complete equivalence between PLS Path Modeling of
two data tables and four classical multivariate analysis methods. In this table, the use of the
deflation operation for the research of higher dimension components is mentioned.
Table 1: Equivalence between the PLS algorithm applied to two blocks of variables X1 and X2
and various method
683
The analytical demonstration of the above mentioned results can be found in Tenenhaus et al.,
2005.
The various options of PLS Path Modeling (Modes A or B for outer estimation; centroid,
factorial or path weighting schemes for inner estimation) allow to find also many methods for
multiple tables analysis: Generalized Canonical Analysis (the Horst's one (1961) and the
Carroll's one (1968)), Multiple Factor Analysis (Escofier & Pags, 1994), Lohmller's split
principal component analysis (1989), Horst's maximum variance algorithm (1965).
The links between PLS and these methods have been studied on practical examples in Guinot,
Latreille and Tenenhaus (2001) and in Pags and Tenenhaus (2001).
Let us consider a situation where J blocks of variables X1,, XJ are observed on the same set
of statistical units. For estimating these latent variables j, Wold (1982) has proposed the
hierarchical model defined as follows:
An arrow scheme describing a hierarchical model for three blocks of variables is shown in
Figure 1.
684
Figure 1: A hierarchical model for a PLS analysis of J blocks of variables.
Table 2 summarizes the links between Hierarchical PLS-PM and several multiple table
analysis organized with respect to the choice of the outer estimation mode (A or B) and of the
inner estimation scheme (Centroid, Factorial or Path Weighting).
In the methods described in Table 2, the higher dimension components are obtained by re-
running the PLS model after deflation of the X-block.
It is also possible to obtain higher dimension orthogonal components on some Xj-blocks (or on
all of them). The hierarchical PLS model is re-run on the selected deflated Xj-blocks.
The orthogonality control for higher dimension components is a tremendous advantage of the
PLS approach (see Tenenhaus (2004) for more details and an example of application).
Finally, PLS Path Modeling may be meant as a general framework for the analysis of multiple
tables. It is demonstrated that this approach recovers usual data analysis methods in this
context but it also allows for new methods to be developed when choosing different mixtures of
estimation modes and schemes in the two steps of the algorithm (internal and external
685
estimation of the latent variables) as well as different orthogonality constraints. Therefore, we
can state that PLS Path Modeling provides a very flexible environment for the study of a multi-
block structure of observed variables by means of structural relationships between latent
variables. Such a general and flexible framework also enriches the data analysis methods with
non-parametric validation procedures (such as bootstrap, jackknife and blindfolding) for the
estimated parameters and fit indices for the different blocks that are more classical in a
modeling approach than in data analysis.
Projects
XLSTAT-PLSPM projects are special Excel workbook templates. When you create a new
project, its default name starts with PLSPMBook. You can then save it to the name you want,
but make sure you use the Save or Save as command of the XLSTAT-PLSPM toolbar to
save it in the folder dedicated to the PLSPM projects using the *.ppm extension.
- D1: This sheet is empty and you need to add all the input data that you want to use into that
worksheet.
- PLSPMGraph: This sheet is blank and is used to design the model. When you select this
sheet, the Path modeling toolbar is displayed. It is made invisible when you leave that sheet.
Once a model has been designed, you can run the optimization. Results sheets are then
added after the PLSPMGraph sheet.
It is possible to record a model before adding new variables and to reload it later (see the
Toolbars section for additional information).
Options
To display the options dialog box, click the button of the XLSTAT-PLSPM toolbar. Use
this dialog box to define the general options of the XLSTAT-PLSPM module.
General tab:
Path for the XLSTAT-PLSPM projects: This path can be modified if and only if you have
administrator rights on the machine. You can then modify the folder where the users files are
saved by clicking the [] button that will display a box where you can select the appropriate
folder. The folder must be accessible for reading and writing to all types of users.
686
Format tab:
Use these options to set the format of the various objects that are displayed on the
PLSPMGraph sheet:
Latent variables: You can define the color and the size of the border line of the ellipses
that represent the latent variables, as well as the color of the background, and the color
and the size of the font.
Manifest variables: You can define the color and the size of the border line of the
rectangles that represent the manifest variables, as well as the color of the background,
and the color and the size of the font.
Arrows (MV-LV): You can define the color and the size of the arrows between the
manifest and the latent variables.
Arrows (LV-LV): You can define the color and the size of the arrows between two
latent variables.
Note 1: So that changes are taken into account once you click the OK button, you need to
optimize the display by clicking on the button.
Note 2: these options do not prevent you from changing the format of one or more objects on
the PLSPMGraph sheet. Using the drawing toolbar of Excel, you can change the fill, the
borders of the objects.
Toolbars
Click this icon to open a new PLSPM project (see Projects for more details).
Click this icon to save the current PLSPM project. This icon is only active if changes have
been made in the project.
687
Click this icon to save the project in a new folder or under a new name.
Click this icon if you want to continue using XLSTAT but not XLSTAT-PLSPM. This allows
to free some memory space.
The second toolbar, Path modeling is only visible when you are on the PLSPMGraph sheet of
a PLSPM project.
Click this icon to add latent variables. If you double click this icon, you can add several
latent variables in a row without having to click this button each time.
Click this icon to add an arrow between two latent variables. If you double click this icon,
you can add several arrows in a row without having to click this button each time. When adding
an arrow, select first the latent variable that will be at the beginning of the arrow by clicking on
it, then leave the mouse button down, and drag the cursor till the latent variable that will be at
the end of the arrow.
Click this icon to hide the manifest variables. If a latent variable is selected when you click
this icon, it will only hide the corresponding manifest variables.
Click this icon to display the manifest variables. If a latent variable is selected when you
click this icon, it will only show the corresponding manifest variables.
Click this icon to define groups. Once groups are defined, a list with the group names is
displayed on the PLSPMGraph sheet, and the icon becomes ; click this icon to remove the
groups definition.
Unprotected/Protected(1)/Protected(2): The first option allows the user to modify the model
and the position of the objects. The second option allows the user to modify the position of the
objects. The third option does not allow the user to move the objects or to delete them.
688
Click this icon to completely remove all the objects on the PLSPMGraph sheet.
Click this icon to display the results of the model, if the latter has already been fitted. If the
results are already displayed, the following icon is displayed ; click it to hide the results.
Click this icon to display a dialog box that allows you to choose which results should be
displayed or not.
Click this icon to start the optimization of the model, and then display the results on both
the PLSPMGraph sheet, and on the results sheet.
Once one or more latent variables have been added on the PLSPMGraph document using the
icon of the Path modeling toolbar, you can define the manifest variables that correspond
to these variables. A latent variable must have manifest variables, even if it is a superblock
variable (a variable that is not directly related to manifest variables but to latent variables with
arrows going from the latent variables to the superblock variable the superblock variable
inherits the manifest variables of the constitutive latent variables).
For a superblock, you need to add all the manifest variables of the parent latent variables. This
is made easy with the XLSTAT interface.
- Click the right button of the mouse, and select Add manifest variables.
These actions lead to the display of a dialog box which options are:
689
General tab:
Name of the latent variable: Enter the name of the latent variable.
Manifest variables: Select on the D1 sheet the data that correspond to the manifest variables.
The input variables can be either quantitative or qualitative.
Quantitative: Activate this option if you want to use quantitative variables and then
select these variables.
Qualitative: Activate this option if you want to use qualitative variables and then select
these variables.
Variable labels: Activate this option if the first row of the data selections includes a header.
Position: Select the position where the manifest variables should be positioned relatively to
the latent variable.
Mode: Select the mode option that determines how the latent variable is constructed from the
manifest variables. The available options are Mode A (reflective way, arrows go from the
latent variable to the manifest variables), Mode B (formative way, arrows go from the
manifest variables to the latent variable), Centroid, PCA, PLS, and Mode MIMIC (a
mixture of Mode A and Mode B). If Mode MIMIC is selected, you need to select a column with
one row per manifest variable (and a header if the Variable labels option is checked), with As
for the variables with Mode A, and Bs for the variables with mode B. See the description
section for more information on the modes. The Automatic mode is only available for
superblocks. It allows to make that the mode for each manifest variable corresponds to its
mode in the latent variable that is used to create the superblock. Centroid, PCA, PLS, and
Automatic modes are only available in the expert display.
Deflation (expert display): Select the deflation mode. The deflation is used when computing
the model on the second and above dimensions of the model.
No deflation: Whatever the dimension, the scores of the latent variable remain
constant.
External: For the successive dimensions, the residuals are computed from the outer
model.
Internal: For the successive dimensions, the residuals are computed from the inner
model.
Internal(W): For the successive dimensions, the residuals are computed from the inner
model after re-estimating the weights.
Dimension (expert display): You can choose the number of dimensions to be studied.
690
Invert sign: Activate this option if you want to reverse the sign of the latent variable. This
option is useful if you notice that the influence of a latent variable is the opposite of what it
should be.
Supberblock (expert display): You can activate this option only if latent variables have already
been created, and if manifest variables were selected for the latter. The list displays the latent
variables for which manifest variables have already been defined. The superbloc tab appears.
You can then select the latent variables that are used to build the superblock variable.
Interaction (expert display): You can activate this option only if latent variables have already
been created. An interaction latent variable is the product of two latent variables that have the
same successor. The interaction variable will have the same successor as the two variables
that were used to create it. The interaction tab appears.
Superblock tab:
The list of all latent variables is displayed. Select latent variables to be included in the
superblock.
Interaction tab:
Generating latent variable: Select two of the latent variables explaining the latent variable
to which the interaction variable is connected.
Treatment of the manifest variables: Select which transformation of the manifest variable
prior to the product you wish to apply. Three options are available: raw manifest variables,
standardized manifest variables and mean centered manifest variables.
Stop conditions:
Automatic: Activate this option so that XLSTAT automatically determines the number
of components to keep.
Max components: Activate this option to set the pour fixer le maximum number of
components to take into account in the model.
Options for PLS regression in the measurement model (only active, if the PLS mode is
selected):
Stop conditions:
691
Automatic: Activate this option so that XLSTAT automatically determines the number
of components to keep.
Max components: Activate this option to set the pour fixer le maximum number of
components to take into account in the model.
Defining groups
If a qualitative variable is available and if you believe that the model could be different for the
various categories of that variable, you may use it to define groups.
To define groups, go to the PLSPMGraph sheet, then click the icon. This displays a
Groups dialog box, which entries are:
Groups: Select on the D1 sheet the data that correspond to the qualitative variable that
indicates to which group each observation belongs.
Column label: Activate this option if the first row of the selection corresponds to a header.
Sort alphabetically: Activate this option if you want that XLSTAT sorts alphabetically the
names of the groups (the categories of the selected qualitative variable). If this option is not
activated, the categories are listed in their order or appearance.
Once you click OK, a list is added at the top right corner of the PLSPMGraph sheet. Once the
model has been computed, you can use this list to display the results of the group you want on
the PLSPMGraph sheet. The results of the model that corresponds to each group are also
displayed on distinct sheets.
Note: if you want to remove the group information, click the button of the Path modeling
toolbar.
Once you have designed the model on the PLSPMGraph sheet, and once the manifest
variables have been defined for each latent variable, you can click the icon of the Path
modeling toolbar to display the Run dialog box that lets you define additional options before
fitting the model.
692
General tab:
Treatment of the manifest variables: Choose if and how the manifest variables should be
transformed.
Standardized, weights on raw MV: The manifest variables are standardized before
fitting the model, and the outer weights are estimated for the raw variables.
Reduced, weights on raw MV: The manifest variables are reduced (divided by the
standard deviation) before fitting the model, and the corresponding outer weights are
estimated.
Initial weight: Choose which initial value should be used for outer weight initialization.
Values of the first eigenvector. The initial values are the values associated to the
first eigenvector.
Signs of the coordinates of the first eigenvector. Instead of taking the values of the
first eigenvector only take the sign.
Weights: Activate this option if the observations are weighted. If you do not activate this
option, the weights will be considered as 1. Weights must be greater than or equal to 0. If a
column header has been selected, check that the "Variable labels" option is activated.
Range: Activate this option if you want to display the results starting from a cell in an existing
worksheet. Then select the corresponding cell.
Sheet: Activate this option to display the results in a new worksheet of the active workbook.
693
Variable labels: Activate this option if the first row of the data selections includes a header.
Observation labels: Activate this option if observations labels are available. Then select the
corresponding data. If the Variable labels option is activated you need to include a header in
the selection. If this option is not activated, the observations labels are automatically generated
by XLSTAT (Obs1, Obs2 ).
Options tab:
Internal estimation: Select the internal estimation method (see the description section for
additional details).
Structural: The inner weights are equal to the correlation between the latent variables
when estimating an explanatory (predecessor) latent variable. Otherwise they are equal
to the OLS regression coefficients.
Factorial: The inner weights are equal to the correlation between the latent variables.
Centroid: The inner weights are equal to the sign of the correlation between the latent
variables.
PLS: The inner weights are equal to the correlation between the latent variables when
estimating an explanatory (predecessor) latent variable. Otherwise they are equal to the
PLS regression coefficients.
Regression: Select the regression method that is used to estimate path coefficients
Dimensions: Enter the number of dimensions up to which the model should be computed.
Stop conditions:
Iterations: Enter the maximum number of iterations for the algorithm. The calculations
are stopped when the maximum number of iterations has been exceeded. Default
value: 100.
Convergence: Enter the value of the difference of criterion between two steps which,
when reached, means that the algorithm is considered to have converged. Default
value: 0.0001.
694
Confidence intervals: Activate this option to compute the confidence intervals. Then choose
the method used to compute the intervals:
Bootstrap: Activate this option to use a bootstrap method. Then enter the number of
Re-samples generated to compute the bootstrap confidence intervals.
Jackknife: Activate this option to use a jackknife method. Then enter the Group size
that is used to generate the samples to compute the jackknife confidence intervals.
Model quality:
Blindfolding: Activate this option to check the model quality using the blindfolding
approach (see the description section for additional details). Cross-validated values for
redundancy and communality will be computed.
Do not accept missing data: Activate this option so that XLSTAT prevents the computations
from continuing if missing data have been detected.
Remove observations: Activate this option to remove the observations that contain missing
data.
Use NIPALS: Activate this option to use the NIPALS algorithm to handle missing data (see the
description section for additional details).
Lohmller: Activate this option to use the Lohmllers procedure to handle missing data:
pairwise deletion to compute sample means and standard deviation, and mean imputation to
compute the scores.
Use the mean: Activate this option to use the mean of the latent variables to estimate
missing data in the manifest variables.
Renormalize: Activate this option to renormalize external weights for each observation
when missing data have been found.
Note: In the case of standardized weights, the two options above lead to pairwise deletion to
compute sample means and standard deviation, and mean imputation to compute the scores.
Estimate missing data: Activate this option to estimate missing data before starting the
computations.
Mean or mode: Activate this option to estimate missing data by using the mean
(quantitative variables) or the mode (qualitative variables) of the corresponding
variables.
695
Nearest neighbour: Activate this option to estimate the missing data of an observation
by searching for the nearest neighbour of the observation.
Multigroup t test: Activate this option to test equality between path coefficients from one
group to another with a t test (the number of bootstrap sample is defined in the option tab).
Significance level (%): Enter the significance level for the t tests.
Permutation tests: Activate this option to test equality of parameters between two groups with
a permutation test.
Significance level (%): Enter the significance level for the t tests.
Path coefficients: Activate this option to test the equality of the path coefficients.
Standardized loadings: Activate this option to test the equality of the standardized
loadings.
Model quality: Activate this option to test the equality of the quality indexes (communalities,
redundancies, and GoF).
Outputs tab:
Descriptive statistics: Activate this option to display descriptive statistics for the selected
variables.
Test significance: Activate this option to test the significance of the correlations.
Significance level (%): Enter the significance level for the above tests.
Inner model: Activate this option to display the results that correspond to the inner model.
Outer model: Activate this option to display the results that correspond to the outer model.
696
R and communalities: Activate this option to display the R2 of the latent variables from the
structural model and the communalities of the manifest variables.
Model quality: Activate this option to display the results of the blindfolding procedure.
Standardized: Activate this option to compute and display standardized factor scores.
Using normalized weights: Activate this option to display factor scores computed with
normalized weights.
Standardized > 0-100: Activate this option to compute standardized scores, and then
transform and display the latter on a 0-100 scale.
Using normalized weights > 0-100: Activate this option to compute factor scores using
normalized weights, and then transform and display the factor scores on a 0-100 scale.
Results options
Many results can be displayed on the PLSPMGraph sheet, once the model has been fitted. It
is recommended to select only a few items in order to keep the results easy to read. To display
the options dialog box, click the icon of the Path modeling toolbar.
These options allow to define which results are displayed below the latent variables.
Mean: Activate this option to display the mean of the latent variable.
Mean (Bootstrap): Activate this option to display the mean of the latent variable
computed using a bootstrap procedure.
Confidence interval: Activate this option to display the confidence interval for the
mean.
R: Activate this option to display the R-square between the latent variable and its
manifest variables.
Adjusted R: Activate this option to display the adjusted R-square between the latent
variable and its manifest variables.
R (Boot/Jack): Activate this option to display the R-square between the latent variable
and its manifest variables, computed using a bootstrap or jackknife procedure.
697
R (conf. int.): Activate this option to display the confidence interval on the R-square
between the latent variable and its manifest variables, computed using a bootstrap or
jackknife procedure.
Communality: Activate this option to display the communality between the latent
variable and its manifest variables.
Redundancy: Activate this option to display the redundancy between the latent variable
and its manifest variables.
D.G. rho: Activate this option to display the Dillon-Goldstein's rho coefficient.
Std. deviation (Scores): Activate this option to display the standard deviation of the
estimated latent variables scores
These options allow to define which results are displayed on the arrows that relate the latent
variables.
Correlation: Activate this option to display the correlation coefficient between the two
latent variables.
Contribution: Activate this option to display the contribution of the latent variables to
the R2.
Path coefficient: Activate this option to display the regression coefficient that
corresponds to the regression of the latent variable that is at the end of the arrow
(dependent) by the latent variable at the beginning of the arrow (predecessor or
explanatory).
Path coefficient (B/J): Activate this option to display the regression coefficient that
corresponds to the regression of the latent variable that is at the end of the arrow
(dependent) by the latent variable at the beginning of the arrow (predecessor or
explanatory), computed using a bootstrap or jackknife procedure.
Standard deviation: Activate this option to display the standard deviation of the path
coefficient.
698
Confidence interval: Activate this option to display the confidence interval for the path
coefficient.
Partial correlations: Activate this option to display partial correlations between latent
variables.
Pr > |t|: Activate this option to display the p-value that corresponds to the Students t.
Arrows thickness depends on: The thickness of the arrows can be related to:
o The p-value of the Students t (the lower the value, the thicker the arrow).
o The correlation (the higher the absolute value, the thicker the arrow; blue
arrows correspond to negative values, red arrows to positive values).
o The contribution (the higher the value, the thicker the arrow).
These options allow to define which results are displayed on the arrows that relate the latent
variables.
Weight (Bootstrap): Activate this option to display the weight computed using a
bootstrap procedure.
Standard deviation: Activate this option to display the standard deviation of the weight.
Confidence interval: Activate this option to display the confidence interval for the
weight.
Correlation: Activate this option to display the correlation coefficient between the
manifest variable and the latent variable.
Correlation (std. deviation): Activate this option to display the standard deviation of
the correlation coefficient between the manifest variable and the latent variable,
computed using a bootstrap of jackknife procedure.
699
Correlation (conf. interval): Activate this option to display the confidence interval of
the correlation coefficient between the manifest variable and the latent variable,
computed using a bootstrap of jackknife procedure.
Communalities: Activate this option to display the communality between the latent
variable and the manifest variables.
Redundancy: Activate this option to display the redundancy between the latent variable
and the manifest variables.
Arrows thickness depends on: The thickness of the arrows can be related to:
o The correlation (the higher the absolute value, the thicker the arrow; blue
arrows correspond to negative values, red arrows to positive values).
o Normalized weights.
Results
The first results are general results which computation is done prior to fitting the path modeling
model:
Summary statistics: This table displays for all the manifest variables, the number of
observations, the number of missing values, the number of non-missing values, the minimum,
the maximum, the mean and the standard deviation.
Model specification (measurement model): This table displays for each latent variable, the
number of manifest variables, the mode, the type (a latent variable which never appears as a
dependent variable is called exogenous), if its sign has been inverted, the number of computed
dimension and the list of all associated manifest variables.
Model specification (structural model): This square matrix shows on its lower triangular part
if there is an arrow that goes from the column variable to the row variable.
Composite reliability: This table allows to check the dimensionality of the blocks. For each
latent variable, a PCA is run on the covariance or correlation matrix of the manifest variables in
order to determine the dimensionality. The Cronbachs alpha, the Dillon-Goldsteins rho, the
700
critical eigenvalue (that can be compared to the eigenvalues obtained from the PCA) and the
condition number are displayed to facilitate the determining of the dimensionality.
Variables/Factors correlations (Latent variable X / Dimension Y): These tables display for
each latent variable the correlations between the manifest variables and the factors extracted
from the PCA. When a block is not unidimensional, these correlations allow to identify how the
corresponding manifest variables can be split into unidimensional blocks.
The results that follow are obtained once the path modeling model has been fitted:
Goodness of fit index (Dimension Y): This table displays the goodness of fit index (GoF)
computed using bootstrap or not and its confidence interval for
Relative: Value of the relative GoF index obtained by dividing the absolute value by its
maximum value achievable for the analyzed dataset.
Outer model: Component of the GoF index based on the communalities (performance
of the measurement model).
Inner model: Component of the GoF index based on the R2 of the endogenous latent
variables (performance of the structural model).
R (Latent variable X / Dimension Y): Value of the R2 index for the endogenous
variables in the structural equations.
Path coefficients (Latent variable X / 1): Value of the regression coefficients in the
structural model estimated on the standardized factor scores.
701
Impact and contribution of the variables to Latent variable X (Dimension Y): Value
of the path coefficients and the contributions (in percent) of the predecessor latent
variables to the R2 index of the endogenous latent variables.
Model assessment (Dimension Y): This table summarizes important results associated to the
latent variables scores.
Direct effects (latent variable) / Dimension Y (expert display): This table shows direct effect
between connected latent variables.
Indirect effect (latent variable) / Dimension Y (expert display): This table shows the indirect
effects between not directly connected latent variables.
Total effect (latent variable) / Dimension Y (expert display): This table shows the total effect
between latent variables. Total effect = direct effect + indirect effect.
Discriminant validity (Squared correlations < AVE) (Dimension Y): This table allows to
check whether each latent variable is really representing a concept different from the other or if
some latent variables are actually representing the same concept. In this table, the R2 index
for any pair of latent variables shall be smaller than the mean communalities for both variables
which indicates that more variance is shared between each latent variable and its block of
manifest variables than with another latent variable representing a different block of manifest
variables.
Mean / Latent variable scores (Dimension Y): Mean values of the individual factor
scores.
Summary statistics / Latent variable scores (Dimension Y): Descriptive statistics of the
latent variable scores computed from the measurement model.
Latent variable scores (Dimension Y): Individual latent variable scores estimated as a
linear combination of the corresponding manifest variables.
Summary statistics / Scores predicted using the structural model (Dimension Y) (expert
display): Descriptive statistics of the latent variable scores computed from the structural
model.
702
Scores predicted using the structural model (Dimension Y) (expert display): Latent
variable scores computed as the predicted values from the structural model equations.
Worksheet PLSPM (Group): For each group, complete results are displayed in separated
worksheets.
Worksheet PLSPM (Multigroup t test): For each path coefficient, results of the t test are
summarized in a table. Each line represents a pair of groups.
Significant: If yes, the difference between the parameters is significant. If not, the
difference is not significant.
Worksheet PLSPM (Permutation test): For each type of parameter, results of the
permutation test are summarized in a table.
Significant: If yes, the difference between the parameters is significant. If not, the
difference is not significant.
703
Example
A tutorial on how to use the XLSTAT-PLSPM module is available on the Addinsoft website:
http://www.xlstat.com/demo-plspm.htm
References
Amato S., Esposito Vinzi V. and Tenenhaus M. (2004). A global Goodness-of-Fit index for
PLS structural equation modeling. in: Proceedings of the XLII SIS Scientific Meeting, vol.
Contributed Papers, 739-742, CLEUP, Padova, 2004.
Carroll J.D. (1968). A generalization of Canonical Correlation Analysis to three or more sets of
variables. Proc. 76th Conv. Am. Psych. Assoc., 227-228.
Chin W.W. (1998). The Partial Least Squares approach for structural equation modeling. In:
G.A. Marcoulides (Ed.), Modern Methods for Business Research, Lawrence Erlbaum
Associates, 295-336.
Esposito Vinzi V., Chin W., Henseler J. and Wang H. (2007). Handbook of Partial Least
Squares: Concepts, Methods and Applications, Springer-Verlag.
Fornell C. and Cha J. (1994). Partial Least Squares. In: R.P. Bagozzi (Ed.), Advanced
Methods of Marketing Research, Basil Blackwell, Cambridge, Ma., 52-78.
Guinot C., Latreille J. and Tenenhaus M. (2001). PLS Path Modelling and Multiple Table
Analysis. Application to the cosmetic habits of women in Ile-de-France. Chemometrics and
Intelligent Laboratory Systems, 58, 247-259.
Horst P. (1965). Factor Analysis of data matrices. Holt, Rinehart and Winston, New York.
Jreskog K.G. (1970). A General Method for Analysis of Covariance Structure. Biometrika,
57, 239-251.
Jreskog, K.G. and Wold, H. (1982). The ML and PLS Techniques for Modeling with Latent
Variables: Historical and Comparative Aspects. In: K.G. Jreskog and H. Wold (Eds.), Systems
Under Indirect Observation, Part 1, North-Holland, Amsterdam, 263-270.
Lohmller J.-B. (1989). Latent Variables Path Modeling with Partial Least Squares. Physica-
Verlag, Heildelberg.
704
Pags J. and Tenenhaus, M. (2001). Multiple Factor Analysis combined with PLS Path
Modelling. Application to the analysis of relationships between physicochemical variables,
sensory profiles and hedonic judgements. Chemometrics and Intelligent Laboratory Systems,
58, 261-273.
Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M. and Lauro C. (2005). PLS Path Modeling.
Computational Statistics & Data Analysis, 48(1), 159-205.
Tenenhaus M. and Hanafi M. (2007). A bridge between PLS path modeling and multi-block
data analysis. In: Esposito Vinzi V.et al. (Eds.), Handbook of Partial Least Squares: Concepts,
Methods and Applications, Springer-Verlag.
Wold H. (1966). Estimation of Principal Components and Related Models by Iterative Least
Squares. In: P.R. Krishnaiah (Ed.), Multivariate Analysis, Academic Press, New York, 391-420.
Wold H. (1973). Non-linear Iterative PArtial Least Squares (NIPALS) modelling. Some current
developments. In: P.R. Krishnaiah (Ed.), Multivariate Analysis III, Academic Press, New York,
383-407.
Wold H. (1975). Soft Modelling by latent variables: the Non-linear Iterative PArtial Least
Squares (NIPALS) Approach. In: J. Gani (Ed.), Perspectives in Probability and Statistics:
Papers, in Honour of M.S. Bartlett on the occasion of his sixty-fifth birthday, Applied Probability
Trust, Academic, London, 117-142.
Wold H. (1979). Model construction and evaluation when theoretical knowledge is scarce: an
example of the use of Partial Least Squres. Cahier 79.06 du Dpartement d'conomtrie,
Facult des Sciences conomiques et Sociales. Genve: Universit De Genve.
Wold H. (1982). Soft Modeling: The basic design and some extensions. In: K.G. Jreskog and
H. Wold (Eds.), Systems under indirect observation, Part 2, North-Holland, Amsterdam, 1-54.
Wold H. (1985). Partial Least Squares. In: S. Kotz and N.L. Johnson (Eds.), Encyclopedia of
Statistical Sciences, John Wiley & Sons, New York, 6, 581-591.
705