Get Started Guide 22
Get Started Guide 22
net/publication/246982444
CITATIONS READS
9,296 8,515
1 author:
F. James Rohlf
State University of New York
242 PUBLICATIONS 44,826 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by F. James Rohlf on 23 March 2015.
EXETER SOFTWARE
47 Route 25A, Suite 2
Setauket, New York 11733-2870
Information in this document is subject to change. The software described in this
document is furnished under a license agreement (single-user or site license). The software
may be used or copied only in accordance with the terms of the agreement.
Copyright © 2009 by Applied Biostatistics Inc., 10 Inwood Road, Port Jefferson, New
York 11777. All rights reserved worldwide.
ISBN: 0-925031-31-3
Preface
NTSYSpc was developed originally for use by students in a course “Taxonomia númerica
em microcomputadores” held in September 1985 at the Estação Agronómica Nacional,
Oeiras, Portugal. Many of the programs were written on a portable PC as I worked each
evening on the balcony of a hotel in Estoril—developing the programs needed the next day's
lab projects. The beautiful surroundings and enthusiastic students seemed to have helped.
Most of the design and many programs were developed during the two-week course. It was
quickly recognized that such a program on a personal microcomputer was of general
interest and that even the primitive PC of those days was able to handle most datasets.
NTSYS was originally written in FORTRAN for the IBM 360/50 mainframe computer
at the University of Kansas in 1966. That version was developed with the help of Ron
Bartcher who also converted it for use on a GE-635 computer in 1968. In 1969 John
Kishpaugh and David Kirk helped with the conversion of NTSYS from the GE-635 back to
an IBM 360/50 and then to the Univac 1100 computer system—both at Stony Brook
University. In addition, many others contributed to its development over the years.
NTSYSpc is a new program written in Pascal. Fortunately, after all of the previous
experience with conversions, most of the computational routines in NTSYS were by now
relatively easy to convert to another language. Both the program and the documentation
have greatly benefited over the years by the help of many of the users who have spotted
many “glitches” in the program and the documentation. Drs. Dean Adams, Leslie Marcus,
and Dennis Slice have made a number of important contributions. NTSYSpc will continue
to be developed. New programs and features are planned so that the system can evolve to
better meet your needs. Your comments, suggestions, and criticisms are much appreciated.
NTSYSpc has gone through many revisions. Exeter’s website should be checked for
new releases that can be downloaded. The help file have been expanded and improved. It
contains technical information that was once in the printed documentation.
This Getting Started Guide is intended only as a quick introduction to the use of
NTSYSpc. Details about the many computational modules and examples of their use are
given in the help file. The help file includes technical details about many of the modules.
1. Introduction
1.1 Areas of application
NTSYSpc is a system of programs that is used to find and display structure in multivariate
data. For example, one may wish to discover that a sample of data points suggests that the
samples may have come from two or more distinct populations. Of equal interest is the
discovery that some subsets of variables are highly inter-correlated. The program was
originally developed for use in biology in the context of the field of numerical taxonomy
(which explains why the name of the program is NTSYS—for Numerical Taxonomy
SYStem). But the programs have also been widely used in morphometrics, ecology and in
many other disciplines in the natural sciences, engineering, and the humanities. The terms
mathematical taxonomy and automatic classification have also been used to describe this
field of application. The techniques also represent a subset of multivariate data analysis and
have close ties to some methods in the field of pattern recognition.
Within the field of systematic biology, one can distinguish two different approaches to
classification. In phenetics one is concerned with the discovery and description of the
patterns of biological diversity and forming classification based on overall similarity
computed from multivariate data. These methods are commonly used in morphometric
studies. In cladistics one is interested in inferring the evolutionary history of the organisms
under study and using it as a basis for classification. Specialized methods have been
developed to take into account the assumption that the underlying model is of a branching
evolutionary tree. It is expected that the best biological explanation of the observed
diversity of a set of organisms will come in terms of their evolutionary history. The
methods are intended to make the best estimates of the evolutionary tree given a set of
descriptive data on a set of organisms. The most commonly used methods are justified on
the basis of the philosophical principle of parsimony (that the shortest tree that can be fitted
to a set of data should be the best estimate of the true tree) but statistically more powerful
methods based on the principle of maximum-likelihood are increasing in popularity. The
neighbor-joining method is also often used.
Many of the methods furnished in NTSYSpc are associated with the field of phenetics.
However, they are best interpreted as simply methods for multivariate data analysis. There
are programs by others that are specialized for phylogenetic methods. Some of the better
known ones are PAUP 1 and PHYLIP2. However, Saitou and Nei's (1987) neighbor-joining
1 Written by David Swofford, currently distributed by the Illinois Natural History Survey.
Introduction 7
method of phylogenetic tree estimation is included in NTSYSpc. The UPGMA procedure in
the SAHN module is also often used on molecular data. A unique feature of its
implementation here is that it is able to take ties into consideration rather than simply using
an arbitrary tie-breaking rule. NTSYSpc also contains specialized methods used in geometric
morphometrics to study variation in shapes of objects.
The principal journal devoted to the general theory behind many of these techniques is
the Journal of Classification. It is published for the Classification Society of North America by
Springer. Theoretical papers are also published in many statistical journals. Applications of
these techniques are published in many scientific journals in the areas of application. For
example, Systematic Biology (formerly Systematic Zoology) has published many theoretical
and applied papers with special emphasis to applications in biological taxonomy.
Most users of these techniques begin with a data matrix that contains information
about the properties (features, characters, landmark or outline coordinates, etc.) of a number
objects (individuals, specimens, quadrats, OTUs, etc.). NTSYSpc can then be used to
compute various measures of similarity or dissimilarity between all pairs of objects and then
summarize this information either in terms of nested sets of similar objects (cluster analysis)
or in terms of a spatial arrangement along one or more coordinate axes (ordination analysis or
various types of multidimensional scaling analysis). This User Guide assumes that the reader
has some familiarity with the methods. It does not contain much advice about which
similarity coefficient or which clustering method should be used. It does, however, give
many hints about the use of the methods. To keep the account general, the neutral terms
"object" or "OTU" (for operational taxonomic unit) are usually used to refer to the things
(specimens) being analyzed and the terms "variable" or "character" are used to refer to the
properties used to describe the objects under study.
Users may find the following general references helpful (the complete references are
given in the Bibliography).
• Everitt and Dunn (1992) give a good concise introduction to both cluster analysis and
multidimensional scaling analysis. They furnish examples from biology.
Felsenstein (2004) gives a comprehensive overview of phylogenetic methods.
• Gnanadesikan (1977) describes many methods for detecting patterns in multidimensional
data. Applications are from many fields.
• Hartigan (1975) describes a large number of different clustering methods. Examples (with
test data sets) are from a great many fields.
• Jackson (1991) is an excellent mathematical text on multivariate analysis. It is much more
comprehensive than implied by its title ("A user's guide to principal components").
• Massart et al. (1978) gives a discussion with applications in analytical chemistry.
2. Modes of operation
There are two modes in which NTSYSpc can be used: interactive and batch. In interactive
mode a module is selected from the main window by clicking on a button which causes a
window showing the various parameters and options for that module to be displayed. After
this form is filled in, click the Compute button to “run” the module and have the results
appear as a new section in the Listing window. Start batch mode by selecting the “Run
batch file…” item on the File menu or by using the convenient speed button on the toolbar.
The batch dialog box will let you select a file containing a sequence of NTSYSpc commands,
specify up to nine parameters, and run the batch file. The batch file contains commands that
call up various modules, supply parameters, and execute them automatically. Batch files are
convenient for the processing of large data sets or for processing a large number of data sets
(perhaps from a simulation).
#nexus
begin trees;
[11 mosq. extracted from Harbach & Kitching (1998)]
translate 1 Anoph1, 2 Toxo12, 3 Wyeo13, 4 Uran17, 5 Culi21, 6
Orth28,
7 Mans29, 8 Psor32, 9 Aede44, 10 Cule101, 11 Dein126;
tree (1:16,(4:2,((6:1,(5:1,2:1):3):3,
((9:1,(8:1,7:6):4):2,((10:1,11:3):1,3:30):1):2):2):8);
end;
The nexus file format is described in: Maddison et al. (1997). The method used for
describing trees is called the “Newick Standard” and was adopted June 26, 1986 by an
informal committee meeting during the Society for the Study of Evolution meetings in
Durham, New Hampshire. James Archie, William H.E. Day, Wayne Maddison, Christopher
Meacham, F. James Rohlf, David Swofford, and Joe Felsenstein were present. The reason for
the name is that the second and final session of the committee met at Newick's restaurant in
Dover, NH. Examples and a simple description of this format are available at
http://evolution.genetics.washington.edu/phylip/newicktree.html.
5. NTedit
The Ntedit program, included with NTSYSpc, is a data editor designed for use with
NTSYSpc data files. For each of the basic file formats (rectangular, symmetric, diagonal,
tree, and graph) it displays an appropriate arrangement of the cells in the spreadsheet.
Using NTedit ensures that the files are formatted correctly.
30 NTedit
The program can be
started in three ways.
1. Click on the
NTedit icon to
start the program.
2. Load this program
from the File|Edit
file data file menu
item or the edit
speed-button on
NTSYSpc’s
toolbar.
3. Use a DOS
command window
and type ntedit Figure 5.1. Example of NTedit with the test.nts file loaded in a
and the name of a grid view.
file and then press
the R key to start the program.
Once the program is started, you can either create a new file or load an existing file.
NTS format files can be loaded either in a spreadsheet like grid (Figure 5.1) or in a plain
ASCII text editing (Figure 5.2) view. Excel files can only be displayed in the grid view and
nexus files can only be displayed in the text view.
In the grid mode you can enter or correct data in any of the cells. You can insert or
delete rows and columns within the table by clicking on the appropriate menu choices or the
speed buttons on the tool bar. You can also add or delete rows and columns from the end of
the table by entering new values in the edit boxes displaying the current numbers of rows
and columns. To change the labels for the rows or columns (given in the first, protected,
row or column of the data table) click on the RowLabs or Col.Labs buttons to unprotect
these entries. You can then type new information in these cells. The new names must not
have any blanks within them. Click these buttons again to re-protect these labels from
accidental change.
To create a new file use the following steps:
1. select “New” from the file menu,
2. select the proper matrix type from the list (you may receive a warning about the
possible loss of data when you change matrix types),
3. enter the correct numbers of rows and columns in the edit boxes labeled “No. rows”
and “No. cols.” (note that the new values do not take effect until your cursor leaves
the edit boxes), and then
4. start entering your data.
NTedit 31
If there are missing
data the identifying
numerical code needs to be
entered in the edit box
labeled “Missing.” Click
on the “Comments” button
if you wish to add
comments to the matrix.
When you are done you
can use the “Check
matrix” item under the
Edit menu to check that all
data values are properly
formatted numbers. It also
will check to make sure
there are no empty cells. Figure 5.2. Example of NTedit with the test.nts file loaded in
This same check is made
when to attempt to save the matrix to a disk file. You will be given a chance to replace all
the empty cells with whatever code you specified for missing data (if that field is blank then
zeroes will be used).
NTedit can also be used to view and make changes in existing files. Changes have to be
made with care as there is no “undo” feature. A limitation of this mode is that the file must
already be in a proper format. If you try to load an NTS file that is not formatted properly
you will receive an error message and NTedit will try to load the file in ASCII text mode.
More flexibility is provided when editing a file in ASCII text mode (see Figure 5.2). Text
may be freely moved around and cut and pasted from other software. An “undo” and
“redo” feature is implemented (see the Edit menu). This mode is especially useful when a
file has format problems that prevent it from being loaded in the usual spreadsheet mode.
Its use is similar to that of the Windows notepad program. Some advantages include the
ability to load much larger files and it does not append .txt to the ends of file names when
saving.
The NTedit help can be consulted for additional information – including various
keyboard shortcuts for use in text mode.
32 Graphics
6.2 Other
options
These depend upon
the plot. For MXPLOT
and MOD3D there are
pick lists for selecting the
variables to be plotted
(select the “Variables”
tab). There are also
choices of whether the
data points should be
identified by sequential
numbers or labeled using
the labels in the input
data. There are also Figure 6.1. Example of graphic options window from the
options to control the MXPLOT module.
various attributes of the
points and lines making
up a plot. There are special dialog boxes to allow you to select colors, plotting symbols,
fonts, etc.
7. Typical applications
Furnished below are some examples of typical applications of NTSYSpc. For simplicity, the
required steps are shown as sequences of batch command statements. This is a compact
way to describe the sequence of modules and their parameters. See the help file for more
detailed technical information about each module.
Note lines that begin with a quote character are treated as comment lines and are
ignored by NTSYSpc.
Bibliography
Burnaby, T. P. 1966. Growth-invariant Gascuel, O. 1997. Concerning the NJ
discriminant functions and generalized algorithm and its unweighted
distances. Biometrics, 22:96-110. version, UNJ. Pp. 149-170 in B.
Darroch, J. N. and J. E. Mosimann. 1985. Mirkin, F. R. McMorris, F. S.
Canonical and principal components Roberts, and A. Rzhetsky, eds.
of shape. Biometrika, 72:241-252. Mathematical hierarchies and
biology. DIMACS series in discrete
Everitt, B. S. and Dunn, G. 1992. Applied
mathematics and theoretical
multivariate data analysis. Oxford
computer science, Vol. American
Univ. Press: New York. 304 pp.
Mathematical Society, Providence,
Felsenstein, J. 2004. Inferring Phylogenies. R.I.
Sinauer, Sunderland. 644 pp.
Gnanadesikan, R. 1977. Methods for
Gabriel, K. R. 1968. The biplot graphical statistical data analysis of
display of matrices with application multivariate observations. Wiley.
to principal component analysis. New York. 311 pp.
Biometrika, 58:453-467.
Hartigan, J. A. 1975. Clustering algorithms.
Gabriel, K. R. 1971. The biplot graphical Wiley. New York. 351 pp.
display of matrices with applicatin
Jackson, J. E. 1991. A user’s guide to
to principal componenet analysis.
principal components. Wiley: New
Biometrika, 58:453-467.
York. 569 pp.
Gabriel, K. 1981. Biplot display of
Maddison, D.R., D.L. Swofford, and W.P.
multivariate matrices for inspection
Maddison. 1997. NEXUS: an
of data and diagnosis. P. 147-173 in
extendible file format for systematic
Barnett, V. (ed.) Interpreting
information. Systematic Biology,.46:
Multivariate Data. John Wiley and
590-621.
Sons, New York.
Mantel, N. A. 1967. The detection of
Gabriel, K. and Odoroff, C. L. 1986.
disease clustering and a generalized
Illustrations of model diagnosis by
regression approach. Cancer Res.,
means of three-dimensional biplots.
27:209-220.
Pp. 257-274 in Wegman, E.J. and
DePriest, D.J. (eds.). Statistical image Reyment, R. A. 1991. Multidimensional
processing and graphics, Marcel paleobiology. Pergamon Press: New
Dekker, New York. York, 377 pp.
42 Bibliography
Romesburg, H. C. 1984. Cluster analysis
for researchers. Lifetime Learning
Publications. Belmont, California.
334 pp.
Saitou, N. and M. Nei. 1987. The neighbor-
joining method: a new method for
reconstructing phylogenetic trees.
Mol. Biol. Evol., 4:406-425.
Smouse, P. E., J. C. Long, and R. R. Sokal.
1986. Multiple regression and
correlation extensions of the Mantel
test of matrix correspondence.
Systematic Zoology, 35:627-632.
Sneath, P. H. A. and R. R. Sokal. 1973.
Numerical Taxonomy. Freeman. San
Francisco. 573 pp.
Sokal, R. R. 1979. Testing statistical
significance of geographic variation
patterns. Systematic Zool.,
28:227-231.
Sokal, R. R. and P. H. A. Sneath. 1963.
Principles of Numerical Taxonomy.
Freeman. San Francisco. 359 pp.
Weir, B. S. 1989. Building trees with DNA
sequences. Biometric Bulletin,
6(4):21-23.
Index 43
INDEX
association coefficients, 11 PCOORDA, 38
axis aspect ratio, 33 phenetics, 6
Batch mode, 19 preference scaling, 36
bimodel, 36 Principal components analysis, 37
biplot, 36 Principal coordinates analysis, 38
Burnaby's method, 39 RAPD data, 27
canonical correlation, 8 repertory grid analysis, 36
canonical vectors analysis, 9 replaceable parameters, 20
cladistics, 6 single-link, 10
Cluster analysis, 35 singular-value decomposition, 11
Common principal components analysis, 9 size
Configuration window, 22 adjustment, 39
consensus tree, 8 spatial autocorrelation analyses, 41
cophenetic value matrix, 8 thin-plate spline, 11
Correspondence analysis, 9 tree matrix, 27
elliptic Fourier analysis, 9 two-block partial least-squares, 8
Excel, 29 ultrametric, 28
File formats, 24 ultrametric values, 8
File overwrite code, 23 UPGMA, 10
Fourier analysis, 9 XLS, 29
homogeneity of covariance matrices, 10
Installation, 11
isometric vector, 40
line limit, 26
Mantel test, 10, 41
matrix comments, 32
microsatellite, 27
minimum-length spanning tree, 9
missing data code, 25, 32
multidimensional scaling, 39
multidimensional scaling analysis, 9
neighbor-joining method, 10
NEXUS format, 27
Ordination analysis, 36
Output Listing Window, 23
PCA, 37