BMDP 2009
BMDP 2009
Grids … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Frame … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
i
Bins … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ii
PR Polychotomous Stepwise Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 34
3S Nonparametric Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1T Univariate and Bivariate Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2T Box- Jenkins Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1V One-way Analysis of Variance or Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 35
2V Analysis of Variance and Covariance with Repeated measures . . . . . . . . . . . 35
3V General Mixed Model Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4V Univariate and Multivariate Analysis of Variance and Covariance,
including Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5V Unbalanced Repeated Measures Models with Structured Covariance Matrices 36
8V General Mixed Model Analysis of Variance—Equal Cell Sizes . . . . . . . . . . . . 36
iii
Chapter 1
Introducing BMDP High Resolution Graphics
Most line-by-line printer graphical displays produced by BMDP programs are now available in
high resolution including bivariate scatterplots, dot density displays, histograms, Q-Q (quantile-
quantile) plots, and ROC curves (Receiver Operating Characteristic). Box and whisker plots and
shaded correlation and distance matrices have been added. A more detailed list of displays
follows below.
A quick description
After a BMDP program has run, the BMDP high resolution display appears automatically in the
BMDP Graphics Window. Graphics options available via menus and dialog boxes include:
Symbol, size, color, font, and line choices. Graph titles and axis labels
are easily customized.
The shape of a display can be altered by dragging the Graph window frame.
Options and features can be applied repeatedly to data in a given display. At any
stage the display can be saved as a Windows Enhanced MetaFile format for importing
into WORD—or it can be saved in a form for further editing later in the Graphics
Window.
The 44 BMDP programs provide a broad array of analyses. Most programs are identified by a
number followed by a letter like “7D, One- and two-way ANOVA with data screening”. The letters
loosely classify the programs into series:
D - data description M - multivariate analyses
F - frequency tables and log linear models L - life tables and survival analysis
R - regression analysis S - nonparametric statistics
V - analysis of variance T - time series
A few program names start with a letter like KM for K-means clustering or LR for logistic
regression, and don’t fit this naming scheme. A detailed description of analyses and features
available in each program is found under Help on the BMDP statistical software main menu.
1
Graphics available in high resolution
Here is a list of high resolution displays and the programs that produce them. Some displays are
illustrated in Chapter 2 and Chapter 5. A full description of program analyses and features is
found on BMDP’s main menu under Help.
Data Screening, Within Group Distributions, and Support for Analysis of Variance
- Histogram 2D,7D
- Dot density plot by group 7D
- Box and whisker plots 2D,7D
- Bar chart of means with standard errors 7D
- Stacked histograms (showing group membership) 7D
- Cumulative histogram, cumulative frequency distribution plot,
normal probability plot, and half normal probability plot 5D
- Q-Q (quantile-quantile) plot 3D
- Row and Column profiles for frequency table data CA
- Scatterplots of data and/or computed values 30 programs
- Box-Cox diagnostic plot for a variance stabilizing transformation 7D,1R
- Miniplots of cell means for factorial or repeated measures designs 9D,2V
Regression
- Plots of (1) residuals and residuals squared against predicted values,
(2) the dependent variable, fitted values, and residuals against the
independent variable, and (3) normal probability and detrended normal
probability plot of the residuals 1R,2R,3R,4R,5R,6R,9R
- Plots of regression diagnostics 2R,9R
- Partial residual plot 2R
- Confidence curves for nonlinear regression parameters (Cook-Weisberg) 3R
- Correct and incorrect classifications as a function of cutpoints on
computed probabilities for logistic regression LR
- Histograms of predicted probabilities of each group LR
- ROC curve (Receiver Operating Characteristic LR
Discriminant Analysis, Cluster Analysis, and Factor Analysis
- Scatterplot of the first two canonical variables in discriminant analysis 7M
- K-means Cluster Profile display (variable means with std. deviations) KM
- Shaded distance or correlation matrices 1M,2M,AM,4M,6M,6R,9R
Life Tables and Survival Analysis
- Survivor function, cumulative survivor function, log of the
survivor function, hazard function, cumulative hazard function,
and death density function 1L,2L
Time Series
- Time series snapshot—moving trimmed means vs. time 1T
- Lagged scatterplots 1T
- Complex demodulation (amplitude and phase) of a time series 1T
- Plotted periodograms, covariance functions and log spectrum vs. frequency 1T
- Confidence bands about estimated spectral density 1T
- One or several time series in one frame or separate frames 1T
2
Chapter 2
Three Examples
Example 1. Program 7D: Analysis of Variance with Data Screening
The problem in this example is to screen income (in US dollars) for a one-way analysis of
variance. Income is grouped by level of education—high school dropout, high school
graduate, some college, and college graduate. Community survey data shown in BMDP
Manual Output 7D.2 are used—income is rescaled to match statistics found on the web in
2008. For the ANOVA analysis, here are the BMDP instructions generated via menus and
dialog boxes.
Program 7D’s default display, the “dot density” plot, appears in the BMDP Graph Window
following text results produced by the BMDP instructions, Group means are marked by a star.
3
Selecting options
We now add descriptive statistics for each group and modify the title and y axis label. Note
the menu bar at top of the Graph Window—here it is enlarged:
Menu items in bold are available for the Dot Density display. Use File to Save and Print the
current display or Open a previous graph. Use Select Graph to identify which display you
want when several are generated during one computer run.
Starting with the dot density plot shown above, Options on the Histogram menu are used
to add descriptive statistics below each group and modify plot labels. The statistics are
produced by the Show Statistics option and the labels are altered using Title, Font Size, and
Color on the Labels dialog box.
80
Income in 2008
60
40
20
Here’s how to use the Labels dialog box below to make the above changes. To add the more
informative title and print it in blue, click on the Title line in the dialog box and type the title in
the yellow box under the Text heading. Then drag the blue Color slider all the way to the
right (The RGB number now appears in the box next to the title). Many colors are possible by
moving the red, green, and blue sliders to form combinations of these colors. The graph title
is printed in Bold Size 28 and the y axis label in Bold Size 15. (The latter was done by first
highlighting the Y Axis line.)
4
Selecting alternative displays
The Histogram menu has three items: Density, Bar Chart, and Options.
From Density, in addition to Dot plot, you can select other displays of the same data:
A Box plot with the group median, 25th and 75th percentiles, and robust
identification of outliers
Side-by-side Histograms
A Stacked histogram where group membership is identified by colors within each
bar
Box Plots Means with Standard Errors Sample sizes (cell counts)
70 120
100
60
80
90
50
60
Income in 2008
Income in 2008
40
Income in 2008
60
40 30
20
20
30
10
0
0 0
No_Grad HS_Grad Some_Col Col_Grad No_Grad HS_Grad Some_Col Col_Grad
No_Grad HS_Grad Some_Col Col_Grad
A star marks the group's mean
5
Stacked Histogram Dot Plot Transposed
Col_Grad
45
40
35
Some_Col
30
Count
25
HS_Grad
20
15
10
No_Grad
5
0
0 25 50 75 100 0 20 40 60 80 100
Income in 2008 Income in 2008
No_Grad HS_Grad Some_Col Col_Grad
A star marks the group's mean
Squares represents more than 1 case.
In the Graph Window, the shape of a graph can be changed by simply dragging the window
frame. This was done for these displays. The graphs shown are available with or without
group descriptive statistics (See Options menu).
Graphs can be printed directly from the Graph window using Print on the File menu, and
graphs can be saved as a EMF (Enhanced Metafile) using Save on the File menu and
imported into WORD as a picture ready for quick resizing. A display can also be saved as a
BGF (BMDP Graphics File) which at a later time can be read in the BMDP Graphics Window
for further editing.
6
Example 2. Program 6D: Scatterplots with Smoother and Power Transformation
The problem in this example is to screen the bivariate relation between 1990 population in
millions for 57 countries and the population projected for each country by the UN for 2020.
That is, we ask if the data for estimating a correlation between these two quantitative
measures are appropriate? Are there outliers? Should the data be transformed?
The instructions for running program 6D should include:
/ PLOT YVAR = pop_2020.
XVAR = pop_1990.
After 6D is executed from the main BMDP window, the default scatterplot appears in the
Graphics Window.
(1) POP_2020 VS. POP_1990
120
60
0 30 60 90 120 150
POP_1990
The distribution of points in the bivariate point cloud on the graph above is far from ideal for
computing a correlation—its shape is not like that of an American football with points falling
symmetrically across the area. Points for three countries straggle upward away from those for
the other countries. To identify the country with the largest projected population, click on the
point to find its coordinates. They are 152.5 and 269.1 which is Brazil.
7
The text for the plot title, size and colors are changed using Labels under Options.
This dialog box is the same as that for Histogram shown in Example 1.
For the line of best fit, select Linear under Smoother. In the graph on the right below,
the size of the residuals (from program 2R) across the predicted population values
should be fairly equal—they’re not, for countries with small predicted values, the
spread is smaller than that for countries with larger values.
Note in the default plot above, the range of the y axis is 250, while that of the x axis is
much smaller. The two variables have the same units, so in the plot below we dragged
the right side of the window for some improvement..
Projected 2020 Population vs. 1990 Population
240 60
180 30
2020 Population
RESIDUAL
120 0
-30
60
-60
0
-30 0 30 60 90 120 150 180 210 240 270
0 30 60 90 120 150 PREDICTD
1990 Population
8
2020 Pop (sqrt) vs. 1990 Pop (log) 2020 Pop (log) vs. 1990 Pop (log)
360
300
240
240 180
120
180
60
120
Sqrt (20200 Population)
60
40 80 120160
200 40 80 120160
200
Log (1990 Population) Log (1990 Population)
If Show Statistics had been selected from Scatterplot/ Options menu, the bottom of the
log-log plot would look like this:
9
Example 3. Program 6D: Can Subpopulations Alter a Linear Relation?
In this example, for two pairs of variables, the regression line is displayed for the complete
sample and then separately for each subpopulation. For one pair of variables, the correlation
within each subgroup is stronger than that for the complete sample, and, for the other pair of
variables, the opposite is true. The Fisher data for 150 iris flowers are used—the length (L)
and width (W) of sepals and petals are recorded for each flower plus its species: setosa,
versicolor, or virginica.
Here are plots of length versus width for the sepals and for the petals. In each plot the line of
best fit is drawn using Scatterplot/ Smoother/ a+b*x [Linear] and the correlation and other
statistics are obtained using Options/ Show Statistics. For the sepals, the correlation
between length and width is negative or close to zero (-0.118) and for the petals, it is 0.963.
SEPAL_L vs. SEPAL_W PETAL_L vs. PETAL_W
80 70
60
70
50
SEPAL_L
PETAL_L
60 40
30
50
20
40 10
N = 150 20 25 30 35 40 45 N = 150 0 5 10 15 20 25
R = -0.118 SEPAL_W R = 0.963 PETAL_W
Mean St.Dev. Mean St.Dev.
X: 30.573 4.3586 Y =65.262-0.2233*X; RMS=68.078 X: 11.993 7.6223 Y =10.835+2.2299*X; RMS=22.868
Y: 58.433 8.2806 Y: 37.580 17.652
Next lines and symbols are defined individually for each species of flower. Here are the
results. An explanation of how to do this follows. For the Setosa group (blue), the relation
between length and width of the sepals is strongly positive (r= 0.74), not negative as it was for
all species combined. For the petals, the correlation for the Setosa group is much lower (0.33)
than that for all species combined (0.96).
SEPAL_L vs. SEPAL_W with Groups PETAL_L vs. PETAL_W with Groups
80 70
60
70
50
SEPAL_L
PETAL_L
60 40
30
50
20
40 10
20 25 30 35 40 45 0 5 10 15 20 25
SEPAL_W PETAL_W
Setosa Setosa
Versicolor Versicolor
Virginica Virginica
10
To display different symbols, lines, and /or colors in a high resolution scatterplot, Group
Codes or Cutpoints must be defined in the BMDP program 6D instructions. Plots can be
requested for all groups together or for each group separately. In the BMDP run, these
instructions were included:
/ Group variable is IRIS.
Codes (IRIS) = 1 TO 3.
Names(IRIS) = SETOSA, VERSICOL, GVIRGINIC.
/ Plot Yvar = SEPAL_L, PETAL_L.
Xvar = SEPAL_W, PETAL_W.
Group=ALL. Group=SETOSA.
The names for two species begin with “V”, so a “G” was inserted to make the names unique.
After program 6D runs and the default scatterplot is in the Graph Window:
Select For each group from the Scatterplot menu to indicate symbols, smoothers
(e.g., lines, curves), enhanced group names, and/or line type will be specifed for one
or more groups.
Group names, symbols, and colors are specified in the Symbols box under Options.
To add each group name and color its symbol, click on a line under Group and type
the name in the yellow box. Here, for the Virginica group, the red Color slider is
dragged all the way to the right (The RGB number now appears in the box next to the
group name). Many colors are possible by moving the red, green, and blue sliders to
form combinations of these colors. While the yellow highlight remains, select a symbol
for the group from the drop-down list under Symbol in the middle of the box.
To include the group name with its associated symbol in a legend, select Show legend
under Options
To draw a line of best fit for each group select Scatterplot/ Smoother/ a+b*x
[Linear]
The correlations reported for the Setosa group were obtained using Show Statistics
on a plot of the Setosa group alone (not shown).
11
Chapter 3
Graphics Window Menus
The menus and submenus of the main BMDP Graphics Window are displayed and described
in this chapter. Common Options follow in Chapter 4.
File Menu
Open provides access to folders where previously made BGF plots are stored.
The current display in the graphics window can be Saved As a BGF file which can be
read into the Graphics Window at a later time for further editing or as an EMF file that
is easily inserted in Word.
Click Select Graph to pick a graph to display from the list of graphs produced in the
current BMDP run.
12
Scatterplot Menu
Scatterplot menu features and options are illustrated in Chapter 2, Examples 2 and 3.
Histogram Menu
The use of Histogram is described in Chapter 1, Example 1.
Density provides Dot plot, Box plot, and Histogram. Dot plots and histograms may
be Stacked.
Bar Chart draws bar charts of Counts and Means
13
Four Color choices are available. Gray scale provides greater resolution than the
color options and is useful for black & white printing.
Click Range to set the minimum and maximum range of the correlations or distances
measures. For example, if the smallest correlation is 0.59, you wouldn’t want the scale
for 3 colors to range from zero to 1.0.
Profile Menu
This display is specific to program KM. Separately for each cluster it displays the mean and
standard deviation of the variables within the cluster. See example under KM in Chapter 5.
A name can be assigned to each cluster in the Labels & Symbols box. See Common
Options in Chapter 4 for more information.
Vertical Lines controls the color and thickness of the grand mean line drawn for each
cluster.
Horizontal Lines controls the color, thickness, and style of lines drawn to show the
spread (standard deviation) of each variable in each cluster.
14
Chapter 4
Common Options for Graphs
BMDP provides several options for fine tuning and customizing graphical displays. For
example, you can add a title or longer axis labels, select plot symbols and their color, log or
power transform plot scales, and more. You can repeatedly modify options to achieve the
desired look for your graph. Options specified in a dialog box are listed following those that
provide a single toggle-like instruction.
Click on a line under Text to highlight it in yellow and then type the respective
plot Title or Y axis label. For Scatterplots, also type an X Axis label.
Specify the Font, Size, and Color for each label.
The Font may be written in Bold, Italic, and/or Underlined.
15
For the 2D Line plot display, here is the Labels dialog box:
Specify an X Axis label and, in the next line, control how names of statistics are
written in the display—this example shows they will be written in blue with bold
Arial font size 13.
Color, Size, Font, and Style of the Variable Names and Group (Cluster) Names
can be changed
Lines below Group Names are used to (1) type a name for each cluster and (2)
control the Color, Size, and Font of the mean symbol. In the box above, the first
cluster is named “F Depressed”.
Color is checked in the box next to Apply to every group’s symbol the current,
so the mean symbols for variables in all clusters will be red. The size of the
symbol, however, will be 5 for the “F Depressed” cluster.
16
Symbols … see Example 3 in Chapter 2.
Scales … has one box for Scatterplots and another for Histograms.
For a Scatterplot this box appears:
17
Specify a power transform to alter the plot scale units: e.g., 1 for original units,
0.5 for square root, and 0 for a log scale. See Example 2 in Chapter 2 for
examples of a square root scale and a log scale.
Specify where scales should be displayed.
For a Histogram display (histogram, box plot, or dot density display) see Example 1 in
Chapter 2.
Lines …
18
In Histogram displays,
Thickness changes box plot lines and error bars on the bar chart of means.
Line options do not apply to histograms.
In Line plot displays,
Thickness and Style apply to lines from the plot frame to the statistics name.
Color does not apply.
Grids …
For Scatterplot displays (1) a vertical dotted line may be placed at each tick on the X axis
and/or a horizontal line at each Y tick or (2) grid lines may be placed at two user specified
positions on the X and/or Y axes.
Frame …
For a Scatterplot display,
The plot frame can be colored. Note here the Red, Green, and Blue sliders are
positioned to create the color yellow.
Frame Thickness changes the thickness of the frame. “2” is heavier than “1”.
Frame at is used to specify which sides of the plot have a frame.
Select Indent X axis or Indent Y axis to indent plot points so none are jammed
against the frame.
19
For a Histogram display (histogram, box plot, and dot density display)
The plot frame can be colored. Note here the Red, Green, and Blue sliders are
positioned to create the color purple.
Frame Thickness changes the thickness of the frame. “2” is heavier than “1”.
Bins …
For a Histogram display (histogram and dot density display)
20
Chapter 5
Program Specific Features and Options
In this section we give a brief overview of BMDP programs ordered by their two character
identifiers. The symbol marks comments about high resolution graphics. For
more details about program analyses and features, see Help on the BMDP main menu.
15
COUNT
10
0
0 2000 4000 6000 8000 10000
Population Density
For the same data, here is a line plot that displays 2D’s estimates of location on a scale
that could be placed at the bottom of the histogram:
21
3D t Tests
For 2-sample, paired, or one-sample designs, classic t tests are printed by default. Crude
histograms are printed with each test. When outliers or distributional problems are a
concern, trimmed t tests and nonparametric tests are available.
Q-Q (quantile-quantile) plot to compare the distributions of two groups or two
variables. When the plot points follow a straight line the groups have the same
distribution.
22
Never Married Married Divorced
80 80 J J 80
P
C
N
J J
N
C
P
60 J 60 N
P 60 C
P N
N J
C C
Income in 2008
Income in 2008
Income in 2008
N
J P N
C P C PJ
C
40 40 C 40 N
N P P N
P
C
N C P
J
20 20 20 N
J
P
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Education Education Education
0.04
0.8
Normal Density Function
0.03
0.6
(Rank+1)/n
0.02
0.4
0.01 0.2
0 0
0.8 M_10
0.6 M_90+
M_85
M_80
M_75
0.4 M_70
M_65
HANGING F_70
DROWNING
M_60 F_65
0.2 M_55 KNIVES F_75
F_90+
F_60F_80
M_50 F_85
AXIS 2
F_55
F_50
0 M_45
JUMPING
F_45
M_40 F_40
M_15
-0.2 M_35 POISON
F_10
GUNS F_35
F_30
M_30
M_25COOKGAS F_25 F_20
OTHER
-0.4 M_20
F_15
-0.6 TOXICGAS
-0.8
24
1L Life Tables and Survivor Functions
1L uses either the product-limit (Kaplan and Meier) or the actuarial life table (Cutler and
Ederer) method to estimate the survivor function. If subjects are separated into treatment
groups, the survivor function can be estimated for each and tested for equality. Mantel-
Cox, Breslow, Tarone-Ware, and Peto-Prentice tests are available.
Plots include cumulative survivor function, log of the survivor function, hazard function,
cumulative hazard function, and probability density function.
0.
0.75
-1.1
0.5
-2.2
0.25
-3.3
0 -4.4
0 10 20 30 40 0 10 20 30 40
Months Months
Age LE 45 Age LE 45
Age Over 45 Age over 45
25
Absolute Correlation Similarity Measure with Average Linkage Absolute Correlation Similarity Measure with Average Linkage
CONCENTR CONCENTR
ALERT ALERT 0 to 0.3333
0.3333 to 0.6666
ANNOY ANNOY
0.6666 to 1.
IRRITABL IRRITABL
CONTENT CONTENT
TENSE TENSE
SLEEPY SLEEPY
TIRED TIRED
SMOKING1 SMOKING1
SMOKING2 SMOKING2
SMOKING3 SMOKING3
SMOKING4 SMOKING4
CONCENTR
TIRED
CONCENTR
TIRED
ALERT
ANNOY
IRRITABL
CONTENT
TENSE
SLEEPY
ALERT
ANNOY
IRRITABL
CONTENT
TENSE
SLEEPY
SMOKING1
SMOKING2
SMOKING3
SMOKING4
SMOKING1
SMOKING2
SMOKING3
SMOKING4
The data here are from an instrument used in a smoking cessation study. The “smoking”
items concern the subject’s desire in different situations to have a cigarette and were
randomly ordered among questions about the psychological and physical state of the
subject. The smoking items cluster tightly together and have little relation to the other
questions.
3M Block Clustering
This program is appropriate for categorical data and groups subsets of cases into clusters
that are alike for subsets of variables. 3M reorders the rows and columns of the original
data matrix and uses different symbols to identify “blocks” of data. A block symbol
diagram is available as text output.
No high resolution graphics
26
4M Factor Analysis
Factor analysis has three objectives: (1) to place variables into factors (groups) such that
each variable is more highly correlated with the others in its factor than with other
variables, (2) to interpret each factor according to the variables belonging to it, and (3) to
compute a score for each factor. Initial factor extraction options are principal components
analysis (PCA), maximum likelihood factor analysis, Kaiser’s second generation little jiffy,
or principal factor analysis. 4M has seven options for factor rotation plus many other
options to control the analysis.
Scatterplots of rotated and unrotated factor loadings, scatterplots of factor scores, and
a shaded correlation matrix with variables ordered by sorted loadings.
Example. See program 1M for a description of the data used in these shaded matrices:
Absolute Value of Correlations sorted by Loadings Absolute Value of Correlations sorted by Loadings
ANNOY ANNOY
IRRITABL IRRITABL 0 to 0.3333
0.3333 to 0.6666
CONTENT CONTENT
0.6666 to 1.
TENSE TENSE
CONCENTR CONCENTR
SMOKING3 SMOKING3
SMOKING4 SMOKING4
SMOKING2 SMOKING2
SMOKING1 SMOKING1
SLEEPY SLEEPY
TIRED TIRED
ALERT ALERT
ANNOY
TENSE
CONCENTR
SLEEPY
TIRED
ANNOY
TENSE
CONCENTR
SLEEPY
TIRED
IRRITABL
CONTENT
SMOKING3
SMOKING4
SMOKING2
SMOKING1
ALERT
IRRITABL
CONTENT
SMOKING3
SMOKING4
SMOKING2
SMOKING1
ALERT
5M Linear and Quadratic Discriminant Analysis
5M performs linear and quadratic discriminant analysis.
No high resolution graphics
6M Canonical Correlation
Canonical correlation analysis determines the linear relationships between two sets of
variables by finding coefficients for a linear combination of the x variables and another set
of coefficients for the y variables such that the correlation between the two linear
combinations (canonical variables) is maximized. 6M then derives more pairs of canonical
variables that are independent of the previous pairs.
Shaded correlation matrix for variables in the x set and the y set. Bivariate plots of
variables and canonical variables.
27
Example. Here are the Fisher iris data used in Examples 2 and 3 in Chapter 2. Note that
in the canonical variable plot, group means are marked by .
2nd Canonical Variable vs. 1st Canonical Variable
2.6
1.3
-1.3
-2.6
-3.9
-10 -5 0 5 10
1st Canonical Variable
Setosa
Versicolor
Virginica
Setosa Mean
Versicolor Me
Virginica Mea
28
(via Euclidean distance) cluster. There are four options to standardize data. For three of
these, KM allows incomplete data. The user can identify initial cluster membership and
select a categorical variable with which to crosstabulate final cluster membership. KM is
useful for large data sets and it supplies information about the role of individual variables
in the clustering.
Scatterplot of the orthogonal projection of cases into the plane defined by the three
most populous clusters, bivariate scatterplots of user selected variables, and a Cluster
profile display—for each cluster the mean of each variable is displayed relative to the
grand mean of all data and the standard deviation is indicated by a the length of a
horizontal line through the mean.
Example. Here, using the survey data, is a cluster profile from a default KM run that
identified five clusters. The vertical line in each cluster represents the overall mean of
each variable. The profile display can help the user to characterize and name each
cluster. The TOTAL depression score in the first cluster, “F Depress”, is considerably
above the mean and appears to set the 34 young women in this cluster apart from the
other subjects. For the second cluster, “M $$$ & Educ” (male-income-education), income
and education are greater for these 50 males than for subjects in other clusters. Female
subjects are found in clusters 3 and 4, males in cluster 5. The older females tend to have
lower incomes and less education than the younger females.
F Depress M $$$ & Educ F Old F Young M Average
SEX
TOTAL
INCOME08
EDUCATN
AGE
80
Income in 2008
60
40
20
0
10 20 30 40 50 60 70 80 90
Age
F old
M Average
F Young
F Depressed
M $$$ & Educ
29
lines across groups. Box-Cox computations are available to determine a power
transformation to stabilize the variance of the residuals.
Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, a partial residual plot, and normal and detrended normal probability
plots of the residuals. When the Box-Cox option is requested, two scatterplots of the
residuals against the predicted value are made—one before the suggested transformation
and the other afterwards.
2R Stepwise Regression
This popular program fits a multiple linear regression equation in a stepwise manner by
entering or removing one variable at a time from a list of potential predictors. You also can
define sets of variables to enter or remove in a single step and you can force specific
variables to enter and remain in the equation.
Special note. The Caseplot option provides a useful line-by-line text display which, for
each case, features three diagnostics side-by-side: a measure of influence, a measure of
leverage, and a standardized residual.
Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, a partial residual plot, and normal and detrended normal probability
plots of the residuals. Added variable plots for variables not yet entered in the equation
(the residual of the dependent variable using already entered variables is on the y axis
and a candidate variable on the x axis). 21 regression diagnostics (measures of
influence, leverage, and residuals) are available for plotting in scatterplots.
Example. For 60 US cities, here regression diagnostic plots for a model to predict
mortality using rainfall, % nonwhite, education, and SO2 as predictors. On the left, a
measure of influence is plotted against case number. We click on the highest point and
identify Case 37, New Orleans. In the right plot we separate influence into a deleted
standardized residual and a measure of leverage. New Orleans’ data are extreme in the
x-space and it has the largest deleted standardized residual.
The DFFITS Influence Measure vs. Case number Deleted Standardized Residual vs. HATDIAG Leverage
4
2
Deleted Standardized Residual
3
DFFITS Measure of Influence
2
1
1
0 0
-1
-1 -2
-3
-2 -4
0 15 30 45 60 0 0.05 0.1 0.15 0.2 0.25
Case Number HATDIAG Leverage Measure
EXTREME CASE 9 3 4 6 10
STATISTICS VALUE NO. LABEL MORTALTY RAIN EDUCATN NONWHITE log_so2
DFFITS 2.1827 37 neworlLA 1113.0000 54.0000 9.7000 31.4000 0.0000
DFFITS -1.0694 32 miamiFL 861.4000 60.0000 11.5000 13.5000 0.0000
30
3R Nonlinear Regression
For the built-in and user specified functions, 3R uses analytically exact partial derivatives
in the iterative process to estimate parameters. 7 functions are built-in; others can be
user-specified. Users can do maximum likelihood estimation for data from the exponential
family of distributions—iteratively reweighted least squares. User-specified loss function
can replace least the squares criterion. Functions of parameters and their standard errors
can be estimated. 3R also provides robust regression—5 functions are available to
downweight outliers.
Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, and normal and detrended normal probability plots of the residuals.
Special to 3R are confidence curves for the parameters.
Example. The model for the first plot below is one of 3R’s built–in functions. The data are
radioactivity counts in a baboon’s blood sampled over time. In the second plot, the
confidence curves provide an easy way to visualize the variability in estimated parameters
from a different study. These curves can show more than the usual Wald intervals—here
we see it is hard to estimate the upper bound with reasonable confidence
Two-Compartment Model (sum of two exponentials) Cook & Weisberg Parameter Confidence Curves
16 5
12
Mitcherlitz Parameter P2
4
Radioactivity Count
2
0
0 1.1 2.2 3.3
0 45 90 135 180 T Value ( df = 10)
Time Upper Bound
Observed Lower Bound
Predicted Estimate
31
5R Polynomial Regression
A polynomial in one independent variable is fit to a dependent variable using least
squares. Orthogonal polynomials are used during computations.
RAIN
0.0088 to 0.3392 1.5
EDUCATN 0.3392 to 0.6696
0.6696 to 1.
Expected Normal Value
NONWHITE
0
MORTALTY
POP_DEN
-1.5
log_so2
log_Nox
-3
RAIN
EDUCATN
POP_DEN
NONWHITE
MORTALTY
log_so2
log_Nox
-3 -1.5 0 1.5 3
Standardized Residual
Y =0.0134+0.9393*X; RMS=0.04
32
AR Derivative-free Nonlinear Regression
This program uses a secant method to approximate to the derivatives and places a secant
plane into the response surface. Seven functions are built in; others can be user-specified.
A system of differential equations (e.g., a compartment model) can be fit to the data. A
user-specified loss function can replace least squares criterion. Functions of parameters
and their standard errors can be estimated. The user can fix the value of a parameter or
impose upper and lower limits on individual parameters or on arbitrary combinations of
parameters. AR also offers ridge regression with a Marquardt option which, at each
subsequent iteration, tempers the correlation between parameters
Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, and normal and detrended normal probability plots of the residuals.
Also scatterplots can be requested from a list of any x or y variables or derived variables
(predicted values, residuals, standardized residuals, weighted residuals, or the natural log
of the residual, predicted residual, or weighted residual).
* * * * * * * *
* *
75 * *
0.75
Proportion True Positives
Percent Correct
*
*
*
50 * 0.5
*
*
*
25 * 0.25
*
0 0
0.000 0.072 0.144 0.216 0.288 0 0.1 0.2 0.3 0.4 0.5
Probability Depressed 0 0.25 0.5 0.75 1
Cutpoint Proportion False Positives
DEPRESSD NORMAL Depressed
A star marks the group's mean Normal
* Overall
The model used to predict which subjects are depressed is not that great!
33
PR Polychotomous Stepwise Logistic Regression
PR computes maximum likelihood estimates of parameters of logistic models for
multinomial data. The categorical values of the dependent variable may be nominal or
ordered. PR has the same capability of automatic generation of design variables as LR.
Histograms of predicted probabilities for each category of the response. For each
category of the response variable, scatterplots of the standardized residuals versus a
independent variable. Scatterplots where the user specifies the x and y variables from a
list including variables and cell descriptions (e.g., observed proportion having outcome i,
predicted probability of outcome i, observation index).
3S Nonparametric Statistics
Tests in this program do not require the assumption of normality. Many use ranks—3S
automatically converts quantitative variables or scores into ranks. Kendall and Spearman
rank correlations are available. For two independent groups, 3S provides Kruskal-Wallis
and Mann-Whitney tests. 3S has pairwise mean comparisons for the Kruskal-Wallace test.
For differences among related or paired variables, the Sign test and Wilcoxon signed-rank
test are available. 3S also provides the Friedman test with pairwise mean comparisons
and Kendall’s coefficient of concordance.
No high resolution graphics
34
1V One-way Analysis of Variance or Covariance
While 1V does provide one-way analysis of variance, it is most frequently used for
analysis of covariance with one main effect and one or more covariates. The slopes of the
covariates are tested for equality (parallelism) among groups.
For each group, scatterplots of (1) residuals versus covariates, (2) observed(O) and
predicted(P) values versus covariates, (3) residuals versus predicted values, and (4)
residuals squared versus predicted values.
H H
S H
40 S S S
H
S H H
S H H
30
S
H H
20
1 2 3 4 5 6 7 8 9 10 11 12
S Shoulder
H Hip
35
5V Unbalanced Repeated Measures Models with Structured Covariance Matrices
5V analyses repeated measures data for many designs including those with unequal
variances, covariance matrices with a specific pattern, and incomplete data. Maximum
likelihood (ML) or restricted maximum likelihood (REML) is used to compute estimates of
the regression and covariance parameters. 5V provides more choice of the covariance
structure than found in 2V, 3V, or 4V.
No high resolution graphics
36