0% found this document useful (0 votes)
15 views39 pages

BMDP 2009

The document introduces BMDP High Resolution Graphics, detailing the various graphical displays available, including scatterplots, histograms, and more. It outlines the customization options for graphics and the classification of BMDP programs based on their analytical capabilities. Additionally, it provides a brief overview of the features and options available in the BMDP software for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

BMDP 2009

The document introduces BMDP High Resolution Graphics, detailing the various graphical displays available, including scatterplots, histograms, and more. It outlines the customization options for graphics and the classification of BMDP programs based on their analytical capabilities. Additionally, it provides a brief overview of the features and options available in the BMDP software for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Contents

1. Introducing BMDP High Resolution Graphics


A quick description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
.

Graphics available in high resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


2. Three Examples
Ex. 1. Program 7D: 1-way & 2-way Analysis of Variance
with Data Screening . . . . . . . . . . . . . . . . 3
Selecting options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Selecting alternative displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Ex. 2. Program 6D: Scatterplots with Smoothers and


Power Transformations . . . . . . . . . . . . . 7
Requesting the line of best fit or a smoother . . . . . . . . . . . . . . . . . . . . . . . . 7
Selecting a power transformation and applying a LOWESS smoother . . . . 8
Ex. 3. Program 6D: Can Subpopulations Alter a Linear Relation? . . . 10

3. Graphics Window Menus


File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Select Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Shaded Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Line Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4. Common Options for Graphs


Show Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Show Legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Labels … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Symbols … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Scales … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Lines … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
.

Grids … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Frame … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
i
Bins … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5. Program Specific Features and Options


1D Descriptive Statistics, Frequencies for Categories, and Data Listings . . . . . . 21
2D Detailed Data Description including Frequencies . . . . . . . . . . . . . . . . . . . . . . 21
3D t Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5D Histograms and Univariate plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6D Bivariate (scatter) Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7D One- and Two-way Analysis of Variance with Data Screening . . . . . . . . . . . . 22
8D Correlations with Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
9D Multiway Description of Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
LE Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4F Two-way and Multiway Frequency Tables—Measures of Association
& Log-linear Models . . . . . . . 23
CA Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1L Life Tables and Survivor Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2L Survival Analysis with Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1M Cluster Analysis of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2M Cluster Analysis of Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3M Block Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4M Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5M Linear and Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 27
6M Canonical Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7M Stepwise Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8M Boolean Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9M Linear scores for preference pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
AM Description and Estimation of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 28
KM K-means Clustering of Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1R Linear Regression by Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2R Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3R Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4R Regression on Principal Components and Ridge Regression . . . . . . . . . . . . . 31
5R Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6R Partial Correlation and Multivariate Regression . . . . . . . . . . . . . . . . . . . . . . . 32
9R All Possible Subsets Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
AR Derivative-free Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
LR Stepwise Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ii
PR Polychotomous Stepwise Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 34
3S Nonparametric Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1T Univariate and Bivariate Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2T Box- Jenkins Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1V One-way Analysis of Variance or Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 35
2V Analysis of Variance and Covariance with Repeated measures . . . . . . . . . . . 35
3V General Mixed Model Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4V Univariate and Multivariate Analysis of Variance and Covariance,
including Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5V Unbalanced Repeated Measures Models with Structured Covariance Matrices 36
8V General Mixed Model Analysis of Variance—Equal Cell Sizes . . . . . . . . . . . . 36

iii
Chapter 1
Introducing BMDP High Resolution Graphics
Most line-by-line printer graphical displays produced by BMDP programs are now available in
high resolution including bivariate scatterplots, dot density displays, histograms, Q-Q (quantile-
quantile) plots, and ROC curves (Receiver Operating Characteristic). Box and whisker plots and
shaded correlation and distance matrices have been added. A more detailed list of displays
follows below.

A quick description
After a BMDP program has run, the BMDP high resolution display appears automatically in the
BMDP Graphics Window. Graphics options available via menus and dialog boxes include:
 Symbol, size, color, font, and line choices. Graph titles and axis labels
are easily customized.
 The shape of a display can be altered by dragging the Graph window frame.
 Options and features can be applied repeatedly to data in a given display. At any
stage the display can be saved as a Windows Enhanced MetaFile format for importing
into WORD—or it can be saved in a form for further editing later in the Graphics
Window.

The 44 BMDP programs provide a broad array of analyses. Most programs are identified by a
number followed by a letter like “7D, One- and two-way ANOVA with data screening”. The letters
loosely classify the programs into series:
D - data description M - multivariate analyses
F - frequency tables and log linear models L - life tables and survival analysis
R - regression analysis S - nonparametric statistics
V - analysis of variance T - time series
A few program names start with a letter like KM for K-means clustering or LR for logistic
regression, and don’t fit this naming scheme. A detailed description of analyses and features
available in each program is found under Help on the BMDP statistical software main menu.

1
Graphics available in high resolution
Here is a list of high resolution displays and the programs that produce them. Some displays are
illustrated in Chapter 2 and Chapter 5. A full description of program analyses and features is
found on BMDP’s main menu under Help.
 Data Screening, Within Group Distributions, and Support for Analysis of Variance
- Histogram 2D,7D
- Dot density plot by group 7D
- Box and whisker plots 2D,7D
- Bar chart of means with standard errors 7D
- Stacked histograms (showing group membership) 7D
- Cumulative histogram, cumulative frequency distribution plot,
normal probability plot, and half normal probability plot 5D
- Q-Q (quantile-quantile) plot 3D
- Row and Column profiles for frequency table data CA
- Scatterplots of data and/or computed values 30 programs
- Box-Cox diagnostic plot for a variance stabilizing transformation 7D,1R
- Miniplots of cell means for factorial or repeated measures designs 9D,2V
 Regression
- Plots of (1) residuals and residuals squared against predicted values,
(2) the dependent variable, fitted values, and residuals against the
independent variable, and (3) normal probability and detrended normal
probability plot of the residuals 1R,2R,3R,4R,5R,6R,9R
- Plots of regression diagnostics 2R,9R
- Partial residual plot 2R
- Confidence curves for nonlinear regression parameters (Cook-Weisberg) 3R
- Correct and incorrect classifications as a function of cutpoints on
computed probabilities for logistic regression LR
- Histograms of predicted probabilities of each group LR
- ROC curve (Receiver Operating Characteristic LR
 Discriminant Analysis, Cluster Analysis, and Factor Analysis
- Scatterplot of the first two canonical variables in discriminant analysis 7M
- K-means Cluster Profile display (variable means with std. deviations) KM
- Shaded distance or correlation matrices 1M,2M,AM,4M,6M,6R,9R
 Life Tables and Survival Analysis
- Survivor function, cumulative survivor function, log of the
survivor function, hazard function, cumulative hazard function,
and death density function 1L,2L
 Time Series
- Time series snapshot—moving trimmed means vs. time 1T
- Lagged scatterplots 1T
- Complex demodulation (amplitude and phase) of a time series 1T
- Plotted periodograms, covariance functions and log spectrum vs. frequency 1T
- Confidence bands about estimated spectral density 1T
- One or several time series in one frame or separate frames 1T

2
Chapter 2
Three Examples
Example 1. Program 7D: Analysis of Variance with Data Screening
The problem in this example is to screen income (in US dollars) for a one-way analysis of
variance. Income is grouped by level of education—high school dropout, high school
graduate, some college, and college graduate. Community survey data shown in BMDP
Manual Output 7D.2 are used—income is rescaled to match statistics found on the web in
2008. For the ANOVA analysis, here are the BMDP instructions generated via menus and
dialog boxes.

Program 7D’s default display, the “dot density” plot, appears in the BMDP Graph Window
following text results produced by the BMDP instructions, Group means are marked by a star.

3
Selecting options
We now add descriptive statistics for each group and modify the title and y axis label. Note
the menu bar at top of the Graph Window—here it is enlarged:

Menu items in bold are available for the Dot Density display. Use File to Save and Print the
current display or Open a previous graph. Use Select Graph to identify which display you
want when several are generated during one computer run.

Starting with the dot density plot shown above, Options on the Histogram menu are used
to add descriptive statistics below each group and modify plot labels. The statistics are
produced by the Show Statistics option and the labels are altered using Title, Font Size, and
Color on the Labels dialog box.

Income in 2008 by Education


100

80
Income in 2008

60

40

20

No_Grad HS_Grad Some_Col Col_Grad


N 66 114 48 66
Mean 29.12 41.29 52.23 63.71
SD 13.94 13.67 18.45 17.62
SE 1.72 1.28 2.66 2.17
Min 4.00 10.00 29.00 37.00
Max 67.00 90.00 90.00 100.00
A star marks the group's mean
Squares represents more than 1 case.

Here’s how to use the Labels dialog box below to make the above changes. To add the more
informative title and print it in blue, click on the Title line in the dialog box and type the title in
the yellow box under the Text heading. Then drag the blue Color slider all the way to the
right (The RGB number now appears in the box next to the title). Many colors are possible by
moving the red, green, and blue sliders to form combinations of these colors. The graph title
is printed in Bold Size 28 and the y axis label in Bold Size 15. (The latter was done by first
highlighting the Y Axis line.)

4
Selecting alternative displays
The Histogram menu has three items: Density, Bar Chart, and Options.
From Density, in addition to Dot plot, you can select other displays of the same data:
 A Box plot with the group median, 25th and 75th percentiles, and robust
identification of outliers
 Side-by-side Histograms
 A Stacked histogram where group membership is identified by colors within each
bar

From Bar Chart, select:


 A bar chart of group means with standard errors
 A bar of group sample sizes (counts)
Here is a selection of displays obtained using the 7D Histogram menu. Note the 7D program
is not rerun—these displays are obtained by using menu items and options in the open
Graphics Window. Thus, the same data are used in each graph, so you can quickly vet which
view best characterizes group differences.

Box Plots Means with Standard Errors Sample sizes (cell counts)
70 120
100

60
80
90
50

60
Income in 2008

Income in 2008

40
Income in 2008

60
40 30

20
20
30

10

0
0 0
No_Grad HS_Grad Some_Col Col_Grad No_Grad HS_Grad Some_Col Col_Grad
No_Grad HS_Grad Some_Col Col_Grad
A star marks the group's mean

5
Stacked Histogram Dot Plot Transposed

Col_Grad
45

40

35

Some_Col
30
Count

25

HS_Grad
20

15

10

No_Grad
5

0
0 25 50 75 100 0 20 40 60 80 100
Income in 2008 Income in 2008
No_Grad HS_Grad Some_Col Col_Grad
A star marks the group's mean
Squares represents more than 1 case.

In the Graph Window, the shape of a graph can be changed by simply dragging the window
frame. This was done for these displays. The graphs shown are available with or without
group descriptive statistics (See Options menu).

Graphs can be printed directly from the Graph window using Print on the File menu, and
graphs can be saved as a EMF (Enhanced Metafile) using Save on the File menu and
imported into WORD as a picture ready for quick resizing. A display can also be saved as a
BGF (BMDP Graphics File) which at a later time can be read in the BMDP Graphics Window
for further editing.

6
Example 2. Program 6D: Scatterplots with Smoother and Power Transformation
The problem in this example is to screen the bivariate relation between 1990 population in
millions for 57 countries and the population projected for each country by the UN for 2020.
That is, we ask if the data for estimating a correlation between these two quantitative
measures are appropriate? Are there outliers? Should the data be transformed?
The instructions for running program 6D should include:
/ PLOT YVAR = pop_2020.
XVAR = pop_1990.
After 6D is executed from the main BMDP window, the default scatterplot appears in the
Graphics Window.
(1) POP_2020 VS. POP_1990

Click on this point


240
– the following
message box will
appear
180
POP_2020

120

60

0 30 60 90 120 150
POP_1990

The distribution of points in the bivariate point cloud on the graph above is far from ideal for
computing a correlation—its shape is not like that of an American football with points falling
symmetrically across the area. Points for three countries straggle upward away from those for
the other countries. To identify the country with the largest projected population, click on the
point to find its coordinates. They are 152.5 and 269.1 which is Brazil.

Requesting the line of best fit or a smoother


Before exploring power transformations to make the data more suitable for computing a
correlation, common scatterplot enhancements are illustrated that add better labels and the
line of best fit. Scatterplot features and options are found under Scatterplot on the Graphics
Window.

7
 The text for the plot title, size and colors are changed using Labels under Options.
This dialog box is the same as that for Histogram shown in Example 1.
 For the line of best fit, select Linear under Smoother. In the graph on the right below,
the size of the residuals (from program 2R) across the predicted population values
should be fairly equal—they’re not, for countries with small predicted values, the
spread is smaller than that for countries with larger values.
 Note in the default plot above, the range of the y axis is 250, while that of the x axis is
much smaller. The two variables have the same units, so in the plot below we dragged
the right side of the window for some improvement..
Projected 2020 Population vs. 1990 Population

Residuals vs. Predicted Values

240 60

180 30
2020 Population

RESIDUAL

120 0

-30
60

-60
0
-30 0 30 60 90 120 150 180 210 240 270
0 30 60 90 120 150 PREDICTD
1990 Population

Selecting a power transformation and applying a LOWESS smoother


Statisticians frequently transform a variable in order to make the shape of a distribution
symmetric or to stabilize variances across groups. It takes just a few clicks in BMDP Graphics
to apply a power transformation to values on a plot scale without stopping to transform
values in the data file. For example, y½ is y raised to the ½ power, y2 is y squared or y raised
to the power 2. Typing 0.5 as the power in the Scatterplot/ Options/ Scales dialog box, plots
the data on a square root scale; typing 0, a log scale; and so on. Add a LOWESS smoother
when not positive about the shape of the relation between the two variables. This allows a
quick interactive way to see which transformation best meets a necessary assumption of
linearity.
In the left plot below, we return to the last 6D plot in the Graph Window and request the data
on the y-axis be displayed on a square root scale and the x-axis values on a log scale. The
LOWESS smoother added to this plot shows the relation is not linear. For the bottom right
plot, the y-axis scale is changed from square root units to a log scale, resulting in the desired
linear relation although the country with the lowest population values, Barbados, is an outlier.

8
2020 Pop (sqrt) vs. 1990 Pop (log) 2020 Pop (log) vs. 1990 Pop (log)
360
300
240
240 180
120
180
60

120
Sqrt (20200 Population)

Log (20200 Population)

60

40 80 120160
200 40 80 120160
200
Log (1990 Population) Log (1990 Population)

If Show Statistics had been selected from Scatterplot/ Options menu, the bottom of the
log-log plot would look like this:

9
Example 3. Program 6D: Can Subpopulations Alter a Linear Relation?
In this example, for two pairs of variables, the regression line is displayed for the complete
sample and then separately for each subpopulation. For one pair of variables, the correlation
within each subgroup is stronger than that for the complete sample, and, for the other pair of
variables, the opposite is true. The Fisher data for 150 iris flowers are used—the length (L)
and width (W) of sepals and petals are recorded for each flower plus its species: setosa,
versicolor, or virginica.
Here are plots of length versus width for the sepals and for the petals. In each plot the line of
best fit is drawn using Scatterplot/ Smoother/ a+b*x [Linear] and the correlation and other
statistics are obtained using Options/ Show Statistics. For the sepals, the correlation
between length and width is negative or close to zero (-0.118) and for the petals, it is 0.963.
SEPAL_L vs. SEPAL_W PETAL_L vs. PETAL_W

80 70

60
70
50
SEPAL_L

PETAL_L
60 40

30
50
20

40 10

N = 150 20 25 30 35 40 45 N = 150 0 5 10 15 20 25
R = -0.118 SEPAL_W R = 0.963 PETAL_W
Mean St.Dev. Mean St.Dev.
X: 30.573 4.3586 Y =65.262-0.2233*X; RMS=68.078 X: 11.993 7.6223 Y =10.835+2.2299*X; RMS=22.868
Y: 58.433 8.2806 Y: 37.580 17.652

Next lines and symbols are defined individually for each species of flower. Here are the
results. An explanation of how to do this follows. For the Setosa group (blue), the relation
between length and width of the sepals is strongly positive (r= 0.74), not negative as it was for
all species combined. For the petals, the correlation for the Setosa group is much lower (0.33)
than that for all species combined (0.96).
SEPAL_L vs. SEPAL_W with Groups PETAL_L vs. PETAL_W with Groups

80 70

60
70
50
SEPAL_L

PETAL_L

60 40

30
50
20

40 10

20 25 30 35 40 45 0 5 10 15 20 25
SEPAL_W PETAL_W
Setosa Setosa
Versicolor Versicolor
Virginica Virginica

10
To display different symbols, lines, and /or colors in a high resolution scatterplot, Group
Codes or Cutpoints must be defined in the BMDP program 6D instructions. Plots can be
requested for all groups together or for each group separately. In the BMDP run, these
instructions were included:
/ Group variable is IRIS.
Codes (IRIS) = 1 TO 3.
Names(IRIS) = SETOSA, VERSICOL, GVIRGINIC.
/ Plot Yvar = SEPAL_L, PETAL_L.
Xvar = SEPAL_W, PETAL_W.
Group=ALL. Group=SETOSA.
The names for two species begin with “V”, so a “G” was inserted to make the names unique.
After program 6D runs and the default scatterplot is in the Graph Window:
 Select For each group from the Scatterplot menu to indicate symbols, smoothers
(e.g., lines, curves), enhanced group names, and/or line type will be specifed for one
or more groups.
 Group names, symbols, and colors are specified in the Symbols box under Options.
To add each group name and color its symbol, click on a line under Group and type
the name in the yellow box. Here, for the Virginica group, the red Color slider is
dragged all the way to the right (The RGB number now appears in the box next to the
group name). Many colors are possible by moving the red, green, and blue sliders to
form combinations of these colors. While the yellow highlight remains, select a symbol
for the group from the drop-down list under Symbol in the middle of the box.

 To include the group name with its associated symbol in a legend, select Show legend
under Options
 To draw a line of best fit for each group select Scatterplot/ Smoother/ a+b*x
[Linear]
 The correlations reported for the Setosa group were obtained using Show Statistics
on a plot of the Setosa group alone (not shown).

11
Chapter 3
Graphics Window Menus
The menus and submenus of the main BMDP Graphics Window are displayed and described
in this chapter. Common Options follow in Chapter 4.

File Menu

 Open provides access to folders where previously made BGF plots are stored.
 The current display in the graphics window can be Saved As a BGF file which can be
read into the Graphics Window at a later time for further editing or as an EMF file that
is easily inserted in Word.

Select Graph Menu

 Click Select Graph to pick a graph to display from the list of graphs produced in the
current BMDP run.

12
Scatterplot Menu
Scatterplot menu features and options are illustrated in Chapter 2, Examples 2 and 3.

 Smoother provides 8 types of lines/curves that can be drawn on a scatterplot (see


Chapter 2, Ex. 2).
 Click For each group to draw a smoother individually for each type of symbol of a
grouping variable (defined in BMDP instructions) or generated by a program (e.g.,
Observed and Predicted).
 For Options see Chapter 4, Common Options.

Histogram Menu
The use of Histogram is described in Chapter 1, Example 1.

 Density provides Dot plot, Box plot, and Histogram. Dot plots and histograms may
be Stacked.
 Bar Chart draws bar charts of Counts and Means

 For Options see Chapter 4, Common Options.

Shaded Matrix Menu


These options are used for shaded distance and correlation matrices found in programs
1M, 2M, AM, 4M, 6M, 6R, and 9R.

 For Labels, see Common Options in Chapter 4.

13
 Four Color choices are available. Gray scale provides greater resolution than the
color options and is useful for black & white printing.
 Click Range to set the minimum and maximum range of the correlations or distances
measures. For example, if the smallest correlation is 0.59, you wouldn’t want the scale
for 3 colors to range from zero to 1.0.

Line Plot Menu


The line plot is specific to program 2D and displays descriptive statistics along a line that
might be the bottom of a histogram. See example under 2D in Chapter 5.

 Labels controls how names of statistics are written in the display.


 Lines applies to lines from the plot frame to the name of each statistic.
 Display is used to select which statistics to include on the line.

Profile Menu
This display is specific to program KM. Separately for each cluster it displays the mean and
standard deviation of the variables within the cluster. See example under KM in Chapter 5.

 A name can be assigned to each cluster in the Labels & Symbols box. See Common
Options in Chapter 4 for more information.
 Vertical Lines controls the color and thickness of the grand mean line drawn for each
cluster.
 Horizontal Lines controls the color, thickness, and style of lines drawn to show the
spread (standard deviation) of each variable in each cluster.

14
Chapter 4
Common Options for Graphs
BMDP provides several options for fine tuning and customizing graphical displays. For
example, you can add a title or longer axis labels, select plot symbols and their color, log or
power transform plot scales, and more. You can repeatedly modify options to achieve the
desired look for your graph. Options specified in a dialog box are listed following those that
provide a single toggle-like instruction.

Show Statistics - Provides descriptive statistics in Histogram and Scatterplot


displays and, for the latter, also the equation of a linear, quadratic,
or cubic regression line. See Examples 1 and 2 in Chapter 2.
Show Legend - Identifies group membership associated with graph symbols and
colors. See Example 3 in Chapter 3.
Transpose - Switches the position of the x and y axes—or turns a display on
its side.
Labels … For Histograms and Scatterplots, see Example 1 in Chapter 2.

 Click on a line under Text to highlight it in yellow and then type the respective
plot Title or Y axis label. For Scatterplots, also type an X Axis label.
 Specify the Font, Size, and Color for each label.
 The Font may be written in Bold, Italic, and/or Underlined.

There are several variations of the Labels dialog box.

15
For the 2D Line plot display, here is the Labels dialog box:

 Specify an X Axis label and, in the next line, control how names of statistics are
written in the display—this example shows they will be written in blue with bold
Arial font size 13.

For Shaded Matrix correlation or distance displays,


 The Title, Text, Font, Size, and Color are as described above.
 Y Axis and X Axis, Font, Size, and Color apply to the display’s row and column
variable names.
For the KM Cluster Profile display, note labels and symbols are combined.

 Color, Size, Font, and Style of the Variable Names and Group (Cluster) Names
can be changed
 Lines below Group Names are used to (1) type a name for each cluster and (2)
control the Color, Size, and Font of the mean symbol. In the box above, the first
cluster is named “F Depressed”.
 Color is checked in the box next to Apply to every group’s symbol the current,
so the mean symbols for variables in all clusters will be red. The size of the
symbol, however, will be 5 for the “F Depressed” cluster.

16
Symbols … see Example 3 in Chapter 2.

 Enhance the name of each Group.


 Change plot Symbols. Choices are a Circle, Star, Triangle , Square,
Pentagon, Hexagon, Diamond, Triangle , Triangle , and Triangle . You
may also use a letter by typing it in the box at the top of the symbol list or use a
Wingding as the Font.
 Change symbol Size and Color.
 Change group colors for the program 7D alternative displays.

Scales … has one box for Scatterplots and another for Histograms.
For a Scatterplot this box appears:

 Set Minimum and Maximum scale limits.


 Specify Number of Ticks on the y axis. E.g., 6 ticks defines 5 intervals. Futzing
with the number of ticks and min and max values may produce “nicer” scale
numbers or, for negative and positive values, ensure that “0” is a tick.

17
 Specify a power transform to alter the plot scale units: e.g., 1 for original units,
0.5 for square root, and 0 for a log scale. See Example 2 in Chapter 2 for
examples of a square root scale and a log scale.
 Specify where scales should be displayed.

For a Histogram display (histogram, box plot, or dot density display) see Example 1 in
Chapter 2.

 Set Minimum and Maximum scale limits


 Specify Number of Ticks on the y axis. E.g., 6 ticks defines 5 intervals. Futzing
with the number of ticks and min and max values may produce “nicer” scale
numbers or, for negative and positive values, ensure that “0” is a tick.
 Specify a power transform to alter the plot scale units: e.g., 1 for original units,
0.5 for square root, and 0 for a log scale.

Lines …

In Scatterplot displays, use


 Line Thickness and Style for a line specified by Smoother. Line Styles are Solid,
Dots, Dashes, Dot/dash, and Dot/dot/dash.
 Color for a single line requested via a Smoother. When For each group is also
specified, the color of each line is determined by the Color set for each group in
Symbol. Line color matches symbol color.

18
In Histogram displays,
 Thickness changes box plot lines and error bars on the bar chart of means.
 Line options do not apply to histograms.
In Line plot displays,
 Thickness and Style apply to lines from the plot frame to the statistics name.
Color does not apply.
Grids …
For Scatterplot displays (1) a vertical dotted line may be placed at each tick on the X axis
and/or a horizontal line at each Y tick or (2) grid lines may be placed at two user specified
positions on the X and/or Y axes.

Frame …
For a Scatterplot display,

 The plot frame can be colored. Note here the Red, Green, and Blue sliders are
positioned to create the color yellow.
 Frame Thickness changes the thickness of the frame. “2” is heavier than “1”.
 Frame at is used to specify which sides of the plot have a frame.
 Select Indent X axis or Indent Y axis to indent plot points so none are jammed
against the frame.

19
For a Histogram display (histogram, box plot, and dot density display)

 The plot frame can be colored. Note here the Red, Green, and Blue sliders are
positioned to create the color purple.
 Frame Thickness changes the thickness of the frame. “2” is heavier than “1”.

Bins …
For a Histogram display (histogram and dot density display)

 Specify the Number of Bins.


 Specify the Number of Dots per Bin. When there are more cases than room in
the bin allows, a square replaces the last symbol in the Dot density plot.
 To allow each dot to represent more than one case, specify the Number of Cases
per Dot. This can be helpful for large samples.

20
Chapter 5
Program Specific Features and Options

In this section we give a brief overview of BMDP programs ordered by their two character
identifiers. The symbol  marks comments about high resolution graphics. For
more details about program analyses and features, see Help on the BMDP main menu.

 1D Descriptive Statistics, Frequencies for Categories, and Data Listings


No high resolution graphics

 2D Detailed Data Description including Frequencies


This program provides the most detailed description of a variable. Its statistics and
features are available for all cases or for each category of a grouping variable. A stem and
leaf display is available as text output.
 This is the best program to obtain a high resolution histogram. It also has a line plot
showing where descriptive statistics fall across the range of the distribution.
Histogram of Population Density
20

15
COUNT

10

0
0 2000 4000 6000 8000 10000
Population Density

A star marks the group's mean

For the same data, here is a line plot that displays 2D’s estimates of location on a scale
that could be placed at the bottom of the histogram:

21
 3D t Tests
For 2-sample, paired, or one-sample designs, classic t tests are printed by default. Crude
histograms are printed with each test. When outliers or distributional problems are a
concern, trimmed t tests and nonparametric tests are available.
 Q-Q (quantile-quantile) plot to compare the distributions of two groups or two
variables. When the plot points follow a straight line the groups have the same
distribution.

 5D Histograms and Univariate plots


5D prints displays for grouped or ungrouped data.
 For high resolution histograms see programs 2D or 7D. In 5D, high resolution displays
include a cumulative histogram, cumulative frequency distribution plot, normal probability
plot, detrended normal probability plot, and half normal plot.

 6D Bivariate (scatter) Plots


6D plots one variable against another and calculates the correlation, p-value, and
equation for the line of best fit. Group membership can be identified within a single frame
or data for each group can be plotted separately.
 Smoother on the Scatterplot menu provides 8 types of lines that can be drawn on a
scatterplot. See Chapter 2, Example 2, about transforming the plot scale to see the effect
of a power transformation and Example 3 for the influence of subpopulatons on a linear
relation.

 7D One- and Two-way Analysis of Variance with Data Screening


By default, you get standard one-way or two-way analysis of variance results. Features
help you identify outliers, skewed distributions, unequal variances, and other anomalies.
 The default within-group high resolution dot plot is useful to scan for outliers and
skewed distributions. Alternative displays include box plots, a bar chart of means with
standard errors, histograms, and a stacked histogram. Descriptive statistics can be
displayed below each display. A Box-Cox plot is available for determining a power
transformation to stabilize group variances.
 For 7D high resolution graphical displays, see Example 1 in Chapter 2.

 8D Correlations with Missing Data


No high resolution graphics

 9D Multiway Description of Groups


This program is used to compute descriptive statistics and display plots of cell means for
data classified into cells by one or more grouping variables.
 9D plots means from (1) a factorial design with two or more factors, (2) a repeated
measures ANOVA design, or (3) two or more variables simultaneously.
Example. Here are means for income grouped by marital status, religion, and education
(for x-axis names for education groups, see Example 1 in Chapter 2). The religions are
Jewish (J), Catholic (C), Protestant (P), and None (N). Note sample sizes are very small
for divorced Jews and Catholics.

22
Never Married Married Divorced

80 80 J J 80
P
C
N
J J
N
C
P
60 J 60 N
P 60 C
P N
N J
C C
Income in 2008

Income in 2008

Income in 2008
N
J P N
C P C PJ
C
40 40 C 40 N
N P P N
P
C
N C P
J
20 20 20 N
J
P

0 0 0

1 2 3 4 1 2 3 4 1 2 3 4
Education Education Education

 LE Maximum Likelihood Estimation


This program estimates the parameters that maximize the likelihood function using the
iterative Newton-Raphson algorithm. The user defines the density function or the natural
log of the density function. Given the data, the program computes analytically exact first
and second derivatives to estimate the parameters that make the density function best fit
the observed data.
 Example. Below we show that a users defined distribution (normal) is not appropriate
for the depression score TOTAL from the survey data. Both displays show the observed
distribution is skewed to the right. In the left plot, the observed points (small circles) are
not found on the left side of the peak of the density function. In the right plot of the rank vs.
the cumulative density, it is also evident that the wrong function was used since the points
depart markedly from a straight line.
Normal Density Function at each TOTAL Score Value (Rank+1)/n vs. Cumulative Density

0.04
0.8
Normal Density Function

0.03
0.6
(Rank+1)/n

0.02
0.4

0.01 0.2

0 0

0 10 20 30 40 0 0.2 0.4 0.6 0.8


TOTAL Score Cumulative Density

 4F Two-way and Multiway Frequency Tables—Measures of Association


& Log-linear Models
Program 4F crosstabulates and analyzes data in frequency tables. 4F can (1) form
frequency tables from the usual cases-by variables data file or data recorded as cell
frequencies and save the tables, (2) compute more than two dozen statistics or measures
of association for two-way tables and (3) analyze multiway tables using the log-linear
model.
No high resolution graphics
23
 CA Correspondence Analysis
 Correspondence analysis is an exploratory multivariate technique that converts
frequency table data into a graphical display in which row and column categories are
depicted as points. Simple correspondence analysis involves two categorical variables
and a graphical display of the corresponding two-way frequency table. Multiple
correspondence analysis is an extension of this problem to three or more categorical
variables and resembles a principal component analysis for categorical variables.
Example: suicide method by age group and sex. From a 1985 study on more than 52,000
suicides in West Germany, data were recorded in a 34 x 8 frequency table—34 sex-age
categories and 8 suicide methods. In such a large table it is hard by scanning differences
and similarities in row (or column) percentages to understand the relationship between
sex, age, and suicide.
In the CA plot of row and column profiles below, each name for 17 male row points begins
with M followed by the lower end of its age interval, while the 17 female row points names
start with F. Note the male row points line up on the left side and the female points on the
right, indicating a clear sex difference in profiles. The vertical dimension shows a rough
ranking by age—at the top left, however, the point Male 10-15 years departs from this
ordering (and has a smaller sample size). Suicide by hanging and knives is associated
with older males; guns, toxic gas, and cooking gas with younger males; poison with young
females; drowning with older females; and jumping with old and young females. Points on
Axis 1 account for 52% of the information in the table, those on Axis 2, 38%—making total
representation of the table as 90%.

Row and Column Profile for Suicide Data

0.8 M_10

0.6 M_90+
M_85
M_80
M_75
0.4 M_70
M_65
HANGING F_70
DROWNING
M_60 F_65
0.2 M_55 KNIVES F_75
F_90+
F_60F_80
M_50 F_85
AXIS 2

F_55
F_50
0 M_45
JUMPING
F_45
M_40 F_40
M_15
-0.2 M_35 POISON
F_10
GUNS F_35
F_30
M_30
M_25COOKGAS F_25 F_20
OTHER

-0.4 M_20
F_15

-0.6 TOXICGAS

-0.8

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8


AXIS 1

24
 1L Life Tables and Survivor Functions
1L uses either the product-limit (Kaplan and Meier) or the actuarial life table (Cutler and
Ederer) method to estimate the survivor function. If subjects are separated into treatment
groups, the survivor function can be estimated for each and tested for equality. Mantel-
Cox, Breslow, Tarone-Ware, and Peto-Prentice tests are available.
 Plots include cumulative survivor function, log of the survivor function, hazard function,
cumulative hazard function, and probability density function.

 2L Survival Analysis with Covariates


2L analyzes survival data for which the time-to-response (survival time) is influenced by
covariates. Two analyses are available: the Cox proportional hazards regression model
which presumes failure (death) rates may be modeled as log-linear functions of covariates
and the other is the accelerated failure time or “log-linear model”. .Data may be stratified
into groups with a survival function estimated for each. Time-dependent covariates may
be used.
 Plots of the survivor function, the log minus log survivor (log cumulative hazard)
function, and residuals against covariates.
Example. The data in these plots are from a heart transplant study reported in 1977. A
proportionality assumption requires that the ratio of hazard rates for different levels of an
independent variable must be constant. When this assumption is violated, some stratify
the data hoping cases within each stratum conform. Here the sample is split into younger
and older patients. The difference between the two strata appears constant in the log
cumulative survival hazard function.
Transplant Study Survivor Function Stratified by Age Log Cumulative Hazard Function

0.

0.75
-1.1

0.5
-2.2

0.25
-3.3

0 -4.4

0 10 20 30 40 0 10 20 30 40
Months Months
Age LE 45 Age LE 45
Age Over 45 Age over 45

 1M Cluster Analysis of Variables


The clustering begins by joining the two most similar variables to form a cluster and
continues joining variables or clusters of variables until all variables are in one cluster.
1M provides four measures of similarity for clustering variables and three criteria for
linking or combining clusters. The text output features a tree diagram showing the clusters
formed at each step.
 Shaded correlation matrix with variables ordered by the clustering. Example. Here
Gray scale and 3 colors are selected as Colors for the shaded matrix.

25
Absolute Correlation Similarity Measure with Average Linkage Absolute Correlation Similarity Measure with Average Linkage
CONCENTR CONCENTR
ALERT ALERT 0 to 0.3333
0.3333 to 0.6666
ANNOY ANNOY
0.6666 to 1.
IRRITABL IRRITABL
CONTENT CONTENT
TENSE TENSE
SLEEPY SLEEPY
TIRED TIRED
SMOKING1 SMOKING1
SMOKING2 SMOKING2
SMOKING3 SMOKING3
SMOKING4 SMOKING4
CONCENTR

TIRED

CONCENTR

TIRED
ALERT

ANNOY

IRRITABL

CONTENT

TENSE

SLEEPY

ALERT

ANNOY

IRRITABL

CONTENT

TENSE

SLEEPY
SMOKING1

SMOKING2

SMOKING3

SMOKING4

SMOKING1

SMOKING2

SMOKING3

SMOKING4
The data here are from an instrument used in a smoking cessation study. The “smoking”
items concern the subject’s desire in different situations to have a cigarette and were
randomly ordered among questions about the psychological and physical state of the
subject. The smoking items cluster tightly together and have little relation to the other
questions.

 2M Cluster Analysis of Cases


Clustering begins by joining the two most similar cases to form a cluster and continues
joining cases or clusters of cases until all cases are in one cluster. 2M provides eleven
distance measures and joins clusters using either a single, centroid, or k nearest neighbor
linkage algorithm. A tree diagram describing the sequence of cluster formation is available
as text output.
 Shaded matrix of distances between cases (similar to the correlation matrix above)
and a histogram of distance measures.

 3M Block Clustering
This program is appropriate for categorical data and groups subsets of cases into clusters
that are alike for subsets of variables. 3M reorders the rows and columns of the original
data matrix and uses different symbols to identify “blocks” of data. A block symbol
diagram is available as text output.
No high resolution graphics

26
 4M Factor Analysis
Factor analysis has three objectives: (1) to place variables into factors (groups) such that
each variable is more highly correlated with the others in its factor than with other
variables, (2) to interpret each factor according to the variables belonging to it, and (3) to
compute a score for each factor. Initial factor extraction options are principal components
analysis (PCA), maximum likelihood factor analysis, Kaiser’s second generation little jiffy,
or principal factor analysis. 4M has seven options for factor rotation plus many other
options to control the analysis.
 Scatterplots of rotated and unrotated factor loadings, scatterplots of factor scores, and
a shaded correlation matrix with variables ordered by sorted loadings.
Example. See program 1M for a description of the data used in these shaded matrices:
Absolute Value of Correlations sorted by Loadings Absolute Value of Correlations sorted by Loadings
ANNOY ANNOY
IRRITABL IRRITABL 0 to 0.3333
0.3333 to 0.6666
CONTENT CONTENT
0.6666 to 1.
TENSE TENSE
CONCENTR CONCENTR
SMOKING3 SMOKING3
SMOKING4 SMOKING4
SMOKING2 SMOKING2
SMOKING1 SMOKING1
SLEEPY SLEEPY
TIRED TIRED
ALERT ALERT
ANNOY

TENSE

CONCENTR

SLEEPY

TIRED

ANNOY

TENSE

CONCENTR

SLEEPY

TIRED
IRRITABL

CONTENT

SMOKING3

SMOKING4

SMOKING2

SMOKING1

ALERT

IRRITABL

CONTENT

SMOKING3

SMOKING4

SMOKING2

SMOKING1

ALERT
 5M Linear and Quadratic Discriminant Analysis
5M performs linear and quadratic discriminant analysis.
No high resolution graphics
 6M Canonical Correlation
Canonical correlation analysis determines the linear relationships between two sets of
variables by finding coefficients for a linear combination of the x variables and another set
of coefficients for the y variables such that the correlation between the two linear
combinations (canonical variables) is maximized. 6M then derives more pairs of canonical
variables that are independent of the previous pairs.
 Shaded correlation matrix for variables in the x set and the y set. Bivariate plots of
variables and canonical variables.

 7M Stepwise Discriminant Analysis


This popular program performs discriminant analysis between two or more groups by
computing linear classification functions in a stepwise manner. 7M provides a jackknifed
classification matrix, the percentage of correct classification, posterior probabilities and
Mahalanobis distances for each case being assigned to each group. Cases not used in
the computations may be classified
 Scatterplot of the first two canonical variables. When there are only two groups or only
one variable enters the model, within-group histograms, box plots, or a dot density display
are available.

27
Example. Here are the Fisher iris data used in Examples 2 and 3 in Chapter 2. Note that
in the canonical variable plot, group means are marked by .
2nd Canonical Variable vs. 1st Canonical Variable

2.6

1.3

2nd Canonical Variable


0.

-1.3

-2.6

-3.9
-10 -5 0 5 10
1st Canonical Variable
Setosa
Versicolor
Virginica
Setosa Mean
Versicolor Me
Virginica Mea

 8M Boolean Factor Analysis


8M estimates Boolean factors of dichotomous (binary) data. In this analysis, the arithmetic
used in matrix multiplication is Boolean, so the factor scores and loadings are binary. In
4M, the scores from an analysis of binary data are linear combinations of the data. In 8M,
a case has a score of 1 if it has a positive response for any of the variables dominant in
the factor and zero otherwise.
No high resolution graphics

 9M Linear scores for preference pairs


For each case, 9M constructs a score that is a linear combination of the variables where
the coefficients are based on judgments (preferences) of experts comparing two cases at
a time. The expert does not have to judge all possible pairs of cases. The variables used
in the linear function are determined in a stepwise manner. Input is a data matrix and a
matrix of preferences. Preferences from more than one expert on the same pairs of cases
can be analyzed and scores computed for each judge—and then correlated.
 Scatterplots of variables or scores.

 AM Description and Estimation of Missing Data


AM (1) describes the pattern of missing data, (2) estimates covariance or correlation
matrices by any of three computation algorithms (including an EM algorithm), and (3)
fills in (imputes) missing or out of range values using one of four methods. Data can be
described and estimated within a group.
 Scatterplots. Two plots are printed for each variable: the first shows cases with
acceptable values for both variables and the second, cases with an estimated value for at
least one variable. The absolute value of correlations of indicator variables (present,
missing) is displayed as a shaded matrix.

 KM K-means Clustering of Cases


KM partitions cases into clusters with the result that each case belongs to a cluster whose
center is closest to the case. KM standardizes the data and begins with all data in one
cluster or with user-specified clusters and at each step reallocates cases to the closest

28
(via Euclidean distance) cluster. There are four options to standardize data. For three of
these, KM allows incomplete data. The user can identify initial cluster membership and
select a categorical variable with which to crosstabulate final cluster membership. KM is
useful for large data sets and it supplies information about the role of individual variables
in the clustering.
 Scatterplot of the orthogonal projection of cases into the plane defined by the three
most populous clusters, bivariate scatterplots of user selected variables, and a Cluster
profile display—for each cluster the mean of each variable is displayed relative to the
grand mean of all data and the standard deviation is indicated by a the length of a
horizontal line through the mean.
Example. Here, using the survey data, is a cluster profile from a default KM run that
identified five clusters. The vertical line in each cluster represents the overall mean of
each variable. The profile display can help the user to characterize and name each
cluster. The TOTAL depression score in the first cluster, “F Depress”, is considerably
above the mean and appears to set the 34 young women in this cluster apart from the
other subjects. For the second cluster, “M $$$ & Educ” (male-income-education), income
and education are greater for these 50 males than for subjects in other clusters. Female
subjects are found in clusters 3 and 4, males in cluster 5. The older females tend to have
lower incomes and less education than the younger females.
F Depress M $$$ & Educ F Old F Young M Average
SEX
TOTAL
INCOME08
EDUCATN
AGE

Bivariate scatterplots in KM are another way to characterize cluster membership.


Income08 by Age with Cluster Identification
100

80
Income in 2008

60

40

20

0
10 20 30 40 50 60 70 80 90
Age
F old
M Average
F Young
F Depressed
M $$$ & Educ

 1R Linear Regression by Groups


1R estimates a linear regression equation between a dependent (predicted) variable and
one or more independent (predictor) variables. Computations are performed on all cases
or on subsets or groups of cases. For the latter, 1R can test the equality of regression

29
lines across groups. Box-Cox computations are available to determine a power
transformation to stabilize the variance of the residuals.
 Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, a partial residual plot, and normal and detrended normal probability
plots of the residuals. When the Box-Cox option is requested, two scatterplots of the
residuals against the predicted value are made—one before the suggested transformation
and the other afterwards.

 2R Stepwise Regression
This popular program fits a multiple linear regression equation in a stepwise manner by
entering or removing one variable at a time from a list of potential predictors. You also can
define sets of variables to enter or remove in a single step and you can force specific
variables to enter and remain in the equation.
Special note. The Caseplot option provides a useful line-by-line text display which, for
each case, features three diagnostics side-by-side: a measure of influence, a measure of
leverage, and a standardized residual.
 Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, a partial residual plot, and normal and detrended normal probability
plots of the residuals. Added variable plots for variables not yet entered in the equation
(the residual of the dependent variable using already entered variables is on the y axis
and a candidate variable on the x axis). 21 regression diagnostics (measures of
influence, leverage, and residuals) are available for plotting in scatterplots.
Example. For 60 US cities, here regression diagnostic plots for a model to predict
mortality using rainfall, % nonwhite, education, and SO2 as predictors. On the left, a
measure of influence is plotted against case number. We click on the highest point and
identify Case 37, New Orleans. In the right plot we separate influence into a deleted
standardized residual and a measure of leverage. New Orleans’ data are extreme in the
x-space and it has the largest deleted standardized residual.
The DFFITS Influence Measure vs. Case number Deleted Standardized Residual vs. HATDIAG Leverage
4
2
Deleted Standardized Residual

3
DFFITS Measure of Influence

2
1
1

0 0
-1
-1 -2
-3
-2 -4
0 15 30 45 60 0 0.05 0.1 0.15 0.2 0.25
Case Number HATDIAG Leverage Measure

EXTREME CASES IN THE PLOTS --

EXTREME CASE 9 3 4 6 10
STATISTICS VALUE NO. LABEL MORTALTY RAIN EDUCATN NONWHITE log_so2
DFFITS 2.1827 37 neworlLA 1113.0000 54.0000 9.7000 31.4000 0.0000
DFFITS -1.0694 32 miamiFL 861.4000 60.0000 11.5000 13.5000 0.0000

30
 3R Nonlinear Regression
For the built-in and user specified functions, 3R uses analytically exact partial derivatives
in the iterative process to estimate parameters. 7 functions are built-in; others can be
user-specified. Users can do maximum likelihood estimation for data from the exponential
family of distributions—iteratively reweighted least squares. User-specified loss function
can replace least the squares criterion. Functions of parameters and their standard errors
can be estimated. 3R also provides robust regression—5 functions are available to
downweight outliers.
 Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, and normal and detrended normal probability plots of the residuals.
Special to 3R are confidence curves for the parameters.

Example. The model for the first plot below is one of 3R’s built–in functions. The data are
radioactivity counts in a baboon’s blood sampled over time. In the second plot, the
confidence curves provide an easy way to visualize the variability in estimated parameters
from a different study. These curves can show more than the usual Wald intervals—here
we see it is hard to estimate the upper bound with reasonable confidence
Two-Compartment Model (sum of two exponentials) Cook & Weisberg Parameter Confidence Curves

16 5

12
Mitcherlitz Parameter P2

4
Radioactivity Count

2
0
0 1.1 2.2 3.3
0 45 90 135 180 T Value ( df = 10)
Time Upper Bound
Observed Lower Bound
Predicted Estimate

 4R Regression on Principal Components and Ridge Regression


4R produces a regression analysis for a dependent variable on a set of principal
components computed from the independent variables. Use this program when the
independent variables are highly correlated. 4R standardizes variables before computing
principal components. A ridge regression option deflates correlations among the
independent variables, thus, reducing the effects of multicollinearity. You can control the
amount of ridging.
 Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, normal and detrended normal probability plots of the residuals, and
ridge trace plots (standardized regression coefficients vs. the index of the ridge factors,
standardized residual sum of squares vs. the index of ridge factors, and multiple R2 vs. the
index of ridge factors.

31
 5R Polynomial Regression
A polynomial in one independent variable is fit to a dependent variable using least
squares. Orthogonal polynomials are used during computations.

 Scatterplots of observed(O) and predicted(P) values of the dependent variable against


the observed value of the independent variable, plot of the residuals against the
independent variable, normal and detrended normal probability plots of the residuals.

 6R Partial Correlation and Multivariate Regression


6R computes the partial correlation of a set of variables after removing the linear effect of
a second set of variables. The program can also be used for regression to predict several
dependent variables with the same set of independent variables.
 Scatterplots of any variable or residual against any other variable or residual. (The
residuals are named “R” followed by the first seven letters of the variable name.)
A normal probability plots of the residuals is also available. 6R can display a shaded
correlation matrix.

 9R All Possible Subsets Regression


9R estimates regression equations for “best” subsets of predictor variables and provides
detailed residual analysis. The Furnival-Wilson algorithm efficiently identifies these
subsets while computing only a small fraction of all possible regressions. Potential outliers
are identified according to the Mahalanobis distances (to each case from the mean of all
cases), standardized residuals, and Cook’s distances. Deleted and weighted residuals are
also available.
 Scatterplots can be requested for any pair of variables or derived variables (predicted
values, residuals, and distance measures). 9R also provides a normal probability plot of
the standardized residuals, and a shaded correlation matrix.
Example. As the ‘best’ overall subset to predict mortality (see 2R for description of the
data), program 9R selected rainfall, education, % nonwhite, and log SO2 as the best
predictors. Here are the simple correlations among the variables available—note the
simple correlations with mortality for the variables in the ‘best’ model are all in green (they
range between 0.3392 to 0.6696. In the normal probability plot, the point at the top right is
Case 37 identified as New Orleans in the 2R output.
Expected Normal Value vs. Standardized Residual
Absolute Value of Correlations

RAIN
0.0088 to 0.3392 1.5
EDUCATN 0.3392 to 0.6696
0.6696 to 1.
Expected Normal Value

NONWHITE
0
MORTALTY

POP_DEN
-1.5
log_so2

log_Nox
-3
RAIN

EDUCATN

POP_DEN
NONWHITE

MORTALTY

log_so2

log_Nox

-3 -1.5 0 1.5 3
Standardized Residual
Y =0.0134+0.9393*X; RMS=0.04

32
 AR Derivative-free Nonlinear Regression
This program uses a secant method to approximate to the derivatives and places a secant
plane into the response surface. Seven functions are built in; others can be user-specified.
A system of differential equations (e.g., a compartment model) can be fit to the data. A
user-specified loss function can replace least squares criterion. Functions of parameters
and their standard errors can be estimated. The user can fix the value of a parameter or
impose upper and lower limits on individual parameters or on arbitrary combinations of
parameters. AR also offers ridge regression with a Marquardt option which, at each
subsequent iteration, tempers the correlation between parameters
 Scatterplots of residuals and residuals squared against predicted values. Plot of
observed(O) and predicted(P) values of the dependent variable against the observed
value of a specified independent variable, plot of the residuals against a specified
independent variable, and normal and detrended normal probability plots of the residuals.
Also scatterplots can be requested from a list of any x or y variables or derived variables
(predicted values, residuals, standardized residuals, weighted residuals, or the natural log
of the residual, predicted residual, or weighted residual).

 LR Stepwise Logistic Regression


LR computes parameters of the logistic model where the dependent variable is binary.
The independent variables can be continuous or categorical and entered in a stepwise
manner. LR generates three types of design variables for categorical variables and their
interactions. Case-control designs can be analyzed. Based on a user-supplied 2-by-2 cost
matrix, LR computes the cost of misclassification as a function of cutpoints.
 Histograms of predicted probabilities of being in the first group for each group (see
graph below for a model to predict depression—yes/no). The percentage of correct
classification versus a spectrum of cutpoints on computed probabilities (see middle plot
below). Scatterplots of observed proportions of the first group versus predicted
probabilities and predicted log odds. Scatterplots where the user specifies the x and y
variables from a list including variables and cell descriptions (e.g., success, predicted
probability, influence). ROC plot (Receiver Operating Characteristic) which is the
proportion of true positives versus the proportions of false positives. See plot on the right.
Percentage of Correct Classification vs. Cutpoint Receiver Operating Characteristic (ROC) Plot
Predicted Probability of Being Depressed
100 1

* * * * * * * *
* *
75 * *
0.75
Proportion True Positives
Percent Correct

*
*
*
50 * 0.5
*
*
*
25 * 0.25
*

0 0
0.000 0.072 0.144 0.216 0.288 0 0.1 0.2 0.3 0.4 0.5
Probability Depressed 0 0.25 0.5 0.75 1
Cutpoint Proportion False Positives
DEPRESSD NORMAL Depressed
A star marks the group's mean Normal
* Overall

The model used to predict which subjects are depressed is not that great!

33
 PR Polychotomous Stepwise Logistic Regression
PR computes maximum likelihood estimates of parameters of logistic models for
multinomial data. The categorical values of the dependent variable may be nominal or
ordered. PR has the same capability of automatic generation of design variables as LR.
 Histograms of predicted probabilities for each category of the response. For each
category of the response variable, scatterplots of the standardized residuals versus a
independent variable. Scatterplots where the user specifies the x and y variables from a
list including variables and cell descriptions (e.g., observed proportion having outcome i,
predicted probability of outcome i, observation index).

 3S Nonparametric Statistics
Tests in this program do not require the assumption of normality. Many use ranks—3S
automatically converts quantitative variables or scores into ranks. Kendall and Spearman
rank correlations are available. For two independent groups, 3S provides Kruskal-Wallis
and Mann-Whitney tests. 3S has pairwise mean comparisons for the Kruskal-Wallace test.
For differences among related or paired variables, the Sign test and Wilcoxon signed-rank
test are available. 3S also provides the Friedman test with pairwise mean comparisons
and Kendall’s coefficient of concordance.
No high resolution graphics

 1T Univariate and Bivariate Spectral Analysis


The data screening and analytical methods in 1T are applicable to a wide variety of time
series data and can be extended to pairs of series. 1T analyzes time series data through
its spectral decomposition obtained by a fast Fourier transform (FFT) algorithm, and it
provides filtering, re-coloring, and adjustments for trends and seasonal effects. Missing
values can be replaced by linear interpolation, local mean, or local median. Estimates of
degree of coherence and regression relation between two time series in different
frequency bands are available.
 Snapshot plot with moving trimmed means overlaid on the data to see general trend
and outliers, graphical displays of periodograms and spectra of individual or paired time
series, lagged scatter plots, complex demodulation, a plot of the amplitude and phase of a
frequency band component of a time series, and a spectral plot with confidence bands.

 2T Box- Jenkins Time Series Analysis


2T analyzes a parametric time domain model iteratively in three stages: the selection of a
tentative model, estimation of model parameters, and testing for adequacy of fit (residual
analysis). Once a suitable model is identified you may forecast future observations. The
class of univariate time domain models includes ARIMA (Autoregressive Integrated Moving
Average), regression, intervention, and transfer function models. When data are missing,
2T allows user-specification of block of observations to use: first, last, largest, or range
from #1 to #2. 2T has several unique line-by-line text output displays including plots of
autocorrelations and partial autocorrelations with approximate 95% confidence intervals
for the original data.
No high resolution graphics

34
 1V One-way Analysis of Variance or Covariance
While 1V does provide one-way analysis of variance, it is most frequently used for
analysis of covariance with one main effect and one or more covariates. The slopes of the
covariates are tested for equality (parallelism) among groups.
 For each group, scatterplots of (1) residuals versus covariates, (2) observed(O) and
predicted(P) values versus covariates, (3) residuals versus predicted values, and (4)
residuals squared versus predicted values.

 2V Analysis of Variance and Covariance with Repeated measures


2V is a popular program that handles a wide variety of designs including Latin square,
incomplete block, and fractional factorials designs. Models can have grouping factors
(between-groups or whole-plot factors), within factors (trial, split-plot, repeated measures,
or within-subjects factors), or both. Each subject must have a response at all
combinations of the trial factors, but group sizes may be unequal.
 Box-Cox diagnostic plots for determining a variance stabilizing transformation.Plots of
cell means for repeated measures designs. Example: over three weeks, patients with
shoulder and hip arthritis were given increasing doses of a pain medication. Once per
week their range of motion was measured at 2, 4, 6, and 10 hours after taking the drug—
thus, a repeated measures design with one between-subjects factor and two within-
subjects factors (3 doses and 4 times within a day). To create a compact display, we
coded the 3 weeks each having 4 range of motion measurements as 1 to 12 and drew a
gridline at 4.5 and 8.5 to separate the weeks.
Effectiveness of Arithritis Medication
60 S
S H
S
50 S H S
Range of Motion

H H
S H
40 S S S
H
S H H
S H H
30
S
H H
20
1 2 3 4 5 6 7 8 9 10 11 12

S Shoulder
H Hip

 3V General Mixed Model Analysis of Variance


Models analyzed by 3V can have several fixed effects, random effects, and/or covariates.
3V uses maximum likelihood (ML) and restricted maximum likelihood approaches.
No high resolution graphics

 4V Univariate and Multivariate Analysis of Variance and Covariance, including


Repeated Measures
4V performs both univariate and multivariate analysis of variance and covariance,
including repeated measures, split-plot, and changeover designs. Effective use of this
program requires more than a casual background in ANOVA.
No high resolution graphics

35
 5V Unbalanced Repeated Measures Models with Structured Covariance Matrices
5V analyses repeated measures data for many designs including those with unequal
variances, covariance matrices with a specific pattern, and incomplete data. Maximum
likelihood (ML) or restricted maximum likelihood (REML) is used to compute estimates of
the regression and covariance parameters. 5V provides more choice of the covariance
structure than found in 2V, 3V, or 4V.
No high resolution graphics

 8V General Mixed Model Analysis of Variance—Equal Cell Sizes


8V performs an analysis of variance for any complete design with equal cell sizes. This
includes nested, crossed, and partially crossed designs for fixed-effect (including repeated
measures), random-effect or variance component models, and mixed models. This
program depends on the structure of the input data to formulate analyses for it does not
use grouping information.
No high resolution graphics

36

You might also like