0% found this document useful (0 votes)
10 views42 pages

Chapter 6

Uploaded by

franciscovenegas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views42 pages

Chapter 6

Uploaded by

franciscovenegas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

342

6 FIRM Examples
In this chapter, two data sets are analyzed using both CATFIRM and CONFIRM. For
more information on FIRM syntax, please see Chapter 8. An overview of this technique
is given in Chapter 7.

The following examples are discussed in the following sections:

• The “auto repair” example


• CATFIRM analysis of the “head injuries” data set
• CONFIRM analysis of the “head injuries” data set
• CATFIRM analysis of the “mixed” data set
• CONFIRM analysis of the “mixed” data set.

6.1 Auto Repair Example

In the study of automobile costs, you may have in a database containing, in order of
appearance, the variables COUNTRY, MAKE, CYL, DISPL, TANK, PRICE, YEAR,
REPAIR which are respectively the car’s country of origin, its make, the number of
cylinders it has, the engine displacement, fuel tank capacity, recommended retail price,
year of manufracture and annual repair costs.

The following variables may be used to predict the repair costs (REPAIR), which is thus
the dependent variable.

• MAKE : the make of the car, with 5 groupings (GM, Ford, Chrysler, Japanese and
European). MAKE is clearly nominal, and would be used as a “free” predictor.
• CYL : the number of cylinders in the machine (4, 6, 8 or 12). CYL should be used
as monotonic, in order to avoid the assumption that the repair costs difference
between 4 and 8 cylinders is the same as between 8 and 12 cylinder cars.
• PRICE : manufacturer’s suggested retail purchase price when new. PRICE is a
ratio-scale predictor that would be specified as monotonic—it seems generally
accepted that cars that are more expensive to buy are more expensive to run, too.
• YEAR : year of manufacture. YEAR is an example of a predictor that looks to be
on interval scale, but appearances may be deceptive. The best would be to use
year as a “free” predictor, thus allowing the cost to have any type of response to
year, including one in which isolated years are bad while their neighbors are
good.
343

6.2 CATFIRM Analysis of the “head injuries” Data Set


6.2.1 Description of the Data

The “head injuries” data set of Titterington et al (1981) is an example of a data set in
which FIRM is a potential method of analysis. The data set was gathered in an attempt to
predict the final outcome of 500 hospital patients who had suffered head injuries. The
outcome for each patient was that he or she was:

• Dead or vegetative (52% of the cases)


• had severe disabilities (10% of the cases) or
• had a moderate or good recovery (38% of the cases).

This outcome is predicted on the basis of 6 predictors assessed on the patients’ admission
to the hospital:

• age : The age of the patient. This is grouped into decades in the original data, and
is grouped the same way here. It has eight classes.
• EMV : This is a composite score of three measures—of eye-opening in response
to stimulation, motor response of best limb, and verbal response. This has seven
classes, but is not measured in all cases, so that there are eight possible codes for
this score — the seven measurements and an eighth “missing” category.
• MRP : This is a composite score of motor responses in all four limbs. This also
has seven measurement classes with an eighth class for missing information.
• Change : The change in neurological function over the first 24 hours. This was
graded 1, 2, or 3, with a fourth class for missing information.
• Eye indicator : A summary of diagnostics on the eyes. This too had three
measurement classes, with a fourth for missing information.
• Pupils : Pupil reaction to light — present, absent, or missing.

In the CATFIRM analysis the dependent variable was treated as being a categorical
variable.

6.2.2 FIRM Syntax File

The syntax for this analysis is contained in headicat.pr2 and is shown and discussed
below.
CATFIRM: den sum
1 ‘Outcome’ 3 ‘Dead/veg’ ‘Severe ‘ ‘Mod/good’
6
‘age’ 2 1 ‘m’ 0 8 ‘01234567’ 4.9000 5.0000
‘EMV’ 3 0 ‘1’ 0 8 ‘?1234567’ 4.9000 5.0000
‘MRP’ 4 0 ‘1’ 0 8 ‘?1234567’ 4.9000 5.0000
‘Change’ 5 0 ‘1’ 0 4 ‘?123’ 4.9000 5.0000
344

‘Eye ind’ 6 0 ‘1’ 0 4 ‘?123’ 4.9000 5.0000


‘Pupils’ 7 0 ‘f’ 0 3 ‘?12’ 4.9000 5.0000
0 1000 25 0.50000 1.00000 25 .000 0 1 0 0 0 0 0 0

Note the following:

• The first line indicates that the outcome variable is categorical and both summary
statistics (sum) and a dendrogram (den) are to be included in the output
produced during analysis.
• The second line contains information on the dependent or outcome variable. It is
the first variable in the data set, as indicated by 1 in the first field, and it is used
here as a categorical variable. The outcome variable has three categories:
‘Dead/veg’ , ‘Severe ’, and ‘Mod/good’.
• This is followed by six lines of data, one for each predictor variable. Predictors
included range from ‘age’, a monotonic variable with 8 possible values to
‘Pupils’, a free variable with 3 possible values.
• The last two lines of the syntax file contain the output options for this analysis.
The options specified are as follows:

0: Table format. The option 0 gives the table as a percentage frequency


breakdown of the dependent variable for all cases in the node, and for the
cases broken down by the different categories of the predictor.
1000: The detail output code is a single number reflecting the answers to
all three questions about the printed information, and is calculated as
follows

Q1: Do you want details of splits not used?,


Q2: Do you want cross tabs before grouping?, and
Q3: Do you want cross tabs after grouping?.

Score 1 for Yes and 2 for No; then compute this option as

1000 × Q1 + Q 2 + 2 × Q3 − 1003.

In this case, detailed analysis of splits is not requested (Q1 = 2), but cross-
tabs are requested (Q2 = Q3 = 1).
25: The minimum size a node must have to be considered for further
splitting.
0.50000: The raw significance level for a split to be made.
1.0000: The conservative significance level that a predictor must attain
for the split to be made.
25: The maximum number of nodes to be analyzed.
0.000: Constant to be added to χ 2 .
345

0: The data for this analysis is in free format. If fixed format had been
indicated, a format statement would have followed directly after this line
of input.
1: Specifies use of FIRM 2.1 methodology rather than FIRM 2.0
methodology.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.

For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.

6.2.3 CATFIRM Output

The syntax read from the syntax file is echoed to the output file. After a preliminary echo
of the syntax from the syntax file, the file shows the analysis of each node. As can be
seen from the output below, 14 nodes were obtained.
CATFIRM. Formal Inference-based Recursive Modeling
Program dimensions
Maximum number of predictors 1500
Maximum number of categories in
Predictors 16
Dependent variable 16

the data file (input):


D:\lisrel850\FIRMEX\HEADINJ.dat
the detailed analysis of splits (output):
D:\lisrel850\FIRMEX\HEADINJ.spl
the contingency tables made (output):
D:\lisrel850\FIRMEX\HEADINJ.tab
the split rule table of splits made (output):
D:\lisrel850\FIRMEX\HEADINJ.spr
Run now starting...
All data in. 500 cases read with
500 retained.
Starting node 1 descended from 0
Split on 6 making descendants 2 3 2
Starting node 2 descended from 1
Split on 1 making descendants 4 4 5 5 6 6 7 7
Starting node 3 descended from 1
Starting node 4 descended from 2
Split on 3 making descendants 8 8 8 8 8 8 9 9
Starting node 5 descended from 2
Split on 5 making descendants 11 10 11 12
346

Starting node 6 descended from 2


Split on 3 making descendants 15 13 13 13 14 14 15 15
Starting node 7 descended from 2
Split on 2 making descendants 16 16 16 16 17 17 17 18
Starting node 8 descended from 4
Starting node 9 descended from 4
Starting node 11 descended from 5
Starting node 12 descended from 5
Starting node 14 descended from 6
Starting node 15 descended from 6
Starting node 17 descended from 7

Mini-dendrogram of analysis

Legend: node number


splitting variable
------------------------------------
horizontal line connects descendants

1
6
---------------------------------------
2 3
1
---------------------------------------------------
4 5 6 7
3 5 3 2
------- ------------- ------------ -------------
8 9 10 11 12 13 14 15 16 17 18

Summary information on the outcome and predictor variables is given next, followed by
a list of the options in effect for the analysis.
Outcome has 3 categories:

Dead/veg
Severe
Mod/good

There are 6 predictors as follows

Type # cats Cat symbols Use? Split% Merge%


mono 8 01234567 May 4.90 5.00 age
float 8 ?1234567 May 4.90 5.00 EMV
float 8 ?1234567 May 4.90 5.00 MRP
float 4 ?123 May 4.90 5.00 Change
float 4 ?123 May 4.90 5.00 Eye ind
free 3 ?12 May 4.90 5.00 Pupils
347

Options in effect:
Tables printed as column percentages
Detail output on file split
Tables given before and after each step
To be analysed, a group must:
have at least 25 cases;
be significant at the 0.500% level;
be conservatively significant at the 1.000% level.
The run will terminate when 25 groups have been formed.
Standard Pearson X^2 statistic is used
Tests use new (FIRM 2.1) procedures

The summary of results of the first split is given below. Node number 1 is the full data
set. In node 1, all the predictors give highly significant splits on the dependent variable
Outcome. The number of groups selected by CATFIRM varies from 2 to 4: under the
heading groups we see how the categories break down. For example, using EMV would
split the cases into 4 groups. Taking the Pearson χ 2 value for the 3 × 4 table and finding
its p-value under the non-asymptotic distribution would give

P = 3.44 ×10−20 %.

Making the Bonferroni adjustment for the grouping that went into this final set of
categories multiplies this p-value by 95, giving the Bonferroni significance as

< 3.27 × 10−18%.

The multiple comparison p-value takes the original Pearson χ 2 for the 3 × 8 table before
grouping the categories and enters it into the non-asymptotic distribution, giving a p-
value of

< 5.00 ×10−17 %.

Since both the Bonferroni and the multiple comparison p-values are conservative, we
take the smaller of them, 3.27 ×10−18 % , to be the conservative significance level of the
predictor.
Summary of results of node number 1 predecessor node number 0
Total group
no. Name Signif % Bonf sig % MC sig % groups
1 age 6.40E-13 2.24E-11 1.62E-10 01 23 45 67
2 EMV 3.44E-20 3.27E-18 5.00E-17 12 345 ?6 7
3 MRP 2.04E-20 1.04E-18 2.32E-16 12 345 ?67
4 Change 0.0116 0.0579 0.0208 ?1 23
5 Eye ind 3.88E-19 1.94E-18 1.40E-18 1 ?2 3
6 Pupils 5.08E-20 1.52E-19 3.47E-17 ?2 1
348

Detail on the characteristics of the best predictor is given next. The most significant split
uses Pupils. It is a binary split, between the pooled class ? or 2 and class 1. This split has
a Bonferroni significance level of 1.52 × 10−19 % and a multiple comparison significance
level of 3.47 ×10−17 % . The smaller of these (both being conservative) is taken as the
significance level of the split.

The summary table below shows how the cases divide between these two—roughly three
quarters go to node 2, where the recovery rate is quite variable, and the rest go to node 3,
where 90% of the patients are dead or vegetative.
Characteristics of the best predictor
6 Pupils 5.08E-20 1.52E-19 3.47E-17 ?2 1
***********************************************************************
predictor 6 Pupils *percent* total number 500
?,2 1 Total
Dead/veg 39.4 90.2 51.8
Severe 11.9 5.7 10.4
Mod/good 48.7 4.1 37.8
totals (100%) 378 122 500
Raw significance of table is 5.08E-20

The descendant nodes are numbered 2 3

The analysis continues with node number 2. CATFIRM lists the makeup of this node—it
is all cases for which Pupils is ? or 2. As the analysis proceeds, this record grows to
reflect the successive splits giving rise to the node. In this node, no significant split can
be made on Pupils (not surprisingly, since the two classes of Pupils represented in this
node were grouped because of their compatibility). All other predictors give significant
splits, the conservative significance ranging from 0.1% for Change down to 8 ×10−12 %
for age. Node 2 is split four ways on age, the cut points being at ages 20, 30 and 60. All
four nodes are investigated again later in the analysis.
Summary of results of node number 2 predecessor node number 1

Makeup Pupils (?,2)


no. Name Signif % Bonf sig % MC sig % groups
1 age 2.30E-13 8.06E-12 4.37E-11 01 23 45 67
2 EMV 9.44E-10 4.81E-08 6.21E-07 12 ?345 67
3 MRP 8.34E-10 4.25E-08 1.24E-06 12345 ?6 7
4 Change 0.0367 0.1836 0.1010 ?1 23
5 Eye ind 2.64E-07 1.32E-06 5.84E-07 1 ?2 3
6 Pupils 100.00 100.00 100.00 ?2
349

Characteristics of the best predictor

1 age 2.30E-13 8.06E-12 4.37E-11 01 23 45 67


***********************************************************************
predictor 1 age *percent* total number 378
0,1 2,3 4,5 6,7 Total
Dead/veg 21.3 30.9 48.2 81.7 39.4
Severe 8.8 15.5 16.5 6.7 11.9
Mod/good 69.9 53.6 35.3 11.7 48.7
totals (100%) 136 97 85 60 378

Raw significance of table is 2.30E-13


The descendant nodes are numbered 4 5 6 7

Node 3 is terminal. Its cases can not (at the significance levels selected) be split further.
When there is no further significant split to be made within a node, the program adds a
message to this effect at the end of the summary of results for that node. This was the
case for node 3, where the best predictor, Eye ind, was not significant.
Summary of results of node number 3 predecessor node number 1
Makeup Pupils (1)
no. Name Signif % Bonf sig % MC sig % groups
1 age 100.00 100.00 100.00 01234567
2 EMV 0.1354 6.9051 8.5868 12 345 ?67
3 MRP 0.3558 4.6251 8.0063 123456 ?7
4 Change 100.00 100.00 100.00 ?123
5 Eye ind 0.1679 0.8394 0.3613 12 ?3
6 Pupils 100.00 100.00 100.00 11

Characteristics of the best predictor


5 Eye ind 0.1679 0.8394 0.3613 12 ?3
This predictor is not significant

The analysis continues in the same way with the successive nodes generated. The
information on the splits actually made is duplicated in the dendrogram. The additional
information in the summary file is on the predictors that did not give rise to splits. In the
full data set, all the predictors were highly significant. In Node 2, age was as significant
as it was in the full data set, so that the predictive power is age is different from that in
Pupils. EMV, MRP and Eye ind had considerably less significance in Node 2 than in
Node 1. While to some degree this is an inevitable consequence of the reduction in
sample size going down the tree, in part it can also indicate overlap in the predictive
information—that these predictors are correlated with Pupils and the common variability
they have with Pupils comprises a lot of the information about Outcome. Scanning the
summary file for this sort of information can often produce valuable insights.

The detailed information of splits is given in the output file headicat.spl. The first few
lines of this file are given below. The split file details the tests that went into the final
grouping of each predictor at each node. There are three slightly different layouts here,
illustrated by the monotonic predictor age, the free predictor Pupils and the floating
predictor MRP. For the monotonic predictor, only adjacent predictors may be merged.
350

The predictor age starts out with 8 groups, giving 7 pairwise χ 2 statistics that are listed
after “Test statistics for grouping”. The smallest of these is 1.3775 for merging categories
2 and 3. This value is well below the merge significance level for the run, so these two
classes are joined into a composite, leaving 7 groups. The merge statistics for these
revised groups are computed, the smallest of which is 1.6163 for merging classes 0 and
1, so these classes are merged. The analysis continues in the same way until 4 composite
groups remain. At this stage, the smallest merge statistic—that for merging (23) with
(45)—is significant at the 2% level, which is below the run’s threshold for merging.
Thus no further reduction of the classes of age occurs.

No more merging being possible, CATFIRM then attempts to split the composite
categories. The line “Test stats for splitting” gives the details of this, showing a χ 2
statistic for each possible resplitting point of a composite category. The largest of these
statistics is 5.0593, which is far from significant at the split significance level specified
for the run, and so no resplitting takes place. Thus (01), (23), (45), (67) is the final
grouping of the categories. The final 3 × 4 contingency table gives a Pearson χ 2 of
77.984. Entering this into χ 2 tables with 6 degrees of freedom gives a raw significance
level of 6.4 × 10−13 % . The Bonferroni multiplier, which allows for the grouping that went
into reducing 8 groups into 4, is 35. The Bonferroni significance level is therefore

35 × 6.4 × 10−13 % = 1.62 ×10−10 %.


1Total group
***************************************************************************
Monotonic age
Table has chi-square 86.310, with df 14 and significance 1.62E-10
8 groups: 0 1 2 3 4 5 6 7
Test statistics for grouping: 1.6 9.8 1.4 4.0 1.7 7.7 5.1

Min stat is 1.3775, to merge (2) and (3). d.f. 2, sig 51.1157
7 groups: 0 1 (2 3) 4 5 6 7
Test statistics for grouping: 1.6 10.7 7.9 1.7 7.7 5.1

Min stat is 1.6163, to merge (0) and (1). d.f. 2, sig 45.3755
6 groups: (0 1) (2 3) 4 5 6 7
Test statistics for grouping: 9.5 7.9 1.7 7.7 5.1

Min stat is 1.7396, to merge (4) and (5). d.f. 2, sig 43.0870
5 groups: (0 1) (2 3) (4 5) 6 7
Test statistics for grouping: 9.5 7.8 7.1 5.1

Min stat is 5.0593, to merge (6) and (7). d.f. 2, sig 7.6055
4 groups: (0 1) (2 3) (4 5) (6 7)
Test statistics for grouping: 9.5 7.8 14.4

Min stat is 7.7518, to merge (23) and (45). d.f. 2, sig 2.0360
4 groups: (0 1) (2 3) (4 5) (6 7)
351

Test stats for splitting


1.6 1.4 1.7 5.1
Max stat is 5.0593 to split group (67). d.f. 2 significance 7.6055
Best is 4 groups, with chi square 77.984 d.f. 6
Bonferroni multiplier, raw significance and MC significance
35. 6.40E-13 1.62E-10

The floating predictors, as exemplified by MRP, have an extra wrinkle to them. Not only
can the monotonic portion of the scale be merged in the same way as with age, but the
floating category can be merged with any of them. Thus the detail starts out with two
lines—the test statistics for merging a successive pair on the scale 1-7 on the upper line,
and those for merging the floating category with each monotonic class on the second line.
CATFIRM checks the full list of (in this case 13) test statistics, and finds that the smallest
is 0.7021, for merging classes 1 and 2. These classes are merged and the analysis
repeated. In the second stage, it happens that the smallest test statistic is for merging the
category ? with 6, so these two categories are joined. Once this is done, there is no
longer a floating category, just the monotonic categories (12), 3, 4, 5, (?6) and 7, and
the subsequent lines of the analysis look like those of a monotonic predictor. There is a
slightly different twist at the last phase. Here the floating category is part of a three-class
composite (?67), and so there are three possible binary splits—? vs (67), (?6) vs 7 and
6 vs (?7). The test statistics for these are shown on two lines of the printout—2.5 and 4.3
for the first two, and 1.4 for the third. In reading these two lines, ignore the rightmost ?
on the first line and the leftmost on the second. Again, the split statistic is not significant,
and so the grouping (12) (345) (?67) is final.
Float MRP
Table has chi-square 117.765, with df 14 and significance 2.32E-16
8 groups: 1 2 3 4 5 6 7
Test statistics for grouping: 0.7 3.8 1.9 3.7 5.0 3.0
and for grouping ? with 17.9 16.6 6.1 8.8 1.4 0.9 4.5
Min stat is 0.7021, to merge (1) and (2). d.f. 2, sig 72.5715
7 groups: (1 2) 3 4 5 6 7
Test statistics for grouping: 5.9 1.9 3.7 5.0 3.0
and for grouping ? with 23.1 6.1 8.8 1.4 0.9 4.5

Min stat is 0.8839, to merge (?) and (6). d.f. 2, sig 65.2141
6 groups: (1 2) 3 4 5 (? 6) 7
Test statistics for grouping: 5.9 1.9 3.7 4.7 4.3

Min stat is 1.9242, to merge (3) and (4). d.f. 2, sig 38.4126
5 groups: (1 2) (3 4) 5 (? 6) 7
Test statistics for grouping: 8.6 3.3 4.7 4.3

Min stat is 3.3063, to merge (34) and (5). d.f. 2, sig 18.9915
4 groups: (1 2) (3 4 5) (? 6) 7
Test statistics for grouping: 11.8 34.9 4.3

Min stat is 4.2724, to merge (?6) and (7). d.f. 2, sig 11.8624
3 groups: (1 2) (3 4 5) (? 6 7)
Test statistics for grouping: 11.8 56.1

Min stat is 11.8477, to merge (12) and (345). d.f. 2, sig 0.2492
3 groups: (1 2) (3 4 5) (? 6 7 ?)
352

Test stats for splitting


0.7 1.5 3.3 2.5 4.3
1.4
Max stat is 4.2724 to split group (?67). d.f. 2 significance 11.8624
Best is 3 groups, with chi square 106.856 d.f. 4
Bonferroni multiplier, raw significance and MC significance
51. 2.04E-20 2.32E-16

Free predictors with more than three categories involve much more computation than
monotonic or floating predictors with the same number of categories. Their analysis
(though not the potential computational load) is illustrated by Pupils. Since any two
categories may be merged, each step of the merge phase computes and lists the lower
triangle of a matrix of pairwise χ 2 values—in this case of a 3 × 3 matrix. The first round
of testing finds the categories ? and 2, and the single statistic FT 3.6 shows that this split
is not significant.

If the splitting phase finds a split that is significant at the split significance level selected
for this run, this composite is split, and CATFIRM returns to the first phase of looking for
possible merges.

Since a c-category composite class of a free predictor can be resplit in 2c−1 ways, testing
for resplitting of free predictors with many classes can become an enormous
computational burden. It is partly for this reason and partly because of the method of
implementation that CATFIRM has an immutable upper limit of 16 on the number of
categories a free predictor may have.

In analyses having many categories, a substantial amount of computing time can


sometimes be saved with a modest loss in the quality of the final grouping by skipping
the resplitting phase. This happens automatically whenever the split significance level
specified for the run is larger than the merge significance level.
Free Pupils
Table has chi-square 101.477 with d.f. 4 and signif 3.47E-17
Merge phase 3 groups ? 1
1 9.9
2 3.6 99.9
Min stat 3.62 to merge groups 1, 3 d.f. 2 signif 15.4512
Merge phase 2 groups ?,2
1 97.7
Min stat 97.71 to merge groups 1, 2 d.f. 2 signif 5.08E-20
Split phase 2 groups: (? 2 ) (1)
Test stats for splitting the group: ? 2
FT 3.6

Max stat is 3.6208 to split group numbered 1 as 1


d.f. 2 significance 15.4512
Best is 2 groups with chi squared 97.714 d.f. 2
Bonferroni multiplier, raw significance and MC significance
3. 5.08E-20 3.47E-17

The other optional and potentially huge file produced on request by CATFIRM is the
contingency tables before and/or after the grouping of categories. This is illustrated by
353

the next section of printout from the output file headicat.tab. Scanning these tables is
often helpful in gaining a better perspective to the grouping that went before, in particular
regarding the number of cases in the groups that were merged.

The table for the total group before grouping with respect to the predictor Pupils (the
most significant predictor) is given first. This is followed by the table for the total group
after splitting on Pupils. Percentages are given for each category of the outcome
separately by category of the predictor. The total number of cases in each group is also
given, allowing the user to calculate the number of cases per combination of predictor
and outcome categories should that be required.
1Total group before grouping
***********************************************************************
predictor 6 Pupils *percent* total number 500
? 1 2 Total
Dead/veg 61.5 90.2 38.6 51.8
Severe 15.4 5.7 11.8 10.4
Mod/good 23.1 4.1 49.6 37.8
totals (100%) 13 122 365 500
Raw significance of table is 3.47E-17

1Total group after grouping


***********************************************************************
predictor 6 Pupils *percent* total number 500
?,2 1 Total
Dead/veg 39.4 90.2 51.8
Severe 11.9 5.7 10.4
Mod/good 48.7 4.1 37.8
totals (100%) 378 122 500

Raw significance of table is 5.08E-20

The final output file, headicat.spr, contains the split rule table of splits made during the
analysis. The complete file is given below. There is nothing of interest to most starting
users in this file. Its main use is for other programs that want to read and make use of the
results of a FIRM analysis. For the contents, layout and format of the split rule table,
please see the section on FIRM syntax in Chapter 7.

1 6 2 3 2
2 1 4 4 5 5 6 6 7 7
4 3 8 8 8 8 8 8 9 9
5 5 11 10 11 12
6 3 15 13 13 13 14 14 15 15
7 2 16 16 16 16 17 17 17 18
-1 -1
0
0

The final file produced by CATFIRM is the full-scale dendrogram (see below). The
dendrogram produced by CATFIRM is a very informative picture of the relationship
354

between these predictors and the patients’ final outcome. We see in it that the most
significant separation was obtained by splitting the full sample on the basis of the
predictor Pupils.

Figure 6.1: CATFIRM dendrogram of “head injuries” data set

The groups of 378 cases for which Pupils had the value 2 or the value ? (that is, missing)
were statistically indistinguishable, and so are grouped together. They constitute one of
the successor groups (node number 2), while those 122 for whom Pupils had the value 1
constitute the other (node number 3), a group with much worse outcomes—90% dead or
vegetative compared with 39% of those in node 2.

Each of these nodes in turn is subjected to the same analysis. The cases in node number 2
can be split again into more homogeneous subgroups. The most significant such split is
obtained by separating the cases into four groups on the basis of the predictor age. These
are patients under 20 years old (node 4), patients 20 to 40 years old (node 5), those 40 to
355

60 years old (node 6) and those over 60 (node 7). The prognosis of these patients
deteriorates with increasing age; 70% of those under 20 ended with moderate or good
recoveries, while only 12% of those over 60 did.

Node 3 is terminal. Its cases can not (at the significance levels selected) be split further.
Node 4 can be split using MRP. Cases with MRP = 6 or 7 constitute a group with a
favorable prognosis (86% with moderate to good recovery), and the other MRP levels
constitute a less-favorable group but still better than average.

These groups, and their descendants, are analyzed in turn in the same way. Ultimately no
further splits can be made and the analysis stops. Altogether 17 nodes are formed, of
which 12 are terminal and the remaining 5 intermediate. Each split in the dendrogram
shows the variable used to make the split and the values of the splitting variable that
define each descendant node. It also lists the statistical significance (p-value) of the split.
Two p-values are given: that on the left is FIRM’s conservative p-value for the split. The
p-value on the right is a Bonferroni-corrected value, reflecting the fact that the split
actually made had the smallest p-value of all the predictors available to split that node. So
for example, on the first split, the actual split made has a p-value of 1.52 × 10−19 % . But
when we recognize that this was selected because it was the most significant of the 6
possible splits, we may want to scale up its p-value by the factor 6 to allow for the fact
that it was the smallest of 6 p-values available (one for each predictor). This gives its
conservative Bonferroni-adjusted p-value as 9.15 × 10−19 % .

The dendrogram and the analysis giving rise to it can be used in two obvious ways—for
making predictions, and to gain understanding of the importance of and interrelationships
between the different predictors. Taking the prediction use first, the dendrogram provides
a quick and convenient way of predicting the outcome for a patient. Following patients
down the tree to see into which terminal node they fall yields 12 typical patient profiles
ranging from 97.1% dead/vegetative to 86.1% with moderate to good recoveries. The
dendrogram is often used in exactly this way to make predictions for individual cases.
Unlike, for instance, predictions made using multiple regression or discriminant analysis,
the prediction requires no arithmetic—merely the ability to take a future case down the
tree. This allows for quick predictions with limited potential for arithmetic errors. A
hospital could, for example, use the dendrogram to give an immediate estimate of a head-
injured patient’s prognosis.

6.3 CATFIRM Analysis of the “mixed” Data Set

6.3.1 Description of the Data

The “mixed” data set was provided by Gordon V. Kass. The original file contains the
records of nearly 20,000 students at the University of the Witwatersrand. From this file,
we randomly extracted 500 for use as a FIRM calibration data set and another 500 to
356

illustrate validation. Most of the predictors are of type real, being percentage scores
attained by the students for a variety of subjects in high school. In addition, the data set
contains four character predictors:

• mattype: indicating which of several possible university entrance qualifying


examinations the student wrote,
• sex: gender of the student,
• faculty: a subject-matter grouping like the colleges of many US universities, and
• matmonth: the month in which the student wrote the final school-leaving
examination.

The dependent variable for the CATFIRM run was also of the type character—a
promotion code. Possible values for this were P for a clear pass for the year; F for a clear
fail for the year; and C for a partial credit on courses for the year. There was a number of
other low-frequency codes that were changed as part of the initial work on the data file to
make them all ? along with the genuinely missing values.

6.3.2 FIRM Syntax

The CATFIRM syntax for the analysis of the “mixed” data is given below:
CATFIRM: den sum
15 ‘Promo code’ -1
12
‘Afrikaans’ 7 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Biology’ 10 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘English’ 6 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Faculty’ 13 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘History’ 12 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Latin’ 11 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Math’ 8 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Mattype’ 3 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘Matmonth’ 4 0 ‘f’ 0 13 ‘?123456789ABC’ 0.9000 1.0000
‘Matyear’ 5 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Science’ 9 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Sex’ 2 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
0 1000 25 .10000 1.00000 25 .500 0 1 0 0 0 0 0 0

• The first line of the syntax file specifies the inclusion of the summary statistics
(keyword sum)and the dendrogram (keyword den) in the output file, rather than
having this information written to various external files.
• The dependent variable, Promo code, is in the 15th position in the data set. The
artificial code -1 indicates that the dependent variable is a character variable and
that FIRM should find the separate categories from the data.
• There are 12 predictors, as indicated by the number 12 in the third line of the
syntax.
357

• The information for each predictor follows next, and is given in the following
order: the name of the predictor, immediately followed by its position in the data.
In the case of Afrikaans, for example, the predictor is in the 7th position in the
data.
• 0 indicates that this predictor may be used for splitting. Afrikaans is a real
predictor (type of predictor = r). The monotonic, free and float predictor types all
require that the predictor’s values be consecutive integers. The next two fields
should contain information on this range, and are only required for these three
predictor types. For a real predictor, 0s are used instead.
• This is usually followed by the category codes. In this case, the absence of
category codes is indicated by “ “. Note that category codes and range of values
are provided in the case of the predictor Matmonth, which is a free predictor.
• The last two entries in this line are the splitting and merging significance values.
These levels are given in percent. In this example, a 0.9% significance level will
be used for splitting and 1% for merging.
• The last line of the syntax file contains the output options for this CONFIRM
analysis. The options specified are as follows:
0 1000 25 .10000 1.00000 25 .500 0 1 0 0 0 0 0 0

0: Table format. The option 0 gives the table as a percentage frequency


breakdown of the dependent variable for all cases in the node, and for the
cases broken down by the different categories of the predictor.
1000: The detail output code is a single number reflecting the answers to
all three questions about the printed information, and is calculated as
follows

Q1: Do you want details of splits not used?,


Q2: Do you want cross tabs before grouping?, and
Q3: Do you want cross tabs after grouping?.

Score 1 for Yes and 2 for No; then compute this option as

1000 × Q1 + Q 2 + 2 × Q3 − 1003.

In this case, detailed analysis of splits is not requested (Q1 = 2), but cross-
tabs are requested (Q2 = Q3 = 1).
25: The minimum size a node must have to be considered for further
splitting.
0.10000: The raw significance level for a split to be made.
1.0000: The conservative significance level that a predictor must attain
for the split to be made.
25: The maximum number of nodes to be analyzed.
0.500: Constant to be added to χ 2 .
358

0: The data for this analysis is in free format. If fixed format had been
indicated, a format statement would have followed directly after this line
of input.
1: Specifies use of FIRM 2.1 methodology .
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.

For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.

6.3.3 CATFIRM Output

The first part of the output file for the CATFIRM analysis of the “mixed” data concerns
the specification of syntax, input and output files.
The following lines were read from file H:\FIRM\DATA\MIXED.CAT

CATFIRM: den sum


15 ‘Promo code’ -1
12
‘Afrikaans’ 7 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Biology’ 10 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘English’ 6 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Faculty’ 13 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘History’ 12 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Latin’ 11 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Math’ 8 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Mattype’ 3 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘Matmonth’ 4 0 ‘f’ 0 13 ‘?123456789ABC’ 0.9000 1.0000
‘Matyear’ 5 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Science’ 9 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Sex’ 2 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
0 1000 25 .10000 1.00000 25 .500 0 1 0 0 0 0 0 0

CATFIRM. Formal Inference-based Recursive Modeling


Program dimensions
Maximum number of predictors 1000
Maximum number of categories in
Predictors 16
Dependent variable 16

the data file (input):


H:\FIRM\DATA\MIXED.dat
359

the detailed analysis of splits (output):


H:\FIRM\DATA\MIXED.spl
the contingency tables made (output):
H:\FIRM\DATA\MIXED.tab
the split rule table of splits made (output):
H:\FIRM\DATA\MIXED.spr
Run now starting...
All data in. 500 cases read with
500 retained.

This is followed by basic splitting and merging information and the layout of the
dendrogram.
Starting node 1 descended from 0
Split on 4 making descendants 2 3 2 4 5 3 4 5 5 4

Starting node 2 descended from 1


Split on 7 making descendants 6 6 6 6 7 7 7 7 7 7 7
Starting node 3 descended from 1

Starting node 4 descended from 1


Split on 11 making descendants 9 8 8 8 8 9 10 10 10 10 10

Starting node 5 descended from 1


Starting node 6 descended from 2
Starting node 7 descended from 2
Starting node 8 descended from 4
Starting node 9 descended from 4
Starting node 10 descended from 4

Mini-dendrogram of analysis

Legend: node number


splitting variable
------------------------------------
horizontal line connects descendants

1
4
--------------------------------------------------
2 3 4 5
7 11
------------ -----------------------
6 7 8 9 10

The output of summary and error information follows. This information is written to the
output file if the keyword sum is used in the first line of the syntax file.

Immediately following the echo of the options, the different codes seen for the dependent
variable are listed—these were F, P, ? and C. These values will be used in the run as the
category labels. This is followed by the code list for the character predictors. The
predictor Mattype, for example, took on 11 different values all of which (as it happens)
were one character long (character predictors do not have to be single-character—they
can have length up to 20). These values being necessarily different, they will be used as
the category symbols for this predictor.
360

---------------------------------------------
| Output of Summary and Error Information |
---------------------------------------------

CATFIRM Formal Inference-based Recursive Modeling.


Categorical dependent variable
Version 2.3 1999/09/04
Copyright 1999 Douglas M Hawkins
Applied Statistics
University of Minnesota
CATFIRM. Formal Inference-based Recursive Modeling

Program dimensions
Maximum number of predictors 1003
Maximum number of categories in
Predictors 16
Dependent variable 16

There are 12 predictors as follows


Type # cats Cat symbols Use? Split% Merge%
real 0 May 0.90 1.00 Afrikaans
real 0 May 0.90 1.00 Biology
real 0 May 0.90 1.00 English
char 0 May 0.90 1.00 Faculty
real 0 May 0.90 1.00 History
real 0 May 0.90 1.00 Latin
real 0 May 0.90 1.00 Math
char 0 May 0.90 1.00 Mattype
free 13 ?123456789ABC May 0.90 1.00 Matmonth
real 0 May 0.90 1.00 Matyear
real 0 May 0.90 1.00 Science
char 0 May 0.90 1.00 Sex
Option 1 is 0.
Option 2 is 1000.
Option 3 is 25.
Option 4 is 0.
Option 5 is 1.
Option 6 is 25.
Option 7 is 1.
Option 8 is 0.
Option 9 is 1.
All data in. 500 cases read with
500 retained.

Character dependent var has values


F
P
?
C

Character predictors seen in the data and their values are:

Predictor 4 Faculty values seen


A C M S
D E F H
B L
These values will be abbreviated to: ACMSDEFHBL
Predictor 8 Mattype values seen
5 1 6 J
F 7 ? 4
3 A 2
361

These values will be abbreviated to: 516JF7?43A2

Predictor 12 Sex values seen


M F
These values will be abbreviated to: MF

Next there is a listing of the cutpoints for the real predictors, which is needed when
reading the rest of the summary output and the dendrogram. Just one real predictor–
Afrikaans—from this list is shown below.

The cutpoints 45, 51, 54, 55, 57, 62, 64, 65 and 75 are the best we can find for dividing
the data set up into 10 groups of about equal size, and are the starting cutpoints from
which FIRM will do its subsequent merging to a final grouping.

Continuous predictors seen in the data and their cutpoints are:

Predictor 1 Afrikaans
Code Max value in class
? Invalid or missing
0 <= 45.000
1 <= 51.000
2 <= 54.000
3 <= 55.000
4 <= 57.000
5 <= 62.000
6 <= 64.000
7 <= 65.000
8 <= 75.000
9 > 75.000

Options in effect:
Tables printed as column percentages
Detail output on file split
Tables given before and after each step
To be analysed, a group must:
have at least 25 cases;
be significant at the 0.100% level;
be conservatively significant at the 1.000% level.
The run will terminate when 25 groups have been formed.
Modify chi sq statistic by 0.50000
Tests use new (FIRM 2.1) procedures

This is followed by a summary output of the same format as used for the “head injuries”
data set. The full data set can be split using almost any of the predictors. The most
significant is a four-way split on English into the groupings {0,1,2,3,4,5}, {6}, {?,7}
and {8,9}. The summary file output continues, showing the same information for the
descendant nodes.
Summary of results of node number 1 predecessor node number 0

Total group
no. Name Signif % Bonf sig % MC sig % groups
1 Afrikaans 0.0189 7.0146 0.9615 012 3 ?4 56789
2 Biology 1.49E-03 0.0284 9.1541 0123?456 789
3 English 1.18E-08 4.38E-06 1.75E-04 ? 012345 6 789
4 Faculty 7.45E-14 2.54E-09 7.39E-07 AM CE SFL DBH
362

5 History 8.75E-04 0.0805 4.3416 012 ?345 678


6 Latin 100.00 100.00 100.00 01?2345678
7 Math 4.75E-05 5.56E-03 0.1141 012 3?456 789
8 Mattype 7.57E-04 0.7744 6.5932 56421JAF37 ?
9 Matmonth 3.23E-06 4.10E-04 1.1351 ?347 1B6C
10 Matyear 1.65E-03 0.1518 0.2005 ? 01234 5678
11 Science 5.38E-05 6.29E-03 0.7419 012 ?34567 89
12 Sex 1.2017 1.2017 1.2017 M F

Characteristics of the best predictor

4 Faculty 7.45E-14 2.54E-09 7.39E-07 AM CE SFL DBH


***************************************************************************
predictor 4 Faculty *percent* total number 500
A,M C,E S,F,L D,B,H Total
F 17.9 34.9 30.3 9.1 25.0
P 70.3 49.3 29.4 61.4 54.2
? 9.7 11.8 19.3 29.5 14.2
C 2.1 3.9 21.1 0.0 6.6
totals (100%) 195 152 109 44 500
Raw significance of table is 7.45E-14

The final output provided is the dendrogram, as shown below.


| 1
split var +---------+
P val (Bonf P) | 25.0%|
| | 54.2%|
--------------- | 14.2%|
| | 6.6%|
levels | 500|
Node | +---------+
+---------+ |
| % F | Faculty
| % P | 2.54E-09;(3.05E-08)
| % ? | |
| % C | -----------------------------------------------------------
| _N_ | AM CE SFL DBH
+---------+ 2 | 3 | 4 | 5 |
+---------+ +---------+ +---------+ +---------+
| 17.9%| | 34.9%| | 30.3%| | 9.1%|
| 70.3%| | 49.3%| | 29.4%| | 61.4%|
| 9.7%| | 11.8%| | 19.3%| | 29.5%|
| 2.1%| | 3.9%| | 21.1%| | 0.0%|
| 195| | 152| | 109| | 44|
+---------+ +---------+ +---------+ +---------+
| |
Math Science
4.35E-03;(0.0522) 3.10E-03;(0.0372)
| |
------------- --------------------------
?012 345789 0123 ?4 56789
6 | 7 | 8 | 9 | 10 |
+---------+ +---------+ +---------+ +---------+ +---------+
| 28.0%| | 7.4%| | 63.9%| | 13.3%| | 14.0%|
| 54.0%| | 87.4%| | 13.9%| | 16.7%| | 51.2%|
| 14.0%| | 5.3%| | 13.9%| | 33.3%| | 14.0%|
| 4.0%| | 0.0%| | 8.3%| | 36.7%| | 20.9%|
| 100| | 95| | 36| | 30| | 43|
+---------+ +---------+ +---------+ +---------+ +---------+

In the complete group, 25% of the students failed, 54.2 % passed, and 6.6% of the
students obtained some college credits for the year completed. In 14.2% of the cases
information was incomplete or unavailable. The most important predictor was the
363

Faculty, which divided the group into four subsets: (AM), (CE), (SFL) and (DBH),
representing subsets of the subject-matter groupings.

6.4 CONFIRM Analysis of the “head injuries” Data Set


6.4.1 Description of the data

The “head injuries” data set of Titterington et al (1981) discussed in Section 6.2 is
analyzed here using CONFIRM. For the CATFIRM analysis of this data, see Section 6.2.
The outcome is predicted on the basis of 6 predictors assessed on the patients’ admission
to the hospital:

• age. The age of the patient. This is grouped into decades in the original data, and
is grouped the same way here. It has eight classes.
• EMV. This is a composite score of three measures—of eye-opening in response to
stimulation, motor response of best limb, and verbal response. This has seven
classes, but is not measured in all cases, so that there are eight possible codes for
this score—the seven measurements and an eighth “missing” category.
• MRP. This is a composite score of motor responses in all four limbs. This also has
seven measurement classes with an eighth class for missing information.
• Change. The change in neurological function over the first 24 hours. This was
graded 1, 2, or 3, with a fourth class for missing information.
• Eye indicator. A summary of diagnostics on the eyes. This too had three
measurement classes, with a fourth for missing information.
• Pupils. Pupil reaction to light—present, absent, or missing.

In the CONFIRM analysis the dependent variable is treated as being on the interval scale
of measurement with values 1, 2, and 3. As the outcome is on the ordinal scale, this
equally-spaced scale is not necessarily statistically appropriate in this data set and is used
as a matter of convenience rather than with the implication that this is considered the best
way to proceed.

6.4.2 FIRM Syntax

The CONFIRM syntax for the analysis of the “head injuries” data is given below:
CONFIRM: DEN SUM SPLIT RULE
1 ‘Outcome’ 0
6
‘age’ 2 1 ‘m’ 0 8 ‘01234567’ 0.9000 1.0000
‘EMV’ 3 0 ‘1’ 0 8 ‘?1234567’ 0.9000 1.0000
‘MRP’ 4 0 ‘1’ 0 8 ‘?1234567’ 0.9000 1.0000
‘Change’ 5 0 ‘1’ 0 4 ‘?123’ 0.9000 1.0000
‘Eye ind’ 6 0 ‘1’ 0 4 ‘?123’ 0.9000 1.0000
‘Pupils’ 7 0 ‘f’ 0 3 ‘?12’ 0.9000 1.0000
1 20 .00100 .50000 1.00000 50 0 0 1 .0000000 1 0 0 0 0
364

• The first line indicates that a CONFIRM analysis is required, and that a
dendrogram (keyword den), summary statistics (keyword sum), the split file
(keyword split) and the split rule file (keyword rule) are to be included in the
output file.
• The second line contains information on the dependent or outcome variable. It is
the first variable in the data set, as indicated by the 1 in the first field, and it is
used here as a continuous variable, as indicated by the 0 in the third entry on this
line. Because of this specification, no category names are required for the FIRM
analysis. This is the dependent variable specification section of the syntax file.
• The third line has one entry—6—indicating the number of predictor variables to
be used in the analysis. This is the start of the predictor specification section in the
syntax file. The next six lines provide information on each of the predictors.
• The variable age, grouped into decades as described in the previous subsection, is
the second variable in the data set (position of predictor = 2). The next value, 1,
indicates that the variable is to be carried along but not used for splitting. Age is a
monotonic predictor (type of predictor = m). The range of values of age is 0 to 8,
as indicated in the next two fields. This is followed by the category codes, in this
case 0 to 7. The last two entries in this line are the splitting and merging
significance values. These levels are given in percent. In this example, a 0.9%
significance level will be used for splitting and 1% for merging.
• The information for the other predictors, EMV, MRP, Change, Eye ind and Pupils,
are given in a similar way, concluding the predictor variable specification.
• The last line of the syntax file contains the output options for this CONFIRM
analysis.
1 20 .00100 .50000 1.00000 50 0 0 1 .0000000 1 0 0 0 0

The options specified are as follows:

1: The detailed split file is requested.


20: Minimum number of cases in a group. In this case, no group smaller
than 20 cases will be considered for splitting.
0.00100: The minimum proportion of SSD to analyze the group.
0.50000: The raw significance level.
1.0000: The conservative significance level that a predictor must attain
for the split to be made.
50: The maximum number of nodes to be analyzed.
0: The external degrees of freedom on the variance estimate. In this
example, no such information is available and the 0 indicates that.
0: The data for this analysis is in free format. If fixed format had been
indicated, a format statement would have followed directly after this line
of input.
1: The use of the pooled variance in t-statistic for the testing of pairs of
categories for compatibility is requested.
365

.0000000: No external variance estimate is used.


1: Indicated use of FIRM 2.1 methodology rather than FIRM 2.0
methodology.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.

For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.

6.4.3 CONFIRM Output

The first few lines of the output file contain information on the version of the FIRM
analysis, copyright and contact information. This is followed by the syntax used in this
analysis.
DATE: 04/21/2000
TIME: 09:59
C O N F I R M

Version 2.3 1999/09/04


Copyright 1999 Douglas M Hawkins
Applied Statistics
University of Minnesota
Formal Inference-based Recursive Modeling

Scientific Software International, Inc.


7383 North Lincoln Avenue, Suite 100
Lincolnwood, IL 60712-1704, U.S.A.
Phone: (800)247-6113, (847)675-0720, Fax: (847)675-2140
Website: www.ssicentral.com
Techsupport: [email protected]

The following lines were read from file H:\FIRM\DATA\HEADINJ.CON

CONFIRM: DEN SUM SPLIT RULE


1 ‘Outcome’ 0
6
‘age’ 2 1 ‘m’ 0 8 ‘01234567’ 0.9000 1.0000
‘EMV’ 3 0 ‘1’ 0 8 ‘?1234567’ 0.9000 1.0000
‘MRP’ 4 0 ‘1’ 0 8 ‘?1234567’ 0.9000 1.0000
‘Change’ 5 0 ‘1’ 0 4 ‘?123’ 0.9000 1.0000
‘Eye ind’ 6 0 ‘1’ 0 4 ‘?123’ 0.9000 1.0000
‘Pupils’ 7 0 ‘f’ 0 3 ‘?12’ 0.9000 1.0000
1 20 .00100 .50000 1.00000 50 0 0 1 .0000000 1 0 0 0 0

Initial information on the analysis is given next, followed by a mini-dendrogram. In both


cases, it can be seen that the first split was on variable number 6, and two new nodes
were formed. In the case of the first of these nodes, splitting next occurred using
predictor 3, while the other node was split using predictor 1.
366

Run now starting....


All data in. 500 cases read with
500 retained.
Start FIRM processing

Starting analysis of node 1


Split on 6 making descendants 2 2 3

Starting analysis of node 2


Split on 3 making descendants 5 4 4 4 4 4 4 5

Starting analysis of node 3


Split on 1 making descendants 6 6 6 6 7 7 8 8

Starting analysis of node 4

Starting analysis of node 5

Starting analysis of node 6


Split on 5 making descendants 10 9 10 10

Starting analysis of node 7


Split on 3 making descendants 12 11 11 11 11 11 12 12

Starting analysis of node 8


Split on 2 making descendants 13 -1 13 13 13 13 13 14

Starting analysis of node 10


Split on 2 making descendants 16 15 15 16 16 16 16 16

Starting analysis of node 11

Starting analysis of node 12


Split on 2 making descendants 17 -1 -1 17 17 18 19 19
.
.

Mini-dendrogram of analysis

Legend: node number


splitting variable
------------------------------------
horizontal line connects descendants

1
6
-------------------------------
2 3
3 1
--------- ---------------------------------------
4 5 6 7 8
5 3 2
---------- ---------- ---------
9 10 11 12 13 14
2 2
---------- ------------------
15 16 17 18 19
367

As the sum, den, split and rule keywords were included in the syntax file, the
summary information is not written to an external file, but forms the next part of the
output file.
---------------------------------------------
| Output of Summary and Error Information |
---------------------------------------------

Program dimensions:
Maximum number of predictors 1003
Maximum number of predictor categories 20

Dependent variable no. 1 is named Outcome


Predictor variables
no posn name no of cats split merge may? type
1 2 age 8 0.900 1.000 yes mono
2 3 EMV 8 0.900 1.000 yes flt
3 4 MRP 8 0.900 1.000 yes flt
4 5 Change 4 0.900 1.000 yes flt
5 6 Eye ind 4 0.900 1.000 yes flt
6 7 Pupils 3 0.900 1.000 yes free

Run options in effect


Full split/merge details of predictors
For a group to be analyzed, it must: -
contain at least 20 cases;
have at least proportion 0.00100 of starting ssd.
Minimum % raw significance to split 0.500
Minimum % conservative significance to split 1.000
Analysis will stop after 50 groups have been formed
Error variance is pooled Anova MS
New (FIRM 2.1 and newer) p-values used
All data in. 500 cases read with
500 retained.

A summary of the options used in splitting and merging of the groups is followed by
specific information on the splitting of each node. The primary use of this information is
for exploration of possible splits that were not used.

When we look at the analysis of the first node, we see that all of the predictors were able
to provide highly significant splits into 2, 3, or 4 category groupings. While all predictors
are highly significant, Pupils gives the most significant split.
***********************************************************************
Analysis of group no. 1 previous group no. 0
no. name mc-sig(%) bonf-sig(%), grouping
1 age 8.76E-13 4.70E-13 01 23 45 67
2 EMV 1.27E-21 1.31E-21 12 345 ?6 7
3 MRP 3.35E-21 1.93E-22 12 345 ?67
4 Change 5.99E-03 0.0202 ?1 23
5 Eye ind 1.30E-21 3.25E-21 1 ?2 3
6 Pupils 1.42E-22 2.02E-22 1? 2

Best predictor
6 Pupils 1.42E-22 2.02E-22 1? 2
368

The one-way analysis of variance for the split actually made is given next, followed by
the summary statistics of the descendant groups formed by the split. This information is
duplicated in the dendrogram, and so it is not of particular interest here. What is
interesting is the fact that in the whole data set, all predictors are highly significant, but
their relative predictive powers seem to change as we go down the dendrogram. Look, for
example, at the relative size of the p-values of age and of the clinical observations as one
goes down the tree. A portion of the output is given below.
Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 84.2132 84.2132 1
Error 353.9868 0.7108 498

F-value 118.474
significance 6.75E-23
Bonferroni P 2.02E-22
multiple comparison P 1.42E-22
overall conservative P 1.42E-22
adjusted for predictor count 8.53E-22

Grouping is significant at the conservative 8.53E-22% level

Statistics for grouping


Node Mean s.d. size s.e. (mean)
2 1.185185 0.520980 135 0.0448389
3 2.109589 0.934116 365 0.0488939

***********************************************************************
Analysis of group no. 2 previous group no. 1
no. name mc-sig(%) bonf-sig(%), grouping
1 age 100.00 100.00 01234567
2 EMV 0.1413 0.0493 123456 ?7
3 MRP 9.19E-03 1.12E-04 123456 ?7
4 Change 100.00 100.00 ?123
5 Eye ind 6.55E-03 5.14E-03 12 ?3
6 Pupils 0.1505 0.1505 1 ?

Best predictor
3 MRP 9.19E-03 1.12E-04 123456 ?7

Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 7.0729 7.0729 1
Error 29.2975 0.2203 133

F-value 32.108
significance 8.65E-06
Bonferroni P 1.12E-04
multiple comparison P 9.19E-03
overall conservative P 1.12E-04
adjusted for predictor count 6.74E-04

Grouping is significant at the conservative 6.74E-04% level

Statistics for grouping


Node Mean s.d. size s.e. (mean)
4 1.073394 0.325076 109 0.0311366
5 1.653846 0.845804 26 0.1658758
369

***********************************************************************
Analysis of group no. 3 previous group no. 1
no. name mc-sig(%) bonf-sig(%), grouping
1 age 1.22E-12 7.84E-13 0123 45 67
2 EMV 5.68E-09 3.68E-09 12 345 ?6 7
3 MRP 1.21E-08 4.03E-10 12345 ?67
4 Change 0.2076 0.2994 ?1 23
5 Eye ind 1.47E-06 3.30E-06 1 ?2 3
6 Pupils 100.00 100.00 2

Best predictor
1 age 1.22E-12 7.84E-13 0123 45 67

Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 56.6016 28.3008 2
Error 261.0148 0.7210 362

F-value 39.250
significance 3.73E-14
Bonferroni P 7.84E-13
multiple comparison P 1.22E-12
overall conservative P 7.84E-13
adjusted for predictor count 3.92E-12

Grouping is significant at the conservative 3.92E-12% level

Statistics for grouping


Node Mean s.d. size s.e. (mean)
6 2.385965 0.860374 228 0.0569796
7 1.876543 0.913547 81 0.1015052
8 1.321429 0.690379 56 0.0922558

***********************************************************************
Analysis of group no. 4 previous group no. 2
no. name mc-sig(%) bonf-sig(%), grouping
1 age 100.00 100.00 01234567
2 EMV 100.00 100.00 ?1234567
3 MRP 100.00 100.00 123456
4 Change 100.00 100.00 ?123
5 Eye ind 100.00 100.00 ?123
6 Pupils 100.00 100.00 ?1
No prediction possible.

***********************************************************************
Analysis of group no. 5 previous group no. 2
no. name mc-sig(%) bonf-sig(%), grouping
1 age 100.00 100.00 01234567
2 EMV 100.00 100.00 ?1234567
3 MRP 100.00 100.00 ?7
4 Change 100.00 100.00 ?123
5 Eye ind 100.00 100.00 ?123
6 Pupils 100.00 100.00 1?
No prediction possible.

As with the summary statistics, the use of the den keyword leads to the inclusion of the
full dendrogram in the output file, rather than the writing of this information to an
external file. The dendrogram thus forms the next part of the CONFIRM output file
considered here.
370

LEGEND:
split var
P val (Bonf P)
|
+------+ ------------
| 500| |
|1.8600| levels
|0.9371| Node |
|1.0000| +------+
|3.0000| |N |
+------+ |Xbar |
| |SD |
Pupils |Min |
1.42E-22;(8.53R-22) |Max |
| +------+
--------------------------
1? 2
2 | 3 |
+------+ +------+
| 135| | 365|
|1.1852| |2.1096|
|0.5210| |0.9431|
|1.0000| |1.0000|
|3.0000| |3.0000|
+------+ +------+
| |
MRP age
1.12E-04;(6.74E-04) 7.84E-13;(3.92E-12)
| |
---------- ----------------------------------
123456 ?7 0123 45 67
4 | 5 | 6 | 7 | 8 |
+------+ +------+ +------+ +------+ +------+
| 109| | 26| | 228| | 81| | 56|
|1.0734| |1.6538| |2.3860| |1.8765| |1.3214|
|0.3251| |0.8458| |0.8604| |0.9135| |0.6904|
|1.0000| |1.0000| |1.0000| |1.0000| |1.0000|
|3.0000| |3.0000| |3.0000| |3.0000| |3.0000|
+------+ +------+ +------+ +------+ +------+
| | |
Eye ind MRP EMV
4.46-06;(2.23E-05) 3.75E-05;(1.88E-04) 1.23E-07;(6.16E-07)
| | |
------------ ---------- ----------
1 ?23 12345 ?67 ?23456 7
9 | 10 | 11 | 12 | 13 | 14 |
+------+ +------+ +------+ +------+ +------+ +------+
| 18| | 210| | 41| | 40| | 46| | 10|
|1.3333| |2.4762| |1.3659| |2.4000| |1.0870| |2.4000|
|0.6860| |0.8137| |0.6617| |0.8412| |0.2849| |0.9661|
|1.0000| |1.0000| |1.0000| |1.0000| |1.0000| |1.0000|
|3.0000| |3.0000| |3.0000| |3.0000| |2.0000| |3.0000|
+------+ +------+ +------+ +------+ +------+ +------+
| |
EMV EMV
1.99E-03;(9.94E-03) 0.0513;(0.2566)
| |
---------- -------------------
12 ?34567 ?34 5 67
15 | 16 | 17 | 18 | 19 |
+------+ +------+ +------+ +------+ +------+
| 17| | 193| | 7| | 12| | 21|
|1.5882| |2.5544| |2.5714| |1.5833| |2.8095|
|0.8703| |0.7627| |0.7868| |0.7930| |0.5118|
|1.0000| |1.0000| |1.0000| |1.0000| |1.0000|
|3.0000| |3.0000| |3.0000| |3.0000| |3.0000|
+------+ +------+ +------+ +------+ +------+
371

The full set of 500 cases has a mean outcome score of 1.8600. The first split is made on
the basis of the predictor Pupils. Cases for which Pupils is 1 or ? constitute the first
descendant group (Node 2), while those for which it is 2 given Node 3. Note that this
is the same predictor that CATFIRM elected to use, but the patients with missing values
of Pupils are grouped with Pupils = 1 by the CONFIRM analysis, and with Pupils = 2 by
CATFIRM. Going to the descendant groups, Node 2 is then split on MRP (unlike in
CATFIRM, where the corresponding node was terminal), while Node 3 is again split four
ways on age.

The overall tree is bigger than that produced by CATFIRM—11 terminal and 7 interior
nodes—and produces groups with prognostic scores ranging from a grim 1.0734 up to
2.8095 on the 1 to 3 scale. As with CATFIRM, the means in these terminal nodes could
be used for prediction of future cases, giving estimates of the score that patients in the
terminal nodes would have.

The statistical model used in CONFIRM can be described as a piecewise constant


regression model. For example, among the patients in Node 3, the outcome is predicted
to be 2.48 for patients aged under 20, 2.25 for those aged 20 to 40, 1.88 for those aged 40
to 60, and 1.32 for those aged over 60. This approach contrasts sharply with, say, a linear
regression in which the outcome would be predicted to rise by some constant amount
with each additional year of age. While this piecewise constant regression model is
seldom valid exactly, it is often a reasonable working approximation of reality.

The dendrogram is the most obviously and immediately useful output of a FIRM
analysis. It is an extract of the much more detailed output, which may be written to file,
or, by using the split keyword, included in the output file. If written to the file, this file
will contain information on both split and table files, in contrast with CATFIRM where
such information will be written to two separate files.

This output contains the following (often very valuable) information:

• An analysis of each predictor in each node, showing which categories of the


predictor FIRM finds it best to group together, and what the conservative
statistical significance level of the split by each predictor is;
• The number of cases flowing into each of the descendant groups;
• Summary statistics of the cases in the descendant nodes. In the case of
CATFIRM, the summary statistics are a frequency breakdown of the cases
between the different classes of the dependent variable. With CONFIRM, the
summary statistics given are the arithmetic mean and standard deviation of the
cases in the node. This summary information is echoed in the dendrogram, as
discussed previously.

The information on the splitting of the first group is shown below. The actual output file
contains similar information on all the groups.
372

The listing starts with the summary statistics of the grouping of the cases in that node by
the different classes of the predictor. The categories of a free or character predictor are
arranged in descending order of the dependent variable mean—all the other predictors are
listed in their original order.
-------------------------------------------
| Output of Detailed Analysis of Splits |
-------------------------------------------

***********************************************************************
Analysis of group no. 1 previous group no. 0 mean = 1.860 size = 500

predictor no. 1 age


statistics before merging
Cate 0 1 2 3 4 5 6 7
mean 2.11 2.30 1.99 1.95 1.61 1.71 1.31 1.00
size 55 111 77 55 61 58 61 22
s.d. 0.96 0.92 0.91 0.95 0.88 0.88 0.67 0.00

The summary table is followed by the one-way analysis of variance for the ungrouped
cross-classification.
Anova SSE DFE SSH DFH R-squared F Sign(%)
372.0 492 66.2 7 0.1510 12.5046 0.8757E-12

Next are the details of merging the categories. For a monotonic predictor like age, the
only possible merges are between adjacent categories. CONFIRM lists a Student’s t-
statistic between each pair of categories on the line below. For example, the t-value for
merging classes 0 and 1 is 1.3, that for merging 1 and 2 is -2.4; that for merging 6 and 7
is -1.4. The smallest t-value is -0.3 for merging classes 2 and 3. This is done, and the
number of classes for merging becomes 7, with 6 possible mergings and their associated
t-values. These t-values are listed on the second “merge stats” line, which shows that the
least significantly different mergeable pair is 4 and 5, with a t-value of 0.6. This merge in
turn takes place, as this t-value is not significant at the merge significance level selected
in this run. This reduction continues until at the final line, both the Student’s t-values (for
merging the composite categories (01), (23), (45), and (67)) are significant, and the
merge testing stops.
Group 0 1 2 3 4 5 6 7
merge stats 1.3 -2.4 -0.3 -2.1 0.6 -2.5 -1.4
Group 0 1 23 4 5 6 7
merge stats 1.3 -2.9 -2.7 0.6 -2.5 -1.4
Group 0 1 23 45 6 7
merge stats 1.3 -2.9 -2.9 -2.5 -1.4
Group 01 23 45 6 7
merge stats -2.6 -2.9 -2.5 -1.4
Group 01 23 45 67
merge stats -2.6 -2.9 -3.4

statistics after splitting / merging


Cate 01 23 45 67
mean 2.23 1.97 1.66 1.23
size 166 132 119 83
s.d. 0.93 0.92 0.88 0.59
373

Anova SSE DFE SSH DFH R-squared F Sign(%)


375.2 496 63.0 3 0.1437 27.7403 0.1344E-13

predictor no. 2 EMV


statistics before merging
Cate ? 1 2 3 4 5 6 7
mean 2.07 1.00 1.19 1.58 1.81 1.84 2.25 2.63
size 28 19 64 52 111 96 65 65
s.d. 0.94 0.00 0.53 0.82 0.93 0.91 0.94 0.74
Anova SSE DFE SSH DFH R-squared F Sign(%)
341.2 492 97.0 7 0.2214 19.9861 0.1274E-20
Group ? 1 2 3 4 5 6 7
merge stats -4.3 0.9 2.5 1.7 0.3 3.0 2.6
-4.3 -4.7 -2.5 -1.5 -1.3 0.9 3.0
Group ? 1 2 3 45 6 7
merge stats -4.3 0.9 2.5 1.9 3.6 2.6
-4.3 -4.7 -2.5 -1.5 0.9 3.0
Group ? 12 3 45 6 7
merge stats -5.1 2.9 1.9 3.6 2.6
-5.1 -2.5 -1.5 0.9 3.0
Group 12 3 45 ?6 7
merge stats 2.9 1.9 3.5 3.3
Group 12 345 ?6 7
merge stats 6.0 4.1 3.2

statistics after splitting / merging


Cate 12 345 ?6 7
mean 1.14 1.78 2.19 2.63
size 83 259 93 65
s.d. 0.47 0.90 0.94 0.74
Anova SSE DFE SSH DFH R-squared F Sign(%)
344.9 496 93.3 3 0.2128 44.7058 0.1382E-22

A slightly different format is used for a floating predictor, as illustrated by the predictor
MRP. Here, while the floating category is on its own, there are two lines of statistics for
each stage—the first for merging ? with 1, ? with 2, ? with 3... At the first merge phase,
these lines show that the least significant is for categories 3 with 4, and so these
categories are merged. The second stage merges 1 with 2 and the third 6 with 7. At the
fourth stage, the smallest t is for merging ? with the composite category (67), and after
this is done, ? no longer floats, and so for the last two stages there is only a single line of
merge outputs.
predictor no. 3 MRP
statistics before merging
Cate ? 1 2 3 4 5 6 7
mean 2.14 1.21 1.31 1.61 1.54 1.87 2.30 2.37
size 21 38 61 33 114 30 91 112
s.d. 0.91 0.58 0.67 0.90 0.81 0.94 0.89 0.90
Anova SSE DFE SSH DFH R-squared F Sign(%)
342.6 492 95.6 7 0.2182 19.6192 0.3351E-20
Group ? 1 2 3 4 5 6 7
merge stats -4.1 0.6 1.6 -0.4 1.9 2.4 0.6
-4.1 -3.9 -2.3 -3.0 -1.2 0.8 1.1
Group ? 1 2 34 5 6 7
merge stats -4.1 0.6 1.9 1.8 2.5 0.6
-4.1 -3.9 -3.0 -1.2 0.8 1.1
374

Group ? 12 34 5 6 7
merge stats -4.3 2.6 1.9 2.5 0.6
-4.3 -3.0 -1.2 0.8 1.1
Group ? 12 34 5 67
merge stats -4.3 2.6 1.9 2.9
-4.3 -3.0 -1.2 1.0

Group 12 34 5 ?67
merge stats 2.6 1.9 2.8
Group 12 345 ?67
merge stats 3.2 8.4

statistics after splitting / merging


Cate 12 345 ?67
mean 1.27 1.61 2.32
size 99 177 224
s.d. 0.64 0.85 0.89
Anova SSE DFE SSH DFH R-squared F Sign(%)
346.2 497 92.0 2 0.2099 66.0065 0.3781E-23

predictor no. 4 Change


statistics before merging
Cate ? 1 2 3
mean 1.82 1.60 1.97 2.13
size 132 143 115 110
s.d. 0.94 0.87 0.92 0.95
Anova SSE DFE SSH DFH R-squared F Sign(%)
419.1 496 19.1 3 0.0437 7.5530 0.5987E-02
Group ? 1 2 3
merge stats -2.0 3.2 1.3
-2.0 1.3 2.6
Group ? 1 23
merge stats -2.0 4.5
-2.0 2.3
Group ?1 23
merge stats 4.1

statistics after splitting / merging


Cate ?1 23
mean 1.71 2.05
size 275 225
s.d. 0.91 0.94
Anova SSE DFE SSH DFH R-squared F Sign(%)
423.6 498 14.6 1 0.0333 17.1594 0.4038E-02

predictor no. 5 Eye ind


statistics before merging
Cate ? 1 2 3
mean 1.92 1.09 1.70 2.22
size 110 96 73 221
s.d. 0.97 0.36 0.89 0.90
Anova SSE DFE SSH DFH R-squared F Sign(%)
351.4 496 86.8 3 0.1982 40.8602 0.1305E-20
Group ? 1 2 3
merge stats -7.0 4.6 4.6
-7.0 -1.7 3.0
Group 1 ?2 3
merge stats 6.9 4.6
375

statistics after splitting / merging


Cate 1 ?2 3
mean 1.09 1.83 2.22
size 96 183 221
s.d. 0.36 0.94 0.90
Anova SSE DFE SSH DFH R-squared F Sign(%)
353.5 497 84.7 2 0.1933 59.5594 0.6500E-21

The final type of output is that for a free predictor (like Pupils). This is much simpler
than in the corresponding CATFIRM case. Here the first stage of the analysis is to sort
the categories of the predictor into ascending order of their mean values of the dependent
variable. Thereafter, the analysis proceeds just like that of a monotonic predictor in these
re-ordered categories. This implies that if any two categories of a free predictor are to end
up joined together, the composite class will necessarily include any other categories
whose mean score lay between the mean scores of the joined groups. This implication is
not a consequence of the definition of a free predictor, but accords with common-sense
expectations.
predictor no. 6 Pupils
statistics before merging
Cate 1 ? 2
mean 1.14 1.62 2.11
size 122 13 365
s.d. 0.45 0.87 0.93
Anova SSE DFE SSH DFH R-squared F Sign(%)
351.3 497 86.9 2 0.1983 61.4491 0.1422E-21
Group 1 ? 2
merge stats 1.9 2.1
Group 1? 2
merge stats 10.9

statistics after splitting / merging


Cate 1? 2
mean 1.19 2.11
size 135 365
s.d. 0.52 0.93
Anova SSE DFE SSH DFH R-squared F Sign(%)
354.0 498 84.2 1 0.1922 118.4738 0.6746E-22

Inclusion of the rule keyword leads to the inclusion of the split rule table shown below
table in the output file. Apart from the first section summarizing the splits made, it has
the additional information on the categories of the character predictors and the cutpoints
of the real predictors.
1 6 2 2 3
2 3 5 4 4 4 4 4 4 5
3 1 6 6 6 6 7 7 8 8
6 5 10 9 10 10
7 3 12 11 11 11 11 11 12 12
8 2 13 -1 13 13 13 13 13 14
10 2 16 15 15 16 16 16 16 16
12 2 17 -1 -1 17 17 18 19 19
-1 -1
0
0
376

6.5 CONFIRM Analysis of the “mixed” Data Set


6.5.1 Description of the Data

The “mixed” data set was provided by Gordon V. Kass and is described in detail in
Section 6.3, where it was analyzed using CATFIRM.

The dependent variable for the CATFIRM run was of the type character—a promotion
code. A second measure of the student’s overall achievement in relation to the
requirements for going on to the second-year curriculum was the aggregate score. This is
the average score received for all courses taken in the first year at University, and is
suitable for analysis using CONFIRM.

6.5.2 FIRM Syntax File

The syntax file for the analysis of the “mixed” data using CONFIRM is given below:

confirm: sum den ! add dendrogram and summary information to output


file
17 ‘U aggr’ 0 ! Number of dependent variable
12 ! Number of predictors
‘Afrikaans’ 7 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Biology’ 10 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘English’ 6 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Faculty’ 13 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘History’ 12 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Latin’ 11 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Math’ 8 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Mattype’ 3 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘Matmonth’ 4 0 ‘f’ 0 13 ‘?123456789ABC’ 0.9000 1.0000
‘Matyear’ 5 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Science’ 9 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Sex’ 2 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
1 20 .00100 1.00000 1.00000 20 0 0 1 .0000000 1 0 0 0 0

• The first line indicates that a CONFIRM analysis is required, and that a
dendrogram (keyword den), summary statistics (keyword sum), the split file
(keyword split) and the split rule file (keyword rule) are to be included in the
output file. Note the use of the ! to start a comment after the end of the input on
the first line.
• The second line contains information on the dependent or outcome variable. It is
the 17th variable in the data set, as indicated by 17 in the first field, and it is used
here as a continuous variable, as indicated by the 0 in the third entry on this line.
Because of this specification, no category names are required for the FIRM
analysis. This is the dependent variable specification section of the syntax file.
377

• The third line has one entry—12—indicating the number of predictor variables to
be used in the analysis. This is the start of the predictor specification section on
the syntax file. The next twelve lines provide information on each of the
predictors.
• The variable Afrikaans is the seventh variable in the data set (position of predictor
= 7), and is a real predictor (type of predictor = r). The 1 indicates that this
variable is to be carried over but not used for splitting.
• The monotonic, free, and float predictor types all require that the predictor’s
values be consecutive integers. The next two fields should contain information on
this range, and are only required for these three predictor types. For a real
predictor, 0s are used instead.
• This is usually followed by the category codes. In this case, the absence of
category codes is indicated by “ “. The last two entries in this line are the splitting
and merging significance values. These levels are given in percent. In this
example, a 0.9% significance level will be used for splitting and 1% for merging.
• The predictor Matmonth is a free predictor, with values ranging from 0 to 13.
Codes for the values are assigned as ?, 1, 2, ...., 9, A, B, C. Splitting and merging
significance levels are the same as for all other predictors.
• The last line of the syntax file contain the output options for this CONFIRM
analysis.

1 20 .00100 1.00000 1.00000 20 0 0 1 .0000000 1 0 0 0 0

• The options specified are as follows:

1: The detailed split file is requested.


20: Minimum number of cases in a group. In this case, no group smaller
than 20 cases will be considered for splitting.
0.00100: The minimum proportion of SSD to analyze the group.
0.5000: The raw significance level.
1.0000: The conservative significance level that a predictor must attain
for the split to be made.
20: The maximum number of nodes to be analyzed.
0: The external degrees of freedom on the variance estimate. In this
example, no such information is available and the 0 indicates that.
0: The data for this analysis is in free format. If fixed format had been
indicated, a format statement would have followed directly after this line
of input.
1: The use of the pooled variance in t-statistic for the testing of pairs of
categories for compatibility is requested.
.0000000: No external variance estimate is used.
1: Indicated use of FIRM 2.1 methodology rather than FIRM 2.0
methodology.
0: Not currently used, set to default value of 0.
378

0: Not currently used, set to default value of 0.


0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.

For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.

6.5.3 CONFIRM Output

The first section of the output file echoes the syntax file and also contains the data set
name and number of observations used in the analysis.
confirm: sum den split rule! add dendrogram and summary information to output
file
17 ‘U aggr’ 0 ! Number of dependent variable
12 ! Number of predictors
‘Afrikaans’ 7 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Biology’ 10 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘English’ 6 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Faculty’ 13 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘History’ 12 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Latin’ 11 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Math’ 8 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Mattype’ 3 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘Matmonth’ 4 0 ‘f’ 0 13 ‘?123456789ABC’ 0.9000 1.0000
‘Matyear’ 5 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Science’ 9 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Sex’ 2 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
1 20 .00100 1.00000 1.00000 20 0 0 1 .0000000 1 0 0 0 0

CONFIRM. Formal Inference - based Recursive Modeling


Program dimensions
Maximum number of predictors 1000
Maximum number of categories in predictors 20
the data file (input):
H:\FIRM\DATA\MIXED.dat
Run now starting....
All data in. 500 cases read with
500 retained.
Start FIRM processing

Initial information on the analysis is given next, followed by a mini-dendrogram. In both


cases, it can be seen that the first split was on variable number 7, and three new nodes
were formed. In the case of the last of these nodes, splitting next occurred using predictor
11.
Starting analysis of node 1
Split on 7 making descendants 3 2 2 2 2 2 3 3 3
4 4

Starting analysis of node 2


Starting analysis of node 3
Starting analysis of node 4
379

Starting to draw full-size dendrogram


Full-size dendrogram complete
Starting to draw mini-dendrogram

Mini-dendrogram of analysis

Legend: node number


splitting variable
------------------------------------
horizontal line connects descendants

1
7
----------------------------------------
2 3 4

As the sum, den, split and rule keywords were included in the syntax file, the
summary information is not written to an external file, but forms the next part of the
output file.
---------------------------------------------
| Output of Summary and Error Information |
---------------------------------------------

CONFIRM Formal Inference - based Recursive Modeling


Continuous dependent variable

Version 2.3 1999/09/04


Copyright 1999 Douglas M Hawkins
Applied Statistics
University of Minnesota

Program dimensions:
Maximum number of predictors 1003
Maximum number of predictor categories 20

Dependent variable no. 17 is named U aggr


Predictor variables
no posn name no of cats split merge may? type
1 7 Afrikaans 0 0.900 1.000 yes real
2 10 Biology 0 0.900 1.000 yes real
3 6 English 0 0.900 1.000 yes real
4 13 Faculty 0 0.900 1.000 yes char
5 12 History 0 0.900 1.000 yes real
6 11 Latin 0 0.900 1.000 yes real
7 8 Math 0 0.900 1.000 yes real
8 3 Mattype 0 0.900 1.000 yes char
9 4 Matmonth 13 0.900 1.000 yes free
10 5 Matyear 0 0.900 1.000 yes real
11 9 Science 0 0.900 1.000 yes real
12 2 Sex 0 0.900 1.000 yes char

Run options in effect


Full split/merge details of predictors
For a group to be analyzed, it must: -
contain at least 20 cases;
have at least proportion 0.00100 of starting ssd.
Minimum % raw significance to split 1.000
380

Minimum % conservative significance to split 1.000


Analysis will stop after 20 groups have been formed
Error variance is pooled Anova MS
New (FIRM 2.1 and newer) p-values used
All data in. 500 cases read with
500 retained.

Next there is a listing of the cutpoints for the real predictors, which is needed when
reading the rest of the summary output and the dendrogram. Just one real predictor–
Afrikaans—from this list is shown below.

The cutpoints 45, 51, 54, 55, 57, 62, 64, 65 and 75 are the best we can find for dividing
the data set up into 10 groups of about equal size, and are the starting cutpoints from
which FIRM will do its subsequent merging to a final grouping.
Character predictors seen in the data and their values are:

Predictor 4 Faculty values seen


A C M S
D E F H
B L
These values will be abbreviated to: ACMSDEFHBL

Predictor 8 Mattype values seen


5 1 6 J
F 7 ? 4
3 A 2
These values will be abbreviated to: 516JF7?43A2

Predictor 12 Sex values seen


M F
These values will be abbreviated to: MF

Continuous predictors seen in the data and their cutpoints are:

Predictor 1 Afrikaans

Code Max value in class


? Invalid or missing
0 <= 45.000
1 <= 51.000
2 <= 54.000
3 <= 55.000
4 <= 57.000
5 <= 62.000
6 <= 64.000
7 <= 65.000
8 <= 75.000
9 > 75.000

A summary of the options used in splitting and merging of the groups is followed by
specific information on the splitting of each node. The primary use of this information is
for exploration of possible splits that were not used.

When we look at the analysis of the first node, we see that four of the predictors (Biology,
English, Math and Science) are able to provide splits significant at better than 1%
conservative. Of these, the predictor Math is the most significant. It divides the students
381

into three subsets—{01234}, {?567} and {89}—effectively distinguishing between


the bottom 50% of the students, the next 30% plus those with missing Math grades, and
the top 20%. Alternatively, we can use the table of cutpoints (given in the summary file,
but now shown here) to turn these into ranges of Math scores. The first group is those at
or below 57%, the second those above 57% but not exceeding 75% (together with the
missing class), and those with scores above 75%. We will need to use the actual cutpoints
given in the summary file (or, in this case, the complete output file) if we want to apply
the model to the prediction of future students’ performance, but the interpretation in
terms of approximate percentages of the pool is probably more informative when it
comes to a better understanding of the connection of the Math score to the aggregate
first-year grade.
***************************************************************************
Analysis of group no. 1 previous group no. 0
no. name mc-sig(%) bonf-sig(%), grouping
1 Afrikaans 27.5146 9.6823 ?01234567 89
2 Biology 2.46E-03 7.22E-05 01234 ?56 789
3 English 0.0108 2.51E-03 ?012345 6789
4 Faculty 1.5561 15.1396 BLSC FEHAMD
5 History 7.5641 1.5349 ?01234567 8
6 Latin 100.00 100.00 ?012345678
7 Math 9.37E-04 4.67E-05 01234 ?567 89
8 Mattype 100.00 100.00 7FJ51A3?642
9 Matmonth 4.6522 13.1064 74 3B1C?6
10 Matyear 20.4593 3.2037 0123 ?45678
11 Science 2.14E-03 4.60E-04 012 ?34567 89
12 Sex 100.00 100.00 MF

Best predictor
7 Math 9.37E-04 4.67E-05 01234 ?567 89

The one-way analysis of variance for the split actually made is given next, followed by
the summary statistics of the descendant groups formed by the split. This information is
duplicated in the dendrogram, and so it is not of particular interest here.
Analysis of variance

Sum of squares mean square degrees of freedom


Grouping 15069.3285 7534.6643 2
Error 196253.2795 394.8758 497

F-value 19.081
significance 1.04E-06
Bonferroni P 4.67E-05
multiple comparison P 9.37E-04
overall conservative P 4.67E-05
adjusted for predictor count 5.60E-04

Grouping is significant at the conservative 5.60E-04% level

Statistics for grouping


Node Mean s.d. size s.e. (mean)
2 42.308411 20.600566 214 1.4082257
3 49.346154 19.753955 234 1.2913564
4 60.288462 17.114764 52 2.3733908
382

***************************************************************************
Analysis of group no. 2 previous group no. 1
no. name mc-sig(%) bonf-sig(%), grouping
1 Afrikaans 100.00 100.00 ?0123456789
2 Biology 100.00 100.00 ?0123456789
3 English 3.7527 5.1131 012345 6789
4 Faculty 10.8388 66.2082 S MCLAHFED
5 History 100.00 100.00 ?012345678
6 Latin 100.00 100.00 ?024568
7 Math 100.00 100.00 01234
8 Mattype 100.00 100.00 7J1A24653
9 Matmonth 100.00 100.00 4BC13
10 Matyear 100.00 100.00 012345678
11 Science 100.00 100.00 ?0123457
12 Sex 100.00 100.00 MF

Best predictor
3 English 3.7527 5.1131 012345 6789

Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 3210.5718 3210.5718 1
Error 87183.0731 411.2409 212

F-value 7.807
significance 0.5681
Bonferroni P 5.1131
multiple comparison P 3.7527
overall conservative P 3.7527
adjusted for predictor count 45.0319
No prediction possible.

As with the summary statistics, the use of the den keyword leads to the inclusion of the
full dendrogram in the output file, rather than the writing of this information to an
external file. The dendrogram thus forms the next part of the CONFIRM output file. The
dendrogram is the most obviously and immediately useful output of a FIRM analysis. It
is an extract of the much more detailed output, which may be written to a file, or, by
using the split keyword, included in the output file. If written to a file, this file will
contain information on both split and table files, in contrast with CATFIRM where such
information will be written to two separate files.

This output contains the following (often very valuable) information:

• An analysis of each predictor in each node, showing which categories of the


predictor FIRM finds it best to group together, and what the conservative
statistical significance level of the split by each predictor is.
• The number of cases flowing into each of the descendant groups.
• Summary statistics of the cases in the descendant nodes. In the case of
CATFIRM, the summary statistics are a frequency breakdown of the cases
between the different classes of the dependent variable. With CONFIRM, the
summary statistics given are the arithmetic mean and standard deviation of the
cases in the node. This summary information is echoed in the dendrogram, as
discussed previously.
383

The split file contains information on the splitting of all groups. The listing starts with the
summary statistics of the grouping of the cases in that node by the different classes of the
predictor. The categories of a free or character predictor are arranged in descending order
of the dependent variable mean—all the other predictors are listed in their original order.

Inclusion of the rule keyword leads to the inclusion of the table in the output file. Apart
from the first section summarizing the splits made, it has the additional information on
the categories of the character predictors and the cutpoints of the real predictors.

You might also like