Chapter 6
Chapter 6
6 FIRM Examples
In this chapter, two data sets are analyzed using both CATFIRM and CONFIRM. For
more information on FIRM syntax, please see Chapter 8. An overview of this technique
is given in Chapter 7.
In the study of automobile costs, you may have in a database containing, in order of
appearance, the variables COUNTRY, MAKE, CYL, DISPL, TANK, PRICE, YEAR,
REPAIR which are respectively the car’s country of origin, its make, the number of
cylinders it has, the engine displacement, fuel tank capacity, recommended retail price,
year of manufracture and annual repair costs.
The following variables may be used to predict the repair costs (REPAIR), which is thus
the dependent variable.
• MAKE : the make of the car, with 5 groupings (GM, Ford, Chrysler, Japanese and
European). MAKE is clearly nominal, and would be used as a “free” predictor.
• CYL : the number of cylinders in the machine (4, 6, 8 or 12). CYL should be used
as monotonic, in order to avoid the assumption that the repair costs difference
between 4 and 8 cylinders is the same as between 8 and 12 cylinder cars.
• PRICE : manufacturer’s suggested retail purchase price when new. PRICE is a
ratio-scale predictor that would be specified as monotonic—it seems generally
accepted that cars that are more expensive to buy are more expensive to run, too.
• YEAR : year of manufacture. YEAR is an example of a predictor that looks to be
on interval scale, but appearances may be deceptive. The best would be to use
year as a “free” predictor, thus allowing the cost to have any type of response to
year, including one in which isolated years are bad while their neighbors are
good.
343
The “head injuries” data set of Titterington et al (1981) is an example of a data set in
which FIRM is a potential method of analysis. The data set was gathered in an attempt to
predict the final outcome of 500 hospital patients who had suffered head injuries. The
outcome for each patient was that he or she was:
This outcome is predicted on the basis of 6 predictors assessed on the patients’ admission
to the hospital:
• age : The age of the patient. This is grouped into decades in the original data, and
is grouped the same way here. It has eight classes.
• EMV : This is a composite score of three measures—of eye-opening in response
to stimulation, motor response of best limb, and verbal response. This has seven
classes, but is not measured in all cases, so that there are eight possible codes for
this score — the seven measurements and an eighth “missing” category.
• MRP : This is a composite score of motor responses in all four limbs. This also
has seven measurement classes with an eighth class for missing information.
• Change : The change in neurological function over the first 24 hours. This was
graded 1, 2, or 3, with a fourth class for missing information.
• Eye indicator : A summary of diagnostics on the eyes. This too had three
measurement classes, with a fourth for missing information.
• Pupils : Pupil reaction to light — present, absent, or missing.
In the CATFIRM analysis the dependent variable was treated as being a categorical
variable.
The syntax for this analysis is contained in headicat.pr2 and is shown and discussed
below.
CATFIRM: den sum
1 ‘Outcome’ 3 ‘Dead/veg’ ‘Severe ‘ ‘Mod/good’
6
‘age’ 2 1 ‘m’ 0 8 ‘01234567’ 4.9000 5.0000
‘EMV’ 3 0 ‘1’ 0 8 ‘?1234567’ 4.9000 5.0000
‘MRP’ 4 0 ‘1’ 0 8 ‘?1234567’ 4.9000 5.0000
‘Change’ 5 0 ‘1’ 0 4 ‘?123’ 4.9000 5.0000
344
• The first line indicates that the outcome variable is categorical and both summary
statistics (sum) and a dendrogram (den) are to be included in the output
produced during analysis.
• The second line contains information on the dependent or outcome variable. It is
the first variable in the data set, as indicated by 1 in the first field, and it is used
here as a categorical variable. The outcome variable has three categories:
‘Dead/veg’ , ‘Severe ’, and ‘Mod/good’.
• This is followed by six lines of data, one for each predictor variable. Predictors
included range from ‘age’, a monotonic variable with 8 possible values to
‘Pupils’, a free variable with 3 possible values.
• The last two lines of the syntax file contain the output options for this analysis.
The options specified are as follows:
Score 1 for Yes and 2 for No; then compute this option as
1000 × Q1 + Q 2 + 2 × Q3 − 1003.
In this case, detailed analysis of splits is not requested (Q1 = 2), but cross-
tabs are requested (Q2 = Q3 = 1).
25: The minimum size a node must have to be considered for further
splitting.
0.50000: The raw significance level for a split to be made.
1.0000: The conservative significance level that a predictor must attain
for the split to be made.
25: The maximum number of nodes to be analyzed.
0.000: Constant to be added to χ 2 .
345
0: The data for this analysis is in free format. If fixed format had been
indicated, a format statement would have followed directly after this line
of input.
1: Specifies use of FIRM 2.1 methodology rather than FIRM 2.0
methodology.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.
The syntax read from the syntax file is echoed to the output file. After a preliminary echo
of the syntax from the syntax file, the file shows the analysis of each node. As can be
seen from the output below, 14 nodes were obtained.
CATFIRM. Formal Inference-based Recursive Modeling
Program dimensions
Maximum number of predictors 1500
Maximum number of categories in
Predictors 16
Dependent variable 16
Mini-dendrogram of analysis
1
6
---------------------------------------
2 3
1
---------------------------------------------------
4 5 6 7
3 5 3 2
------- ------------- ------------ -------------
8 9 10 11 12 13 14 15 16 17 18
Summary information on the outcome and predictor variables is given next, followed by
a list of the options in effect for the analysis.
Outcome has 3 categories:
Dead/veg
Severe
Mod/good
Options in effect:
Tables printed as column percentages
Detail output on file split
Tables given before and after each step
To be analysed, a group must:
have at least 25 cases;
be significant at the 0.500% level;
be conservatively significant at the 1.000% level.
The run will terminate when 25 groups have been formed.
Standard Pearson X^2 statistic is used
Tests use new (FIRM 2.1) procedures
The summary of results of the first split is given below. Node number 1 is the full data
set. In node 1, all the predictors give highly significant splits on the dependent variable
Outcome. The number of groups selected by CATFIRM varies from 2 to 4: under the
heading groups we see how the categories break down. For example, using EMV would
split the cases into 4 groups. Taking the Pearson χ 2 value for the 3 × 4 table and finding
its p-value under the non-asymptotic distribution would give
P = 3.44 ×10−20 %.
Making the Bonferroni adjustment for the grouping that went into this final set of
categories multiplies this p-value by 95, giving the Bonferroni significance as
The multiple comparison p-value takes the original Pearson χ 2 for the 3 × 8 table before
grouping the categories and enters it into the non-asymptotic distribution, giving a p-
value of
Since both the Bonferroni and the multiple comparison p-values are conservative, we
take the smaller of them, 3.27 ×10−18 % , to be the conservative significance level of the
predictor.
Summary of results of node number 1 predecessor node number 0
Total group
no. Name Signif % Bonf sig % MC sig % groups
1 age 6.40E-13 2.24E-11 1.62E-10 01 23 45 67
2 EMV 3.44E-20 3.27E-18 5.00E-17 12 345 ?6 7
3 MRP 2.04E-20 1.04E-18 2.32E-16 12 345 ?67
4 Change 0.0116 0.0579 0.0208 ?1 23
5 Eye ind 3.88E-19 1.94E-18 1.40E-18 1 ?2 3
6 Pupils 5.08E-20 1.52E-19 3.47E-17 ?2 1
348
Detail on the characteristics of the best predictor is given next. The most significant split
uses Pupils. It is a binary split, between the pooled class ? or 2 and class 1. This split has
a Bonferroni significance level of 1.52 × 10−19 % and a multiple comparison significance
level of 3.47 ×10−17 % . The smaller of these (both being conservative) is taken as the
significance level of the split.
The summary table below shows how the cases divide between these two—roughly three
quarters go to node 2, where the recovery rate is quite variable, and the rest go to node 3,
where 90% of the patients are dead or vegetative.
Characteristics of the best predictor
6 Pupils 5.08E-20 1.52E-19 3.47E-17 ?2 1
***********************************************************************
predictor 6 Pupils *percent* total number 500
?,2 1 Total
Dead/veg 39.4 90.2 51.8
Severe 11.9 5.7 10.4
Mod/good 48.7 4.1 37.8
totals (100%) 378 122 500
Raw significance of table is 5.08E-20
The analysis continues with node number 2. CATFIRM lists the makeup of this node—it
is all cases for which Pupils is ? or 2. As the analysis proceeds, this record grows to
reflect the successive splits giving rise to the node. In this node, no significant split can
be made on Pupils (not surprisingly, since the two classes of Pupils represented in this
node were grouped because of their compatibility). All other predictors give significant
splits, the conservative significance ranging from 0.1% for Change down to 8 ×10−12 %
for age. Node 2 is split four ways on age, the cut points being at ages 20, 30 and 60. All
four nodes are investigated again later in the analysis.
Summary of results of node number 2 predecessor node number 1
Node 3 is terminal. Its cases can not (at the significance levels selected) be split further.
When there is no further significant split to be made within a node, the program adds a
message to this effect at the end of the summary of results for that node. This was the
case for node 3, where the best predictor, Eye ind, was not significant.
Summary of results of node number 3 predecessor node number 1
Makeup Pupils (1)
no. Name Signif % Bonf sig % MC sig % groups
1 age 100.00 100.00 100.00 01234567
2 EMV 0.1354 6.9051 8.5868 12 345 ?67
3 MRP 0.3558 4.6251 8.0063 123456 ?7
4 Change 100.00 100.00 100.00 ?123
5 Eye ind 0.1679 0.8394 0.3613 12 ?3
6 Pupils 100.00 100.00 100.00 11
The analysis continues in the same way with the successive nodes generated. The
information on the splits actually made is duplicated in the dendrogram. The additional
information in the summary file is on the predictors that did not give rise to splits. In the
full data set, all the predictors were highly significant. In Node 2, age was as significant
as it was in the full data set, so that the predictive power is age is different from that in
Pupils. EMV, MRP and Eye ind had considerably less significance in Node 2 than in
Node 1. While to some degree this is an inevitable consequence of the reduction in
sample size going down the tree, in part it can also indicate overlap in the predictive
information—that these predictors are correlated with Pupils and the common variability
they have with Pupils comprises a lot of the information about Outcome. Scanning the
summary file for this sort of information can often produce valuable insights.
The detailed information of splits is given in the output file headicat.spl. The first few
lines of this file are given below. The split file details the tests that went into the final
grouping of each predictor at each node. There are three slightly different layouts here,
illustrated by the monotonic predictor age, the free predictor Pupils and the floating
predictor MRP. For the monotonic predictor, only adjacent predictors may be merged.
350
The predictor age starts out with 8 groups, giving 7 pairwise χ 2 statistics that are listed
after “Test statistics for grouping”. The smallest of these is 1.3775 for merging categories
2 and 3. This value is well below the merge significance level for the run, so these two
classes are joined into a composite, leaving 7 groups. The merge statistics for these
revised groups are computed, the smallest of which is 1.6163 for merging classes 0 and
1, so these classes are merged. The analysis continues in the same way until 4 composite
groups remain. At this stage, the smallest merge statistic—that for merging (23) with
(45)—is significant at the 2% level, which is below the run’s threshold for merging.
Thus no further reduction of the classes of age occurs.
No more merging being possible, CATFIRM then attempts to split the composite
categories. The line “Test stats for splitting” gives the details of this, showing a χ 2
statistic for each possible resplitting point of a composite category. The largest of these
statistics is 5.0593, which is far from significant at the split significance level specified
for the run, and so no resplitting takes place. Thus (01), (23), (45), (67) is the final
grouping of the categories. The final 3 × 4 contingency table gives a Pearson χ 2 of
77.984. Entering this into χ 2 tables with 6 degrees of freedom gives a raw significance
level of 6.4 × 10−13 % . The Bonferroni multiplier, which allows for the grouping that went
into reducing 8 groups into 4, is 35. The Bonferroni significance level is therefore
Min stat is 1.3775, to merge (2) and (3). d.f. 2, sig 51.1157
7 groups: 0 1 (2 3) 4 5 6 7
Test statistics for grouping: 1.6 10.7 7.9 1.7 7.7 5.1
Min stat is 1.6163, to merge (0) and (1). d.f. 2, sig 45.3755
6 groups: (0 1) (2 3) 4 5 6 7
Test statistics for grouping: 9.5 7.9 1.7 7.7 5.1
Min stat is 1.7396, to merge (4) and (5). d.f. 2, sig 43.0870
5 groups: (0 1) (2 3) (4 5) 6 7
Test statistics for grouping: 9.5 7.8 7.1 5.1
Min stat is 5.0593, to merge (6) and (7). d.f. 2, sig 7.6055
4 groups: (0 1) (2 3) (4 5) (6 7)
Test statistics for grouping: 9.5 7.8 14.4
Min stat is 7.7518, to merge (23) and (45). d.f. 2, sig 2.0360
4 groups: (0 1) (2 3) (4 5) (6 7)
351
The floating predictors, as exemplified by MRP, have an extra wrinkle to them. Not only
can the monotonic portion of the scale be merged in the same way as with age, but the
floating category can be merged with any of them. Thus the detail starts out with two
lines—the test statistics for merging a successive pair on the scale 1-7 on the upper line,
and those for merging the floating category with each monotonic class on the second line.
CATFIRM checks the full list of (in this case 13) test statistics, and finds that the smallest
is 0.7021, for merging classes 1 and 2. These classes are merged and the analysis
repeated. In the second stage, it happens that the smallest test statistic is for merging the
category ? with 6, so these two categories are joined. Once this is done, there is no
longer a floating category, just the monotonic categories (12), 3, 4, 5, (?6) and 7, and
the subsequent lines of the analysis look like those of a monotonic predictor. There is a
slightly different twist at the last phase. Here the floating category is part of a three-class
composite (?67), and so there are three possible binary splits—? vs (67), (?6) vs 7 and
6 vs (?7). The test statistics for these are shown on two lines of the printout—2.5 and 4.3
for the first two, and 1.4 for the third. In reading these two lines, ignore the rightmost ?
on the first line and the leftmost on the second. Again, the split statistic is not significant,
and so the grouping (12) (345) (?67) is final.
Float MRP
Table has chi-square 117.765, with df 14 and significance 2.32E-16
8 groups: 1 2 3 4 5 6 7
Test statistics for grouping: 0.7 3.8 1.9 3.7 5.0 3.0
and for grouping ? with 17.9 16.6 6.1 8.8 1.4 0.9 4.5
Min stat is 0.7021, to merge (1) and (2). d.f. 2, sig 72.5715
7 groups: (1 2) 3 4 5 6 7
Test statistics for grouping: 5.9 1.9 3.7 5.0 3.0
and for grouping ? with 23.1 6.1 8.8 1.4 0.9 4.5
Min stat is 0.8839, to merge (?) and (6). d.f. 2, sig 65.2141
6 groups: (1 2) 3 4 5 (? 6) 7
Test statistics for grouping: 5.9 1.9 3.7 4.7 4.3
Min stat is 1.9242, to merge (3) and (4). d.f. 2, sig 38.4126
5 groups: (1 2) (3 4) 5 (? 6) 7
Test statistics for grouping: 8.6 3.3 4.7 4.3
Min stat is 3.3063, to merge (34) and (5). d.f. 2, sig 18.9915
4 groups: (1 2) (3 4 5) (? 6) 7
Test statistics for grouping: 11.8 34.9 4.3
Min stat is 4.2724, to merge (?6) and (7). d.f. 2, sig 11.8624
3 groups: (1 2) (3 4 5) (? 6 7)
Test statistics for grouping: 11.8 56.1
Min stat is 11.8477, to merge (12) and (345). d.f. 2, sig 0.2492
3 groups: (1 2) (3 4 5) (? 6 7 ?)
352
Free predictors with more than three categories involve much more computation than
monotonic or floating predictors with the same number of categories. Their analysis
(though not the potential computational load) is illustrated by Pupils. Since any two
categories may be merged, each step of the merge phase computes and lists the lower
triangle of a matrix of pairwise χ 2 values—in this case of a 3 × 3 matrix. The first round
of testing finds the categories ? and 2, and the single statistic FT 3.6 shows that this split
is not significant.
If the splitting phase finds a split that is significant at the split significance level selected
for this run, this composite is split, and CATFIRM returns to the first phase of looking for
possible merges.
Since a c-category composite class of a free predictor can be resplit in 2c−1 ways, testing
for resplitting of free predictors with many classes can become an enormous
computational burden. It is partly for this reason and partly because of the method of
implementation that CATFIRM has an immutable upper limit of 16 on the number of
categories a free predictor may have.
The other optional and potentially huge file produced on request by CATFIRM is the
contingency tables before and/or after the grouping of categories. This is illustrated by
353
the next section of printout from the output file headicat.tab. Scanning these tables is
often helpful in gaining a better perspective to the grouping that went before, in particular
regarding the number of cases in the groups that were merged.
The table for the total group before grouping with respect to the predictor Pupils (the
most significant predictor) is given first. This is followed by the table for the total group
after splitting on Pupils. Percentages are given for each category of the outcome
separately by category of the predictor. The total number of cases in each group is also
given, allowing the user to calculate the number of cases per combination of predictor
and outcome categories should that be required.
1Total group before grouping
***********************************************************************
predictor 6 Pupils *percent* total number 500
? 1 2 Total
Dead/veg 61.5 90.2 38.6 51.8
Severe 15.4 5.7 11.8 10.4
Mod/good 23.1 4.1 49.6 37.8
totals (100%) 13 122 365 500
Raw significance of table is 3.47E-17
The final output file, headicat.spr, contains the split rule table of splits made during the
analysis. The complete file is given below. There is nothing of interest to most starting
users in this file. Its main use is for other programs that want to read and make use of the
results of a FIRM analysis. For the contents, layout and format of the split rule table,
please see the section on FIRM syntax in Chapter 7.
1 6 2 3 2
2 1 4 4 5 5 6 6 7 7
4 3 8 8 8 8 8 8 9 9
5 5 11 10 11 12
6 3 15 13 13 13 14 14 15 15
7 2 16 16 16 16 17 17 17 18
-1 -1
0
0
The final file produced by CATFIRM is the full-scale dendrogram (see below). The
dendrogram produced by CATFIRM is a very informative picture of the relationship
354
between these predictors and the patients’ final outcome. We see in it that the most
significant separation was obtained by splitting the full sample on the basis of the
predictor Pupils.
The groups of 378 cases for which Pupils had the value 2 or the value ? (that is, missing)
were statistically indistinguishable, and so are grouped together. They constitute one of
the successor groups (node number 2), while those 122 for whom Pupils had the value 1
constitute the other (node number 3), a group with much worse outcomes—90% dead or
vegetative compared with 39% of those in node 2.
Each of these nodes in turn is subjected to the same analysis. The cases in node number 2
can be split again into more homogeneous subgroups. The most significant such split is
obtained by separating the cases into four groups on the basis of the predictor age. These
are patients under 20 years old (node 4), patients 20 to 40 years old (node 5), those 40 to
355
60 years old (node 6) and those over 60 (node 7). The prognosis of these patients
deteriorates with increasing age; 70% of those under 20 ended with moderate or good
recoveries, while only 12% of those over 60 did.
Node 3 is terminal. Its cases can not (at the significance levels selected) be split further.
Node 4 can be split using MRP. Cases with MRP = 6 or 7 constitute a group with a
favorable prognosis (86% with moderate to good recovery), and the other MRP levels
constitute a less-favorable group but still better than average.
These groups, and their descendants, are analyzed in turn in the same way. Ultimately no
further splits can be made and the analysis stops. Altogether 17 nodes are formed, of
which 12 are terminal and the remaining 5 intermediate. Each split in the dendrogram
shows the variable used to make the split and the values of the splitting variable that
define each descendant node. It also lists the statistical significance (p-value) of the split.
Two p-values are given: that on the left is FIRM’s conservative p-value for the split. The
p-value on the right is a Bonferroni-corrected value, reflecting the fact that the split
actually made had the smallest p-value of all the predictors available to split that node. So
for example, on the first split, the actual split made has a p-value of 1.52 × 10−19 % . But
when we recognize that this was selected because it was the most significant of the 6
possible splits, we may want to scale up its p-value by the factor 6 to allow for the fact
that it was the smallest of 6 p-values available (one for each predictor). This gives its
conservative Bonferroni-adjusted p-value as 9.15 × 10−19 % .
The dendrogram and the analysis giving rise to it can be used in two obvious ways—for
making predictions, and to gain understanding of the importance of and interrelationships
between the different predictors. Taking the prediction use first, the dendrogram provides
a quick and convenient way of predicting the outcome for a patient. Following patients
down the tree to see into which terminal node they fall yields 12 typical patient profiles
ranging from 97.1% dead/vegetative to 86.1% with moderate to good recoveries. The
dendrogram is often used in exactly this way to make predictions for individual cases.
Unlike, for instance, predictions made using multiple regression or discriminant analysis,
the prediction requires no arithmetic—merely the ability to take a future case down the
tree. This allows for quick predictions with limited potential for arithmetic errors. A
hospital could, for example, use the dendrogram to give an immediate estimate of a head-
injured patient’s prognosis.
The “mixed” data set was provided by Gordon V. Kass. The original file contains the
records of nearly 20,000 students at the University of the Witwatersrand. From this file,
we randomly extracted 500 for use as a FIRM calibration data set and another 500 to
356
illustrate validation. Most of the predictors are of type real, being percentage scores
attained by the students for a variety of subjects in high school. In addition, the data set
contains four character predictors:
The dependent variable for the CATFIRM run was also of the type character—a
promotion code. Possible values for this were P for a clear pass for the year; F for a clear
fail for the year; and C for a partial credit on courses for the year. There was a number of
other low-frequency codes that were changed as part of the initial work on the data file to
make them all ? along with the genuinely missing values.
The CATFIRM syntax for the analysis of the “mixed” data is given below:
CATFIRM: den sum
15 ‘Promo code’ -1
12
‘Afrikaans’ 7 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Biology’ 10 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘English’ 6 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Faculty’ 13 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘History’ 12 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Latin’ 11 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Math’ 8 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Mattype’ 3 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘Matmonth’ 4 0 ‘f’ 0 13 ‘?123456789ABC’ 0.9000 1.0000
‘Matyear’ 5 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Science’ 9 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Sex’ 2 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
0 1000 25 .10000 1.00000 25 .500 0 1 0 0 0 0 0 0
• The first line of the syntax file specifies the inclusion of the summary statistics
(keyword sum)and the dendrogram (keyword den) in the output file, rather than
having this information written to various external files.
• The dependent variable, Promo code, is in the 15th position in the data set. The
artificial code -1 indicates that the dependent variable is a character variable and
that FIRM should find the separate categories from the data.
• There are 12 predictors, as indicated by the number 12 in the third line of the
syntax.
357
• The information for each predictor follows next, and is given in the following
order: the name of the predictor, immediately followed by its position in the data.
In the case of Afrikaans, for example, the predictor is in the 7th position in the
data.
• 0 indicates that this predictor may be used for splitting. Afrikaans is a real
predictor (type of predictor = r). The monotonic, free and float predictor types all
require that the predictor’s values be consecutive integers. The next two fields
should contain information on this range, and are only required for these three
predictor types. For a real predictor, 0s are used instead.
• This is usually followed by the category codes. In this case, the absence of
category codes is indicated by “ “. Note that category codes and range of values
are provided in the case of the predictor Matmonth, which is a free predictor.
• The last two entries in this line are the splitting and merging significance values.
These levels are given in percent. In this example, a 0.9% significance level will
be used for splitting and 1% for merging.
• The last line of the syntax file contains the output options for this CONFIRM
analysis. The options specified are as follows:
0 1000 25 .10000 1.00000 25 .500 0 1 0 0 0 0 0 0
Score 1 for Yes and 2 for No; then compute this option as
1000 × Q1 + Q 2 + 2 × Q3 − 1003.
In this case, detailed analysis of splits is not requested (Q1 = 2), but cross-
tabs are requested (Q2 = Q3 = 1).
25: The minimum size a node must have to be considered for further
splitting.
0.10000: The raw significance level for a split to be made.
1.0000: The conservative significance level that a predictor must attain
for the split to be made.
25: The maximum number of nodes to be analyzed.
0.500: Constant to be added to χ 2 .
358
0: The data for this analysis is in free format. If fixed format had been
indicated, a format statement would have followed directly after this line
of input.
1: Specifies use of FIRM 2.1 methodology .
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
0: Not currently used, set to default value of 0.
For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.
The first part of the output file for the CATFIRM analysis of the “mixed” data concerns
the specification of syntax, input and output files.
The following lines were read from file H:\FIRM\DATA\MIXED.CAT
This is followed by basic splitting and merging information and the layout of the
dendrogram.
Starting node 1 descended from 0
Split on 4 making descendants 2 3 2 4 5 3 4 5 5 4
Mini-dendrogram of analysis
1
4
--------------------------------------------------
2 3 4 5
7 11
------------ -----------------------
6 7 8 9 10
The output of summary and error information follows. This information is written to the
output file if the keyword sum is used in the first line of the syntax file.
Immediately following the echo of the options, the different codes seen for the dependent
variable are listed—these were F, P, ? and C. These values will be used in the run as the
category labels. This is followed by the code list for the character predictors. The
predictor Mattype, for example, took on 11 different values all of which (as it happens)
were one character long (character predictors do not have to be single-character—they
can have length up to 20). These values being necessarily different, they will be used as
the category symbols for this predictor.
360
---------------------------------------------
| Output of Summary and Error Information |
---------------------------------------------
Program dimensions
Maximum number of predictors 1003
Maximum number of categories in
Predictors 16
Dependent variable 16
Next there is a listing of the cutpoints for the real predictors, which is needed when
reading the rest of the summary output and the dendrogram. Just one real predictor–
Afrikaans—from this list is shown below.
The cutpoints 45, 51, 54, 55, 57, 62, 64, 65 and 75 are the best we can find for dividing
the data set up into 10 groups of about equal size, and are the starting cutpoints from
which FIRM will do its subsequent merging to a final grouping.
Predictor 1 Afrikaans
Code Max value in class
? Invalid or missing
0 <= 45.000
1 <= 51.000
2 <= 54.000
3 <= 55.000
4 <= 57.000
5 <= 62.000
6 <= 64.000
7 <= 65.000
8 <= 75.000
9 > 75.000
Options in effect:
Tables printed as column percentages
Detail output on file split
Tables given before and after each step
To be analysed, a group must:
have at least 25 cases;
be significant at the 0.100% level;
be conservatively significant at the 1.000% level.
The run will terminate when 25 groups have been formed.
Modify chi sq statistic by 0.50000
Tests use new (FIRM 2.1) procedures
This is followed by a summary output of the same format as used for the “head injuries”
data set. The full data set can be split using almost any of the predictors. The most
significant is a four-way split on English into the groupings {0,1,2,3,4,5}, {6}, {?,7}
and {8,9}. The summary file output continues, showing the same information for the
descendant nodes.
Summary of results of node number 1 predecessor node number 0
Total group
no. Name Signif % Bonf sig % MC sig % groups
1 Afrikaans 0.0189 7.0146 0.9615 012 3 ?4 56789
2 Biology 1.49E-03 0.0284 9.1541 0123?456 789
3 English 1.18E-08 4.38E-06 1.75E-04 ? 012345 6 789
4 Faculty 7.45E-14 2.54E-09 7.39E-07 AM CE SFL DBH
362
In the complete group, 25% of the students failed, 54.2 % passed, and 6.6% of the
students obtained some college credits for the year completed. In 14.2% of the cases
information was incomplete or unavailable. The most important predictor was the
363
Faculty, which divided the group into four subsets: (AM), (CE), (SFL) and (DBH),
representing subsets of the subject-matter groupings.
The “head injuries” data set of Titterington et al (1981) discussed in Section 6.2 is
analyzed here using CONFIRM. For the CATFIRM analysis of this data, see Section 6.2.
The outcome is predicted on the basis of 6 predictors assessed on the patients’ admission
to the hospital:
• age. The age of the patient. This is grouped into decades in the original data, and
is grouped the same way here. It has eight classes.
• EMV. This is a composite score of three measures—of eye-opening in response to
stimulation, motor response of best limb, and verbal response. This has seven
classes, but is not measured in all cases, so that there are eight possible codes for
this score—the seven measurements and an eighth “missing” category.
• MRP. This is a composite score of motor responses in all four limbs. This also has
seven measurement classes with an eighth class for missing information.
• Change. The change in neurological function over the first 24 hours. This was
graded 1, 2, or 3, with a fourth class for missing information.
• Eye indicator. A summary of diagnostics on the eyes. This too had three
measurement classes, with a fourth for missing information.
• Pupils. Pupil reaction to light—present, absent, or missing.
In the CONFIRM analysis the dependent variable is treated as being on the interval scale
of measurement with values 1, 2, and 3. As the outcome is on the ordinal scale, this
equally-spaced scale is not necessarily statistically appropriate in this data set and is used
as a matter of convenience rather than with the implication that this is considered the best
way to proceed.
The CONFIRM syntax for the analysis of the “head injuries” data is given below:
CONFIRM: DEN SUM SPLIT RULE
1 ‘Outcome’ 0
6
‘age’ 2 1 ‘m’ 0 8 ‘01234567’ 0.9000 1.0000
‘EMV’ 3 0 ‘1’ 0 8 ‘?1234567’ 0.9000 1.0000
‘MRP’ 4 0 ‘1’ 0 8 ‘?1234567’ 0.9000 1.0000
‘Change’ 5 0 ‘1’ 0 4 ‘?123’ 0.9000 1.0000
‘Eye ind’ 6 0 ‘1’ 0 4 ‘?123’ 0.9000 1.0000
‘Pupils’ 7 0 ‘f’ 0 3 ‘?12’ 0.9000 1.0000
1 20 .00100 .50000 1.00000 50 0 0 1 .0000000 1 0 0 0 0
364
• The first line indicates that a CONFIRM analysis is required, and that a
dendrogram (keyword den), summary statistics (keyword sum), the split file
(keyword split) and the split rule file (keyword rule) are to be included in the
output file.
• The second line contains information on the dependent or outcome variable. It is
the first variable in the data set, as indicated by the 1 in the first field, and it is
used here as a continuous variable, as indicated by the 0 in the third entry on this
line. Because of this specification, no category names are required for the FIRM
analysis. This is the dependent variable specification section of the syntax file.
• The third line has one entry—6—indicating the number of predictor variables to
be used in the analysis. This is the start of the predictor specification section in the
syntax file. The next six lines provide information on each of the predictors.
• The variable age, grouped into decades as described in the previous subsection, is
the second variable in the data set (position of predictor = 2). The next value, 1,
indicates that the variable is to be carried along but not used for splitting. Age is a
monotonic predictor (type of predictor = m). The range of values of age is 0 to 8,
as indicated in the next two fields. This is followed by the category codes, in this
case 0 to 7. The last two entries in this line are the splitting and merging
significance values. These levels are given in percent. In this example, a 0.9%
significance level will be used for splitting and 1% for merging.
• The information for the other predictors, EMV, MRP, Change, Eye ind and Pupils,
are given in a similar way, concluding the predictor variable specification.
• The last line of the syntax file contains the output options for this CONFIRM
analysis.
1 20 .00100 .50000 1.00000 50 0 0 1 .0000000 1 0 0 0 0
For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.
The first few lines of the output file contain information on the version of the FIRM
analysis, copyright and contact information. This is followed by the syntax used in this
analysis.
DATE: 04/21/2000
TIME: 09:59
C O N F I R M
Mini-dendrogram of analysis
1
6
-------------------------------
2 3
3 1
--------- ---------------------------------------
4 5 6 7 8
5 3 2
---------- ---------- ---------
9 10 11 12 13 14
2 2
---------- ------------------
15 16 17 18 19
367
As the sum, den, split and rule keywords were included in the syntax file, the
summary information is not written to an external file, but forms the next part of the
output file.
---------------------------------------------
| Output of Summary and Error Information |
---------------------------------------------
Program dimensions:
Maximum number of predictors 1003
Maximum number of predictor categories 20
A summary of the options used in splitting and merging of the groups is followed by
specific information on the splitting of each node. The primary use of this information is
for exploration of possible splits that were not used.
When we look at the analysis of the first node, we see that all of the predictors were able
to provide highly significant splits into 2, 3, or 4 category groupings. While all predictors
are highly significant, Pupils gives the most significant split.
***********************************************************************
Analysis of group no. 1 previous group no. 0
no. name mc-sig(%) bonf-sig(%), grouping
1 age 8.76E-13 4.70E-13 01 23 45 67
2 EMV 1.27E-21 1.31E-21 12 345 ?6 7
3 MRP 3.35E-21 1.93E-22 12 345 ?67
4 Change 5.99E-03 0.0202 ?1 23
5 Eye ind 1.30E-21 3.25E-21 1 ?2 3
6 Pupils 1.42E-22 2.02E-22 1? 2
Best predictor
6 Pupils 1.42E-22 2.02E-22 1? 2
368
The one-way analysis of variance for the split actually made is given next, followed by
the summary statistics of the descendant groups formed by the split. This information is
duplicated in the dendrogram, and so it is not of particular interest here. What is
interesting is the fact that in the whole data set, all predictors are highly significant, but
their relative predictive powers seem to change as we go down the dendrogram. Look, for
example, at the relative size of the p-values of age and of the clinical observations as one
goes down the tree. A portion of the output is given below.
Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 84.2132 84.2132 1
Error 353.9868 0.7108 498
F-value 118.474
significance 6.75E-23
Bonferroni P 2.02E-22
multiple comparison P 1.42E-22
overall conservative P 1.42E-22
adjusted for predictor count 8.53E-22
***********************************************************************
Analysis of group no. 2 previous group no. 1
no. name mc-sig(%) bonf-sig(%), grouping
1 age 100.00 100.00 01234567
2 EMV 0.1413 0.0493 123456 ?7
3 MRP 9.19E-03 1.12E-04 123456 ?7
4 Change 100.00 100.00 ?123
5 Eye ind 6.55E-03 5.14E-03 12 ?3
6 Pupils 0.1505 0.1505 1 ?
Best predictor
3 MRP 9.19E-03 1.12E-04 123456 ?7
Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 7.0729 7.0729 1
Error 29.2975 0.2203 133
F-value 32.108
significance 8.65E-06
Bonferroni P 1.12E-04
multiple comparison P 9.19E-03
overall conservative P 1.12E-04
adjusted for predictor count 6.74E-04
***********************************************************************
Analysis of group no. 3 previous group no. 1
no. name mc-sig(%) bonf-sig(%), grouping
1 age 1.22E-12 7.84E-13 0123 45 67
2 EMV 5.68E-09 3.68E-09 12 345 ?6 7
3 MRP 1.21E-08 4.03E-10 12345 ?67
4 Change 0.2076 0.2994 ?1 23
5 Eye ind 1.47E-06 3.30E-06 1 ?2 3
6 Pupils 100.00 100.00 2
Best predictor
1 age 1.22E-12 7.84E-13 0123 45 67
Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 56.6016 28.3008 2
Error 261.0148 0.7210 362
F-value 39.250
significance 3.73E-14
Bonferroni P 7.84E-13
multiple comparison P 1.22E-12
overall conservative P 7.84E-13
adjusted for predictor count 3.92E-12
***********************************************************************
Analysis of group no. 4 previous group no. 2
no. name mc-sig(%) bonf-sig(%), grouping
1 age 100.00 100.00 01234567
2 EMV 100.00 100.00 ?1234567
3 MRP 100.00 100.00 123456
4 Change 100.00 100.00 ?123
5 Eye ind 100.00 100.00 ?123
6 Pupils 100.00 100.00 ?1
No prediction possible.
***********************************************************************
Analysis of group no. 5 previous group no. 2
no. name mc-sig(%) bonf-sig(%), grouping
1 age 100.00 100.00 01234567
2 EMV 100.00 100.00 ?1234567
3 MRP 100.00 100.00 ?7
4 Change 100.00 100.00 ?123
5 Eye ind 100.00 100.00 ?123
6 Pupils 100.00 100.00 1?
No prediction possible.
As with the summary statistics, the use of the den keyword leads to the inclusion of the
full dendrogram in the output file, rather than the writing of this information to an
external file. The dendrogram thus forms the next part of the CONFIRM output file
considered here.
370
LEGEND:
split var
P val (Bonf P)
|
+------+ ------------
| 500| |
|1.8600| levels
|0.9371| Node |
|1.0000| +------+
|3.0000| |N |
+------+ |Xbar |
| |SD |
Pupils |Min |
1.42E-22;(8.53R-22) |Max |
| +------+
--------------------------
1? 2
2 | 3 |
+------+ +------+
| 135| | 365|
|1.1852| |2.1096|
|0.5210| |0.9431|
|1.0000| |1.0000|
|3.0000| |3.0000|
+------+ +------+
| |
MRP age
1.12E-04;(6.74E-04) 7.84E-13;(3.92E-12)
| |
---------- ----------------------------------
123456 ?7 0123 45 67
4 | 5 | 6 | 7 | 8 |
+------+ +------+ +------+ +------+ +------+
| 109| | 26| | 228| | 81| | 56|
|1.0734| |1.6538| |2.3860| |1.8765| |1.3214|
|0.3251| |0.8458| |0.8604| |0.9135| |0.6904|
|1.0000| |1.0000| |1.0000| |1.0000| |1.0000|
|3.0000| |3.0000| |3.0000| |3.0000| |3.0000|
+------+ +------+ +------+ +------+ +------+
| | |
Eye ind MRP EMV
4.46-06;(2.23E-05) 3.75E-05;(1.88E-04) 1.23E-07;(6.16E-07)
| | |
------------ ---------- ----------
1 ?23 12345 ?67 ?23456 7
9 | 10 | 11 | 12 | 13 | 14 |
+------+ +------+ +------+ +------+ +------+ +------+
| 18| | 210| | 41| | 40| | 46| | 10|
|1.3333| |2.4762| |1.3659| |2.4000| |1.0870| |2.4000|
|0.6860| |0.8137| |0.6617| |0.8412| |0.2849| |0.9661|
|1.0000| |1.0000| |1.0000| |1.0000| |1.0000| |1.0000|
|3.0000| |3.0000| |3.0000| |3.0000| |2.0000| |3.0000|
+------+ +------+ +------+ +------+ +------+ +------+
| |
EMV EMV
1.99E-03;(9.94E-03) 0.0513;(0.2566)
| |
---------- -------------------
12 ?34567 ?34 5 67
15 | 16 | 17 | 18 | 19 |
+------+ +------+ +------+ +------+ +------+
| 17| | 193| | 7| | 12| | 21|
|1.5882| |2.5544| |2.5714| |1.5833| |2.8095|
|0.8703| |0.7627| |0.7868| |0.7930| |0.5118|
|1.0000| |1.0000| |1.0000| |1.0000| |1.0000|
|3.0000| |3.0000| |3.0000| |3.0000| |3.0000|
+------+ +------+ +------+ +------+ +------+
371
The full set of 500 cases has a mean outcome score of 1.8600. The first split is made on
the basis of the predictor Pupils. Cases for which Pupils is 1 or ? constitute the first
descendant group (Node 2), while those for which it is 2 given Node 3. Note that this
is the same predictor that CATFIRM elected to use, but the patients with missing values
of Pupils are grouped with Pupils = 1 by the CONFIRM analysis, and with Pupils = 2 by
CATFIRM. Going to the descendant groups, Node 2 is then split on MRP (unlike in
CATFIRM, where the corresponding node was terminal), while Node 3 is again split four
ways on age.
The overall tree is bigger than that produced by CATFIRM—11 terminal and 7 interior
nodes—and produces groups with prognostic scores ranging from a grim 1.0734 up to
2.8095 on the 1 to 3 scale. As with CATFIRM, the means in these terminal nodes could
be used for prediction of future cases, giving estimates of the score that patients in the
terminal nodes would have.
The dendrogram is the most obviously and immediately useful output of a FIRM
analysis. It is an extract of the much more detailed output, which may be written to file,
or, by using the split keyword, included in the output file. If written to the file, this file
will contain information on both split and table files, in contrast with CATFIRM where
such information will be written to two separate files.
The information on the splitting of the first group is shown below. The actual output file
contains similar information on all the groups.
372
The listing starts with the summary statistics of the grouping of the cases in that node by
the different classes of the predictor. The categories of a free or character predictor are
arranged in descending order of the dependent variable mean—all the other predictors are
listed in their original order.
-------------------------------------------
| Output of Detailed Analysis of Splits |
-------------------------------------------
***********************************************************************
Analysis of group no. 1 previous group no. 0 mean = 1.860 size = 500
The summary table is followed by the one-way analysis of variance for the ungrouped
cross-classification.
Anova SSE DFE SSH DFH R-squared F Sign(%)
372.0 492 66.2 7 0.1510 12.5046 0.8757E-12
Next are the details of merging the categories. For a monotonic predictor like age, the
only possible merges are between adjacent categories. CONFIRM lists a Student’s t-
statistic between each pair of categories on the line below. For example, the t-value for
merging classes 0 and 1 is 1.3, that for merging 1 and 2 is -2.4; that for merging 6 and 7
is -1.4. The smallest t-value is -0.3 for merging classes 2 and 3. This is done, and the
number of classes for merging becomes 7, with 6 possible mergings and their associated
t-values. These t-values are listed on the second “merge stats” line, which shows that the
least significantly different mergeable pair is 4 and 5, with a t-value of 0.6. This merge in
turn takes place, as this t-value is not significant at the merge significance level selected
in this run. This reduction continues until at the final line, both the Student’s t-values (for
merging the composite categories (01), (23), (45), and (67)) are significant, and the
merge testing stops.
Group 0 1 2 3 4 5 6 7
merge stats 1.3 -2.4 -0.3 -2.1 0.6 -2.5 -1.4
Group 0 1 23 4 5 6 7
merge stats 1.3 -2.9 -2.7 0.6 -2.5 -1.4
Group 0 1 23 45 6 7
merge stats 1.3 -2.9 -2.9 -2.5 -1.4
Group 01 23 45 6 7
merge stats -2.6 -2.9 -2.5 -1.4
Group 01 23 45 67
merge stats -2.6 -2.9 -3.4
A slightly different format is used for a floating predictor, as illustrated by the predictor
MRP. Here, while the floating category is on its own, there are two lines of statistics for
each stage—the first for merging ? with 1, ? with 2, ? with 3... At the first merge phase,
these lines show that the least significant is for categories 3 with 4, and so these
categories are merged. The second stage merges 1 with 2 and the third 6 with 7. At the
fourth stage, the smallest t is for merging ? with the composite category (67), and after
this is done, ? no longer floats, and so for the last two stages there is only a single line of
merge outputs.
predictor no. 3 MRP
statistics before merging
Cate ? 1 2 3 4 5 6 7
mean 2.14 1.21 1.31 1.61 1.54 1.87 2.30 2.37
size 21 38 61 33 114 30 91 112
s.d. 0.91 0.58 0.67 0.90 0.81 0.94 0.89 0.90
Anova SSE DFE SSH DFH R-squared F Sign(%)
342.6 492 95.6 7 0.2182 19.6192 0.3351E-20
Group ? 1 2 3 4 5 6 7
merge stats -4.1 0.6 1.6 -0.4 1.9 2.4 0.6
-4.1 -3.9 -2.3 -3.0 -1.2 0.8 1.1
Group ? 1 2 34 5 6 7
merge stats -4.1 0.6 1.9 1.8 2.5 0.6
-4.1 -3.9 -3.0 -1.2 0.8 1.1
374
Group ? 12 34 5 6 7
merge stats -4.3 2.6 1.9 2.5 0.6
-4.3 -3.0 -1.2 0.8 1.1
Group ? 12 34 5 67
merge stats -4.3 2.6 1.9 2.9
-4.3 -3.0 -1.2 1.0
Group 12 34 5 ?67
merge stats 2.6 1.9 2.8
Group 12 345 ?67
merge stats 3.2 8.4
The final type of output is that for a free predictor (like Pupils). This is much simpler
than in the corresponding CATFIRM case. Here the first stage of the analysis is to sort
the categories of the predictor into ascending order of their mean values of the dependent
variable. Thereafter, the analysis proceeds just like that of a monotonic predictor in these
re-ordered categories. This implies that if any two categories of a free predictor are to end
up joined together, the composite class will necessarily include any other categories
whose mean score lay between the mean scores of the joined groups. This implication is
not a consequence of the definition of a free predictor, but accords with common-sense
expectations.
predictor no. 6 Pupils
statistics before merging
Cate 1 ? 2
mean 1.14 1.62 2.11
size 122 13 365
s.d. 0.45 0.87 0.93
Anova SSE DFE SSH DFH R-squared F Sign(%)
351.3 497 86.9 2 0.1983 61.4491 0.1422E-21
Group 1 ? 2
merge stats 1.9 2.1
Group 1? 2
merge stats 10.9
Inclusion of the rule keyword leads to the inclusion of the split rule table shown below
table in the output file. Apart from the first section summarizing the splits made, it has
the additional information on the categories of the character predictors and the cutpoints
of the real predictors.
1 6 2 2 3
2 3 5 4 4 4 4 4 4 5
3 1 6 6 6 6 7 7 8 8
6 5 10 9 10 10
7 3 12 11 11 11 11 11 12 12
8 2 13 -1 13 13 13 13 13 14
10 2 16 15 15 16 16 16 16 16
12 2 17 -1 -1 17 17 18 19 19
-1 -1
0
0
376
The “mixed” data set was provided by Gordon V. Kass and is described in detail in
Section 6.3, where it was analyzed using CATFIRM.
The dependent variable for the CATFIRM run was of the type character—a promotion
code. A second measure of the student’s overall achievement in relation to the
requirements for going on to the second-year curriculum was the aggregate score. This is
the average score received for all courses taken in the first year at University, and is
suitable for analysis using CONFIRM.
The syntax file for the analysis of the “mixed” data using CONFIRM is given below:
• The first line indicates that a CONFIRM analysis is required, and that a
dendrogram (keyword den), summary statistics (keyword sum), the split file
(keyword split) and the split rule file (keyword rule) are to be included in the
output file. Note the use of the ! to start a comment after the end of the input on
the first line.
• The second line contains information on the dependent or outcome variable. It is
the 17th variable in the data set, as indicated by 17 in the first field, and it is used
here as a continuous variable, as indicated by the 0 in the third entry on this line.
Because of this specification, no category names are required for the FIRM
analysis. This is the dependent variable specification section of the syntax file.
377
• The third line has one entry—12—indicating the number of predictor variables to
be used in the analysis. This is the start of the predictor specification section on
the syntax file. The next twelve lines provide information on each of the
predictors.
• The variable Afrikaans is the seventh variable in the data set (position of predictor
= 7), and is a real predictor (type of predictor = r). The 1 indicates that this
variable is to be carried over but not used for splitting.
• The monotonic, free, and float predictor types all require that the predictor’s
values be consecutive integers. The next two fields should contain information on
this range, and are only required for these three predictor types. For a real
predictor, 0s are used instead.
• This is usually followed by the category codes. In this case, the absence of
category codes is indicated by “ “. The last two entries in this line are the splitting
and merging significance values. These levels are given in percent. In this
example, a 0.9% significance level will be used for splitting and 1% for merging.
• The predictor Matmonth is a free predictor, with values ranging from 0 to 13.
Codes for the values are assigned as ?, 1, 2, ...., 9, A, B, C. Splitting and merging
significance levels are the same as for all other predictors.
• The last line of the syntax file contain the output options for this CONFIRM
analysis.
For more information on syntax, see the sections on the dependent and predictor variable
specifications in Chapter 8.
The first section of the output file echoes the syntax file and also contains the data set
name and number of observations used in the analysis.
confirm: sum den split rule! add dendrogram and summary information to output
file
17 ‘U aggr’ 0 ! Number of dependent variable
12 ! Number of predictors
‘Afrikaans’ 7 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Biology’ 10 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘English’ 6 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Faculty’ 13 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘History’ 12 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Latin’ 11 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Math’ 8 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Mattype’ 3 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
‘Matmonth’ 4 0 ‘f’ 0 13 ‘?123456789ABC’ 0.9000 1.0000
‘Matyear’ 5 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Science’ 9 0 ‘r’ 0 0 ‘‘ 0.9000 1.0000
‘Sex’ 2 0 ‘c’ 0 0 ‘‘ 0.9000 1.0000
1 20 .00100 1.00000 1.00000 20 0 0 1 .0000000 1 0 0 0 0
Mini-dendrogram of analysis
1
7
----------------------------------------
2 3 4
As the sum, den, split and rule keywords were included in the syntax file, the
summary information is not written to an external file, but forms the next part of the
output file.
---------------------------------------------
| Output of Summary and Error Information |
---------------------------------------------
Program dimensions:
Maximum number of predictors 1003
Maximum number of predictor categories 20
Next there is a listing of the cutpoints for the real predictors, which is needed when
reading the rest of the summary output and the dendrogram. Just one real predictor–
Afrikaans—from this list is shown below.
The cutpoints 45, 51, 54, 55, 57, 62, 64, 65 and 75 are the best we can find for dividing
the data set up into 10 groups of about equal size, and are the starting cutpoints from
which FIRM will do its subsequent merging to a final grouping.
Character predictors seen in the data and their values are:
Predictor 1 Afrikaans
A summary of the options used in splitting and merging of the groups is followed by
specific information on the splitting of each node. The primary use of this information is
for exploration of possible splits that were not used.
When we look at the analysis of the first node, we see that four of the predictors (Biology,
English, Math and Science) are able to provide splits significant at better than 1%
conservative. Of these, the predictor Math is the most significant. It divides the students
381
Best predictor
7 Math 9.37E-04 4.67E-05 01234 ?567 89
The one-way analysis of variance for the split actually made is given next, followed by
the summary statistics of the descendant groups formed by the split. This information is
duplicated in the dendrogram, and so it is not of particular interest here.
Analysis of variance
F-value 19.081
significance 1.04E-06
Bonferroni P 4.67E-05
multiple comparison P 9.37E-04
overall conservative P 4.67E-05
adjusted for predictor count 5.60E-04
***************************************************************************
Analysis of group no. 2 previous group no. 1
no. name mc-sig(%) bonf-sig(%), grouping
1 Afrikaans 100.00 100.00 ?0123456789
2 Biology 100.00 100.00 ?0123456789
3 English 3.7527 5.1131 012345 6789
4 Faculty 10.8388 66.2082 S MCLAHFED
5 History 100.00 100.00 ?012345678
6 Latin 100.00 100.00 ?024568
7 Math 100.00 100.00 01234
8 Mattype 100.00 100.00 7J1A24653
9 Matmonth 100.00 100.00 4BC13
10 Matyear 100.00 100.00 012345678
11 Science 100.00 100.00 ?0123457
12 Sex 100.00 100.00 MF
Best predictor
3 English 3.7527 5.1131 012345 6789
Analysis of variance
Sum of squares mean square degrees of freedom
Grouping 3210.5718 3210.5718 1
Error 87183.0731 411.2409 212
F-value 7.807
significance 0.5681
Bonferroni P 5.1131
multiple comparison P 3.7527
overall conservative P 3.7527
adjusted for predictor count 45.0319
No prediction possible.
As with the summary statistics, the use of the den keyword leads to the inclusion of the
full dendrogram in the output file, rather than the writing of this information to an
external file. The dendrogram thus forms the next part of the CONFIRM output file. The
dendrogram is the most obviously and immediately useful output of a FIRM analysis. It
is an extract of the much more detailed output, which may be written to a file, or, by
using the split keyword, included in the output file. If written to a file, this file will
contain information on both split and table files, in contrast with CATFIRM where such
information will be written to two separate files.
The split file contains information on the splitting of all groups. The listing starts with the
summary statistics of the grouping of the cases in that node by the different classes of the
predictor. The categories of a free or character predictor are arranged in descending order
of the dependent variable mean—all the other predictors are listed in their original order.
Inclusion of the rule keyword leads to the inclusion of the table in the output file. Apart
from the first section summarizing the splits made, it has the additional information on
the categories of the character predictors and the cutpoints of the real predictors.