Decision Trees
Decision Trees
Decision trees
Barry de Ville
Decision trees trace their origins to the era of the early development of
written records. This history illustrates a major strength of trees: exceptionally
interpretable results which have an intuitive tree-like display which, in turn,
enhances understanding and the dissemination of results. The computational
origins of decision trees—sometimes called classification trees or regression
trees—are models of biological and cognitive processes. This common heritage
drives complementary developments of both statistical decision trees and trees
designed for machine learning. The unfolding and progressive elucidation of the
various features of trees throughout their early history in the late 20th century is
discussed along with the important associated reference points and responsible
authors. Statistical approaches, such as a hypothesis testing and various resampling
approaches, have coevolved along with machine learning implementations. This
had resulted in exceptionally adaptable decision tree tools, appropriate for various
statistical and machine learning tasks, across various levels of measurement, with
varying levels of data quality. Trees are robust in the presence of missing data
and offer multiple ways of incorporating missing data in the resulting models.
Although trees are powerful, they are also flexible and easy to use methods. This
assures the production of high quality results that require few assumptions to
deploy. The treatment ends with a discussion of the most current developments
which continue to rely on the synergies and cross-fertilization between statistical
and machine learning communities. Current developments with the emergence
of multiple trees and the various resampling approaches that are employed are
discussed. © 2013 Wiley Periodicals, Inc.
Node ID: 1
1: 38.2 %
0: 61.8 %
Count: 1309
−
Gender
Female Male
Age Age
Porphyry in the 3rd century C.E.1 These early observations. This top-most root node contains the
precomputational origins of decision trees confirm global distribution of the ‘target’ field for the analysis:
a persistently useful, innate capability of decision in this case, survival versus nonsurvival. In general,
trees to project and encapsulate contextually revealing targets may be any level of measurement; e.g. nominal,
visual displays that are both intuitive and powerful ordinal, or interval. When nominal targets are used,
visual metaphors. If we fast forward to the 20th as in the case shown in Figure 1, the tree is sometimes
century, we see that computational decision trees referred to as a ‘classification tree’.
emerged at the same time as the nascent fields of In Figure 1, the overall survival rate—repesented
artificial intelligence2,a and statistical computation. by ‘1’ in the data—is 38%. Marginal counts are
As a result their development has benefitted from a sometimes presented alongside the percentages so as
rich cross-disciplinary cross-fertilization that has led to display the actual number of observations that fall
to a range of new methods—from resampling methods into the two respective categories. In Figure 1, only
like boosting and bagging—to more recent generalized the total number of observations are displayed at the
multiple tree methods such as Random Forests. bottom of the node display (labeled as Node ID: 1).
The decision tree unfolds in a stepwise fashion:
the tree is formed by first partitioning the root
OPERATION, FEATURES, AND node to form branches that define the descendent
INTERPRETATION leaves (or nodes) that form clusters of observations
The characteristic form of decision trees is shown in that are alike within a node yet dissimilar when
Figure 1. Here we see a recursive subsetting of a target compared to other nodes at any given level of the
field of data according to the values of associated tree. The branch partitions are based on a selection
fields to create partitions, and associated descendent that is taken from a search through the data set
data subsets (nodes), that contain progressively similar to discover fields of data that can be input as
intra-node target values and progressively dissimilar partitioning fields to best describe the variablility
inter-node values at any given level of the tree. among the target values that are displayed in the root
Figure 1 shows a decision tree analysis perfomed node. Potential partitioning fields are thereby termed
on data that are drawn from research conducted on ‘inputs’. Once an input is selected, the descendent
passengers on the ill-fated Titanic.2 The top-most node leaves, or nodes, are produced. (terminal nodes are
of the tree—termed the ‘root node’—contains 1309 usually called ‘‘leaves’’). In Figure 1, the first level
of the decision tree is produced by selecting the the best overall model. Regardless of the method, once
‘Gender’ field as the best input field from the set of the initial level of the tree is determined the process
inputs that are available (other inputs in this data set continues in a recursive fashion until one of more
include passenger age, cabin class (first, second, and possible stop conditions are met, thus terminating the
so on), fare paid, cabin location, boarding location, process. Generally, stopping rules consist of thresholds
and destination). on diminishing returns (in terms of test statistics) or
The selection of the ‘best’ input field is an in a diminishing supply of training cases (minimum
open subject of active research. Decision trees acceptable number of observations in a node).
allow for a variety of computational approaches to As shown in Figure 1, gender is selected as the
input selection. The top-down graphical display also first partitioning field below the root node. In this
supports the exploration of various effects visually, so case, we see that the use of gender as the partitioning
that strong branches—or compelling branches—may field forms two descendent nodes for female and male
be selected based on theoretical notions about the passengers, respectively. One interpretation might be
interaction of the various model components. In to note that the effect of gender is strong and appears
this example, the ‘best’ field selection is based to follow a protocol that calls for ‘women and children
on partition strength diagnostics produced by the first’ in the lifeboats. Here we see that, while the over-
software, coupled with the domain knowledge of all survival rate is 38%, this increases to about 73%
the analyst. In the Titanic data, gender, age, and among females whereas the overall male survival rate
cabin class are all important and predictive inputs drops to about 19%. The descendent nodes formed
with multiple, interweaving interactions. A complete by recursively partitioning the female and male nodes,
exposition of the various interactions is not possible respectively, illustrate one of the most striking and
in the limited space here. Consequently, gender alone useful features of decision trees: here we see the con-
is used in the description here so as to present a textual effect of age on survival rate. In this case, we see
simple, hopefully compelling, result. This result, and that among females, older ages are more likely to sur-
the domain knowledge framework that describes it, is vive (83% survival rate among older females vs 68%
presented below. survival rate among younger females). In the male
The descendent nodes produced by the selection population, the effect is completely reversed: older
of gender as the first partitioning field in Figure 1 males have a substantially lower survival rate (17%
are commonly referred to as the first level of the vs 58% is older males compared with younger males).
tree. The leaves in this first level correspond to the We can interpret these findings as normative
male and female passengers. The ‘leaf’ terminology behavior in the social dynamics that evolved in
is often used when the decision ‘tree’ metaphor for this impromptu community that consists of the self-
this method is used. The more general term ‘node’ selected passengers of this inaugural voyage across the
is used in recognition of the fact that decision trees Atlantic. Our initial sense of the ‘women and children
are a particular form of connected graph. In graph first’ protocol—displayed in the first partition—is
terminology, the partitions are ‘edges’ and the leaves reinforced by normative behavior that demonstrates
are ‘nodes’. preferential treatment based on age status. Because the
Using the ‘node’ terminology, the first level of the second tier partitions are unique to female and male
tree has two descendent nodes: the ‘female’ descendent groups, respectively, we see a contrasting preferential
has a survival rate of about 72%, whereas the ‘male’ age treatment among females compared with males.
descendent node has a survival rate of only 19%. This contrast favors older females and younger males.
It is normal, as in this case, to select the the input This asymmetry in the descendent nodes on the second
that produces the most dramatic separation in the level of the tree provides a dramatic illustration of
variability among the descendent nodes. In practice, the outstanding ability of decision trees to expose
the analyst may often guide the sequence of the relationships in context.
unfolding of branch partitions in order to support The enduring legacy of decision trees is that
a better explanation of a sequence of effects or to they demonstrate that multiple contributors need
support and confirm the conditional relations that to be recruited to effectively explain a relationship.
are assumed to exist among the various inputs and the Further, the form of the resulting relationships will
component nodes that they produce. In the case of high reveal multiple contextual effects that will influence
performance predictive modeling applications there is the understanding and effective presentation of the
less emphasis on analyst interaction in the formation results. The utility of decision trees in detecting and
of the tree and more emphasis on the selection of presenting contextual effects was a significant driver
high quality partitions that can collectively produce to the development of one of the earliest and most
exploited the then most current technology of decision trees were better tools because they find the
mechanical calculators. interactions as they grow the tree.
Decision trees turn out to be well adapted to Many observers at the time were resistant
mechanical calculators using Hollerith punch cards to employ the relatively new and lightly tested
because of the sorting and selection characteristics approach advocated by Morgan and Sonquist.
of the algorithm and the avoidance of, e.g., any Regression practitioners then—and now—develop
matrix-based computations. For each predictive field results on the basis of well-informed theory and widely
that could be considered as an input for use in tested results in a broad, active, and well-informed
the characterization of a target field it was possible community. The theoretical underpinnings—coupled
to sort subclasses formed for each target-predictor with a rich history of fielded results—enable regression
combination and then to identify imbalances between practitioners to develop time-tested, effective metrics
the expected frequency of the subclass and the and diagnostics in a wide range of circumstances.
observed frequency of the subclass. This step-by-step Decision trees were demonstrated to have
recursive process is simple enough for both mechanical shortcomings of their own: how to go about selecting
calculators and unassisted humans. Unbalanced appropriate variables to form the tree partitions (input
distributions—which we would now identify as vetting and selection) and how many partitions, of
distributions with high chi-squared values—could be what complexity, to build. These latter two problems
easily identified with the tabulating machines available served as the ‘grist for the mill’ of the next steps in the
at this time. This method—so useful in the era prior to development of statistical decision trees carried out by
digital computers—survives to the current day as the Kass and Hawkins8 and Breiman et al.,9 respectively.
underpinning for all decision tree implementations. Over time, this body of work has provided substantial
A further refinement introduced by Belson credibility and a rich legacy of fielded applications
involved the differential assessment of nested subclass that help establish trees as a useful, viable, and
predictors.6 Belson recognized that descendent nodes trustworthy technique.
of a tree could be examined recursively, just as
the top node had been. Belson further recognized
that descendent nodes could be subset by either the RULE INDUCTION, MACHINE
same predictor or another predictor such that the
descendent nodes of the tree could be balanced and
LEARNING, AND DECISION TREES
symmetrical—employing a matching set of predictors During the 1950s, as Belson was developing
with each level of the subtree—or could be unbalanced his approach, a kind of computation which
in that subnode partitions could be based on the most he described as based ‘ . . . on the principal
powerful predictor at a given level of the subtree. of biological classification’, other researchers in
This innovation exploits the power of decision trees experimental psychology were attempting to encode
to explore and discover a host of subregion effects human approaches to concept formation tasks. Both
in data and, like the use of predictors identified on approaches naturally fed into the nascent field of
the basis of deviation from expected values, forms the artificial intelligence and machine learning. In this
basis of modern decision trees. way Belson’s work serves as a precursor to a new line
Morgan and Sonquist7 built on Belson’s early of decision tree development that employs machine
work and saw decision trees as a complement and algorithms to produce executable rules.
alternative to regression to analyze survey data. The work in experimental psychology led to the
Initially, Morgan and Sonquist began with the notion development of a computer implementation, entitled
of employing trees in order to identify interaction ‘CLS’ (for Concept Learning System) developed by
terms that would be useful in forming the most Hunt et al.10 As in the earlier approaches of Belson
effective regression solution for their data modeling and Morgan and Sonquist, CLS works through
tasks. In tests run by Morgan and Sonquist, they the successive application of partitions in the data
observed a decision tree which partitioned data into 21 based on highly discriminating variables or inputs.
groups that accounted for two-thirds of variance of the J. Ross Quinlan entered this field from a machine
response variable. A similar regression with 30 terms, learning perspective. He formalized the development
including interaction terms, was only able to account of this approach to concept formation as a method
for 36% of the variance in the response. The authors of knowledge acquisition. This resulted in the
reached three conclusions: (1) that interactions among development of ‘Interactive Dichotomizer 3’ (ID3).11
inputs are inevitable; (2) that regression requires the Follow-ons to Quinlan’s initial work have led
analyst to specify interactions in advance; and (3) that to the development of a number of rule generation
approaches for knowledge acquisition, commonly of selecting the best single predictor at any one stage
referred to as ‘rule induction’. in the growth of the decision tree—can be extended by
resampling the available training data. This random
BOX 1 element has many benefits: the most obvious benefit
is the smoothing properties. While a single decision
Donald Michie served as the editor of a set tree bisects the space of training data into a number
of findings that featured Quinlan’s initial work of hard-edge rectangles, multitrees form many over-
on ID3. Michie was a colleague of Alan Turing lapping bisections so that the fitted space more closely
during the World War II Enigma Project and approximates such methods as neural networks
is a founding father of the field of artificial and multiple regression. With multiple trees we can
intelligence. He later employed inductive rules derive multiple, overlapping viewpoints that are
to the adaptive control of robotic devices different but complementary. When taken together,
and spacecraft.12 This rule method serves as a the overlapping views reduce both variance and bias.
template for self-learning robotic systems up to The resampling approach has led to a number
the present day.
of methods to ‘boost’ the predictive power of the
host training set. Multiple trees are always grown,
Subsequent work by Quinlan led to the regardless of the specific method that is employed. In
development of C4.5.13 addition to the introduction of random components
Rule induction is an active area of development in multiple trees, these approaches also offer the
and has led to a range of rule induction approaches, opportunity to reweight computations in successive
for example, W Cohen’s ‘RIPPER’.14 RIPPER iterations of tree growth. Unlike the ‘sequential
incorporates a multitree approach often described covering’ approach, described above, where successive
as ‘sequential covering’. In these approaches the samples are drawn from the training corpus in
tree is first grown so that a pure node is found. unaltered form, boosting approaches reweight cases
A pure node is a node that results from the in successive iterations. The coverage offered by these
identification of a rule that predicts 100% of the approaches is less structured and deterministic than
target values. The preconditions of the rule ‘covers’ sequential covering. In this approach, the reweighting
the training observations that correspond to this rule. goal is to alter successive training samples with
The observations that are covered by the rule are the view to improving the predictive performance
then removed from the training data (i.e. are ‘ripped’ of successive rule sets. These approaches have been
out). Successive trees are run, at each step looking explored and advocated by Schapire;17 notably in
for a rule that produces a ‘pure node’. Multiple trees Adaboost developed by Freund and Schapire;18 Arcing
may be grown until no more pure nodes are found. by Breiman;19 and Gradient Boosting by Friedman.20
Overall, the predictive space is ‘covered’ through The Adaboost method (from ‘adaptive boosting’)
the layering of these successively grown predictive employs an approach that reweights individual
rules. The RIPPER algorithm is a greedy algorithm; observations in subsequent samples. In Gradient
i.e. it produces excessively overoptimistic results Boosting, the target value is adjusted by a function of
that do not generalize well. Alternative multitree the residual of the training value minus the predicted
approaches, discussed below, are less greedy and value.
offer superior generalization performance. Another Various group-voting or aggregation methods
innovation suggested by Cohen was to form rules are possible in the production of a final group-
based on both the presence and absence of attributes voting metric: including numeric averaging with
(allow Boolean NOTs to form part of the selection continuous outcomes and majority votes or polling
expression). This approach has more recently been with categorical outcomes.
implemented as part of a text mining solution to The interaction between the fields of statistical
generate automatic text classification rules based on decision trees and machine learning continued
inductive rule learning.15 throughout these adaptations of bootstrapping
applications to multiple trees. One innovation
CURRENT DEVELOPMENTS included sampling and randomization across both
rows and columns of the training data. This technique
(MULTIPLE TREES) entered the machine learning field in the application of
The bootstrap method, described by Efron,16 is a multiple decision trees to digit recognition as described
prominent example of the utility of resampling in sta- in Amit and Geman.21 Much of this cross-fertilization
tistical computation. The single tree approach—one is due to substantial cross-disciplinary work carried
out by Breiman. He described this general row the corresponding emphasis on sampling without
and column sampling approach as ‘Random replacement. With larger training data, sampling
Forests’;22 these are currently the leading benchmark without replacement tends to reinforce the adoption of
implementation of decision trees across a variety of differences in the model results. This is now recognized
statistical and machine learning applications. as a potential strength of multitree methods.
To date, most multitree methods demonstrate
strengths in various circumstances. As this field evolves
CONCLUSIONS it may become clear which method is best in which
set of circumstances. Given the pace of innovation in
There are many variations of multitree themes:
this area it is likely that improved methods and new
autonomous vs serial samples; row vs column
paradigms will continue to emerge.
reweighting schemes; replacement samples vs no
replacement; and so on. Improvements over best-
guess, single decision trees are shown in most multitree NOTE
methods. As training data continues to increase in a
This data table is based on the Titanic Passenger List
size there are now obvious benefits in the approach
edited by Michael A. Findlay, originally published in
of multiple autonomous trees as these trees can
Ref23 , and expanded with the help of the internet
be calculated independently, in parallel, prior to
community. The original HTML files were obtained
the production of an aggregate effect. As the size
by Philip Hind (1999).
of initial training data has increased, so too has
REFERENCES
1. Lima M. Visual Complexity: Mapping Patterns of Expert Systems in the Micro-electronic Age. Edinburgh:
Information. New York: Princeton Architectural Press; Edinburgh University Press; 1979, 168–201.
2011, 28.
12. Michie D, Sammut C. Controlling a black-box
2. http://lib.stat.cmu.edu/S/Harrell/data/descriptions/ simulation of a spacecraft. AI Mag 1991, 12:56–63.
titanic.html. (Accessed September 23, 2013).
13. Quinlan JR. C4.5: Programs for Machine Learning.
3. Sonquist JA, Baker EL, Morgan JN. Searching for New York: Morgan Kaufmann; 1988.
Structure. Ann Arbor, MI: Institute for Social Research;
1973. 14. Cohen, WW. Fast effective rule induction. Proceedings
of the Twelfth International Conference on Machine
4. Kass GV. An exploratory technique for investigating Learning; 1995, 115–123.
large quantities of categorical data. J R Stat Soc 1980,
29:119–127. 15. Automatic Boolean rule generation. Available at: http://
www.sas.com/text-analytics/text-miner/index.html.
5. Belson WA. A technique for studying the effects of (Accessed March 25, 2013).
television broadcast. J R Stat Soc 1956, 5:195.
16. Efron B. Bootstrap methods: another look at the
6. Belson WA. Matching and prediction on the principle
Jackknife. Ann Stat 1979, 7:1–26.
of biological classification. J R Stat Soc 1959, 8:65–75.
17. Schapire RE. The strength of weak learnability. Mach
7. Morgan JN, Sonquist JA. Problems in the analysis of
Learn 1990, 5:197–227.
survey data, and a proposal. J Am Stat Assoc 1963,
58:415–435. 18. Freund Y. Schapire RE. Experiments with a new
8. Hawkins DM, Kass GV. Automatic interaction detec- boosting algorithm. Proceedings of the Thirteenth
tion. In: Hawkins DM, ed. Topics in Applied Multivari- International Conference on Machine Learning, Bari,
ate Analysis. Cambridge: Cambridge University Press; Italy; 1996, 148–156.
1982. 19. Brieman L. Arcing classifiers. Ann Stat 1998,
9. Breiman L, Friedman JH, Olshen RA, Stone CJ. 26:801–849.
Classification and Regression Trees. London: Chapman 20. Friedman, HJ. Stochastic gradient boosting. 1999.
and Hall; 1984. Available at: http://www-stat.stanford.edu/∼jhf/ftp/
10. Hunt E, Marin J, Stone P. Experiments in Induction. stobst.ps.
New York: Academic Press; 1966. 21. Amit Y, Geman D. Shape quantization and recognition
11. Quinlan JR. Discovering rules by induction from with randomized trees. Neural Comput 1997,
large collections of examples. In: Michie D, ed. 9:1545–1588.
22. Breiman L. Random forests, 2001. Available at http:// 23. Eaton JP, Haas CA. Titanic: Triumph and Tragedy,
oz.berkeley.edu/∼breiman/randomforest2001.pdf. Second Edition. New York: W.W. Norton & Company
(Accessed September 23, 2013). Inc; 1995.
FURTHER READING
Hawkins DM. Recursive partitioning. WIREs Comput Stat 2009, 1:290–295.
Loh WY. Classification and regression trees. WIREs Data Mining Knowl Discov 2011, 1:14–23.
de Ville B, Neville P. Decision Trees for Analytics Using SAS Enterprise Miner. Cary, NC: SAS Press; 2013.