Classification With Decision Trees
Classification With Decision Trees
with
Decision Trees
What is a Decision Tree
• A set of Nonparametric Algorithms primarily for classification but equally powerful for regression problems
too
• Decision trees help in devising rules for classification and regression
• A decision tree is a collection of decision nodes, connected by branches, extending downward from the root
node until terminating in leaf nodes
• Beginning at the root node, which by convention is placed at the top of the decision tree diagram, attributes
are tested at the decision nodes, with each possible outcome resulting in a branch
• Each branch then leads either to another decision node or to a terminating leaf node
• An example of classifying the potential customers as Bad Credit Risk or Good Credit Risk based upon the
attributes of Savings (Low, Medium, High), Assets (Low or Not Low) and Income (>=$30k or <$30k)
What is a Decision Tree
• An example of classifying the potential customers as Bad Credit Risk or Good Credit Risk based upon the
attributes of Savings (Low, Medium, High), Assets (Low or Not Low) and Income (>=$30k or <$30k)
A Simple Decision Tree
Mixed Target Values with Same Attribute Values
• Here, as all customers have the same predictor values, there is no possible way to split the records according
to the predictor variables that will lead to a pure leaf
• Therefore, such nodes become diverse leaf nodes, with mixed values for the target attribute
• In this case, the decision tree may report that the classification for such customers is “bad,” with 60%
confidence, as determined by the three-fifths of customers in this node who are bad credit risks
Requirements of Decision Tree
• Decision tree algorithms represent supervised learning, and as such require prelabelled target variables.
• A training data set must be supplied, which provides the algorithm with the values of the target variable
• This training data set should be rich and varied, providing the algorithm with a healthy cross section of the
types of records for which classification may be needed in the future
• Decision trees learn by example, and if examples are systematically lacking for a definable subset of
records, classification and prediction for this subset will be problematic or impossible
• The target attribute classes must be discrete That is, one cannot apply decision tree analysis to a continuous
target variable.
• the target variable must take on values that are clearly demarcated as either belonging to a particular class
or not belonging
Questions on Decision Tree Construction
• Why, in the example above, did the decision tree choose the savings attribute for the root node split? Why did
it not choose assets or income instead?
• Decision trees seek to create a set of leaf nodes that are as “pure” as possible; that is, where each of the
records in a particular leaf node has the same target value
• How does one measure uniformity, or conversely, how does one measure heterogeneity?
• We shall examine two of the many methods for measuring leaf node purity, which lead to the following
two leading algorithms for constructing decision trees
• CART
• C4.5
CART-Classification and Regression Trees
• The CART method was suggested by Breiman et al in 1984
• The decision trees produced by CART are strictly binary, containing exactly two branches for each decision
node
• CART recursively partitions the records in the training data set into subsets of records with similar values for
the target attribute
• The CART algorithm grows the tree by conducting for each decision node, an exhaustive search of all
available variables and all possible splitting values, selecting the optimal split according to the following
criteria
• Let Φ(s|t) be a measure of the “goodness” of a candidate split s at node t, where
CART
• Then the optimal split is whichever split maximizes this measure Φ(s|t) over all possible splits at node t
• Suppose that we have the training data set shown in Table and are interested in using CART to build a
decision tree for predicting whether a particular customer should be classified as being a good or a bad credit
risk
CART
• Possible candidate splits for the root node
CART
• For each candidate split, let us examine the values of the various components of the optimality measure Φ(s|t)
CART
• CART decision tree after initial split
CART
• Splits, for decision node A (best performance highlighted)
CART
• CART decision tree after decision node A split
CART
• CART decision tree, fully grown form
Classification Error
• Not all dataset produces a tree with pure or homogeneous or uniform leaf nodes
• Dataset with same attribute conditions but different class label may occur frequently - leads to a certain level
of classification error
• For example, suppose that, as we cannot further partition the records in table we classify the records contained
in this leaf node as bad credit risk
• Then the probability that a randomly chosen record from this leaf node would be classified correctly is 0.6,
because three of the five records (60%) are actually classified as bad credit risks
• Hence, our classification error rate for this particular leaf would be 0.4 or 40%, because two of the five
records are actually classified as good credit risks
Classification Error
• CART would then calculate the error rate for the entire decision tree to be the weighted average of the
individual leaf error rates, with the weights equal to the proportion of records in each leaf
• To avoid memorizing the training set, the CART algorithm needs to begin pruning nodes and branches that
would otherwise reduce the generalizability of the classification results
• Even though the fully grown tree has the lowest error rate on the training set, the resulting model may be too
complex, resulting in overfitting
• As each decision node is grown, the subset of records available for analysis becomes smaller and less
representative of the overall population
• Pruning the tree will increase the generalizability of the results
• Essentially, an adjusted overall error rate is found that penalizes the decision tree for having too many leaf
nodes and thus too much complexity
C4.5
• The C4.5 algorithm is Quinlan’s extension of his own iterative dichotomizer 3 (ID3) algorithm for generating
decision tree
• Unlike CART, the C4.5 algorithm is not restricted to binary splits. Whereas CART always produces a binary
tree, C4.5 produces a tree of more variable shape
• The C4.5 method for measuring node homogeneity is quite different from the CART method
• The C4.5 algorithm uses the concept of information gain or entropy reduction to select the optimal split
• Suppose that we have a variable X whose k possible values have probabilities p1, p2, … , pk. What is the
smallest number of bits, on average per symbol, needed to transmit a stream of symbols representing the
values of X observed?
• The answer is called the entropy of X and is defined as
C4.5
• C4.5 uses this concept of entropy as follows. Suppose that we have a candidate split S, which partitions the
training data set T into several subsets, T1, T2, … , Tk
• The mean information requirement can then be calculated as the weighted sum of the entropies for the
individual subsets, as follows where Pi represents the proportion of records in subset i (For saving 3
subsets-Low, Medium, High)
• We may then define our information gain to be gain(S) = H(T) - HS(T), that is, the increase in information
produced by partitioning the training data T according to this candidate split S
• At each decision node, C4.5 chooses the optimal split to be the split that has the greatest information gain,
gain(S)
C4.5
• Let us look at the same data again
• Possible candidate splits at root node are given
• Now, because five of the eight records are classified as good credit risk, with the remaining three records
classified as bad credit risk, the entropy before splitting is
C4.5
• For candidate split 1 (savings), two of the records have high savings, three of the records have medium
savings, and three of the records have low savings, so we have Phigh = 2⁄8, Pmedium = 3⁄8, Plow = 3⁄8
• For high saving one is Good and one is Bad credit risk so, entropy is
• Here, Assets is to be split first at the root node same as CART algorithm
C4.5
• After splitting according to Assets
C4.5
• The initial split has resulted in the creation of two terminal leaf nodes and one new decision node – Leaf
nodes are Low (2 Bad) and High (2 good)
• Assets medium is containing 3 good and 1 bad so, needs to be split
• Entropy before splitting
• For split 1 Savings Plow =1/4, Pmedium = 2/4, Phigh=1/4 and Hlow=0, Hmedium=0 and Hhigh=0. So, information gain
is = 0.8133 – 0 = 0.8133
• For split 2 Income <=25k and >25k also gives this same information gain eventually which is maximum
• So, arbitrarily, Savings is chosen as next attribute to split
C4.5
• The final decision tree is
• Finally, once the decision tree is fully grown, C4.5 engages in pessimistic postpruning
Decision Rules
• Interesting feature of Decision Tree models is their interpretability
• Decision rules can be constructed from a decision tree simply by traversing any given path from the root node
to any leaf
• Decision rules come in the form if antecedent, then consequent
• The support of the decision rule refers to the proportion of records in the data set that rest in that particular
terminal leaf node
• The confidence of the rule refers to the proportion of records in the leaf node for which the decision rule is
true
• In this small example, all of our leaf nodes are pure, resulting in perfect confidence levels of 100% = 1.00 but
in reality, leaf nodes are generally nonuniform
Support and Confidence Measure Values
• For the Tree generated using C4.5 the values of support and confidence measures are (Measure of Importance of
the discovered rule)
• Support of a decision rule refers to the proportion of records in the dataset that rest in a particular terminal leaf node
• Confidence of the rule refers to the proportion of records in the leaf node for which the decision rule is true