Concepts - Decision Trees
Concepts - Decision Trees
Marital
BuyCar = No Status
Single Married
Attribute/Variable importance
- When an attribute A splits the set S into subsets… “pureness” is
measured (Logworth/Entropy/Gini) and compare the sum to the
measurement of the original set S
- The attribute that maximizes the difference (Gain Info) is selected,
i.e., the attribute that increases the purity most!
Decision Tree…Simple Example
B Yes No
C0 1 5
C1 4 2
Measures of Node Impurity
• Entropy Comparison among Splitting
Criteria
For s 2-class problem
• Gini Index
• Misclassification Error
• Logworth
Logworth = -log(p-value of chi-squared)
Advantages & Limitations
Advantages:
- Easy to understand: Decision Trees are widely used to explain how decisions are
reached based on multiple criteria.
- Categorical and continuous variables: Decision trees can be generated using either
categorical data or continuous data.
- Able to handle complex relationships: A decision tree can partition a dataset into
distinct regions based on ranges or specific values.
- Identifying unknown records: Extremely fast at classifying unknown records
- Easy to interpret: especially for small-sized trees
Limitations:
- Computationally expensive: Building decision trees can be computationally expensive,
particularly when analysing a large dataset with many continuous variables.
- Difficult to optimize: Generating a useful decision automatically can be challenging,
since large and complex tress can be easily generated. Trees that are too small may not
capture enough information. Generating the ‘best’ tree through optimization is difficult.
Tree Variations: Tree Size Options for Controlling Complexity
Logworth threshold
Maximum tree depth
Minimum leaf size