Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
o Geometrically, we can think of decision trees as a set of an axis-parallel hyperplanes that divide the space into
the number of hyper cuboids during the inference.
o we can see that in the Image Decision boundaries are axis parallel & have a decision surface for each leaf node
(Yi).
• Entropy
o Entropy is a measure of unpredictability. Feature with higher entropy has more random values than the one
with lower entropy.
o Suppose we tossed a coin 4 times,
Output P(H) P(T) Entropy Interpretation
{H, H, T, T} 50% 50% 1 Output is a random event
{T, T, T, H,} 25% 75% 0.8113 The result is less random i.e., biased coin
o For a given random variable Y which has K values, then enropy of Y is
𝐾
o If one class has 100% probability and others are zero, the entropy is minimum
o In this figure entropy of skewed distribution will be lesser than that of normally distributed. More peaked is
the distribution lesser will be the entropy
o It has higher value for dissimilar or divergent distribution and less value for similar distributions
• Information Gain
o The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding the attribute that returns the highest information gain (i.e., the most
homogeneous branches).
o Example.
▪ As we know entropy of golf example was 0.94
▪ Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.
▪ Step 4a: The dataset is divided based on the values of the selected attribute. This process is run
recursively on the non-leaf branches, until all data is processed.
▪ In practice, we need some termination criteria. For example, when coefficient of variation (CV) for a
branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n)
remain in the branch (e.g., 3).
▪ Step 4b: "Overcast" subset does not need any further splitting because its CV (8%) is less than the
threshold (10%). The related leaf node gets the average of the "Overcast" subset.
▪ Step 4c: However, the "Sunny" branch has an CV (28%) more than the threshold (10%) which needs
further splitting. We select "Windy" as the best node after "Outlook" because it has the largest SDR.
▪ Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more than the threshold (10%). This
branch needs further splitting. We select "Temp" as the best node because it has the largest SDR.
▪ Because the number of data points for all three branches (Cool, Hot and Mild) is equal or less than 3
we stop further branching and assign the average of each branch to the related leaf node.