0% found this document useful (0 votes)
109 views7 pages

Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes

- Decision trees are flowchart-like structures where internal nodes represent tests on attributes, branches represent outcomes of tests, and leaf nodes hold class labels. They can be thought of as sets of axis-parallel hyperplanes that divide the space into regions. - Entropy is a measure of unpredictability. It is highest when classes are equally probable and lowest when one class is certain. Information gain is the decrease in entropy from splitting on an attribute, and the attribute with highest information gain is chosen for the split. - Kullback-Leibler divergence measures the difference between two probability distributions, with higher values for dissimilar distributions and lower for similar ones. It is used to calculate information gain during decision tree construction.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views7 pages

Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes

- Decision trees are flowchart-like structures where internal nodes represent tests on attributes, branches represent outcomes of tests, and leaf nodes hold class labels. They can be thought of as sets of axis-parallel hyperplanes that divide the space into regions. - Entropy is a measure of unpredictability. It is highest when classes are equally probable and lowest when one class is certain. Information gain is the decrease in entropy from splitting on an attribute, and the attribute with highest information gain is chosen for the split. - Kullback-Leibler divergence measures the difference between two probability distributions, with higher values for dissimilar distributions and lower for similar ones. It is used to calculate information gain during decision tree construction.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

• Geometric Intuition of decision tree: Axis parallel hyperplanes

o Decision tree is a flowchart like tree structure where


▪ Each internal node denotes a test on an attribute
▪ Each branch represents an outcome of the test
▪ Each leaf node (terminal node) holds a class label.
o We can alternatingly think that decision trees are a group of nested IF-ELSE conditions which can be modelled
as a tree where the decision is made in the internal node and output is obtained in the leaf node.
o Below is a simple decision tree of the IRIS Data Set. a & b are the sepal, petal lengths, respectively.

o Geometrically, we can think of decision trees as a set of an axis-parallel hyperplanes that divide the space into
the number of hyper cuboids during the inference.

o we can see that in the Image Decision boundaries are axis parallel & have a decision surface for each leaf node
(Yi).
• Entropy
o Entropy is a measure of unpredictability. Feature with higher entropy has more random values than the one
with lower entropy.
o Suppose we tossed a coin 4 times,
Output P(H) P(T) Entropy Interpretation
{H, H, T, T} 50% 50% 1 Output is a random event
{T, T, T, H,} 25% 75% 0.8113 The result is less random i.e., biased coin
o For a given random variable Y which has K values, then enropy of Y is
𝐾

𝐻(𝑦) = − ∑ 𝑃(𝑦𝑖 ). log 𝑏 (𝑃(𝑦𝑖 ))


𝑖=1
𝑃(𝑦𝑖 ) = 𝑃(𝑌 = 𝑦𝑖 )
𝑡𝑦𝑝. 𝑏 = 2(lg)𝑜𝑟 𝑏 = 𝑒(𝑙𝑛)
o E.g., for the above play golf example, there are two classes
𝐻(𝑦) = −𝑃(𝑦 = 𝑦𝑒𝑠) − 𝑃(𝑦 = 𝑛𝑜)
9 9 5 5
𝐻(𝑦) = − log ( ) − log ( ) = 0.94
14 14 14 14
Rohan Sawant 4.2 Decision Tree 1
o Comparing entropies for different cases for 2-class example

Sr. No. P(Y+) P(Y-) Entropy


1 99% 1% 0.08
2 50% 50% 1.00
3 100% 0% 0.00

▪ If both classes are equally probable, we have maximum entropy value of 1


▪ If one class fully dominates, the entropy value becomes 0

• Comparing entropies for different cases for Multi-class example


o If all classes are equiprobable, the entropy will be maximum i.e., uniform distribution

o If one class has 100% probability and others are zero, the entropy is minimum
o In this figure entropy of skewed distribution will be lesser than that of normally distributed. More peaked is
the distribution lesser will be the entropy

• Kullback-Leibler Divergence or relative entropy


o KL Divergence measures the difference between two probability
distributions over the same variable 𝑋.
o For Discrete probability distributions 𝑃 & 𝑄 on same probability
space 𝑋
𝑝(𝑥)
𝐷𝐾𝐿 (𝑃||𝑄) = ∑ 𝑝(𝑥). log ( )
𝑞(𝑥)
𝑥∈𝑋
o For distributions 𝑃 and 𝑄 of a continuous random variable

𝑝(𝑥)
𝐷𝐾𝐿 (𝑃||𝑄) = ∫ 𝑝(𝑥). log ( ) 𝑑𝑥
𝑞(𝑥)
−∞

𝑝(𝑥) 𝑎𝑛𝑑 𝑞(𝑥) 𝑎𝑟𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑖𝑒𝑠 𝑜𝑓 𝑃 𝑎𝑛𝑑 𝑄

o It has higher value for dissimilar or divergent distribution and less value for similar distributions
• Information Gain
o The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding the attribute that returns the highest information gain (i.e., the most
homogeneous branches).
o Example.
▪ As we know entropy of golf example was 0.94

Rohan Sawant 4.2 Decision Tree 2


𝐻(𝑦) = −𝑃(𝑦 = 𝑦𝑒𝑠) − 𝑃(𝑦 = 𝑛𝑜)
9 9 5 5
𝐻(𝑦) = − log ( ) − log ( ) = 0.94
14 14 14 14
▪ The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it
is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the
entropy before the split. The result is the Information Gain, or decrease in entropy.
Play Golf W.E(Play, Play Golf W.E(Play,
Outlook Temp
Yes No Entropy Outlook) Yes No Entropy Temp)
Sunny 3 2 0.9710 0.3468 Hot 2 2 1.0000 0.2857
Overcast 4 0 0.0000 0.0000 Mild 4 2 0.9183 0.3936
Rainy 2 3 0.9710 0.3468 Cool 3 1 0.8113 0.2318
Gain 0.246 Gain 0.029
Play Golf W.E(Play, Play Golf W.E(Play,
Humidity Windy
Yes No Entropy Windy) Yes No Entropy Temp)
High 3 4 0.9852 0.4926 FALSE 6 2 0.8113 0.4636
Normal 6 1 0.5917 0.2958 TRUE 3 3 1.0000 0.4286
Gain 0.152 Gain 0.048
𝐺𝑎𝑖𝑛(𝑇, 𝑋) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − 𝑊. 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑇, 𝑋)
𝐺(𝑃𝑙𝑎𝑦 𝐺𝑜𝑙𝑓, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑃𝑙𝑎𝑦 𝑔𝑜𝑙𝑓) − 𝑊. 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑃𝑙𝑎𝑦 𝐺𝑜𝑙𝑓, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘)
= 0.94 − 0.693 = 0.247
𝑛
|𝐷𝑖 |
𝐼𝐺(𝑦, 𝐷𝑖 ) = ∑ ∗ 𝐻𝐷𝑖 (𝑦) − 𝐻𝐷 (𝑦)
|𝐷|
𝑖=1
▪ Choose attribute with the largest information gain as the decision node, divide the dataset by its
branches and repeat the same process on every branch.
Play Golf
Outlook
Yes No Entropy W.E(Play,Outlook)
Sunny 3 2 0.9710 0.3468
Overcast 4 0 0.0000 0.0000
Rainy 2 3 0.9710 0.3468

▪ A branch with entropy of 0 is a pure node there is no need to break it further

▪ A branch with entropy more than 0 needs further splitting.


Outlook Temp Humidity Windy Play Golf Outlook Feature Temp (Yes) (No)
Sunny Mild High FALSE Yes Mild 2 1
Temp
Sunny Cool Normal FALSE Yes Cool 1 1
Sunny Cool Normal TRUE No Normal 2 1
Sunny Humidity
Sunny Mild Normal FALSE Yes High 1 1
Sunny Mild High TRUE No FALSE 3 0
Windy
TRUE 0 2
▪ Here, we can conclude that windy will be the important feature as it gives pure nodes

Rohan Sawant 4.2 Decision Tree 3


▪ Same thing can be applied to rainy and we can see that for rain, humidity gives pure nodes
Outlook Temp Humidity Windy Play Golf Outlook Feature Temp (Yes) (No)
Rainy Hot High FALSE No Hot 0 2
Rainy Hot High TRUE No Temp Mild 1 1
Rainy Mild High FALSE No Cool 1 0
Rainy Cool Normal FALSE Yes Rainy Normal 1 0
Humidity
Rainy Mild Normal TRUE Yes High 0 3
FALSE 1 2
Windy
TRUE 1 1

o We stop growing tree at following cases


▪ When there is a pure node
▪ There are very few points corresponding to a single class
▪ If the tree is two deeper as depth of tree increases, we tend to overfit
▪ In decision tree, our hyperparameter will be the depth of tree
• Gini Impurity
o It is similar idea to entropy
𝑘

𝐼𝑚𝐺 (𝑌) = 1 − ∑(𝑃(𝑦𝑖 )2 )


𝑖=1
𝑌 = 𝑦1 , 𝑦2 , 𝑦3 , … … , 𝑦𝑘

Sr. No. P(Y+) P(Y-) Entropy G.I.


1 99% 1% 0.08 0.02
2 50% 50% 1.00 0.50
3 100% 0% 0.00 0.00
o Calculating logarithmic values is computational expensive that computing squares, thus as Entropy and Gini
impurity behave similar to each other, Gini impurity is preferred for its compute efficiency.
• Splitting numerical features
Temp 85 80 83 70 68 65 64 72 69 75
Play No No Yes Yes Yes No Yes No Yes Yes
o If we are having numerical features and then if we have to split the data.
o We first sort them by numbers
Temp 64 65 68 69 70 72 75 80 83 85
Play Yes No Yes Yes Yes No Yes No Yes No
o We split the data at each threshold value such that D1 is dataset less than 65 & D2 will be the dataset more
than 65. Same goes for every threshold value
o And we calculate information gain for each of this dataset and then the combination giving maximum value of
information gain will be chosen as classification node

Rohan Sawant 4.2 Decision Tree 4


• Categorical features with many possible values
o Many times, we occur in such a situation where there are lot of categories e.g., pin codes. Here the list of pin
codes will be bigger as each of them is representing an area but corresponding to it the y’ will be less or more
o We convert these features into numerical feature such that
#𝑦 = 1 𝑓𝑜𝑟 𝑝𝑖𝑛𝑗
𝑃 (𝑌 = 1 |𝑝𝑖𝑛𝑗 ) =
#𝑝𝑖𝑛𝑗
o Further we will divide the datasets on threshold values like we did in numerical features
• Overfitting and underfitting
o As depth of the tree increases, possibility of having very few points at lead node also increases.
o Also, these very few points are most likely to be noisy points. If we are considering them in our decision tree.
We are basically overfitting
o As the depth of model increases interpretability of model also decreases. i.e., it’s very hard to understand the
model if there are multiple nested conditions.
o Decision stump is a decision tree with only one branch. We predict the class which has majority of points

o If depth of the model is less, the model will be underfit.


o Here depth will be a hyperparameter which needs to be tuned based on cross-validation.

• Train and Run time complexity


o Train time complexity = O(n.log(n)d)
o Run time space complexity = O(nodes)
o Typically, depth of trains = 5 or 10. Else it become shard to interpret
o Run time complexity = O(depth)
o Decision Trees are suitable: Large data, small dimensionality, low latency requirement
• Regression using Decision Trees

o Approach1: - By standard deviation


▪ Standard deviation for one attribute:
▪ Standard Deviation (S) is for tree building (branching).
▪ Coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well.
▪ Average (Avg) is the value in the leaf nodes.

Rohan Sawant 4.2 Decision Tree 5


▪ Standard deviation for two attributes (target and predictor):
▪ The standard deviation reduction is based on the decrease in standard deviation after a dataset is
split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest
standard deviation reduction (i.e., the most homogeneous branches).

▪ Step 1: The standard deviation of the target is calculated.


Standard deviation (Hours Played) = 9.32
▪ Step 2: The dataset is then split on the different attributes. The standard deviation for each branch is
calculated. The resulting standard deviation is subtracted from the standard deviation before the
split. The result is the SDR (standard deviation reduction)

▪ Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

▪ Step 4a: The dataset is divided based on the values of the selected attribute. This process is run
recursively on the non-leaf branches, until all data is processed.
▪ In practice, we need some termination criteria. For example, when coefficient of variation (CV) for a
branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n)
remain in the branch (e.g., 3).
▪ Step 4b: "Overcast" subset does not need any further splitting because its CV (8%) is less than the
threshold (10%). The related leaf node gets the average of the "Overcast" subset.

▪ Step 4c: However, the "Sunny" branch has an CV (28%) more than the threshold (10%) which needs
further splitting. We select "Windy" as the best node after "Outlook" because it has the largest SDR.

Rohan Sawant 4.2 Decision Tree 6


Because the number of data points for both branches (FALSE and TRUE) is equal or less than 3 we
stop further branching and assign the average of each branch to the related leaf node.

▪ Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more than the threshold (10%). This
branch needs further splitting. We select "Temp" as the best node because it has the largest SDR.

▪ Because the number of data points for all three branches (Cool, Hot and Mild) is equal or less than 3
we stop further branching and assign the average of each branch to the related leaf node.

Rohan Sawant 4.2 Decision Tree 7

You might also like