0% found this document useful (0 votes)
20 views45 pages

Decision Trees

Uploaded by

akrab.tech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views45 pages

Decision Trees

Uploaded by

akrab.tech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CSCI417

Machine Intelligence
Lecture # 3

Spring 2024

1
Tentative Course Topics

1.Machine Learning Basics


2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.

2
Recommending App

For a woman who works at


an office, which app do we
recommend?

For a man who works at a


factory, which app do we
recommend?

ML ask:
Between Gender and Occupation,
which one seems more decisive for
predicting what app will the users
download?....

3
Recommending App
Occupation

School Work

Pokemon Go Gender

F M

WhatsApp Snapchat

4
Between a horizontal and
a vertical line, which one
would cut the data
better?

5
6
Non-parametric Estimation
• A non-parametric model is not fixed, but its complexity
depends on the size of the training set or, rather, the
complexity of the problem inherent in the data.
• Here, a non-parametric model does not mean that the model
has no parameters; it means that the number of parameters
is not fixed and that their number can grow depending on the
size of the data or, better still, depending on the complexity of
the regularity that underlies the data.

7
Decision tree
• A decision tree is a hierarchical data structure implementing
the divide-and-conquer strategy.
• It is an efficient non-parametric method that can be used for
both classification and regression.
• A decision tree is also a non-parametric model in the sense
that we do not assume any parametric form for the class
densities, and the tree structure is not fixed a priori, but the
tree grows, branches and leaves are added during learning
depending on the complexity of the problem inherent in the
data.

8
Function Approximation

9
Ref: https://www.seas.upenn.edu, Eric Eaton.
Sample Dataset (Will Nadal Play Tennis?)
• Columns denote features Xi
• Rows denote labeled instances hx i , yi i
Class label denotes whether a tennis game was played

10
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree

• A possible decision tree for the data:


• Each internal node: test one attribute Xi
• Each branch from a node: selects one value for Xi
• Each leaf node: predict Y

11
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree

• A possible decision tree for the data:


• What prediction would we make for
<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?

12
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribution over labels

13
Ref: https://www.seas.upenn.edu, Eric Eaton.
Stages of Machine Learning
Given: labeled training data X , Y = { hx i , yi i } ni= 1
• Assumes each x i ⇠ D(X ) with yi = f t ar get (x i )

X, Y
Train the model:
learner
model classifier.train(X, Y )
x model yprediction

Apply the model to new data:


• Given: new unlabeled instance x ⇠ D(X )
yprediction ß model.predict(x)
14
Ref: https://www.seas.upenn.edu, Eric Eaton.
Top-down learning
node = root of decision tree
Main loop:
1. A ß the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else,
recurse over new leaf nodes.

How do we choose which attribute is best?


15
Choosing the Best Attribute
Key problem: choosing which attribute to split a given set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest number of possible values
– Most-Values: Choose the attribute with the largest number of possible values
– Max-Gain: Choose the attribute that has the largest expected information gain
• i.e., attribute that results in smallest expected size of subtrees rooted at its children

• The ID3 algorithm uses the Max-Gain method of selecting the best
attribute

16
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples

Entropy can be roughly thought of as how much variance the data has.

17
18
Entropy: a common way to measure impurity

19
Entropy
 Entropy = 0  all examples belong to the same class. Minimum
 Entropy = 1  examples are evenly split between classes. impurity

Maximum
entropy = - 1 log21 = 0 impurity
entropy = -0.5 log20.5 – 0.5 log20.5 =1

16

20
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy
 Entropy = 0  all examples belong to the same class. Minimum
 Entropy = 1  examples are evenly split between classes. impurity

Maximum
impurity

16

21
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
2-Class Cases:
2-Class Cases:
Xn Minimum
Entropy H (x) = − P (x = i ) log2 P (x = i ) impurity
i= 1

• What is the entropy of a group in which all


examples belong to the same class?
– entropy = - 1 log21 = 0

not a good training set for learning Maximum


impurity
• What is the entropy of a group with 50%
in either class?
– entropy = -0.5 log20.5 – 0.5 log20.5 =1
good training set for learning 16

22
Information Gain

23
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain

24
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain

25
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain of feature WIND

26
Information Gain of feature HUMIDITY

27
Information Gain of feature TEMP

28
Information Gain of feature OUTLOOK

29
After several iterations

30
Information Gain
Example of a family of 10 members, where
5 members are pursuing their studies and
the rest of them have completed or not
pursued.

% of pursuing = 50% % of pursuing = 0%


% of not pursuing = 50% % of not pursuing = 100%

First, calculate the entropy.


Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1 Entropy = -(0) * log2(0) -(1) * log2(1) = 0

we can say that if a node contains only one class in it or formally says the node of the tree is pure the entropy for data in such node
will be zero and according to the information gain formula the information gained for such node will be higher and purity is higher

if the entropy is higher the information gain will be less, and the node can be considered as the less pure.

31
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy for Parent Node

32
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy for Parent Node
Now according to the performance
of the students, we can say
Students= 20
Curricular activity = 10
No curricular activity = 10

Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Entropy for Child Node

Students= 14 Students= 6
Curricular activity = 8/14=57% Curricular activity = 2/6=33%
No curricular activity = 6/14=43%
No curricular activity = 4/6=67%

Entropy = -(0.43) * log2(0.43) -(0.57) * log2(0.57) = 0.98 Entropy = -(0.33) * log2(0.33) -(0.67) * log2(0.67) = 0.91
we have calculated the entropy for the parent and child nodes now the weighted sum of these entropies will give the weighted entropy of all the nodes.
Weighted Entropy : (14/20)*0.98 + (6/20)*0.91 = 0.959

33
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
splits based on the class

Entropy for parent nodes


Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1

Entropy for child nodes


Class 11th
Entropy = -(0.8) * log2(0.8) -(0.2) * log2(0.2) = 0.722

Class 12th
Weighted entropy Entropy = -(0.2) * log2(0.2) -(0.8) * log2(0.8) = 1
Weighted Entropy : (10/20)*0.722 + (10/20)*0.722 = 0.722

34
Calculation of Information Gain
the amount of uncertainty because of any process or any given random
variable.

the amount of information improved in the nodes before splitting them for
making further decisions.

• Information Gain = 1 – Entropy higher Information Gain = more Entropy removed, which is what we
want.

Split Entropy Information Gain

Performance of class 0.959 0.041

class 0.722 0.278

35
36
37
Which Tree Should We Output?

38
Overfitting in DTs

39
Overfitting in DTs

40
Avoiding overfitting
How can we avoid overfitting?
• Stop growing when data split is not statistically significant
• Acquire more training data
• Remove irrelevant attributes (manual process – not always possible)
• Grow full tree, then post-prune

How to select “best” tree:


• Measure performance over training data
• Measure performance over separate validation data set
• Add complexity penalty to performance measure
(heuristic: simpler is better)

41
Reduced-error pruning
Split training data further into training and validation sets
Grow tree based on training set
Do until further pruning is harmful:
1. Evaluate impact on validation set of pruning each
possible node (plus those below it)
2. Greedily remove the node that most improves
validation set accuracy

42
Effect of Reduced-Error Pruning
Effect of Reduced-Error Pruning

The tree is pruned back to the red line where


it gives more accurate results on the test data 43
Decision Tree in summary

• Representation: decision trees


• Bias: prefer small decision trees
• Search algorithm: greedy
• Heuristic function:
information gain or information
content or others
• Overfitting / pruning

44
Decision Tree PROS & CONS

 Fast and simple to implement  Prone to over-fitting, specially when


many features are to be considered.
 DTs would work will with any type of
nonlinearly separable data.  While DTs visualize the decisions to be
made, at the same time they condense
 A “path” through possibilities, with a complex process into discrete steps
alternatives, always leads toward a (which may be a good or a bad thing).
desirable outcome.
 Non-incremental (batch method)

45

You might also like