0% found this document useful (0 votes)
8 views6 pages

DM Chapter 4

Chapter 4 discusses decision trees and their classification capabilities, emphasizing the importance of entropy and information gain for feature selection and data splitting. It outlines the structure of decision trees, the divide-and-conquer approach for building them, and the C5.0 algorithm's strengths and weaknesses. Additionally, it covers pruning techniques to prevent overfitting and improve model accuracy.

Uploaded by

oumaima abaied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

DM Chapter 4

Chapter 4 discusses decision trees and their classification capabilities, emphasizing the importance of entropy and information gain for feature selection and data splitting. It outlines the structure of decision trees, the divide-and-conquer approach for building them, and the C5.0 algorithm's strengths and weaknesses. Additionally, it covers pruning techniques to prevent overfitting and improve model accuracy.

Uploaded by

oumaima abaied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 4: Classification Using Decision Trees and Rules

Remember:
-​ Entropy is used to determine which feature to use for splitting, not where to split.
-​ The goal is to reduce entropy (homogeneity) and maximize information gain

Understanding decision trees


-​ From of the model: tree structure.
-​ The model itself comprises a series of logical decisions.
-​ Decision nodes: a decision to be made on an attribute.
-​ Branches: decision nodes split into branches that indicate the decision's choices.
-​ The tree is terminated by leaf nodes (terminal nodes) that denote the result of following a
combination of decisions.
-​ Decision trees are appropriate for:
-​ applications in which the classification mechanism needs to be transparent for legal
reasons.
-​ The results need to be shared to facilitate decision-making.
-​ Decision trees are:
-​ Most widely used machine learning technique
-​ Can be applied for modeling almost any type of data—often with unparalleled performance.
-​ In spite of their wide applicability, it is worth noting some scenarios where trees may not be an
ideal fit because may result in a very large number of decisions and an overly complex tree:
-​ a task where the data has a large number of nominal features with many levels
-​ the data has a large number of numeric features.

Divide and conquer


-​ Decision trees are built using a heuristic called recursive partitioning, known as divide and
conquer.
-​ It is called divide and conquer because it uses the feature values to split the data into smaller
and smaller subsets of similar classes.
Steps:
1.​ The algorithm begins at the root node (the entire dataset)
2.​ It chooses a feature that is the most predictive of the target class.
3.​ The examples are then partitioned into groups of distinct values of this feature; this decision
forms the first set of tree branches.
4.​ The algorithm continues to divide-and-conquer the nodes, choosing the best predictive feature
each time until a stopping criterion is reached.
Stopping condition:
This might occur at a node if:
➢​ All (or nearly all) of the examples at the node have the same class
➢​ There are no remaining features to distinguish among examples
➢​ The tree has grown to a predefined size limit
Example:
Imagine that you are working for a Hollywood film
studio, and your desk is piled high with screenplays.
Rather than read each one cover-to-cover, you decide
to develop a decision tree algorithm to predict
whether a potential movie would fall into one of three
categories: mainstream hit, critic's choice, or box office
bust. To gather data for your model, you turn to the
studio archives to examine the previous ten years of
movie releases. After reviewing the data for 30 different movie scripts, a pattern emerges. There seems to
be a relationship between the film's proposed shooting budget, the number of A-list celebrities lined up
for starring roles, and the categories of success. A scatter plot of this data might look something like the
following diagram:
Step 1:The algorithm begins at the root node (the entire dataset)

Step 2: Choose a feature (film's proposed shooting


budget, or number of A-list celebrities) that is the most
predictive of the target class. (categories of success).
Let's first split the feature indicating the number of
celebrities, partitioning the movies into 2 groups:
-​ with a low number of A-list stars
-​ without a low number of A-list stars

Step 3: The examples are then partitioned into groups of distinct values of this feature;
Next, among the group of movies with a larger number of celebrities, we can make another split
between:
-​ movies with a high budget
-​ movies without a high budget

At this point, we have partitioned the data into three groups:


1.​ top-left corner: critically-acclaimed films 🟩
a.​ a high number of celebrities
b.​ a relatively low budget.
2.​ top-right corner: box office hits 🔷:
a.​ a large number of celebrities
b.​ a high budget and.
3.​ Bottom half: flops 🔴:
a.​ a low number of celebrities: has little star power
b.​ A budget ranging from small to large
Step 4: Continue to divide and conquer the nodes:
We could continue to divide the data by splitting it based on increasingly specific ranges of budget and
celebrity counts until each of the incorrectly classified values resides in its own, perhaps tiny partition.
Stopping condition:
Since the data can continue to be split until there are
no distinguishing features within a partition, a
decision tree can be prone to be overfitting for the
training data with overly specific decisions. We'll
avoid this by stopping the algorithm here since more
than 80 percent of the examples in each group are
from a single class.
Limitation of the decision tree: uses axis-parallel
splits
-​ You might have noticed that diagonal lines could
have split the data even more cleanly.
-​ The fact that each split considers one feature at a
time prevents the decision tree from forming
more complex decisions such as "if the number of
celebrities is greater than the estimated budget,
then it will be a critical success".

The C5.0 decision tree algorithm


-​ The most well-known implementation of decision trees.
-​ Was developed by computer scientist J. Ross Quinlan as an improved version of his prior
algorithm, C4.5, which itself is an improvement over his ID3 (Iterative Dichotomiser 3) algorithm.
-​ It does well for most types of problems directly out of the box.
-​ It generally performs nearly as well as other models but is much easier to understand and
deploy.
Strengths:
➔​ An all-purpose classifier that does well on most problems
➔​ Highly-automatic learning process can handle numeric or nominal features, missing data
➔​ Uses only the most important features
➔​ Can be used on data with relatively few training examples or a very large number
➔​ Results in a model that can be interpreted without a mathematical background (for relatively
small trees)
➔​ More efficient than other complex models
Weaknesses: relatively minor and can be largely avoided
➔​ Decision tree models are often biased toward splits on features having a large number of levels
➔​ It is easy to overfit or underfit the model
➔​ Can have trouble modeling some relationships due to reliance on axis parallel splits
➔​ Small changes in training data can result in large changes to decision logic
➔​ Large trees can be difficult to interpret and the decisions they make may seem counterintuitive
Choosing the best feature for splitting data
Entropy:
-​ Pure segment: If it contains only a single class.
-​ C5.0 uses entropy for measuring purity (randomness). The entropy of a sample of data indicates
how mixed the class values are;
-​ If one class dominates completely (e.g., 100% "Yes"), entropy is zero, meaning the data is
completely homogeneous.
-​ If a dataset has equal numbers of different classes (e.g., 50% "Yes" and 50% "No"), it has
high entropy ( = 1 ).
-​ The definition of entropy is specified by:

In the entropy formula, for a given segment of data (S):


-​ C: the number of different class levels
-​ Pi: the proportion of values falling into class level i.
For example, suppose we have a partition of data with two classes:
-​ red (60%)
-​ white (40%)
We can calculate the entropy as: -0.60 * log2(0.60) - 0.40 * log2(0.40)=0.9709506
How Decision Trees Use Entropy:
-​ A decision tree algorithm looks for the best way to split the data to reduce entropy.
-​ It evaluates each feature and calculates how much it decreases entropy when used for splitting.
For any 2-class arrangement:
If x: the proportion of class 1
Then, 1-x: the proportion of class 2
Using the curve() function, we can then plot the entropy for all possible values of x: curve(-x * log2(x) - (1
- x) * log2(1 - x), col="red", xlab = "x", ylab = "Entropy", lwd=4)

As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in the maximum entropy.
Information gain:
-​ Information gain is used to calculate the change in entropy resulting from a split on each
possible feature.
-​ The algorithm checks different features and selects the one that provides the highest information
gain, meaning it best separates the classes.
-​ If the information gain = 0 → no reduction in entropy for splitting on this feature → The split does
not improve class separation.
Information Gain (Feature) = Entropy Before Split − Weighted Entropy After Split
InfoGain(F) = Entropy (S1) − Entropy (S2)

-​ Weighted Entropy After Split: entropy in the partitions resulting from the split.
𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆2) = ∑ 𝑤𝑖 * 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑃𝑖)
𝑖=1
Weighted Entropy after split = w1​* Entropy(P1​) +…+ wn * Entropy(Pn​)
Where wi is the proportion of data points in partition Pi
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑃𝑖
𝑤𝑖 = 𝑡ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠
This means that after splitting a dataset into multiple groups (partitions), the overall entropy of the new
dataset is calculated by considering:
1.​ The entropy of each individual partition (how mixed or pure it is).
2.​ The size (proportion) of each partition relative to the total dataset.
-​
Handling Numeric Features
-​ The previous formulae assume nominal features, but decision trees use information gain for
splitting on numeric features as well.
-​ A common practice is testing various splits that divide the values into groups greater than or less
than a threshold; this reduces the numeric feature into a two-level categorical feature.
-​ The numeric threshold (e.g., "greater than 50" vs. "less than 50"). yielding the largest
information gain is chosen for the split
Pruning the decision tree
Large tree → overly specific decisions → overfitting model
Pruning a decision tree involves reducing its size.

Pre-pruning/ Early stopping:

-​ Pre-pruning/ Early stopping stops the tree from growing:


a.​ once it reaches a certain number of decisions
b.​ if the decision nodes contain only a small number of examples.
-​ However, one downside is that there is no way to know whether the tree will miss subtle, but
important patterns that it would have learned had it grown to a larger size.

Post-pruning:

-​ Post-pruning involves:
1.​ growing a tree that is too large
2.​ using pruning criteria based on the error rates at the nodes to reduce the size of the tree to a
more appropriate level.
-​ This is often a more effective approach because it is difficult to determine the optimal depth of a
decision tree without growing it first.
C5.0’s Approach to Pruning

Post-pruning: C5.0 automatically prunes trees to improve accuracy.


-​ C5.0 first grows a large tree that overfits the training data.
-​ Then, it removes branches and nodes that don’t significantly improve classification accuracy.

Subtree Raising & Subtree Replacement


Subtree Raising:
-​ A branch deep in the tree is moved higher if it helps simplify the tree without losing accuracy.
Subtree Replacement:
-​ A complex set of conditions is replaced with a simpler decision that performs just as well.

Balancing overfitting and underfitting models


If model accuracy is vital it may be worth investing some time with various pruning options to see if it
improves performance on the test data.
One of the strengths of the C5.0 algorithm is that it is very easy to adjust the training options.

You might also like