Assignment 3
Assignment 3
3
AIM: Assignment of Decision Tree.
THEORY:
The purpose of an information system is to extract useful information from raw data. Data
science is a field of study that aims to understand and analyze data by means of statistics, big
data, machine learning and to provide support for decision makers and autonomous systems.
While this sounds complicated, the tools are based on mathematical models and specialized
software components that are already available (e.g. Python packages). In the following labs we
will learn about.. learning. Machine Learning, to be more specific, and the two main
classes: Supervised Learning and Unsupervised Learning. The general idea is to write
software programs that can learn from the available data, identify patterns and make decisions
with minimal human interventions, based on Machine Learning algorithms.
Machine Learning: Supervised Learning
Supervised learning is the Machine Learning task of learning a function (f) that maps an input
(X) to an output (y) based on example input-output pairs. The goal is to find (approximate) the
mapping function so that new data can be predicted. The function can be continuous in the case
of regression, or discrete in the case of classification, requiring different algorithms. Now, we
will discuss about classification methods, where the input/output variables are attributes and not
limited to numbers.
Regression vs classification
The main difference between them is that the output variable in regression is numerical (or
continuous, such as “dollars” or “weight”) while that for classification is categorical (or discrete,
such as “red”, “blue”, “small”, “large”). For example, when provided with a dataset about houses
(e.g. Boston), and you are asked to predict their prices, that is a regression task because price will
be a continuous output (see Lab 7). Examples of the common regression algorithms include
linear regression, Support Vector Regression (SVR), and regression trees.
For example, when provided with a dataset about houses, a classification algorithm can try to
predict whether the prices for the houses “sell more or less than the recommended retail price”.
Examples of the common classification algorithms include logistic regression, Naïve
Bayes, decision trees, and K Nearest Neighbors.
A decision tree is a classification and prediction tool having a tree like structure, where each
internal node denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (terminal node) holds a class label:
You can actually construct a flowchart that can be used to understand the decisions from the
historical data and predict decisions for the next sample of data.
Here is an example literally comparing apples and oranges based on the size and the texture of
the fruit, based on Decision Trees. The algorithm has to learn from the available, labelled
examples and then predict other fruits and classify them as either apples or oranges:
- ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative
Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate
candidate splits. Some of Quinlan’s research on this algorithm from 1986 can be
found here (PDF, 1.4 MB) (link resides outside of ibm.com).
- C4.5: This algorithm is considered a later iteration of ID3, which was also developed by
Quinlan. It can use information gain or gain ratios to evaluate split points within the decision
trees.
- CART: The term, CART, is an abbreviation for “classification and regression trees” and was
introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal
attribute to split on. Gini impurity measures how often a randomly chosen attribute is
misclassified. When evaluating using Gini impurity, a lower value is more ideal.
While there are multiple ways to select the best attribute at each node, two methods, information
gain and Gini impurity, act as popular splitting criterion for decision tree models. They help to
evaluate the quality of each test condition and how well it will be able to classify samples into a
class.
It’s difficult to explain information gain without first discussing entropy. Entropy is a concept
that stems from information theory, which measures the impurity of the sample values. It is
defined with by the following formula, where:
For this dataset, the entropy is 0.94. This can be calculated by finding the proportion of days
where “Play Tennis” is “Yes”, which is 9/14, and the proportion of days where “Play Tennis” is
“No”, which is 5/14. Then, these values can be plugged into the entropy formula above.
Entropy (Tennis) = -(9/14) log2(9/14) – (5/14) log2 (5/14) = 0.94
We can then compute the information gain for each of the attributes individually. For example,
the information gain for the attribute, “Humidity” would be the following:
Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) = 0.151
As a recap,
- 7/14 represents the proportion of values where humidity equals “high” to the total number of
humidity values. In this case, the number of values where humidity equals “high” is the same as
the number of values where humidity equals “normal”.
- 0.985 is the entropy when Humidity = “high”
- 0.59 is the entropy when Humidity = “normal”
Then, repeat the calculation for information gain for each attribute in the table above, and select
the attribute with the highest information gain to be the first split point in the decision tree. In
this case, outlook produces the highest information gain. From there, the process is repeated for
each subtree.
Gini Impurity
Gini impurity is the probability of incorrectly classifying random data point in the dataset if it
were labeled based on the class distribution of the dataset. Similar to entropy, if set, S, is
pure—i.e. belonging to one class) then, its impurity is zero. This is denoted by the following
formula:
The accuracy and classification report for the wine quality dataset is:
Accuracy: 0.94
Classification Report:
precision recall f1-score support
After applying pre-pruning the tree looks like:
Advantages and disadvantages of Decision Trees
While decision trees can be used in a variety of use cases, other algorithms typically outperform
decision tree algorithms. That said, decision trees are particularly useful for data mining and
knowledge discovery tasks. Let’s explore the key benefits and challenges of utilizing decision
trees more below:
Advantages
- Easy to interpret: The Boolean logic and visual representations of decision trees make them
easier to understand and consume. The hierarchical nature of a decision tree also makes it easy to
see which attributes are most important, which isn’t always clear with other algorithms,
like neural networks.
- More flexible: Decision trees can be leveraged for both classification and regression tasks,
making it more flexible than some other algorithms. It’s also insensitive to underlying
relationships between attributes; this means that if two variables are highly correlated, the
algorithm will only choose one of the features to split on.
Disadvantages
- Prone to overfitting: Complex decision trees tend to overfit and do not generalize well to new
data. This scenario can be avoided through the processes of pre-pruning or post-pruning.
Pre-pruning halts tree growth when there is insufficient data while post-pruning removes
subtrees with inadequate data after tree construction.
- High variance estimators: Small variations within data can produce a very different decision
tree. Bagging, or the averaging of estimates, can be a method of reducing variance of decision
trees. However, this approach is limited as it can lead to highly correlated predictors.
- More costly: Given that decision trees take a greedy search approach during construction, they
can be more expensive to train compared to other algorithms.
- Not fully supported in scikit-learn: Scikit-learn is a popular machine learning library based
in Python. While this library does have a Decision Tree module (DecisionTreeClassifier, link
resides outside of ibm.com), the current implementation does not support categorical variables.
REFERENCES:
2. https://www.ibm.com/in-en/topics/decision-trees
3. https://medium.com/@abhishekjainindore24/all-about-decision-trees-80ea55e37fef
CONCLUSION:
● Pre-Pruning stops tree growth early based on conditions like depth or minimum samples
per split, reducing complexity and computational cost but risking underfitting.
● Post-Pruning (Cost Complexity Pruning) grows a full tree first and then removes less
significant branches using an optimal α value, making it more effective at refining the
model.
● The best pruning strategy depends on the dataset, but a combination of both methods
often yields the most balanced results.
By implementing pruning techniques, decision trees become more interpretable, efficient, and
accurate, making them more suitable for real-world applications.