0% found this document useful (0 votes)
5 views73 pages

Decision Tree, Clustering

The document discusses clustering, specifically k-means clustering, which groups similar objects into clusters based on their proximity to cluster centers. It highlights the challenges of choosing the optimal number of clusters (K) and introduces ensemble learning techniques like bagging and boosting to improve model accuracy by combining multiple learners. Bagging reduces variance through parallel training of weak learners, while boosting sequentially focuses on correcting errors from previous models to create a strong classifier.

Uploaded by

sakshijha7632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views73 pages

Decision Tree, Clustering

The document discusses clustering, specifically k-means clustering, which groups similar objects into clusters based on their proximity to cluster centers. It highlights the challenges of choosing the optimal number of clusters (K) and introduces ensemble learning techniques like bagging and boosting to improve model accuracy by combining multiple learners. Bagging reduces variance through parallel training of weak learners, while boosting sequentially focuses on correcting errors from previous models to create a strong classifier.

Uploaded by

sakshijha7632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Clustering

• Clustering or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group (called a
cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
• Example for Clustering – Color Quantization - Let us say we have an image that is stored with 24 bits/pixel and can have up
to 16 million colors. Assume we have a color screen with 8 bits/pixel that can display only 256 colors. We want to find the
best 256 colors among all 16 million colors such that the image using only the 256 colors in the palette looks as close as
possible to the original image. This is color quantization where we map from high to lower resolution.
❑ k-means Clustering
• The k-means clustering algorithm is one of the simplest unsupervised learning algorithms for solving the clustering problem.
• Let it be required to classify a given data set into a certain number of clusters, say, k clusters.
• We start by choosing k points arbitrarily as the “centres” of the clusters, one for each cluster.
• We then associate each of the given data points with the nearest centre.
• We now take the averages of the data points associated with a centre and replace the centre with the average, and this is
done for each of the centres.
• We repeat the process until the centres converge to some fixed points.
• The data points nearest to the centres form the various clusters in the dataset.
• Each cluster is represented by the associated centre.
• ie: we take the intra cluster error by taking distance of each data point from cluster center (inner
summation), we add this error for all the clusters(outer summation where k is the number of clusters), we
aim to minimize this sum. vi ‘s are the cluster centers.
Some Methods to choose initial cluster points
• In the problem, the required number of clusters is 2
and we take k = 2.
• We choose two points arbitrarily as the initial cluster
centres.
• Let us choose arbitrarily

• We compute the distances of the given data points


from the cluster centers.
Minimum Assigned
Distance Center

1 Ṽ1
0 Ṽ1

0 Ṽ2
1.41 Ṽ1
2 Ṽ2
3.61 Ṽ2

The distance from X4 is same for both we assumes x4 as Ṽ1


Second Iteration:
Third Iteration:

No change in clusters. Hence, stop here.


Disadvantages/Drawbacks of K means Clustering
➢ Even though the k-means algorithm is fast, robust and easy to understand, there are several disadvantages to
the algorithm.
● The learning algorithm requires apriori specification of the number of cluster centers
● The final cluster centers depend on the initial vi’s.
● With different representation of data we get different results
● Euclidean distance measures can unequally weight underlying factors.
● The learning algorithm provides the local optima of the squared error function.
● Randomly choosing of the initial cluster centers may not lead to a fruitful result.
● The algorithm cannot be applied to categorical data.

• The optimum number of Clusters (k) can be identified by plotting number of clusters with respect to
error, after a certain K value the decrease in error reduces drastically with each iteration – Elbow
Method to find optimal number of K
Challenges in Determining the Appropriate Value for K
Choosing the Right K: K-means clustering requires you to specify the number of clusters (K)
beforehand. One of the main challenges is determining the right value of K. If K is too small, the
algorithm may merge distinct clusters. If K is too large, the algorithm may create clusters where
none exist.
Elbow Method: One common approach to selecting K is the elbow method. You plot the within-
cluster sum of squared distances (WSS) for different values of K and look for an "elbow" in the
plot where the rate of decrease slows down. This point typically suggests a good value for K.
Silhouette Score: The silhouette score is another metric used to evaluate the quality of clusters
and helps determine the best K by measuring how similar an object is to its own cluster compared
to other clusters.
Combining Multiple Learners – Ensemble Learning
• The No Free Lunch Theorem states that there is no single learning algorithm that in any domain always
induces the most accurate learner.
• The usual approach is to try many and choose the one that performs the best on a separate validation set.
• We may combine multiple learning algorithms or same algorithms with different hyperparameters as
classifiers.

➢ Why do we prefer to combine many learners together?


There are several reasons why a single learner may not produce accurate results.
• Each learning algorithm carries with it a set of assumptions. This leads to error if the assumptions do not
hold. We cannot be fully sure whether the assumptions are true in a particular situation.
• Learning is an ill-posed problem. With finite data, each algorithm may converge to a different solution and
may fail in certain circumstances.
• The performance of a learner may be fine-tuned to get the highest possible accuracy on a validation set. But
this fine-tuning is a complex task and still there are instances on which even the best learner is not accurate
enough.
• It has been proved that there is no single learning algorithm that always produces the most accurate output.
❑ What are the different ways to achieve diversity in learning or how can we combine/select diverse
learners together. What do you mean by a base learner, how are they chosen
• When many learning algorithms are combined, the individual algorithms in the collection are called
the base learners of the collection.
• When we generate multiple base-learners, we want them to be reasonably accurate but do not require
them to be very accurate individually.
• The base-learners are not chosen for their accuracy, but for their simplicity.
• What we care for is the final accuracy when the base- learners are combined, rather than the accuracies
of the bas-learners we started from.
There are several different ways for selecting the base learners.

1. Use different learning algorithms


• There may be several learning algorithms for performing a given task. For example, for classification,
one may choose the naive Bayes’ algorithm, or the decision tree algorithm or even the SVM
algorithm.
• Different algorithms make different assumptions about the data and lead to different results.
• When we decide on a single algorithm, we give emphasis to a single method and ignore all others.
• Combining multiple learners based on multiple algorithms, we get better results.
1.
2. Use the same algorithm with different hyperparameters
• A hyperparameter is a parameter whose value is set before the learning process begins.
• By contrast, the values of other parameters are derived via training.
• The number of layers, the number of nodes in each layer and the initial weights are all hyper- parameters in an artificial
neural network.
• When we train multiple base-learners with different hyperparameter values, we average over it and reduce variance,
and therefore error.

3. Use different representations of the input object


• For example, in speech recognition, to recognize the uttered words, words may be represented by the acoustic(sound)
input.
• Words can also be represented by video images of the speaker’s lips as the words are spoken.
• Different representations make different characteristics explicit allowing better identification.

4. Use different training sets to train different base-learners


• This can be done by drawing random training sets from the given sample; this is called bagging.
• The learners can be trained serially so that instances on which the preceding base- learners are not accurate are given
more emphasis in training later base learners; examples are boosting and cascading.
• The partitioning of the training sample can also be done based on locality in the input space so that each base-learner is
trained on instances in a certain local part of the input space.
5. Model Combination Schemes
a) Multi-expert combination methods
• In this base learners work in parallel.
• All of them are trained and then given an instance, they all give their decisions, and a separate combiner
computes the final decision using their predictions.
• Examples include voting and its variants.
b) Multistage combination methods
• These methods use a serial approach where the next base-learner is trained with or tested on only the
instances where the previous base-learners are not accurate enough.

Model combination schemes


● Voting
➢ The simplest way to combine multiple classifiers is by voting, which corresponds to taking a linear combination
of the learners also know as ensembles.
➢ Each learner outputs its result, there are different mechanisms to choose the combined result. (also explain why
we combine multiple learners – based on marks)
➢ Base-learners are dj and their outputs are combined using f(·). This is for a single output;
➢ In the case of classification, each base learner has K outputs that are separately used to calculate yi, and then we
choose the maximum. Note that here, all learners observe the same input; it may be the case that different
learners observe different representations of the same input object or event.
Simple Voting
• All learners are given equal weight and we have simple voting that corresponds to taking an average.
Weighted Voting
• Each learner result is assigned different weights and we consider a weighted combination of the result
to obtain the final result.
Binary classification problem using Voting
• Consider a binary classification problem with class labels −1 and +1.
• Let there be L base learners and let x be a test instance.
• Each of the base learners will assign a class label to x. If the class label assigned is +1, we say that the
learner votes for +1 and that the label +1 gets a vote.
• The number of votes obtained by the class labels when the different base learners are applied is
counted.
• In the voting scheme for combining the learners, the label which gets the majority votes is assigned to
x.
Multi-class classification problem using Voting
• Let there be n class labels C1,C2,...,Cn.
• Let x be a test instance and let there be L base learners.
• Here also, each of the base learners will assign a class label to x and when a class label is assigned a
label, the label gets a vote.
• In the voting scheme, the class label which gets the maximum number of votes is assigned to x.
❖ Bagging and Boosting are two types of Ensemble Learning.
❖ These two decrease the variance of a single estimate as they combine several estimates from different
models.
❖ So the result may be a model with higher stability
1.Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel and
combines them for determining the model average.
2.Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this model,
learners learn sequentially and adaptively to improve model predictions of a learning algorithm.

❑ Bagging
• Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed to
improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression.
• It decreases the variance and helps to avoid overfitting.
• It is usually applied to decision tree methods.
• Bagging is a special case of the model averaging approach.
Description of the Technique
• Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is selected via row sampling with a
replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e., bootstrap).
• Then a classifier model Mi is learned for each training set D < i.
• Each classifier Mi returns its class prediction.
• The bagged classifier M* counts the votes and assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
•Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations with
replacement.
•Step 2: A base model is created on each of these subsets.
•Step 3: Each model is learned in parallel with each training set and independent of each other.
•Step 4: The final predictions are determined by combining the predictions from all the models.
How does Bagging Classifier Work?
The basic steps of how a bagging classifier works are as follows:
•Bootstrap Sampling: In Bootstrap Sampling randomly ‘n’ subsets of original training data are sampled with
replacement. This step ensures that the base models are trained on diverse subsets of the data, as some samples may
appear multiple times in the new subset, while others may be omitted. It reduces the risks of overfitting and improves the
accuracy of the model.
•Base Model Training: In bagging, multiple base models are used. After the Bootstrap Sampling, each base model
is independently trained using a specific learning algorithm, such as decision trees, support vector machines, or neural
networks on a different bootstrapped subset of data. These models are typically called “Weak learners” because they may
not be highly accurate on their own. Since the base model is trained independently of different subsets of data. To make
the model computationally efficient and less time-consuming, the base models can be trained in parallel.
•Aggregation: Once all the base models are trained, it is used to make predictions on the unseen data i.e. the subset of
data on which that base model is not trained. In the bagging classifier, the predicted class label for the given instance is
chosen based on the majority voting. The class which has the majority voting is the prediction of the model.
•Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset of particular base models during
the bootstrapping method. These “out-of-bag” samples can be used to estimate the model’s performance without the need
for cross-validation.
•Final Prediction: After aggregating the predictions from all the base models, Bagging produces a final prediction for
each instance.
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance are present. It
makes random feature selection to grow trees. Several random trees make a Random Forest.

Advantages of Bagging
•The biggest advantage of bagging is that multiple weak learners can work better than a single strong learner.
•It provides stability and increases the machine learning algorithm’s accuracy, which is used in statistical
classification and regression.
•It helps in reducing variance, i.e., it avoids overfitting.
Disadvantages of Bagging
•It may result in high bias if it is not modeled properly and thus may result in underfitting.
•Since we must use multiple models, it becomes computationally expensive and may not be suitable in
various use cases.
Boosting
• Boosting is an ensemble modeling technique designed to create a strong classifier by combining
multiple weak classifiers.
• The process involves building models sequentially, where each new model aims to correct the
errors made by the previous ones.
• Initially, a model is built using the training data.
• Subsequent models are then trained to address the mistakes of their predecessors.
• Boosting assigns weights to the data points in the original dataset.
• Higher weights: Instances that were misclassified by the previous model receive higher weights.
• Lower weights: Instances that were correctly classified receive lower weights.
• Training on weighted data: The subsequent model learns from the weighted dataset, focusing its
attention on harder-to-learn examples (those with higher weights).
• This iterative process continues until:
• The entire training dataset is accurately predicted, or
• A predefined maximum number of models is reached.
Boosting Algorithms
• There are several boosting algorithms.
• The original ones, proposed by Robert Schapire and Yoav Freund were not adaptive and could not take
full advantage of the weak learners.
• Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious
Gödel Prize.
• AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification.
• AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier”.
• In boosting, we take records from the dataset and pass them to base learners sequentially; here, base learners can be any
model. Suppose we have m number of records in the dataset.
• Then, we pass a few records to base learner BL1 and train it.
• Once the BL1 gets trained, then we pass all the records from the dataset and see how the Base learner works.
• For all the records classified incorrectly by the base learner, we only take them and pass them to other base learners, say
BL2, and simultaneously pass the incorrect records classified by BL2 to train BL3.
• This will go on unless and until we specify some specific number of base learner models we need.
• Finally, we combine the output from these base learners and create a strong learner; thus, the model’s prediction power
improves.
Boosting works with the following steps:
•We sample m-number of subsets from an initial training dataset.
•Using the first subset, we train the first weak learner.
•We test the trained weak learner using the training data. As a result of the testing, some data points will be incorrectly
predicted.
•Each data point with the wrong prediction is sent into the second subset of data, and this subset is updated.
•Using this updated subset, we train and test the second weak learner.
•We continue with the next subset until reaching the total number of subsets.
•We now have the total prediction. The overall prediction has already been aggregated at each step, so there is no need to
calculate it.
Advantages of Boosting
•It is one of the most successful techniques in solving the two-class classification problems.
•It is good at handling the missing data.
Disadvantages of Boosting
•Boosting is hard to implement in real time due to the increased complexity of the algorithm.
•The high flexibility of these techniques results in multiple numbers of parameters that directly affect the behavior of
the model.
Similarities between Bagging and Boosting Differences between Bagging and Boosting:
Decision Trees

• Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain
parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions
or the final outcomes.
• And the decision nodes are where the data is split.
• An example of a decision tree can be explained using above binary tree. Let’s say you want to predict whether a person
is fit given their information like age, eating habit, and physical activity, etc.
• The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the
leaves, which are outcomes like either ‘fit’, or ‘unfit’.
• In this case this was a binary classification problem (a yes no type problem).
• A tree structure resembling Flow Charts
• Leaf Node: Holds the Class Labels (Terminal Nodes)
• Internal Node: Non–Leaf Nodes → denotes a test on an attribute
• Branch: Outcome of the test
• Root Node: The topmost node from where all the branches originate

• Given a test tuple, x, the feature values are tested against the decision tree.
• A path is traced from the root to a leaf node
• Leaf node holds the classification results
Advantages
• No domain knowledge is required • Types of Decision Trees: Iterative Dichotomiser 3 (ID3), C 4.5, C 5.0 (latest)
• Easy to interpret
• Can handle multidimensional data
• Simple steps for learning and classification
Splitting Criterion
• Tells the best feature to test at node N
• Gives the best way to partition the tuples in D into individual classes
• “Goodness” is determined by measuring the purity of the partition at each branch
• A partition is pure, if all the tuples in it belongs to the same class
• Splitting criterion also tells which branches to grow from node N with respect to the outcomes of the
chosen test

Splitting
• The node N is labeled with the splitting criterion, which serves as a test at the node
• A branch is grown from N for each of the outcomes, and the tuples in D are partitioned
accordingly
• Let A be the feature having v distinct values, {a 1 , a2, . . . , a v }
• Three conditions can arise:
— A is discrete–valued and binary tree is desired
— A is discrete–valued and non–binary tree is desired
— A is continuous–valued
Splitting Scenarios
Splitting Scenario: Discrete A & Non–Binary Tree is desired
• Outcomes of the test at node N correspond to the known values of A.
• A branch is created for each known value, aj , of A
• Partition D j is the subset of labelled tuples in D having value a j of A.
• Because all the tuples in a given partition have the same value for A, A
need not be considered in any future partitioning of the tuples.
• Therefore, it is removed from feature list
Splitting Scenario: A is Continuous
• Test at node N has two possible outcomes, corresponding to the conditions A ≤ Split Point
and A > Split Point
• Split Point is returned by Attribute selection method as part of the splitting criterion.
• Split Point may not be a value of A from the training data.
• Two branches are grown from N and labeled according to the previous outcomes
• Partitioning: D 1 holds the subset of tuples in D for which A ≤ Split Point, while D 2 holds the rest.
Splitting Scenario: A is Discrete and Binary Tree is desired
• Test at node N is of the form A ∈S A ?
• S A is the splitting subset for A, returned by Attribute selection method
• S A is a subset of the known values of A.
• If a given tuple has value a j of A and if a j ∈S A , then the test at node N
is satisfied.
• Two branches are grown from N .
• Partitioning: D 1 holds the subset of tuples in D for which the test is satisfied, while D 2 holds the
rest.
❖ The algorithm uses the same process recursively to form a decision tree for the tuples
at each resulting partition, D j , of D Recursive partitioning stops only when any one
of the Terminating Conditions are met:
Terminating Conditions
• All the tuples at node N belong to the same class, C
— Node N is converted to a leaf node labeled with the class C
• The partition D j is empty
— A leaf is created with the majority class in D
• No remaining features on which the tuples may be further partitioned
— Convert node N into a leaf node and label it with the most common class in D
— Called Majority Voting
Decision Tree Algorithm
(Decision Trees)
Attribute Selection Measures (ASM)
• A heuristic to find the Splitting Criterion (SC) that best separates a data partition D .
• The best splitting criterion partitions D into pure (close to pure) partitions
• Also called Splitting Rules (SR), since they decide how the tuples at a node are to be split
• ASM ranks each feature describing the data
• The feature with the best score is used to split the given tuple
• Depending on the measure, the best score can be minimum or maximum
• If the feature is continuous valued, a splitting point is determined
• If feature is discrete & binary tree is desired, a splitting subset is created
• The node for partition D is labeled with the splitting criterion
• Branches are grown for each outcome of the criterion and the tuples are partitioned accordingly.
Notation

• Number of classes: m; C i , D : Set of tuples in class C i in D


• |D|: No. of tuples in D; |C i,D |: No. of tuples in C i , D
Information Gain
• Based on the information content of the feature
• The feature with the highest information gain is chosen as the splitting attribute for the node
• This minimizes the information required to split the resulting partitions further
• i.e., the resulting partitions has minimum impurity
• Thus, the number of tests (nodes) needed to classify a tuple is minimized
• This can make the tree simple
There are two main types of Decision Trees:
1. Classification trees (Yes/No types) (can explain ID3 algorithm) What we’ve seen above is an example of
classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable is
Categorical. Tree models where the target variable can take a discrete set of values are called
classification trees. In these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels.
2. Regression trees (Continuous data types) – (can explain CART algorithm) Decision trees where the
target variable can take continuous values (real numbers) like the price of a house, are called regression
trees.
Classification trees

• Consider the data given in Table 8.1 which specify the features of certain vertebrates and the class to which they belong.
• For each species, four features have been identified: “gives birth”, ”aquatic animal”, “aerial animal” and “has legs”.
• There are five class labels, namely, “amphibian”, “bird”, “fish”, “mammal” and “reptile”.
• The problem is how to use this data to identify the class of a newly discovered vertebrate.
Construction of the tree
Step 1
• We split the set of examples given in Table 8.1 into disjoint subsets according to the values of the feature “gives
birth”.
• Since there are only two possible values for this feature, we have only two subsets: One subset consisting of those
examples for which the value of “gives birth” is “yes” and one subset for which the value is “no”.
Step 2
• We now consider the examples in Table 8.2. We split these examples based on the values of the feature “aquatic
animal”. There are three possible values for this feature. However, only two of these appear in Table 8.2. Accordingly,
we need consider only two subsets. These are shown in Tables 8.4 and 8.5.

• Table 8.4 contains only one example and hence no further splitting is required.
• It leads to the assignment of the class label “fish”.
• The examples in Table 8.5 need to be split into subsets based on the values of
“aerial animal”.
• It can be seen that these subsets immediately lead to unambiguous assignment of
class labels: The value of “no” leads to “mammal” and the value “yes” leads to
”bird”.
Step 3
• Next we consider the examples in Table 8.3 and split them into disjoint subsets based on the values of “aquatic
animal”. We get the examples in Table 8.6 for “yes”, the examples in Table 8.8 for “no” and the examples in Table 8.7
for “semi”. We now split the resulting subsets based on the values of “has legs”, etc.
• Putting all these together, we get the the diagram in tree (Figure 8.5) for the data in Table 8.1
❑ Elements of a classification tree
• The various elements in a classification tree are identified as follows.
• In the example discussed above, initially we chose the feature “gives birth” to split the data set into disjoint
subsets and then the feature “aquatic animal”, and so on.
• There was no theoretical justification for this choice.
• The classification tree depends on the order in which the features are selected for partitioning the data.
• Stopping criteria (when can we stop the growth of a Decision Tree)
→ All (or nearly all) of the examples at the node have the same class.
→ There are no remaining features to distinguish among the examples.
→ The tree has grown to a predefined size limit.
❑ Explain how we decide which feature to be selected at each level in a decision tree.
• Feature selection measures (information gain, Gain ratio, Gini index )
• If a dataset consists of n attributes, then deciding which attribute is to be to placed at the root or at different
levels of the tree as internal nodes is a complicated problem.
• It is not enough that we just randomly select any node to be the root.
• If we do this, it may give us bad results with low accuracy.
• The most important problem in implementing the decision tree algorithm is deciding which features are to be
considered as the root node and at each level.
• Several methods have been developed to assign numerical values to the various features such that the values
reflect the relative importance of the various features. These are called the feature selection measures.
• Two of the popular feature selection measures are information gain and Gini index.
Entropy
• The degree to which a subset of examples contains only a single class is known as purity, and any subset composed
of only a single class is called a pure class.
• Informally, entropy is a measure of “impurity” in a dataset.
• Sets with high entropy are very diverse and provide little information about other items that may also belong in the
set, as there is no apparent commonality.
• Entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n
classes, entropy ranges from 0 to log2(n).
• In each case, the minimum value indicates that the sample is completely homogeneous, while the maximum value
indicates that the data are as diverse as possible
• Consider a segment S of a dataset having c number of class labels. Let pi be the proportion of examples in S having
the i th class label. The entropy of S is defined as

Special case
• Let the data segment S has only two class labels, say, “yes” and “no”.
• If p is the proportion of examples having the label “yes” then the proportion of examples having label “no” will be 1 − p.
• In this case, the entropy of S is given by
Entropy (S) = −p log2(p) − (1 − p) log2(1 − p).
Examples of Entropy Calculation

• Let “i” be some class label.


• We denote by pi the proportion of examples with class
label “i”. 1.
• Let S be the data in Table 8.1.
• The class labels are ”amphi”, “bird”, ”fish”, ”mammal” and
”reptile”.
• In S we have the following numbers.

Number of examples with class label “amphi” =3


Number of examples with class label “bird” =2
Number of examples with class label “fish” =2
Number of examples with class label “mammal” =2
Number of examples with class label “reptile” =1
Total number of examples =10
Therefore, we have
Information Gain
• Let S be a set of examples, A be a feature (or, an attribute), Sv be the subset of S with A = v, and Values (A) be the set of
all possible values of A.
• Then the information gain of an attribute A relative to the set S, denoted by Gain (S, A), is defined as

• The attribute A with the highest information gain is selected.


Example
Consider the data S given in Table 8.1. We have have already seen that ∣S∣ = 10 Entropy (S) = 2.2464
We denote the information gain corresponding to the feature “ i ” by Gain (S, i)
Similarly we can compute Gain(S, aerial animal) and Gain(S, has legs)
Gini indices
• The Gini split index of a data set is another feature selection measure in the construction of classification
trees. This measure is used in the CART algorithm
Tree Pruning
• Due to noise/anomalies in the data, trees may overfit.
• Pruning is Early stopping of trees to avoid this
• Advantages:
— Trees are shorter, Less complex, More interpretable
— Faster and Better classification of unseen tuples
• Two methods:
— Prepruning: Halt the tree construction early
— Postpruning: Remove subtrees from a fully grown tree’
Prepruning
• Decides not to further split or partition data at the given node
• After halting, the node becomes a leaf.
• Majority voting gives the class label at the leaf
• Else, the leaves can be programmed to show the probability distribution of the tuples

• Method: Thresholding on the Attribute Selection Measures is done to stop the splitting of nodes
• Ex: Only if the Gain(A ) > T , split the node; Else Don’t split
• Too large T → Too simplified tree
• Too small T → Complex tree
• Setting the threshold is the key

Postpruning
Most common method
• Removes a subtree and replaces it with a leaf
• Leaf label is found out by majority voting
Postpruning
Cost Complexity Pruning Algorithm
• Used in CART method
• Cost Complexity: Function of the no. of leaves and the percentage of misclassified tuples
• For each node, this is compared for pruned and unpruned cases
• If pruned one has lower complexity, pruning is performed
• This process is performed from bottom to top
• A different set used for computing cost complexity → Pruning Set
• Pruning set is independent of training set

Pessimistic Pruning Algorithm


• Used in C4.5
• Uses Error Rate Estimates to prune/keep the subtrees
• No separate pruning set needed; uses the training set itself
• After the tree is built, start removing each node from the first child node (top to bottom)
• Compute the error rate before and after pruning.
• Prune/keep the node based on the error rate
• Prepruning is less computationally complex
• Postpruning is computationally complex, but gives best results
• Pre and Post Pruning can also be combined if needed
• Even after pruning, Repetition and Replication can exist in decision trees

• Replication: Duplicate subtrees within the tree


• Both can be removed by means of splits using a combination of attributes
Random Forest
➢ A Random Forest is a collection of decision trees that work together to make predictions.
➢ Random Forest algorithm is a powerful tree learning technique in Machine Learning to make
predictions and then we do voting of all the trees to make prediction.
➢ They are widely used for classification and regression task.
➢ It is a type of classifier that uses many decision trees to make predictions.
➢ It takes different random parts of the dataset to train each tree and then it combines the results by
averaging them.
➢ This approach helps improve the accuracy of predictions. Random Forest is based on ensemble
learning.

➢ Imagine asking a group of friends for advice on where to go for vacation. Each friend gives their
recommendation based on their unique perspective and preferences (decision trees trained on
different subsets of data). You then make your final decision by considering the majority opinion
or averaging their suggestions (ensemble prediction).
➢ Process starts with a dataset with rows and their corresponding class labels (columns).
➢ Then - Multiple Decision Trees are created from the training data.
➢ Each tree is trained on a random subset of the data (with replacement) and a random subset of features.
This process is known as bagging or bootstrap aggregating.
➢ Each Decision Tree in the ensemble learns to make predictions independently.
➢ When presented with a new, unseen instance, each Decision Tree in the ensemble makes a prediction.
➢ The final prediction is made by combining the predictions of all the Decision Trees. This is typically done
through a majority vote (for classification) or averaging (for regression).
➢ We can evaluate the model's performance using Mean Squared Error and R-squared Score which show how accurate the
predictions are and used a random sample to check model prediction.
The random Forest algorithm works in several steps:
❑ Random Forest builds multiple decision trees using random samples of the data. Each tree is
trained on a different subset of the data which makes each tree unique.
❑ When creating each tree the algorithm randomly selects a subset of features or variables to
split the data rather than using all available features at a time. This adds diversity to the trees.
❑ Each decision tree in the forest makes a prediction based on the data it was trained on. When
making final prediction random forest combines the results from all the trees.
▪ For classification tasks the final prediction is decided by a majority vote. This means that the
category predicted by most trees is the final prediction.
▪ For regression tasks the final prediction is the average of the predictions from all the trees.
❑ The randomness in data samples and feature selection helps to prevent the model from
overfitting making the predictions more accurate and reliable.
Assumptions of Random Forest
❑ Each tree makes its own decisions: Every tree in the forest makes its own predictions without relying
on others.
❑ Random parts of the data are used: Each tree is built using random samples and features to reduce
mistakes.
❑ Enough data is needed: Sufficient data ensures the trees are different and learn unique patterns and
variety.
❑ Different predictions improve accuracy: Combining the predictions from different trees leads to a more
accurate final results.

Key Features of Random Forest


❑ Handles Missing Data: Automatically handles missing values during training, eliminating the need
for manual imputation.
❑ Algorithm ranks features based on their importance in making predictions offering valuable
insights for feature selection and interpretability.
❑ Scales Well with Large and Complex Data without significant performance degradation.
❑ Algorithm is versatile and can be applied to both classification tasks (e.g., predicting categories) and
regression tasks (e.g., predicting continuous values).
Advantages of Random Forest
➢ Random Forest provides very accurate predictions even with large datasets.
➢ Random Forest can handle missing data well without compromising with accuracy.
➢ It doesn’t require normalization or standardization on dataset.
➢ When we combine multiple decision trees it reduces the risk of overfitting of the model.

Limitations of Random Forest


✓ It can be computationally expensive especially with a large number of trees.
✓ It’s harder to interpret the model compared to simpler models like decision trees.

You might also like