Unit-3 Alt
Unit-3 Alt
Unit3 ml - Hhitffu
UNIT-III
Learning with Trees: Decision Trees, Constructing Decision Trees, Classification and
Regression Trees.
Ensemble Learning: Boosting, Bagging, Different ways to combine classifiers, Basic Statistics,
Gaussian Mixture Models, Nearest Neighbour Methods.
Unsupervised Learning: K Means Algorithm.
Decision Trees:
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
The decisions or the tests are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
Example:
One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex: if there is a party then go to it
if there is not a party and you have an urgent deadline then study
Constructing Decision Trees:
Types of Decision Tree Algorithms:
ID3: This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.
C4.5: This is an improved version of ID3 that can handle missing data and continuous
attributes.
CART: This algorithm uses a different measure called Gini impurity to decide how to
split the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.
ID3 Algorithm:
where the logarithm is base 2 because we are imagining that we encode everything
using binary digits (bits), and we define 0 log 0 = 0.
If all of the examples are positive, then we don’t get any extra information from
knowing the value of the feature for any particular example, since whatever the value
of the feature, the example will be positive. Thus, the entropy of that feature is 0.
However, if the feature separates the examples into 50% positive and 50% negative,
then the amount of entropy is at a maximum, and knowing about that feature is very
useful to us.
For our decision tree, the best feature to pick as the one to classify on now is the one
that gives you the most information, i.e., the one with the highest entropy.
Information Gain:
It is defined as the entropy of the whole set minus the entropy when a particular feature
is chosen.
The ID3 algorithm computes this information gain for each feature and chooses the
one that produces the highest value.
C4.5 Algorithm:
It is an improved version of ID3.
Pruning is another method that can help us avoid overfitting.
It helps in improving the performance of the Decision tree by cutting the nodes or sub-
nodes which are not significant.
Additionally, it removes the branches which have very low importance.
There are mainly 2 ways for pruning:
Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the nodes based
on their significance.
C4.5 uses a different method called rule post-pruning.
This consists of taking the tree generated by ID3, converting it to a set of if-then rules,
and then pruning each rule by removing preconditions if the accuracy of the rule
increases without it.
The rules are then sorted according to their accuracy on the training set and applied in
order.
The advantages of dealing with rules are that they are easier to read and their order in
the tree does not matter, just their accuracy in the classification.
For Continuous Variables, the simplest solution is to discretise the continuous variable.
Computation complexity of Decision Tree is O(dnlogn) where n is number of data
points, d is number of dimensions.
Classification Example: construct the decision tree to decide what to do in the evening
Compute Entropy of S:
Therefore, the root node will be the party feature, which has two feature values (‘yes’
and ‘no’), so it will have two branches coming out of it.
When we look at the ‘yes’ branch, we see that in all five cases where there was a party
we went to it, so we just put a leaf node there, saying ‘party’.
For the ‘no’ branch, out of the five cases there are three different outcomes, so now we
need to choose another feature.
The five cases we are looking at are:
We’ve used the party feature, so we just need to calculate the information gain of the
other two over these five examples:
Here, Deadline feature has maximum information gain. Hence, we selected Deadline
feature for splitting data.
Gini Impurity:
The node with uniform class distribution has the highest impurity.
The minimum impurity is obtained when all records belong to the same class.
An attribute with the smallest Gini Impurity is selected for splitting the node.
Regression in Trees:
A Regression tree is an algorithm where the target variable is continuous and the tree
is used to predict its value.
Regression Tree works by splitting the training data recursively into smaller subsets
based on specific criteria.
The objective is to split the data in a way that minimizes the residual reduction (Sum
of Squared Error) in each subset.
Residual Reduction- Residual reduction is a measure of how much the average
squared difference between the predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower the residual reduction, the better
the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the
one that results in the greatest reduction of residual error in the resulting subsets. This
process is repeated until a stopping criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
Ensemble Learning:
Ensemble learning refers to the approach of combining multiple ML models to produce
a more accurate and robust prediction compared to any individual model.
The conventional ensemble methods include bagging, boosting, and stacking-based
methods
Boosting:
Boosting is an ensemble technique that combines multiple weak learners to create a
strong learner.
The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting).
AdaBoost:
AdaBoost short for Adaptive Boosting is an ensemble learning used in machine
learning for classification and regression problems.
The main idea behind AdaBoost is to iteratively train the weak classifier on the training
dataset with each successive classifier giving more weightage to the data points that are
misclassified.
The final AdaBoost model is decided by combining all the weak classifier that has been
used for training with the weightage given to the models according to their accuracies.
The model which has the highest accuracy is given the highest weightage while the
model which has the lowest accuracy is given a lower weightage.
Steps in AdaBoost:
1. Weight Initialization
At the start, every instance is assigned an identical weight. These weights determine the
importance of every example.
2. Model Training
A weak learner is skilled at the dataset, with the aim of minimizing classification errors.
The weighted mistakes are then calculated by means of summing up the weights of the
misclassified times. This step emphasizes the importance of the samples which are tough to
classify.
The weight of the susceptible learner is calculated primarily based on their Performance in
classifying the training data. Models that perform properly are assigned higher weights,
indicating that they're more reliable.
The example weights are updated to offer more weight to the misclassified samples from the
previous step.
6. Repeat
Steps 2 through 5 are repeated for a predefined variety of iterations or till a distinctive overall
performance threshold is met.
The very last sturdy model (also referred to as the ensemble) is created by means of
combining the weighted outputs of all weak learners.
8. Classification
To make predictions on new records, AdaBoost uses the very last ensemble model.
Bagging:
Bagging is a supervised learning technique that can be used for both regression and
classification tasks.
Stacking:
Stacking combines many ensemble methods in order to build a meta-learner.
Stacking has two levels of learning: 1) base learning and 2) meta-learning.
In the first one, the base learners are trained with training data set.
Once trained, the base learners create a new data set for a meta-learner.
The meta-learner is then trained with that new training data set.
Finally, the trained meta-learner is used to classify new instances.
Mode:
The "mode" is the most common value in a dataset.
It is calculated by finding the value that occurs most frequently in the dataset.
If there are multiple values that occur with the same frequency, the dataset is said to be
bimodal, trimodal, or multimodal.
The mode is a useful measure of central tendency because it can identify the most
common value in a dataset.
However, it is not a good measure of central tendency for datasets with a wide range of
values or datasets with no repeating values.
Variance:
Variance is a measure of how much the data for a variable varies from it's mean.
Covariance:
Covariance is a measure of relationship between two variables that is scale dependent,
i.e. how much will a variable change when another variable changes.
Standard Deviation:
The square root of the variance is known as the standard deviation
Mahalanobis Distance:
Mahalanobis Distance is a statistical tool used to measure the distance between a point
and a distribution.
It is a powerful technique that considers the correlations between variables in a dataset,
making it a valuable tool in various applications such as outlier detection, clustering,
and classification.
D² = (x-μ)ᵀΣ⁻¹(x-μ)
Where D² is the squared Mahalanobis Distance, x is the point in question, μ is the mean
vector of the distribution, Σ is the covariance matrix of the distribution, and ᵀ denotes
the transpose of a matrix.
The Gaussian / Normal Distribution:
Normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution that is symmetric about the mean, depicting that data near the
mean are more frequent in occurrence than data far from the mean.
Model with high variance pays a lot of attention to training data and does not generalize
on the data which it hasn’t seen before.
As a result, such models perform very well on training data but has high error rates on
test data.
If our model is too simple and has very few parameters then it may have high bias and
low variance.
On the other hand if our model has large number of parameters then it’s going to have
high variance and low bias.
So we need to find the right/good balance without overfitting and underfitting the data.
In soft clustering, instead of forcefully assigning a data point to a single cluster, GMM
assigns probabilities that indicate the likelihood of that data point belonging to each of
the Gaussian components.
Notation:
GMM Parameters:
Model Training
Expectation-Maximization:
During the E step, the model calculates the probability of each data point belonging to
each Gaussian component.
The M step then adjusts the model’s parameters based on these probabilities.
Post-training, GMMs cluster data points based on the highest posterior probability.
They are also used for density estimation, assessing the probability density at any
point in the feature space.
Example:
A simple example to showcase the insertion into a K-Dimensional Tree, we will use a k = 2.
The points we will be adding are: (7,8), (12,3), (14,1), (4,12), (9,1), (2,7), and (10,19).
Unsupervised Learning:
Unsupervised learning is a type of machine learning that learns from unlabeled data.
This means that the data does not have any pre-existing labels or categories.
The goal of unsupervised learning is to discover patterns and relationships in the data
without any explicit guidance.
Types of Unsupervised Learning:
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior. Clustering is a type of
unsupervised learning that is used to group similar data points together.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
K Means Algorithm:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters
Simple and Easy to implement: The K-means algorithm is easy to understand and
implement.
Fast and Efficient: K-means is computationally efficient and can handle large datasets
with high dimensionality.
Scalability: K-means can handle large datasets with many data points and can be easily
scaled to handle even larger datasets.
Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.
*****