Unit 4
Unit 4
● Multistage combination
○ Multistage combination methods use a serial approach where
■ The next combination base-learner is trained with or tested on only
the instances where the previous base-learners are not accurate
enough.
■ The idea is that the base-learners (or the different representations
they use) are sorted in increasing complexity so that a complex
base-learner is not used (or its complex representation is not extracted)
unless the preceding simpler base-learners are not confident.
○ An example is cascading
—------------------------------------------
K-Nearest Neighbor(KNN)
● K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
● K-NN algorithm
○ assumes the similarity between the new case/data and available cases and
○ put the new case into the category that is most similar to the available categories.
● K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
● K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
● K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
● It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
● KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
● Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
● we have a new data point x1, so this data point will lie in which of these
categories.
With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:
○ Firstly, we will choose the number of neighbors, so we will choose the k=5.
○ Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
○ By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
○ As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
○ A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
○ Large values for K are good, but it may have some difficulties.
Advantages of KNN Algorithm:
○ It is simple to implement.
○ Easy to understand
○ Always needs to determine the value of K which may be complex some time.
○ The computation cost is high because of calculating the distance between the
data points for all the training samples.
○
Example-
Height (in cms) Weight (in kgs) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
L
170 68
Height 161 : Weight : 61 Predict the T Shirt Size
Ensemble techniques
● methods that use multiple learning algorithms or models to produce one optimal
predictive model.
● The model produced has better performance than the base learners taken alone.
● Other applications of ensemble learning also include selecting the important features,
data fusion, etc.
● Ensemble techniques can be primarily classified into Bagging, Boosting, and Stacking.
Here, m represents a weak learner; d1, d2, d3, d4 are the random samples from Data D; d’, d”,
d”’ are updated training data based on the results from the previous weak learner.
1. Bagging:
● Bagging is mainly applied in supervised learning problems.
● It involves two steps, i.e., bootstrapping and aggregation.
● Bootstrapping is a random sampling method in which samples are derived from the data
using the replacement procedure.
● In Fig 1., the first step in bagging is bootstrapping, where random data samples are fed
to each base learner.
○ The base learning algorithm is run on the samples to complete the procedure.
● In Aggregation, the outputs from the base learners are combined.
● The goal is to increase the accuracy meanwhile reducing variance to a large extent.
● Eg.- RandomForest where the predictions from decision trees(base learners) are taken
parallelly.
● In the case of regression problems, these predictions are averaged to give the final
prediction and
● in the case of classification problems, the mode is selected as the predicted class.
2. Boosting:
● It is an ensemble method in which each predictor learns from preceding predictor
mistakes to make better predictions in the future.
● The technique combines several weak base learners that are arranged in a sequential
(Fig 1.) manner such that weak learners learn from the previous weak learner’s errors to
create a better predictive model.
● Hence one strong learner is formed through significantly improving the predictability of
models.
● Eg. – XGBoost, AdaBoost.
Bagging Boosting
Various training data subsets are randomly Each new subset contains the
drawn with replacement from the whole components that were misclassified by
issue.
performance.
variance.
predictions that belong to the same type. that belong to the different types.
developed model.
3. Stacking:
The number of weak learners in the stack is variable.
● While bagging and boosting used homogenous weak learners for ensemble,
● Stacking often considers heterogeneous weak learners, learns them in parallel, and
combines them by training a meta-learner to output a prediction based on the different
weak learner’s predictions.
● A meta learner inputs the predictions as the features and the target being the ground
truth values in data D(Fig 2.), it attempts to learn how to best combine the input
predictions to make a better output prediction.
In averaging ensemble eg. Random Forest, the model combines the predictions from multiple
trained models. A limitation of this approach is that each model contributes the same amount to
the ensemble prediction, irrespective of how well the model performed. An alternate approach is
a weighted average ensemble, which weighs the contribution of each ensemble member by the
trust on their contribution in giving the best predictions. The weighted average ensemble
provides an improvement over the model average ensemble.
A further generalization of this approach is replacing the linear weighted sum with Linear
Regression (regression problem) or Logistic Regression (classification problem) to
combine the predictions of the sub-models with any learning algorithm.
This approach is called Stacking.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to
best combine the input predictions to make a better output prediction.
Voting
Hard voting
● Hard voting is also known as majority voting.
● The base model's classifiers are fed with the training data individually.
● The models predict the output class independent of each other.
● The output class is a class expected by the majority of the models.
In the above figure, Pf is the class predicted by the majority of the classifiers Cm.
Soft voting
● In Soft voting, Classifiers or base models are fed with training data to predict the
classes out of m possible courses.
● Each base model classifier independently assigns the probability of occurrence
of each type.
● In the end, the average of the possibilities of each class is calculated, and the
final output is the class having the highest probability.
Gaussian mixture models
● Suppose there are set of data points that need to be grouped into several parts or
clusters based on their similarity. In machine learning, this is known as Clustering.
● There are several methods available for clustering:
○ K Means Clustering
○ Hierarchical Clustering
○ Gaussian Mixture Models
● Normal or Gaussian Distribution
○ In real life, many datasets can be modeled by Gaussian Distribution (Univariate
or Multivariate).
○ So it is quite natural and intuitive to assume that the clusters come from different
Gaussian Distributions.
○ it is tried to model the dataset as a mixture of several Gaussian Distributions.
○
●
Gaussian Mixture Model
Suppose there are K clusters (For the sake of simplicity here it is assumed that
the number of clusters is known and it is K). So and is also estimated for
each k. Had it been only one distribution, they would have been estimated by
the maximum-likelihood method. But since there are K such clusters and the
probability density is defined as a linear function of densities of all these K
distributions, i.e.
Expectation maximization
The Expectation-Maximization (EM) algorithm is an iterative way to find
maximum-likelihood estimates for model parameters when the data is incomplete
or has some missing data points or has some hidden variables. EM chooses
some random values for the missing data points and estimates a new set of data.
These new values are then recursively used to estimate a better first date, by
filling up missing points, until the values get fixed.
These are the two basic steps of the EM algorithm, namely E Step or
Expectation Step or Estimation Step and M Step or Maximization Step.
2. Expectation step (E – step): Using the observed available data of the
Advantages of EM algorithm –
● The E-step and M-step are often pretty easy for many problems in
terms of implementation.
Disadvantages of EM algorithm –
Voting
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learner
This is also known as ensembles and linear opinion pools. In the simplest case, all learners are
given equal weight and we have simple voting
that corresponds to taking an average. Still, taking a (weighted) sum is
only one of the possibilities and there are also other combination rules,
as shown in table 17.1 (Kittler et al. 1998). If the outputs are not posterior probabilities, these
rules require that outputs be normalized to the
same scale
In weighted sum, dji is the vote of learner j for class Ci and wj is the
weight of its vote. Simple voting is a special case where all voters have
equal weight, namely, wj = 1/L. In classification, this is called plurality
voting where the class having the maximum number of votes is the winner. When there are two
classes, this is majority voting where the winn