0% found this document useful (0 votes)
8 views24 pages

Unit 4

Unit IV covers ensemble techniques and unsupervised learning, focusing on model combination schemes like bagging, boosting, and stacking, as well as K-Nearest Neighbor (KNN) algorithm. It explains the workings of KNN, its advantages and disadvantages, and the importance of selecting the appropriate value of K. Additionally, it discusses Gaussian mixture models and the Expectation-Maximization algorithm for handling incomplete data.

Uploaded by

inban0405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views24 pages

Unit 4

Unit IV covers ensemble techniques and unsupervised learning, focusing on model combination schemes like bagging, boosting, and stacking, as well as K-Nearest Neighbor (KNN) algorithm. It explains the workings of KNN, its advantages and disadvantages, and the importance of selecting the appropriate value of K. Additionally, it discusses Gaussian mixture models and the Expectation-Maximization algorithm for handling incomplete data.

Uploaded by

inban0405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT IV​

ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING


Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -
bagging, boosting, stacking,
Unsupervised learning: K-means, Instance Based Learning: KNN, Gaussian mixture
models and Expectation maximization

Combining multiple learners:

Model combination schemes,


There are also different ways the multiple base-learners are combined to generate the final
output:
●​ Multiexpert Combination
○​ Multiexpert combination methods have base-learners that work in parallel.
○​ These methods can in turn be divided into two:
■​ In the global approach, also called learner fusion,
●​ given an input,
●​ all base-learners generate an output and
●​ all these outputs are used.
●​ Examples are voting and stacking.
■​ In the local approach, or learner selection,
●​ for example, in mixture of experts, there is a gating model, which
looks at the input and chooses one (or very few) of the learners
as responsible for generating the output

●​ Multistage combination
○​ Multistage combination methods use a serial approach where
■​ The next combination base-learner is trained with or tested on only
the instances where the previous base-learners are not accurate
enough.
■​ The idea is that the base-learners (or the different representations
they use) are sorted in increasing complexity so that a complex
base-learner is not used (or its complex representation is not extracted)
unless the preceding simpler base-learners are not confident.
○​ An example is cascading
—------------------------------------------

K-Nearest Neighbor(KNN)
●​ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
●​ K-NN algorithm
○​ assumes the similarity between the new case/data and available cases and
○​ put the new case into the category that is most similar to the available categories.
●​ K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
●​ K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
●​ K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
●​ It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
●​ KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
●​ Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and

●​ we have a new data point x1, so this data point will lie in which of these
categories.

To solve this type of problem, we need a K-NN algorithm.

With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

-​ Step-1: Select the number K of the neighbors


-​ Step-2: Calculate the Euclidean distance of K number of neighbors
-​ Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
-​ Step-4: Among these k neighbors, count the number of the data points in each category.
-​ Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

○​ Firstly, we will choose the number of neighbors, so we will choose the k=5.

○​ Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
○​ By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
○​ As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

Suppose we have height, weight and T-shirt size of some customers


and we need to predict the T-shirt size of a new customer given only
height and weight information we have. Data including height, weight
and T-shirt size information is shown below -
Height (in cms) Weight (in kgs) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
L
170 68

How to select the value of K in the K-NN Algorithm?

○​ The most preferred value for K is 5.

○​ A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.

○​ Large values for K are good, but it may have some difficulties.
Advantages of KNN Algorithm:

○​ It is simple to implement.

○​ It is robust to the noisy training data

○​ It can be more effective if the training data is large.

○​ Easy to understand

○​ No assumptions about data

○​ Can be applied to both classification and regression

○​ Works easily on multi-class problems

Disadvantages of KNN Algorithm:

○​ Always needs to determine the value of K which may be complex some time.

○​ The computation cost is high because of calculating the distance between the
data points for all the training samples.

○​ Memory Intensive / Computationally expensive

○​ Sensitive to scale of data

○​ Does not work well on rare event (skewed) target variable

○​ Struggle when high number of independent variables

○​
Example-
Height (in cms) Weight (in kgs) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
L
170 68
Height 161 : Weight : 61 Predict the T Shirt Size

Ensemble techniques
●​ methods that use multiple learning algorithms or models to produce one optimal
predictive model.
●​ The model produced has better performance than the base learners taken alone.
●​ Other applications of ensemble learning also include selecting the important features,
data fusion, etc.
●​ Ensemble techniques can be primarily classified into Bagging, Boosting, and Stacking.
Here, m represents a weak learner; d1, d2, d3, d4 are the random samples from Data D; d’, d”,
d”’ are updated training data based on the results from the previous weak learner.

1. Bagging:
●​ Bagging is mainly applied in supervised learning problems.
●​ It involves two steps, i.e., bootstrapping and aggregation.
●​ Bootstrapping is a random sampling method in which samples are derived from the data
using the replacement procedure.
●​ In Fig 1., the first step in bagging is bootstrapping, where random data samples are fed
to each base learner.
○​ The base learning algorithm is run on the samples to complete the procedure.
●​ In Aggregation, the outputs from the base learners are combined.
●​ The goal is to increase the accuracy meanwhile reducing variance to a large extent.
●​ Eg.- RandomForest where the predictions from decision trees(base learners) are taken
parallelly.
●​ In the case of regression problems, these predictions are averaged to give the final
prediction and
●​ in the case of classification problems, the mode is selected as the predicted class.

2. Boosting:
●​ It is an ensemble method in which each predictor learns from preceding predictor
mistakes to make better predictions in the future.
●​ The technique combines several weak base learners that are arranged in a sequential
(Fig 1.) manner such that weak learners learn from the previous weak learner’s errors to
create a better predictive model.
●​ Hence one strong learner is formed through significantly improving the predictability of
models.
●​ Eg. – XGBoost, AdaBoost.
Bagging Boosting

Various training data subsets are randomly Each new subset contains the

drawn with replacement from the whole components that were misclassified by

training dataset. previous models.

Bagging attempts to tackle the over-fitting Boosting tries to reduce bias.

issue.

If the classifier is unstable (high variance), If the classifier is steady and

then we need to apply bagging. straightforward (high bias), then we

need to apply boosting.

Every model receives an equal weight. Models are weighted by their

performance.

Objective to decrease variance, not bias. Objective to decrease bias, not

variance.

It is the easiest way of connecting It is a way of connecting predictions

predictions that belong to the same type. that belong to the different types.

Every model is constructed independently. New models are affected by the

performance of the previously

developed model.

3. Stacking:
The number of weak learners in the stack is variable.

●​ While bagging and boosting used homogenous weak learners for ensemble,
●​ Stacking often considers heterogeneous weak learners, learns them in parallel, and
combines them by training a meta-learner to output a prediction based on the different
weak learner’s predictions.
●​ A meta learner inputs the predictions as the features and the target being the ground
truth values in data D(Fig 2.), it attempts to learn how to best combine the input
predictions to make a better output prediction.

In averaging ensemble eg. Random Forest, the model combines the predictions from multiple
trained models. A limitation of this approach is that each model contributes the same amount to
the ensemble prediction, irrespective of how well the model performed. An alternate approach is
a weighted average ensemble, which weighs the contribution of each ensemble member by the
trust on their contribution in giving the best predictions. The weighted average ensemble
provides an improvement over the model average ensemble.

A further generalization of this approach is replacing the linear weighted sum with Linear
Regression (regression problem) or Logistic Regression (classification problem) to
combine the predictions of the sub-models with any learning algorithm.
This approach is called Stacking.

In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to
best combine the input predictions to make a better output prediction.

Stacking for Machine Learning


Fig 3. The stacked model with meta learner = Logistic Regression and the weak
learners = Decision Tree, Random Forest, K Neighbors Classifier and XGBoost

Voting

Hard voting
●​ Hard voting is also known as majority voting.
●​ The base model's classifiers are fed with the training data individually.
●​ The models predict the output class independent of each other.
●​ The output class is a class expected by the majority of the models.

In the above figure, Pf is the class predicted by the majority of the classifiers Cm.

Soft voting
●​ In Soft voting, Classifiers or base models are fed with training data to predict the
classes out of m possible courses.
●​ Each base model classifier independently assigns the probability of occurrence
of each type.
●​ In the end, the average of the possibilities of each class is calculated, and the
final output is the class having the highest probability.
Gaussian mixture models

●​ Suppose there are set of data points that need to be grouped into several parts or
clusters based on their similarity. In machine learning, this is known as Clustering.
●​ There are several methods available for clustering:
○​ K Means Clustering
○​ Hierarchical Clustering
○​ Gaussian Mixture Models
●​ Normal or Gaussian Distribution
○​ In real life, many datasets can be modeled by Gaussian Distribution (Univariate
or Multivariate).
○​ So it is quite natural and intuitive to assume that the clusters come from different
Gaussian Distributions.
○​ it is tried to model the dataset as a mixture of several Gaussian Distributions.

○​
●​
Gaussian Mixture Model

Suppose there are K clusters (For the sake of simplicity here it is assumed that
the number of clusters is known and it is K). So and is also estimated for
each k. Had it been only one distribution, they would have been estimated by
the maximum-likelihood method. But since there are K such clusters and the
probability density is defined as a linear function of densities of all these K
distributions, i.e.​

where is the mixing coefficient for k-th distribution.

Expectation maximization
The Expectation-Maximization (EM) algorithm is an iterative way to find
maximum-likelihood estimates for model parameters when the data is incomplete
or has some missing data points or has some hidden variables. EM chooses
some random values for the missing data points and estimates a new set of data.
These new values are then recursively used to estimate a better first date, by
filling up missing points, until the values get fixed. ​
These are the two basic steps of the EM algorithm, namely E Step or
Expectation Step or Estimation Step and M Step or Maximization Step.​

1.​ Given a set of incomplete data, consider a set of starting parameters.

2.​ Expectation step (E – step): Using the observed available data of the

dataset, estimate (guess) the values of the missing data.


3.​ Maximization step (M – step): Complete data generated after the

expectation (E) step is used in order to update the parameters.

4.​ Repeat step 2 and step 3 until convergence.


Usage of EM algorithm –

●​ It can be used to fill the missing data in a sample.

●​ It can be used as the basis of unsupervised learning of clusters.

●​ It can be used for the purpose of estimating the parameters of Hidden

Markov Model (HMM).

●​ It can be used for discovering the values of latent variables.

Advantages of EM algorithm –

●​ It is always guaranteed that likelihood will increase with each iteration.

●​ The E-step and M-step are often pretty easy for many problems in

terms of implementation.

●​ Solutions to the M-steps often exist in the closed form.

Disadvantages of EM algorithm –

●​ It has slow convergence.

●​ It makes convergence to the local optima only.

●​ It requires both the probabilities, forward and backward (numerical

optimization requires only forward probability).

Voting
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learner
This is also known as ensembles and linear opinion pools. In the simplest case, all learners are
given equal weight and we have simple voting
that corresponds to taking an average. Still, taking a (weighted) sum is
only one of the possibilities and there are also other combination rules,
as shown in table 17.1 (Kittler et al. 1998). If the outputs are not posterior probabilities, these
rules require that outputs be normalized to the
same scale

Example of combination rules on three learners and three classes.


Sum rule is the most intuitive
and is the most widely used in practice. Median rule is more robust to
outliers; minimum and maximum rules are pessimistic and optimistic, respectively. With the
product rule, each learner has veto power; regardless
of the other ones, if one learner has an output of 0, the overall output
goes to 0. Note that after the combination rules, yi do not necessarily
sum up to 1.

In weighted sum, dji is the vote of learner j for class Ci and wj is the
weight of its vote. Simple voting is a special case where all voters have
equal weight, namely, wj = 1/L. In classification, this is called plurality
voting where the class having the maximum number of votes is the winner. When there are two
classes, this is majority voting where the winn

You might also like