learning algorithms
learning algorithms
A high variance model results from learning the data too well. It varies with each data point.
Hence it is impossible to predict the next point accurately.
Both high bias and high variance models thus cannot generalize properly. Thus, weak learners
will either make incorrect generalizations or fail to generalize altogether. Because of this, the
predictions of weak learners cannot be relied on by themselves.
As we know from the bias-variance trade-off, an under fit model has high bias and low variance,
whereas an over fit model has high variance and low bias. In either case, there is no balance between
bias and variance. For there to be a balance, both the bias and variance need to be low. Ensemble
learning tries to balance this bias-variance trade-off by reducing either the bias or the variance.
Ensemble learning will aim to reduce the bias if we have a weak model with high bias and low
variance. Ensemble learning will aim to reduce the variance if we have a weak model with high
variance and low bias. This way, the resulting model will be much more balanced, with low bias and
variance. Thus, the resulting model will be known as a strong learner. This model will be more
generalized than the weak learners. It will thus be able to make accurate predictions.
Multiexpert combination: -
Multiexpert combination methods have base-learners that work in parallel. These methods can in turn
be divided into two:
In the global approach, also called learner fusion, given an input, all base-learners generate an
output and all these outputs are used. Examples are voting and stacking.
In the local approach, or learner selection, for example, in mixture of experts, there is a gating
model, which looks at the input and chooses one (or very few) of the learners as responsible for
generating the output.
Multistage combination: -
Multistage combination methods use a serial approach where the next base-learner is trained
with or tested on only the instances where the previous base-learners are not accurate enough. The
idea is that the base-learners (or the different representations they use) are sorted in increasing
complexity so that a complex base-learner is not used (or its complex representation is not extracted)
unless the preceding simpler base-learners are not confident. An example is cascading.
Let us say that we have L base-learners. We denote by dj(x) the prediction of base-learner Mj
given the arbitrary dimensional input x. In the case of multiple representations, each M j uses a different
input representation xj. The final prediction is calculated from the predictions of the base-learners:
Voting: -
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners (see figure 17.1):
This is also known as ensembles and linear opinion pools. In the simplest case, all learners are
given equal weight and we have simple voting that corresponds to taking an average. Still, taking a
(weighted) sum is only one of the possibilities and there are also other combination rules, as shown in
table 17.1. If the outputs are not posterior probabilities, these rules require that outputs be normalized
to the same scale.
An example of the use of these rules is shown in table 17.2, which demonstrates the effects of
different rules. Sum rule is the most intuitive and is the most widely used in practice. Median rule is
more robust to outliers; minimum and maximum rules are pessimistic and optimistic, respectively.
With the product rule, each learner has veto power; regardless of the other ones, if one learner has an
output of 0, the overall output goes to 0. Note that after the combination rules, y i do not necessarily
sum up to 1.
The voting classifier is an ensemble learning method that combines several base models to
produce the final optimum solution. The base model can independently use different algorithms such
PVP Siddhartha Institute of Technology, Department of IT Page | 3
as KNN, Random forests, Regression, etc., to predict individual outputs. This brings diversity in the
output, thus called Heterogeneous ensembling. In contrast, if base models use the same algorithm to
predict separate outcomes, this is called Homogeneous ensembling.
Voting Classifier is an estimator that combines models representing different classification
algorithms associated with individual weights for confidence.
The Voting classifier estimator built by combining different classification models turns out to be
stronger meta-classifier that balances out the individual classifiers’ weaknesses on a particular
dataset.
Voting classifier takes majority voting based on weights applied to the class or class
probabilities and assigns a class label to a record based on majority vote.
The ensemble classifier prediction can be mathematically represented as the following:
In the above figure, Pf is the class predicted by the majority of the classifiers Cm.
Majority Voting based on equal weights: When majority voting is taken based equal
weights, mode of the predicted label is taken. Let’s say there are 3 classifiers, clf1, clf2, clf3. For a
particular data, the prediction is [1, 1, 0]. In case, the weights assigned to the classifiers are equal, the
mode of the prediction is taken. Thus, the mode of [1, 1, 0] is 1 and hence the predicted class to the
particular record becomes class 1. For equal weights, the equation in fig 1 gets simplified to the
following:
Diagrammatically, this is how the hard voting classifier with equal weights will look like:
PVP Siddhartha Institute of Technology, Department of IT Page | 4
Soft voting: -
In soft voting, Classifiers or base models are fed with training data to predict the classes out of m
possible courses. Each base model classifier independently assigns the probability of occurrence of
each type. In the end, the average of the possibilities of each class is calculated, and the final output is
the class having the highest probability.
Soft voting classifier classifies input data based on the probabilities of all the predictions made by
different classifiers. Weights applied to each classifier get applied appropriately based on the equation
given.
Let’s understand this using an example. Let’s say there are two binary classifiers clf1, clf2 and clf3. For
a particular record, the classifiers make the following predictions in terms of probabilities in favour of
classes [0,1]:
clf1 -> [0.2, 0.8], clf2 -> [0.1, 0.9], clf3 -> [0.8, 0.2]
With equal weights, the probabilities will get calculated as the following:
Prob of Class 0 = 0.33*0.2 + 0.33*0.1 + 0.33*0.8 = 0.363
Prob of Class 1 = 0.33*0.8 + 0.33*0.9 + 0.33*0.2 = 0.627
The probability predicted by ensemble classifier will be [36.3%, 62.7%]. The class will most likely by
class 1 if the threshold is 0.5. This is how a soft voting classifier with equal weights will look like:
Conclusion
Voting classifier can be seen as a stronger meta-classifier that balances out the individual
classifiers’ weaknesses on a particular dataset.
Voting classifier is an ensemble classifier which takes input as two or more estimators and
classifies the data based on majority voting.
Hard voting classifier classifies data based on class labels and the weights associated with each
classifier
Soft voting classifier classifies data based on the probabilities and the weights associated with
each classifier.
Hard-voting ensembles output the mode of the base classifiers’ predictions, whereas soft-voting
ensembles average predicted probabilities (or scores).
Bagging: -
Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve
the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance
trade-offs and reduces the variance of a prediction model. Bagging avoids over fitting of data and is
used for both regression and classification models, specifically for decision tree algorithms.
Aggregation
In order to take into account all the future effects, model predictions undergo aggregation to integrate
them for the final forecast. Based on the total number of effects or on the likelihood of projections
obtained from the bootstrapping of and model in the process, the aggregation can be performed.
Stages of Bagging
Bootstrapping: This is a mathematical method used to produce random samples or bootstrap
samples with replacement.
Model fitting: We create models on bootstrap samples at this point. Usually, to construct the
models, the same algorithm is use. Nevertheless, there is no limitation on using multiple
algorithms.
Types of Boosting: -
Boosting methods are focused on iteratively combining weak learners to build a strong learner that can
predict more accurate outcomes. As a reminder, a weak learner classifies data slightly better than
random guessing. This approach can provide robust prediction problem results, outperform neural
networks, and support vector machines for tasks.
AdaBoost: -
AdaBoost was the first really successful boosting algorithm developed for the purpose of
binary classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting
technique that combines multiple “weak classifiers” into a single “strong classifier”.
AdaBoost, also called Adaptive Boosting, is a technique in Machine Learning used as an
Ensemble Method. The most common estimator used with AdaBoost is decision trees with one level
which means Decision trees with only 1 split. These trees are also called Decision Stumps.
What this algorithm does is that it builds a model and gives equal weights to all the data points.
It then assigns higher weights to points that are wrongly classified. Now all the points with higher
weights are given more importance in the next model. It will keep training models until and unless a
lower error is received.
Unsupervised Learning: -
Unsupervised learning is a machine learning technique in which models are not supervised using
training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can
be compared to learning which takes place in the human brain while learning new things. It can be
defined as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset
and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem because
unlike supervised learning, we have the input data but no corresponding output data. The goal of
PVP Siddhartha Institute of Technology, Department of IT Page | 11
unsupervised learning is to find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of problems:
1. Clustering
2. Association
Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset.
It can be defined as "A way of grouping the data points into different clusters, consisting of similar
data points. The objects with the possible similarities remain in a group that has less or no
similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior,
etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system
can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
NOTE: - Clustering is somewhere similar to the classification algorithm, but the difference is the type of
dataset that we are using. In classification, we work with the labeled data set, whereas in clustering, we
work with the unlabeled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable
sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out
the things. The clustering technique also works in the same way. Other examples of clustering are
grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide
the recommendations as per the past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are
divided into several groups with similar properties.
Partitioning Clustering: -
This method is one of the most popular choices for analysts to create clusters. In partitioning
clustering, the clusters are partitioned based upon the characteristics of the data points. We need to
specify the number of clusters to be created for this clustering method. These clustering algorithms
follow an iterative process to reassign the data points between clusters based upon the distance. The
algorithms that fall into this category are as follows
1. K-Means Clustering
2. PAM (Partitioning Around Medoids)
3. CLARA (Clustering Large Applications)
K-Means Clustering: -
• A K-means clustering algorithm tries to group similar items in the form of clusters. The number of
groups is represented by K.
• K-means clustering uses the euclidean distance method to find out the distance between the points.
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in
the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Let us say we have an image that is stored with 24 bits/pixel and can have up to 16 million colors.
Assume we have a color screen with 8 bits/pixel that can display only 256 colors. We want to find the
best 256 colors among all 16 million colors such that the image using only the 256 colors in the palette
looks as close as possible to the original image. This is color quantization where we map from high to
lower resolution. In the general case, the aim is to map from a continuous space to a discrete space;
this process is called vector quantization.
Of course we can always quantize uniformly, but this wastes the colormap by assigning entries to
colors not existing in the image, or would not assign extra entries to colors frequently used in the
image. For example, if the image is a seascape, we expect to see many shades of blue and maybe no red.
So the distribution of the colormap entries should reflect the original density as close as possible
placing many entries in high-density regions, discarding regions where there is no data.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the
best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method. The graph for the elbow method looks like the below image:
Expectation-Maximization Algorithm: -
The EM algorithm is considered a latent variable model to find the local maximum likelihood
parameters of a statistical model, proposed by Arthur Dempster, Nan Laird, and Donald Rubin in 1977.
The EM (Expectation-Maximization) algorithm is one of the most commonly used terms in
machine learning to obtain maximum likelihood estimates of variables that are sometimes observable
and sometimes not. However, it is also applicable to unobserved data or sometimes called latent. It has
various real-world applications in statistics, including obtaining the mode of the posterior marginal
distribution of parameters in machine learning and data mining applications.
In most real-life applications of machine learning, it is found that several relevant learning
features are available, but very few of them are observable, and the rest are unobservable. If the
variables are observable, then it can predict the value using instances.
On the other hand, the variables which are latent or directly not observable, for such variables
Expectation-Maximization (EM) algorithm plays a vital role to predict the value with the condition that
the general form of probability distribution governing those latent variables is known to us. In this
topic, we will discuss a basic introduction to the EM algorithm, a flow chart of the EM algorithm, its
applications, advantages, and disadvantages of EM algorithm, etc.
What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable variables
in statistical models. Further, it is a technique to find maximum likelihood estimation when the latent
variables are present. It is also referred to as the latent variable model.
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable. These
unobservable variables are known as latent variables.
Key Points:
It is known as the latent variable model to determine MLE and MAP parameters for latent variables.
It is used to predict values of parameters in instances where data is missing or unobservable for
learning, and this is done until convergence of the values occurs.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as the k-means
clustering algorithm. Being an iterative approach, it consists of two modes. In the first mode, we
estimate the missing or latent variables. Hence it is referred to as the Expectation/estimation step
(E-step). Further, the other mode is used to optimize the parameters of the models so that it can
explain the data more clearly. The second mode is known as the maximization-step or M-step.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step, Expectation Step,
Maximization Step, and convergence Step. These steps are explained as follows:
1st Step: The very first step is to initialize the parameter values. Further, the system is provided
with incomplete observed data with the assumption that data is obtained from a specific model.
2nd Step: This step is known as Expectation or E-Step, which is used to estimate or guess the values
of the missing or incomplete data using the observed data. Further, E-step primarily updates the
variables.
3rd Step: This step is known as Maximization or M-step, where we use complete data obtained from
the 2nd step to update the parameter values. Further, M-step primarily updates the hypothesis.
4th step: The last step is to check if the values of latent variables are converging or not. If it gets
"yes", then stop the process; else, repeat the process from step 2 until the convergence occurs.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent variables through
observed data in datasets. The EM algorithm or latent variable model has a broad range of real-life
applications in machine learning. These are as follows:
The EM algorithm is applicable in data clustering in machine learning.
Advantages of EM algorithm
It is very easy to implement the first two basic steps of the EM algorithm in various machine
learning problems, which are E-step and M- step.
It is mostly guaranteed that likelihood will enhance after each iteration.
It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
The convergence of the EM algorithm is very slow.
It can make convergence for the local optima only.
It takes both forward and backward probability into consideration. It is opposite to that of
numerical optimization, which takes only forward probabilities.
The importance of the EM algorithm can be seen in various applications:
data clustering, natural language processing (NLP), computer vision, image reconstruction, structural
engineering, etc.
Where ki is the number of components making up p(x|Ci) and Gij is the component j of class i.
Note that different classes may need different number of components.
Spectral clustering: -
As the number of dimensions increases, a distance-based similarity measure converges to a
constant value between any given examples. Reduce dimensionality either by using PCA on the feature
data, or by using “spectral clustering” to modify the clustering algorithm as explained below.
Figure 3: A demonstration of the curse of dimensionality. Each plot shows the pairwise
distances between 200 random points.
Spectral clustering avoids the curse of dimensionality by adding a pre-clustering step to your
algorithm:
Reduce the dimensionality of feature data by using PCA.
Project all data points into the lower-dimensional subspace.
Cluster the data in this subspace by using your chosen algorithm.
Therefore, spectral clustering is not a separate clustering algorithm but a pre- clustering step that you
can use with any clustering algorithm.
PVP Siddhartha Institute of Technology, Department of IT Page | 20
There are two major approaches in clustering. They are:
1. Compactness
2. Connectivity
In compactness, the points are closer to each other and are compact towards the cluster center.
Distance is used as a measure to compute closeness. There are different types of distance metrics that
are in use. A few of them are Euclidean distance, Manhattan distance, Minkowski distance, and
Hamming distance.
K-means algorithm uses the compactness approach.
In connectivity, the points in a cluster are either immediately next to each other (epsilon
distance) or connected. Even if the distance is less, they are not put in the same cluster.
Spectral clustering is one of the techniques to follow this approach.
Building the Similarity Graph: This step builds the Similarity Graph in the form of an adjacency
matrix which is represented by A.
Projecting the data onto a lower Dimensional Space: This step is done to account for the
possibility that members of the same cluster may be far away in the given dimensional space. Thus
the dimensional space is reduced so that those points are closer in the reduced dimensional space
and thus can be clustered together by a traditional clustering algorithm. It is done by computing
the Graph Laplacian Matrix. To compute it though first, the degree of a node needs to be defined.
This Matrix is then normalized for mathematical efficiency. To reduce the dimensions, first, the
eigenvalues and the respective eigenvectors are calculated. If the number of clusters is k then the
first eigenvalues and their eigen-vectors are taken and stacked into a matrix such that the eigen-
vectors are the columns.
Clustering the Data: - This process mainly involves clustering the reduced data by using any
traditional clustering technique – typically K-Means Clustering. First, each node is assigned a row of
the normalized of the Graph Laplacian Matrix. Then this data is clustered using any traditional
technique. To transform the clustering result, the node identifier is retained.