0% found this document useful (0 votes)
26 views32 pages

Unit IV Aiml

aiml

Uploaded by

shalini.26it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views32 pages

Unit IV Aiml

aiml

Uploaded by

shalini.26it
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING 9

Combining multiple learners: Model combination schemes, Voting, Ensemble Learning - bagging,
boosting, stacking, Unsupervised learning: K-means, Instance Based Learning: KNN, Gaussian mixture
models and Expectation maximization

Combining Multiple Learners

• When designing a learning machine, we make choices like parameters of machine, training data,
representation, etc. This implies some sort of variance in performance. For example, in a classification
setting, we can use a parametric classifier or in a multilayer perceptron, we should also decide on the
number of hidden units.

• Each learning algorithm dictates a certain model that comes with a set of assumptions. This inductive
bias leads to error if the assumptions do not hold for the data.

• Different learning algorithms have different accuracies. The learning algorithms can be combined to
attain higher accuracy.

• Data fusion is the process of fusing multiple records representing the same real-world object into a
single, consistent, and clean representation..

• Combining different models is done to improve the performance of deep learning models.

Building a new model by combination requires less time, data, and computational resources.
The most common method to combine models is by a weighted average improves the averaging
multiple models, where taking a weighted average improves the accuracy.

1. Generating Diverse Learners:

• Different Algorithms: We can use different learning algorithms to train different base-learners.
Different algorithms make different assumptions about the data and lead to different classifiers.
• Different Hyper-parameters: We can use the same learning algorithm but use it with different hyper-
parameters.
• Different Input Representations: Different representations make different characteristics explicit
allowing better identification.
• Different Training Sets: Another possibility is to train different base-learners by different subsets of
the training set.
Model Combination Schemes
• Different methods are used for generating final output for multiple base learners are Multiexpert and
multistage combination.

1. Multiexpert combination

• Multiexpert combination methods have base-learners that work in parallel.


a) Global approach (learner fusion): given an input, all base-learners generate an output and all these
outputs are used, such as voting and stacking

b) Local approach (learner selection): in mixture of experts, there is a gating model, which looks at
the input and chooses one (or very few) of the learners as responsible for generating the output.

2. Multistage combination
Multistage combination methods use a serial approach where the next multistage combination base-
learner is trained with or tested on only the instances where the previous base-learners are not accurate
enough.

• Let's assume that we want to construct a function that maps inputs to outputs from a set of known
Ntrain input-output pairs.
D train = {(x, y)}i=1Ntrain

where xi Є X is a D dimensional feature input vector, yi Є Y is the output.

Voting
A voting classifier is a machine learning model that gains experience by training on a collection of
several models and forecasts an output (class) based on the class with the highest likelihood of
becoming the output.
• Voting is an ensemble machine learning algorithm.
• For regression, a voting ensemble involves making a prediction that is the average of multiple other
regression models.
Voting Strategies:
 Hard Voting – The class that receives the majority of votes is selected as the final prediction. It is

commonly used in classification problems. In regression, it predicts the average of the individual
predictions.
 Soft Voting – Weighted average of predicted probabilities is used to make the final prediction. It
is suitable when classifiers provide probability estimates. In other words, for each class, it sums
the predicted probabilities and predicts the class with the highest sum.
• In this methods, the first step is to create multiple classification/regression models using some training
dataset.
Each base model can be created using different splits of the same training dataset and same algorithm, or
using the same dataset with different algorithms, or any other method.
Fig. 9.1.2 shows general idea of Base-learners with model combiner.
• When combining multiple independent and diverse decisions each of which is at least more accurate
than random guessing, random errors cancel each other out, and correct decisions are reinforced. Human
ensembles are demonstrably better.

• Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple
models.
Base Models: Individual models that form the ensemble. For example, Support Vector Machines,
Logistic Regression, Decision Trees.
Classifier and Regressor Variants:
 Voting Classifier – Combines multiple classifiers for classification tasks.
 Voting Regressor – Combines multiple regressors for regression tasks.

Ensemble Learning

Ensemble Learning
Ensemble modeling is the process of running two or more related but different analytical models and
then synthesizing the results into a single score or spread in order to improve the accuracy of predictive
analytics and data mining applications.

• Ensembles of classifiers is a set of classifiers whose individual decisions combined in some way to
classify new examples.

• Ensemble methods combine several decision trees classifiers to produce better predictive performance
than a single decision tree classifier.

The main principle behind the ensemble model is that a group of weak learners come together to form a
strong learner, thus increasing the accuracy of the model.
• Why do ensemble methods work?
• Based on one of two basic observations :

1. Variance reduction: If the training sets are completely independent, it will always helps to average
an ensemble because this will reduce variance without affecting bias (e.g., bagging) and reduce
sensitivity to individual data points.

2. Bias reduction: For simple models, average of models has much greater capacity than single model.
Averaging models can reduce bias substantially by increasing capacity and control variance by Citting
one component at a time.
Bagging
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used
to reduce variance within a noisy data set.

In bagging, a random sample of data in a training set is selected with replacement—meaning that the
individual data points can be chosen more than once.
Bagging is generally used to lessen variance within a loud dataset.

The Bagging is an assembling approach that tries to resolve overfitting for class or the regression
problems.

Bagging pursuits to improve the accuracy and overall performance of gadget mastering algorithms.

It does this by taking random subsets of an original dataset, with substitute, and fits either a classifier
(for classification) or regressor (for regression) to each subset.

• For given a training set of size n, create m samples of size n by drawing n examples from the original
data, with replacement.

Each bootstrap sample will on average contain 63.2 % of the unique training examples, the rest are
replicates. It combines the m resulting models using simple majority vote.

• In particular, on each round, the base learner is trained on what is often called a "bootstrap replicate" of
the original training set.
Suppose the training set consists motor of n examples. Then a bootstrap replicate is a new training set
that also consists of n examples, and which is formed by repeatedly selecting uniformly at random and
with replacement n examples from the original training set. This means that the same example may
appear multiple times in the bootstrap replicate, or it may appear not at all.

It also decreases error by decreasing the variance in the results due to unstable learners, algorithms (like
decision trees) whose output can change dramatically when the training data is slightly changed.

Pseudocode:

1. Given training data (x1, y1), ..., (xm, Ym)

2. For t = 1,..., T:

a. Form bootstrap replicate dataset S t by selecting m random examples from the training set with
replacement.

b. Let ht be the result of training base learning algorithm on St.

3. Output combined classifier:


H(x) = majority (h1(x), ..., hT (x)).

Bagging Steps:

1. Suppose there are N observations and M features in training data set. A sample aside from training
data set is taken randomly with replacement.

2. A subset of M features is selected randomly and whichever feature gives the best split is used to split
the node iteratively.

3. The tree is grown to the largest.

4. Above steps are repeated n times and prediction is given based on the aggregation of predictions from
n number of trees.

Advantages of Bagging:

1. Reduces over -fitting of the model.

2. Handles higher dimensionality data very well.

3. Maintains accuracy for missing data.

Disadvantages of Bagging:

1. Since final prediction is based on the mean predictions from subset trees, it won't give precise values
for the classification and regression model.

Boosting
• Boosting is a very different method to generate multiple predictions (function mob estimates) and
combine them linearly.
Boosting refers to a general and provably effective method of producing a very accurate classifier by
combining rough and moderately inaccurate rules of thumb.
• A learner is weak if it produces a classifier that is only slightly better than random guessing, while a
learner is said to be strong if it produces a classifier that achieves a low error with high confidence for a
given concept.
• Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves
generalization performance. Examples are given weights. At each iteration, a new hypothesis is learned
and the examples are reweighted to focus the system on examples that the most recently learned
classifier got wrong.

• Boosting is a bias reduction technique. It typically improves the performance of a single tree model. A
reason for this is that we often cannot construct trees which are sufficiently large due to thinning out of
observations in the terminal nodes.
• Boosting is then a device to come up with a more complex solution by taking linear combination of
trees. In presence of high-dimensional predictors, boosting is also very useful as a regularization
technique for additive or interaction modeling.
• To begin, we define an algorithm for finding the rules of thumb, which we call a weak learner. The
boosting algorithm repeatedly calls this weak learner, each time feeding it a different distribution over
the training data. Each call generates a weak classifier and we must combine all of these into a single
classifier that, hopefully, is much more accurate than any one of the rules.
• Train a set of weak hypotheses: h1,..., hT. The combined hypothesis H is a weighted majority vote of
the T weak hypotheses. During the training, focus on the examples that are misclassified.

AdaBoost:

• AdaBoost, short for "Adaptive Boosting", is a machine learning meta - algorithm formulated by Yoav
Freund and Robert Schapire who won the prestigious "Gödel Prize" in 2003 for their work.
It can be used in conjunction with many other types of learning algorithms to improve their
performance.
• It can be used to learn weak classifiers and final classification based on weighted vote of weak
classifiers.
• It is linear classifier with all its desirable properties. It has good generalization properties.
• To use the weak learner to form a highly accurate prediction rule by calling the weak learner
repeatedly on different distributions over the training examples.
• Initially, all weights are set equally, but each round the weights of incorrectly classified examples are
increased so that those observations that the previously classifier poorly predicts receive greater weight
on the next iteration.
• Advantages of AdaBoost:
1. Very simple to implement

2. Fairly good generalization


3. The prior error need not be known ahead of time.

• Disadvantages of AdaBoost:
1. Sub optimal solution

2. Can over fit in presence of noise.

Boosting Steps:

1. Draw a random subset of training samples d1 without replacement from the training set D to train a
weak learner C1

2. Draw second random training subset d2 without replacement from the training set and add 50 percent
of the samples that were previously falsely classified/misclassified to train a weak learner C2

3. Find the training samples d3 in the training set D on which C1 and C2 disagree to train a third weak
learner C3

4. Combine all the weak learners via majority voting.

Advantages of Boosting:

1. Supports different loss function.

2. Works well with interactions.

Disadvantages of Boosting:

1. Prone to over-fitting.

2. Requires careful tuning of different hyper - parameters

Stacking
• Stacking, sometimes called stacked generalization, is an ensemble machine learning method that
combines multiple heterogeneous base or component models via a meta-model.
• The base model is trained on the complete training data, and then the meta-model is trained on the
predictions of the base models.
The advantage of stacking is the ability to explore the solution space with different models in the same
problem.
The stacking based model can be visualized in levels and has at least two levels of the models. The first
level typically trains the two or more base learners(can be heterogeneous) and the second level might be
a single meta learner that utilizes the base models predictions as input and gives the final result as
output.
A stacked model can have more than two such levels but increasing the levels doesn't always guarantee
better performance.
• In the classification tasks, often logistic regression is used as a meta learner, while linear regression is
more suitable as a meta learner for regression-based tasks.
• Stacking is concerned with combining multiple classifiers generated by different learning algorithms
L1,..., LN on a single dataset S, which is composed by a feature vector S1 = (xi, ti).
• The stacking process can be broken into two phases:
1. Generate a set of base - level classifiers C1,..., CN where Ci = Li (S)

2. Train a meta - level classifier to combine the outputs of the base – level classifiers.

• Fig. 9.2.2 shows stacking frame.

• The training set for the meta- level classifier is generated through a leave - one - out cross validation
process.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction: If the training sets are completely independent, it will always helps to average
an ensemble because this will reduce variance without affecting bias (e.g. - bagging) and reduce
sensitivity to individual data points.

2. Bias reduction: For simple models, average of models has much greater capacity than single model
Averaging models can reduce bias substantially by increasing capacity and control variance by Citting
one component at a time.
Difference between Bagging and Boosting

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset.

It can be defined as "A way of grouping the data points into different clusters, consisting of similar
data points. The objects with the possible similarities remain in a group that has less or no
similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior,
etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabelled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system
can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Example: Let's understand the clustering technique with the real-world example of Mall: When we visit
any shopping mall, we can observe that the things with similar usage are grouped together. Such as the
t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of clustering are grouping
documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this technique
are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products.

Netflix also uses this technique to recommend the movies and web-series to its users as per the watch
history.

The below diagram explains the working of the clustering algorithm. We can see the different fruits are
divided into several groups with similar properties.

Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily
shaped distributions are formed as long as the dense region can be connected. This algorithm does it by
identifying different clusters in the dataset and connects the areas of high densities into clusters. The
dense areas in data space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying densities and
high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created.
In this technique, the dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the tree at the
correct level. The most common example of this method is the Agglomerative Hierarchical algorithm.

Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership
to be in a cluster.

Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm.

Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms need
to guess the number of clusters in the given dataset, whereas some are required to find the minimum
distance between the observation of the dataset.
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density
of data points. It is an example of a centroid-based model, that works on updating the candidates
for centroid to be the centre of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with
Noise. It is an example of a density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high density are separated by the areas of
low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed. In GMM,
it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs
the bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the
outset and then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of data
points until convergence. It has O(N2T) time complexity, which is the main drawback of this
algorithm.

Applications of Clustering
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data objects
in one group that is far from the other dissimilar objects. The accurate result of a query depends
on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the
GIS database. This can be very useful to find that for what purpose the particular land should be
used, that means for which purpose it is more suitable.

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems
in machine learning or data science.
What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters.

Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there
will be two clusters, and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
o We will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids.

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to find
the optimal number of clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method
uses the concept of WCSS value.

WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster.
The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its
centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as
the best value of K.
o Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

K-Means Algorithm Properties

1. There are always K clusters.

2. There is always at least one item in each cluster.

3. The clusters are non-hierarchical and they do not overlap.

4. Every member of a cluster is closer to its cluster than any other cluster because closeness does not
always involve the 'center' of clusters.

The K-Means Algorithm Process

1. The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters
resulting in clusters that have roughly the same number of data points.

2. For each data point.

a. Calculate the distance from the data point to each cluster.

b. If the data point is closest to its own cluster, leave it where it is.

c. If the data point is not closest to its own cluster, move it into the closest cluster.

3. Repeat the above step until a complete pass through all the data points results in no data point moving
from one cluster to another. At this point the clusters are stable and the clustering process ends.

4. The choice of initial partition can greatly affect the final clusters that result, in terms of inter- cluster
and intracluster distances and cohesion.

• K-means algorithm is iterative in nature. It converges, however only a local minimum is obtained. It
works only for numerical data. This method easy to implement.
• Advantages of K-Means Algorithm:
1. Efficient in computation

2. Easy to implement.

• Weaknesses
1. Applicable only when mean is defined.

2. Need to specify K, the number of clusters, in advance.

3. Trouble with noisy data and outliers.

4. Not suitable to discover clusters with non-convex shapes.

Instance Based Learning: KNN

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and when it gets new data, then it classifies that data
into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all
the training samples.
Difference between K-means and KNN

Gaussian Mixture Models and Expectation Maximization

• Gaussian Mixture Models is a "soft" clustering algorithm, where each point probabilistically "belongs"
to all clusters. This is different than k-means where each point belongs to one cluster.
• The Gaussian mixture model is a probabilistic model that assumes all the data points are generated
from a mix of Gaussian distributions with unknown parameters.
• For example, in modelling human height data, height is typically modelled as a normal distribution for
each gender with a mean of approximately 5'10" for males and 5'5" for females. Given only the height
data and not the gender assignments for each data point, the distribution of all heights would follow the
sum of two scaled (different variance) and shifted (different mean) normal distributions. A model
making this assumption is an example of a Gaussian mixture model.
• Gaussian mixture models do not rigidly classify each and every instance into one class or the other.
The algorithm attempts to produce K-Gaussian distributions that would take into account the entire
training space. Every point can be associated with one or more distributions.
Consequently, the deterministic factor would be the probability that each point belongs to a certain
Gaussian distribution.
• GMMs have a variety of real-world applications. Some of them are listed below.
a) Used for signal processing

b) Used for customer churn analysis

c) Used for language identification

d) Used in video game industry

e) Genre classification of songs

Expectation-maximization
In real-world machine learning applications, it is common to have many relevant features, but only a
subset of them may be observable.
When dealing with variables that are sometimes observable and sometimes not, it is indeed possible to
utilize the instances when that variable is visible or observed in order to learn and make predictions
for the instances where it is not observable.
This approach is often referred to as handling missing data. By using the available instances where the
variable is observable, machine learning algorithms can learn patterns and relationships from the
observed data. These learned patterns can then be used to predict the values of the variable in
instances where it is missing or not observable.
EM algorithm is applicable to latent variables, which are variables that are not directly observable but
are inferred from the values of other observed variables.

The EM algorithm serves as the foundation for many unsupervised clustering algorithms. It provides a
framework to find the local maximum likelihood parameters of a statistical model and infer latent
variables in cases where data is missing or incomplete.

It consists of an estimation step (E-step) and a maximization step (M-step), forming an iterative
process to improve model fit.

 In the E step, the algorithm computes the latent variables i.e. expectation of the log-likelihood
using the current parameter estimates.
 In the M step, the algorithm determines the parameters that maximize the expected log-likelihood
obtained in the E step, and corresponding model parameters are updated based on the estimated
latent variables.
Expectation-Maximization in EM Algorithm

By iteratively repeating these steps, the EM algorithm seeks to maximize the likelihood of the
observed data.

It is commonly used for unsupervised learning tasks, such as clustering, where latent variables are
inferred and has applications in various fields, including machine learning, computer vision, and
natural language processing.

Key Terms in Expectation-Maximization (EM) Algorithm

 Latent Variables: Latent variables are unobserved variables in statistical models that can only be
inferred indirectly through their effects on observable variables. They cannot be directly measured
but can be detected by their impact on the observable variables.
 Likelihood: It is the probability of observing the given data given the parameters of the model. In
the EM algorithm, the goal is to find the parameters that maximize the likelihood.
 Log-Likelihood: It is the logarithm of the likelihood function, which measures the goodness of fit
between the observed data and the model. EM algorithm seeks to maximize the log-likelihood.
 Maximum Likelihood Estimation (MLE): MLE is a method to estimate the parameters of a
statistical model by finding the parameter values that maximize the likelihood function, which
measures how well the model explains the observed data.
 Posterior Probability: In the context of Bayesian inference, the EM algorithm can be extended to
estimate the maximum a posteriori (MAP) estimates, where the posterior probability of the
parameters is calculated based on the prior distribution and the likelihood function.
 Expectation (E) Step: The E-step of the EM algorithm computes the expected value or posterior
probability of the latent variables given the observed data and current parameter estimates. It
involves calculating the probabilities of each latent variable for each data point.
 Maximization (M) Step: The M-step of the EM algorithm updates the parameter estimates by
maximizing the expected log-likelihood obtained from the E-step. It involves finding the
parameter values that optimize the likelihood function, typically through numerical optimization
methods.

 Convergence: Convergence refers to the condition when the EM algorithm has reached a stable
solution.

 How Expectation-Maximization (EM) Algorithm Works:

The essence of the Expectation-Maximization algorithm is to use the available observed data of the
dataset to estimate the missing data and then use that data to update the values of the parameters. Let
us understand the EM algorithm in detail.

EM Algorithm Flowchart

1. Initialization:
 Initially, a set of initial values of the parameters are considered. A set of incomplete observed
data is given to the system with the assumption that the observed data comes from a specific
model.
2. E-Step (Expectation Step): In this step, we use the observed data in order to estimate or guess the
values of the missing or incomplete data. It is basically used to update the variables.
 Compute the posterior probability or responsibility of each latent variable given the observed
data and current parameter estimates.
 Estimate the missing or incomplete data values using the current parameter estimates.
 Compute the log-likelihood of the observed data based on the current parameter estimates and
estimated missing data.
3. M-step (Maximization Step): In this step, we use the complete data generated in the preceding
“Expectation” – step in order to update the values of the parameters. It is basically used to update
the hypothesis.
 Update the parameters of the model by maximizing the expected complete data log-likelihood
obtained from the E-step.
4. Convergence: In this step, it is checked whether the values are converging or not, if yes, then stop
otherwise repeat step-2 and step-3 i.e. “Expectation” – step and “Maximization” – step until the
convergence occurs.
 Check for convergence by comparing the change in log-likelihood or the parameter values
between iterations.
 If the change is below a predefined threshold, stop and consider the algorithm converged.
 Otherwise, go back to the E-step and repeat the process until convergence is achieved.

You might also like