0% found this document useful (0 votes)
9 views17 pages

Unit 4

The document discusses ensemble techniques and unsupervised learning in machine learning, focusing on methods such as bagging, boosting, and stacking to improve model performance. It also explains K-Means clustering and K-Nearest Neighbors (KNN) algorithms, detailing their processes and applications. Additionally, it covers concepts like expectation maximization and Gaussian mixture models, emphasizing their roles in statistical modeling and clustering.

Uploaded by

TAMILSELVI R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

Unit 4

The document discusses ensemble techniques and unsupervised learning in machine learning, focusing on methods such as bagging, boosting, and stacking to improve model performance. It also explains K-Means clustering and K-Nearest Neighbors (KNN) algorithms, detailing their processes and applications. Additionally, it covers concepts like expectation maximization and Gaussian mixture models, emphasizing their roles in statistical modeling and clustering.

Uploaded by

TAMILSELVI R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING

PART - A
1. What is
Bagging is a way to decrease the variance in the prediction by generating additional data
for training from dataset using combinations with repetitions to produce multi-sets of the
original data. Boosting is an iterative technique which adjusts the weight of an observation
based on the last classification.

2. What is
Stacking is one of the most popular ensemble machine learning techniques used to predict
multiple nodes to build a new model and improve model performance. Stacking enables us
to train multiple models to solve similar problems, and based on their combined output, it
builds a new model with improved performance.

3. Which a
The three main classes of ensemble learning methods are bagging, stacking, and boosting,
and it is important to both have a detailed understanding of each method and to consider
them on your predictive modeling project.

4. Why ens
There are two main reasons to use an ensemble over a single model, and they are related;
they are: Performance: An ensemble can make better predictions and achieve better
performance than any single contributing model. Robustness: An ensemble reduces the
spread or dispersion of the predictions and model performance.

5. What is
A voting classifier is a machine learning estimator that trains various base models or
estimators and predicts on the basis of aggregating the findings of each base estimator.
The aggregating criteria can be combined decision of voting for each estimator output

6. What typ
The performance-weighted-voting model integrates five classifiers including logistic
regression, SVM, random forest, XGBoost and neural networks. We first used cross-
validation to get the predicted results for the five classifiers.

7. What is
K-Means is a simple and fast clustering method, but it may not truly capture heterogeneity
inherent in Cloud workloads. Gaussian Mixture Models can discover complex patterns and
group them into cohesive, homogeneous components that are close representatives of real
patterns within the data set.

8. What are Gaussian


mixture models? How is expectation maximization used in it? Expectation
maximization provides an iterative solution to maximum likelihood
estimation with latent variables. Gaussian mixture models are an approach
to density estimation where the parameters of the distributions are fit using the expectation-maximization
algorithm.
9. What is k-
means unsupervised learning?
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike
in supervised learning. K-Means performs the division of objects into clusters that share similarities and
are dissimilar to the objects belonging to another cluster. The term 'K' is a number.

10. What is the


difference between K-means and KNN?
KNN is a supervised learning algorithm mainly used for classification problems, whereas K-Means (aka K-
means clustering) is an unsupervised learning algorithm. K in K-Means refers to the number of clusters,
whereas K in KNN is the number of nearest neighbors (based on the chosen distance metric).

11. What is
expectation maximization algorithm used for?
` The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model
in cases where the equations cannot be solved directly. Typically these models involve latent variables in
addition to unknown parameters and known data observations.

12. What is the


advantage of Gaussian process?
Gaussian processes are a powerful algorithm for both regression and classification. Their greatest practical
advantage is that they can give a reliable estimate of their own uncertainty.

13. What are


examples of unsupervised learning?
Some examples of unsupervised learning algorithms include K-Means Clustering, Principal Component
Analysis and Hierarchical Clustering.

14. How do
you implement expectation maximization algorithm?
The two steps of the EM algorithm are:
E-step: perform probabilistic assignments of each data point to some class based on the current
hypothesis h for the distributional class parameters;
M-step: update the hypothesis h for the distributional class parameters based on the new data assignments.

15. What is the


principle of maximum likelihood?
The principle of maximum likelihood is a method of obtaining the optimum values of the parameters that
define a model. And while doing so, you increase the likelihood of your model reaching the “true”
model.
PART B

1. Explain the various ensemble learning techniques?

Ensemble methods are techniques that aim at improving the accuracy of results in models by combining
multiple models instead of using a single model. The combined models increase the accuracy of the
results significantly. This has boosted the popularity of ensemble methods in machine learning.
Categories of Ensemble Methods

Ensemble methods fall into two broad categories, i.e., sequential ensemble techniques and parallel ensemble
techniques. Sequential ensemble techniques generate base learners in a sequence, e.g., Adaptive
Boosting (AdaBoost). The sequential generation of base learners promotes the dependence between the
base learners. The performance of the model is then improved by assigning higher weights to previously
misrepresented learners.
• In parallel ensemble techniques, base learners are generated in a parallel format, e.g.,
random forest. Parallel methods utilize the parallel generation of base learners to encourage
independence between the base learners. The independence of base learners significantly reduces the
error due to the application of averages.

• The majority of ensemble techniques apply a single algorithm in base learning, which results in
homogeneity in all base learners. Homogenous base learners refer to base learners of the same type, with
similar qualities. Other methods apply heterogeneous base learners, giving rise to heterogeneous
ensembles. Heterogeneous base learners are learners of distinct types.

Main Types of Ensemble Methods

Bagging

➢ Bagging, the short form for bootstrap aggregating, is mainly applied in classification and
regression. It increases the accuracy of models through decision trees, which reduces variance to a large
extent. The reduction of variance increases accuracy, eliminating overfitting, which is a challenge to
many predictive models.
➢ Bagging is classified into two types, i.e., bootstrapping and aggregation. Bootstrapping is a
sampling technique where samples are derived from the whole population (set) using the replacement
procedure. The sampling with replacement method helps make the selection procedure randomized. The
base learning algorithm is run on the samples to complete the procedure.
➢ Aggregation in bagging is done to incorporate all possible outcomes of the prediction and
randomize the outcome. Without aggregation, predictions will not be accurate because all outcomes are
not put into consideration. Therefore, the aggregation is based on the probability bootstrapping
procedures or on the basis of all outcomes of the predictive models.

Bagging is advantageous since weak base learners are combined to form a single strong learner that is more
stable than single learners. It also eliminates any variance, thereby reducing the overfitting of models.
One limitation of bagging is that it is computationally expensive. Thus, it can lead to more bias in
models when the proper procedure of bagging is ignored.

Boosting

➢ Boosting is an ensemble technique that learns from previous predictor mistakes to make better
predictions in the future. The technique combines several weak base learners to form one strong learner,
thus significantly improving the predictability of models. Boosting works by arranging weak learners in
a sequence, such that weak learners learn from the next learner in the sequence to create better predictive
models.
➢ Boosting takes many forms, including gradient boosting, Adaptive Boosting (AdaBoost), and
XGBoost (Extreme Gradient Boosting). AdaBoost uses weak learners in the form of decision trees,
which mostly include one split that is popularly known as decision stumps. AdaBoost’s main decision
stump comprises observations carrying similar weights.
➢ Gradient boosting adds predictors sequentially to the ensemble, where preceding predictors
correct their successors, thereby increasing the model’s accuracy. New predictors are fit to counter the
effects of errors in the previous predictors. The gradient of descent helps the gradient booster identify
problems in learners’ predictions and counter them accordingly.

Stacking

Stacking, another ensemble method is often referred to as stacked generalization. This technique works by
allowing a training algorithm to ensemble several other similar learning algorithm predictions. Stacking
has been successfully implemented in regression, density estimations, distance learning, and
classifications. It can also be used to measure the error rate involved during bagging.

Variance Reduction

Ensemble methods are ideal for reducing the variance in models, thereby increasing the accuracy of
predictions. The variance is eliminated when multiple models are combined to form a single prediction
that is chosen from all other possible predictions from the combined models. An ensemble of models
combines various models to ensure that the resulting prediction is the best possible, based on the
consideration of all predictions.

Simple Ensemble Techniques

In this section, we will look at a few simple but powerful techniques, namely:

1. Max Voting
2. Averaging
3. Weighted Averaging
2. Explain in detail about k-means algorithm?

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how
the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

➢ It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
➢ It is a way
centroid-based algorithm,
that each dataset where
belongs onlyeach
one cluster is associated
group that has similarwith a centroid. The main
properties.
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding
➢ It allowsclusters.
us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

➢ The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters. The below
diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which mean reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

o Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below: Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it
by applying some mathematics that we have studied to calculate the distance between two points. So, we
will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization. As we need to find the closest cluster, so we will
repeat the process by choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new centroids as below: Next,
we will reassign each datapoint to the new centroid. For this, we will repeat the same process
of finding a median line.

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:

3. Explain details about KNN algorithm?

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or
class of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
4. Explain in detail about Gaussian mixture models and expectation maximization?

EM algorithm in GMM
In statistics, EM (expectation maximization) algorithm handles latent variables, while GMM
is the Gaussian mixture model.

✓ Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are
used to classify data into different categories based on the probability distribution. Gaussian
mixture models can be used in many different areas, including finance, marketing and so
much more.

✓ Gaussian Mixture Models (GMMs) give us more flexibility than K-Means. With
GMMs we assume that the data points are Gaussian distributed; this is a less restrictive
assumption than saying they are circular by using the mean. That way, we have two
parameters to describe the shape of the clusters: the mean and the standard deviation!

✓ Taking an example in two dimensions, this means that the clusters can take any kind of
elliptical shape (since we have standard deviation in both the x and y directions). Thus, each
Gaussian distribution is assigned to a single cluster. In order to find the parameters of the
Gaussian for each cluster (e.g the mean and standard deviation) we will use an optimization
algorithm called Expectation–Maximization (EM). Take a look at the graphic below as an
illustration of the Gaussians being fitted to the clusters. Then we can proceed on to the
process of Expectation–Maximization clustering using GMMs.

✓ Gaussian mixture models (GMM) are a probabilistic concept used to model real-world
data sets. GMMs are a generalization of Gaussian distributions and can be used to represent
any data set that can be clustered into multiple Gaussian distributions. The Gaussian mixture
model is a probabilistic model that assumes all the data points are generated from a mix of
Gaussian distributions with unknown parameters.
✓ A Gaussian mixture model can be used for clustering, which is the task of grouping a
set of data points into clusters. GMMs can be used to find clusters in data sets where the
clusters may not be clearly defined. Additionally, GMMs can be used to estimate the
probability that a new data point belongs to each cluster. Gaussian mixture models are also
relatively robust to outliers, meaning that they can still yield accurate results even if there are
some data points that do not fit neatly into any of the clusters. This makes GMMs a flexible
and powerful tool for clustering data.

✓ It can be understood as a probabilistic model where Gaussian distributions are assumed


for each group and they have means and co variances which define their parameters. GMM
consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian distribution is
defined as a continuous probability distribution that takes on a bell-shaped curve. Another
name for Gaussian distribution is the normal distribution. Here is a picture of Gaussian
mixture models:

✓ GMM has many applications, such as density estimation, clustering, and image
segmentation. For density estimation, GMM can be used to estimate the probability density
function of a set of data points. For clustering, GMM can be used to group together data
points that come from the same Gaussian distribution. And for image segmentation, GMM
can be used to partition an image into different regions.

✓ Gaussian mixture models can be used for a variety of use cases, including identifying
customer segments, detecting fraudulent activity, and clustering images. In each of these
examples, the Gaussian mixture model is able to identify clusters in the data that may not be
immediately obvious. As a result, Gaussian mixture models are a powerful tool for data
analysis and should be considered for any clustering task.

Expectation-maximization (EM) method in relation to GMM


In Gaussian mixture models, an expectation-maximization method is a powerful tool for
estimating the parameters of a Gaussian mixture model (GMM). The expectation is termed E
and maximization is termed M. Expectation is used to find the Gaussian parameters which are
used to represent each component of gaussian mixture models. Maximization is termed M and
it is involved in determining whether new data points can be added or not.

✓ The expectation-maximization method is a two-step iterative algorithm that alternates


between performing an expectation step, in which we compute expectations for each data
point using current parameter estimates and then maximize these to produce a new gaussian,
followed by a maximization step where we update our gaussian means based on the maximum
likelihood estimate.

✓ The EM method works by first initializing the parameters of the GMM, then iteratively
improving these estimates. At each iteration, the expectation step calculates the expectation of
the log-likelihood function with respect to the current parameters. This expectation is then
used to maximize the likelihood in the maximization step. The process is then repeated until
convergence. Here is a picture representing the two-step iterative aspect of the algorithm

The EM algorithm consists of two steps: the E- step and the M-step. Firstly, the model
parameters and the can be randomly initialized. In the E-step, the algorithm tries to guess the
value of based on the parameters, while in the M-step, the algorithm updates the value of the
model parameters based on the guess of the E-step. These two steps are repeated until
convergence is reached. The algorithm in GMM is repeat until convergence.

Optimization uses the Expectation Maximization algorithm, which alternates between two steps:
1. E-step: Compute the posterior probability over z given our current model - i.e.
how much do we think each Gaussian generates each datapoint.
2. M-step: Assuming that the data really was generated this way, change the
parameters of each Gaussian to maximize the probability that it would generate the data it is
currently responsible for.

The K-Means Algorithm:


1. Assignment step: Assign each data point to the closest cluster
2. Refitting step: Move each cluster center to the center of gravity of the data assigned to it

The EM Algorithm:
1. E-step: Compute the posterior probability over z given our current model
2. M-step: Maximize the probability that it would generate the data it is currently
responsible for.

5. Explain the difference between K-means and KNN.


Purpose and Type of Algorithm:
K-means:
o Type: Unsupervised Learning
o Purpose: K-means is a clustering algorithm, used to group data points into K clusters
based on their similarity. It tries to minimize the variance within each cluster by
assigning data points to the nearest centroid.
o Use Case: Typically used for data segmentation, market segmentation, image
compression, or grouping similar items in an unsupervised manner.
K-nearest Neighbors (KNN):
o Type: Supervised Learning
o Purpose: KNN is a classification (or regression) algorithm, where you classify (or
predict the value of) a data point based on the labels of its K nearest neighbors in the
feature space. The class with the majority of neighbors gets assigned to the data point.
o Use Case: Used for classification tasks (e.g., spam detection, image recognition) or
regression tasks (e.g., predicting house prices).

2. How They Work:


K-means:
o The algorithm first randomly selects K initial centroids (one for each cluster).
o It then assigns each data point to the closest centroid based on distance (typically
Euclidean distance).
o After assigning all points to clusters, the centroids are recalculated as the mean of all
data points in the cluster.
o This process of assigning points and recalculating centroids repeats until the centroids
stabilize (i.e., the assignment of points doesn’t change).
o The goal is to minimize the within-cluster variance (distance between data points
and their centroids).
KNN:
o The algorithm stores the entire training dataset in memory.
o To classify a new data point, KNN calculates the distance (often Euclidean distance)
between the new point and all points in the training set.
o It then selects the K nearest neighbors (data points with the smallest distances).
o For classification, the majority class among the K nearest neighbors is assigned to the
new data point. For regression, the output is the average (or weighted average) of the
values of the K nearest neighbors.
o KNN does not involve training. It’s a lazy learner, meaning that the computation
happens at the time of prediction, not during training.

3. Training vs Prediction:
K-means:
o Training Phase: The algorithm performs clustering during training by iterating over
the dataset and adjusting the centroids. Once the centroids are fixed, the model is
ready for use.
o Prediction Phase: For new data, you predict which cluster the data belongs to by
computing the distance from the data point to the centroids.
KNN:
o Training Phase: There is no explicit training phase in KNN. It simply stores the
entire training dataset.
o Prediction Phase: Predictions happen during the test phase by comparing the new
input to the stored dataset.

4. Output:
K-means:
o The output is a set of clusters. Each data point is assigned to a cluster, and the
centroids represent the "center" of each cluster.
KNN:
o The output is a label or value for a data point (in classification, it’s the predicted
class; in regression, it’s the predicted continuous value).

6. Parameters:
7. K-means:
o K (number of clusters): The number of clusters is a hyperparameter that needs to be
set beforehand.
o Distance metric: Typically Euclidean distance, but other metrics can be used (e.g.,
Manhattan distance).
KNN:
o K (number of neighbors): The number of nearest neighbors to consider for
classification or regression.
o Distance metric: Commonly Euclidean distance, but other options like Manhattan
distance can also be used.
o Weighting: Option to weight neighbors differently (e.g., giving closer neighbors more
weight).

6. Computational Complexity:
K-means:
o Training Complexity: The complexity of K-means is roughly O(n⋅K⋅d)O(n \cdot K \
cdot d)O(n⋅K⋅d), where nnn is the number of data points, KKK is the number of
clusters, and ddd is the number of features.
o Prediction Complexity: Once trained, predicting the cluster of a new data point
involves O(K⋅d)O(K \cdot d)O(K⋅d), as it requires calculating the distance to all K
centroids.
KNN:
o Training Complexity: There is no training phase, so the complexity is O(1)O(1)O(1).
o Prediction Complexity: The prediction for each data point is O(n⋅d)O(n \cdot
d)O(n⋅d), where nnn is the number of data points and ddd is the number of features.
This is because KNN computes the distance to all points in the training set.

7. Memory and Speed:


K-means:
o Memory: K-means generally requires storing the centroids and the assignment of each
point to a cluster, so memory usage is relatively low.
o Speed: It is faster during prediction since the clusters are already formed and the
model just needs to calculate the nearest centroid.
KNN:
o Memory: KNN requires storing the entire training dataset, which can be memory-
intensive, especially with large datasets.
o Speed: KNN can be slower during prediction, especially if the dataset is large, as it
computes the distance to every training sample.

8. Handling of Data:
K-means:
o Data Type: K-means requires numerical data because it relies on distance
calculations (Euclidean distance is commonly used).
o Feature Scaling: K-means is sensitive to the scale of features, so feature scaling (like
normalization or standardization) is often needed before applying K-means.
KNN:
o Data Type: KNN also requires numerical data for distance computation.
o Feature Scaling: Similar to K-means, KNN is also sensitive to feature scaling.
Features should be scaled so that one feature does not dominate the distance
calculation due to its larger scale.

9. Sensitivity to Outliers:
K-means: K-means is sensitive to outliers since outliers can significantly affect the position
of the centroids and the overall clustering result.
KNN: KNN can also be sensitive to outliers because outliers can skew the majority voting
process when determining the nearest neighbors.

Aspect K-means KNN


Unsupervised Supervised
Type of Algorithm
(Clustering) (Classification/Regression)
Cluster data into K Classify or predict based on K
Objective
groups nearest points
Training Phase Yes (Centroid update) No (Stores data)
Assign new points to Find K nearest neighbors,
Prediction Phase
clusters classify/predict
Parameter K (number of clusters) K (number of neighbors)
Output Cluster assignments Predicted class or value
O(n⋅K⋅d)O(n \cdot K \
Computational O(n⋅d)O(n \cdot d)O(n⋅d) during
cdot d)O(n⋅K⋅d) during
Complexity prediction
training
Low (stores centroids and
Memory Usage High (stores entire dataset)
cluster labels)
Scales well with large
Scalability Not ideal for very large datasets
datasets

You might also like