Machine Learning
Machine Learning
Assignment: • EM Algorithm
• K-Nearest Neighbors (KNN)
1. Problem Setting:
Incomplete Data: We have a dataset with missing or hidden variables. The goal is to estimate
the parameters of a statistical model that best describes the observed data.
Model Assumption: We assume a probabilistic model that describes how the observed data is
generated from the hidden variables and model parameters.
4. Iterative Process:
Iterate: Alternate between the E-step and M-step until the algorithm converges, i.e., until the
parameter estimates stabilize and no longer change significantly between iterations.
6. Considerations:
Initialization: The performance of the EM algorithm can be sensitive to the initial parameter
values, and it is often beneficial to use multiple initializations or more sophisticated techniques
like EM with random restarts to avoid local optima.
Convergence: While the EM algorithm is guaranteed to converge to a local maximum of the
likelihood function, it may not always find the global maximum, so careful initialization and
multiple runs may be required to achieve good results.
Overall, the EM algorithm provides a powerful framework for estimating the parameters of
complex probabilistic models in machine learning, particularly in situations involving missing or
hidden variables, and it is widely used in various applications like clustering, latent variable
modeling, and missing data imputation.
K-Nearest Neighbors
The K-Nearest Neighbors (KNN) algorithm is a robust and intuitive machine learning method
employed to tackle classification and regression problems. By capitalizing on the concept of
similarity, KNN predicts the label or value of a new data point by considering its K closest
neighbours in the training dataset. In this article, we will learn about a supervised learning
algorithm (KNN) or the k – Nearest Neighbours, highlighting it’s user-friendly nature.
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense application in
pattern recognition, data mining, and intrusion detection.
(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily
used for its simplicity and ease of implementation. It does not require any assumptions about
the underlying data distribution. It can also handle both numerical and categorical data, making
it a flexible choice for various types of datasets in classification and regression tasks. It is a
non-parametric method that makes predictions based on the similarity of data points in a given
dataset. K-NN is less sensitive to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a
distance metric, such as Euclidean distance. The class or value of the data point is then
determined by the majority vote or average of the K neighbors. This approach allows the
algorithm to adapt to different patterns and make predictions based on the local structure of the
data.
As we know that the KNN algorithm helps us identify the nearest points or the groups for a
query point. But to determine the closest groups or the nearest points for a query point we need
some metric. For this purpose, we use below distance metrics:
1. Euclidean Distance:
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line
that joins the two points which are into consideration. This metric helps us calculate the net
displacement done between the two states of an object.
2. Manhattan Distance:
Manhattan Distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-dimensions.
3. Workings of KNN algorithm:
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K
nearest neighbors in the training dataset.
To make predictions, the algorithm calculates the distance between each new data point in the
test dataset and all the data points in the training dataset. The Euclidean distance is a commonly
used distance metric in K-NN, but other distance metrics, such as Manhattan distance or
Minkowski distance, can also be used depending on the problem and data. Once theDistances
between the new data point and all the data points in the training dataset are calculated, the
algorithm proceeds to find the K nearest neighbors based on these distances. Thе specific
method for selecting the nearest neighbors can vary, but a common approach is to sort the
distances in ascending order and choose the K data points with the shortest distances.
After identifying the K nearest neighbors, the algorithm makes predictions based on the labels
or values associated with these neighbors. For classification tasks, the majority class among the
K neighbors is assigned as the predicted label for the new data point. For regression tasks, the
average or weighted average of the values of the K neighbors is assigned as the predicted value.
Let X be the training dataset with n data points, where each data point is represented by a
ddimensional feature vector
X_i and Y be the corresponding labels or values for each data point in X.Given a new data
point x, the algorithm calculates the distance between x and each data point
Pros: Simple, easy to understand, and doesn't make strong assumptions about the underlying
data distribution.
Cons: Computationally expensive for large datasets, sensitive to irrelevant features, and may
struggle with high-dimensional data. Normalization:
Scaling features can be important for KNN since it's distance-based. Features with larger scales
might dominate the distance calculations.
KNN is often used in scenarios where the decision boundaries are complex and not easily
defined by a mathematical formula. It's a non-parametric and instance-based learning
algorithm, meaning it doesn't explicitly learn a model during training; instead, it memorizes the
training data and makes predictions based on similarity in the input space.