Machine algorithm
Machine algorithm
Select the Number K of Neighbors: Decide on the value of K, which is the number of nearest
neighbors to consider for classification.
Calculate Distances: Compute the distance (usually Euclidean) between the new data point and
each point in the training dataset.
Identify K Nearest Neighbors: Select the K data points that are closest to the new data point based
on the calculated distances.
Count the Categories: For the K nearest neighbors, count the number of occurrences of each
category or class label.
Assign the New Data Point: Assign the new data point to the category with the majority vote
among the K nearest neighbors.
Model is Ready: The model is ready to classify new data points based on the majority vote of the K
nearest neighbors.
This sequence ensures that the new data point is classified according to the most common class among its
K nearest neighbors.
In K-means clustering, updating the centers involves recalculating the position of each cluster’s center
(also called the centroid) after assigning data points to clusters.
Steps involved:
This process is repeated iteratively until the cluster centers stabilize and no longer change significantly, or
a predefined number of iterations is reached.
Imagine you have a group of people (data points) standing in different locations (clusters).
Each group has a leader (cluster center).
To find a new leader, you calculate the average position of everyone in the group.
The new leader (center) is now at this average position, and the process helps ensure that the leader
is in the best possible spot to represent the group.
Difference between KNN and K-Means
Purpose:
Type of Learning:
Density-Based Clustering:
Definition: Groups data based on how dense (crowded) the data points are.
Challenge: Can be tricky with data that has uneven density or many dimensions.
How It Works: Finds clusters in areas where data points are closely packed together. It can handle
clusters of different shapes and is good at finding outliers (points that don’t fit any cluster).
Definition: Groups data by assuming that the data follows specific statistical distributions.
Common Distribution: Gaussian (bell-shaped curve).
Example: Gaussian Mixture Models (GMM) with Expectation-Maximization.
How It Works: Assumes that the data comes from a mix of several distributions. It tries to estimate
the best-fit distributions to form clusters.
Hierarchical Clustering:
Definition: Builds clusters in a tree-like structure without needing to decide the number of clusters
beforehand.
Types:
o Agglomerative: Starts with individual data points and combines them into larger clusters.
o Divisive: Starts with one big cluster and splits it into smaller ones.
Pearson's correlation coefficient (often denoted as Pearson's r) is one of the crucial factors to
consider when assessing the appropriateness of regression analysis. Pearson's r measures the
strength and direction of the linear relationship between two continuous variables.
The requirements when considering the use of Pearson's correlation coefficient are:
1. Scale of measurement should be interval or ratio.
2. The association should be linear.
4. There should be no outliers in the data.