0% found this document useful (0 votes)
2 views

Machine algorithm

The document discusses the impact of outliers on regression analysis, explaining how they can skew results. It outlines the steps involved in the k-NN algorithm and contrasts it with the K-means clustering algorithm, highlighting their purposes and learning types. Additionally, it describes various clustering methods, including partitioning, density-based, distribution model-based, and hierarchical clustering, and introduces Pearson's correlation coefficient as a measure of linear relationship strength between two variables.

Uploaded by

metacit 7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine algorithm

The document discusses the impact of outliers on regression analysis, explaining how they can skew results. It outlines the steps involved in the k-NN algorithm and contrasts it with the K-means clustering algorithm, highlighting their purposes and learning types. Additionally, it describes various clustering methods, including partitioning, density-based, distribution model-based, and hierarchical clustering, and introduces Pearson's correlation coefficient as a measure of linear relationship strength between two variables.

Uploaded by

metacit 7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Machine learning algorithms

1. How can outliers impact regression analysis?


An outlier is a data point that differs significantly from other observations. An outlier may be due to
a variability in the measurement, an indication of data which may be collected or it may be the
result of experimental error. They can significantly skew the results of regression analysis by
distorting the regression line and affecting the accuracy of predictions.
2. What are the steps involved in k-NN algorithm?

Select the Number K of Neighbors: Decide on the value of K, which is the number of nearest
neighbors to consider for classification.

Calculate Distances: Compute the distance (usually Euclidean) between the new data point and
each point in the training dataset.

Identify K Nearest Neighbors: Select the K data points that are closest to the new data point based
on the calculated distances.

Count the Categories: For the K nearest neighbors, count the number of occurrences of each
category or class label.

Assign the New Data Point: Assign the new data point to the category with the majority vote
among the K nearest neighbors.

Model is Ready: The model is ready to classify new data points based on the majority vote of the K
nearest neighbors.

This sequence ensures that the new data point is classified according to the most common class among its
K nearest neighbors.

3. What do you understand by K-means algorithm? How is it different from KNN?

In K-means clustering, updating the centers involves recalculating the position of each cluster’s center
(also called the centroid) after assigning data points to clusters.

Steps involved:

1. Current Centers: Initialize the cluster centers.


2. Assign Points: Assign each data point to the nearest cluster center based on distance (usually
Euclidean distance), forming clusters.
3. Calculate New Centers:
o For each cluster, gather all data points assigned to that cluster.
o Compute the average position (mean) of these data points for the cluster.
o Update the cluster center to this average position.

This process is repeated iteratively until the cluster centers stabilize and no longer change significantly, or
a predefined number of iterations is reached.

(Ex only for understading ….)

 Imagine you have a group of people (data points) standing in different locations (clusters).
 Each group has a leader (cluster center).
 To find a new leader, you calculate the average position of everyone in the group.
 The new leader (center) is now at this average position, and the process helps ensure that the leader
is in the best possible spot to represent the group.
Difference between KNN and K-Means

Purpose:

o K-Means Clustering: Groups data into clusters.


o KNN: Predicts the label or value of a new data point based on its neighbors.

Type of Learning:

o K-Means Clustering: Unsupervised learning (no predefined labels).


o KNN: Supervised learning (uses labels from training data).

4. What are the different types of clustering?


Partitioning Clustering:

 Definition: Divides data into separate, non-overlapping groups.


 Also Known As: Centroid-based clustering.
 Example: K-Means Clustering.
 How It Works: You choose how many groups (K) you want. The algorithm finds the center of
each group and assigns data points to the nearest center, aiming to make the groups as distinct as
possible.

Density-Based Clustering:

 Definition: Groups data based on how dense (crowded) the data points are.
 Challenge: Can be tricky with data that has uneven density or many dimensions.
 How It Works: Finds clusters in areas where data points are closely packed together. It can handle
clusters of different shapes and is good at finding outliers (points that don’t fit any cluster).

Distribution Model-Based Clustering:

 Definition: Groups data by assuming that the data follows specific statistical distributions.
 Common Distribution: Gaussian (bell-shaped curve).
 Example: Gaussian Mixture Models (GMM) with Expectation-Maximization.
 How It Works: Assumes that the data comes from a mix of several distributions. It tries to estimate
the best-fit distributions to form clusters.

Hierarchical Clustering:

 Definition: Builds clusters in a tree-like structure without needing to decide the number of clusters
beforehand.
 Types:
o Agglomerative: Starts with individual data points and combines them into larger clusters.
o Divisive: Starts with one big cluster and splits it into smaller ones.

5. What is Pearson’s cofficent? Write its formula too.

Pearson's correlation coefficient (often denoted as Pearson's r) is one of the crucial factors to
consider when assessing the appropriateness of regression analysis. Pearson's r measures the
strength and direction of the linear relationship between two continuous variables.

The requirements when considering the use of Pearson's correlation coefficient are:
1. Scale of measurement should be interval or ratio.
2. The association should be linear.
4. There should be no outliers in the data.

You might also like