ML Unit 2
ML Unit 2
Nearest Neighbor-based models are a class of non-parametric algorithms used for classification
and regression. These models rely on the idea that similar data points exist close to each other
in a given feature space.
The Nearest Neighbor algorithm finds the closest data points (neighbors) to make predictions. It is based
on the assumption that similar things exist in proximity.
Example
data = load_iris()
X= data.data
Y=data.target
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
How It Works
1. The model finds the 3 nearest neighbors for each test sample.
2. It takes a majority vote to assign the label (classification).
3. It evaluates performance by checking accuracy on test data.
Output Example
What is Proximity?
In machine learning and data science, proximity simply means how close or how similar two
data points are to each other.
We use proximity measures when we need to compare one object (or data point) with others —
like when we want to group similar things or find the closest match.
If you want to predict whether a fruit is an apple or an orange, you can look at similar fruits in your
dataset — if it’s more similar to apples than oranges, you can say it’s probably an apple.
1. Distance Measures – These tell us how far apart two things are
➤ Lower distance = More similar
➤ Higher distance = More different
2. Similarity Measures – These tell us how alike two things are
➤ Higher similarity = More similar
➤ Lower similarity = Less similar
🔹 Real-Life Examples
Imagine you're in a college class. You sit near the same 4 friends every day. If a new student walks in and
sits close to your group and shares similar interests (same major, hobbies), you'd say:
Here, you used a proximity measure in your brain — you compared based on:
You bought a black hoodie on Amazon. Next time, Amazon shows you:
This happens because Amazon uses proximity between your shopping behavior and that of others to
suggest similar items.
Distance Measures
In machine learning, distance measures are used to calculate how far apart two data points
are.
Think of them like a ruler — they help us figure out which points are close to each other and
which are far away.
💡 Real-Life Example:
Imagine you're cutting across a field from one corner to the opposite corner — this is the straight-line
distance (Euclidean).
🔹 In Simple Words:
This measures distance like you're walking through a city with blocks and streets, only moving
up/down or left/right (not diagonally).
It’s useful in grid-like situations.
🔹 In Simple Words:
This is a general formula that includes both Euclidean and Manhattan distances.
You control it with a value called p:
o If p = 1 → Manhattan distance
o If p = 2 → Euclidean distance
🔹 In Simple Words:
Pizza 🍕
Cricket 🏏
Movies 🎥
Dancing 💃
You don't.
Now you can say: ➡️You both are quite similar, even if not 100% the same.
This is where similarity functions help
🧪 Real Example:
You and your friend use almost the same words in essays, even if one writes more.
Cosine similarity will be high.
💬 In Simple Words:
Tells how much two sets have in common.
It’s the number of common things divided by total unique things.
💬 In Simple Words:
It tells how much the smaller set is inside the bigger one.
A binary pattern is a sequence of 0s and 1s. For example, [1, 0, 1, 1] is a binary pattern.
These binary patterns are often used to represent data in fields like:
For binary data, we typically look at how many bits are the same (similar) and how many are
different.
1. Hamming Distance
2. Jaccard Similarity
(….please refer above for both hamming distance and jaccard similarity)
Example:
Pattern A = [1, 0, 1, 0, 1]
Pattern B = [1, 1, 0, 0, 1]
Common 1s = 3 (positions 1, 3, 5)
Total 1s in A and B = 5 (positions 1, 2, 3, 5)
Jaccard Similarity =
3/5=0.6
In machine learning, classification means predicting which category (or class) an object belongs to.
Distance measures are often used to decide how similar an object is to others in a dataset. Based on
these similarities (or distances), different algorithms can classify data.
Here are some key classification algorithms that rely on distance measures:
💬 In Simple Words:
The K-Nearest Neighbors (KNN) algorithm classifies a new data point based on the majority
class of its K nearest neighbors in the feature space.
The distance measure (like Euclidean distance) is used to find the closest data points to the new
one.
📐 How It Works:
✅ Example:
Imagine you want to classify a new animal based on its weight and height.
If you have K=2, the two closest neighbors to a new animal (weight = 2.5 kg, height = 33 cm) will likely be
Animal 1 and Animal 3, both cats. So, the new animal would be classified as a Cat.
💬 In Simple Words:
The Radius Nearest Neighbor (Radius NN) algorithm is similar to KNN but with a twist. Instead
of looking for a fixed number of neighbors, it considers all data points within a specified radius
(distance).
If no data points fall within the radius, the new point isn't classified.
📐 How It Works:
✅ Example:
If your radius is 5 units and the new point is within 5 units of several points in the dataset, it will be
classified based on those points' majority class. If no points are within 5 units, the new point will not be
classified.
✅ 3. KNN Regression
💬 In Simple Words:
KNN Regression is similar to KNN Classification, but instead of assigning a class label, it predicts
a continuous value (like price, height, weight, etc.) for the new data point.
KNN Regression calculates the average (or sometimes weighted average) of the values of the K
nearest neighbors.
📐 How It Works:
✅ Example:
Imagine you're predicting the price of a house based on its features (size, location, etc.).
If a new house has a size of 1700 sq ft, the KNN regression algorithm might look at the K nearest houses
(let's say K=2) and compute the average price of those houses to predict the price of the new house.
Performance of Classifiers
In machine learning, evaluating the performance of classifiers is crucial to understand how well a model
is working and if it can be trusted to make accurate predictions.
When we train a classifier (e.g., K-Nearest Neighbors, Support Vector Machine), we need to assess its
ability to correctly predict the classes of new, unseen data. This assessment is usually done using several
metrics and evaluation techniques.
Accuracy
Precision
Recall
F1-Score.
Specificity
Accuracy?
Accuracy tells us how many predictions the model got correct out of the total predictions made.
Example:
o Missed 10 → FN = 10
60 emails were not spam
o Model correctly said “not spam” for 50 → TN = 50
Precision?
"Out of all the times the model said 'Yes', how many times was it actually correct?"
Recall?
"Out of all the actual 'Yes' cases, how many did the model correctly find?"
In simple words:
“Did the model catch all the actual positives?”
F1-Score?
Specificity?
"Out of all the actual 'No' cases, how many did the model correctly say 'No'?"
In other words:
“How good is the model at identifying the negatives correctly?”
Evaluation Techniques
1. Cross-Validation
Instead of training and testing once, we split the data into parts and test the model multiple
times.
The ROC curve shows the performance of a classification model at all thresholds.
It compares: