0% found this document useful (0 votes)
84 views22 pages

ML Unit 2

This document provides an overview of Nearest Neighbor-based models in machine learning, detailing the Nearest Neighbor algorithm, proximity measures, and various distance measures used for classification and regression tasks. It explains the importance of proximity in algorithms like K-Nearest Neighbors and discusses performance metrics for evaluating classifiers. Additionally, it covers different types of distance and similarity measures, including Euclidean, Manhattan, and Hamming distances, as well as non-metric similarity functions.

Uploaded by

geethasri2k1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views22 pages

ML Unit 2

This document provides an overview of Nearest Neighbor-based models in machine learning, detailing the Nearest Neighbor algorithm, proximity measures, and various distance measures used for classification and regression tasks. It explains the importance of proximity in algorithms like K-Nearest Neighbors and discusses performance metrics for evaluating classifiers. Additionally, it covers different types of distance and similarity measures, including Euclidean, Manhattan, and Hamming distances, as well as non-metric similarity functions.

Uploaded by

geethasri2k1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 22

unit 2:machine learning

Introduction to Nearest Neighbor-Based Models

Nearest Neighbor-based models are a class of non-parametric algorithms used for classification
and regression. These models rely on the idea that similar data points exist close to each other
in a given feature space.

Nearest Neighbor Algorithm?

The Nearest Neighbor algorithm finds the closest data points (neighbors) to make predictions. It is based
on the assumption that similar things exist in proximity.

Example

Step 1: Import Libraries

from sklearn.neighbors import KNeighborsClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

Step 2: Load Data

data = load_iris()

X= data.data

Y=data.target

Step 3: Split Data for Training and Testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the k-NN Model

knn = KNeighborsClassifier(n_neighbors=3) # Using 3 nearest neighbors

knn.fit(X_train, y_train)

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Step 5: Make Predictions

y_pred = knn.predict(X_test)

Step 6: Evaluate Model Accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

How It Works

1. The model finds the 3 nearest neighbors for each test sample.
2. It takes a majority vote to assign the label (classification).
3. It evaluates performance by checking accuracy on test data.

Output Example

Model Accuracy: 1.00

2.Introduction to Proximity Measures

What is Proximity?

In machine learning and data science, proximity simply means how close or how similar two
data points are to each other.

We use proximity measures when we need to compare one object (or data point) with others —
like when we want to group similar things or find the closest match.

These measures help in:

 Classification (like K-Nearest Neighbour)


 Clustering (like K-Means)
 Recommendation systems (like Netflix or Amazon suggestions)

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Why is Proximity Important?

Many algorithms use the concept of "nearness" or "similarity" to make decisions.

If you want to predict whether a fruit is an apple or an orange, you can look at similar fruits in your
dataset — if it’s more similar to apples than oranges, you can say it’s probably an apple.

Types of Proximity Measures

There are two main types:

1. Distance Measures – These tell us how far apart two things are
➤ Lower distance = More similar
➤ Higher distance = More different
2. Similarity Measures – These tell us how alike two things are
➤ Higher similarity = More similar
➤ Lower similarity = Less similar

🔹 Real-Life Examples

✅ Example 1: Friends Group

Imagine you're in a college class. You sit near the same 4 friends every day. If a new student walks in and
sits close to your group and shares similar interests (same major, hobbies), you'd say:

➡️"That new student is similar to us."

Here, you used a proximity measure in your brain — you compared based on:

 Physical closeness (distance)


 Common features (similarity)

✅ Example 2: Online Shopping

You bought a black hoodie on Amazon. Next time, Amazon shows you:

DEPARTMENT OF CSE & AIML


unit 2:machine learning

 Other black hoodies


 Hoodies in similar price range
 Brands you’ve seen before

This happens because Amazon uses proximity between your shopping behavior and that of others to
suggest similar items.

Distance Measures

What Are Distance Measures?

In machine learning, distance measures are used to calculate how far apart two data points
are.

Think of them like a ruler — they help us figure out which points are close to each other and
which are far away.

Why do we need them?

 Group similar data


 Predict labels based on closeness
 Find outliers(………………………….)

Common Types of Distance Measures:

1. Euclidean Distance (Straight Line Distance)


2. Manhattan Distance (City Block Distance)
3. Minkowski Distance (General Formula)
4. Hamming Distance (For Binary Data)

Euclidean Distance (Straight Line Distance)


 It is the shortest distance between two points — like drawing a straight line from one point to
another.
 Used when you care about how far two things are in real space.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

💡 Real-Life Example:

Imagine you're cutting across a field from one corner to the opposite corner — this is the straight-line
distance (Euclidean).

Manhattan Distance (City Block Distance)

🔹 In Simple Words:

 This measures distance like you're walking through a city with blocks and streets, only moving
up/down or left/right (not diagonally).
 It’s useful in grid-like situations.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

3. Minkowski Distance (General Formula)

🔹 In Simple Words:

 This is a general formula that includes both Euclidean and Manhattan distances.
 You control it with a value called p:
o If p = 1 → Manhattan distance
o If p = 2 → Euclidean distance

DEPARTMENT OF CSE & AIML


unit 2:machine learning

4. Hamming Distance (For Binary Data)

🔹 In Simple Words:

 Used when data is in binary form (0s and 1s).


 It simply counts the number of positions where the bits are different.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Non-Metric Similarity Functions

What does "Similarity" mean?

 Similarity means how much two things are alike.


 A similarity function gives us a score — the higher the score, the more similar the two things
are.

What does "Non-Metric" mean?

 "Metric" functions (like distance) follow strict math rules.


 Non-metric similarity functions don’t always follow all those rules — but they still tell us how
similar two things are.

✅ We use non-metric similarity when we care about similarity, not distance.

🧠 Think of It Like This:

Let’s say you and your friend both love:

 Pizza 🍕
 Cricket 🏏
 Movies 🎥

But your friend also likes:

 Dancing 💃
You don't.

Now you can say: ➡️You both are quite similar, even if not 100% the same.
This is where similarity functions help

Non-Metric Similarity Functions

1. Cosine Similarity – (For numbers/text)


2. Jaccard Similarity – (For sets or lists)
3. Overlap Coefficient – (How much overlap)
Cosine Similarity – (For numbers/text)
💬 In Simple Words:
 Tells how similar two things are by looking at their direction, not size.
 Used for text, numbers, or features.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

🧪 Real Example:
 You and your friend use almost the same words in essays, even if one writes more.
 Cosine similarity will be high.

2. Jaccard Similarity – (For sets or lists)

💬 In Simple Words:
 Tells how much two sets have in common.
 It’s the number of common things divided by total unique things.

3. Overlap Coefficient – (How much overlap)

💬 In Simple Words:
 It tells how much the smaller set is inside the bigger one.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Proximity Between Binary Patterns

What is a Binary Pattern?

 A binary pattern is a sequence of 0s and 1s. For example, [1, 0, 1, 1] is a binary pattern.
 These binary patterns are often used to represent data in fields like:

DEPARTMENT OF CSE & AIML


unit 2:machine learning

o Image recognition (where pixels are represented as binary values),


o Text analysis (where words are represented in binary vectors),
o DNA sequences (where genes are represented as binary strings).

🔹 Proximity Between Binary Patterns

 Proximity means how close or similar two binary patterns are.


 There are various ways to measure how similar or close these patterns are.

For binary data, we typically look at how many bits are the same (similar) and how many are
different.

🔸 Key Methods to Measure Proximity

1. Hamming Distance
2. Jaccard Similarity

(….please refer above for both hamming distance and jaccard similarity)

Example:

Let's take two binary patterns:

 Pattern A = [1, 0, 1, 0, 1]
 Pattern B = [1, 1, 0, 0, 1]

Now, count the common 1s and total 1s:

 Common 1s = 3 (positions 1, 3, 5)
 Total 1s in A and B = 5 (positions 1, 2, 3, 5)

Jaccard Similarity =

3/5=0.6

So, the similarity between these patterns is 0.6 or 60%.

Classification Algorithms Based on Distance Measures

In machine learning, classification means predicting which category (or class) an object belongs to.
Distance measures are often used to decide how similar an object is to others in a dataset. Based on
these similarities (or distances), different algorithms can classify data.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Here are some key classification algorithms that rely on distance measures:

✅ 1. K-Nearest Neighbors (KNN) Classifier

💬 In Simple Words:

 The K-Nearest Neighbors (KNN) algorithm classifies a new data point based on the majority
class of its K nearest neighbors in the feature space.
 The distance measure (like Euclidean distance) is used to find the closest data points to the new
one.

📐 How It Works:

1. Choose a value for K (e.g., K=3 means the 3 closest neighbors).


2. Measure the distance between the new data point and all points in the training set (using
distance measures like Euclidean).
3. Sort the distances, pick the top K neighbors, and assign the class that appears the most among
those neighbors.

✅ Example:

Imagine you want to classify a new animal based on its weight and height.

 Animal 1 (Cat): weight = 2 kg, height = 30 cm


 Animal 2 (Dog): weight = 5 kg, height = 50 cm
 Animal 3 (Cat): weight = 3 kg, height = 35 cm

If you have K=2, the two closest neighbors to a new animal (weight = 2.5 kg, height = 33 cm) will likely be
Animal 1 and Animal 3, both cats. So, the new animal would be classified as a Cat.

✅ 2. Radius Nearest Neighbor (Radius NN)

💬 In Simple Words:

 The Radius Nearest Neighbor (Radius NN) algorithm is similar to KNN but with a twist. Instead
of looking for a fixed number of neighbors, it considers all data points within a specified radius
(distance).
 If no data points fall within the radius, the new point isn't classified.

DEPARTMENT OF CSE & AIML


unit 2:machine learning

📐 How It Works:

1. Define a radius (e.g., radius = 5 units).


2. Measure the distance between the new point and all points in the dataset.
3. If a point falls within the radius of the new point, classify it based on the majority class of those
points.

✅ Example:

If your radius is 5 units and the new point is within 5 units of several points in the dataset, it will be
classified based on those points' majority class. If no points are within 5 units, the new point will not be
classified.

✅ 3. KNN Regression

💬 In Simple Words:

 KNN Regression is similar to KNN Classification, but instead of assigning a class label, it predicts
a continuous value (like price, height, weight, etc.) for the new data point.
 KNN Regression calculates the average (or sometimes weighted average) of the values of the K
nearest neighbors.

📐 How It Works:

DEPARTMENT OF CSE & AIML


unit 2:machine learning

1. Choose a value for K.


2. Measure the distance between the new data point and all points in the dataset.
3. Find the K nearest neighbors and compute their average value to predict the output.

✅ Example:

Imagine you're predicting the price of a house based on its features (size, location, etc.).

 House 1: size = 1500 sq ft, price = $300,000


 House 2: size = 1800 sq ft, price = $350,000
 House 3: size = 1600 sq ft, price = $320,000

If a new house has a size of 1700 sq ft, the KNN regression algorithm might look at the K nearest houses
(let's say K=2) and compute the average price of those houses to predict the price of the new house.

Performance of Classifiers

In machine learning, evaluating the performance of classifiers is crucial to understand how well a model
is working and if it can be trusted to make accurate predictions.

When we train a classifier (e.g., K-Nearest Neighbors, Support Vector Machine), we need to assess its
ability to correctly predict the classes of new, unseen data. This assessment is usually done using several
metrics and evaluation techniques.

Key Metrics for Classifier Performance

DEPARTMENT OF CSE & AIML


unit 2:machine learning

 Accuracy

 Precision

 Recall

 F1-Score.

 Specificity

Accuracy?

Accuracy tells us how many predictions the model got correct out of the total predictions made.

Accuracy = Total Correct Predictions ÷ Total Predictions.

Example:

You built a model to detect spam emails.

You tested it on 100 emails. Here's what happened:

 40 emails were actually spam


o Model correctly found 30 of them → TP = 30

o Missed 10 → FN = 10
 60 emails were not spam
o Model correctly said “not spam” for 50 → TN = 50

DEPARTMENT OF CSE & AIML


unit 2:machine learning

o Wrongly said “spam” for 10 → FP = 10

Precision?

Precision tells us:

"Out of all the times the model said 'Yes', how many times was it actually correct?"

"How precise was the model when it predicted something as positive?"

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Recall?

Recall tells us:

"Out of all the actual 'Yes' cases, how many did the model correctly find?"

In simple words:
“Did the model catch all the actual positives?”

DEPARTMENT OF CSE & AIML


unit 2:machine learning

F1-Score?

📢 F1-Score is the balance between:

 Precision → “How many predicted YES are actually correct?”


 Recall → “How many actual YES did the model catch?”

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Specificity?

Specificity tells us:

"Out of all the actual 'No' cases, how many did the model correctly say 'No'?"

In other words:
“How good is the model at identifying the negatives correctly?”

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Evaluation Techniques

1. Cross-Validation

Cross-validation is a technique used to check how well your model works on


unseen data — it helps prevent overfitting.

Instead of training and testing once, we split the data into parts and test the model multiple
times.

2. ROC Curve and AUC

ROC = Receiver Operating Characteristic

The ROC curve shows the performance of a classification model at all thresholds.

It compares:

 True Positive Rate (TPR = Recall)


 False Positive Rate (FPR)

DEPARTMENT OF CSE & AIML


unit 2:machine learning

Main Metrics to Measure Regression Performance:

1. Mean Absolute Error (MAE)


2. Mean Squared Error (MSE)
3. Root Mean Squared Error (RMSE)
4. R-squared (R² Score)

DEPARTMENT OF CSE & AIML


unit 2:machine learning

DEPARTMENT OF CSE & AIML

You might also like