0% found this document useful (0 votes)

84 views22 pages

ML Unit 2

This document provides an overview of Nearest Neighbor-based models in machine learning, detailing the Nearest Neighbor algorithm, proximity measures, and various distance measures used for classification and regression tasks. It explains the importance of proximity in algorithms like K-Nearest Neighbors and discusses performance metrics for evaluating classifiers. Additionally, it covers different types of distance and similarity measures, including Euclidean, Manhattan, and Hamming distances, as well as non-metric similarity functions.

Uploaded by

geethasri2k1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views22 pages

ML Unit 2

Uploaded by

geethasri2k1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 22

unit 2:machine learning

Introduction to Nearest Neighbor-Based Models

Nearest Neighbor-based models are a class of non-parametric algorithms used for classification
and regression. These models rely on the idea that similar data points exist close to each other
in a given feature space.

Nearest Neighbor Algorithm?

The Nearest Neighbor algorithm finds the closest data points (neighbors) to make predictions. It is based
on the assumption that similar things exist in proximity.

Example

Step 1: Import Libraries

from sklearn.neighbors import KNeighborsClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

Step 2: Load Data

data = load_iris()

X= data.data

Y=data.target

Step 3: Split Data for Training and Testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the k-NN Model

knn = KNeighborsClassifier(n_neighbors=3) # Using 3 nearest neighbors

knn.fit(X_train, y_train)

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Step 5: Make Predictions

y_pred = knn.predict(X_test)

Step 6: Evaluate Model Accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

How It Works

1. The model finds the 3 nearest neighbors for each test sample.
2. It takes a majority vote to assign the label (classification).
3. It evaluates performance by checking accuracy on test data.

Output Example

Model Accuracy: 1.00

2.Introduction to Proximity Measures

What is Proximity?

In machine learning and data science, proximity simply means how close or how similar two
data points are to each other.

We use proximity measures when we need to compare one object (or data point) with others —
like when we want to group similar things or find the closest match.

These measures help in:

 Classification (like K-Nearest Neighbour)

 Clustering (like K-Means)
 Recommendation systems (like Netflix or Amazon suggestions)

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Why is Proximity Important?

Many algorithms use the concept of "nearness" or "similarity" to make decisions.

If you want to predict whether a fruit is an apple or an orange, you can look at similar fruits in your
dataset — if it’s more similar to apples than oranges, you can say it’s probably an apple.

Types of Proximity Measures

There are two main types:

1. Distance Measures – These tell us how far apart two things are
➤ Lower distance = More similar
➤ Higher distance = More different
2. Similarity Measures – These tell us how alike two things are
➤ Higher similarity = More similar
➤ Lower similarity = Less similar

🔹 Real-Life Examples

✅ Example 1: Friends Group

Imagine you're in a college class. You sit near the same 4 friends every day. If a new student walks in and
sits close to your group and shares similar interests (same major, hobbies), you'd say:

➡️"That new student is similar to us."

Here, you used a proximity measure in your brain — you compared based on:

 Physical closeness (distance)

 Common features (similarity)

✅ Example 2: Online Shopping

You bought a black hoodie on Amazon. Next time, Amazon shows you:

DEPARTMENT OF CSE & AIML

unit 2:machine learning

 Other black hoodies

 Hoodies in similar price range
 Brands you’ve seen before

This happens because Amazon uses proximity between your shopping behavior and that of others to
suggest similar items.

Distance Measures

What Are Distance Measures?

In machine learning, distance measures are used to calculate how far apart two data points
are.

Think of them like a ruler — they help us figure out which points are close to each other and
which are far away.

Why do we need them?

 Group similar data

 Predict labels based on closeness
 Find outliers(………………………….)

Common Types of Distance Measures:

1. Euclidean Distance (Straight Line Distance)

2. Manhattan Distance (City Block Distance)
3. Minkowski Distance (General Formula)
4. Hamming Distance (For Binary Data)

Euclidean Distance (Straight Line Distance)

 It is the shortest distance between two points — like drawing a straight line from one point to
another.
 Used when you care about how far two things are in real space.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

💡 Real-Life Example:

Imagine you're cutting across a field from one corner to the opposite corner — this is the straight-line
distance (Euclidean).

Manhattan Distance (City Block Distance)

🔹 In Simple Words:

 This measures distance like you're walking through a city with blocks and streets, only moving
up/down or left/right (not diagonally).
 It’s useful in grid-like situations.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

3. Minkowski Distance (General Formula)

🔹 In Simple Words:

 This is a general formula that includes both Euclidean and Manhattan distances.
 You control it with a value called p:
o If p = 1 → Manhattan distance
o If p = 2 → Euclidean distance

DEPARTMENT OF CSE & AIML

unit 2:machine learning

4. Hamming Distance (For Binary Data)

🔹 In Simple Words:

 Used when data is in binary form (0s and 1s).

 It simply counts the number of positions where the bits are different.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Non-Metric Similarity Functions

What does "Similarity" mean?

 Similarity means how much two things are alike.

 A similarity function gives us a score — the higher the score, the more similar the two things
are.

What does "Non-Metric" mean?

 "Metric" functions (like distance) follow strict math rules.

 Non-metric similarity functions don’t always follow all those rules — but they still tell us how
similar two things are.

✅ We use non-metric similarity when we care about similarity, not distance.

🧠 Think of It Like This:

Let’s say you and your friend both love:

 Pizza 🍕
 Cricket 🏏
 Movies 🎥

But your friend also likes:

 Dancing 💃
You don't.

Now you can say: ➡️You both are quite similar, even if not 100% the same.
This is where similarity functions help

Non-Metric Similarity Functions

1. Cosine Similarity – (For numbers/text)

2. Jaccard Similarity – (For sets or lists)
3. Overlap Coefficient – (How much overlap)
Cosine Similarity – (For numbers/text)
💬 In Simple Words:
 Tells how similar two things are by looking at their direction, not size.
 Used for text, numbers, or features.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

🧪 Real Example:
 You and your friend use almost the same words in essays, even if one writes more.
 Cosine similarity will be high.

2. Jaccard Similarity – (For sets or lists)

💬 In Simple Words:
 Tells how much two sets have in common.
 It’s the number of common things divided by total unique things.

3. Overlap Coefficient – (How much overlap)

💬 In Simple Words:
 It tells how much the smaller set is inside the bigger one.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Proximity Between Binary Patterns

What is a Binary Pattern?

 A binary pattern is a sequence of 0s and 1s. For example, [1, 0, 1, 1] is a binary pattern.
 These binary patterns are often used to represent data in fields like:

DEPARTMENT OF CSE & AIML

unit 2:machine learning

o Image recognition (where pixels are represented as binary values),

o Text analysis (where words are represented in binary vectors),
o DNA sequences (where genes are represented as binary strings).

🔹 Proximity Between Binary Patterns

 Proximity means how close or similar two binary patterns are.

 There are various ways to measure how similar or close these patterns are.

For binary data, we typically look at how many bits are the same (similar) and how many are
different.

🔸 Key Methods to Measure Proximity

1. Hamming Distance
2. Jaccard Similarity

(….please refer above for both hamming distance and jaccard similarity)

Example:

Let's take two binary patterns:

 Pattern A = [1, 0, 1, 0, 1]
 Pattern B = [1, 1, 0, 0, 1]

Now, count the common 1s and total 1s:

 Common 1s = 3 (positions 1, 3, 5)
 Total 1s in A and B = 5 (positions 1, 2, 3, 5)

Jaccard Similarity =

3/5=0.6

So, the similarity between these patterns is 0.6 or 60%.

Classification Algorithms Based on Distance Measures

In machine learning, classification means predicting which category (or class) an object belongs to.
Distance measures are often used to decide how similar an object is to others in a dataset. Based on
these similarities (or distances), different algorithms can classify data.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Here are some key classification algorithms that rely on distance measures:

✅ 1. K-Nearest Neighbors (KNN) Classifier

💬 In Simple Words:

 The K-Nearest Neighbors (KNN) algorithm classifies a new data point based on the majority
class of its K nearest neighbors in the feature space.
 The distance measure (like Euclidean distance) is used to find the closest data points to the new
one.

📐 How It Works:

1. Choose a value for K (e.g., K=3 means the 3 closest neighbors).

2. Measure the distance between the new data point and all points in the training set (using
distance measures like Euclidean).
3. Sort the distances, pick the top K neighbors, and assign the class that appears the most among
those neighbors.

✅ Example:

Imagine you want to classify a new animal based on its weight and height.

 Animal 1 (Cat): weight = 2 kg, height = 30 cm

 Animal 2 (Dog): weight = 5 kg, height = 50 cm
 Animal 3 (Cat): weight = 3 kg, height = 35 cm

If you have K=2, the two closest neighbors to a new animal (weight = 2.5 kg, height = 33 cm) will likely be
Animal 1 and Animal 3, both cats. So, the new animal would be classified as a Cat.

✅ 2. Radius Nearest Neighbor (Radius NN)

💬 In Simple Words:

 The Radius Nearest Neighbor (Radius NN) algorithm is similar to KNN but with a twist. Instead
of looking for a fixed number of neighbors, it considers all data points within a specified radius
(distance).
 If no data points fall within the radius, the new point isn't classified.

DEPARTMENT OF CSE & AIML

unit 2:machine learning

📐 How It Works:

1. Define a radius (e.g., radius = 5 units).

2. Measure the distance between the new point and all points in the dataset.
3. If a point falls within the radius of the new point, classify it based on the majority class of those
points.

✅ Example:

If your radius is 5 units and the new point is within 5 units of several points in the dataset, it will be
classified based on those points' majority class. If no points are within 5 units, the new point will not be
classified.

✅ 3. KNN Regression

💬 In Simple Words:

 KNN Regression is similar to KNN Classification, but instead of assigning a class label, it predicts
a continuous value (like price, height, weight, etc.) for the new data point.
 KNN Regression calculates the average (or sometimes weighted average) of the values of the K
nearest neighbors.

📐 How It Works:

DEPARTMENT OF CSE & AIML

unit 2:machine learning

1. Choose a value for K.

2. Measure the distance between the new data point and all points in the dataset.
3. Find the K nearest neighbors and compute their average value to predict the output.

✅ Example:

Imagine you're predicting the price of a house based on its features (size, location, etc.).

 House 1: size = 1500 sq ft, price = $300,000

 House 2: size = 1800 sq ft, price = $350,000
 House 3: size = 1600 sq ft, price = $320,000

If a new house has a size of 1700 sq ft, the KNN regression algorithm might look at the K nearest houses
(let's say K=2) and compute the average price of those houses to predict the price of the new house.

Performance of Classifiers

In machine learning, evaluating the performance of classifiers is crucial to understand how well a model
is working and if it can be trusted to make accurate predictions.

When we train a classifier (e.g., K-Nearest Neighbors, Support Vector Machine), we need to assess its
ability to correctly predict the classes of new, unseen data. This assessment is usually done using several
metrics and evaluation techniques.

Key Metrics for Classifier Performance

DEPARTMENT OF CSE & AIML

unit 2:machine learning

 Accuracy

 Precision

 Recall

 F1-Score.

 Specificity

Accuracy?

Accuracy tells us how many predictions the model got correct out of the total predictions made.

Accuracy = Total Correct Predictions ÷ Total Predictions.

Example:

You built a model to detect spam emails.

You tested it on 100 emails. Here's what happened:

 40 emails were actually spam

o Model correctly found 30 of them → TP = 30

o Missed 10 → FN = 10
 60 emails were not spam
o Model correctly said “not spam” for 50 → TN = 50

DEPARTMENT OF CSE & AIML

unit 2:machine learning

o Wrongly said “spam” for 10 → FP = 10

Precision?

Precision tells us:

"Out of all the times the model said 'Yes', how many times was it actually correct?"

"How precise was the model when it predicted something as positive?"

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Recall?

Recall tells us:

"Out of all the actual 'Yes' cases, how many did the model correctly find?"

In simple words:
“Did the model catch all the actual positives?”

DEPARTMENT OF CSE & AIML

unit 2:machine learning

F1-Score?

📢 F1-Score is the balance between:

 Precision → “How many predicted YES are actually correct?”

 Recall → “How many actual YES did the model catch?”

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Specificity?

Specificity tells us:

"Out of all the actual 'No' cases, how many did the model correctly say 'No'?"

In other words:
“How good is the model at identifying the negatives correctly?”

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Evaluation Techniques

1. Cross-Validation

Cross-validation is a technique used to check how well your model works on

unseen data — it helps prevent overfitting.

Instead of training and testing once, we split the data into parts and test the model multiple
times.

2. ROC Curve and AUC

ROC = Receiver Operating Characteristic

The ROC curve shows the performance of a classification model at all thresholds.

It compares:

 True Positive Rate (TPR = Recall)

 False Positive Rate (FPR)

DEPARTMENT OF CSE & AIML

unit 2:machine learning

Main Metrics to Measure Regression Performance:

1. Mean Absolute Error (MAE)

2. Mean Squared Error (MSE)
3. Root Mean Squared Error (RMSE)
4. R-squared (R² Score)

DEPARTMENT OF CSE & AIML

unit 2:machine learning

DEPARTMENT OF CSE & AIML

Varioklav 75s and 135s
No ratings yet
Varioklav 75s and 135s
6 pages
Similarity-Based Learning: Exercise Solutions: Solutionsmanual-Mit-7X9-Style 2015/4/22 21:17 Page 45 #55
No ratings yet
Similarity-Based Learning: Exercise Solutions: Solutionsmanual-Mit-7X9-Style 2015/4/22 21:17 Page 45 #55
10 pages
UNIT-2 ML Notes
No ratings yet
UNIT-2 ML Notes
15 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
ML Unit - 2
No ratings yet
ML Unit - 2
85 pages
Introduction To AI and ML - UNIT 4
No ratings yet
Introduction To AI and ML - UNIT 4
29 pages
ML Unit 2
No ratings yet
ML Unit 2
24 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
Similarity Based Learning (Part 1)
No ratings yet
Similarity Based Learning (Part 1)
6 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
Similarity Based Learning (Part 2)
No ratings yet
Similarity Based Learning (Part 2)
15 pages
DS - Module 3
No ratings yet
DS - Module 3
65 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Unit II 2 Mark Answers ML
No ratings yet
Unit II 2 Mark Answers ML
3 pages
Lecture Slides-Week15,16
No ratings yet
Lecture Slides-Week15,16
50 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
BookSlides 5A Similarity Based Learning
No ratings yet
BookSlides 5A Similarity Based Learning
40 pages
K Nearest Neighbour - Algorithm
No ratings yet
K Nearest Neighbour - Algorithm
29 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Machine Learning Module-03
No ratings yet
Machine Learning Module-03
24 pages
8.2. Machine Learning K Nearest Neighbor
No ratings yet
8.2. Machine Learning K Nearest Neighbor
36 pages
ML Unit-2
No ratings yet
ML Unit-2
55 pages
m3 Final-1
No ratings yet
m3 Final-1
171 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
29 K-Nearest Neighbor and Summing Up The End-To-End Workflow
No ratings yet
29 K-Nearest Neighbor and Summing Up The End-To-End Workflow
6 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
01 Basics 02knn 03
No ratings yet
01 Basics 02knn 03
9 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Chapter 2
No ratings yet
Chapter 2
26 pages
Distance Functions
No ratings yet
Distance Functions
7 pages
9.introduction To Artificial Intelligence
No ratings yet
9.introduction To Artificial Intelligence
14 pages
IV Distance and Rule Based Models 4.1 Distance Based Models
No ratings yet
IV Distance and Rule Based Models 4.1 Distance Based Models
45 pages
ML Module5
No ratings yet
ML Module5
37 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Assignment 3 B
No ratings yet
Assignment 3 B
7 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
20180723161729D4730 - Pert18 - K-Nearest Neighbor
No ratings yet
20180723161729D4730 - Pert18 - K-Nearest Neighbor
22 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
BookSlides 5A Similarity-based-Learning
No ratings yet
BookSlides 5A Similarity-based-Learning
40 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
ML Lecture 13 KNN
No ratings yet
ML Lecture 13 KNN
14 pages
Distance Based Method
No ratings yet
Distance Based Method
67 pages
Predict Based Simmiliarity and Validation
No ratings yet
Predict Based Simmiliarity and Validation
19 pages
K-Nearest Neighbors: Nipun Batra July 5, 2020
No ratings yet
K-Nearest Neighbors: Nipun Batra July 5, 2020
66 pages
Showfile
No ratings yet
Showfile
130 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
Aiml M3 C2
No ratings yet
Aiml M3 C2
56 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
10 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
CS5371 Theory of Computation
No ratings yet
CS5371 Theory of Computation
2 pages
2.1/2.2 Adding and Subtracting Rational Expressions - Worksheet
No ratings yet
2.1/2.2 Adding and Subtracting Rational Expressions - Worksheet
3 pages
General Science Reviewer For LET 2022
No ratings yet
General Science Reviewer For LET 2022
117 pages
Shrivastava Et Al 2023 Rapid Estimation of Size Based Heterogeneity in Monoclonal Antibodies by Machine Learning
No ratings yet
Shrivastava Et Al 2023 Rapid Estimation of Size Based Heterogeneity in Monoclonal Antibodies by Machine Learning
11 pages
Trojan Port List
No ratings yet
Trojan Port List
13 pages
KeyTalk Anything You Ever Wanted To Know About SMIME Email Encryption DigitalSigning Configurations. But Were Afraid To Ask
No ratings yet
KeyTalk Anything You Ever Wanted To Know About SMIME Email Encryption DigitalSigning Configurations. But Were Afraid To Ask
19 pages
FlashLoanExample Sol
No ratings yet
FlashLoanExample Sol
3 pages
9100 Manual
No ratings yet
9100 Manual
11 pages
Physics Statistical Mechanics N Solid State Physics
No ratings yet
Physics Statistical Mechanics N Solid State Physics
4 pages
Ocean and Sea Waves
No ratings yet
Ocean and Sea Waves
30 pages
Server Information Gathering Packet v1.0
No ratings yet
Server Information Gathering Packet v1.0
12 pages
Ducted Split Air Conditioner: Service Manual
No ratings yet
Ducted Split Air Conditioner: Service Manual
19 pages
MSD Digital 6A and 6AL Ignition Control
No ratings yet
MSD Digital 6A and 6AL Ignition Control
20 pages
Tension 13: 5or1 He T TH Ro No H RD in
No ratings yet
Tension 13: 5or1 He T TH Ro No H RD in
1 page
Homomorphism
No ratings yet
Homomorphism
10 pages
Intro S4HANA Using Global Bike Exercises PP Fiori en v4.2
No ratings yet
Intro S4HANA Using Global Bike Exercises PP Fiori en v4.2
16 pages
STACK
No ratings yet
STACK
39 pages
ANSYS Presentation
100% (1)
ANSYS Presentation
48 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
Computer Ebook English RBE
No ratings yet
Computer Ebook English RBE
69 pages
The Design Process & The Role of CAD
100% (1)
The Design Process & The Role of CAD
12 pages
Weld Consumable Calculator, Butt and Fillet Welds
No ratings yet
Weld Consumable Calculator, Butt and Fillet Welds
7 pages
Maths Scanner
No ratings yet
Maths Scanner
136 pages
SCHLENKER Katalog 2022 EN - WEB
No ratings yet
SCHLENKER Katalog 2022 EN - WEB
136 pages
MG HG Replacement
No ratings yet
MG HG Replacement
16 pages
Data Sheet For Anchor (2472)
100% (1)
Data Sheet For Anchor (2472)
2 pages
Classroom Inventory List SCHOOL YEAR
No ratings yet
Classroom Inventory List SCHOOL YEAR
1 page
Recent Advances in Diagnostic Aids
No ratings yet
Recent Advances in Diagnostic Aids
59 pages
Rat IL - 4 Assay Kit 2014
No ratings yet
Rat IL - 4 Assay Kit 2014
14 pages