0% found this document useful (0 votes)
3 views

Module 1-3

The document provides a comprehensive overview of feature extraction techniques for text and image data, emphasizing the importance of transforming raw data into numerical features for machine learning. It covers methods such as n-grams for text and pixel counting for images, along with dimensionality reduction and data augmentation strategies. Additionally, it explains key machine learning concepts, models, and evaluation metrics, making it suitable for beginners in the field.

Uploaded by

katrao39798
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 1-3

The document provides a comprehensive overview of feature extraction techniques for text and image data, emphasizing the importance of transforming raw data into numerical features for machine learning. It covers methods such as n-grams for text and pixel counting for images, along with dimensionality reduction and data augmentation strategies. Additionally, it explains key machine learning concepts, models, and evaluation metrics, making it suitable for beginners in the field.

Uploaded by

katrao39798
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Detailed Explanation of the PDF: "AIML Module 01 Lab 01 Features"

This lab notebook introduces the concept of feature extraction from data, focusing on two
types: text data and image data. The goal is to help you understand how to transform raw data
into features that machine learning algorithms can use. Below, I will explain every major part,
step by step, as if you are new to the field.

Part 1: Features of Text

Why Extract Features from Text?


Machine learning algorithms work with numbers, not raw text.
To use text as input, we must convert it into a set of features (numerical values that capture
important information about the text).

Downloading Text Data


The lab downloads Wikipedia articles on topics like "Giraffe," "Elephant," "Machine," and
"Artificial Intelligence" in English, French, and Hindi.
This is done using the wikipedia Python package.

Cleaning the Text


The text is cleaned to remove all special characters, numbers, and spaces, leaving only
lowercase alphabetic characters.
This simplifies the data and ensures consistency across languages.

What are N-grams?


N-grams are continuous sequences of 'n' items (characters or words) from a given text.
Unigram: Single character (n=1)
Bigram: Pair of characters (n=2)
Trigram: Sequence of three characters (n=3)
N-grams help capture the structure and patterns in text.
Counting N-grams
The frequency of each n-gram is counted using Python’s Counter from the collections
module.
These counts are used as features representing the text.

Visualizing N-gram Frequencies


The frequencies are plotted as histograms (bar charts).
By comparing these plots for different languages, you can see that the patterns of n-gram
frequencies are unique to each language.

Key Observations
Bigram frequencies (pairs of characters) are similar across topics in the same language but
differ significantly between languages.
Therefore, bigram frequency is a good feature for distinguishing languages, not topics.

Dimensionality Reduction
By using unigrams, you reduce the text to 26 features (one for each letter).
Using bigrams, you get 26×26 = 676 features (all possible pairs of letters).

Further Exploration
Try using different languages, topics, or text sources.
Visualize trigrams (three-character sequences) or higher-order n-grams for more complex
patterns.

Part 2: Features from Images (Written Numbers)

Dataset Used: MNIST


The MNIST dataset contains images of handwritten digits (0-9), each as a 28×28 pixel
grayscale image.

Visualizing the Data


Images of the digits "1" and "0" are displayed to understand what the data looks like.

Simple Feature: Sum of Pixels


For each image, count the number of non-black (active) pixels.
This feature alone can distinguish between some digits (e.g., "1" has fewer active pixels
than "0").
Advanced Feature: Counting "Hole" Pixels
A "hole pixel" is a black pixel completely surrounded by non-black pixels.
The algorithm fills in the holes and counts how many such pixels exist.
This feature is especially useful for distinguishing "0" (which has a hole) from "1" (which
does not).

Visualizing Hole Pixels


The lab displays side-by-side images of the original digit and the image showing only the
hole pixels.

Feature: Hull Pixels


The "hull" of an image is the digit with all holes filled in.
Counting the number of hull pixels provides another feature for classification.

Feature: Boundary Pixels


Boundary pixels are those on the edge of the digit (where the digit meets the background).
The algorithm finds these by comparing each pixel to its neighbors.

Visualizing Boundary Pixels


The lab shows images with boundary pixels highlighted, helping you see the outline of the
digit.

Summary Table: Features Extracted


Data
Feature Name What It Measures Why It’s Useful
Type

Text Unigram Count Frequency of each letter Captures language/character patterns

Text Bigram Count Frequency of letter pairs Captures language structure

Text Trigram Count Frequency of letter triplets Captures more complex patterns

Number of active (non-black)


Image Pixel Sum Simple shape/size feature
pixels

Hole Pixel Distinguishes digits with/without


Image Number of enclosed "hole" pixels
Count holes

Image Hull Pixel Count Number of pixels in filled digit Measures overall digit area

Image Boundary Pixels Number of edge pixels Captures digit outline


Key Takeaways for Beginners
Feature extraction is the process of turning raw data (text or images) into numbers that a
computer can understand.
For text, features are often based on the frequency of letters or groups of letters (n-grams).
For images, features can be as simple as counting pixels or as complex as measuring holes
and boundaries.
Visualizing features helps you understand how well they separate different classes (e.g.,
different digits or languages).
The choice of features is crucial for the success of any machine learning model.

If you have any specific part of the code or concept you want to explore further, let me know!

Detailed Explanation of Lab 2: Machine Learning Terms, Metrics, and Data
Augmentation (Comprehensive & Structured)
This guide explains every concept, model, and process in Lab 2, incorporating all your follow-up
questions and using simple examples. It is organized into clear sections for easy understanding,
just like the structure used for Module 1.

Section 1: Core Machine Learning Terms and Concepts

Key Terms
Term Meaning Example

Feature A measurable property of data Pixel value in an image, house age

Target/Label The value or class the model tries to predict Digit (0-9), “Apple” or “Orange”

Classification Predicting a category/class for each input Is this an apple or an orange?

Regression Predicting a continuous value What is the house price?

Classifier An algorithm/model that assigns a class label KNN, Decision Tree

A system that learns from data to make


Model 1-NN classifier
predictions

Training set Data used to teach the model 80% of your dataset

Validation Data used to tune/check the model before final


10% of your dataset
set testing

Data used to evaluate the model’s final


Test set 10% of your dataset
performance

Metric A way to measure model performance Accuracy, precision, recall

Model memorizes training data but fails on new High train accuracy, low validation
Overfitting
data accuracy

Underfitting Model fails to learn patterns from data Low train and validation accuracy

Section 2: Models Used and How They Work


1. K-Nearest Neighbors (KNN) Classifier
Purpose: Assigns a class to a new data point based on the classes of its ‘k’ closest
neighbors in the training set [1] [2] .
How it works (with k=1, “1-NN”):
1. For a new data point, calculate its distance to every point in the training set.
2. Find the single closest point (“nearest neighbor”).
3. Assign the class of that neighbor to the new point.

Simple Example
Suppose you have these fruits:

Fruit Roundness Diameter Class

Apple 8 7 Apple

Orange 6 8 Orange

A new fruit has roundness 7, diameter 8.


Calculate distance to each fruit.
The closest is Orange (distance = 1).
So, classify the new fruit as Orange.

How to Find the Closest Neighbor


Euclidean distance:

Repeat for all points, pick the smallest value.

2. Random Classifier
Purpose: Assigns a random label to each data point, without learning from data.
Why use it: It’s a baseline to show what accuracy you’d get just by guessing.
Example:
If you have 4 fruits (2 apples, 2 oranges) and guess randomly, you’ll be correct about 50%
of the time.
Section 3: Measuring Model Performance

Train Accuracy vs. Validation Accuracy


Metric How It’s Calculated Example

Train Accuracy Correct predictions on training data 5/6 correct → 83.3%

Validation Accuracy Correct predictions on validation data 3/4 correct → 75%

Why both?
High train but low validation accuracy means overfitting.
Both low means underfitting.
Both high and close means good generalization [3] .

Section 4: How Images Are Represented for the Model

Flattened Pixel Values


What does “flattened” mean?
An image is a grid (e.g., 28x28 pixels). Flattening means turning this grid into a single long
list of numbers (784 values for MNIST).
Why flatten?
Most models expect a single list (vector) of features, not a 2D grid.

Example
A 3x3 image:

0 255 0
255 0 255
0 255 0

Flattened: ``

Section 5: Data Augmentation

What is Data Augmentation?


Definition:
Creating new data samples by making small, realistic changes to the originals (e.g., rotating,
shearing, flipping images) [4] [5] [6] [7] .
Why:
Increases dataset size, reduces overfitting, helps model generalize better.
Simple Example
If you have 5 images of a cat, you can:
Flip them horizontally
Rotate them a few degrees
Adjust brightness
Now you have many more images for training, even though you started with just 5.

How Shear Changes an Image


Shearing slants the image sideways, turning straight vertical lines into slanted lines, like
pushing the top of a square sideways while keeping the bottom fixed.
Why use shear?
It helps the model learn to recognize digits that are written at an angle.

Visual Example
Original:

+-----+
| |
| |
+-----+

After horizontal shear:

+-----+
/ /
/ /
+-----+

Section 6: Model and Evaluation Process

Step-by-Step Workflow
1. Train the 1-NN classifier on original (flattened) images.
Measure baseline accuracy (e.g., 80%).
2. Apply data augmentation (rotate, shear, etc.) to create new images.
Retrain the model on the expanded dataset.
Measure new accuracy (e.g., 85%).
3. Hyperparameter Tuning (Grid Search):
Hyperparameters: Settings you choose before training (e.g., angle of rotation, amount
of shear).
Grid search: Try different values (e.g., rotate by 10°, 20°, 30°) and see which gives the
best accuracy [8] .
For two hyperparameters (e.g., angle and shear), try all combinations.
4. Visualize Results with Graphs:
One hyperparameter:
X-axis: Value (e.g., angle in degrees)
Y-axis: Test accuracy
Plot to see which value works best.
Two hyperparameters (e.g., angle and shear):
Use a heatmap (colored grid):
X-axis: Angle
Y-axis: Shear
Color: Accuracy

Simple Example of the Whole Process


Suppose you want to train a model to recognize handwritten digits (like "3" and "8"):
1. Baseline:
You train your model on the original images.
Test accuracy: 80% (model gets 8 out of 10 test images right).
2. Augmentation:
You rotate each image by a small angle (say, 10°) and add these to your training data.
Retrain the model.
Test accuracy: 85% (now gets 8.5 out of 10 right on average).
3. Hyperparameter Tuning (Grid Search):
Try rotation angles: 5°, 10°, 15°, 20°
For each angle, retrain the model and measure accuracy.
Plot angle (X-axis) vs. accuracy (Y-axis). The best angle is where accuracy is highest.
4. Combined Augmentation:
Try all combinations of angle (5°, 10°, 15°) and shear (2, 4, 6).
For each pair, retrain and measure accuracy.
Show results as a heatmap: X-axis is angle, Y-axis is shear, color is accuracy.
Section 7: Why Use a Random Classifier as a Baseline?
A random classifier just guesses the class for each data point.
If your model can't do better than random, it means it isn’t learning any useful patterns.
Example:
For a binary task (apple/orange), random guessing gives you about 50% accuracy. Your
model should do better than this.

Section 8: Summary Table of All Concepts


Concept Explanation/Example

Classification Sorting data into categories (e.g., “cat” or “dog”)

Regression Predicting continuous values (e.g., house price)

1-NN Classifier Assigns class of the closest training example to a new data point

Flattened Pixel Values Turning image grid into a single list of numbers

Data Augmentation Creating new samples by changing originals (rotate, shear, etc.)

Shear Slanting an image sideways

Train/Validation Accuracy Accuracy on training/validation data

Random Classifier Assigns classes randomly, used as a baseline

Hyperparameter Tuning Trying different settings to find the best model performance

Grid Search Systematically trying combinations of hyperparameters

Graphs/Heatmaps Visualizing how accuracy changes with different settings

Section 9: Lab 2 in a Nutshell


1. Understand and use key ML terms and models (classification, regression, KNN, random
classifier).
2. Represent images as flattened pixel values for the model.
3. Augment data (rotate, shear) to help the model learn better.
4. Evaluate model using train and validation accuracy.
5. Tune hyperparameters (like angle/shear) using grid search.
6. Visualize results with graphs and heatmaps to pick the best settings.
7. Compare with a random classifier to ensure your model is actually learning.

If you need any section explained even more simply or visually, let me know!

1. https://www.datacamp.com/blog/classification-machine-learning
2. https://stats.stackexchange.com/questions/151756/knn-1-nearest-neighbor
3. https://www.digitalocean.com/community/tutorials/training-validation-and-accuracy-in-pytorch
4. https://www.datacamp.com/tutorial/complete-guide-data-augmentation
5. https://aws.amazon.com/what-is/data-augmentation/
6. https://neptune.ai/blog/data-augmentation-in-python
7. https://docs.edgeimpulse.com/docs/concepts/machine-learning/data-augmentation
8. https://www.yourdatateacher.com/2021/05/19/hyperparameter-tuning-grid-search-and-random-searc
h/
Detailed Explanation of Lab 3: Transforming Data Using Linear Algebra (Updated
with All Your Queries)
This comprehensive guide explains every concept, step, and method in Lab 3, integrating all
your follow-up questions. It is structured for absolute beginners, with clear definitions, step-by-
step examples, and practical context for each part of the lab.

Section 1: Introduction to Linear Algebra and Matrix Transformations

What is Linear Algebra in Machine Learning?


Linear algebra is the math of vectors (lists of numbers) and matrices (tables of numbers).
In machine learning, data is often represented as vectors or matrices, and many algorithms
use linear algebra to process and transform this data [1] [2] [3] [4] .

What is a Matrix Transformation?


A matrix transformation is a mathematical operation where you multiply your data (as
vectors) by a matrix to change its orientation, scale, or position.
This can rotate, stretch, compress, or shear the data, changing how it appears in space [5]
[6] [3] .

Section 2: Visualizing Transformations on a Unit Square

How Does a Matrix Change Data?


Imagine a simple square drawn on a piece of paper.
If you apply a transformation matrix, the square might become a rectangle, parallelogram, or
even rotate.
Each point (x, y) in the square is multiplied by the matrix to get a new point (x', y').

Simple Example:
Suppose you have a point (2, 3) and a transformation matrix:

Multiply to get the new point:


So, (2, 3) moves to (-1, 5) after the transformation.

Section 3: Effect of Transformations on Data and Distance

Why Do Transformations Matter?


Many algorithms (like nearest neighbor) use the distance between points to make decisions.
A transformation can make points that were close become far apart, or vice versa, changing
how the model sees the data [5] [3] .

Example:
If three points are close together before a transformation, they might be spread out after.
This can change which points are considered "nearest neighbors" for classification.

Section 4: Feature Extraction from Images

What is Feature Extraction?


Feature extraction means turning complex data (like an image) into a set of numbers
(features) that capture important information.
In the lab, features extracted from digit images include:
Hole pixels: Number of pixels inside holes in the digit.
Boundary pixels: Number of pixels on the digit’s edge.
Hull pixels: Number of pixels in the filled shape.
Sum of pixel intensities: Total brightness.

Section 5: Using the 1-Nearest Neighbor (1-NN) Classifier

How Does 1-NN Work?


For a new data point, the 1-NN classifier finds the closest point in the training set (using
Euclidean distance) and assigns its label.
The classifier is tested on the extracted features, and accuracy is calculated as the
percentage of correct predictions.
Example:
If 80 out of 100 test digits are correctly classified, the accuracy is 80%.

Section 6: Transforming Features with Matrices

How Is a Matrix Transformation Applied?

Step-by-Step Example
Suppose you have three data points, each with two features:

Data Point Feature 1 (x) Feature 2 (y)

A 2 3

B 1 0

C -1 2

You choose a transformation matrix:

To transform A (2, 3):

Do the same for B and C.


After transformation, all points are in new positions (the data is rotated and stretched).
Why do this?
Sometimes, transforming features makes classes more separable or reveals new patterns.
Both training and test data must be transformed the same way for fair comparison.

Section 7: Experimenting with Different Transformations


The lab tries multiple transformation matrices.
For each, it visualizes the transformed features and measures classification accuracy.
Some transformations improve accuracy, others may not.
Section 8: Feature Selection and Combinations

How Can Two or More Features Be Used Together?

Simple Example:
Suppose you want to classify fruit as apple or orange. You have:
Weight
Color Score
If you use both features, you can plot each fruit as a point in 2D space (weight vs. color). Apples
and oranges may form separate clusters, making classification easier.

Combining Features:
Sometimes, you create new features by combining existing ones (e.g., Area = Height ×
Width).
Using multiple features often improves model performance because it captures more
information about each data point.

In the Lab:
The lab tests different combinations of features (e.g., hole only, boundary only, hole +
boundary).
Some combinations give better accuracy than others.

Section 9: Data Normalization

Why Normalize?
Features with different scales (e.g., 0–1000 vs. 0–10) can cause models to focus too much
on the bigger numbers.
Normalization rescales each feature to the same range (typically 0 to 1).

How to Normalize:

Do this for each feature, for both training and test data.
Effect on Accuracy:
Normalization can improve or maintain accuracy by ensuring all features are treated equally.

Section 10: Key Takeaways and Best Practices


Matrix transformations help you change the “view” of your data, sometimes making
patterns easier to see.
Feature extraction turns complex data into usable numbers for models.
1-NN classifier uses distances to classify new data points.
Feature selection and combination can greatly affect model performance.
Normalization ensures fair comparison between features.
Experimentation with transformations and feature sets is key to finding the best model.

Summary Table: Concepts and Their Roles


Concept Role in Lab Example/Effect

Matrix Rotates, stretches, or skews


Changes data’s “view” or coordinate system
Transformation data

Hole pixels, boundary pixels,


Feature Extraction Turns images into numbers the model can use
etc.

Simple, distance-based
1-NN Classifier Predicts class based on nearest neighbor
classification

Feature Selection Choosing which features to use for best results Hole + Boundary vs. all features

Data Normalization Makes features comparable by rescaling them All features between 0 and 1

Plots features to see class separation and


Visualization Scatter plots, color by class
transformation effect

Lab 3 in a Nutshell
1. Understand linear algebra and matrix transformations and how they affect data [1] [2] [5]
[6] [3] [7] [8] [4] .

2. Extract features from images for classification.


3. Visualize features and their transformations.
4. Classify using 1-NN and measure accuracy.
5. Experiment with different transformations and feature combinations.
6. Normalize features for fair comparison.
7. Choose the best feature set and transformation for your model.

If you need more details or examples for any section, just ask!

1. https://www.youtube.com/watch?v=SioiFuMRiv4
2. https://www.coursera.org/learn/machine-learning-linear-algebra
3. https://www.machinelearningmastery.com/linear_algebra_for_machine_learning/
4. https://www.machinelearningplus.com/linear-algebra/introduction-to-linear-algebra/
5. https://www.bitsathy.ac.in/linear-algebra-for-machine-learning/
6. https://www.freecodecamp.org/news/linear-algebra-roadmap/
7. https://www.youtube.com/watch?v=uZeDTwWcnuY
8. https://pabloinsente.github.io/intro-linear-algebra
Detailed Explanation of Module 2 Lab 1: Basic Plots (Structured for Beginners)
This lab introduces the fundamentals of data visualization using Python, focusing on how to
appreciate, interpret, and visualize data before applying machine learning. You’ll learn about
different plot types, how to create them, and how to interpret what they show about your
dataset. Below is a step-by-step breakdown, integrating all the main ideas and practical
examples from the notebook.

Section 1: Why Visualize Data?


Purpose: Visualization helps us quickly understand patterns, trends, and relationships in
data that would be hard to spot in tables.
Key Point: Good visualizations make it easier to appreciate, interpret, and communicate
insights from data.

Section 2: Getting Started with Matplotlib and Seaborn


Matplotlib: A powerful and widely-used Python library for 2D plotting. It helps you create
line plots, scatter plots, bar charts, and more.
Seaborn: Built on top of Matplotlib, Seaborn makes it easier to create attractive and
informative statistical graphics (like box plots and violin plots).

Section 3: Loading and Preparing the Dataset


Dataset Used: The Automobile dataset (from Kaggle), containing information about cars
(e.g., make, price, horsepower, body style).
Data Cleaning: Missing values are replaced with NaN and rows with missing data are
dropped to ensure clean analysis.
Splitting Data:
Features (X): All columns except price.
Target (y): The price column, converted to numeric values.

Section 4: Basic Plot Types and Their Interpretation


A. Scatter Plot
What it shows: The relationship between two variables (e.g., car make and price).
How to create:
plt.figure(figsize=(24, 7), facecolor='pink')
plt.scatter(X["make"], y, c='green')
plt.xlabel('Make')
plt.ylabel('Price')
plt.title('Car maker vs Price - Scatter Plot')
plt.show()

Interpretation:
Each point is a car.
You can see which car makers tend to have higher or lower prices.
Example: Mercedes-Benz, Jaguar, Porsche, and BMW are on the higher end.

B. Box Plot
What it shows: The distribution of prices for each car maker, including median, quartiles,
and outliers.
How to create:
sns.set(rc={'figure.figsize': (24,7)})
sns.boxplot(x=X["make"], y=y, palette="Set3").set_title('Car Manufacturer vs Price -

Interpretation:
The box shows the middle 50% of prices (interquartile range).
The line inside the box is the median.
Whiskers show the range; dots outside are outliers.
Example: Mercedes-Benz and Porsche have the highest medians and outliers at very
high prices.

C. Violin Plot
What it shows: Combines a box plot with a density plot to show the distribution’s shape for
each group.
How to create:
sns.violinplot(x=X["make"], y=y, palette="Set3").set_title('Car maker vs Price - Viol

Interpretation:
The width of the violin indicates how many cars fall in that price range.
You can see if the distribution is skewed or has multiple peaks.

D. Swarm Plot
What it shows: Each data point is plotted, avoiding overlap, to show the distribution within
categories.
How to create:
sns.swarmplot(x=X["make"], y=y).set_title('Car maker vs Price Swarm Plot')

Interpretation:
You see the spread of individual car prices for each make.
Useful for spotting clusters and outliers.

Section 5: Exploring Numeric Features and Correlations

A. Scatter Plot for Numeric Features


Example: Horsepower vs. Price
How to create:
sns.scatterplot(x=pd.to_numeric(X["horsepower"]), y=y, color="blue")

Interpretation:
Points trend upward: higher horsepower generally means higher price.

B. Joint Plot
What it shows: Relationship between two variables, plus their individual distributions.
How to create:
sns.jointplot(x=pd.to_numeric(X["horsepower"]), y=y, kind="reg", color='green')

Interpretation:
The regression line shows the trend.
If the line slopes up, the variables are positively correlated.
C. Finding Negative Correlations
Example: Highway-mpg (miles per gallon on highway) vs. Price
How to create:
sns.jointplot(x=data["highway-mpg"], y=data["price"], kind="reg", color='purple')

Interpretation:
The regression line slopes down: as mpg increases, price tends to decrease (negative
correlation).
This means more fuel-efficient cars are often less expensive.

Section 6: Interpreting Plots for Categorical Features

Box Plot for Body Style vs. Price


How to create:
sns.boxplot(x=X["body-style"], y=y, palette="Set3").set_title('Body Style vs Price Bo

Interpretation:
Convertibles have the highest maximum prices.
Hatchbacks have the lowest median and least price variation.
Sedans cover a wide range of prices.
Body style can be a good predictor of price.

Section 7: Plotting Multiple Plots Together


How to create subplots:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.boxplot(x=X["body-style"], y=y, palette="Set3", ax=axes[^0])
axes[^0].set_title('Body Style vs Price - Box Plot')
sns.swarmplot(x=X["body-style"], y=y, color="green", ax=axes[^1])
axes[^1].set_title('Body Style vs Price - Swarm Plot')
plt.tight_layout()
plt.show()

Interpretation:
You can compare different plot types side by side for deeper insights.
Section 8: Exercise and Exploration
Try plotting joint plots for all numeric features against price.
Look for features with strong positive or negative correlations.
Try new plot types from Seaborn’s documentation to represent the data in creative ways.

Key Takeaways
Visualization is essential for understanding your data before modeling.
Scatter plots show relationships between two variables.
Box plots and violin plots reveal distributions and outliers within categories.
Swarm plots help visualize every data point in a category.
Joint plots combine scatter plots with distribution plots, making trends and correlations
obvious.
Subplots let you compare multiple visualizations at once.
Interpreting plots helps you spot trends, outliers, and relationships that are hard to see in
tables.

If you need a deeper explanation of any plot type, how to interpret a specific plot, or want to see
more code examples, just ask!

Detailed Explanation of Module 2 Lab 2: Principal Components Analysis (PCA) –
Updated with All Your Questions
This guide explains every concept and step in the PCA lab, integrating your follow-up questions
and using beginner-friendly language and examples. Each section is structured for clarity and
depth, with practical context and simple analogies.

Section 1: What is Principal Component Analysis (PCA) and Why Use It?
PCA is a technique for simplifying complex datasets by reducing the number of features
(dimensions) while preserving as much important information (variance) as possible [1] [2] .
Why use PCA?
To visualize high-dimensional data in 2D or 3D.
To speed up machine learning and reduce overfitting.
To remove noise and redundancy from data.
To find patterns or groupings that are hard to see in the original data.

Section 2: Step-by-Step PCA Process

Step 1: Standardization (Normalization)


Why?
Features may have different units or scales (e.g., height in cm, weight in kg).
Standardization rescales all features to have mean = 0 and standard deviation = 1, so
each feature contributes equally [2] [3] .
How?
For each value, subtract the mean of that feature and divide by its standard deviation:

Example:
If one feature ranges from 1–1000 and another from 0–1, the first would dominate the
analysis unless standardized.
Step 2: Covariance Matrix Calculation
What is Covariance?
It measures how two features vary together.
Positive covariance: features increase together; negative: one increases as the other
decreases.
Covariance Matrix:
A table showing the covariance between every pair of features.
For 3 features, it’s a 3x3 matrix.
Why?
It helps find relationships between features and is the foundation for finding principal
components [2] [3] .

Step 3: Eigenvectors and Eigenvalues


Eigenvectors:
Directions (axes) in the data space along which variance is maximized.
Eigenvalues:
Tell how much variance is along each eigenvector.
Why?
The eigenvector with the largest eigenvalue points in the direction of the greatest
variance in the data [2] [3] .
How?
Solve the equation , where is the covariance matrix, is the eigenvector,
[2]
and is the eigenvalue .

Step 4: Computing the Principal Components (Selection and Construction)


What are Principal Components?
New axes (directions) created from the original features, capturing the most variance.
The first principal component (PC1) captures the most variance; the second (PC2)
captures the next most, and so on [2] [3] .
How are they computed?
1. Pair eigenvectors and eigenvalues.
2. Sort eigenvectors by their eigenvalues in descending order.
3. Select the top k eigenvectors (those with the highest eigenvalues); these are your
principal components [2] [3] .
4. Project the standardized data onto these new axes (multiply the data by the selected
eigenvectors), creating new features (PC1, PC2, ...) [2] [3] .
Simple Example:
Imagine you have two features (height and weight).
After standardization and covariance calculation, you find two eigenvectors.
The first points in the direction where the data is most spread out (maybe a diagonal
through the data cloud).
The second is perpendicular to the first.
You keep the first one (or both) as your principal components and project your data
onto these axes for easier analysis [2] [3] .

Step 5: Transforming Data to the New Subspace


How?
Multiply the original standardized data by the matrix of selected eigenvectors.
Each data point now has coordinates in terms of the principal components, not the
original features [2] [3] .
Why?
This transformation gives you a new dataset with fewer features but most of the
important information preserved.

Section 3: Interpreting and Using Principal Components

Explained Variance and Choosing Number of Components


Explained Variance:
The proportion of the dataset’s total variance captured by each principal component.
The first few PCs often capture most of the variance.
Cumulative Explained Variance:
Add up the explained variance for the top PCs to see how much total information you
keep.
Rule of Thumb: Keep enough PCs to explain 90% of the variance [3] .
Example:
If PC1 explains 70% and PC2 explains 20%, the first two together explain 90% of the
data’s variance.
Visualization
2D or 3D Scatter Plots:
Plot data using the first two or three PCs as axes.
Color points by class (e.g., benign/malignant).
If classes separate well, PCA has revealed useful structure [3] .
Loadings:
Show how much each original feature contributes to each principal component.
High positive or negative values mean strong influence.

Section 4: Practical Applications of PCA


Data Visualization:
Make complex, high-dimensional data visible and understandable [3] .
Noise Reduction:
Remove less important components (those with low variance) to reduce noise.
Feature Engineering:
Use PCs as new features for machine learning models.
Real-World Examples:
Image compression, facial recognition, anomaly detection, recommendation systems,
and healthcare data analysis [1] [3] .

Section 5: PCA in Action – Simple Example


Suppose you have three features (A, B, C) for each sample.

Sample A B C

1 2 3 4

2 3 4 5

3 4 5 6

Step-by-step:
1. Standardize A, B, C.
2. Compute covariance matrix (3x3).
3. Find eigenvectors/eigenvalues.
4. Sort and select top 2 eigenvectors (PC1, PC2).
5. Project data onto PC1 and PC2 to get new values for each sample.
6. Plot samples on a 2D graph using PC1 and PC2.
Section 6: Summary Table
Step What Happens? Why It Matters

Prevents features with large values


Standardization Rescale features to mean 0, std 1
dominating

Covariance Matrix Measures how features vary together Finds relationships between features

Find new axes (directions) of max


Eigenvectors/Eigenvalues Basis for principal components
variance

Principal Components New axes capturing most information Reduce data size, keep important info

Shows how much info each Helps choose how many components
Explained Variance
component keeps to keep

Enables visualization and better


Data Projection Transform data onto new axes
modeling

Section 7: Key Takeaways


PCA is a powerful tool for simplifying complex data and improving analysis.
It works by finding new axes (principal components) that capture the most variance.
The process involves standardization, covariance calculation, finding
eigenvectors/eigenvalues, selecting top components, and projecting data.
PCA is widely used for visualization, noise reduction, and as a preprocessing step for
machine learning.

If you want a deeper explanation of any step, or a code example for a specific part, just ask!

1. https://www.pickl.ai/blog/a-step-by-step-complete-guide-to-principal-component-analysis-pca-for-b
eginners/
2. https://www.turing.com/kb/guide-to-principal-component-analysis
3. https://www.datacamp.com/tutorial/pca-analysis-r
Detailed Explanation of Module 2 Lab 3: Manifold Learning Methods (Updated and
Structured for Beginners)
This guide explains every concept and step in the Manifold Learning lab, integrating the content
provided and all your follow-up queries. It is organized for clarity, with practical examples and
beginner-friendly language.

Section 1: What is Manifold Learning?


Manifold learning is a set of techniques for reducing the dimensionality of data by finding a
lower-dimensional “surface” (manifold) within a higher-dimensional space.
Many real-world datasets, though high-dimensional, actually lie on or near a much lower-
dimensional curved surface (manifold).
The goal: Find a new, low-dimensional representation of the data that preserves its
essential structure—especially for visualization or further analysis.

Section 2: Why Not Just Use PCA?


PCA (Principal Component Analysis) is a linear method: it works well if the data lies on a flat
(linear) subspace.
Drawbacks of PCA on curved manifolds:
PCA may need many more dimensions than the true manifold to capture the data
structure.
PCA can project faraway points along the manifold to nearby locations, losing the true
relationships [1] [2] .
PCA cannot capture curved or non-linear relationships [1] .

Section 3: What is ISOMAP?


ISOMAP stands for Isometric Mapping.
It is a non-linear dimensionality reduction technique based on spectral theory.
Key idea: Instead of preserving straight-line (Euclidean) distances, ISOMAP preserves
geodesic distances—the shortest path along the manifold, not through space [1] [2] [3] .
Result: ISOMAP can "unfold" curved data (like an S-curve) into a flat, low-dimensional
space while preserving meaningful relationships.
Section 4: How ISOMAP Works (Step-by-Step with Example)

Step 1: Construct the Neighborhood Graph


Goal: Capture local relationships by connecting each data point to its nearest neighbors [1]
[2] [3] [4] .

How:
For each data point, find its k nearest neighbors (using Euclidean distance).
Build a graph where each point is a node connected to its neighbors by edges weighted
by their distances.
You can use either the k-nearest neighbors method or an ε-ball (all points within a
certain radius) [5] .
Example:
Imagine 1000 points in 3D forming an S-curve. For each point, connect it to its 10 closest
points. The graph now represents local relationships.

Step 2: Compute Geodesic (Shortest Path) Distances


Goal: Estimate the true "manifold" distance between all pairs of points—not just the straight
line, but the shortest path along the graph [1] [2] [3] .
How:
Use Dijkstra’s algorithm (or similar) to find the shortest path between every pair of
points in the graph [1] [5] .
The sum of the edge weights along the shortest path gives the geodesic distance.
Example:
If points A and D aren’t directly connected, but A is connected to B, B to C, and C to D, the
geodesic distance from A to D is the sum of distances A–B, B–C, and C–D [5] .

Step 3: Find the Low-Dimensional Embedding (Using MDS)


Goal: Map the data into a lower-dimensional space (like 2D or 3D) while preserving
geodesic distances as much as possible [1] [2] .
How:
Square the geodesic distance matrix and double-center it.
Perform eigenvalue decomposition (like in PCA) to find the top eigenvectors (directions
with the most variance).
The top k eigenvectors (with the largest eigenvalues) become the axes of your new,
reduced space.
Project the data onto these axes to get the low-dimensional embedding [2] [3] .
Example:
For the S-curve, ISOMAP "unfolds" the curve into a flat 2D space, revealing the underlying
2D structure.
Section 5: ISOMAP in Practice

Python Implementation (with scikit-learn)

from sklearn.manifold import Isomap

# X is your high-dimensional data


embedding = Isomap(n_neighbors=10, n_components=2)
X_transformed = embedding.fit_transform(X)

n_neighbors: Number of neighbors for the neighborhood graph.


n_components: Number of dimensions for the output.

Manual Steps (as in the lab notebook)


1. Compute pairwise Euclidean distances for all points.
2. Keep only the k nearest neighbors for each point to build the graph.
3. Use Dijkstra’s algorithm to compute shortest (geodesic) paths.
4. Center the squared geodesic distance matrix.
5. Perform eigenvalue decomposition and select top components.
6. Project data onto these components for the final embedding.

Section 6: ISOMAP vs. PCA


Aspect PCA (Linear) ISOMAP (Non-linear)

Preserves Euclidean distances Geodesic (manifold) distances

Handles curves? No Yes

Good for Flat, linear data Curved, non-linear manifolds

Example Flat plane S-curve, Swiss roll

Section 7: Parameters and Practical Tips


Number of Neighbors (k):
Too low: The graph may break into disconnected pieces.
Too high: The graph may connect points that are not true neighbors, distorting the
manifold [2] [3] .
Tip: Try different values and visualize the results.
Connected Graph:
Ensure the neighborhood graph is a single connected component, or results may be
incoherent [2] [3] .

Section 8: Limitations and Drawbacks


ISOMAP struggles if the manifold is not well-sampled or has holes [2] [3] .
Careful parameter tuning (especially k) is required.
Computationally intensive for very large datasets (Landmark ISOMAP is an efficient variant)
[2] .

Section 9: Other Manifold Learning Methods


LLE (Locally Linear Embedding): Preserves local relationships.
t-SNE: Focuses on preserving local structure for visualization.
UMAP: Similar to t-SNE, often faster and better at maintaining global structure.
Key difference: Each method preserves different aspects of the data’s structure.

Section 10: Example – ISOMAP on an S-Curve


Dataset: 1000 points forming an S-shaped curve in 3D.
Process:
1. Build the neighborhood graph (each point connects to 10 nearest neighbors).
2. Compute geodesic distances using shortest paths.
3. Use MDS/eigenvalue decomposition to embed data in 2D.
Result: The S-curve is “unfolded” into a flat 2D shape, revealing the true underlying
structure.

Section 11: Key Takeaways


Manifold learning (like ISOMAP) helps uncover the true, lower-dimensional structure of
complex data.
ISOMAP is powerful for non-linear dimensionality reduction, especially when data lies on a
curved manifold.
Choosing parameters (like k) and method is crucial for good results.
Visualization after reduction helps interpret and understand high-dimensional data.

If you want a deeper explanation of any step, or want to see code for a particular part, just
ask!

1. https://www.sjsu.edu/faculty/guangliang.chen/Math253S20/lec10ISOmap.pdf
2. https://www.centron.de/en/tutorial/dimension-reduction-isomap/
3. https://www.mililink.com/upload/article/1159096330aams_vol_215_march_2022_a6_p2371-2382_s._gna
na_sophia,_k._k._thanammal_and_s._s._sujatha.pdf
4. https://labex.io/tutorials/ml-manifold-learning-with-scikit-learn-71115
5. https://www.youtube.com/watch?v=Xu_3NnkAI9s
Detailed Explanation of Module 2 Lab 4: t-Distributed Stochastic Neighbor
Embedding (t-SNE)
(Fully Updated with Lab Content and Your Queries)

Section 1: What is t-SNE and Why Use It?


t-SNE stands for t-Distributed Stochastic Neighbor Embedding.
It is an unsupervised, non-linear dimensionality reduction technique, mainly used for
visualizing high-dimensional data in 2D or 3D.
Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE helps reveal
patterns, clusters, and relationships in data that are not visible in tables or with linear
methods like PCA [1] .

Section 2: How Does t-SNE Work? (Step-by-Step with Example)


t-SNE works in three main steps, each with a clear purpose and effect:

Step 1: Measure Similarities in High-Dimensional Space


For every pair of data points, t-SNE centers a Gaussian distribution over each point and
measures how dense the other points are under this distribution.
This process produces a set of probabilities (Pij) that reflect how likely points are to be
neighbors in the original high-dimensional space.
The perplexity parameter controls the size of the neighborhood considered for each point
(think of it as a guess of how many close neighbors each point has). Typical values are
between 5 and 50 [1] .
Example:
Suppose you have 1,797 images of handwritten digits (each 8×8 pixels, so 64 features). For
each image, t-SNE calculates its similarity to every other image based on their pixel values,
resulting in a matrix of probabilities that encode local structure.

Step 2: Measure Similarities in Low-Dimensional Space


t-SNE maps all points to a 2D or 3D space, initially at random.
It then computes a new set of probabilities (Qij) using a Student t-distribution (with heavier
tails than a Gaussian).
The heavy tails allow distant points to be modeled more flexibly, helping clusters spread out
and avoid crowding [1] .
Example:
The same digit images are now points on a 2D plane. t-SNE computes how close they are using
the t-distribution, building a new matrix of similarities for the low-dimensional space.

Step 3: Match the Two Probability Distributions


t-SNE tries to make the low-dimensional similarities (Qij) match the high-dimensional
similarities (Pij) as closely as possible.
It does this by minimizing the Kullback-Leibler (KL) divergence between the two
distributions, using gradient descent.
The points are moved around iteratively until the best match is found, preserving local
neighborhoods and revealing clusters.
Example:
As optimization proceeds, images of the digit "3" are pulled together, "8"s are grouped, and so
on. After enough iterations, the 2D plot shows well-separated clusters for each digit [1] .

Section 3: Practical Application – Visualizing Digits


Dataset: 1,797 handwritten digit images (0–9), each 8×8 pixels (64 features).
Goal: Visualize how the digits cluster in 2D using t-SNE.
Code Example:

from sklearn.manifold import TSNE


from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

digits = load_digits()
X = np.vstack([digits.data[digits.target == i] for i in range(10)])
y = np.hstack([digits.target[digits.target == i] for i in range(10)])

tsne = TSNE(init="pca", random_state=20150101, n_components=2, perplexity=30, n_iter=1000


digits_proj = tsne.fit_transform(X)

palette = np.array(sns.color_palette("hls", 10))


plt.figure(figsize=(8,8))
plt.scatter(digits_proj[:,0], digits_proj[:,1], c=palette[y.astype(int)], s=40)
plt.title("t-SNE visualization")
plt.show()

Interpretation: Each color is a digit. t-SNE clusters similar digits together, making the
structure visible [1] .
Section 4: Understanding and Tuning t-SNE Hyperparameters
Parameter What It Does Typical Values / Notes

n_components Output dimensions (usually 2 or 3 for visualization) 2 or 3

perplexity Controls neighborhood size (local vs. global structure) 5–50 (try several values)

n_iter Number of optimization steps (iterations) ≥250 (usually 1000 or more)

Optimization algorithm (‘barnes_hut’ is fast, ‘exact’ is ‘barnes_hut’ for large


method
slower) datasets

Effect of Perplexity:
Low perplexity (e.g., 5): Focuses on very local structure; clusters may be small and tight.
High perplexity (e.g., 100): Considers more global structure; clusters may merge or lose
detail.
Best practice: Try several values and compare results. Perplexity should be less than the
number of points [1] .

Effect of n_iter (Iterations):


Too few iterations: The plot may not stabilize; clusters may look “pinched” or not well
separated.
More iterations: Allows the optimization to converge and produce a clearer map.
Best practice: Iterate until the configuration is stable (often ≥1000) [1] .

Effect of Method:
‘barnes_hut’: Fast, approximate, O(NlogN) time; good for large datasets.
‘exact’: Slower, O(N²) time; more accurate but computationally expensive [1] .

Section 5: Visualizing the Effects of Hyperparameters


Changing Perplexity:
Perplexity 5: Local clusters dominate, but global structure may be lost.
Perplexity 30: Balanced, clear clusters for each digit.
Perplexity 100: Clusters may merge, and points from different digits may mix.
Changing Iterations:
10, 20, 60, 120 steps: Clusters are not yet formed; plots look unstable or “pinched.”
1000 steps: Well-separated, stable clusters.
5000 steps: Similar to 1000, but clusters may be denser [1] .
Section 6: Best Practices and Limitations
Randomness:
t-SNE results can vary between runs due to random initialization. Use random_state for
reproducibility.
No True Clustering:
t-SNE is not a clustering algorithm; it only helps visualize clusters.
Interpretation:
The axes in a t-SNE plot have no intrinsic meaning; only the relative positions and
groupings matter.
Parameter Sensitivity:
Results can change with different perplexity or iteration settings. Always experiment
with several values [1] .

Section 7: Summary Table


Step What Happens Example (Digits)

Compute similarities (Pij) in high-dimensional space using How likely two digit images are
1
Gaussian neighbors

Compute similarities (Qij) in low-dimensional space using Initial random 2D positions for each
2
Student-t image

Minimize KL divergence between Pij and Qij (gradient Points move until clusters of digits
3
descent) form

Section 8: Exercises and Exploration


Try different perplexity and iteration values to see how the visualization changes.
Use t-SNE for exploration, not for clustering or modeling directly.
Combine with other techniques: Use t-SNE after PCA or as a preprocessing step for
visualization [1] .

Section 9: Key Takeaways


t-SNE is a powerful tool for exploring and visualizing high-dimensional data.
It excels at revealing clusters and local structure.
Hyperparameters like perplexity and n_iter greatly influence the results—experiment with
them!
t-SNE is best for visualization and data exploration, not as a preprocessing step for
modeling or clustering [1] .
If you want a deeper explanation of any step, or want to see code for a particular part, just
ask!

1. AIML_Module_2_Lab_4_t_SNE.ipynb-Colab.pdf
Detailed Explanation of Module 2 Project Lab: Data Exploration and Dimensionality
Reduction (with All Your Queries Included)
This guide walks through all the steps in your Module 2 project notebook, covering both the
heart disease and Starbucks nutrition datasets. It integrates all your queries and explains each
step in beginner-friendly language, with practical examples and observations.

Section 1: Data Loading and Preparation

Heart Disease Dataset


Source: Kaggle heart.csv
Features: Age, sex, chest pain type, blood pressure, cholesterol, fasting blood sugar, ECG
results, max heart rate, exercise angina, ST depression, slope, number of vessels,
thalassemia, and target (disease/no disease).
Initial Steps:
Load the data with pd.read_csv.
Check the shape: 1,025 rows × 14 columns.
Replace numeric codes with readable labels (e.g., 1 → "Disease", 0 → "No_disease" for
target; 1 → "Male", 0 → "Female" for sex, etc.).

Section 2: Exploratory Data Analysis (EDA) and Visualization

Class Distribution
Bar Plot:
Shows the number of samples with and without heart disease.
Pie Chart:
Visualizes the percentage of samples in each class.
Observation:
~54.5% of samples have heart disease.
Categorical Feature Analysis
Bar Plots:
For features like sex, chest pain type, fasting blood sugar, ECG results, angina, slope,
number of vessels, and thalassemia.
Count Plots:
Show how each categorical feature relates to disease presence.
Example:
countplot(x='cp', hue='target', data=data)
Shows how chest pain types are distributed across disease/no disease.

Continuous Feature Analysis


Pair Plots:
Visualize relationships between continuous features (age, cholesterol, max heart rate,
oldpeak, resting BP) and disease status.
Box Plots:
Compare the distribution of each continuous feature between disease and no-disease
groups.
Observation:
Three features that show significant differences between groups:
Age
Cholesterol (chol)
Maximum Heart Rate Achieved (thalach)

Correlation Analysis
Correlation Matrix:
Shows how strongly each pair of continuous features is related.
Heatmap:
Visualizes these relationships.
Observation:
Oldpeak and Slope: strong negative correlation.
Age and Trestbps: moderate positive correlation.

Feature Distribution by Category


Box Plots by Target:
For each continuous feature, compare disease vs. no-disease.
Section 3: Dimensionality Reduction with PCA

What is PCA?
Principal Component Analysis (PCA) reduces the number of features while preserving as
much variance as possible.
It transforms the data into new axes (principal components) that capture the most important
patterns.

Steps:
1. Standardize the Data:
Ensures each feature contributes equally.
2. Fit PCA:
Find principal components and explained variance.
3. Cumulative Explained Variance Plot:
Shows how many components are needed to capture most of the variance.
Observation:
7 principal components explain ~95% of the variance (optimal number).

PCA Visualization
Scatter Plot of First Two PCs:
Colors by disease status.
Observation:
No clear separation of disease/no-disease in the PCA plot; classes overlap.

Section 4: Nonlinear Dimensionality Reduction with t-SNE

What is t-SNE?
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear method for visualizing
high-dimensional data in 2D or 3D.
It preserves local structure (clusters) better than PCA.

Steps:
1. Apply t-SNE to Numeric Data:
tsne = TSNE(n_components=2)

2. Scatter Plot:
Colors by disease status.
Observation:
t-SNE provides slightly better separation than PCA, but there is still overlap between
disease and no-disease samples.

Section 5: Key Questions and Answers (from Your Queries)


1. What percentage of samples have disease?
About 54.5% (from the pie chart).
2. Which three continuous features show significant statistical difference between
groups?
Age
Cholesterol (chol)
Maximum Heart Rate Achieved (thalach)
3. Is there a clear separation between disease/no-disease in PCA and t-SNE plots?
PCA: No clear separation; classes overlap.
t-SNE: Slightly better, but still overlap; features alone may not be sufficient for perfect
separation.
4. What is the optimal number of principal components?
7 components explain ~95% of the variance.
5. Which continuous features are most strongly correlated?
Oldpeak and Slope (strong negative correlation)
Age and Trestbps (moderate positive correlation)

Section 6: Starbucks Nutrition Dataset Analysis

Data Preparation
Source: star_nutri_expanded.csv
Features: Calories, fat, carbs, sugars, protein, caffeine, vitamins, beverage type,
preparation, etc.
Cleaning:
Convert “Varies” to NaN, fill missing caffeine with the mean.
Fix data types and remove/convert problematic values.

Feature Engineering
Tea Category:
Add a binary column: 1 if beverage is tea, 0 otherwise.
One-Hot Encoding:
Convert categorical features (beverage, preparation) into binary columns.
Convert % strings to numbers:
Remove % signs and convert to floats.

PCA on Nutrition Data


1. Select Numeric Nutritional Columns.
2. Standardize Data.
3. Fit PCA, Plot Cumulative Explained Variance.
Observation:
The first few components capture a moderate amount of variance; many are
needed for >90%.
4. Scatter Plot of First Two PCs, Colored by Tea/Non-Tea.
Observation:
Some separation between tea and non-tea drinks, but not very distinct clusters.

t-SNE on Nutrition Data


1. Apply t-SNE to First Two Principal Components.
2. Scatter Plot, Colored by Tea/Non-Tea.
Observation:
t-SNE shows some grouping, but tea and non-tea drinks still overlap, indicating
nutritional features alone don’t fully separate the categories.

Section 7: Summary Table


Step Heart Disease Dataset Starbucks Nutrition Dataset

star_nutri_expanded.csv, cleaning,
Data Loading heart.csv, label cleaning, type conversion
encoding

Box, bar, scatter, histograms, word


EDA Bar, pie, count, box, pair plots, heatmap
cloud

Correlation Heatmap, identify strong/weak feature Correlation heatmap among nutrition


Analysis correlations values

Cumulative variance, 2D scatter, optimal Cumulative variance, 2D scatter,


PCA
components tea/non-tea

t-SNE 2D scatter, compare to PCA 2D scatter, compare to PCA

Overlap in classes, 7 PCs for 95% variance,


Key Observations Moderate separation, many PCs needed
key features
Section 8: Key Takeaways
EDA helps understand class balance, feature distributions, and relationships.
Correlation analysis reveals which features are related.
PCA summarizes data with fewer features; optimal number is where cumulative variance
plateaus (~95%).
t-SNE is better for visualizing clusters and subtle groupings, but may not always provide
clear separation.
Both datasets show that some categories (disease/tea) are not perfectly separable using
basic features alone.

If you want a deeper explanation of any step or want to see code for a particular part, just
ask!

Detailed Explanation of Module 3 Lab 1: Understanding Distance Metrics and
Introduction to KNN
This lab introduces the concept of distance metrics—how to measure the "closeness" of data
points—and shows how these are used in the K-Nearest Neighbors (KNN) algorithm. Below, each
section and concept is explained step-by-step, with examples and answers to key questions.

Section 1: Distance Metrics

A. What is a Distance Metric?


A distance metric is a mathematical way to measure how far apart two points (data samples) are
in space. Different metrics are used depending on the data type and problem.

B. Common Distance Metrics (with Examples)


1. Euclidean Distance
Definition: The straight-line distance between two points.
Formula:

Example:
x_1 = np.array((1, 2))
x_2 = np.array((4, 6))
euclidean_dist = np.sqrt(np.sum((x_1-x_2) ** 2))
print(euclidean_dist) # Output: 5.0

Visualization: The shortest path between two points on a plane.


2. Manhattan Distance
Definition: The sum of the absolute differences of their coordinates (like a taxi driving
on a city grid).
Formula:

Example:
manhattan_dist = np.sum(np.abs(x_1 - x_2))
print(manhattan_dist) # Output: 7

3. Minkowski Distance
Generalizes Euclidean (p=2) and Manhattan (p=1) distances.
Formula:

Example:
For , the Minkowski distance between the same points is about 4.5.
4. Hamming Distance
Definition: Number of positions at which the corresponding values are different (used
for categorical/binary data).
Example:
str_1 = 'euclidean'
str_2 = 'manhattan'
hamming_dist = distance.hamming(list(str_1), list(str_2)) * len(str_1)
print(hamming_dist) # Output: 7.0

5. Cosine Similarity
Definition: Measures the cosine of the angle between two vectors (used for text and
high-dimensional data).
Formula:

Example:
cosine_similarity = np.dot(x_1, x_2)/(norm(x_1)*norm(x_2))
print(cosine_similarity) # Output: 0.992...

6. Chebyshev Distance
Definition: The maximum absolute difference across any dimension.
Example:
chebyshev_distance = distance.chebyshev(x_1, x_2)
print(chebyshev_distance) # Output: 4

7. Jaccard Distance
Definition: Measures dissimilarity between sets.
Formula:
Example:
print(distance.jaccard([1, 0, 0], [0, 1, 0])) # Output: 1.0

8. Haversine Distance
Definition: Used for geographic coordinates (latitude/longitude) on a sphere (e.g.,
Earth).
Example:
haversine([-0.116773, 51.510357], [-77.009003, 38.889931]) # Output: 5897.658 km

C. How to Choose the Right Distance Metric?


Euclidean: Most common for continuous, low-dimensional data.
Manhattan: Useful for high-dimensional or grid-like data.
Cosine Similarity: Good when only direction matters (e.g., text).
Hamming: For categorical/binary variables.
Jaccard: For set/binary data.
Haversine: For geographic data.

D. Visualizing Distance Metrics


The lab uses 3D plots to show how Euclidean and Manhattan distances look from the origin,
helping you understand how each metric "measures" space differently.

Section 2: K-Nearest Neighbors (KNN)

A. What is KNN?
KNN is a supervised, non-parametric, instance-based algorithm used for classification and
regression.
How it works: For a new data point, KNN finds the k closest points in the training set (using
a distance metric) and assigns the most common class among them (for classification) or
averages their values (for regression).
B. KNN on a Synthetic Dataset
The lab generates two clusters of 2D points (red and blue) and uses KNN to classify new
points.
Example:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(pts, tgts)
our_predictions = knn.predict(test_pts)
print("Prediction Accuracy: ", 100 * np.mean(our_predictions == test_tgts))
# Output: e.g., 80.0

Experiment:
Try different distance metrics ('euclidean', 'manhattan', 'chebyshev', 'minkowski',
'hamming') and observe how accuracy changes.

C. KNN on the Iris Dataset (Real Data Example)


Iris dataset: 150 samples, 3 species, 4 features each.
Data is split into training and testing sets.
KNN is run with different distance metrics (Euclidean, Cosine, Manhattan, Chebyshev).
Result:
For this dataset and split, all metrics gave 100% accuracy, but this may not always be the
case in other datasets.

Section 3: Questions to Think About and Answer


1. How are similarity and distance different?
Similarity measures how alike two data points are (higher = more alike, e.g., cosine
similarity).
Distance measures how far apart two data points are (lower = more similar, e.g.,
Euclidean).
In KNN, distance metrics like Euclidean and Manhattan are used to find the closest
neighbors, while similarity is used in other algorithms.
2. What makes a valid distance metric?
A valid distance metric must satisfy:
Non-negativity: $ d(x, y) \geq 0 $
Identity: $ d(x, y) = 0 $ if and only if $ x = y $
Symmetry: $ d(x, y) = d(y, x) $
Triangle Inequality: $ d(x, z) \leq d(x, y) + d(y, z) $
Section 4: Best Practices and Observations
Metric choice matters: For some data, the right distance metric can significantly improve
KNN performance [1] [2] [3] .
Curse of dimensionality: In very high-dimensional data, distances between points become
less meaningful, and KNN may not work well [2] .
Feature scaling: Always scale features before using KNN, especially with Euclidean or
Manhattan distance [2] [4] .

Section 5: Summary Table


Metric Use Case Formula/Example

$ \sqrt{\sum (x_i - y_i)^2}


Euclidean Continuous, low-dim data
$

x_i -
Manhattan High-dim, grid data $ \sum $
y_i

Max difference in any x_i -


Chebyshev $ \max $
dimension y_i

# positions where $ x_i


Hamming Categorical/binary variables
\neq y_i $

A \cap } A \cup }
Jaccard Set/binary data $ 1 - \frac{
B { B $

Text, direction matters, not


Cosine $ \frac{x \cdot y}{|x||y|} $
magnitude

Haversine Geographic coordinates See code in notebook

Key Takeaways
Distance metrics are foundational for KNN and many other algorithms.
KNN is simple and effective, but its performance depends on the distance metric and value
of $ k $.
Experiment with different metrics and always scale your features.

If you want a deeper explanation of any metric, code example, or visualization, just ask!

1. https://www.ustcnewly.com/teaching/2020_2_3.pdf
2. https://www.kdnuggets.com/2020/11/most-popular-distance-metrics-knn.html
3. https://blog.devgenius.io/exploring-knn-with-different-distance-metrics-85aea1e8299
4. https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/
Detailed Explanation of Module 3 Lab 2: Implementing KNN from Scratch and
Visualizing Algorithm Performance
(Updated with All Your Queries and Key Concepts)

Section 1: Implementing KNN from Scratch

What is KNN?
K-Nearest Neighbors (KNN) is a simple, intuitive algorithm for classification and regression. It
predicts the label of a new data point by looking at the labels of its k closest points in the
training set (using a distance metric, usually Euclidean distance), and choosing the most
common label among them.

How is KNN Implemented from Scratch?


Distance Calculation: For each test point, compute the distance to every training point.
Find Neighbors: Sort all distances and select the k smallest (closest) points.
Predict Label: For classification, take the most frequent label among the k neighbors.
Example Code:

def predict(X_train, y_train, X_test, k):


distances = []
targets = []
for i in range(len(X_train)):
distances.append([np.sqrt(np.sum(np.square(X_test - X_train[i, :]))), i])
distances = sorted(distances)
for i in range(k):
index = distances[i][^1]
targets.append(y_train[index])
return Counter(targets).most_common(1)[^0][^0]

For k=1, the label of the single nearest neighbor is returned.


For k>1, the most common label among the k neighbors is chosen.

Accuracy Metric
Accuracy is the ratio of correctly classified samples to total samples:

def Accuracy(gtlabel, predlabel):


correct = (gtlabel == predlabel).sum()
return correct / len(gtlabel)

Section 1.1: KNN on the Iris Dataset


Dataset: Iris (150 samples, 4 features, 3 classes).
Process:
1. Split data into training and test sets.
2. Use your KNN function to predict test labels.
3. Calculate accuracy.
Result Example:
"The accuracy of our classifier is 94.0%"
Comparison:
The sklearn library’s KNN implementation gives the same accuracy, validating your scratch
code.

Section 1.2: Weighted KNN


Why Weighted?
If k is large, distant neighbors may outvote closer, more relevant ones. Weighted KNN gives
more importance to closer neighbors (e.g., by using the inverse of their distance as a
weight).
How to Implement:
In sklearn, use weights='distance' in KNeighborsClassifier.
In your own code, you’d multiply each neighbor’s vote by its weight (inverse distance).

Section 1.3: Return Neighbors and Distances


Modification:
Instead of just the predicted label, your function can return the indices, distances, and
labels of the k nearest neighbors for each test point.
Why?
This helps you analyze which points are influencing each prediction.

Section 2: Visualizing Data and KNN Behavior


Voronoi Diagrams
What are they?
Voronoi diagrams partition the plane into regions where each region contains all points
closest to one "seed" (data point).
Why useful?
They show how the choice of distance metric and data distribution affects the influence of
each training point.
Limitation:
Only practical for 2D data, so you use the first two features or apply PCA to reduce
dimensions.
Example Code:

from scipy.spatial import Voronoi, voronoi_plot_2d


vor = Voronoi(points)
voronoi_plot_2d(vor)
plt.scatter(points[:, 0], points[:, 1], c=targets, cmap='viridis', edgecolor='k')
plt.show()

Section 2.2: Decision Boundaries in KNN


What are Decision Boundaries?
Imaginary lines (or surfaces) in the feature space where the predicted class changes. They
show which regions of the space are classified as which class by KNN.
How are they plotted?
1. Create a grid covering the feature space.
2. Use KNN to predict the class at each grid point.
3. Color each region according to the predicted class.
4. Overlay the training data points.
Why are they important?
They help you see how KNN generalizes and where it is likely to make mistakes.
For small k, boundaries are jagged and sensitive to noise; for large k, boundaries are
smoother.
Example Code:

def decision_boundary_plot(x_dec, y_dec, k):


h = .02
n = len(set(y_dec))
cmap_light = ListedColormap(['pink', 'green', 'cyan', 'yellow'][:n])
cmap_bold = ['pink', 'darkgreen', 'blue', 'yellow'][:n]
for weights in ['uniform', 'distance']:
clf = KNeighborsClassifier(n_neighbors=k, weights=weights)
clf.fit(x_dec, y_dec)
x_min, x_max = x_dec[:, 0].min() - 1, x_dec[:, 0].max() + 1
y_min, y_max = x_dec[:, 1].min() - 1, x_dec[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap=cmap_light)
sns.scatterplot(x=x_dec[:, 0], y=x_dec[:, 1], hue=y_dec,
palette=cmap_bold, edgecolor="black", alpha=1.0)
plt.show()

Section 2.3: PCA for Visualization


Why PCA?
The Iris dataset has 4 features; to plot Voronoi diagrams and decision boundaries, you need
2D data.
How?
Use PCA to reduce the data to two principal components, then plot as above.

Section 2.4: Confusion Matrix and Classification Report


Confusion Matrix:
A table showing the number of correct and incorrect predictions for each class. Diagonal
values are correct; off-diagonal are mistakes.
Classification Report:
Gives precision, recall, F1-score, and support for each class.
Precision: Of all predicted as class X, how many were correct?
Recall: Of all actual class X, how many did we find?
F1-score: Harmonic mean of precision and recall.
Example Output:

precision recall f1-score support


0 1.00 1.00 1.00 13
1 0.88 1.00 0.94 22
2 1.00 0.80 0.89 15
accuracy 0.94 50
macro avg 0.96 0.93 0.94 50
weighted avg 0.95 0.94 0.94 50
Section 3: Applying KNN on the Car Evaluation Dataset
Data Preparation:
Categorical features are label-encoded to numbers.
Data is split into train/test sets.
KNN Training and Evaluation:
KNN is trained and tested as above.
Accuracy is reported (e.g., 89.88%).
Visualization:
PCA reduces the data to 2D for plotting Voronoi diagrams and decision boundaries.
Confusion matrix and classification report are generated for model evaluation.

Summary Table
Concept What It Means / Why It Matters

KNN from scratch Understands the algorithm’s logic, not just using libraries

Weighted KNN Closer neighbors have more influence on prediction

Voronoi diagrams Visualize which points “own” which regions of space

Decision boundaries Show where class predictions change in feature space

PCA Reduces high-dimensional data to 2D for visualization

Confusion matrix Shows details of correct/incorrect predictions per class

Classification report Gives precision, recall, F1-score for each class

In summary:
This lab teaches you to implement KNN from scratch, understand how it works, visualize its
behavior using Voronoi diagrams and decision boundaries, and evaluate its performance with
confusion matrices and classification reports. You also learn how to handle categorical data, use
PCA for visualization, and interpret the strengths and weaknesses of your classifier [1] [2] [3] [4] [5]
[6] [7] .

1. https://www.machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-s
cratch/
2. https://www.kaggle.com/code/jebathuraiibarnabas/knn-from-scratch-with-visualization
3. https://realpython.com/knn-python/
4. https://www.kaggle.com/code/just4jcgeorge/k-nearest-neighbour-algorithm
5. https://dataaspirant.com/k-nearest-neighbor-algorithm-implementaion-python-scratch/
6. https://www.scribd.com/document/736817575/MACHINE-LEARNING-LAB-MANUAL
7. AIML_Module_3_Lab_2_Implementing_KNN_from_scratch_and_visualize_Algorithm_performance.ipynb-
Cola.pdf
Detailed Explanation of Module 3 Lab 3: Using KNN for Text Classification
(Updated and Structured for Beginners, with All Your Queries Addressed)

Section 1: Understanding NLP Tools and Preprocessing

A. Why Preprocess Text for Machine Learning?


Raw text is messy: it contains numbers, punctuation, capitalization, and word variants.
Goal: Convert text into a clean, structured form that algorithms can process.

B. Key Preprocessing Steps


1. Remove numbers and punctuation: Only keep meaningful words.
2. Lowercase all text: Ensures "Apple" and "apple" are treated the same.
3. Stemming and Lemmatization:
Stemming: Cuts words to their root form (e.g., "troubling" → "troubl"). Fast but can
create non-words.
Lemmatization: Converts words to their dictionary root (e.g., "troubling" → "trouble").
More accurate but slower and needs context [1] .
Which to use? Lemmatization is more precise, but stemming is faster. Choose based on
your needs.

C. Using NLTK for Text Processing


NLTK (Natural Language Toolkit) provides tools for tokenization, stemming, lemmatization,
and stopword removal.
Example code snippet:
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.tokenize import word_tokenize
# Lemmatize
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("troubling") # Output: "troubling" (needs POS for best results)
# Stem
stemmer = SnowballStemmer('english')
stemmer.stem("troubling") # Output: "troubl"
Section 2: Feature Extraction from Text

A. Bag of Words (BoW)


What is it?
Represents each document as a vector of word counts.
Ignores word order; just counts how many times each word appears.
Pros: Simple, works well for many tasks.
Cons: Treats all words equally; ignores context and importance.

B. TF-IDF (Term Frequency-Inverse Document Frequency)


What is it?
Weighs word counts by how rare or informative a word is across all documents.
Common words (like "the") get low weight; rare but important words get high weight.
Pros: Highlights meaningful words, often improves classification [2] .
Cons: Still ignores word order and context.

C. Implementation in the Lab


BoW and TF-IDF are both used as feature extraction methods.
Code Example:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# BoW
vectorizer = CountVectorizer(stop_words='english')
X_bow = vectorizer.fit_transform(texts)
# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(texts)

Section 3: KNN for Text Classification

A. How Does KNN Work for Text?


Training phase: Store the feature vectors (from BoW or TF-IDF) and their class labels.
Prediction phase: For a new document, calculate its distance (e.g., Euclidean, Manhattan)
to all training documents, find the k closest, and assign the most common class among
them [3] [4] .
B. Example: Sentiment Analysis
Dataset: Reviews labeled as positive (1) or negative (0).
Workflow:
1. Preprocess and clean the text.
2. Convert to BoW or TF-IDF features.
3. Split into training and test sets.
4. Train KNN classifier (e.g., k=7 for BoW, k=10 for TF-IDF).
5. Evaluate accuracy and cross-validation score.
Results from the Lab:
BoW + KNN: ~66% accuracy, cross-validation ~61%.
TF-IDF + KNN: ~74% accuracy, cross-validation ~74%.
Interpretation: TF-IDF outperforms BoW because it emphasizes informative words [2] .

C. Example: Spam Detection


Dataset: SMS messages labeled as spam (1) or ham (0).
Workflow is the same as above.
Results:
BoW + KNN: ~93% accuracy, cross-validation ~93%.
TF-IDF + KNN: ~88% accuracy, cross-validation ~88%.
Interpretation: For this dataset, BoW performed slightly better, but both methods are
strong.

Section 4: Reflective Questions and Answers


1. Why does TF-IDF usually outperform BoW?
BoW treats all words equally, so common but uninformative words can dominate.
TF-IDF gives higher weight to rare, meaningful words, improving the classifier’s focus on
important features [2] .
2. Are there better techniques than BoW and TF-IDF?
Word Embeddings (Word2Vec, GloVe): Capture word meaning and context as vectors.
Transformers (BERT, etc.): Use deep learning to understand context and semantics,
outperforming traditional methods.
3. Stemming vs. Lemmatization: Pros and Cons
Stemming:
Pros: Fast, simple.
Cons: Can produce non-words, less accurate.
Lemmatization:
Pros: Accurate, context-aware, outputs real words.
Cons: Slower, more complex [1] .

Section 5: Summary Table


Step What Happens / Why It Matters

Preprocessing Clean text, remove noise, normalize words

BoW/TF-IDF Convert text to numeric features for ML algorithms

KNN Classifies new text by comparing to k closest examples

Evaluation Accuracy and cross-validation scores measure performance

Advanced Techniques Word embeddings and transformers improve context understanding

Stemming vs. Lemmatization Trade-off between speed and accuracy in word normalization

Key Takeaways
Text must be cleaned and converted to numbers for machine learning.
TF-IDF usually outperforms BoW by focusing on important words.
KNN is simple and effective for text classification, but the quality of features matters a lot.
Better techniques like word embeddings and transformers can provide even higher
accuracy.
Stemming and lemmatization are important preprocessing steps, each with pros and cons.

If you want more detail on any step, code examples, or further reading, just ask!

1. https://www.ibm.com/think/topics/stemming-lemmatization
2. https://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/23829
3. https://spotintelligence.com/2023/08/22/k-nearest-neighbours/
4. https://www.slideshare.net/slideshow/cs8080irtunit-i-t6-knn-classifierpdf/251786728
Detailed Explanation of Module 3 Project: Diabetes Classification with KNN
(Fully Updated and Structured for Beginners, with All Your Queries Addressed)

Section 1: Introduction and Dataset Overview


Objective: Predict whether a Pima Indian woman has diabetes using features like
pregnancies, glucose, BMI, age, etc.
Dataset: 768 rows, 8 features + 1 target column (Outcome: 1 = diabetic, 0 = non-diabetic).

Section 2: Exploratory Data Analysis (EDA)

A. Data Inspection and Description


Loaded the data and checked its shape: (768, 9).

Used .describe() to get statistics (mean, std, min, max) for each feature.
Example: Average age is ~33, average BMI is ~32.5.

B. Distribution and Outlier Detection


Used distribution plots and boxplots for features like Pregnancies, BMI, etc., to check for
skewness and outliers.

C. Correlation Analysis
Correlation Matrix & Heatmap:
Used diabetes_data.corr() and sns.heatmap() to visualize relationships.
Findings:
Insulin is moderately correlated with Glucose, BMI, and Age.
SkinThickness is highly correlated with BMI.

D. Checking for Data Balance (Imbalance)


Used value_counts() and a count plot to check class distribution:
print(diabetes_data['Outcome'].value_counts())
plt.figure(figsize=(12,6))
sns.countplot(x='Outcome', data=diabetes_data, palette='bright')
plt.title("Output class distribution")
plt.show()
Result: 500 non-diabetic, 268 diabetic (imbalanced dataset; non-diabetic cases are
almost twice diabetic).

E. Pairplot (Scatterplot Matrix)


Used Seaborn's pairplot to visualize pairwise relationships:
sns.pairplot(diabetes_data, hue='Outcome', palette='bright', diag_kind='kde', corner=
plt.show()

Why do graphs decrease as you move up?


The pairplot only shows each pair once (lower triangle); the upper triangle is left
blank to avoid redundancy.
For example, "Insulin vs SkinThickness" appears only once, not mirrored.

Section 3: Feature Analysis


Boxplots for BMI, DiabetesPedigreeFunction, Pregnancies, and Age vs. Outcome.
Findings:
Diabetic women tend to have higher BMI, more pregnancies, higher pedigree
function, and are older.

Section 4: Feature Scaling and Standardization

A. Why Standardize/Scale?
Features have different scales (e.g., glucose vs. pedigree function).
KNN uses distance, so unscaled features can dominate.
Standardization:
Transforms features to mean 0, std 1.
Formula: $ Z = \frac{X - \mu}{\sigma} $

B. Min-Max Scaling
What is it?
Rescales features to the range [1] .

Formula:
$ x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} $
Why use it?
Ensures all features contribute equally to distance calculations.
Especially important for KNN.
How was it used?
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_feat)

Result: Improved KNN test accuracy to ~77%.

Section 5: Choosing the Best K Value for KNN

A. The Elbow Method


Tried k values from 1 to 40.
For each k:
1. Split data into train/test sets (70/30).
2. Train KNN, compute test error rate.
Plot: Error rate vs. k.
Results:
Max train score: 100% at k=1 (overfitting).
Max test score:
~73% (unscaled),
~75.8% (standardized),
~77% (MinMax scaled),
Best at k=22, 27, 33, 37 (MinMax scaled).

Section 6: Visualizing KNN Decision Boundaries with Voronoi Diagrams

A. Why Visualize?
Shows how KNN divides the feature space into regions for each class.
Helps understand how scaling and k affect classification.

B. How Was It Done?


1. PCA: Reduced scaled data to 2D.
2. Train KNN: On 2D data.
3. Create a grid: Covered the 2D space.
4. Predict class for each grid point: Colored regions by prediction.
5. Overlay data points.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_2D = pca.fit_transform(scaled_data)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_2D, diabetes_data['Outcome'])
# ... (plotting code)

Result: The Voronoi diagram shows decision boundaries and how KNN classifies new data.

Section 7: Addressing All Your Queries


Checking if the data is balanced:
Used value_counts() and a count plot. Found the dataset is imbalanced (more non-
diabetic cases).
Why fewer graphs in pairplot as you move up:
Pairplot only shows each feature pair once (lower triangle), so the number of graphs
decreases upwards.
What is Min-Max scaling?
Rescales features to [1] so all features are on the same scale, preventing dominance in
distance calculations.
Effect of scaling:
Scaling (especially MinMax) improved KNN’s performance on this dataset.

Section 8: Key Takeaways


EDA helps identify important features and data issues (like imbalance).
Scaling is critical for KNN and other distance-based algorithms.
Choosing k: Use the elbow method to find the best value.
Visualization: Decision boundaries and Voronoi diagrams help interpret KNN’s behavior.
Performance: With proper scaling and k, KNN achieves up to ~77% accuracy on this
dataset.

If you need more details, code examples, or further clarification on any step, just ask!

1. AIML_Module_3_project.ipynb-Colab.pdf

You might also like