Module 1-3
Module 1-3
This lab notebook introduces the concept of feature extraction from data, focusing on two
types: text data and image data. The goal is to help you understand how to transform raw data
into features that machine learning algorithms can use. Below, I will explain every major part,
step by step, as if you are new to the field.
Key Observations
Bigram frequencies (pairs of characters) are similar across topics in the same language but
differ significantly between languages.
Therefore, bigram frequency is a good feature for distinguishing languages, not topics.
Dimensionality Reduction
By using unigrams, you reduce the text to 26 features (one for each letter).
Using bigrams, you get 26×26 = 676 features (all possible pairs of letters).
Further Exploration
Try using different languages, topics, or text sources.
Visualize trigrams (three-character sequences) or higher-order n-grams for more complex
patterns.
Text Trigram Count Frequency of letter triplets Captures more complex patterns
Image Hull Pixel Count Number of pixels in filled digit Measures overall digit area
If you have any specific part of the code or concept you want to explore further, let me know!
⁂
Detailed Explanation of Lab 2: Machine Learning Terms, Metrics, and Data
Augmentation (Comprehensive & Structured)
This guide explains every concept, model, and process in Lab 2, incorporating all your follow-up
questions and using simple examples. It is organized into clear sections for easy understanding,
just like the structure used for Module 1.
Key Terms
Term Meaning Example
Target/Label The value or class the model tries to predict Digit (0-9), “Apple” or “Orange”
Training set Data used to teach the model 80% of your dataset
Model memorizes training data but fails on new High train accuracy, low validation
Overfitting
data accuracy
Underfitting Model fails to learn patterns from data Low train and validation accuracy
Simple Example
Suppose you have these fruits:
Apple 8 7 Apple
Orange 6 8 Orange
2. Random Classifier
Purpose: Assigns a random label to each data point, without learning from data.
Why use it: It’s a baseline to show what accuracy you’d get just by guessing.
Example:
If you have 4 fruits (2 apples, 2 oranges) and guess randomly, you’ll be correct about 50%
of the time.
Section 3: Measuring Model Performance
Why both?
High train but low validation accuracy means overfitting.
Both low means underfitting.
Both high and close means good generalization [3] .
Example
A 3x3 image:
0 255 0
255 0 255
0 255 0
Flattened: ``
Visual Example
Original:
+-----+
| |
| |
+-----+
+-----+
/ /
/ /
+-----+
Step-by-Step Workflow
1. Train the 1-NN classifier on original (flattened) images.
Measure baseline accuracy (e.g., 80%).
2. Apply data augmentation (rotate, shear, etc.) to create new images.
Retrain the model on the expanded dataset.
Measure new accuracy (e.g., 85%).
3. Hyperparameter Tuning (Grid Search):
Hyperparameters: Settings you choose before training (e.g., angle of rotation, amount
of shear).
Grid search: Try different values (e.g., rotate by 10°, 20°, 30°) and see which gives the
best accuracy [8] .
For two hyperparameters (e.g., angle and shear), try all combinations.
4. Visualize Results with Graphs:
One hyperparameter:
X-axis: Value (e.g., angle in degrees)
Y-axis: Test accuracy
Plot to see which value works best.
Two hyperparameters (e.g., angle and shear):
Use a heatmap (colored grid):
X-axis: Angle
Y-axis: Shear
Color: Accuracy
1-NN Classifier Assigns class of the closest training example to a new data point
Flattened Pixel Values Turning image grid into a single list of numbers
Data Augmentation Creating new samples by changing originals (rotate, shear, etc.)
Hyperparameter Tuning Trying different settings to find the best model performance
If you need any section explained even more simply or visually, let me know!
⁂
1. https://www.datacamp.com/blog/classification-machine-learning
2. https://stats.stackexchange.com/questions/151756/knn-1-nearest-neighbor
3. https://www.digitalocean.com/community/tutorials/training-validation-and-accuracy-in-pytorch
4. https://www.datacamp.com/tutorial/complete-guide-data-augmentation
5. https://aws.amazon.com/what-is/data-augmentation/
6. https://neptune.ai/blog/data-augmentation-in-python
7. https://docs.edgeimpulse.com/docs/concepts/machine-learning/data-augmentation
8. https://www.yourdatateacher.com/2021/05/19/hyperparameter-tuning-grid-search-and-random-searc
h/
Detailed Explanation of Lab 3: Transforming Data Using Linear Algebra (Updated
with All Your Queries)
This comprehensive guide explains every concept, step, and method in Lab 3, integrating all
your follow-up questions. It is structured for absolute beginners, with clear definitions, step-by-
step examples, and practical context for each part of the lab.
Simple Example:
Suppose you have a point (2, 3) and a transformation matrix:
Example:
If three points are close together before a transformation, they might be spread out after.
This can change which points are considered "nearest neighbors" for classification.
Step-by-Step Example
Suppose you have three data points, each with two features:
A 2 3
B 1 0
C -1 2
Simple Example:
Suppose you want to classify fruit as apple or orange. You have:
Weight
Color Score
If you use both features, you can plot each fruit as a point in 2D space (weight vs. color). Apples
and oranges may form separate clusters, making classification easier.
Combining Features:
Sometimes, you create new features by combining existing ones (e.g., Area = Height ×
Width).
Using multiple features often improves model performance because it captures more
information about each data point.
In the Lab:
The lab tests different combinations of features (e.g., hole only, boundary only, hole +
boundary).
Some combinations give better accuracy than others.
Why Normalize?
Features with different scales (e.g., 0–1000 vs. 0–10) can cause models to focus too much
on the bigger numbers.
Normalization rescales each feature to the same range (typically 0 to 1).
How to Normalize:
Do this for each feature, for both training and test data.
Effect on Accuracy:
Normalization can improve or maintain accuracy by ensuring all features are treated equally.
Simple, distance-based
1-NN Classifier Predicts class based on nearest neighbor
classification
Feature Selection Choosing which features to use for best results Hole + Boundary vs. all features
Data Normalization Makes features comparable by rescaling them All features between 0 and 1
Lab 3 in a Nutshell
1. Understand linear algebra and matrix transformations and how they affect data [1] [2] [5]
[6] [3] [7] [8] [4] .
If you need more details or examples for any section, just ask!
⁂
1. https://www.youtube.com/watch?v=SioiFuMRiv4
2. https://www.coursera.org/learn/machine-learning-linear-algebra
3. https://www.machinelearningmastery.com/linear_algebra_for_machine_learning/
4. https://www.machinelearningplus.com/linear-algebra/introduction-to-linear-algebra/
5. https://www.bitsathy.ac.in/linear-algebra-for-machine-learning/
6. https://www.freecodecamp.org/news/linear-algebra-roadmap/
7. https://www.youtube.com/watch?v=uZeDTwWcnuY
8. https://pabloinsente.github.io/intro-linear-algebra
Detailed Explanation of Module 2 Lab 1: Basic Plots (Structured for Beginners)
This lab introduces the fundamentals of data visualization using Python, focusing on how to
appreciate, interpret, and visualize data before applying machine learning. You’ll learn about
different plot types, how to create them, and how to interpret what they show about your
dataset. Below is a step-by-step breakdown, integrating all the main ideas and practical
examples from the notebook.
Interpretation:
Each point is a car.
You can see which car makers tend to have higher or lower prices.
Example: Mercedes-Benz, Jaguar, Porsche, and BMW are on the higher end.
B. Box Plot
What it shows: The distribution of prices for each car maker, including median, quartiles,
and outliers.
How to create:
sns.set(rc={'figure.figsize': (24,7)})
sns.boxplot(x=X["make"], y=y, palette="Set3").set_title('Car Manufacturer vs Price -
Interpretation:
The box shows the middle 50% of prices (interquartile range).
The line inside the box is the median.
Whiskers show the range; dots outside are outliers.
Example: Mercedes-Benz and Porsche have the highest medians and outliers at very
high prices.
C. Violin Plot
What it shows: Combines a box plot with a density plot to show the distribution’s shape for
each group.
How to create:
sns.violinplot(x=X["make"], y=y, palette="Set3").set_title('Car maker vs Price - Viol
Interpretation:
The width of the violin indicates how many cars fall in that price range.
You can see if the distribution is skewed or has multiple peaks.
D. Swarm Plot
What it shows: Each data point is plotted, avoiding overlap, to show the distribution within
categories.
How to create:
sns.swarmplot(x=X["make"], y=y).set_title('Car maker vs Price Swarm Plot')
Interpretation:
You see the spread of individual car prices for each make.
Useful for spotting clusters and outliers.
Interpretation:
Points trend upward: higher horsepower generally means higher price.
B. Joint Plot
What it shows: Relationship between two variables, plus their individual distributions.
How to create:
sns.jointplot(x=pd.to_numeric(X["horsepower"]), y=y, kind="reg", color='green')
Interpretation:
The regression line shows the trend.
If the line slopes up, the variables are positively correlated.
C. Finding Negative Correlations
Example: Highway-mpg (miles per gallon on highway) vs. Price
How to create:
sns.jointplot(x=data["highway-mpg"], y=data["price"], kind="reg", color='purple')
Interpretation:
The regression line slopes down: as mpg increases, price tends to decrease (negative
correlation).
This means more fuel-efficient cars are often less expensive.
Interpretation:
Convertibles have the highest maximum prices.
Hatchbacks have the lowest median and least price variation.
Sedans cover a wide range of prices.
Body style can be a good predictor of price.
Interpretation:
You can compare different plot types side by side for deeper insights.
Section 8: Exercise and Exploration
Try plotting joint plots for all numeric features against price.
Look for features with strong positive or negative correlations.
Try new plot types from Seaborn’s documentation to represent the data in creative ways.
Key Takeaways
Visualization is essential for understanding your data before modeling.
Scatter plots show relationships between two variables.
Box plots and violin plots reveal distributions and outliers within categories.
Swarm plots help visualize every data point in a category.
Joint plots combine scatter plots with distribution plots, making trends and correlations
obvious.
Subplots let you compare multiple visualizations at once.
Interpreting plots helps you spot trends, outliers, and relationships that are hard to see in
tables.
If you need a deeper explanation of any plot type, how to interpret a specific plot, or want to see
more code examples, just ask!
⁂
Detailed Explanation of Module 2 Lab 2: Principal Components Analysis (PCA) –
Updated with All Your Questions
This guide explains every concept and step in the PCA lab, integrating your follow-up questions
and using beginner-friendly language and examples. Each section is structured for clarity and
depth, with practical context and simple analogies.
Section 1: What is Principal Component Analysis (PCA) and Why Use It?
PCA is a technique for simplifying complex datasets by reducing the number of features
(dimensions) while preserving as much important information (variance) as possible [1] [2] .
Why use PCA?
To visualize high-dimensional data in 2D or 3D.
To speed up machine learning and reduce overfitting.
To remove noise and redundancy from data.
To find patterns or groupings that are hard to see in the original data.
Example:
If one feature ranges from 1–1000 and another from 0–1, the first would dominate the
analysis unless standardized.
Step 2: Covariance Matrix Calculation
What is Covariance?
It measures how two features vary together.
Positive covariance: features increase together; negative: one increases as the other
decreases.
Covariance Matrix:
A table showing the covariance between every pair of features.
For 3 features, it’s a 3x3 matrix.
Why?
It helps find relationships between features and is the foundation for finding principal
components [2] [3] .
Sample A B C
1 2 3 4
2 3 4 5
3 4 5 6
Step-by-step:
1. Standardize A, B, C.
2. Compute covariance matrix (3x3).
3. Find eigenvectors/eigenvalues.
4. Sort and select top 2 eigenvectors (PC1, PC2).
5. Project data onto PC1 and PC2 to get new values for each sample.
6. Plot samples on a 2D graph using PC1 and PC2.
Section 6: Summary Table
Step What Happens? Why It Matters
Covariance Matrix Measures how features vary together Finds relationships between features
Principal Components New axes capturing most information Reduce data size, keep important info
Shows how much info each Helps choose how many components
Explained Variance
component keeps to keep
If you want a deeper explanation of any step, or a code example for a specific part, just ask!
⁂
1. https://www.pickl.ai/blog/a-step-by-step-complete-guide-to-principal-component-analysis-pca-for-b
eginners/
2. https://www.turing.com/kb/guide-to-principal-component-analysis
3. https://www.datacamp.com/tutorial/pca-analysis-r
Detailed Explanation of Module 2 Lab 3: Manifold Learning Methods (Updated and
Structured for Beginners)
This guide explains every concept and step in the Manifold Learning lab, integrating the content
provided and all your follow-up queries. It is organized for clarity, with practical examples and
beginner-friendly language.
How:
For each data point, find its k nearest neighbors (using Euclidean distance).
Build a graph where each point is a node connected to its neighbors by edges weighted
by their distances.
You can use either the k-nearest neighbors method or an ε-ball (all points within a
certain radius) [5] .
Example:
Imagine 1000 points in 3D forming an S-curve. For each point, connect it to its 10 closest
points. The graph now represents local relationships.
If you want a deeper explanation of any step, or want to see code for a particular part, just
ask!
⁂
1. https://www.sjsu.edu/faculty/guangliang.chen/Math253S20/lec10ISOmap.pdf
2. https://www.centron.de/en/tutorial/dimension-reduction-isomap/
3. https://www.mililink.com/upload/article/1159096330aams_vol_215_march_2022_a6_p2371-2382_s._gna
na_sophia,_k._k._thanammal_and_s._s._sujatha.pdf
4. https://labex.io/tutorials/ml-manifold-learning-with-scikit-learn-71115
5. https://www.youtube.com/watch?v=Xu_3NnkAI9s
Detailed Explanation of Module 2 Lab 4: t-Distributed Stochastic Neighbor
Embedding (t-SNE)
(Fully Updated with Lab Content and Your Queries)
digits = load_digits()
X = np.vstack([digits.data[digits.target == i] for i in range(10)])
y = np.hstack([digits.target[digits.target == i] for i in range(10)])
Interpretation: Each color is a digit. t-SNE clusters similar digits together, making the
structure visible [1] .
Section 4: Understanding and Tuning t-SNE Hyperparameters
Parameter What It Does Typical Values / Notes
perplexity Controls neighborhood size (local vs. global structure) 5–50 (try several values)
Effect of Perplexity:
Low perplexity (e.g., 5): Focuses on very local structure; clusters may be small and tight.
High perplexity (e.g., 100): Considers more global structure; clusters may merge or lose
detail.
Best practice: Try several values and compare results. Perplexity should be less than the
number of points [1] .
Effect of Method:
‘barnes_hut’: Fast, approximate, O(NlogN) time; good for large datasets.
‘exact’: Slower, O(N²) time; more accurate but computationally expensive [1] .
Compute similarities (Pij) in high-dimensional space using How likely two digit images are
1
Gaussian neighbors
Compute similarities (Qij) in low-dimensional space using Initial random 2D positions for each
2
Student-t image
Minimize KL divergence between Pij and Qij (gradient Points move until clusters of digits
3
descent) form
1. AIML_Module_2_Lab_4_t_SNE.ipynb-Colab.pdf
Detailed Explanation of Module 2 Project Lab: Data Exploration and Dimensionality
Reduction (with All Your Queries Included)
This guide walks through all the steps in your Module 2 project notebook, covering both the
heart disease and Starbucks nutrition datasets. It integrates all your queries and explains each
step in beginner-friendly language, with practical examples and observations.
Class Distribution
Bar Plot:
Shows the number of samples with and without heart disease.
Pie Chart:
Visualizes the percentage of samples in each class.
Observation:
~54.5% of samples have heart disease.
Categorical Feature Analysis
Bar Plots:
For features like sex, chest pain type, fasting blood sugar, ECG results, angina, slope,
number of vessels, and thalassemia.
Count Plots:
Show how each categorical feature relates to disease presence.
Example:
countplot(x='cp', hue='target', data=data)
Shows how chest pain types are distributed across disease/no disease.
Correlation Analysis
Correlation Matrix:
Shows how strongly each pair of continuous features is related.
Heatmap:
Visualizes these relationships.
Observation:
Oldpeak and Slope: strong negative correlation.
Age and Trestbps: moderate positive correlation.
What is PCA?
Principal Component Analysis (PCA) reduces the number of features while preserving as
much variance as possible.
It transforms the data into new axes (principal components) that capture the most important
patterns.
Steps:
1. Standardize the Data:
Ensures each feature contributes equally.
2. Fit PCA:
Find principal components and explained variance.
3. Cumulative Explained Variance Plot:
Shows how many components are needed to capture most of the variance.
Observation:
7 principal components explain ~95% of the variance (optimal number).
PCA Visualization
Scatter Plot of First Two PCs:
Colors by disease status.
Observation:
No clear separation of disease/no-disease in the PCA plot; classes overlap.
What is t-SNE?
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear method for visualizing
high-dimensional data in 2D or 3D.
It preserves local structure (clusters) better than PCA.
Steps:
1. Apply t-SNE to Numeric Data:
tsne = TSNE(n_components=2)
2. Scatter Plot:
Colors by disease status.
Observation:
t-SNE provides slightly better separation than PCA, but there is still overlap between
disease and no-disease samples.
Data Preparation
Source: star_nutri_expanded.csv
Features: Calories, fat, carbs, sugars, protein, caffeine, vitamins, beverage type,
preparation, etc.
Cleaning:
Convert “Varies” to NaN, fill missing caffeine with the mean.
Fix data types and remove/convert problematic values.
Feature Engineering
Tea Category:
Add a binary column: 1 if beverage is tea, 0 otherwise.
One-Hot Encoding:
Convert categorical features (beverage, preparation) into binary columns.
Convert % strings to numbers:
Remove % signs and convert to floats.
star_nutri_expanded.csv, cleaning,
Data Loading heart.csv, label cleaning, type conversion
encoding
If you want a deeper explanation of any step or want to see code for a particular part, just
ask!
⁂
Detailed Explanation of Module 3 Lab 1: Understanding Distance Metrics and
Introduction to KNN
This lab introduces the concept of distance metrics—how to measure the "closeness" of data
points—and shows how these are used in the K-Nearest Neighbors (KNN) algorithm. Below, each
section and concept is explained step-by-step, with examples and answers to key questions.
Example:
x_1 = np.array((1, 2))
x_2 = np.array((4, 6))
euclidean_dist = np.sqrt(np.sum((x_1-x_2) ** 2))
print(euclidean_dist) # Output: 5.0
Example:
manhattan_dist = np.sum(np.abs(x_1 - x_2))
print(manhattan_dist) # Output: 7
3. Minkowski Distance
Generalizes Euclidean (p=2) and Manhattan (p=1) distances.
Formula:
Example:
For , the Minkowski distance between the same points is about 4.5.
4. Hamming Distance
Definition: Number of positions at which the corresponding values are different (used
for categorical/binary data).
Example:
str_1 = 'euclidean'
str_2 = 'manhattan'
hamming_dist = distance.hamming(list(str_1), list(str_2)) * len(str_1)
print(hamming_dist) # Output: 7.0
5. Cosine Similarity
Definition: Measures the cosine of the angle between two vectors (used for text and
high-dimensional data).
Formula:
Example:
cosine_similarity = np.dot(x_1, x_2)/(norm(x_1)*norm(x_2))
print(cosine_similarity) # Output: 0.992...
6. Chebyshev Distance
Definition: The maximum absolute difference across any dimension.
Example:
chebyshev_distance = distance.chebyshev(x_1, x_2)
print(chebyshev_distance) # Output: 4
7. Jaccard Distance
Definition: Measures dissimilarity between sets.
Formula:
Example:
print(distance.jaccard([1, 0, 0], [0, 1, 0])) # Output: 1.0
8. Haversine Distance
Definition: Used for geographic coordinates (latitude/longitude) on a sphere (e.g.,
Earth).
Example:
haversine([-0.116773, 51.510357], [-77.009003, 38.889931]) # Output: 5897.658 km
A. What is KNN?
KNN is a supervised, non-parametric, instance-based algorithm used for classification and
regression.
How it works: For a new data point, KNN finds the k closest points in the training set (using
a distance metric) and assigns the most common class among them (for classification) or
averages their values (for regression).
B. KNN on a Synthetic Dataset
The lab generates two clusters of 2D points (red and blue) and uses KNN to classify new
points.
Example:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(pts, tgts)
our_predictions = knn.predict(test_pts)
print("Prediction Accuracy: ", 100 * np.mean(our_predictions == test_tgts))
# Output: e.g., 80.0
Experiment:
Try different distance metrics ('euclidean', 'manhattan', 'chebyshev', 'minkowski',
'hamming') and observe how accuracy changes.
x_i -
Manhattan High-dim, grid data $ \sum $
y_i
A \cap } A \cup }
Jaccard Set/binary data $ 1 - \frac{
B { B $
Key Takeaways
Distance metrics are foundational for KNN and many other algorithms.
KNN is simple and effective, but its performance depends on the distance metric and value
of $ k $.
Experiment with different metrics and always scale your features.
If you want a deeper explanation of any metric, code example, or visualization, just ask!
⁂
1. https://www.ustcnewly.com/teaching/2020_2_3.pdf
2. https://www.kdnuggets.com/2020/11/most-popular-distance-metrics-knn.html
3. https://blog.devgenius.io/exploring-knn-with-different-distance-metrics-85aea1e8299
4. https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/
Detailed Explanation of Module 3 Lab 2: Implementing KNN from Scratch and
Visualizing Algorithm Performance
(Updated with All Your Queries and Key Concepts)
What is KNN?
K-Nearest Neighbors (KNN) is a simple, intuitive algorithm for classification and regression. It
predicts the label of a new data point by looking at the labels of its k closest points in the
training set (using a distance metric, usually Euclidean distance), and choosing the most
common label among them.
Accuracy Metric
Accuracy is the ratio of correctly classified samples to total samples:
Summary Table
Concept What It Means / Why It Matters
KNN from scratch Understands the algorithm’s logic, not just using libraries
In summary:
This lab teaches you to implement KNN from scratch, understand how it works, visualize its
behavior using Voronoi diagrams and decision boundaries, and evaluate its performance with
confusion matrices and classification reports. You also learn how to handle categorical data, use
PCA for visualization, and interpret the strengths and weaknesses of your classifier [1] [2] [3] [4] [5]
[6] [7] .
1. https://www.machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-s
cratch/
2. https://www.kaggle.com/code/jebathuraiibarnabas/knn-from-scratch-with-visualization
3. https://realpython.com/knn-python/
4. https://www.kaggle.com/code/just4jcgeorge/k-nearest-neighbour-algorithm
5. https://dataaspirant.com/k-nearest-neighbor-algorithm-implementaion-python-scratch/
6. https://www.scribd.com/document/736817575/MACHINE-LEARNING-LAB-MANUAL
7. AIML_Module_3_Lab_2_Implementing_KNN_from_scratch_and_visualize_Algorithm_performance.ipynb-
Cola.pdf
Detailed Explanation of Module 3 Lab 3: Using KNN for Text Classification
(Updated and Structured for Beginners, with All Your Queries Addressed)
Stemming vs. Lemmatization Trade-off between speed and accuracy in word normalization
Key Takeaways
Text must be cleaned and converted to numbers for machine learning.
TF-IDF usually outperforms BoW by focusing on important words.
KNN is simple and effective for text classification, but the quality of features matters a lot.
Better techniques like word embeddings and transformers can provide even higher
accuracy.
Stemming and lemmatization are important preprocessing steps, each with pros and cons.
If you want more detail on any step, code examples, or further reading, just ask!
⁂
1. https://www.ibm.com/think/topics/stemming-lemmatization
2. https://jurnalnasional.ump.ac.id/index.php/JUITA/article/view/23829
3. https://spotintelligence.com/2023/08/22/k-nearest-neighbours/
4. https://www.slideshare.net/slideshow/cs8080irtunit-i-t6-knn-classifierpdf/251786728
Detailed Explanation of Module 3 Project: Diabetes Classification with KNN
(Fully Updated and Structured for Beginners, with All Your Queries Addressed)
Used .describe() to get statistics (mean, std, min, max) for each feature.
Example: Average age is ~33, average BMI is ~32.5.
C. Correlation Analysis
Correlation Matrix & Heatmap:
Used diabetes_data.corr() and sns.heatmap() to visualize relationships.
Findings:
Insulin is moderately correlated with Glucose, BMI, and Age.
SkinThickness is highly correlated with BMI.
A. Why Standardize/Scale?
Features have different scales (e.g., glucose vs. pedigree function).
KNN uses distance, so unscaled features can dominate.
Standardization:
Transforms features to mean 0, std 1.
Formula: $ Z = \frac{X - \mu}{\sigma} $
B. Min-Max Scaling
What is it?
Rescales features to the range [1] .
Formula:
$ x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} $
Why use it?
Ensures all features contribute equally to distance calculations.
Especially important for KNN.
How was it used?
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_feat)
A. Why Visualize?
Shows how KNN divides the feature space into regions for each class.
Helps understand how scaling and k affect classification.
Result: The Voronoi diagram shows decision boundaries and how KNN classifies new data.
If you need more details, code examples, or further clarification on any step, just ask!
⁂
1. AIML_Module_3_project.ipynb-Colab.pdf