0% found this document useful (0 votes)
4 views29 pages

ML Unit-4

The document discusses non-parametric methods in machine learning, highlighting their flexibility and power in learning patterns from data without strong assumptions about its structure. It covers various non-parametric algorithms, their advantages and limitations, and specific techniques like Non-Parametric Density Estimation, Histogram Estimator, and K-Nearest Neighbor Estimator. Additionally, it explains non-parametric classification with an example of decision trees for loan approval and introduces the Condensed Nearest Neighbor technique in the context of reinforcement learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views29 pages

ML Unit-4

The document discusses non-parametric methods in machine learning, highlighting their flexibility and power in learning patterns from data without strong assumptions about its structure. It covers various non-parametric algorithms, their advantages and limitations, and specific techniques like Non-Parametric Density Estimation, Histogram Estimator, and K-Nearest Neighbor Estimator. Additionally, it explains non-parametric classification with an example of decision trees for loan approval and introduces the Condensed Nearest Neighbor technique in the context of reinforcement learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

SIDDARTHA INSTITUTE OF SCIENCE AND TECHNOLOGY :: PUTTUR

(AUTONOMOUS)
Siddharth Nagar, Narayanavanam Road – 517583

QUESTION AND ANSWERS ( UNIT - IV)

Subject with Code: Machine Learning(20CS0535) Course & Branch: B.Tech - CSE
Regulation: R20 Year &Sem: III-B.Tech & II - Sem

Q NO.
1a. Define and Explain about non-parametric methods? [L1] [4M]
[CO3
]
Non-parametric methods in machine learning are algorithms that do not make
strong assumptions about the shape or structure of the data. Instead, they learn
patterns directly from the training data, making them highly flexible and
powerful.

These methods are helpful when:

 There is a large amount of data


 There is no prior knowledge about the data
 We want the model to automatically learn complex patterns

Examples of non-parametric algorithms:

 k-Nearest Neighbors (k-NN)


 Decision Trees (CART, C4.5)
 Support Vector Machines (SVM)
 Random Forest

Benefits of Non-Parametric Methods:

 Flexibility: Easily adapt to various data patterns without a fixed structure.


 Power: Work well without needing assumptions about the data's shape.
 Performance: Often achieve high accuracy by fitting complex relationships
in data.

Limitations of Non-Parametric Methods:

 Require more data: Need large datasets to learn patterns effectively.


 Slower training: Training time is higher due to more model complexity.
 Overfitting risk: May memorize training data too closely, reducing
accuracy on new data.
 Less interpretability: Hard to understand how the model arrives at its
predictions.

1b. List out advantages and limitations of non-parametric methods in [L2] [8M]
ML [CO3
]
Non-parametric methods in machine learning are statistical techniques that do not
assume any specific distribution or form for the data. These methods are useful
when we have limited information about the data's structure, and they offer
flexibility in real-world applications.

Advantages of Non-Parametric Methods:

1. No Need for Normal Distribution:


These methods work even when data does not follow a normal distribution,
making them more flexible.
2. Support for Nominal and Ordinal Data:
They can be used with categorical (nominal) or ranked (ordinal) data, where
parametric methods may not work.
3. No Dependence on Population Parameters:
Hypotheses can be tested without using values like population mean or standard
deviation.
4. Simpler Computations:
Often involve easier mathematical calculations compared to parametric
counterparts.
5. User-Friendly and Easy to Understand:
Non-parametric tests are generally simpler in concept, making them more
approachable for beginners.

Limitations of Non-Parametric Methods:

1. Lower Sensitivity:
When parametric method assumptions are valid, non-parametric methods may miss
subtle patterns or differences.
2. Limited Use of Data Information:
Some methods only use partial information, such as the direction of change, rather
than exact values (e.g., sign test).
3. Reduced Efficiency:
These methods often need larger datasets to achieve the same results as parametric
methods.
➤ Example: A sign test may require 100 samples, while a t-test needs only 60 for
similar outcomes.

Non-parametric methods are best used when data does not meet the conditions
required for parametric methods. While they offer flexibility and simplicity, they
may sacrifice

2a. State and explain Non Parametric Density Estimation [L1] [6M]
[CO3
]
Definition:

Non-Parametric Density Estimation is a method used to estimate the probability density


function (PDF) of a random variable without assuming any fixed functional form or
distribution (like normal, exponential, etc.). It relies directly on the structure and spread of
the data to determine the shape of the distribution.

There are four main non-parametric density estimation methods commonly used
in statistics and machine learning:

1. Histogram Estimator
2. Naive Estimator
3. Kernel Density Estimator (KDE)
4. K-Nearest Neighbor Estimator (KNN Estimator)

1. Histogram Estimator

 This is the oldest and most popular method for estimating probability density.
 The data range is divided into equal-sized intervals called bins.
 Given a training dataset X={xt}t=1NX = \{x_t\}_{t=1}^NX={xt}t=1N, an origin
x0x_0x0, and bin width hhh, the histogram density at a point depends on the
number of training samples in that bin.
 The density estimate is proportional to the count of data points within each bin.
 The choice of origin x0x_0x0 affects the estimation near the edges or boundaries of
the data range.

t
x ∈ same bin as x
p ( x )=
Nh
2. Naive Estimator

 Unlike the histogram, the naive estimator does not fix an origin.
 It estimates density based on neighboring training samples around each
point.
 For a given training set X = {xt} Nt=1 and bin width hhh, the naive
estimator counts samples in the range h/2h/2h/2 to the left and right of the
target point.
 This method is simpler and does not rely on the position of fixed bins.

The values in the range of h/2 to the left and right of the sample involve the density
contribution.
3. Kernel Density Estimator (KDE)

 KDE smoothens the probability density function by placing a weighted


kernel function over each data point.
 The kernel acts like a smooth weight function, usually a Gaussian kernel
(bell-shaped curve).

 As the distance ∣x−xt∣|x - x_t|∣x−xt∣ increases, the kernel value decreases,


meaning points farther away contribute less to the density estimate.
 Other kernels include Rectangular, Triangular, Bi weight, Uniform, Cosine,
etc.

4. K-Nearest Neighbor Estimator (KNN Estimator)

Unlike the previous methods of fixing the bin width h, in this estimation, we fix the
value of nearest neighbors k. The density of a sample depends on the value of k and
the distance of the kth nearest neighbor from the sample. This is close enough to the
Kernel estimation method. The K-NN density estimation is, where dk(x) is the
Euclidean distance from the sample to its kth nearest neighbor.

k
p( x ) =
2 N d k ( x)
2b. Explain Histogram Estimator with simple example. [L2] [6M]
[CO3
]

Histogram Estimator
A Histogram Estimator is a non-parametric method used to estimate the probability
density function (PDF) of a continuous random variable based on observed data. It
works by dividing the data range into equal-sized intervals (called bins) and
counting how many data points fall into each bin. The height of each bin
(normalized count) gives an estimate of the density in that region.

How it Works:
1. Divide the data range into bins: Split the range of values into non-
overlapping intervals (bins) of equal width h.
2. Count the number of data points in each bin: For each bin, count how many
data points fall within it.

3. Estimate the density: The estimated density in each bin is:


Number of points ∈the bin
f(x)¿
n.h
where:
 n = total number of data points
 h = width of each bin

Simple Example:
Suppose we have a small dataset:
{2,3,5,6,7,8,9}\{2, 3, 5, 6, 7, 8, 9\}{2,3,5,6,7,8,9}

Let’s estimate the density using histogram estimation with:


 Bin width h=2h = 2h=2
 Bins: [2–4), [4–6), [6–8), [8–10)

Step-by-step:

Bin Data Points in Count Density Estimate


Bin
[2, 4) 2, 3 2 2 / 7⋅2=0.143 =0.143
[4, 6) 5 1 1 / 7⋅2=0.071=0.071
[6, 8) 6, 7 2 2 / 7⋅2=0.143 =0.143
[8, 10) 8, 9 2 2 / 7⋅2=0.143 =0.143

Interpretation:
 The histogram estimator gives a stepwise approximation of the PDF.
 Areas with more data points have higher estimated density.
 It’s a simple way to visualize and understand the distribution of data.

3a. Analyze the K-Nearest Neighbor Estimator. [L4] [6M]


[CO6
]
K-Nearest Neighbor (K-NN) Estimator – Analysis

The K-Nearest Neighbor (K-NN) estimator is a non-parametric density estimation


technique used to estimate the probability density function of a dataset. Unlike other
methods that fix the bin width (like histogram or kernel methods), the K-NN
estimator fixes the number of neighbors (k) and adjusts the size of the region around
a query point based on the distance to its k-th nearest neighbor.

K-NN Algorithm – Working Steps

The working of the K-Nearest Neighbors (K-NN) algorithm can be explained


using the following steps:

1. Step 1: Select the number K of neighbors.


2. Step 2: Calculate the Euclidean distance between the new data point and all
the training data points.
3. Step 3: Identify the K nearest neighbors based on the smallest distances.
4. Step 4: Among these K neighbors, count how many data points belong to
each category (or class).
5. Step 5: Assign the new data point to the category with the highest count
among the K neighbors.
6. Step 6: The model is now ready to make predictions.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

 First, choose the number of neighbors, for example,


k = 5.
 Next, calculate the
Euclidean distance between
the data points, which is
the straight-line
distance between two
points in space,
commonly used
in geometry.

Euclid Distance between A1 and B 1 = (x 2−x 1)2 +( y 2 – y 1)2

 By calculating the Euclidean distance we got the nearest neighbors, as three


nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
 As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.

How to Select the Value of K in the K-NN Algorithm:

Here are some important points to consider when choosing the value of K in the K-
Nearest Neighbors (K-NN) algorithm:

 There is no fixed rule to determine the best value for K; it often requires
experimenting with different values.
 The most commonly used and preferred value of K is 5.
 A small value of K (e.g., K = 1 or 2) can make the model sensitive to noise
and outliers.
 A larger value of K can improve stability, but it may also lead to less distinct
classifications and may include irrelevant points in predictions.

Advantages of the K-NN Algorithm:

 It is simple and easy to implement.


 It is robust to noisy data.
 It performs well when training data is large.

Disadvantages of the K-NN Algorithm:

 The value of K must be carefully selected, which can be challenging.


 The algorithm has a high computational cost, as it requires calculating
distances between the new data point and all training samples.

3b. Express the Non Parametric classification with example. [L6] [6M]
[CO3
]

Non-parametric classification refers to a type of machine learning model that does


not make strong assumptions about the underlying data distribution. Instead, it relies
directly on the training data to make predictions. These methods are particularly
useful when the data is complex or not well modelled by a known distribution.

Characteristics:
 No fixed number of parameters: The model complexity grows with the size
of the training data.
 Flexible decision boundaries: Can adapt to more complex data distributions.
 Memory-based: Often relies on storing the training data (e.g., instance-based
learning).

Common Non-Parametric Classifiers:


1. k-Nearest Neighbors (k-NN)
2. Decision Trees
3. Random Forests (though they use an ensemble of trees)
4. Support Vector Machines (SVM) (when using non-linear kernels)

Example: Decision Tree for Loan Approval


Suppose a bank wants to decide whether to approve a loan application based on the
following factors:
 Income (High / Medium / Low)
 Credit History (Good / Bad)
 Student (Yes / No)
And the target output is:
 Loan Approved (Yes / No)

Sample Dataset:

Income Credit History Student Loan Approved

High Good No Yes


High Bad No No
Medium Good No Yes
Low Good Yes Yes
Low Bad Yes No
Low Good No No

Decision Tree:

Explanation:

 First split: Based on Credit History (it's the most informative).


 If Good:
o Then split further on Income.
 High → Yes
 Low → No
 If Bad:
o Then split on Student.
 Yes → Yes
 No → No

Prediction Example:
Suppose a new applicant has:
 Income: Low
 Credit History: Good
 Student: Yes

Prediction Path:
 Credit History = Good → go left
 Income = Low → classify as No
So, Loan is not approved.

4 Illustrate Condensed Nearest Neighbor in reinforcement learning. [L3] [12]


[CO4
]

Condensed Nearest Neighbor is a technique originally from supervised learning,


specifically a data reduction method for the k-Nearest Neighbor (k-NN) classifier.
 The goal of CNN is to reduce the size of the training set by selecting a subset
of examples that still classify the entire dataset correctly using a 1-NN rule.
 It iteratively selects points that are misclassified by the current subset and
adds them to the subset until all training points are correctly classified.
 This results in a condensed subset of the data that can be used for faster k-
NN classification without losing much accuracy.

While CNN is not a standalone reinforcement learning algorithm, it plays a valuable


role as an optimization technique for memory-based RL methods. By intelligently
condensing the data used for learning, CNN helps reduce memory and
computational requirements without significantly sacrificing performance. Its
application is especially useful in instance-based value approximation, experience
replay, and transition model learning.

CNN in the Context of Reinforcement Learning


While not traditionally part of RL, Condensed Nearest Neighbor can be applied in
the following ways:
1. Experience Replay Buffer Optimization
 In RL, agents store transitions (state, action, reward, next_state) in a buffer.
 CNN can help reduce redundancy in this buffer by retaining only the most
informative transitions (e.g., near decision boundaries).
 This improves memory efficiency and possibly training speed, especially in
large or continuous state spaces.
2. Instance-Based Value Function Approximation
 Instead of using a parametric function approximator (like a neural network),
RL can use instance-based methods to estimate value functions or Q-values
(e.g., via k-NN).
 Here, CNN can reduce the number of stored instances (transitions) while still
allowing effective approximation of value functions.

3. Model-Based RL
 In model-based RL, an agent may use stored transitions to model the
environment.
 CNN can ensure the learned model is based on a compact yet representative
subset of experience.

Algorithm Overview (CNN):


1. Initialize the prototype set with one transition per class (or per distinct
reward outcome).
2. For each transition in the original buffer:
o Predict the outcome using k-NN on the prototype set.
o If prediction is incorrect (or value estimate is poor), add the
transition to the prototype set.
o Otherwise, discard it.
This results in a minimal subset of transitions that preserve the performance of the
original larger dataset.

Advantages in RL:
 Memory Efficiency: Fewer transitions stored.
 Faster Learning: Reduces computation in instance-based methods.
 Noise Reduction: Avoids overfitting to redundant data.
Limitations:
 Greedy Algorithm: May discard useful rare cases.
 High Dimensionality Issues: Less effective with continuous, high-
dimensional state spaces.
 Information Loss Risk: Important for long-term dependencies in RL.

5a. List out the Applications of KNN in machine learning. [L1] [6M]
[CO6
]
1. Classification Tasks
 Image classification: Assigning labels to images (e.g., handwritten digit
recognition like MNIST).
 Text categorization: Classifying news articles, emails (spam vs. non-spam),
or documents.
 Medical diagnosis: Predicting diseases based on symptoms or test results.
2. Regression Tasks
 House price prediction: Estimating property prices based on features like
location, size, etc.
 Weather forecasting: Predicting temperature, humidity, etc., based on
historical data.
 Stock price estimation: Approximating future prices using nearest historical
patterns.

3. Recommendation Systems
 Product recommendations: Suggesting items based on the preferences of
similar users (collaborative filtering).
 Movie or music recommendations: Based on user behaviour similarity.
4. Anomaly Detection
 Detecting outliers in network intrusion, fraud detection, or sensor data by
observing data points that are far from their neighbours.
5. Pattern Recognition
 Face recognition: Matching facial features to known identities.
 Speech recognition: Matching audio patterns to words or phonemes.
6. Image Processing
 Content-based image retrieval (CBIR): Finding similar images from a
database.
 Object detection: Classifying different parts of an image.
7. Recommender Systems in E-commerce
 Suggesting products to users by comparing with similar users' preferences or
browsing history.
8. Bioinformatics
 Classifying genes or proteins based on sequence similarity.
 Predicting biological function of unknown genes using labelled data.
9. Customer Segmentation
 Grouping customers based on purchasing behaviour or demographics.
10. Credit Scoring & Risk Analysis
 Predicting loan default or creditworthiness based on similarity to past
customers.

5b. Distinguish between parametric and non parametric classifications. [L4] [6M]
[CO4
]
Parametric classifiers simplify the learning problem by assuming a specific model
structure with a finite set of parameters. They are efficient but less flexible.
Non-parametric classifiers are more data-driven, making them better at capturing
complex relationships, but they require more data and computation.

Aspect Parametric Non-Parametric


Classification Classification
Assumes a fixed form Makes no assumption
Definition (parameterized) for the about the functional
decision function. form of the data.
Predefined, finite set of Flexible, model
Model Structure parameters. complexity grows with
data.
Logistic Regression, k-Nearest Neighbors (k-
Examples Linear Discriminant NN), Decision Trees,
Analysis, Naive Bayes. Support Vector
Machines.
Usually faster due to Slower, as it may
Training Time fewer parameters. require more data
storage and
computation.
Fast (fixed function Can be slower (e.g., k-
Prediction Time evaluated). NN requires searching
entire dataset).
Strong assumptions about Minimal or no
Assumptions data distribution (e.g., assumptions about data
normality). distribution.
Less flexible, may More flexible, can
Flexibility underfit if assumptions capture complex
are wrong. patterns.
Data Requirement Works well with small to Needs large datasets for
medium-sized datasets. good performance.
Often more interpretable May be less
Interpretability due to simpler model interpretable, especially
forms. with large or complex
models.
Overfitting Risk Lower (due to limited Higher, especially if
capacity). model is too flexible.

6. Discuss the following terms i. Principle Component Analysis ii. Factor [L2] [12M
Analysis [CO5 ]
]

i. Principle Component
Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning. It is a statistical process that
converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling. It is a technique to draw strong
patterns from the given dataset by reducing the variances.

The PCA algorithm is based on some mathematical concepts such as:


o Variance and Covariance
o Eigenvalues and Eigen factors

Steps for PCA algorithm


1. Getting the dataset.
Firstly, we need to take the input dataset and divide it into two subparts X
and Y, where X is
the training set, and Y is the validation set.

2. Representing data into a structure.


Now we will represent our dataset into a structure. Such as we will represent
the two- dimensional matrix of independent variable X. Here each row
corresponds to the data items, and the column corresponds to the Features.
The number of columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column,
the features with high variance are more important compared to the features
with lowervariance. If the importance of features is independent of the
variance of the feature, then we will divide each data item in a column with
the standard deviation of the column. Here we will name the matrix as Z.

4. Calculating the covariance of z.


To calculate the covariance of Z, we will take the matrix Z, and will
transpose it. After transpose, we will multiply it by Z. The output matrix will
be the Covariance matrix of Z.
5. Calculating the Eigen values and Eigen vectors.
Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions
of the axes with high information and the coefficients of these eigenvectors
are defined as the eigenvalues.
6. Sorting the Eigen vectors.
In this step, we will take all the eigenvalues and will sort them in decreasing
order, which means from largest to smallest. And simultaneously sort the
eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix
will be named as P*.
7. Calculating the new features or principal components.
Here we will calculate the new features. To do this, we will multiply the P*
matrix to the Z. In the resultant matrix Z*, each observation is the linear
combination of original features. Each column of the Z* matrix is
independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed out.

ii. Factor Analysis:

Definition:
Feature Analysis in machine learning refers to the process of examining and
selecting the most relevant input variables (features) that contribute to building an
effective predictive model. It involves understanding the role, significance, and
quality of features in relation to the target variable or output.

Purpose of Feature Analysis:


 To improve model accuracy by selecting meaningful features.
 To reduce overfitting by eliminating noisy or irrelevant data.
 To enhance training efficiency by reducing dimensionality.
 To make models more interpretable and generalizable.

Types of Feature Analysis Techniques:


1. Feature Selection:
Choosing a subset of the most important features from the original set.
 Filter methods: Use statistical techniques (e.g., correlation, chi-square test)
independent of the model.
 Wrapper methods: Evaluate feature subsets by training and testing models
(e.g., forward selection, recursive feature elimination).
 Embedded methods: Feature selection occurs during model training (e.g.,
Lasso regularization).

2. Feature Extraction:
Creating new features by transforming or combining existing ones.
 Principal Component Analysis (PCA): Reduces dimensionality by creating
new orthogonal features.
 Linear Discriminant Analysis (LDA): Maximizes class separability.
 Autoencoders: Neural network-based technique for learning compressed
representations.
3. Feature Engineering:
Manually creating new features from raw data to improve model performance.
Examples include:
 Creating interaction terms (e.g., multiplying two features).
 Binning numerical variables.
 Extracting date/time features (like day of week, month).

7. Define Dimensionality Reduction .Illustrate in detail about Subset [L2] [12M


Selection in dimensionality reduction. [CO3 ]
]

Dimensionality Reduction is the process of reducing the number of input variables


or features in a dataset while preserving as much meaningful information as
possible. In high-dimensional data (data with many features), some features may be
redundant, irrelevant, or noisy, which can negatively impact model performance.
Dimensionality reduction techniques simplify the data, making it easier to visualize,
faster to process, and often improving the performance of machine learning
algorithms.
Dimensionality reduction is broadly classified into two types
1. Feature Selection
2. Feature Extraction

Feature Selection (Subset Selection)

Feature selection involves selecting a subset of relevant features from the original
dataset without transforming them. This helps in removing irrelevant, redundant,
or less informative features.
Sub-techniques shown in the diagram:

 Missing Value Ratio: Removes features with too many missing values.
 Low Variance Filter: Removes features that show little variation across
records.
 High Correlation Filter: Removes one of two highly correlated features to
avoid redundancy.
 Random Forest: Uses feature importance scores from a tree-based model.
 Backward Feature Elimination: Starts with all features and removes one at a
time based on performance.
 Forward Feature Selection: Starts with none and adds features that improve
performance.
Goal: Keep only the most informative features from the original dataset.

2. Feature Extraction (Dimensionality Reduction Proper)


This process transforms existing features into new, compact representations
(components), which retain most of the data's useful information.
A. Components/Factors Based Techniques:
These techniques create new variables (factors or components) that summarize the
original features.
 Factor Analysis (FA): Uncovers latent variables that explain the observed
correlations.
 Principal Component Analysis (PCA): Projects data onto orthogonal
directions (principal components) that capture maximum variance.
 Independent Component Analysis (ICA): Separates a multivariate signal into
independent non-Gaussian components.

B. Projection-Based Techniques:
These are mostly used for visualization and non-linear dimensionality reduction,
often for exploratory data analysis.
 ISOMAP: Preserves geodesic distances between points.
 t-SNE (t-distributed Stochastic Neighbor Embedding): Maps high-
dimensional data into 2D or 3D for visualization, preserving local structure.
 UMAP (Uniform Manifold Approximation and Projection): Similar to t-SNE
but faster and better at preserving global structure.
Goal: Reduce dimensions through transformation, often for improved clustering,
visualization, or model efficiency.

8a. Explain Linear Discriminant Analysis. [L2] [8M]


[CO4
]
Linear Discriminant Analysis is a powerful supervised learning technique for
both dimensionality reduction and classification. It is especially useful when
working with datasets where class separation is critical, and dimensionality is
high. By projecting the data onto a new axis, LDA significantly enhances class
separability and classification performance.
It is especially useful when multiple features need to be analyzed to distinguish
between two or more classes.

In many classification tasks, when we try to separate classes using a single feature,
the result might include overlapping between classes. This leads to poor
classification accuracy.
 Example: Suppose we are classifying two classes using a 2D feature space
(X and Y axes). It might not be possible to draw a straight line to separate
them in 2D.
 LDA transforms the feature space to a new axis that maximizes the distance
between the means of the classes and minimizes the spread within each
class.
 As a result, LDA converts a 2D or 3D space into a 1D space, enabling better
class separation.

LDA performs dimensionality reduction by calculating a new axis (linear


combination of features) that:
1. Maximize Between-Class Variance (SB):
o This measures how far apart the means of different classes are.
o The larger the distance between the class means, the better the
separation.
2. Minimize Within-Class Variance (SW):
o This measures the spread of data points within each class.
o The more compact the class clusters are, the easier it is to distinguish
between them.
The objective of LDA is to maximize the ratio of between-class
variance to within-class variance, so that the classes are well-
separated in the new projected space.

Step-by-Step Procedure of LDA:


1. Compute the mean vectors for each class in the dataset.
o Example: For 3 classes, calculate mean vector for each class.
2. Compute the within-class scatter matrix (SW):
o Measures the spread of features within each class.
3. Compute the between-class scatter matrix (SB):
o Measures the distance between the class means.
4. Compute eigenvalues and eigenvectors for the matrix SW−1SBS_W^{-1}
S_BSW−1SB:
o These represent the directions (axes) of maximum class separation.
5. Select top k eigenvectors corresponding to the largest eigenvalues:
o These form the new lower-dimensional feature space.
6. Project the data onto this new space:
o Each data point is transformed to the new axis, improving class
separation.
Applications of LDA
 Face recognition
 Speech recognition
 Medical diagnostics
 Marketing segmentation
 Text classification
 Bioinformatics and gene expression analysis

Limitations of LDA (1 Mark)


 Assumes normal distribution of data
 Assumes equal covariance among all classes
 Linear boundaries only – not suitable for non-linear problems

8b. Illustrate various applications of Linear Discriminant Analysis. [L1] [4M]


[CO6
]

1. Face Recognition
 LDA is extensively used in facial recognition systems.
 It helps reduce the high-dimensional space of image pixels into a smaller
space while maintaining the differences between different faces (classes).
 For example, the "Fisher faces" method uses LDA to improve upon PCA in
recognizing human faces under varying lighting and expression conditions.

2. Medical Diagnosis
 In healthcare, LDA is used to classify patients into disease vs. healthy
categories based on symptoms or diagnostic parameters (e.g., blood pressure,
sugar levels, etc.).
 Example: Classifying tumor types (benign or malignant) using labeled
biological data.

3. Text and Document Classification


 LDA can be applied to natural language processing tasks to classify
documents into categories such as spam/non-spam, news topics, or
sentiment classes.
 It converts the high-dimensional term-frequency vectors into lower
dimensions, making classification faster and more accurate.

4. Marketing and Customer Segmentation


 Businesses use LDA to group customers based on purchasing behavior,
demographics, and preferences.
 It helps in targeted marketing by classifying consumers into distinct
segments (e.g., high spenders, occasional buyers, etc.).

5. Speech Recognition
 In speech processing, LDA is used to classify audio signals into phoneme or
word classes.
 It enhances recognition by reducing noisy and redundant features, improving
the model’s efficiency.

6. Bioinformatics and Genetics


 In gene expression analysis, where datasets often have thousands of features
(genes), LDA helps classify different cell types or disease stages.
 It selects the most significant features for discriminating between biological
conditions.

9a. Compare Multidimensionality scaling and Metric dimensionality [L5] [6M]


scaling. [CO5
]
Aspect Metric MDS Non-Metric MDS
Data Type Quantitative (interval or Ordinal (rank-order data)
ratio scales)
Objective Preserve actual distances Preserve the rank order of distances
between data points (i.e., the relative ordering of
dissimilarities)
Assumes that Assumes that only the order of
Distance dissimilarities are dissimilarities is meaningful; the
Assumption measured on a actual magnitudes are not necessarily
meaningful scale and can informative
be directly interpreted as
distances
Minimizes a "stress" Minimizes a "stress" function that
function that measures measures the discrepancy between the
Optimizatio the discrepancy between ranks of the original dissimilarities
n Criterion the original and the ranks of the distances in the
dissimilarities and the low-dimensional space, often using
distances in the low- monotonic regression
dimensional space, often
using least squares
Less flexible; sensitive to More flexible; can handle non-metric
Flexibility violations of metric data and is robust to violations of
assumptions metric assumptions
Distances in the low- Distances in the low-dimensional
Interpretatio dimensional space can be space reflect the rank order of the
n interpreted directly in original dissimilarities but not their
terms of the original exact values
dissimilarities
Suitable when precise Suitable when only the order of
distance information is dissimilarities is known or
Use Cases available and meaningful, such as in preference
meaningful, such as in rankings, survey responses, or
physical measurements subjective assessments
or when the goal is to
preserve actual distances
Computatio Often involves Typically involves iterative
nal eigenvalue optimization methods, such as
Approach decomposition or other monotonic regression and stress
linear algebra techniques minimization algorithms
Geographical mapping, Market research, psychology, ecology,
Examples of psychometrics with and other fields where data are ordinal
Applications interval data, and other or where preserving the order of
contexts where similarities/dissimilarities is more
preserving actual important than preserving their exact
distances is crucial values
9b. List out the applications of MDS. [L1] [6M]
[CO6
]
1.Data Visualization:
 Reducing high-dimensional data to 2D or 3D for easy visualization and
interpretation.
 Helps to reveal hidden patterns, clusters, or relationships in the data.

2. Psychology and Social Sciences:


 Mapping perceptual or cognitive similarities/dissimilarities among stimuli,
such as colors, sounds, or words.
 Analyzing survey or questionnaire data to understand subjective perceptions.
3. Marketing and Consumer Research:
 Understanding consumer preferences and positioning of products or brands
in a perceptual space.
 Analyzing similarity between products or customer segments.
4. Bioinformatics and Genomics:
 Visualizing genetic distance or similarity among species or populations.
 Analyzing protein or gene expression data.
5. Linguistics:
 Mapping semantic similarities between words or languages.
 Studying dialect or language variation.
6. Ecology and Environmental Science:
 Visualizing similarity in species distribution or environmental conditions
across sites.
7. Recommendation Systems:
 Representing similarity between users or items in lower-dimensional space
to improve recommendations.
8. Computer Vision and Image Analysis:
 Reducing dimensionality of image features for classification or clustering.

10a. Summarize the following terms i) Distances ii) Euclidian distance iii) [L2] [6M]
metrics [CO3
]

i) Distances

Distance measures, also known as distance metrics or similarity measures,


are mathematical functions that quantify the difference or similarity between two
data points in a machine learning context. These measures are fundamental to many
algorithms, including clustering, classification, and search. They determine how
close or far apart data points are, influencing the structure and performance of
machine learning models.
Types of Distance Measures:
 Euclidean Distance:
The straight-line distance between two points in a multi-dimensional space. It's a
commonly used metric, particularly for numerical data.
 Manhattan Distance:
The sum of the absolute differences between corresponding coordinates of two
points. It's also known as L1 distance or city block distance.
 Minkowski Distance:
A generalization of both Euclidean and Manhattan distances, using a parameter to
control the type of distance. It's a flexible metric for various applications.
 Hamming Distance:
Measures the number of positions where two strings of equal length differ. It's
useful for comparing categorical or binary data.
 Cosine Similarity:
Measures the cosine of the angle between two vectors. It indicates the degree of
similarity between vectors, regardless of their magnitude.
 Jaccard Similarity:
Calculates the overlap between two sets, indicating the proportion of elements they
share. It's useful for comparing categorical or binary data

ii) Euclidean Distance

Euclidean Distance is defined as the distance between two points in Euclidean


space. To find the distance between two points, the length of the line segment that
connects the two points should be measured.
Euclidean distance is like measuring the straightest and shortest path between two
points. t tells you how far apart the two points are without any turns or bends, just
like a bird would fly directly from one spot to another.

Euclidean Distance Formula


Consider two points (x1, y1) and (x2, y2) in a 2-dimensional space; the Euclidean
Distance between them is given by using the formula:

d= √ (x 2−x 1)2 +( y 2 − y 1)2


Where,
 d is Euclidean Distance,
 (x1, y1) is the Coordinate of the first point,
 (x2, y2) is the Coordinate of the second point.

Euclidean Distance in 3D
If the two points (x1, y1, z1) and (x2, y2, z2) are in a 3-dimensional space, the
Euclidean Distance between them is given by using the formula:

d= √ (x 2−x 1)2 +( y 2 − y 1)2 +( z 2−z 1 )2


Where,
 d is Euclidean Distance,
 (x1, y1, z1) is the Coordinate of the first point,
 (x2, y2, z2) is the Coordinate of the second point.

iii)Metrics

Metrics are quantitative measures used to evaluate the performance of a


model. They help determine how well a model is performing and can be used to
compare different models. Metrics are essential for understanding a model's
strengths, weaknesses, and overall effectiveness.
Here's a breakdown of key aspects:

 Purpose:
Metrics provide insights into a model's ability to predict outcomes, generalize to
new data, and make accurate classifications or regressions.
 Types:
Different metrics are used for different types of machine learning tasks, including:
o Classification: Accuracy, Precision, Recall, F1-score, ROC AUC.
o Regression: Mean Squared Error (MSE), Mean Absolute Error
(MAE), R-squared.
o Clustering: Silhouette score, Davies-Bouldin index.
Examples:
1. Accuracy: Measures the overall correctness of predictions.
2. Precision: Measures the proportion of true positive predictions among all
positive predictions.
3. Recall: Measures the proportion of true positive predictions among all actual
positive cases.
4. F1-score: The harmonic mean of precision and recall, providing a balanced
measure.
5. MAE: The average absolute difference between predicted and actual values.
6. MSE: The average of the squared differences between predicted and actual
values.
7. R-squared: Measures the proportion of variance in the dependent variable
that is predictable from the independent variable(s).
8. ROC AUC: Measures the area under the receiver operating characteristic
curve, used for evaluating classification models.
10b. Analyze the supervised learning after clustering [L4] [6M]
[CO4
]
Supervised Learning After Clustering

Using clustering as a preprocessing step before supervised learning involves


combining the strengths of unsupervised and supervised methods. This approach
enhances the learning process by capturing hidden patterns and groupings in the data
that can serve as valuable features for supervised models.

Integrating clustering with supervised learning is a powerful hybrid approach. It


enhances the model's ability to capture hidden relationships in data, especially when
labels are sparse or the dataset is complex. When applied thoughtfully, this
technique can lead to improved predictive performance, better generalization, and
more insightful models.

Clustering for Feature Creation


 Unsupervised Learning:
Clustering algorithms such as K-Means, DBSCAN, or Hierarchical
Clustering are used to group data points based on similarity without
reference to target labels.
 Creating Categorical Features:
Once clustering is performed, each data point is assigned a cluster ID. This
ID becomes a new categorical feature that can be used in subsequent
supervised learning tasks.
 Dimensionality Reduction:
Clustering can condense high-dimensional data by summarizing patterns into
discrete groupings, which simplifies the dataset while retaining structure.

Supervised Learning with Cluster Information


 Incorporating Cluster IDs:
The newly generated cluster IDs can be added as input features to
classification or regression models. These IDs represent latent structures or
relationships not captured by raw features.
 Improved Model Performance:
Cluster-based features can help the model distinguish between different data
regions more effectively, especially in noisy, non-linear, or high-
dimensional datasets.
 Example:
In customer segmentation, clustering might reveal groups with similar
buying behavior. A supervised model can then predict which segment a new
customer belongs to, using both original features and the cluster label.

Advantages of Clustering Before Supervised Learning

 Handling Imbalanced Data:


Clustering can help rebalance datasets by identifying natural groupings,
which can be used to stratify training samples or augment minority classes.
 Dealing with High-Dimensionality:
Clustering captures important patterns and reduces the complexity of the
dataset, making it more suitable for algorithms that struggle with high-
dimensional data.
 Creating New Informative Features:
Cluster assignments often reveal hidden structures or similarities that raw
features alone cannot express.
 Improving Generalization:
By recognizing local patterns and grouping similar instances, clustering can
help supervised models avoid overfitting and generalize better to unseen
data.

You might also like