ML Unit-4
ML Unit-4
(AUTONOMOUS)
Siddharth Nagar, Narayanavanam Road – 517583
Subject with Code: Machine Learning(20CS0535) Course & Branch: B.Tech - CSE
Regulation: R20 Year &Sem: III-B.Tech & II - Sem
Q NO.
1a. Define and Explain about non-parametric methods? [L1] [4M]
[CO3
]
Non-parametric methods in machine learning are algorithms that do not make
strong assumptions about the shape or structure of the data. Instead, they learn
patterns directly from the training data, making them highly flexible and
powerful.
1b. List out advantages and limitations of non-parametric methods in [L2] [8M]
ML [CO3
]
Non-parametric methods in machine learning are statistical techniques that do not
assume any specific distribution or form for the data. These methods are useful
when we have limited information about the data's structure, and they offer
flexibility in real-world applications.
1. Lower Sensitivity:
When parametric method assumptions are valid, non-parametric methods may miss
subtle patterns or differences.
2. Limited Use of Data Information:
Some methods only use partial information, such as the direction of change, rather
than exact values (e.g., sign test).
3. Reduced Efficiency:
These methods often need larger datasets to achieve the same results as parametric
methods.
➤ Example: A sign test may require 100 samples, while a t-test needs only 60 for
similar outcomes.
Non-parametric methods are best used when data does not meet the conditions
required for parametric methods. While they offer flexibility and simplicity, they
may sacrifice
2a. State and explain Non Parametric Density Estimation [L1] [6M]
[CO3
]
Definition:
There are four main non-parametric density estimation methods commonly used
in statistics and machine learning:
1. Histogram Estimator
2. Naive Estimator
3. Kernel Density Estimator (KDE)
4. K-Nearest Neighbor Estimator (KNN Estimator)
1. Histogram Estimator
This is the oldest and most popular method for estimating probability density.
The data range is divided into equal-sized intervals called bins.
Given a training dataset X={xt}t=1NX = \{x_t\}_{t=1}^NX={xt}t=1N, an origin
x0x_0x0, and bin width hhh, the histogram density at a point depends on the
number of training samples in that bin.
The density estimate is proportional to the count of data points within each bin.
The choice of origin x0x_0x0 affects the estimation near the edges or boundaries of
the data range.
t
x ∈ same bin as x
p ( x )=
Nh
2. Naive Estimator
Unlike the histogram, the naive estimator does not fix an origin.
It estimates density based on neighboring training samples around each
point.
For a given training set X = {xt} Nt=1 and bin width hhh, the naive
estimator counts samples in the range h/2h/2h/2 to the left and right of the
target point.
This method is simpler and does not rely on the position of fixed bins.
The values in the range of h/2 to the left and right of the sample involve the density
contribution.
3. Kernel Density Estimator (KDE)
Unlike the previous methods of fixing the bin width h, in this estimation, we fix the
value of nearest neighbors k. The density of a sample depends on the value of k and
the distance of the kth nearest neighbor from the sample. This is close enough to the
Kernel estimation method. The K-NN density estimation is, where dk(x) is the
Euclidean distance from the sample to its kth nearest neighbor.
k
p( x ) =
2 N d k ( x)
2b. Explain Histogram Estimator with simple example. [L2] [6M]
[CO3
]
Histogram Estimator
A Histogram Estimator is a non-parametric method used to estimate the probability
density function (PDF) of a continuous random variable based on observed data. It
works by dividing the data range into equal-sized intervals (called bins) and
counting how many data points fall into each bin. The height of each bin
(normalized count) gives an estimate of the density in that region.
How it Works:
1. Divide the data range into bins: Split the range of values into non-
overlapping intervals (bins) of equal width h.
2. Count the number of data points in each bin: For each bin, count how many
data points fall within it.
Simple Example:
Suppose we have a small dataset:
{2,3,5,6,7,8,9}\{2, 3, 5, 6, 7, 8, 9\}{2,3,5,6,7,8,9}
Step-by-step:
Interpretation:
The histogram estimator gives a stepwise approximation of the PDF.
Areas with more data points have higher estimated density.
It’s a simple way to visualize and understand the distribution of data.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
Here are some important points to consider when choosing the value of K in the K-
Nearest Neighbors (K-NN) algorithm:
There is no fixed rule to determine the best value for K; it often requires
experimenting with different values.
The most commonly used and preferred value of K is 5.
A small value of K (e.g., K = 1 or 2) can make the model sensitive to noise
and outliers.
A larger value of K can improve stability, but it may also lead to less distinct
classifications and may include irrelevant points in predictions.
3b. Express the Non Parametric classification with example. [L6] [6M]
[CO3
]
Characteristics:
No fixed number of parameters: The model complexity grows with the size
of the training data.
Flexible decision boundaries: Can adapt to more complex data distributions.
Memory-based: Often relies on storing the training data (e.g., instance-based
learning).
Sample Dataset:
Decision Tree:
Explanation:
Prediction Example:
Suppose a new applicant has:
Income: Low
Credit History: Good
Student: Yes
Prediction Path:
Credit History = Good → go left
Income = Low → classify as No
So, Loan is not approved.
3. Model-Based RL
In model-based RL, an agent may use stored transitions to model the
environment.
CNN can ensure the learned model is based on a compact yet representative
subset of experience.
Advantages in RL:
Memory Efficiency: Fewer transitions stored.
Faster Learning: Reduces computation in instance-based methods.
Noise Reduction: Avoids overfitting to redundant data.
Limitations:
Greedy Algorithm: May discard useful rare cases.
High Dimensionality Issues: Less effective with continuous, high-
dimensional state spaces.
Information Loss Risk: Important for long-term dependencies in RL.
5a. List out the Applications of KNN in machine learning. [L1] [6M]
[CO6
]
1. Classification Tasks
Image classification: Assigning labels to images (e.g., handwritten digit
recognition like MNIST).
Text categorization: Classifying news articles, emails (spam vs. non-spam),
or documents.
Medical diagnosis: Predicting diseases based on symptoms or test results.
2. Regression Tasks
House price prediction: Estimating property prices based on features like
location, size, etc.
Weather forecasting: Predicting temperature, humidity, etc., based on
historical data.
Stock price estimation: Approximating future prices using nearest historical
patterns.
3. Recommendation Systems
Product recommendations: Suggesting items based on the preferences of
similar users (collaborative filtering).
Movie or music recommendations: Based on user behaviour similarity.
4. Anomaly Detection
Detecting outliers in network intrusion, fraud detection, or sensor data by
observing data points that are far from their neighbours.
5. Pattern Recognition
Face recognition: Matching facial features to known identities.
Speech recognition: Matching audio patterns to words or phonemes.
6. Image Processing
Content-based image retrieval (CBIR): Finding similar images from a
database.
Object detection: Classifying different parts of an image.
7. Recommender Systems in E-commerce
Suggesting products to users by comparing with similar users' preferences or
browsing history.
8. Bioinformatics
Classifying genes or proteins based on sequence similarity.
Predicting biological function of unknown genes using labelled data.
9. Customer Segmentation
Grouping customers based on purchasing behaviour or demographics.
10. Credit Scoring & Risk Analysis
Predicting loan default or creditworthiness based on similarity to past
customers.
5b. Distinguish between parametric and non parametric classifications. [L4] [6M]
[CO4
]
Parametric classifiers simplify the learning problem by assuming a specific model
structure with a finite set of parameters. They are efficient but less flexible.
Non-parametric classifiers are more data-driven, making them better at capturing
complex relationships, but they require more data and computation.
6. Discuss the following terms i. Principle Component Analysis ii. Factor [L2] [12M
Analysis [CO5 ]
]
i. Principle Component
Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning. It is a statistical process that
converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling. It is a technique to draw strong
patterns from the given dataset by reducing the variances.
Definition:
Feature Analysis in machine learning refers to the process of examining and
selecting the most relevant input variables (features) that contribute to building an
effective predictive model. It involves understanding the role, significance, and
quality of features in relation to the target variable or output.
2. Feature Extraction:
Creating new features by transforming or combining existing ones.
Principal Component Analysis (PCA): Reduces dimensionality by creating
new orthogonal features.
Linear Discriminant Analysis (LDA): Maximizes class separability.
Autoencoders: Neural network-based technique for learning compressed
representations.
3. Feature Engineering:
Manually creating new features from raw data to improve model performance.
Examples include:
Creating interaction terms (e.g., multiplying two features).
Binning numerical variables.
Extracting date/time features (like day of week, month).
Feature selection involves selecting a subset of relevant features from the original
dataset without transforming them. This helps in removing irrelevant, redundant,
or less informative features.
Sub-techniques shown in the diagram:
Missing Value Ratio: Removes features with too many missing values.
Low Variance Filter: Removes features that show little variation across
records.
High Correlation Filter: Removes one of two highly correlated features to
avoid redundancy.
Random Forest: Uses feature importance scores from a tree-based model.
Backward Feature Elimination: Starts with all features and removes one at a
time based on performance.
Forward Feature Selection: Starts with none and adds features that improve
performance.
Goal: Keep only the most informative features from the original dataset.
B. Projection-Based Techniques:
These are mostly used for visualization and non-linear dimensionality reduction,
often for exploratory data analysis.
ISOMAP: Preserves geodesic distances between points.
t-SNE (t-distributed Stochastic Neighbor Embedding): Maps high-
dimensional data into 2D or 3D for visualization, preserving local structure.
UMAP (Uniform Manifold Approximation and Projection): Similar to t-SNE
but faster and better at preserving global structure.
Goal: Reduce dimensions through transformation, often for improved clustering,
visualization, or model efficiency.
In many classification tasks, when we try to separate classes using a single feature,
the result might include overlapping between classes. This leads to poor
classification accuracy.
Example: Suppose we are classifying two classes using a 2D feature space
(X and Y axes). It might not be possible to draw a straight line to separate
them in 2D.
LDA transforms the feature space to a new axis that maximizes the distance
between the means of the classes and minimizes the spread within each
class.
As a result, LDA converts a 2D or 3D space into a 1D space, enabling better
class separation.
1. Face Recognition
LDA is extensively used in facial recognition systems.
It helps reduce the high-dimensional space of image pixels into a smaller
space while maintaining the differences between different faces (classes).
For example, the "Fisher faces" method uses LDA to improve upon PCA in
recognizing human faces under varying lighting and expression conditions.
2. Medical Diagnosis
In healthcare, LDA is used to classify patients into disease vs. healthy
categories based on symptoms or diagnostic parameters (e.g., blood pressure,
sugar levels, etc.).
Example: Classifying tumor types (benign or malignant) using labeled
biological data.
5. Speech Recognition
In speech processing, LDA is used to classify audio signals into phoneme or
word classes.
It enhances recognition by reducing noisy and redundant features, improving
the model’s efficiency.
10a. Summarize the following terms i) Distances ii) Euclidian distance iii) [L2] [6M]
metrics [CO3
]
i) Distances
Euclidean Distance in 3D
If the two points (x1, y1, z1) and (x2, y2, z2) are in a 3-dimensional space, the
Euclidean Distance between them is given by using the formula:
iii)Metrics
Purpose:
Metrics provide insights into a model's ability to predict outcomes, generalize to
new data, and make accurate classifications or regressions.
Types:
Different metrics are used for different types of machine learning tasks, including:
o Classification: Accuracy, Precision, Recall, F1-score, ROC AUC.
o Regression: Mean Squared Error (MSE), Mean Absolute Error
(MAE), R-squared.
o Clustering: Silhouette score, Davies-Bouldin index.
Examples:
1. Accuracy: Measures the overall correctness of predictions.
2. Precision: Measures the proportion of true positive predictions among all
positive predictions.
3. Recall: Measures the proportion of true positive predictions among all actual
positive cases.
4. F1-score: The harmonic mean of precision and recall, providing a balanced
measure.
5. MAE: The average absolute difference between predicted and actual values.
6. MSE: The average of the squared differences between predicted and actual
values.
7. R-squared: Measures the proportion of variance in the dependent variable
that is predictable from the independent variable(s).
8. ROC AUC: Measures the area under the receiver operating characteristic
curve, used for evaluating classification models.
10b. Analyze the supervised learning after clustering [L4] [6M]
[CO4
]
Supervised Learning After Clustering