AIML-Unit 4 Notes-Assignment 4
AIML-Unit 4 Notes-Assignment 4
UNIT -IV
Basic Methods in Supervised Learning: Distance-based methods, Nearest-Neighbors, Decision
Trees, Support Vector Machines, Nonlinearity and Kernel Methods.
Unsupervised Learning: Clustering, K-means, Dimensionality Reduction, PCA and kernel.
=================================================================================
Supervised learning
Supervised learning is a type of machine learning where a model is trained on labelled data, meaning
the input data is paired with the correct output or label. The goal is for the model to learn the
mapping from inputs to outputs, enabling it to make accurate predictions on new, unseen data.
Common algorithms include linear regression, decision trees, and neural networks. Supervised
learning is used for tasks like classification (predicting categories) and regression (predicting
continuous values). The effectiveness of the model depends on the quality and quantity of the
labelled data used during training.
Distance-based methods
Distance-based methods in supervised learning are techniques that classify data based on the
similarity or distance between data points. These methods use distance metrics, such as Euclidean or
Manhattan distance, to measure how close an unknown data point is to known labelled instances.
Popular algorithms like k-Nearest Neighbours (k-NN) and Support Vector Machines (SVM) rely on
these distances to make predictions. In k-NN, the class of the majority of the nearest neighbours
determines the class of the new data point. These methods are simple, intuitive, and effective for
problems where spatial proximity is a strong indicator of classification.
Distance-based methods in machine learning rely on measuring the distance between data
points to make predictions.
These methods are used both in classification and regression problems.
Cosine similarity is a metric used to measure how similar two vectors (or data points) are, based
on their orientation, regardless of their magnitude.
It is commonly used in text analysis, information retrieval, and machine learning, especially
when comparing documents or words in high-dimensional spaces.
Hamming distance between two strings or vectors of equal length is the number of positions at
which the corresponding symbols are different.
In other words, it measures the minimum number of substitutions required to change one string
into the other, or equivalently, the minimum number of errors that could have transformed one
string into the other.
In a more general context, the Hamming distance is one of several string metrics for measuring
the edit distance between two sequences.
It is named after the American mathematician Richard Hamming.
Mahalonobis distance is the distance between a point and a distribution. and not between two
distinct points.
It is effectively a multivariate equivalent of the Euclidean distance.
It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical
applications ever since.
However, it’s not so well known or used in the machine learning practice.
How is Mahalanobis distance different from Euclidean distance?
• It transforms the columns into uncorrelated variables
• Scale the columns to make their variance equal to 1
• Finally, it calculates the Euclidean distance.
where,
• D^2 is the square of the Mahalanobis
distance.
• x is the vector of the observation (row in a
dataset),
• m is the vector of mean values of
independent variables (mean of each
column),
• C^(-1) is the inverse covariance matrix of
independent variables.
Nearest Neighbours
Nearest neighbors is a concept used in machine learning, statistics, and computer science, referring
to the idea of finding data points that are most similar to a given point within a dataset. It's often
used for classification, regression, and clustering problems. The key ideas around nearest neighbors:
1. K-Nearest Neighbors (KNN):
o A popular algorithm based on the nearest neighbors concept, where K is a parameter
that defines how many neighbors to consider when making a prediction for a new data
point.
o Classification: The algorithm assigns the class most common among the K nearest data
points to the new data point.
o Regression: The algorithm averages the target values of the K nearest neighbors to
predict a value for the new point.
2. Distance Metrics:
o The similarity between data points is typically measured using a distance metric, such as
Euclidean distance, Manhattan distance, or Minkowski distance, depending on the
problem and the nature of the data.
3. Curse of Dimensionality:
o As the number of features increases, the concept of "nearness" can become less
meaningful. In high-dimensional spaces, all points might seem equally distant from one
another, leading to reduced accuracy in the nearest neighbors' predictions.
4. Choice of K:
o The value of K affects the model's performance. A small K can make the model sensitive
to noise, while a large K can smooth out predictions but potentially overlook small
patterns or details.
5. Efficiency:
o For large datasets, finding the nearest neighbors can be computationally expensive,
especially in high-dimensional spaces. Techniques like KD-trees, Ball Trees, or
Approximate Nearest Neighbors (ANN) are used to optimize search times.
6. Applications:
o Nearest neighbors are widely used in recommendation systems, image recognition,
anomaly detection, and other areas where similar data points provide meaningful
insights or predictions.
Nearest neighbors is a foundational method in machine learning, particularly useful in scenarios
where similarity between data points can help make decisions or predictions.
1. Basic Concept:
Classification: SVMs are primarily used for classification tasks. They aim to find the optimal
hyperplane (a decision boundary) that best separates different classes in the feature space.
Regression: In regression (SVR), SVMs try to predict a continuous value by fitting the best
hyperplane while allowing for some margin of error.
2. Key Components:
Support Vectors: The data points that are closest to the hyperplane and play a critical role in
defining the optimal boundary. These points are used to create the decision boundary.
Hyperplane: A hyperplane is a boundary that divides the data into two classes. In 2D, this is
a line; in higher dimensions, it's a plane or hyperplane. SVM aims to maximize the margin
between classes while ensuring the data points are correctly classified.
3. Margin Maximization:
The goal of SVM is to maximize the margin between the hyperplane and the support vectors.
A larger margin is associated with better generalization, meaning the model is less likely to
overfit.
4. Linear vs. Nonlinear SVM:
Linear SVM: When the data is linearly separable (i.e., a straight line or hyperplane can
separate the classes), a linear SVM can be used.
Nonlinear SVM: If the data isn't linearly separable, SVM uses a technique called the kernel
trick to map the data into a higher-dimensional space where a linear hyperplane can
separate the classes. Common kernels include:
o Polynomial kernel
o Radial Basis Function (RBF) kernel
o Sigmoid kernel
5. Kernel Trick:
The kernel trick enables SVM to operate in higher-dimensional spaces without explicitly
computing the transformation, making it efficient even for complex, non-linear
relationships.
6. Cost Parameter (C):
The parameter C controls the trade-off between maximizing the margin and minimizing
classification errors. A high C value puts more emphasis on minimizing errors (which can lead
to overfitting), while a lower C allows for a wider margin but allows some misclassification
(which can improve generalization).
7. Advantages:
Effective in high-dimensional spaces: SVM is effective in situations where there are many
features (high-dimensional data).
Robust to overfitting: Especially when using the correct kernel and regularization
parameters.
Works well for both linear and non-linear data: Thanks to the kernel trick, SVM can handle
complex data distributions.
Unique solution: The optimization problem in SVM has a unique global solution.
8. Disadvantages:
Computationally expensive: SVM can be slow to train, especially with large datasets or high-
dimensional data, as it involves solving a complex quadratic optimization problem.
Sensitive to parameter tuning: Choosing the right kernel, the parameter C, and the kernel-
specific parameters (like the RBF gamma) can significantly affect performance.
Not ideal for large datasets: Because of its computational complexity, SVM may not scale
well with very large datasets.
9. Applications:
SVM is used in various applications, including:
o Image classification: Recognizing objects in images.
o Text classification: Spam filtering, sentiment analysis.
o Bioinformatics: Gene classification, disease prediction.
o Handwriting recognition and other pattern recognition tasks.
Support Vector Machines are a powerful classification and regression tool that works by finding the
optimal hyperplane to separate classes with the largest possible margin. The use of kernels allows
SVM to handle both linear and non-linear problems, making it versatile, though it can be
computationally intensive and sensitive to parameter selection.
Nonlinearity and Kernel Methods
Nonlinearity and kernel methods are key concepts in machine learning, especially when dealing with
complex data that cannot be separated or modelled well using linear methods. The concept can be
seen in the following figures:
Each of these supervised learning methods has its strengths and weaknesses, and the choice of
which to use depends on the dataset and problem at hand. Here's a quick recap of their applications:
Distance-based Methods (e.g., k-NN): Good for small datasets and classification/regression
based on closeness of data points.
Decision Trees: Easy to interpret, but can overfit and be unstable.
Support Vector Machines: Effective for both linear and non-linear problems, but
computationally expensive.
Kernel Methods: Used to deal with non-linear data by transforming the feature space, but may
be tricky to tune.
ML algorithms vary in complexity and suitability based on the structure of the data.
Each method provides different strengths in handling tasks like classification and regression.
Important part of learning, is understanding when to apply each method based on the problem
you’re trying to solve.
Unsupervised Learning
Unsupervised learning is a type of machine learning where the model is trained on unlabelled
data, meaning the input data does not come with corresponding target labels or outcomes. The
goal of unsupervised learning is to identify hidden patterns, structures, or relationships in the
data.
Unlike supervised learning, where the algorithm learns from labelled data to make predictions,
unsupervised learning focuses on discovering inherent patterns within the data itself.
Common tasks in unsupervised learning include clustering, where the algorithm groups similar
data points together (e.g., customer segmentation in marketing), and dimensionality reduction,
which aims to reduce the number of features in the data while retaining important information
(e.g., Principal Component Analysis, or PCA).
Other techniques in unsupervised learning include anomaly detection, which identifies unusual
or outlier data points, and association rule learning, which finds interesting relationships
between variables (e.g., market basket analysis).
One of the main challenges in unsupervised learning is the absence of labelled data, making it
harder to evaluate the performance of the model and interpret the results.
Popular algorithms for unsupervised learning include K-means clustering, Hierarchical
clustering, Gaussian Mixture Models (GMM), and Self-Organizing Maps (SOM) for clustering,
and PCA and t-SNE for dimensionality reduction.
Unsupervised learning is widely used in areas like customer behavior analysis, image and speech
recognition, anomaly detection in security, and natural language processing.
The flexibility and ability to work with unlabelled data make unsupervised learning especially
valuable in real-world scenarios where obtaining labelled data is costly, time-consuming, or
impractical.
Clustering
Clustering is an unsupervised machine learning technique that involves grouping similar data
points together based on certain characteristics or features, without the need for labeled data.
The goal of clustering is to organize a dataset into subsets, or clusters, where data points within
each cluster are more similar to each other than to those in other clusters.
Clustering is widely used in various fields, including customer segmentation, image recognition,
anomaly detection, and market research.
There are several popular clustering algorithms, such as K-means, which divides data into a
predefined number of clusters by minimizing the variance within each cluster, and Hierarchical
clustering, which builds a tree-like structure of nested clusters, often used for hierarchical
relationships.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another popular
method that groups together points that are close to each other based on a distance metric, and
it is especially effective at handling noise and outliers.
Gaussian Mixture Models (GMM) assume that data points are generated from a mixture of
several Gaussian distributions and can capture more complex, non-spherical cluster shapes.
The choice of algorithm depends on the data type, the distribution of the data, and the desired
outcome.
Key challenges in clustering include determining the optimal number of clusters and selecting
appropriate distance metrics for measuring similarity.
Evaluation of clustering performance can be difficult due to the lack of ground truth labels, but
methods like silhouette scores and Davies-Bouldin index can provide some insights into the
quality of the clusters.
Clustering is a powerful tool for uncovering hidden patterns in data, making it valuable for
exploratory data analysis and uncovering natural groupings within complex datasets.
Example:
In customer segmentation, a business might want to group customers based on their purchasing
behavior.
Clustering algorithms can automatically identify groups like high spenders, low spenders,
frequent buyers, etc., without needing predefined labels for each customer.
Key Algorithms:
• K-Means
• Hierarchical Clustering
• DBSCAN
K-Means Clustering
K-means clustering is a widely used unsupervised machine learning algorithm that groups data
points into a predefined number of clusters based on their similarity.
The goal of K-means is to partition the data into K clusters, where each cluster contains points
that are more similar to each other than to those in other clusters.
The algorithm works through an iterative process:
1. Initialization: K initial centroids (one for each cluster) are chosen, either randomly or using
methods like K-means++ to improve convergence.
2. Assignment Step: Each data point is assigned to the nearest centroid based on a distance
metric, usually Euclidean distance.
3. Update Step: After the points are assigned to clusters, the centroids are recalculated as the
mean of all data points in each cluster.
4. Repeat: The assignment and update steps are repeated until the centroids no longer change
significantly, indicating convergence.
K-means is computationally efficient and works well when clusters are spherical and roughly
equal in size.
However, it has some limitations: it requires the number of clusters (K) to be predefined, it can
be sensitive to the initial placement of centroids (which may lead to different results on different
runs), and it may struggle with clusters of varying shapes or densities.
Additionally, K-means is sensitive to outliers, as they can distort the centroid calculation.
Despite these challenges, K-means is popular in tasks like customer segmentation, image
compression, and anomaly detection due to its simplicity and effectiveness in many scenarios.
It is important to evaluate the results using metrics like the silhouette score or elbow method to
determine the optimal number of clusters.
Dimensionality reduction
Dimensionality reduction is a technique used in machine learning and data analysis to reduce
the number of features (variables) in a dataset while preserving its essential structure and
information.
High-dimensional datasets can be computationally expensive to process, and visualizing or
interpreting them becomes challenging.
Dimensionality reduction helps by transforming the data into a lower-dimensional space, making
it easier to analyze and visualize.
Two common methods for dimensionality reduction are Principal Component Analysis (PCA)
and t-Distributed Stochastic Neighbor Embedding (t-SNE).
PCA identifies the directions (principal components) in which the data varies the most and
projects the data onto those directions, reducing the number of features while retaining the
most important variance.
t-SNE, on the other hand, is often used for visualizing high-dimensional data in two or three
dimensions, preserving local structures in the data.
Dimensionality reduction is beneficial for improving computational efficiency, reducing noise,
and preventing overfitting, especially in high-dimensional spaces where models may struggle to
generalize.
However, it comes with trade-offs, as reducing dimensions can sometimes lead to a loss of
important information.
It is widely used in applications such as image compression, feature extraction, and pre-
processing for machine learning models.
Example:
In an image dataset, each image might have thousands of pixels, making analysis slow.
Dimensionality reduction techniques can reduce the number of dimensions while still preserving
important patterns.
Kernel Methods
Kernel methods in unsupervised learning are powerful techniques that allow algorithms to
handle nonlinearity by mapping data into higher-dimensional spaces where complex patterns
can become linear and easier to analyze.
The core idea is to use a kernel function to compute the inner product between data points in
this higher-dimensional feature space without explicitly transforming the data, a technique
known as the kernel trick.
Kernel Trick: Instead of directly mapping data points to higher-dimensional spaces, the kernel
function computes the inner product in that space using the original data points.
This makes kernel methods computationally efficient, especially when dealing with complex data
structures.
In unsupervised learning, kernel methods are commonly applied in tasks such as clustering and
dimensionality reduction.
For example, in Kernel Principal Component Analysis (Kernel PCA), the kernel trick is used to
perform dimensionality reduction in a nonlinear feature space, capturing more complex
structures in the data than traditional PCA.
In Kernel K-means clustering, the kernel allows for the identification of more complex, non-
linear clusters compared to traditional K-means, making it effective in cases where clusters are
not easily separable by a simple linear boundary.
Common kernels include the Radial Basis Function (RBF) kernel, polynomial kernel, and sigmoid
kernel, each suited for different types of data.
Overall, kernel methods provide flexibility and power in unsupervised learning tasks, enabling
the modelling of intricate relationships in high-dimensional data.
Kernel methods are techniques used to enable algorithms to operate in higher-dimensional
spaces without explicitly computing the coordinates in that space.
This is particularly useful for non-linear data that cannot be separated in lower dimensions.
Example: Support Vector Machines (SVM) use kernel methods to find decision boundaries that can
separate non-linearly separable data by mapping them to higher-dimensional space.
Common Kernels:
• Linear Kernel
• Polynomial Kernel
• Radial Basis Function (RBF) Kernel
Example:
For a classification task where you want to distinguish between two classes (e.g., cats and
dogs), a linear separator might not work well.
Using a kernel function, the data can be transformed into a higher-dimensional space where
the classes become separable by a linear boundary.