0% found this document useful (0 votes)
31 views39 pages

ML Unit-4-1

The document outlines a syllabus for a course on Artificial Intelligence and Data Science, focusing on machine learning concepts including supervised and unsupervised learning, ensemble methods, and neural networks. It details various clustering techniques, particularly k-means, and their applications in areas such as image segmentation and business intelligence. The document also discusses the limitations of k-means and provides example code for its implementation in image processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views39 pages

ML Unit-4-1

The document outlines a syllabus for a course on Artificial Intelligence and Data Science, focusing on machine learning concepts including supervised and unsupervised learning, ensemble methods, and neural networks. It details various clustering techniques, particularly k-means, and their applications in areas such as image segmentation and business intelligence. The document also discusses the limitations of k-means and provides example code for its implementation in image processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Artificial Intelligence and Data Science (AI & DS)

Home

Machine Learning [B20AD3201]


Syllabus
Unit-1

Introduction- Artificial Intelligence, Machine Learning, Deep Learning, Types of Machine


Learning Systems, Main Challenges of Machine Learning. Statistical Learning: Introduction,
Supervised and Unsupervised Learning, Training and Test Loss, Tradeoffs in Statistical
Learning, Estimating Risk Statistics, Sampling distribution of an estimator, Empirical Risk
Minimization.

Unit-2

Supervised Learning (Regression/Classification): Basic Methods: Distance-based Methods,


Nearest Neighbours, Decision Trees, Naive Bayes, Linear Models: Linear Regression, Logistic
Regression, Generalized Linear Models, Support Vector Machines, Binary Classification:
Multiclass/Structured outputs, MNIST, Ranking.

Unit-3

Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and
Pasting, Random Forests, Boosting, Stacking. Support Vector Machine: Linear SVM
Classification, Nonlinear SVM Classification SVM Regression, Naïve Bayes Classifiers.

Unit-4

Unsupervised Learning Techniques: Clustering, K-Means, Limits of K-Means, Using


Clustering for Image Segmentation, Using Clustering for Pre processing, Using Clustering for
Semi-Supervised Learning, DBSCAN, Gaussian Mixtures. Dimensionality Reduction: The
Curse of Dimensionality, Main Approaches for Dimensionality Reduction, PCA, Using Scikit-
Learn, Randomized PCA, Kernel PCA.

Unit-5

Neural Networks and Deep Learning: Introduction to Artificial Neural Networks with Keras,
Implementing MLPs with Keras, Installing Tensor Flow 2, Loading and Preprocessing Data
with Tensor Flow.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Unit-4

Unsupervised Learning Techniques: Clustering, K-Means, Limits of K-Means, Using


Clustering for Image Segmentation, Using Clustering for Pre processing, Using Clustering for
Semi-Supervised Learning, DBSCAN, Gaussian Mixtures. Dimensionality Reduction: The
Curse of Dimensionality, Main Approaches for Dimensionality Reduction, PCA, Using Scikit-
Learn, Randomized PCA, Kernel PCA.

Clustering
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters. The set of clusters resulting from a
cluster analysis can be referred to as a clustering. In this context, different clustering methods
may generate different clusterings on the same data set. The partitioning is not performed by
humans, but by the clustering algorithm. Hence, clustering is useful in that it can lead to the
discovery of previously unknown groups within the data.
Cluster analysis has been widely used in many applications such as business intelligence,
image pattern recognition, Web search, biology, and security.
• In business intelligence, clustering can be used to organize a large number of customers
into groups, where customers within a group share strong similar characteristics. This
facilitates the development of business strategies for enhanced customer relationship
management.
• In image recognition, clustering can be used to discover clusters or “subclasses” in
handwritten character recognition systems. Suppose we have a data set of handwritten
digits, where each digit is labeled as either 1, 2, 3, and so on. Note that there can be a
large variance in the way in which people write the same digit. Take the number 2, for
example. Some people may write it with a small circle at the left bottompart, while
some others may not. We can use clustering to determine subclasses for “2,” each of
which represents a variation on the way in which 2 can be written. Using multiple
models based on the subclasses can improve overall recognition accuracy.
• Clustering has also found many applications in Web search. For example, a keyword
search may often return a very large number of hits (i.e., pages relevant to the search)

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
due to the extremely large number of web pages. Clustering can be used to organize the
search results into groups and present the results in a concise and easily accessible way.

Cluster is a collection of data objects that are similar to one another within the cluster and
dissimilar to objects in other clusters, a cluster of data objects can be treated as an implicit
class. In this sense, clustering is sometimes called automatic classification. Again, a critical
difference here is that clustering can automatically find the groupings. This is a distinct
advantage of cluster analysis.
Clustering is also called data segmentation in some applications because clustering partitions
large data sets into groups according to their similarity. Clustering can also be used for outlier
detection, where outliers (values that are “far away” from any cluster) may be more interesting
than common cases. Applications of outlier detection include the detection of credit card fraud
and the monitoring of criminal activities in electronic commerce.

Types of Clustering:
1. Partitioning methods.
• k-Means: A Centroid-Based Technique
• k-Medoids: A Representative Object-Based Technique
• CLARANS (Clustering Large Applications based upon RANdomized Search)
2. Hierarchical methods.
• Agglomerative versus Divisive Hierarchical Clustering.
• Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH).
• Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling.
3. Density-based methods
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
• OPTICS: Ordering Points to Identify the Clustering Structure.
• OPTICS: Ordering Points to Identify the Clustering Structure.
4. Grid-based methods
• STING: STatistical INformation Grid
• CLIQUE (CLustering In QUEst)

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
k-Means
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute
the objects in D into k clusters, C1, : : : ,Ck. An objective function is used to assess the
partitioning quality so that objects within a cluster are similar to one another but dissimilar to
objects in other clusters. This is, the objective function aims for high intracluster similarity and
low intercluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that
cluster. Conceptually, the centroid of a cluster is its center point. The centroid can be defined
in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster.
The difference between an object p Ci and ci, the representative of the cluster, is measured
by dist(p, ci), where dist(x,y) is the Euclidean distance between two points x and y. The quality
of cluster Ci can be measured by the with in cluster variation, which is the sum of squared
error between all objects in Ci and the centroid ci, defined as

where E is the sum of the squared error for all objects in the data set; p is the point in space
representing a given object; and ci is the centroid of cluster Ci.

k-means algorithm:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example: Consider a set of objects located in 2-D space, as depicted in Figure (a). Let k =3,
that is, the user would like the objects to be partitioned into three clusters.

Clustering of a set of objects using the k-means method; for (b) update cluster centers and
reassign objects accordingly (the mean of each cluster is marked by a C).

We arbitrarily choose three objects as the three initial cluster centers, where cluster centers are
marked by a +. Each object is assigned to a cluster based on the cluster center to which it is the
nearest. Such a distribution forms silhouettes encircled by dotted curves, as shown in Figure
(a).
Next, the cluster centers are updated. That is, the mean value of each cluster is recalculated
based on the current objects in the cluster. Using the new cluster centers, the objects are
redistributed to the clusters based on which cluster center is the nearest. Such a redistribution
forms new silhouettes encircled by dashed curves, as shown in Figure (b).
This process iterates, leading to Figure (c). The process of iteratively reassigning objects to
clusters to improve the partitioning is referred to as iterative relocation. Eventually, no
reassignment of the objects in any cluster occurs and so the process terminates. The resulting
clusters are returned by the clustering process.

• The k-means method is not guaranteed to converge to the global optimum and often
terminates at a local optimum. The results may depend on the initial random selection
of cluster centers.
• To obtain good results in practice, it is common to run the k-means algorithm multiple
times with different initial cluster centers.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Limits of K-Means:
1. Sensitivity to Initial Conditions
• K-means is sensitive to initial conditions. The algorithm randomly initializes the
cluster centroids at the beginning, and the final clustering results can vary
depending on these initial positions.
• Different initializations can lead to other local optima, resulting in different
clustering outcomes. This makes the K-means algorithm less reliable and
reproducible.
2. Difficulty in Determining
• One of the drawbacks of the K-means algorithm is that we have to set the number
of clusters (K) in advance. Choosing an incorrect number of clusters can lead to
inaccurate results. Various methods are available to estimate the optimal K, such as
the silhouette analysis or elbow method, but they may not always provide a clear-
cut answer.
• If we use a too-small K, we’ll get too broad clusters.
• It may require multiple runs to find the most suitable value of K, which can be time-
consuming and resource-consuming.
3. Inability to Handle Categorical Data
• The algorithm works with numerical data, where distances between data points can
be calculated. However, categorical data doesn’t have a natural notion of distance
or similarity.
• When categorical data is used with the K-means algorithm, it requires converting
the categories into numerical values, such as using one-hot encoding.
• One shortcoming of using one-hot encoding is that it treats each feature
independently and can degrade performance since it can significantly increase data
dimensionality.
4. Time Complexity
• The time complexity of the algorithm is O(n * K * M * D), where K is the number
of clusters, n is the number of data points, D is the number of dimensions, and M is
the number of iterations.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Using Clustering for Image Segmentation
Image segmentation is the process of dividing an image into distinct regions or segments,
where each region has some common characteristic. The goal is to simplify or change the
representation of an image, making it more meaningful and easier to analyze. Each segment
can represent different objects, boundaries, textures, or regions of interest.
Types of Image Segmentation
1. Thresholding: Pixels are grouped based on intensity values (e.g., foreground vs.
background).
2. Edge Detection: Segmentation based on identifying edges in the image (using
techniques like Sobel, Canny).
3. Region-based Segmentation: Dividing the image based on regions with similar
properties, such as color or texture.
4. Clustering-based Segmentation: Using clustering algorithms to group pixels with
similar features.
K-Means Clustering for Image Segmentation
K-Means is a partitional clustering algorithm that divides data into k clusters, where k is
predefined. It assigns each data point to the nearest cluster center and then iteratively updates
the centers based on the points assigned to them. It’s an unsupervised learning algorithm
because it doesn’t require labeled data.
K-Means Algorithm Steps:
1. Choose k cluster centers (centroids): These can be initialized randomly or by some
other method (like K-Means++).
2. Assign each point to the nearest centroid: Compute the Euclidean distance between
each point and the centroids and assign each pixel to the closest centroid.
3. Update the centroids: After assignment, update the centroids by calculating the mean
of all the points in each cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids no longer change significantly
(convergence) or a specified number of iterations is reached.
Using K-Means for Image Segmentation
In the context of image segmentation, K-Means is applied to group similar pixels based on
color, intensity, or other features, making it easier to segment objects, textures, or regions in an
image.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Process of K-Means for Image Segmentation:
1. Flatten the Image: The image, which is typically a 3D array of shape (height, width,
channels), is reshaped into a 2D array of shape (num_pixels, channels) where each row
represents a pixel’s RGB (or grayscale) value.
2. Clustering: K-Means is applied to the pixel data. Each pixel is assigned to one of the
k clusters, where k is the number of desired segments or regions.
3. Reconstructing the Image: The result of K-Means clustering is a set of centroids (the
average color of each cluster). The image is then reconstructed by replacing each pixel's
original color with the corresponding centroid color.
4. Output: The output is a segmented image with k regions, each having similar pixel
values. These regions could represent objects or different parts of an image.

Why Use K-Means for Image Segmentation?


1. Unsupervised Learning:K-Means doesn’t need labeled data, which makes it easy to
use when you don’t have predefined categories for the segments.
2. Simplicity and Efficiency:K-Means is relatively simple to implement and works
efficiently for segmentation tasks, especially for images with clear regions of similar
colors or intensities.
3. Color-Based Segmentation:K-Means is particularly effective when you want to
segment based on color because it groups similar colors together into segments.
4. Versatility:It can be applied to both grayscale and color images. For color images, each
pixel's RGB values are used as features, while for grayscale images, only intensity
values are considered.

Applications of K-Means in Image Segmentation


1. Medical Imaging: Segmenting different regions (e.g., tumors, organs) from medical
scans like MRIs and CT scans.
2. Object Recognition: Segmentation helps in recognizing and classifying objects by
isolating them from the background.
3. Satellite Imagery: Segmenting different land types (water, forests, urban areas) in
satellite images.
4. Autonomous Vehicles: K-Means can be used to segment roads, vehicles, pedestrians,
and other objects from camera feeds.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example code:
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Load the image


image = cv2.imread('2.jpg') # Replace with your image path
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Convert BGR to RGB
plt.imshow(image)
plt.title("Original Image")
plt.axis("off")
plt.show()

# Step 2: Reshape image for clustering


pixel_values = image.reshape((-1, 3)) # Flatten the image
pixel_values = np.float32(pixel_values) # Convert to float32 for k-means

# Step 3: Apply K-Means Clustering


k = 3 # Number of clusters
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 100, 0.2)
_, labels, centers = cv2.kmeans(pixel_values, k, None, criteria, 10,
cv2.KMEANS_RANDOM_CENTERS)
print(criteria)

# Convert centers back to uint8


centers = np.uint8(centers)
segmented_image = centers[labels.flatten()]
segmented_image = segmented_image.reshape(image.shape)

# Step 4: Display the segmented image


plt.imshow(segmented_image)
plt.title("Segmented Image with K-Means (k=3)")
plt.axis("off")
plt.show()

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Using Clustering for Pre processing

Using clustering in data preprocessing can enhance the quality and efficiency of your machine
learning model by grouping similar data points and creating new features, reducing
dimensionality, or addressing data issues like noise and outliers.
Clustering is an unsupervised machine learning technique where data points are grouped based
on similarities. The idea is that points in the same group (or cluster) share certain characteristics
and should be closer to each other in some feature space than to points in other clusters.
Common clustering algorithms include:
• K-means
• DBSCAN (Density-Based Spatial Clustering)
• Hierarchical Clustering
• Gaussian Mixture Models (GMM)
Each of these algorithms has its strengths and is chosen based on the data's nature (e.g., the
expected shape of the clusters, noise levels, etc.)
Why Use Clustering in Preprocessing?
Clustering is used in preprocessing to simplify, structure, and enhance the data. Here's how
it can be applied at various stages:
1. Feature Engineering
Clustering can be used to create new features that help machine learning models identify
patterns or relationships in the data more easily. By labeling each data point with the cluster it
belongs to, you can add a cluster label as an additional feature, which often improves model
performance.
Example: Suppose you have a dataset of customer information (age, income, spending
behavior) and want to predict customer churn. By clustering the customers into groups based
on their spending patterns, the cluster label can serve as a feature to help the model better
understand the behavior of customers who are likely to churn.
Steps:
1. Apply a clustering algorithm like K-means to the customer data.
2. Assign each customer to a cluster (label).
3. Use the cluster label as a feature for a classification model (e.g., predicting whether a
customer will churn).

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
2. Dimensionality Reduction
High-dimensional datasets (i.e., datasets with many features) can suffer from the curse of
dimensionality, where the complexity of the model increases exponentially as the number of
features grows. Clustering helps to reduce the number of features by grouping similar data
points together.
How Clustering Helps: After clustering, you can use the cluster centroids (the average of all
points in a cluster) as a summarized representation of that group. This reduces the number of
unique data points the model needs to process.
Example: For customer data with many variables (age, income, education, etc.), you can apply
K-means clustering and represent each customer by the distance to their assigned cluster's
centroid, reducing the complexity of the dataset.

3. Outlier Detection and Noise Removal


In any dataset, there will often be points that do not fit well with the majority of the data —
these are outliers. Clustering can help identify outliers, as data points that do not belong to any
meaningful cluster are often considered outliers.
How Clustering Helps: DBSCAN, for example, classifies points that do not belong to any
cluster as noise (often assigned a label of -1). After applying clustering, outliers can be removed
or treated, improving the performance of machine learning models.
Example: In a dataset of transactions, a few very high-value transactions might not belong to
any customer segment. These can be flagged and removed or treated differently by the model.

4. Data Balancing
Many real-world datasets are imbalanced, meaning that some classes are underrepresented
compared to others. Clustering can be used to balance data within clusters, which can improve
model performance, especially in classification tasks.
How Clustering Helps:After clustering the data, you can ensure that each cluster contains a
balanced representation of different classes by using techniques like oversampling or
undersampling within each cluster.
Example: In a medical dataset, the disease class might be underrepresented. After clustering
the data by patient characteristics, you can apply SMOTE (Synthetic Minority Oversampling
Technique) within each cluster to balance the classes before training a model.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
5. Reducing Complexity in Model Training
When training models, having smaller, more homogeneous groups (clusters) can reduce the
complexity of the training process. Instead of training on the entire dataset, you can train on
clusters that represent more homogeneous data points. This can make models more
interpretable and faster to train.
How Clustering Helps: Clustering helps segment the data into groups that can be trained
independently, making it easier to fit models and analyze the relationships within smaller
subsets of data.
Example: For large datasets with customer behavior data, clustering helps group customers
with similar buying patterns. You can then train a classifier separately for each cluster, leading
to more efficient training and better performance.

6. Creating Synthetic Data


Clustering can be combined with techniques like SMOTE (Synthetic Minority Over-sampling
Technique) to create synthetic data. After clustering, you can apply SMOTE within each cluster
to generate synthetic samples, which can help balance classes in imbalanced datasets.
How Clustering Helps: After clustering, SMOTE can generate synthetic data for minority
class clusters, ensuring that the synthetic data is meaningful and representative of the cluster's
characteristics.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Using Clustering for Semi-Supervised Learning
Semi-supervised learning (SSL) is a type of machine learning where a model is trained using
a combination of:
• Labeled data (data with known outputs or labels),
• Unlabeled data (data where the labels are unknown).
In real-world scenarios, labeled data can be scarce, expensive, or time-consuming to obtain,
while unlabeled data is abundant. Semi-supervised learning aims to leverage both labeled and
unlabeled data to improve model performance. Clustering is a powerful technique in SSL, as
it can help make use of the large amount of unlabeled data.
Clustering in Semi-Supervised Learning
Clustering in SSL involves grouping unlabeled data points into clusters based on their features.
These clusters represent different subgroups or patterns in the data. The key idea is to use the
structure of the unlabeled data to guide the learning process and improve predictions on new
or unseen data points.

How Clustering Helps in Semi-Supervised Learning

1. Label Propagation Through Clusters:


Cluster labeling: The basic idea is that once we have a few labeled data points, we can
propagate these labels to other points that belong to the same cluster. Since data points in the
same cluster are similar, we assume that they are likely to have similar labels.
Example: Suppose you have labeled data for cats and dogs, and a large set of unlabeled data
with images of various animals. By clustering the unlabeled data into groups of similar images
(e.g., based on features like shape, texture, etc.), you can propagate the known labels ("cat" or
"dog") to the members of each cluster, assuming the unlabeled data in each cluster will have
the same label.

2. Improving Label Quality:


The labels assigned through clustering can help improve the overall label quality of the dataset.
By clustering similar unlabeled data together, the clustering algorithm may help identify
mislabeling in the training data.
Example: if certain data points from the labeled set are outliers, clustering can group them
with other similar points, making the training process more robust.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
3. Using Clusters as Pseudo-Labels:
In SSL, pseudo-labelling is a method where you use the predicted label from an unsupervised
learning model (like clustering) as a "pseudo-label" for training the model. By assigning labels
to unlabelled data based on their cluster assignment, we treat them as if they were labeled.
Example: If a clustering algorithm divides a dataset into 5 clusters, we can assume that all the
data points in the same cluster share the same label and train the model accordingly.
4. Data Exploration and Disambiguation:
Clustering can also help in data exploration. By grouping similar data points together,
clustering can reveal hidden structure and relationships in the data. This may help in making
better decisions about how to assign labels or refine training data, especially when there's
ambiguity in labelling.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
DBSCAN
Clustering is an unsupervised learning technique where we try to group the data points based
on specific characteristics. There are various clustering algorithms with K-
Means and Hierarchical being the most used ones. Some of the use cases of clustering
algorithms include:
• Document Clustering
• Recommendation Engine
• Image Segmentation
• Market Segmentation
• Search Result Grouping
• and Anomaly Detection.
K-Means and Hierarchical Clustering both fail to create clusters of arbitrary shapes. They are
not able to form clusters based on varying densities. That’s why we need DBSCAN clustering.

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm


used in machine learning to partition data into clusters based on their distance to other points.
Its effective at identifying and removing noise in a data set, making it useful for data cleaning
and outlier detection. Unlike other clustering algorithms (such as K-means), DBSCAN does
not require the number of clusters to be predefined and can discover clusters of arbitrary shapes.
DBSCAN is particularly useful for datasets with noise and can identify clusters with varying
shapes and densities. It is widely used in applications like geographic data analysis, anomaly
detection, and image segmentation.
Example
Let’s try to understand it with an example. Here we have data points densely present in the
form of concentric circles:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
We can see three different dense clusters in the form of concentric circles with some noise here.
Now, let’s run K-Means and Hierarchical clustering algorithms and see how they cluster these
data points.

You might be wondering why there are four colors in the graph. As I said earlier, this data
contains noise, too. Therefore, I have taken noise as a different cluster, which is represented by
the purple color. Sadly, both of them failed to cluster the data points. Also, they were not able
to detect the noise present in the dataset properly. Now, let’s take a look at the results from
DBSCAN clustering.

Awesome! DBSCAN cannot only cluster the data points correctly but also perfectly detect
noise in the dataset.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
DBSCAN Algorithm:
“How does DBSCAN find clusters?”Initially, all objects in a given data set D are marked as
“unvisited.” DBSCAN randomly selects an unvisited object p, marks p as “visited,” and checks
whether the neighborhood of p contains at least MinPts objects.If not, p is marked as a noise
point. Otherwise, a new cluster C is created for p, and all the objects in the neighborhood of
p are added to a candidate set, N. DBSCAN iteratively adds to C those objects in N that do not
belong to any cluster. In this process, for an object p1 in N that carries the label “unvisited,”
DBSCAN marks it as “visited” and checks its neighborhood. If the neighborhood of p1 has
at least MinPts objects, those objects in the neighborhood of p1 are added to N. DBSCAN
continues adding objects to C until C can no longer be expanded, that is, N is empty. At this
time, cluster C is completed, and thus is output. To find the next cluster, DBSCAN randomly
selects an unvisited object from the remaining ones. The clustering process continues until all
objects are visited.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Reachability and Connectivity
These are the two concepts that you need to understand before moving further. Reachability
states if a data point can be accessed from another data point directly or indirectly, whereas
Connectivity states whether two data points belong to the same cluster or not. In terms of
reachability and connectivity, two points in DBSCAN can be referred to as:
• Directly Density-Reachable
• Density-Reachable
• Density-Connected
Density-reachability and density-connectivity. Consider below figure for a given
represented by the radius of the circles, and, say, let MinPts = 3.
Labelled points, m,p,o, r are core objects because each is in an neighbourhood containing at
least three points. Object q is directly density-reachable from m. Object m is directly density-
reachable from p and vice versa.
Object q is (indirectly) density-reachable from p because q is directly density reachable from
m and m is directly density-reachable from p. However, p is not density reachable from q
because q is not a core object. Similarly, r and s are density-reachable from o and o is density-
reachable from r. Thus, o, r, and s are all density-connected.

Density-reachability and density-connectivity in density-based clustering.

Advantages of the DBSCAN Algorithm


• DBSCAN does not require the number of centroids to be known beforehand as in the
case with the K-Means Algorithm.
• It can find clusters with any shape.
• It can also locate clusters that are not connected to any other group or clusters. It can
work well with noisy clusters.
• It is robust to outliers.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Disadvantages of the DBSCAN Algorithm
• It does not work with datasets that have varying densities.
• Cannot be employed with multiprocessing as it cannot be partitioned.
• Cannot find the right cluster if the dataset is sparse.
• It is sensitive to parameters epsilon and minPoints

Applications of DBSCAN
• It is used in satellite imagery.
• Used in XRay crystallography
• Anamoly detection in temperature.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Gaussian Mixtures
A Gaussian Mixture Model (GMM) is a probabilistic model used to represent the presence of
subpopulations (clusters) within a larger population, where each subpopulation follows a
Gaussian (normal) distribution. GMM is considered a soft clustering technique because each
data point can belong to multiple clusters with different probabilities.
GMM assumes that:
• The data points are generated by a mixture of several Gaussian distributions.
• Each distribution is defined by its mean vector (μ) and covariance matrix (Σ).

Mathematical Representation
A GMM is represented as:

Parameters of GMM
The GMM has three key sets of parameters for each Gaussian component:

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Parameter Estimation: Expectation-Maximization (EM) Algorithm
To fit a GMM to data, we use the Expectation-Maximization (EM) algorithm:

Advantages of GMM
• Soft clustering: Points can belong to multiple clusters with varying probabilities.
• Flexibility: Can model clusters with different shapes and densities by adjusting the
covariance structure.
• Probabilistic interpretation: Provides insights into the uncertainty of cluster
assignments.

Limitations of GMM
• Sensitive to Initialization: Poor initialization can lead to local optima.
• Assumption of Gaussian Distribution: May not perform well if the data doesn't fit a
Gaussian distribution.
• Number of Components: Requires the number of components (K) to be specified
beforehand.

Applications
• Anomaly Detection: Identifying outliers based on low probability density.
• Image Processing: Segmenting regions with different textures or intensities.
• Customer Segmentation: Grouping customers based on purchasing behavior.
• Speech Recognition: Modeling acoustic feature distributions.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Dimensionality Reduction

Dimensionality reduction is a technique used in machine learning and data science to reduce
the number of input features while preserving as much important information as possible. This
helps in overcoming computational inefficiencies and improving model performance.

The Curse of Dimensionality:


The curse of dimensionality refers to the problems that arise when working with high-
dimensional data. As the number of dimensions (features) increases, data points become
sparser, making it harder to analyze patterns.
Problems Caused by High Dimensionality
1. Increased Computational Cost: More dimensions mean more computations, making
models slower.
2. Data Sparsity: In high dimensions, data points are spread far apart, making clustering
and classification difficult.
3. Overfitting: More features can lead to models capturing noise instead of useful
patterns.
4. Distance Measure Distortion: Many machine learning algorithms rely on distance
metrics (e.g., Euclidean distance). In high dimensions, all points tend to appear
equidistant.
Example of Curse of Dimensionality
Consider a 1D space where points are randomly distributed between 0 and 1. If we have 100
data points, they are likely well distributed.
Now, if we increase the dimensions:
• 2D space: The same 100 points are spread in a unit square (0,1) × (0,1).
• 3D space: The points are spread in a unit cube (0,1) × (0,1) × (0,1).
• 100D space: The points are in a unit hypercube, and the data density becomes extremely
sparse.
This causes problems because:
• Most machine learning models struggle to learn useful patterns.
• Distance-based methods like k-NN become ineffective.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Let’s take an example to explain this better:
Imagine you are building a machine learning model to predict house prices based on features
like the number of bedrooms, square footage, location, age of the house, number of bathrooms,
and so on. If you have too many features like additional ones for each room’s condition,
flooring type, or neighborhood amenities, your dataset can become very large and complex.
With too many features, your model may become slow to train, and it might also pick up
unnecessary details or noise. For example, suppose the flooring type doesn’t significantly
impact house prices. In that case, it might lead the model to make less accurate predictions,
especially when the data is noisy or when there are many irrelevant features.

How Dimensionality Reduction Works?


Lets understand how dimensionality Reduction is used with the help of the figure below:

On the left, data points exist in a 3D space (X, Y, Z), but the Z-dimension appears unnecessary
since the data primarily varies along the X and Y axes. The goal of dimensionality reduction is
to remove less important dimensions without losing valuable information.

On the right, after reducing the dimensionality, the data is represented in lower-dimensional
spaces. The top plot (X-Y) maintains the meaningful structure, while the bottom plot (Z-Y)
shows that the Z-dimension contributed little useful information. This process makes data
analysis more efficient, improving computation speed and visualization while minimizing
redundancy.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Main Approaches for Dimensionality Reduction

There are two main approaches to dimensionality reduction:

1. Feature Selection.
2. Feature Extraction
1. Feature Selection (Selecting Relevant Features):
Feature selection chooses the most relevant features from the dataset without altering them.
It helps remove redundant or irrelevant features, improving model efficiency. There are
several methods for feature selection including filter methods, wrapper methods, and
embedded methods.

1. Filter Methods: Use statistical tests to rank features by importance.


Example Techniques: Correlation, Mutual Information, Chi-Square Test.
2. Wrapper Methods: Train models iteratively with different feature subsets and evaluate
performance.
Example Techniques: Recursive Feature Elimination (RFE), Forward Selection,
Backward Elimination
3. Embedded Methods: Feature selection is integrated into the model training process.
Example Techniques: LASSO Regression (L1 regularization), Decision Tree Feature
Importance.
2. Feature Extraction (Transforming Features):
Feature extraction creates new, lower-dimensional features from the original ones. These
methods combine or project features while preserving important patterns.
A. Linear Methods
1. Principal Component Analysis (PCA): Finds new axes that maximize
variance, transforming correlated features into uncorrelated components.
Best Used for High-dimensional, correlated data.
2. Linear Discriminant Analysis (LDA): Maximizes class separation while
reducing dimensions. Best Used for Classification problems.
3. Singular Value Decomposition (SVD): Factorizes data matrices into simpler
structures. Best Used for Text data (Latent Semantic Analysis in NLP).

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
B. Non-Linear Methods
1. t-SNE (t-Distributed Stochastic Neighbor Embedding): Maps high-
dimensional data to a lower-dimensional space while preserving local relationships.
Best Used for Visualization
2. UMAP (Uniform Manifold Approximation and Projection): Similar to t-
SNE but faster and better at preserving global structure. Best Used for Large datasets
and Visualization.
3. Autoencoders (Neural Networks): Uses an encoder-decoder architecture to
learn compact representations. Best Used for Deep Learning Applications.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that


transforms high-dimensional data into a lower-dimensional space while preserving as much
variance (information) as possible.
• It finds a new set of orthogonal (uncorrelated) axes, called principal components,
that maximize the variance in the data.
• These components are ranked, with the first principal component (PC1) capturing the
most variance, the second (PC2) capturing the second most, and so on.
• PCA is widely used in machine learning, data visualization, and noise reduction.

PCA is useful in many scenarios, such as:


• Reducing dimensionality to improve computational efficiency.
• Avoiding overfitting in machine learning models.
• Visualizing high-dimensional data in 2D or 3D.
• Decorrelating features (removing redundancy).

PCA Working:
Step 1: Standardize the Data
Since PCA is sensitive to feature scaling, we first standardize the data by converting each
feature into a zero-mean and unit variance form:

where:
• μ is the mean of each feature.
• σ is the standard deviation

Step 2: Compute the Covariance Matrix


The covariance matrix captures relationships between features. It is computed as:

where:
• C is the d × d covariance matrix.
• XT X is the dot product of the data matrix.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example covariance matrix for 3 features:

A high covariance means two features are correlated.

Step 3: Compute Eigenvalues and Eigenvectors


We solve the eigenvalue decomposition problem: Cv = λv
where:
• v are eigenvectors (principal components).
• λ are eigenvalues (amount of variance explained).
• The largest eigenvalue corresponds to the most important principal component.
• The number of components chosen depends on the percentage of variance
explained.

We solve for eigenvalues λ: det(C−λI)=0

Eigenvectors satisfy: (C−λI)v=0

Step 4: Select Top k Principal Components


We choose the top k eigenvectors (those with the highest eigenvalues) to form a projection
matrix: W= [v1,v2, ...,vk]
where:
• W is the d × k transformation matrix.
• k is the number of components retained.

Step 5: Transform Data into the New Space


We project the original data onto the new principal component space: Xreduced=X W
This gives us a lower-dimensional representation of the data.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Example of PCA

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
For Eigenvectors:

Substitute Eigenvalue λ1=10.8224

Solve the System of Equations

Normalize the Eigenvector

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Solve for Eigenvector λ2=0.1151

Solve the System of Equations

Normalize the Eigenvector

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Step 6: Select the Principal Component
Since we want to reduce our 2D data to 1D, we select the eigenvector corresponding to the
largest eigenvalue, which captures the most variance.
Our eigenvalues are: λ1=10.8224, λ2=0.1151 Since λ1 is the largest eigenvalue, its
corresponding eigenvector will be our principal component:

Step 7: Transform the Data


We project our centered data onto the principal component using:
Xtransformed = Xcentered * v1

Final Result of PCA Transformation in Matrix Format (Reducing 2D to 1D)

The final transformed data after applying PCA (projecting onto the principal component) is:

These are the projections of the original points onto the principal component (the direction
of maximum variance).

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Using Scikit-Learn

Scikit-Learn (or sklearn) is a popular Python machine learning library used for:
1. Data Preprocessing (e.g., Standardization, Normalization)
2. Dimensionality Reduction (e.g., PCA, t-SNE)
3. Machine Learning Algorithms (e.g., SVM, Decision Trees, K-Means)
4. Model Evaluation (e.g., Accuracy, Precision-Recall)

Install and Import Scikit-Learn

pip install scikit-learn

Example : Scikit-Learn provides a simple way to apply Principal Component Analysis.

Program

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Step 1: Define the 2D Dataset


#Here:
• Column 1: Feature 1
• Column 2: Feature 2

X = np.array([
[2, 3],
[3, 5],
[5, 8],
[7, 10]])

# Step 2: Standardize the Data (mean = 0, variance = 1)


#Before applying PCA, it's crucial to standardize the dataset to ensure all
features have the same scale. This is done using Z-score normalization, which
centers the data around mean = 0 and scales it so that its variance = 1.

scaler = StandardScaler() #StandardScaler normalizes the data.


X_scaled = scaler.fit_transform(X)

# Step 3: Apply PCA (Reduce from 2D to 1D)

pca = PCA(n_components=1) # PCA(n_components=k) reduces the dimensions to k.


X_pca = pca.fit_transform(X_scaled)

# fit_transform() computes principal components and projects data.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
# Step 4: Print Results

print("Original Data (2D):\n", X)

print("\nTransformed Data (1D):\n", X_pca)

print("\nPrincipal Component:\n", pca.components_)

# Step 5: Plot the Original Data and Principal Component

plt.figure(figsize=(6, 4))

plt.scatter(X[:, 0], X[:, 1], label="Original Data", color='blue')

# Get the principal component direction

vector = pca.components_[0] # Eigenvector

origin = np.mean(X, axis=0) # Mean of original data

# Plot principal component (red arrow)

plt.quiver(origin[0], origin[1], vector[0], vector[1], scale=3,


color='red', angles='xy', scale_units='xy', label="Principal Component")

plt.xlabel("Feature 1")

plt.ylabel("Feature 2")

plt.legend()

plt.title("PCA: Original Data & Principal Component")

plt.grid()

plt.show()

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Randomized PCA

Randomized PCA is a faster approximation of standard PCA, particularly useful for large
datasets with high dimensions. Instead of computing all eigenvalues and eigenvectors, it
uses a stochastic algorithm to estimate the most important components efficiently.

Advantages:
• Faster than Standard PCA for high-dimensional data.
• Good Approximation of the principal components.
• Works Well with Sparse Data.
• Useful for Large Image and Text Datasets

Example : Compare Standard PCA vs. Randomized PCA with a 2D dataset.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X = np.array([
[2, 3],
[3, 5],
[5, 8],
[7, 10]
])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca_standard = PCA(n_components=1) # Reduce to 1D


X_pca_standard = pca_standard.fit_transform(X_scaled)
print("Standard PCA Result:\n", X_pca_standard)

pca_randomized = PCA(n_components=1, svd_solver="randomized")


X_pca_randomized = pca_randomized.fit_transform(X_scaled)

print("Randomized PCA Result:\n", X_pca_randomized)


print("Standard PCA Components:\n", pca_standard.components_)
print("Randomized PCA Components:\n", pca_randomized.components_)

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
plt.figure(figsize=(6, 4))
plt.scatter(X[:, 0], X[:, 1], label="Original Data", color='blue')

# Get the principal component direction


vector = pca_randomized.components_[0] # Eigenvector
origin = np.mean(X, axis=0) # Mean of original data

# Plot principal component (red arrow)


plt.quiver(origin[0], origin[1], vector[0], vector[1], scale=3,
color='red', angles='xy', scale_units='xy', label="Principal Component")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.title("Randomized PCA: Original Data & Principal Component")
plt.grid()
plt.show()

Difference Between Standard PCA and Randomized PCA


Both Standard PCA and Randomized PCA are used for dimensionality reduction, but
they differ in terms of computation, speed, and use cases.

Feature Standard PCA Randomized PCA

Computes exact eigenvalues & Uses approximate randomized


Computation
eigenvectors SVD
Speed Slow for large datasets Fast for high-dimensional data
Accuracy Precise Good approximation
Scalability Not efficient for big data Scales well for large datasets
Memory High (stores full covariance Low (uses a subset of data for
Usage matrix) approximation)
Small to medium datasets where Large datasets (big data, images,
Best For
precision is needed NLP) where speed is needed
svd_solver="full" (exact svd_solver="randomized" (fast
SVD Solver
decomposition) approximation)

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Kernel PCA

Kernel PCA (KPCA) is an extension of Principal Component Analysis (PCA) that allows us to
find nonlinear patterns in data. While standard PCA only works with linear transformations,
Kernel PCA uses the kernel trick to map data into a higher-dimensional space, where it can
find nonlinear principal components.

Why Use Kernel PCA?


• Handles Nonlinear Data: Standard PCA only captures linear patterns, while KPCA
can map data to a higher-dimensional space where patterns become linear.
• Uses the Kernel Trick: Instead of explicitly computing higher dimensions, KPCA
applies kernels to find nonlinear relationships efficiently.
• Useful for Complex Datasets: Works well for datasets with curved or nonlinear
structures, such as image processing, NLP, and clustering.

Mathematical Explanation:
1. Standard PCA finds principal components by computing eigenvectors of the covariance
matrix.
2. Kernel PCA applies a nonlinear transformation Φ(X) to map data into a higher-
dimensional space.
3. Instead of computing this higher-dimensional transformation explicitly, KPCA uses the
kernel trick: K(xi,xj)=Φ(xi)⋅Φ(xj) where K is a kernel function.
4. KPCA finds eigenvectors of the kernel matrix K, allowing for dimensionality reduction
in the transformed space.

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Kernel PCA in Scikit-Learn

Let's apply Kernel PCA to a nonlinear dataset.

#Step 1: Import Required Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

# Step 2: Create a Nonlinear Dataset (Moons Dataset)

X, y = make_moons(n_samples=100, noise=0.1, random_state=42)

# Step 3: Standardize the Data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Apply Standard PCA

from sklearn.decomposition import PCA


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

#Step 5: Apply Kernel PCA with RBF Kernel

kpca = KernelPCA(n_components=2, kernel="rbf", gamma=15)


X_kpca = kpca.fit_transform(X_scaled)

#Step 6: Visualize the Results

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Standard PCA
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, map="coolwarm")
axes[0].set_title("Standard PCA")

# Kernel PCA
axes[1].scatter(X_kpca[:, 0],X_kpca[:,1],c=y,cmap="coolwarm")
axes[1].set_title("Kernel PCA (RBF)")
plt.show()

Department of Information Technology, SRKREC(A)


Artificial Intelligence and Data Science (AI & DS)

Home
Differences Between PCA, Randomized PCA, and Kernel PCA
PCA (Standard
Feature Randomized PCA Kernel PCA
PCA)
Uses randomized
Computes exact Singular Value Applies kernel trick to
Approach eigenvalues from Decomposition map data to a higher-
covariance matrix (SVD) for dimensional space
approximation
Handles
No, only linear Yes, captures nonlinear
Nonlinear No, still linear
patterns structures
Data?
Fast, optimized for
Slow for large Moderate, depends on
Speed high-dimensional
datasets kernel choice
data
Memory High (stores full Low (works with a
High (stores kernel matrix)
Usage covariance matrix) subset of data)
Efficient for large Can be expensive for large
Scalability Poor for big data
datasets datasets
Yes (uses kernels like
Kernel Trick? No No
RBF, polynomial)
Small to medium
Works Best Large, high- Nonlinear datasets with
datasets with linear
For dimensional datasets complex patterns
relationships
Image compression, Big data applications Clustering, face
Example Use
eigenfaces, financial (e.g., NLP, recognition, non-
Cases
data analysis genomics) Euclidean data
Implementati PCA(n_components
PCA(n_components KernelPCA(n_components
on in Scikit- =k, svd_solver
=k) =k, kernel="rbf")
Learn ="randomized")

Department of Information Technology, SRKREC(A)

You might also like