Unit 4
Unit 4
Clustering is a type of unsupervised machine learning technique where data is grouped into
clusters based on similarity. The goal is to find inherent patterns or structures within data.
Clustering methods can be broadly categorized into three types: Partitioned Clustering,
Hierarchical Clustering, and Density-based Clustering.
1. Partitioned Clustering In partitioned clustering, the data is divided into a set of distinct,
non-overlapping clusters. Each data point is assigned to exactly one cluster.
● K-medoids Clustering
Process:
● K-modes Clustering
Process:
Hierarchical clustering builds a tree-like structure (dendrogram) that groups data based on
their similarity. This method does not require the number of clusters to be defined upfront.
The similarity between clusters can be calculated using different linkage methods (e.g.,
single-linkage, complete-linkage, average-linkage).
3. Density-Based Clustering
Density-based clustering identifies clusters based on the density of points in the data space.
It is well-suited for identifying clusters of arbitrary shape and handling outliers effectively.
K-means clustering is a partitioned clustering technique that groups numerical data into K
clusters. It minimizes a clustering cost function (sum of squared distances) to form
well-defined clusters.
K-means uses a similarity measure (typically Euclidean distance) and iteratively adjusts
cluster centers (centroids) to minimize the total variance within clusters.
Let:
4. Recalculate Distances
○ Compute the distance between each data point and the updated cluster
centers.
5. Check for Convergence
○ If no data point has changed its cluster assignment, stop.
○ Otherwise, repeat from Step 2.
The k in k-means clustering represents the number of clusters into which the data is divided.
Significance of K:
Key Points
● Objective: Minimize the sum of squared distances between points and their
assigned cluster centroids.
● Complexity: Iterative algorithm with time complexity O(n×k×d), where n is the
number of points, k is the number of clusters, and d is the dimensionality.
● Convergence: Guaranteed to converge, but may reach a local minimum depending
on the initial cluster centers.
Advantages
PROBLEMS:
1)The data mining task is to cluster points into three clusters, where the points are A1(2,10),
A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4) , C1(1,2), C2( 4,9) The distance function is
Manhattan distance. Suppose initially we assign A1 ,B1,C1 as the center of each cluster, use
the k-means algorithm to show only a) The three clusters after the first round of execution .
b) The final three clusters
https://www.youtube.com/watch?v=KzJORp8bgqs&list=PL4gu8xQu0_5KiYnRlueicckEmpFAi
RD5Y&index=5
2)https://www.youtube.com/watch?v=nWmeC61DgBo&list=PL4gu8xQu0_5KiYnRlueicckEm
pFAiRD5Y&index=8
L1 dist(Manhattan dist):
Selecting the optimal number of clusters (K) is crucial in clustering algorithms like
K-means. Two commonly used methods are:
https://www.youtube.com/watch?v=wW1tgWtkj4I
1. Elbow Method
Or Inertia.
The Silhouette Method evaluates the quality of clusters by measuring how similar data points
are to their own cluster compared to other clusters.
https://www.youtube.com/watch?v=FGXkbawTHRQ&list=PL4gu8xQu0_5KiYnRlueicckEmpF
AiRD5Y&index=27
1. Initialize Modes
○ Arbitrarily select k objects as initial cluster modes.
2. Assign Objects to Clusters
○ Assign each object to the cluster with the most similar mode based on the
least number of mismatches (dissimilarity).
3. Update Modes
○ For each cluster, update the mode by finding the most frequently occurring
value for each attribute (column).
4. Repeat Until Convergence
○ Repeat steps 2 and 3 until the clusters remain unchanged in two consecutive
iterations.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Purpose: DBSCAN clusters data points based on density, identifying core points, noise, and
clusters without needing to specify the number of clusters in advance.
Key Concepts:
Ex1.https://www.youtube.com/watch?v=-p354tQsKrs&list=PL4gu8xQu0_5KiYnRlueicckEmp
FAiRD5Y&index=9
Ex2.https://www.youtube.com/watch?v=ZOLYaa9Jex0&list=PL4gu8xQu0_5KiYnRlueicckEm
pFAiRD5Y&index=10
Self-Organizing Maps (SOMs), also known as Kohonen Maps, are a type of artificial
neural network used for unsupervised learning. They are mainly used for dimensionality
reduction and visualization of high-dimensional data.
SOMs map high-dimensional input data onto a 2D grid (usually) of neurons. Each neuron in
the grid has a weight vector of the same dimension as the input data, and the goal is to train
the network such that similar data points in the input space are mapped close to each other
on the grid.
Advantages of SOMs:
Disadvantages of SOMs:
Applications of SOMs:
● Data Visualization
● Clustering
● Feature Mapping
● Pattern Recognition
EX:https://www.youtube.com/watch?v=InVvyioWDlw&list=PL4gu8xQu0_5JK6KmQi-Qx5hI3
W13RJbDY&index=24
EM Algorithm (Expectation-Maximization Algorithm)
The EM algorithm iteratively estimates the parameters of the model by alternating between
two steps: Expectation (E-step) and Maximization (M-step).
1. Initialization
○ Start with initial guesses for the parameters of the model (e.g., means,
variances, and mixing coefficients in a GMM).
2. E-step (Expectation Step)
○ Given the current parameter estimates, calculate the expected value of the
latent (hidden) variables based on the observed data.
○ This step computes the probability of the data belonging to different clusters
(or distributions) using the current parameters.
3. M-step (Maximization Step)
○ Update the parameters of the model by maximizing the likelihood function (or
minimizing the negative log-likelihood) based on the expected values
calculated in the E-step.
4. Repeat
○ Repeat the E-step and M-step until the parameters converge (i.e., the change
in the parameter values between iterations is below a threshold).
● Clustering
● Missing Data
● Image and Speech Recognition
● Hidden Markov Models (HMMs)
● Local Optima: It is sensitive to the initial parameter estimates and may converge to
a local optimum.
● Computational Complexity: Can be computationally expensive, especially for large
datasets and models with many parameters.
● Convergence Speed: The algorithm may require many iterations to converge.
Example:
You have a bag with candies of different colors (e.g., Red, Green, and Blue), but you
don't know the exact proportion of each color. You use the EM algorithm to estimate
these proportions.
1. You take a candy out of the bag without looking at its color.
2. A friend guesses the color of the candy based on prior knowledge, providing both a
guess and their confidence level for each color.
3. For example:
a. First candy: 80% chance Red, 10% Green, 10% Blue.
b. Second candy: 30% Red, 60% Green, 10% Blue.
c. Third candy: 20% Red, 10% Green, 70% Blue.
You use your friend's guesses to estimate the number of candies of each color.
Sum the confidence values for each color across all guesses:
Red: (0.8+0.3+0.2)/3=0.43
(0.8+0.3+0.2)/3=0.43 (43%)
Green: (0.1+0.6+0.1)/3=0.27
(0.1+0.6+0.1)/3=0.27 (27%)
Blue: (0.1+0.1+0.7)/3=0.30
(0.1+0.1+0.7)/3=0.30 (30%)
Update your estimates of the color proportions in the bag based on these calculations.
Step 3: Repeat
● Go back to the E-step and continue the process with updated guesses and
estimates.
● Over multiple iterations, the guesses and estimates converge to the actual
proportions of candies in the bag.
AGGLOMERATIVE CLUSTERING:
Linkage refers to the methods used to determine the distance between clusters in
hierarchical clustering algorithms. These methods define how to compute the "distance"
between two clusters based on the individual distances between their members.
1. Single Linkage: In single linkage clustering, the distance between two clusters is
defined as the minimum distance between any single pair of points in the two clusters.
Advantages:
Disadvantages:
○ Sensitive to noise and outliers (since a single pair of points can influence the
distance calculation).
○ May result in chaining, where clusters grow by connecting distant points.
https://www.youtube.com/watch?v=tXYAdGn-SuM
2. Complete Linkage:In complete linkage clustering, the distance between two clusters is
defined as the maximum distance between any pair of points in the two clusters.
https://www.youtube.com/watch?v=0A0wtto9wHU&t=25s
Mathematical Formulation:
1. Given a dataset X with n data points and d dimensions, we want to reduce the data
to a lower k-dimensional space.
2. The covariance matrix C is computed as
Applications of PCA:
Advantages of PCA:
Disadvantages of PCA:
Ex1 :https://www.youtube.com/watch?v=ZtS6sQUAh0c
Ex2: https://www.youtube.com/watch?v=XO0US1aTA50
Kernel Principal Component Analysis (Kernel PCA) is an extension of PCA that allows
non-linear dimensionality reduction by mapping data into a higher-dimensional space using
kernel functions (like RBF, polynomial) before applying PCA. It is useful for data with
non-linear relationships, where traditional PCA fails.
Steps in Kernel PCA:
1. Kernel Mapping: Use a kernel function (e.g., RBF) to map the data to a
higher-dimensional space without explicitly transforming the data.
2. Compute the Kernel Matrix: Calculate the pairwise kernel values for all data points.
3. Center the Kernel Matrix: Subtract the mean row and column from the kernel
matrix.
4. Eigenvalue Decomposition: Perform eigenvalue decomposition to find the principal
components in the feature space.
5. Projection: Project the data onto the selected principal components to reduce
dimensionality.
Advantages:
Disadvantages:
● Computationally expensive.
● Choice of kernel function can affect performance.
● Results may be hard to interpret in the high-dimensional space.
Applications:
t-SNE is a powerful non-linear dimensionality reduction technique primarily used for the
visualization of high-dimensional data in two or three dimensions. It is particularly effective
for visualizing clusters or patterns in complex datasets, such as those arising in machine
learning or bioinformatics.
where Pijis the probability of point xj being a neighbor of point xi, and σ is a parameter that
controls the width of the Gaussian.
● Similarity Measurement (Low-dimensional space):t-SNE then tries to embed the
points into a low-dimensional space (typically 2D or 3D), where the similarities are
modeled using a t-distribution with one degree of freedom (Cauchy distribution),
rather than a Gaussian distribution. The use of a t-distribution helps to separate
clusters better and reduces crowding problems.
where Qijrepresents the similarity between points yi and yjin the low-dimensional space.
Key Features:
Advantages:
Disadvantages: