MLP U4
MLP U4
LEARNING
WITH PYTHON
SEMESTER 5
UNIT - 4
HI COLLEGE
SYLLABUS
UNIT - 4
HI COLLEGE
UNSUPERVISED LEARNING ALGORITHMS:
INTRODUCTION TO CLUSTERING
Clustering is a fundamental technique in unsupervised learning that involves
grouping similar data points together. It is used to explore and uncover
patterns or structures within a dataset without any predefined labels or target
variables. The goal of clustering is to partition the data into distinct groups, or
clusters, such that data points within the same cluster are more similar to each
other than those in different clusters.
K-MEANS CLUSTERING
K-means clustering is a popular and widely used algorithm for partitioning data
into distinct clusters. It is an iterative algorithm that aims to minimize the
within-cluster variance, also known as the sum of squared distances between
data points and their assigned cluster centroid.
1. Initialization:
Begin by choosing the number of clusters, denoted as 'k', that you want to
identify in your data.
Randomly initialize the centroids of these 'k' clusters by selecting 'k' data
points from the dataset.
2. Assignment:
For each data point, calculate the distance between the data point and
each centroid.
Assign the data point to the cluster associated with the nearest centroid.
This is typically done based on Euclidean distance, but other distance
metrics can also be used.
3. Update:
Once all data points are assigned to a cluster, compute the new centroids of
each cluster. The centroids are calculated as the mean of all the data points
assigned to that cluster.
4. Iteration:
Repeat steps 2 and 3 until convergence is achieved. Convergence occurs
when the assignments of data points to clusters no longer change
significantly or when a maximum number of iterations is reached.
HIERARCHICAL CLUSTERING
Hierarchical clustering is a clustering algorithm that builds a hierarchical
structure of clusters by iteratively merging or splitting clusters. It does not
require a predefined number of clusters, unlike the K-means algorithm.
Hierarchical clustering assigns a data point to a cluster at each level of the
hierarchy, creating a tree-like structure called a dendrogram.
SOMs are often used for visualizing and analyzing complex, high-dimensional
data. The algorithm maps the input data onto a grid of neurons, each with a
weight vector associated with it. The grid can have any shape, but it is usually
organized as a two-dimensional grid.
1. Initialization:
Randomly initialize the weight vectors of the neurons in the grid. These
weight vectors have the same dimensionality as the input data.
2. Training:
Select a random input vector from the dataset.
Compute the Euclidean distance between the input vector and the weight
vectors of all neurons.
Identify the best-matching unit (BMU) - the neuron with the closest weight
vector to the input vector.
3. Update:
Update the weight vectors of the BMU and its neighboring neurons to make
them more similar to the input vector.
The magnitude of the update depends on the learning rate, which is initially
high and gradually decreases over time.
The neighborhood size also decreases over time, allowing the algorithm to
refine the representation.
4. Iteration:
Repeat steps 2 and 3 for a specified number of iterations or until
convergence is achieved.
After training, the SOM represents the data in a low-dimensional grid where
similar input vectors are placed close together. This allows for visual analysis of
the data and identification of clusters or patterns. Each neuron in the SOM grid
can be associated with a specific cluster or category.
2. Unsupervised Learning: SOMs do not require labeled data and can discover
patterns and relationships in an unsupervised manner.
4. Robustness: SOMs are robust to noise, as a noisy input vector will still find its
place within the grid based on its relationship to other vectors.
1. K-means Clustering:
One of the most popular clustering algorithms.
Divides data points into K clusters, where K is a user-specified parameter.
Each data point is assigned to the cluster with the closest centroid.
The algorithm iteratively updates the cluster centroids until convergence.
2. Hierarchical Clustering:
As discussed earlier, this algorithm builds a hierarchy of clusters based on
the similarity between data points.
The choice of linkage method (e.g., single-linkage, complete-linkage,
average-linkage) and distance metric impacts the results.
Dendrograms can be used to visualize the clustering process.
Feature Selection:
1. Filter Methods:
Use statistical measures like correlation, chi-square, or mutual information
to rank features.
Select the top-ranked features based on a predefined threshold or a fixed
number.
2. Wrapper Methods:
Involve evaluating the performance of different feature subsets using an
external machine learning algorithm.
Search for the best subset of features, typically through a backward or
forward selection process.
3. Embedded Methods:
Incorporate feature selection directly into the learning algorithm.
Model-specific techniques that determine feature importance during the
training phase, e.g., LASSO regularization or decision tree-based importance.