0% found this document useful (0 votes)
4 views

Exploring Unsupervised Learning Algorithms with the Iris Dataset

Uploaded by

ramosjohn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Exploring Unsupervised Learning Algorithms with the Iris Dataset

Uploaded by

ramosjohn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

General Questions of Understanding

1. Main steps in clustering with the Iris dataset:


○ Features such as normalization, scaling or removing of special characters from
the data.
○ A few clustering methods are mentioned below: –-selecting clustering algorithms
such as k in the K-means, eps and min_samples in DBSCAN.
○ Implementing the described algorithm on the worked through dataset.
○ Visualizing clusters.
○ Measuring the quality of clustering with methods such as deciding silhouette
score, or comparing the results with actual labels in case they are available.

2. Determination of clusters without predefined labels: Implemented techniques are


able to locate clusters with the help of certain patterns in the data set: local density (i.e.
DBSCAN), distance (i.e. hierarchical clustering), or statistical characteristics (i.e. GMM).
The similarities of the feature cause them to group the data points with similar
characteristics.

3. Iris dataset suitability:

○ It has a clear distinct cluster in at least the petal and the sepal measurements.
○ Small, since it is computationally possible to handle Unfortunately Weibull
distribution has some shortcomings which are as follows.
○ This is suitable for cluster evaluation and has inherent structure for reorganizing
algorithms.

K-means Clustering

1. Optimal cluster centroids: K-means cluster iteratively re-computes centroids in manner


that unlike other centroid-based algorithms, it minimizes the sum of squared distance
between the points and their centroids.

2. Significance of k and selection:

○ k stands for and is equal to the number of clusters. The choice of right k is again
very important and can be determined by methods such as the elbow method,
silhouette analysis as well as the gap statistic.
3. Handling overlapping clusters: K-means clustering standard assumption does not
allow overlapping of clusters and clusters are spherical in shape. It may wrongly map the
points within the overlapping area because it purely uses distance.

4. Effect of different initial centroids: K-means can also be sensitive with the initial
position of centroids and may end up in different results. Techniques such as the
k-means++ initialization can therefore be said to improve the consistency.

DBSCAN

1. Parameters eps and min_samples:

○ eps specifies the neighborhood range which defines point proximity.


○ min_samples defines the required number of points to form a dense area in
figures A and B.
2. Data distributions suitable for DBSCAN:

○ Overall, suitable for clustering everything from randomly shaped groups and
small noise.
○ Challenges are observed when working with datasets in particular when they
have dissimilar densities or are in high dimensionality.

3. Noise identification: DBSCAN assigns noise for points which do not belong to any
single cluster. It does not like K-means place all the points into clusters which make it
less sensitive to the outliers.

Hierarchical Clustering

1. Agglomerative vs. divisive:

○ Agglomerative: Each point begins as a cluster and is merged successively.


○ Divisive: One cluster, partition and continue this process repeatedly.
2. Linkage method effects:

○ Single linkage: Concentrates on the nearest point in the data set and might build
elongated clusters.
○ Complete linkage: Again, the farthest points are considered; final clusters are
compact.
○ Average linkage: Balances between the two.

3. Deciding the number of clusters: However, the number of clusters is a kind of


predetermined as dendrogram cuts vertically at some chosen height usually the largest
vertical distance is the height with no horizontal links.

Mean Shift Clustering

1. Cluster centroids: Mean Shift locates spots known as modes by continuously relocating
points in the direction of high point density.

2. Bandwidth parameter: Bandwidth in turn dictates the sphere of influence of each point
and has a proportional relationship with the number of clusters.

3. Handling non-spherical clusters: However, it should be noted that Mean Shift can
capture clusters of arbitrary shape, in contrast to K-means since Mean Shift is not limited
to finding clusters of a particular geometry.

Gaussian Mixture Models (GMM)

1. Data modeling: GMM models the data as a sum of several Gaussian densities giving a
probabilistic clustering method.

2. Role of EM algorithm: The Expectation-Maximization algorithm estimate and


maximizes the parameters of the Gaussian component and reiteratively assigns cluster.

3. Handling varying shapes and sizes: Another advantage of the |-GMM is in ability to
model clusters of any shape and different sizes because of its probabilistic approach and
flexibility of covariance matrices.
Comparative Analysis

1. Clustering results on Iris dataset:

○ The main drawback of the <inertial>|It should be noted that K-means may have a
problem with the issue of overlapping categories.
○ DBSCAN can discover noise but it can never plant clusters of different density in
the data they are searching.
○ Hierarchical clustering facilitates the assessment of relationships through
development of easy to understand Tree diagram.
○ Also, GMM is flexible in the shapes of the clusters.

2. Best algorithm for Iris species: The performance varies depending on the used
parameter and GMM’s main benefit might be in the matter of how well it separates
clusters because it is based on probability distribution.

3. Strengths and weaknesses:

○ K-means: Fast, easy; does not work well with clusters that are not spherical in
shape.
○ DBSCAN: Meliem solves noise issues; pose a parameter tuning concern.
○ Hierarchical: Reasonable; has high operational complexity in large datasets.
○ GMM: Flexible; but less efficient in terms of computational load needed.

Visualization and Evaluation

1. Visualization:

○ For data visualization, it is possible to use some pre-processing methods such as


PCA or t-SNE to plot the data in 2D or 3D space.
○ Subsamples of plots can be scattered by means of scatterplots with different
colors or marks.
2. Evaluation metrics:

○ Silhouette score: Defines the distance between two groups of clusters.


○ Davies-Bouldin index: Tells about cluster quality by assessing the similarity within
the clusters.
○ Adjusted Rand Index or Normalized Mutual Information: Clusters must be
compared to exclusively labeled datasets.

You might also like