U-5 IML Questions
1) Explain how clustering tasks differ from classification tasks and how
clustering defines groups. What are different types of clustering
techniques?
Ans:
Differences Between Clustering and Classification Tasks:
Aspect Clustering Classification
Type of Learning Unsupervised learning. Supervised learning.
Goal To group data points into To assign data points to
clusters based on similarity or predefined categories or
patterns in the data. classes.
Input Data No labeled data; only feature Requires labeled data (features
data is available. and corresponding labels).
Output A set of clusters, each Predicted labels for the data
representing similar data points based on the training
points. model.
How Groups Are Defined by the algorithm based Defined by the training process
Defined on similarity measures like using labeled data and decision
distance, density, or boundaries.
connectivity.
Evaluation Internal metrics (e.g., silhouette Accuracy, precision, recall, F1-
Metrics score, Davies-Bouldin index) score, confusion matrix, etc.
or external benchmarks.
Applications Market segmentation, anomaly Spam detection, fraud
detection, document clustering, detection, medical diagnosis,
image compression. sentiment analysis.
Types of Clustering Techniques (Simple Explanation):
Clustering is about grouping similar items together. Here are the main types of clustering
techniques explained simply:
1. Partitioning Clustering
• What it does: Divides the data into k groups (clusters) where each item belongs to one
group.
Example:
• k-means: Groups data by finding cluster centers (called centroids) and putting items
near them in the same group.
• When to use: If you know the number of clusters you want and the clusters are round-
shaped.
2. Hierarchical Clustering
• What it does: Makes a tree of clusters, where smaller clusters merge into bigger ones
or bigger ones split into smaller ones.
• Example:
o Agglomerative (Bottom-Up): Start with one item per cluster, then combine similar
clusters.
o Divisive (Top-Down): Start with one big cluster, then split into smaller ones.
• When to use: When you want to see clusters at different levels of detail.
3. Density-Based Clustering
• What it does: Finds clusters in areas where items are packed tightly and ignores areas
with few items (outliers).
• Example:
o DBSCAN: Groups items that are close together and marks isolated items as
noise.
• When to use: If clusters are oddly shaped or you have outliers.
4. Grid-Based Clustering
• What it does: Divides the data space into squares (grids) and groups items in the grids
that have many points.
• Example:
o STING: Groups based on density in grid cells.
• When to use: For very large datasets.
5. Model-Based Clustering
• What it does: Assumes the data is made up of different groups, each with a certain
shape or pattern (like bell-shaped).
• Example:
o Gaussian Mixture Models (GMM): Groups items based on probabilities that
they belong to a certain cluster.
• When to use: If you think clusters follow specific patterns (e.g., bell-shaped).
6. Spectral Clustering
• What it does: Groups items based on their relationships, like how connected they are
in a network.
• When to use: For data with complex shapes or relationships, like in social networks.
7. Fuzzy Clustering
• What it does: Lets an item belong to more than one cluster, with a percentage for each.
• Example:
o Fuzzy c-means: Groups items but allows overlap between groups.
• When to use: If items could naturally belong to multiple groups.
2) In detail explain Hierarchical clustering (= what is Hierarchical
clustering? What are the types of Hierarchical clustering. With a neat
sketch explain Hierarchical Agglomerative Algo. With a neat sketch,
explain Hierarchical Division clustering Algo)
Ans:
Hierarchical Clustering:
Hierarchical clustering is a connectivity-based clustering model that groups the data points
together that are close to each other based on the measure of similarity or distance. The
assumption is that data points that are close to each other are more similar or related than
data points that are farther apart.
Types of Hierarchical Clustering:
Hierarchical Agglomerative Clustering:
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). A structure that is more informative than the unstructured set of clusters returned
by flat clustering. This clustering algorithm does not require us to prespecify the number
of clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and
then successively agglomerate pairs of clusters until all clusters have been merged into a
single cluster that contains all data.
Steps:
• Consider each alphabet as a single cluster and calculate the distance of one cluster
from all the other clusters.
• In the second step, comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly to cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
• We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Hierarchical Divisive clustering:
It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.
Computing Distance Matrix
While merging two clusters we check the distance between two every pair of clusters and
merge the pair with the least distance/most similarity. But the question is how is that
distance determined. There are different ways of defining Inter Cluster distance/similarity.
Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in squared
error when two clusters are merged.
3) What is Partitioning Method (K-Mean) in Data Mining? Explain K-
Mean.
Ans:
Partitioning Method:
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of clusters
that has to be generated for the clustering methods. In the partitioning method when
database(D) that contains multiple(N) objects then the partitioning method constructs user-
specified(K) partitions of the data in which each partition represents a cluster and a particular
region.
K-Mean (A centroid based Technique):
The K means algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data objects inside
the group (intracluster) is high but the similarity of data objects with the data objects from
outside the cluster is low (intercluster). The similarity of the cluster is determined with respect
to the mean value of the cluster. It is a type of square error algorithm. At the start randomly k
objects from the dataset are chosen in which each of the objects represents a cluster
mean(centre).
Algorithm:
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
4. Repeat Step 2 until no change occurs.
Example: Suppose we want to group the visitors to a website using just their age as
follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-
29) and (36-66) as 2 clusters we get using K Mean Algorithm.
4) (a) While performing K-means clustering, how do you determine the
value of K?
(b) How would you perform K-Means on very large data sets?
© How would you pre process the data for K-Means?
(d) Briefly explain about K-Medoids.
Ans:
a.DETERMINING THE VALUE OF K:
Elbow Method
• Process:
1. Perform K-means clustering for a range of K values (e.g., K=1K to K=10K).
2. Compute the within-cluster sum of squares (WCSS) for each K. WCSS
measures the total variance within clusters.
3. Plot Kversus WCSS.
4. Identify the "elbow point" where the rate of decrease in WCSS sharply changes
(i.e., the curve starts flattening). This point suggests the optimal K.
b. Performing K-mean on Very large data set:
Performing K-means clustering on very large datasets requires modifications and optimizations
to handle the computational and memory demands
1. Use Mini-Batch K-Means
• Mini-Batch K-means is a variation of K-means that processes small, random subsets
(mini-batches) of the dataset instead of the entire dataset at each iteration.
• Advantages:
o Faster convergence.
o Significantly reduces memory usage.
• Drawback: Might slightly reduce clustering accuracy compared to full K-means.
2. Dimensionality Reduction
• High-dimensional data increases computational complexity. Reduce dimensionality
using techniques like:
o Principal Component Analysis (PCA).
o t-SNE (for visualization).
o Truncated SVD (for sparse data).
• Perform K-means on the reduced dataset.
• Benefits:
o Speeds up clustering.
o Can help reduce noise in the data.
3. Initialize with Efficient Methods
• The K-means++ initialization improves convergence speed by choosing initial
centroids that are spread out.
• Scalable initialization algorithms like Fast K-means++ can further enhance
performance on large datasets.
4. Cluster on a Sampled Subset
• Randomly sample a subset of the data for initial clustering.
• Use the resulting centroids to cluster the entire dataset.
• Steps:
1. Perform K-means on a random subset.
2. Use the obtained centroids to assign all data points.
• Tradeoff: Risk of missing rare patterns in the data.
c. Pre process the data in K-mean:
To preprocess data for K-Means, follow these simple steps:
1. Handle Missing Values
o Replace missing values with the mean, median, or mode of the
respective column.
2. Normalize/Standardize the Data
o Scale features to have the same range (e.g., using Min-Max scaling or
Z-score normalization) since K-Means uses distance metrics sensitive to
magnitude.
3.Remove Outliers (Optional)
o Identify and handle outliers as they can distort cluster centroids. Use
methods like z-scores or IQR filtering.
4.Convert Categorical Data
o If the dataset has categorical variables, encode them using one-hot
encoding or label encoding.
5.Reduce Dimensionality (Optional)
• For high-dimensional data, reduce the number of features using PCA or similar
techniques.
d. K-Medoids:
K-Medoids is a clustering algorithm similar to K-Means, but instead of using the mean of
the data points to represent a cluster (as in K-Means), it uses an actual data point, called the
medoid, as the cluster center.
Key Points:
1. Medoid: The most representative data point in a cluster, minimizing the total
distance to other points in the cluster.
2. Distance Metric: It works with any distance metric (e.g., Euclidean, Manhattan),
making it more robust for non-numeric or irregular data.
3. Robust to Outliers: Since it uses actual data points as centers, it's less sensitive to
outliers compared to K-Means.
How It Works:
1. Initialize KK random medoids (data points).
2. Assign each data point to the nearest medoid based on the chosen distance metric.
3. Update the medoids by selecting a new data point within the cluster that minimizes
the total distance to other points.
4. Repeat steps 2–3 until the medoids stabilize.
5) What is DBSCAN? Describe the parameters required for DBSCAN
Algorithm. Briefly explain the steps used in DBSCAN Algorithm.
Ans:
DBSCAN:
Clustering analysis or simply Clustering is basically an Unsupervised learning method that
divides the data points into a number of specific batches or groups, such that the data points in
the same groups have similar properties and data points in different groups have different
properties in some sense.
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”.
Parameters Required For DBSCAN Algorithm:
1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is
chosen too small then a large part of the data will be considered as an outlier. If it is
chosen very large then the clusters will merge and the majority of the data points will
be in the same clusters. One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum
MinPts can be derived from the number of dimensions D in the dataset as, MinPts >=
D+1. The minimum value of MinPts must be chosen at least 3.
Steps Used In DBSCAN Algorithm:
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exists a point c which has
a sufficient number of points in its neighbors and both points a and b are within
the eps distance. This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.
6) What is an Association Rule Learning Algorithm? How it will works?
Explain the process of finding pattern using association rule with suitable
examples. Explain how the Market Basket Analysis uses the concepts of
association analysis.
Ans:
Association Rule Learning Algorithm:
Association Rule Learning is a machine learning technique used to find interesting
relationships or patterns (rules) between items in large datasets. It is widely used in
transactional databases to discover frequent itemsets and correlations.
Works:
The goal is to identify rules of the form:
If (Condition A), then (Condition B)
Where AA and BB are sets of items.
Example: "If a customer buys bread, they are likely to buy butter."
Key Metrics:
1. Support: Measures how often items appear together.
Support(A → B)=Transactions with A and B/Total Transactions
2. Confidence: Measures the likelihood of BB given AA.
Confidence(A → B)=Transactions with A and B/Transactions with A
3. Lift: Measures the strength of the rule compared to random chance.
Lift(A → B)=Confidence(A → B)/Support(B)
Process of Finding Patterns Using Association Rules:
1. Data Preparation:
Organize the dataset into a transactional format, e.g., a list of items purchased in each
transaction.
2. Find Frequent Itemsets:
Use algorithms like Apriori or FP-Growth to find itemsets that occur frequently
together (based on Support).
3. Generate Rules:
From the frequent itemsets, generate association rules that meet the thresholds for
Confidence and Lift.
4. Evaluate Rules:
Rank rules based on metrics like Lift or Confidence to identify the most valuable
patterns.
Example:
Dataset of transactions:
Transaction Items Purchased
1 Bread, Butter, Milk
2 Bread, Butter
3 Milk, Bread
4 Butter, Milk
Step 1: Frequent Itemsets
• Bread & Butter: Support = 2/4 = 0.5
• Bread & Milk: Support = 2/4 = 0.5
Step 2: Generate Rules
• Rule: "If Bread → Butter"
o Confidence = 2/3 = 0.67
Step 3: Evaluate
• Lift: Higher Lift indicates a stronger association compared to random chance.
Market Basket Analysis and Association Rules:
Market Basket Analysis applies association rule learning to understand customer purchasing
behavior.
Concept:
• Identifies products often bought together (e.g., "diapers and beer").
• Helps businesses with:
1. Cross-Selling: Suggest related items.
2. Store Layout Optimization: Place frequently purchased items closer.
3. Promotions: Offer discounts on associated items.
Example:
Rule: "If a customer buys a smartphone, they are likely to buy a phone case."
• Action: Recommend phone cases during smartphone checkout.
In summary, association rule learning discovers valuable patterns in data, enabling
businesses to make data-driven decisions.
7) In detail Compare Hierarchical clustering and K-means clustering.
Ans:
Comparison of Hierarchical Clustering and K-Means Clustering:
Aspect Hierarchical Clustering K-Means Clustering
Definition A clustering algorithm that builds a A centroid-based clustering
hierarchy of clusters, either by algorithm that partitions data
merging or splitting them. into K clusters based on
distance.
Approach - Divisive (Top-Down): Start with all Partitional: Divides data into
data as one cluster and split. K clusters in a flat manner,
- Agglomerative (Bottom-Up): Start without hierarchy.
with each data point as its own
cluster and merge.
Output A dendrogram (tree-like structure) A flat K-partitioned
showing clusters at various levels. clustering.
Number of Clusters No need to pre-define K. The Requires pre-specifying the
(K) dendrogram allows choosing the number of clusters K.
number of clusters visually.
Cluster Shape Can detect clusters of arbitrary Assumes clusters are
shapes. spherical (circular or convex
shapes).
Algorithm Type Hierarchical and deterministic (if no Iterative and non-
tie occurs). deterministic (results may
vary with different
initializations).
Scalability Computationally expensive for large Efficient for large datasets
datasets (O(n^2) or worse). (O(n×k×i))where i is the
number of iterations).
8) Write the differences between DBSCAN and K-Means.
Ans:
Differences between DBSCAN and K-Means:
DBSCAN K-Means
K-Means is very sensitive to the
In DBSCAN we need not specify the number
number of clusters so it
of clusters.
need to specified
Clusters formed in K-Means are
Clusters formed in DBSCAN can be of any arbitrary
spherical or
shape.
convex in shape
K-Means does not work well
DBSCAN can work well with datasets having noise and with outliers data. Outliers
outliers can skew the clusters in K-
Means to a very large extent.
In K-Means only one parameter
In DBSCAN two parameters are required for training
is required is for training
the Model
the model
9) Compare Supervised Vs Unsupervised learning
Ans:
Comparison of Supervised vs. Unsupervised Learning:
Aspect Supervised Learning Unsupervised Learning
Definition Learning without labeled
Learning with labeled data
data, focusing on finding
(input-output pairs).
hidden patterns.
Goal Predict outcomes or Group similar data points or
classify data based on discover underlying
labeled training data. structure.
Data Requirement Requires labeled data. Works with unlabeled data.
Output Predicts specific outcomes Produces clusters, patterns,
(classes or values). or reduced dimensions.
Techniques - Classification (e.g., - Clustering (e.g., K-Means,
Decision Trees, SVM, Hierarchical Clustering).
Logistic Regression). - Dimensionality Reduction
- Regression (e.g., Linear (e.g., PCA).
Regression).
Examples - Spam email classification. - Customer segmentation.
- Predicting house prices. - Market Basket Analysis.
- Diagnosing diseases. - Anomaly detection.
Performance Evaluation Use metrics like accuracy, Evaluate using internal
precision, recall, and metrics (e.g., silhouette
RMSE. score) or visual inspection.
Dependency on Labels Completely depends on No need for labeled data;
labeled data for learning. learns patterns from the
structure of data.
Applications - Fraud detection. - Grouping customers by
- Weather forecasting. behavior.
- Sentiment analysis. - Reducing data for
visualization.