0% found this document useful (0 votes)
115 views3 pages

DMG Exam 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views3 pages

DMG Exam 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

DMG Exam 3

Total Marks: 30 Time: 60 min.

Instructions

1. You will have to create a PDF file with your answers, name the file as <Name>-<RollNo>.pdf
and upload it on the classroom page.
2. Any submission submitted on and after 11:10 will be marked as late. No submissions after
11.10 will be considered.
3. You will have to join the zoom link (lecture) from a camera-enabled device. Attendance will be
taken from the zoom itself before evaluation. Absent students will be given zero marks.
4. For Plagiarism, institute policy will be followed. Any case of plagiarism from online sources or
from your colleagues will result in an "F" grade.
5. Do not roundoff any values, You are strongly advised to truncate any intermediate or final
decimal values.
6. If you face issues while submitting to the Google classroom, Please mail your responses to
[email protected] with the subject as "DMG Exam 3 ". The above timings will hold for mailed
responses too.

Question-1 [5 Marks] The Bisecting k-Means algorithm starts by dividing the points into two
clusters. It may consider several bisections and pick the best one. Let us take "best" to mean the
lowest SSE (Sum Squared Error). The SSE is defined as the sum of the squares of the distances
between each of the points of the cluster and the centroid of the cluster.

Suppose that the data set consists of nine points arranged in a square grid, as suggested by the figure
below:

Although it doesn't matter for this question, you may take the grid spacing to be 1 (i.e., the squares
are 2-by-2) and the lower-left corner to be the point (0,0). In the figure, we see three possible
bisections. (a) would be the bisection if we chose the two initial centroids to be 3 and 7, for example,
and broke ties in favor of 7. (b) would be the split if we chose initial centroids 1 and 2. (c) would be
the split for initial choice 2 and 7.

Comment on the below-given choices in terms of if they are correct or incorrect. If it is a wrong
option, then discuss the reasons for that.
I. (b) is better than (a)

II. (c) is the worst choice.

III. (a) and (c) are equally good choices.

Question-2 [9 marks] Suppose that the true data consists of three clusters, as suggested by the
diagram below:

There is a large cluster B centered around the origin (0,0), with 8000 points uniformly distributed in a
circle of radius 2. There are two small clusters, A and C, each with 1000 points uniformly distributed
in a circle of radius 1. The center of A is at (-10,0) and the center of C is at (10,0).

Suppose we choose three initial centroids x, y, and z, and cluster the points according to which of x, y,
or z they are closest. The result will be three apparent clusters, which may or may not coincide with
the true clusters A, B, and C. Depending on from which clusters we chose x, y, and z, it is possible that
all and only the points in B will be assigned to one of these three centroids. Another possibility is that
one of these centroids will be assigned all of B and all of A, but none of C. Still, a third possibility is
that one centroid will be assigned all of B and C, but none of A. Compute the probabilities of each of
these events.

Question-3 [5 Marks] Perform a hierarchical clustering of the following six points:


Using the single-link proximity measure (the distance between clusters is the shortest distance
between any pair of points, one from each cluster). Give the dendrogram showing the correct merge
sequence.

Question-4 [5 Marks] In the following, 1 through 7 are items. Which of the following association
rules has a confidence that is certain to be at least as great as the confidence of 12=>34567 and no
greater than the confidence of 1234=>5? Give reasons.

a) 134=>567

b) 123=>456

c) 134=>257

d) 134=>256

e) 124=>356

Question-5 [6 Marks] Consider the task of anomaly detection using the classic DBSCAN algorithm
and the Silhouette score. The dataset has clusters of different densities, and anomalies are rare but
are in a group with a higher density than the regular points cluster.

a) What shall be the DBSCAN algorithm's performance in detecting anomalies?

b) The Silhouette score is usually defined using the notion of distance of a given point from the
centroid of the corresponding cluster. Comment on the coefficient values for the anomalous
objects for the DBSCAN algorithm.

You might also like