0% found this document useful (0 votes)
30 views8 pages

Idm Assignment 3 22735

KMeans, OPTICS, and Agglomerative Clustering models were tested on patient symptom data to create a diagnostic clustering model. Agglomerative Clustering performed best with a score of 0.84145, as it is a form of hierarchical clustering which is necessary for this data since many diagnoses can be subsets of larger illnesses. Various preprocessing techniques and hyperparameters were tested across 90 submissions to identify the best model.

Uploaded by

Hamza Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views8 pages

Idm Assignment 3 22735

KMeans, OPTICS, and Agglomerative Clustering models were tested on patient symptom data to create a diagnostic clustering model. Agglomerative Clustering performed best with a score of 0.84145, as it is a form of hierarchical clustering which is necessary for this data since many diagnoses can be subsets of larger illnesses. Various preprocessing techniques and hyperparameters were tested across 90 submissions to identify the best model.

Uploaded by

Hamza Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

IDM ASSIGNMENT

Muhammad Hassan Nami


4th December 2022
Introduction
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Modeling
5. Submissions
6. Evaluation

1
Business Understanding
We have the information regarding the symptoms that a sample of patients have, our goal is to use clustering to
create an accurate model that can diagnose patients based on their symptoms.

Data Understanding
The data contains 2783 unique patients’ symptoms, the symptoms are defined using 132 unique binary columns.

Data Preparation
To make the data easier to manage and efficient to manipulate we will remove the ‘row ID’ column and any
columns that consist only of 1s or only of 0s. Other than that, for specific cases we have also removed columns
based on variance and collinearity.

Data Modeling
Models used are KMeans, OPTICS and Agglomerative Clustering.

Submissions
SUBMISSION PREPROCESSING* MODEL** SCORE

2
1 Did Not Remove ‘row ID’ KMeans() 0.00803
Categorical To Numerical

2 Did Not Remove ‘row ID’ KMeans() 0.44166


One Hot Encoding

3 Did Not Remove ‘row ID’ KMeans() 0.52964


Dummy Encoding

4 None KMeans() 0.6518

5 None AgglomerativeClustering() 0.01586

6 None KMeans(n_init=100) 0.6518

7 None KMeans(n_clusters=15,n_in 0.57338


it=100,max_iter=1000)

8 None KMeans(n_clusters=10,n_in 0.59262


it=100,max_iter=1000)

9 None KMeans(n_clusters=6,n_init 0.55455


=1000,max_iter=1000)

10 None KMeans(n_clusters=23,n_in 0.34952


it=1000,max_iter=1000)

11 None KMeans() 0.62693

12 None KMeans(n_clusters=13, 0.61441


n_init=10000)

13 None KMeans() 0.61446

14 None KMeans(n_clusters=13) 0.65376

15 None KMeans(clusters=22) 0.37563

16 Variance Threshold 0 KMeans(clusters=22) 0.37563

17 None KMeans(clusters=6) 0.55455

18 Variance Threshold 0 KMeans(clusters=6) 0.55455

19 None KMeans(n_clusters=13,max 0.61441


_iter=1500,n_init=100), yes

20 None KMeans(n_clusters=13,max 0.52663


_iter=1500,n_init=100),
yes(max)

21 None KMeans() 0.74087

3
22 None KMeans() 0.77969

23 None KMeans() 0.61446

24 None KMeans(n_clusters=16, 0.49704


n_init=50, max_iter=300,
algorithm=full)

25 None KMeans() 0.80573

26-30*** None KMeans(n_clusters=8) 0.46266-0.68636

31-40*** None KMeans(n_clusters=19,rand 0.41492-0.47415


om_state=k,max_iter=1000)

41-50*** None KMeans(random_state=k** 0.44386-0.66589


**)

51-60*** None KMeans() 0.35949-0.78356

61 None AgglomerativeClustering(n_ 0.48585


clusters=17,linkage='ward',c
ompute_distances=True)

62 None AgglomerativeClustering(n_ 0.62658


clusters=17,linkage='single',
compute_distances=True)

63 None AgglomerativeClustering(n_ 0.5641


clusters=17,linkage='averag
e',compute_distances=True)

64 None AgglomerativeClustering(n_ 0.36292


clusters=17,linkage='comple
te',compute_distances=True)

65 None AgglomerativeClustering(n_ 0.52013


clusters=19,linkage='single',
compute_distances=True)

66 None AgglomerativeClustering(n_ 0.47415


clusters=19,linkage='averag
e',compute_distances=True)

67 None AgglomerativeClustering(n_ 0.37563


clusters=22,linkage='single',
compute_distances=True)

68 None AgglomerativeClustering(n_ 0.34952


clusters=23,linkage='single',
compute_distances=True)

4
69 None AgglomerativeClustering(n_ 0.25346
clusters=25,linkage='single',
compute_distances=True)

70 None AgglomerativeClustering(n_ 0.5614


clusters=13,linkage='single',
compute_distances=True)

71 None AgglomerativeClustering(n_ 0.64654


clusters=16,linkage='single',
compute_distances=True)

72 None AgglomerativeClustering(n_ 0.50846


clusters=16,linkage='ward',c
ompute_distances=True)

73 None AgglomerativeClustering(n_ 0.60619


clusters=16,linkage='averag
e',compute_distances=True)

74 None AgglomerativeClustering(n_ 0.38363


clusters=16,linkage='comple
te',compute_distances=True)

75 None AgglomerativeClustering(n_ 0.51168


clusters=11,linkage='single',
compute_distances=True)

76 None AgglomerativeClustering(n_ 0.23864


clusters=10,linkage='single',
compute_distances=True)

77 None OPTICS() 0.15725

78 None OPTICS(n_jobs=- 0.11212


1,min_samples=25)

79 None OPTICS(n_jobs=- 0.34952


1,min_samples=10,min_clus
ter_size=100)

80 None OPTICS(n_jobs=- 0.34952


1,min_samples=5,min_clust
er_size=121)

81 Removed Features Using KMeans() 0.68636


Multicollinearity

82 Removed Features Using KMeans() 0.66589


Multicollinearity

83 Removed Features Using KMeans() 0.68589

5
Multicollinearity

84 Removed Features Using KMeans() 0.68139


Multicollinearity

85 Removed Features Using KMeans() 0.54964


Multicollinearity

86 Removed Features Using AgglomerativeClustering(n_ 0.84145


Multicollinearity clusters=11,linkage='averag
e',compute_distances=True,
affinity='manhattan')

87 Removed Features Using AgglomerativeClustering(n_ 0.768


Multicollinearity clusters=10,linkage='averag
e',compute_distances=True,
affinity='manhattan')

88 Removed Features Using AgglomerativeClustering(n_ 0.704


Multicollinearity clusters=12,linkage='averag
e',compute_distances=True,
affinity='manhattan')

89 Removed Features Using AgglomerativeClustering(n_ 0.68524


Multicollinearity clusters=13,linkage='averag
e',compute_distances=True,
affinity='manhattan')

90 Removed Features Using AgglomerativeClustering(n_ 0.48605


Multicollinearity clusters=8,linkage='average'
,compute_distances=True,af
finity='manhattan')

* In all submissions ‘row ID’ was removed unless specified otherwise.

** n_clusters in k means are taken from elbow and silhouette curves. n_cluster in agglomerative
clustering are taken from dendrograms.

*** These submissions were created using loops to check randomness.

**** Loop run to iteratively change random state and run model, then sort by inertia_.

6
Evaluation
Which algorithm worked best for the given dataset and why?

Agglomerative Clustering works best as it is a form of hierarchical clustering, Hierarchical clustering is necessary as
many diagnosed illnesses can be a subset of a larger illness.

What is the optimal number of clusters in the data as per your findings and why?

According to the dendrogram 11 clusters were the optimal value.

What were the overall challenges that you faced while improving the score, and so on?

Randomness of KMeans was especially challenging to overcome as we were limited to 10 submissions per day.
Even if the random state remained constant there was still a large amount of variance in the clusters formed.

However, the most challenging part was using Word as a medium for creating this report.

You might also like