Idm Assignment 3 22735
Idm Assignment 3 22735
1
Business Understanding
We have the information regarding the symptoms that a sample of patients have, our goal is to use clustering to
create an accurate model that can diagnose patients based on their symptoms.
Data Understanding
The data contains 2783 unique patients’ symptoms, the symptoms are defined using 132 unique binary columns.
Data Preparation
To make the data easier to manage and efficient to manipulate we will remove the ‘row ID’ column and any
columns that consist only of 1s or only of 0s. Other than that, for specific cases we have also removed columns
based on variance and collinearity.
Data Modeling
Models used are KMeans, OPTICS and Agglomerative Clustering.
Submissions
SUBMISSION PREPROCESSING* MODEL** SCORE
2
1 Did Not Remove ‘row ID’ KMeans() 0.00803
Categorical To Numerical
3
22 None KMeans() 0.77969
4
69 None AgglomerativeClustering(n_ 0.25346
clusters=25,linkage='single',
compute_distances=True)
5
Multicollinearity
** n_clusters in k means are taken from elbow and silhouette curves. n_cluster in agglomerative
clustering are taken from dendrograms.
**** Loop run to iteratively change random state and run model, then sort by inertia_.
6
Evaluation
Which algorithm worked best for the given dataset and why?
Agglomerative Clustering works best as it is a form of hierarchical clustering, Hierarchical clustering is necessary as
many diagnosed illnesses can be a subset of a larger illness.
What is the optimal number of clusters in the data as per your findings and why?
What were the overall challenges that you faced while improving the score, and so on?
Randomness of KMeans was especially challenging to overcome as we were limited to 10 submissions per day.
Even if the random state remained constant there was still a large amount of variance in the clusters formed.
However, the most challenging part was using Word as a medium for creating this report.