Data Science Code
Data Science Code
Provided with a dataset containing various data points, each described by a set of features. Your
task is to apply unsupervised learning techniques to cluster these data points into 'n' distinct
clusters. Additionally, you need to identify the most important features and their corresponding
values that contribute significantly to the grouping of data points within these clusters.
Objective:
1. Clustering: Implement an unsupervised learning algorithm to partition the data points
into 'n' clusters. You should select an appropriate clustering algorithm based on the
characteristics of the dataset and the problem requirements.
2. Feature Importance: Identify the most important features that influence the formation of
clusters. Determine the relevance and significance of each feature in grouping data
points together. This analysis will help you understand which features are driving the
cluster assignments.
3. Feature Values: Determine the specific values or ranges of values for the identified
important features that are associated with each cluster. In other words, find the feature-
value combinations that differentiate one cluster from another.
Index
##Solution
##1. Data Preprocessing: Handling of missing values using mean and mode.
Determined the optimal number of clusters which is '2' based on both elbow method and
silhouette score.
Applid the chosen algorithm to cluster the data points into '2' clusters.
##3. Evaluation:
Evaluated the quality of the clustering results using Silhouette score, Davies-Bouldin index,
Calinski_Harabasz_score.
##4. Clustering: Employed techniques such as dimensionality reduction method PCA and K-
means inbuilt method to perform clustering.
Determined the relevance and contribution of each feature to the cluster assignments by
employing RFC.
##5. Insights:
Tabulation of the contribution of the original features in forming the pca components.
Created a bar gaph that represents the influencing feature-value distributions in each cluster.
For each identified important feature, analyse its values or value ranges that are prevalent in
each cluster in table form
Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score,
calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder,StandardScaler
Importing Data
data=pd.read_csv('METABRIC_RNA_Mutation.csv')
#Data Exploration
data.head()
stmn2_mut siah1_mut
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
data.shape
(1904, 693)
data.describe(include='all')
pam50_+_claudin-low_subtype cohort
er_status_measured_by_ihc \
count 1904 1904.000000
1874
unique 7 NaN
2
top LumA NaN
Positve
freq 679 NaN
1445
mean NaN 2.643908
NaN
std NaN 1.228615
NaN
min NaN 1.000000
NaN
25% NaN 1.000000
NaN
50% NaN 3.000000
NaN
75% NaN 3.000000
NaN
max NaN 5.000000
NaN
Data Preprocessing
Missing Values
#identifying and noting down the cols with missing values
missing_cat=[]
for i in cat_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_cat.append(i)
missing_num=[]
for i in num_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_num.append(i)
#we will drop the columns have more than 10% missing values
data1=data.drop(['tumor_stage'],axis=1)
#handling missing values with mode and mean for cat and num cols
respectively
for i in missing_cat:
if(i in data1.columns.tolist()):
x=data1[i].mode().tolist()
data1[i]=data1[i].fillna(x[0])
for i in missing_num:
if(i in data1.columns.tolist()):
data1[i]=data1[i].fillna(data1[i].mean())
#redefining cat and num cols after dropping the ones with excess % of
missing data
new_cat=data1.describe(include='category').columns.tolist()
new_num=data1.describe(exclude='category').columns.tolist()
type_of_breast_surgery 0.011555
cancer_type_detailed 0.007878
cellularity 0.028361
er_status_measured_by_ihc 0.015756
tumor_other_histologic_subtype 0.007878
primary_tumor_laterality 0.055672
oncotree_code 0.007878
3-gene_classifier_subtype 0.107143
death_from_cancer 0.000525
dtype: float64
neoplasm_histologic_grade 0.037815
mutation_count 0.023634
tumor_size 0.010504
tumor_stage 0.263130
dtype: float64
data1.isna().sum().sum()
Standardization
To ensure all features have a similar scale, which can be important for algorithms that rely on
distance calculations(K-Means clustering uses distance calculations)
data.describe(exclude='category')
neoplasm_histologic_grade hormone_therapy \
count 1832.000000 1904.000000
mean 2.415939 0.616597
std 0.650612 0.486343
min 1.000000 0.000000
25% 2.000000 0.000000
50% 3.000000 1.000000
75% 3.000000 1.000000
max 3.000000 1.000000
lymph_nodes_examined_positive mutation_count \
count 1904.000000 1859.000000
mean 2.002101 5.697687
std 4.079993 4.058778
min 0.000000 1.000000
25% 0.000000 3.000000
50% 0.000000 5.000000
75% 2.000000 7.000000
max 45.000000 80.000000
scaler=StandardScaler()
scaled_data=pd.DataFrame(scaler.fit_transform(data1[new_num]),columns=
new_num)
print('Scaled data shape:',scaled_data.shape)
new_data=pd.concat((data1[new_cat],scaled_data),axis=1)
print('New data shape:',new_data.shape)
Clustering
To find the ideal number of clusters using elbow and silhoutte score method
X=new_data.copy()
X.columns = X.columns.astype(str)
#Elbow method
# Calculate WCSS for a range of K values
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
From both the graphs we can conclude that the ideal number of clusters is 2 as per the
methods.
Clustering the dimensionally reduced dataset through
PCA, evaluating the results and obtaining the important
features and their corresponding values
n_components = 5 #the number of components
pca = PCA(n_components=n_components)
X=new_data.copy()
X.columns = X.columns.astype(str)
X_pca = pca.fit_transform(X)
davies_bouldin_score: a lower Davies-Bouldin Index suggests that the clusters are well-
separated and distinct
silhouette_score: A higher Silhouette Score indicates that the data points are well-separated
into clusters.
# Print the loadings for the first PC with corresponding feature names
def feature_names(loadings,original_feature_names,component_name):
feat={'feature_name':[],'weightage':[]}
loadings_for_first_pc = loadings
loading_with_names = list(zip(original_feature_names,
loadings_for_first_pc))
for feature_name, loading in loading_with_names:
feat['feature_name'].append(feature_name)
feat['weightage'].append(loading)
values=sorted(feat['weightage'],reverse=True)[0:3]
indices={'feature_name':[],'weightage':[],'component_name':[]}
for i in values:
if 'weightage' in feat:
x = [index for index, value in enumerate(feat['weightage']) if
value == i]
indices['feature_name'].append(orig_cols[x[0]])
indices['weightage'].append(i)
indices['component_name'].append(component_name)
return indices
f_names=pd.DataFrame()
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[0],orig
inal_feature_names,0))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[1],orig
inal_feature_names,1))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[3],orig
inal_feature_names,3))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[2],orig
inal_feature_names,2))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[4],orig
inal_feature_names,4))))
x_cols=new_data.columns.tolist()
x_new=new_data.copy()
x_new['ClusterLabel']=df['ClusterLabel']
feat_contribution={'Feature_Name':
[],'%_of_Influence_in_cluster_formation':[]}
data=x_new.drop(['ClusterLabel'],axis=1)
labels=x_new['ClusterLabel']
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(data,labels) # Fit the model with cluster labels as targets
feature_importance = model.feature_importances_.tolist()
values=sorted(feature_importance,reverse=True)[0:10]
for i in values:
x=[index for index, value in enumerate(feature_importance) if value
== i]
feat_contribution['Feature_Name'].append(x_cols[x[0]])
# feat_contribution['index'].append(x[0])
feat_contribution['%_of_Influence_in_cluster_formation'].append(i*100)
Insights Obtained
Top 3 original features that have the major influence on
the formation of pca components(distribution is
component wise)
1. feature_name: Name of the feature as per the dataframe.
f_names
feature_name weightage component_name
0 tp53_mut 0.994655 0
1 muc16_mut 0.047248 0
2 syne1_mut 0.023867 0
0 muc16_mut 0.952263 1
1 ahnak2_mut 0.284344 1
2 kmt2c_mut 0.054483 1
0 kmt2c_mut 0.926952 3
1 pik3ca_mut 0.294682 3
2 map3k1_mut 0.203274 3
0 ahnak2_mut 0.954922 2
1 syne1_mut 0.029579 2
2 ahnak_mut 0.024780 2
0 syne1_mut 0.843752 4
1 kmt2c_mut 0.166811 4
2 dnah11_mut 0.119488 4
Feature_Name %_of_Influence_in_cluster_formation
0 tp53_mut 19.821408
1 bcl2 2.640829
2 aph1b 2.100798
3 chek1 1.870523
4 er_status 1.277683
5 gata3 1.042316
6 e2f3 0.806697
7 mapk1 0.749623
8 cdkn2a 0.734711
9 srd5a1 0.703182
important_features=feat_contribution['Feature_Name']
insights0,insights1=pd.DataFrame(),pd.DataFrame()
j=0
for feature in important_features:
for label,group in cluster_groups:
j+=1
if(j%2==0):
insights0=pd.concat((insights0,group[feature]),axis=1)
else:
insights1=pd.concat((insights1,group[feature]),axis=1)
insights={'cluster':[],'max_freq':[],
'most_occured_value':[],'mean_value':[],
'median_value':[],'std_dev':[],'feature_name':[]}
insights['max_freq'].append(group[feature].value_counts().tolist()[0])
insights['mean_value'].append(group[feature].mean())
insights['median_value'].append(group[feature].median())
insights['std_dev'].append(group[feature].std())
#cluster_label=1
insights1.describe()
Conclusion
The objective of the case study that is to cluster the data and to identify the important features
along with their contribution to the clustering is achieved.