0% found this document useful (0 votes)
16 views

Data Science Code

The document describes clustering a dataset containing various data points described by features to identify important features and values that contribute to cluster groupings. It outlines preprocessing steps like handling missing data and standardization, applying K-means clustering and evaluating results to determine feature importance and values associated with each cluster.

Uploaded by

Vasudevan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Data Science Code

The document describes clustering a dataset containing various data points described by features to identify important features and values that contribute to cluster groupings. It outlines preprocessing steps like handling missing data and standardization, applying K-means clustering and evaluating results to determine feature importance and values associated with each cluster.

Uploaded by

Vasudevan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Problem Statement

Provided with a dataset containing various data points, each described by a set of features. Your
task is to apply unsupervised learning techniques to cluster these data points into 'n' distinct
clusters. Additionally, you need to identify the most important features and their corresponding
values that contribute significantly to the grouping of data points within these clusters.

Objective:
1. Clustering: Implement an unsupervised learning algorithm to partition the data points
into 'n' clusters. You should select an appropriate clustering algorithm based on the
characteristics of the dataset and the problem requirements.
2. Feature Importance: Identify the most important features that influence the formation of
clusters. Determine the relevance and significance of each feature in grouping data
points together. This analysis will help you understand which features are driving the
cluster assignments.
3. Feature Values: Determine the specific values or ranges of values for the identified
important features that are associated with each cluster. In other words, find the feature-
value combinations that differentiate one cluster from another.

Index
##Solution

##1. Data Preprocessing: Handling of missing values using mean and mode.

Standardidation of features using StandardScaler.

Encoded categorical variables using LabelEncoder.

##2. Clustering: Choose an appropriate clustering algorithm (K-means clustering).

Determined the optimal number of clusters which is '2' based on both elbow method and
silhouette score.

Applid the chosen algorithm to cluster the data points into '2' clusters.

##3. Evaluation:

Evaluated the quality of the clustering results using Silhouette score, Davies-Bouldin index,
Calinski_Harabasz_score.

##4. Clustering: Employed techniques such as dimensionality reduction method PCA and K-
means inbuilt method to perform clustering.

Determined the relevance and contribution of each feature to the cluster assignments by
employing RFC.

##5. Insights:
Tabulation of the contribution of the original features in forming the pca components.

Created a bar gaph that represents the influencing feature-value distributions in each cluster.

For each identified important feature, analyse its values or value ranges that are prevalent in
each cluster in table form

Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score,
calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder,StandardScaler

Importing Data
data=pd.read_csv('METABRIC_RNA_Mutation.csv')

#Data Exploration

data.head()

patient_id age_at_diagnosis type_of_breast_surgery cancer_type


\
0 0 75.65 MASTECTOMY Breast Cancer

1 2 43.19 BREAST CONSERVING Breast Cancer

2 5 48.87 MASTECTOMY Breast Cancer

3 6 47.68 MASTECTOMY Breast Cancer

4 8 76.97 MASTECTOMY Breast Cancer

cancer_type_detailed cellularity chemotherapy


\
0 Breast Invasive Ductal Carcinoma NaN 0

1 Breast Invasive Ductal Carcinoma High 0


2 Breast Invasive Ductal Carcinoma High 1

3 Breast Mixed Ductal and Lobular Carcinoma Moderate 1

4 Breast Mixed Ductal and Lobular Carcinoma High 1

pam50_+_claudin-low_subtype cohort er_status_measured_by_ihc ...


mtap_mut \
0 claudin-low 1.0 Positve ...
0
1 LumA 1.0 Positve ...
0
2 LumB 1.0 Positve ...
0
3 LumB 1.0 Positve ...
0
4 LumB 1.0 Positve ...
0

ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut hras_mut prps2_mut


smarcb1_mut \
0 0 0 0 0 0 0
0
1 0 0 0 0 0 0
0
2 0 0 0 0 0 0
0
3 0 0 0 0 0 0
0
4 0 0 0 0 0 0
0

stmn2_mut siah1_mut
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

[5 rows x 693 columns]

data.shape

(1904, 693)

data.describe(include='all')

patient_id age_at_diagnosis type_of_breast_surgery


cancer_type \
count 1904.000000 1904.000000 1882
1904
unique NaN NaN 2
2
top NaN NaN MASTECTOMY Breast
Cancer
freq NaN NaN 1127
1903
mean 3921.982143 61.087054 NaN
NaN
std 2358.478332 12.978711 NaN
NaN
min 0.000000 21.930000 NaN
NaN
25% 896.500000 51.375000 NaN
NaN
50% 4730.500000 61.770000 NaN
NaN
75% 5536.250000 70.592500 NaN
NaN
max 7299.000000 96.290000 NaN
NaN

cancer_type_detailed cellularity chemotherapy \


count 1889 1850 1904.000000
unique 6 3 NaN
top Breast Invasive Ductal Carcinoma High NaN
freq 1500 939 NaN
mean NaN NaN 0.207983
std NaN NaN 0.405971
min NaN NaN 0.000000
25% NaN NaN 0.000000
50% NaN NaN 0.000000
75% NaN NaN 0.000000
max NaN NaN 1.000000

pam50_+_claudin-low_subtype cohort
er_status_measured_by_ihc \
count 1904 1904.000000
1874
unique 7 NaN
2
top LumA NaN
Positve
freq 679 NaN
1445
mean NaN 2.643908
NaN
std NaN 1.228615
NaN
min NaN 1.000000
NaN
25% NaN 1.000000
NaN
50% NaN 3.000000
NaN
75% NaN 3.000000
NaN
max NaN 5.000000
NaN

... mtap_mut ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut


hras_mut \
count ... 1904 1904 1904 1904 1904
1904.0
unique ... 5 5 5 4 4
4.0
top ... 0 0 0 0 0
0.0
freq ... 1900 1900 1900 1901 1901
1024.0
mean ... NaN NaN NaN NaN NaN
NaN
std ... NaN NaN NaN NaN NaN
NaN
min ... NaN NaN NaN NaN NaN
NaN
25% ... NaN NaN NaN NaN NaN
NaN
50% ... NaN NaN NaN NaN NaN
NaN
75% ... NaN NaN NaN NaN NaN
NaN
max ... NaN NaN NaN NaN NaN
NaN

prps2_mut smarcb1_mut stmn2_mut siah1_mut


count 1904 1904.0 1904 1904.0
unique 3 4.0 3 3.0
top 0 0.0 0 0.0
freq 1902 1024.0 1902 1024.0
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
[11 rows x 693 columns]

#separating categorical and numerical value based columns


cat_cols=data.describe(include='object').columns.tolist()
num_cols=data.describe(exclude='object').columns.tolist()

#converting object datatypes into categorical columns.


for i in cat_cols:
data[i]=data[i].astype('category')

Data Preprocessing
Missing Values
#identifying and noting down the cols with missing values
missing_cat=[]
for i in cat_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_cat.append(i)

missing_num=[]
for i in num_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_num.append(i)

#viewing the % of missing values in the columns


print((data[missing_cat].isna().sum())/data.shape[0])
print((data[missing_num].isna().sum())/data.shape[0])

#we will drop the columns have more than 10% missing values
data1=data.drop(['tumor_stage'],axis=1)

#handling missing values with mode and mean for cat and num cols
respectively
for i in missing_cat:
if(i in data1.columns.tolist()):
x=data1[i].mode().tolist()
data1[i]=data1[i].fillna(x[0])

for i in missing_num:
if(i in data1.columns.tolist()):
data1[i]=data1[i].fillna(data1[i].mean())
#redefining cat and num cols after dropping the ones with excess % of
missing data
new_cat=data1.describe(include='category').columns.tolist()
new_num=data1.describe(exclude='category').columns.tolist()

type_of_breast_surgery 0.011555
cancer_type_detailed 0.007878
cellularity 0.028361
er_status_measured_by_ihc 0.015756
tumor_other_histologic_subtype 0.007878
primary_tumor_laterality 0.055672
oncotree_code 0.007878
3-gene_classifier_subtype 0.107143
death_from_cancer 0.000525
dtype: float64
neoplasm_histologic_grade 0.037815
mutation_count 0.023634
tumor_size 0.010504
tumor_stage 0.263130
dtype: float64

data1.isna().sum().sum()

Encoding Categorical Variables


encoder=LabelEncoder()

#encoding mixture of numerical and text values


for i in new_cat:
try:
data1[i]=encoder.fit_transform(data1[i])
except TypeError:
data1[i]=data1[i].replace(0,'Zero')
data1[i]=encoder.fit_transform(data1[i])

Standardization
To ensure all features have a similar scale, which can be important for algorithms that rely on
distance calculations(K-Means clustering uses distance calculations)

data.describe(exclude='category')

patient_id age_at_diagnosis chemotherapy cohort \


count 1904.000000 1904.000000 1904.000000 1904.000000
mean 3921.982143 61.087054 0.207983 2.643908
std 2358.478332 12.978711 0.405971 1.228615
min 0.000000 21.930000 0.000000 1.000000
25% 896.500000 51.375000 0.000000 1.000000
50% 4730.500000 61.770000 0.000000 3.000000
75% 5536.250000 70.592500 0.000000 3.000000
max 7299.000000 96.290000 1.000000 5.000000

neoplasm_histologic_grade hormone_therapy \
count 1832.000000 1904.000000
mean 2.415939 0.616597
std 0.650612 0.486343
min 1.000000 0.000000
25% 2.000000 0.000000
50% 3.000000 1.000000
75% 3.000000 1.000000
max 3.000000 1.000000

lymph_nodes_examined_positive mutation_count \
count 1904.000000 1859.000000
mean 2.002101 5.697687
std 4.079993 4.058778
min 0.000000 1.000000
25% 0.000000 3.000000
50% 0.000000 5.000000
75% 2.000000 7.000000
max 45.000000 80.000000

nottingham_prognostic_index overall_survival_months ... \


count 1904.000000 1904.000000 ...
mean 4.033019 125.121324 ...
std 1.144492 76.334148 ...
min 1.000000 0.000000 ...
25% 3.046000 60.825000 ...
50% 4.042000 115.616667 ...
75% 5.040250 184.716667 ...
max 6.360000 355.200000 ...

srd5a1 srd5a2 srd5a3 st7


star \
count 1.904000e+03 1.904000e+03 1.904000e+03 1.904000e+03
1904.000000
mean 4.726891e-07 -3.676471e-07 -9.453782e-07 -1.050420e-07 -
0.000002
std 1.000263e+00 1.000262e+00 1.000262e+00 1.000263e+00
1.000262
min -2.120800e+00 -3.364800e+00 -2.719400e+00 -4.982700e+00 -
2.981700
25% -6.188500e-01 -6.104750e-01 -6.741750e-01 -6.136750e-01 -
0.632900
50% -2.456500e-01 -4.690000e-02 -1.422500e-01 -5.175000e-02 -
0.026650
75% 3.306000e-01 5.144500e-01 5.146000e-01 5.787750e-01
0.590350
max 6.534900e+00 1.027030e+01 6.329000e+00 4.571300e+00
12.742300

tnk2 tulp4 ugt2b15 ugt2b17


ugt2b7
count 1.904000e+03 1.904000e+03 1.904000e+03 1904.000000
1.904000e+03
mean 3.676471e-07 4.726891e-07 7.878151e-07 0.000000
3.731842e-18
std 1.000264e+00 1.000262e+00 1.000263e+00 1.000262
1.000262e+00
min -3.833300e+00 -3.609300e+00 -1.166900e+00 -2.112600 -
1.051600e+00
25% -6.664750e-01 -7.102000e-01 -5.058250e-01 -0.476200 -
7.260000e-01
50% 7.000000e-04 -2.980000e-02 -2.885500e-01 -0.133400 -
4.248000e-01
75% 6.429000e-01 5.957250e-01 6.022500e-02 0.270375
4.284000e-01
max 3.938800e+00 3.833400e+00 1.088490e+01 12.643900
3.284400e+00

[8 rows x 503 columns]

scaler=StandardScaler()
scaled_data=pd.DataFrame(scaler.fit_transform(data1[new_num]),columns=
new_num)
print('Scaled data shape:',scaled_data.shape)
new_data=pd.concat((data1[new_cat],scaled_data),axis=1)
print('New data shape:',new_data.shape)

Scaled data shape: (1904, 502)


New data shape: (1904, 692)

Clustering
To find the ideal number of clusters using elbow and silhoutte score method

X=new_data.copy()
X.columns = X.columns.astype(str)

#Elbow method
# Calculate WCSS for a range of K values
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

# Plot the Elbow Method curve


plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='-', color='b')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()

#silhoutte scores method


silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot the Silhouette Method curve


plt.figure(figsize=(8, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='-',
color='b')
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(range(2, 11))
plt.grid(True)
plt.show()

From both the graphs we can conclude that the ideal number of clusters is 2 as per the
methods.
Clustering the dimensionally reduced dataset through
PCA, evaluating the results and obtaining the important
features and their corresponding values
n_components = 5 #the number of components
pca = PCA(n_components=n_components)
X=new_data.copy()
X.columns = X.columns.astype(str)
X_pca = pca.fit_transform(X)

# Step 3: Analyze PCA results


explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratios for each component


print("Explained Variance Ratios:")
print(explained_variance_ratio)

# Optional: Visualize the explained variance


plt.plot(range(1, n_components + 1),
np.cumsum(explained_variance_ratio))
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance vs. Number of Components")
plt.show()

Explained Variance Ratios:


[0.33570044 0.13648734 0.09348513 0.05628623 0.04368038]
The dimensionally reduced components explains approximately 68% of the data.

K-Means clustering Is useful to obtain important features post clustering.

davies_bouldin_score: a lower Davies-Bouldin Index suggests that the clusters are well-
separated and distinct

silhouette_score: A higher Silhouette Score indicates that the data points are well-separated
into clusters.

calinski-Harabasz_score: higher values indicating better separation between clusters

# Performing K-Means clustering


kmeans = KMeans(n_clusters=2, random_state=0,max_iter=1000)
labels = kmeans.fit_predict(X_pca)

# Evaluating the clustering results


silhouette_avg = silhouette_score(X, labels)
db_index = davies_bouldin_score(X, labels)
ch_index = calinski_harabasz_score(X, labels)

print(f"Silhouette Score: {silhouette_avg}")


print(f"Davies-Bouldin Index: {db_index}")
print(f"Calinski-Harabasz Index: {ch_index}")
Silhouette Score: 0.3530636931145265
Davies-Bouldin Index: 1.2331561638066597
Calinski-Harabasz Index: 804.105981173214

#obtaining the features


df=pd.DataFrame(X_pca)
df['ClusterLabel'] = kmeans.fit_predict(df)

# Get the centroids (cluster centers)


centroids = kmeans.cluster_centers_

# Identify important features based on centroids


important_features = np.argsort(np.abs(centroids), axis=1)[:,-5:] #
Select the top 5 important features for each cluster

print("Important Features for Each Cluster:")


print(important_features)

Important Features for Each Cluster:


[[2 4 3 1 0]
[2 4 3 1 0]]

#obtaining the original features from the dimensionally reduced


components and their weightage in the process.
loadings = pca.components_
# Get the original feature names (assuming your data is in a
DataFrame)
original_feature_names = list(pd.DataFrame(X, columns=["Feature_" +
str(i) for i in range(691)]))
orig_cols=new_data.columns.tolist()

# Print the loadings for the first PC with corresponding feature names
def feature_names(loadings,original_feature_names,component_name):
feat={'feature_name':[],'weightage':[]}
loadings_for_first_pc = loadings
loading_with_names = list(zip(original_feature_names,
loadings_for_first_pc))
for feature_name, loading in loading_with_names:
feat['feature_name'].append(feature_name)
feat['weightage'].append(loading)
values=sorted(feat['weightage'],reverse=True)[0:3]
indices={'feature_name':[],'weightage':[],'component_name':[]}
for i in values:
if 'weightage' in feat:
x = [index for index, value in enumerate(feat['weightage']) if
value == i]
indices['feature_name'].append(orig_cols[x[0]])
indices['weightage'].append(i)
indices['component_name'].append(component_name)
return indices
f_names=pd.DataFrame()
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[0],orig
inal_feature_names,0))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[1],orig
inal_feature_names,1))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[3],orig
inal_feature_names,3))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[2],orig
inal_feature_names,2))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[4],orig
inal_feature_names,4))))

x_cols=new_data.columns.tolist()
x_new=new_data.copy()
x_new['ClusterLabel']=df['ClusterLabel']

feat_contribution={'Feature_Name':
[],'%_of_Influence_in_cluster_formation':[]}
data=x_new.drop(['ClusterLabel'],axis=1)
labels=x_new['ClusterLabel']
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(data,labels) # Fit the model with cluster labels as targets
feature_importance = model.feature_importances_.tolist()
values=sorted(feature_importance,reverse=True)[0:10]
for i in values:
x=[index for index, value in enumerate(feature_importance) if value
== i]
feat_contribution['Feature_Name'].append(x_cols[x[0]])
# feat_contribution['index'].append(x[0])

feat_contribution['%_of_Influence_in_cluster_formation'].append(i*100)

Insights Obtained
Top 3 original features that have the major influence on
the formation of pca components(distribution is
component wise)
1. feature_name: Name of the feature as per the dataframe.

2. weightage: Weightage of the feature in the formation of the component.

3. component_name: The name of the pca component(0-4)

f_names
feature_name weightage component_name
0 tp53_mut 0.994655 0
1 muc16_mut 0.047248 0
2 syne1_mut 0.023867 0
0 muc16_mut 0.952263 1
1 ahnak2_mut 0.284344 1
2 kmt2c_mut 0.054483 1
0 kmt2c_mut 0.926952 3
1 pik3ca_mut 0.294682 3
2 map3k1_mut 0.203274 3
0 ahnak2_mut 0.954922 2
1 syne1_mut 0.029579 2
2 ahnak_mut 0.024780 2
0 syne1_mut 0.843752 4
1 kmt2c_mut 0.166811 4
2 dnah11_mut 0.119488 4

The below table shows the importance of the features in


the formation of the clusters(in %)
pd.DataFrame(feat_contribution)

Feature_Name %_of_Influence_in_cluster_formation
0 tp53_mut 19.821408
1 bcl2 2.640829
2 aph1b 2.100798
3 chek1 1.870523
4 er_status 1.277683
5 gata3 1.042316
6 e2f3 0.806697
7 mapk1 0.749623
8 cdkn2a 0.734711
9 srd5a1 0.703182

The below visualizations show the cluster-wise


distribution of values(frequency-value pair) belonging to
the top10 features that have influenced the formation of
clusters.
#processing the obtained indices to extract the original feature names
from the dataset and their distribution cluster wise.

important_features=feat_contribution['Feature_Name']

# Group data by cluster


cluster_label_data=new_data.copy()
cluster_label_data['ClusterLabel']=df['ClusterLabel']
cluster_groups = cluster_label_data.groupby('ClusterLabel')

insights0,insights1=pd.DataFrame(),pd.DataFrame()
j=0
for feature in important_features:
for label,group in cluster_groups:
j+=1
if(j%2==0):
insights0=pd.concat((insights0,group[feature]),axis=1)
else:
insights1=pd.concat((insights1,group[feature]),axis=1)

insights={'cluster':[],'max_freq':[],
'most_occured_value':[],'mean_value':[],
'median_value':[],'std_dev':[],'feature_name':[]}

# Analyze feature distributions for each cluster


for feature in important_features:
for label, group in cluster_groups:
insights['feature_name'].append(feature)
insights['cluster'].append(label)
insights['most_occured_value'].append(group[feature].mode()
[0])

insights['max_freq'].append(group[feature].value_counts().tolist()[0])
insights['mean_value'].append(group[feature].mean())
insights['median_value'].append(group[feature].median())
insights['std_dev'].append(group[feature].std())

# Visualize feature distributions


plt.figure(figsize=(8, 6))
for label, group in cluster_groups:
plt.hist(group[feature], bins=10, alpha=0.5, label=f'Cluster
{label}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.legend()
plt.title(f'Distribution of {feature} by Cluster')
plt.show()
The below tables give the statistical description of the
values from the top10 features which have influenced the
formation of clusters.
#cluster_label=0
insights0.describe()

tp53_mut bcl2 aph1b chek1 er_status


gata3 \
count 472.000000 472.000000 472.000000 472.000000 472.000000
472.000000
mean 238.042373 -0.697808 -0.757289 0.733421 0.447034 -
0.787537
std 57.288430 0.942895 0.788225 1.051694 0.497714
0.983745
min 113.000000 -2.791904 -2.917802 -1.504302 0.000000 -
2.772898
25% 198.750000 -1.399302 -1.302625 -0.058999 0.000000 -
1.631724
50% 237.000000 -0.828201 -0.782550 0.681002 0.000000 -
0.726699
75% 274.250000 -0.129875 -0.282425 1.373253 1.000000
0.027776
max 342.000000 2.656105 2.212302 3.952808 1.000000
1.745900

e2f3 mapk1 cdkn2a srd5a1


count 472.000000 472.000000 472.000000 472.000000
mean 0.696880 0.512338 0.581673 0.672567
std 1.073119 1.015709 1.408055 1.160492
min -1.992600 -2.659901 -1.331901 -1.201500
25% -0.102249 -0.225300 -0.452076 -0.180400
50% 0.601601 0.547700 0.031799 0.414699
75% 1.330601 1.164900 1.494925 1.287924
max 4.458301 4.294400 5.837501 6.534898

#cluster_label=1
insights1.describe()

tp53_mut bcl2 aph1b chek1 er_status


\
count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000

mean 10.045391 0.230004 0.249609 -0.241742 0.871508

std 24.314789 0.907946 0.935167 0.854743 0.334753

min 0.000000 -2.625804 -2.854401 -1.949803 0.000000

25% 2.000000 -0.276325 -0.344600 -0.833076 1.000000

50% 2.000000 0.316401 0.314051 -0.381150 1.000000

75% 2.000000 0.848577 0.878576 0.172951 1.000000

max 125.000000 2.534905 3.881904 4.015408 1.000000

gata3 e2f3 mapk1 cdkn2a srd5a1

count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000

mean 0.259579 -0.229698 -0.168871 -0.191725 -0.221684

std 0.860239 0.859374 0.935873 0.727733 0.829995

min -2.812598 -2.885000 -3.069801 -1.356901 -2.120800

25% -0.024749 -0.787700 -0.807651 -0.610651 -0.708675


50% 0.468950 -0.295050 -0.254400 -0.324951 -0.379500

75% 0.804675 0.226376 0.353850 0.015474 0.027400

max 2.202800 4.480201 3.617600 4.304301 5.345998

Conclusion
The objective of the case study that is to cluster the data and to identify the important features
along with their contribution to the clustering is achieved.

You might also like