0% found this document useful (0 votes)

21 views29 pages

Data Science Code

The document describes clustering a dataset containing various data points described by features to identify important features and values that contribute to cluster groupings. It outlines preprocessing steps like handling missing data and standardization, applying K-means clustering and evaluating results to determine feature importance and values associated with each cluster.

Uploaded by

Vasudevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views29 pages

Data Science Code

Uploaded by

Vasudevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Problem Statement

Provided with a dataset containing various data points, each described by a set of features. Your
task is to apply unsupervised learning techniques to cluster these data points into 'n' distinct
clusters. Additionally, you need to identify the most important features and their corresponding
values that contribute significantly to the grouping of data points within these clusters.

Objective:
1. Clustering: Implement an unsupervised learning algorithm to partition the data points
into 'n' clusters. You should select an appropriate clustering algorithm based on the
characteristics of the dataset and the problem requirements.
2. Feature Importance: Identify the most important features that influence the formation of
clusters. Determine the relevance and significance of each feature in grouping data
points together. This analysis will help you understand which features are driving the
cluster assignments.
3. Feature Values: Determine the specific values or ranges of values for the identified
important features that are associated with each cluster. In other words, find the feature-
value combinations that differentiate one cluster from another.

Index
##Solution

##1. Data Preprocessing: Handling of missing values using mean and mode.

Standardidation of features using StandardScaler.

Encoded categorical variables using LabelEncoder.

##2. Clustering: Choose an appropriate clustering algorithm (K-means clustering).

Determined the optimal number of clusters which is '2' based on both elbow method and
silhouette score.

Applid the chosen algorithm to cluster the data points into '2' clusters.

##3. Evaluation:

Evaluated the quality of the clustering results using Silhouette score, Davies-Bouldin index,
Calinski_Harabasz_score.

##4. Clustering: Employed techniques such as dimensionality reduction method PCA and K-
means inbuilt method to perform clustering.

Determined the relevance and contribution of each feature to the cluster assignments by
employing RFC.

##5. Insights:
Tabulation of the contribution of the original features in forming the pca components.

Created a bar gaph that represents the influencing feature-value distributions in each cluster.

For each identified important feature, analyse its values or value ranges that are prevalent in
each cluster in table form

Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score,
calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder,StandardScaler

Importing Data
data=pd.read_csv('METABRIC_RNA_Mutation.csv')

#Data Exploration

data.head()

patient_id age_at_diagnosis type_of_breast_surgery cancer_type

\
0 0 75.65 MASTECTOMY Breast Cancer

1 2 43.19 BREAST CONSERVING Breast Cancer

2 5 48.87 MASTECTOMY Breast Cancer

3 6 47.68 MASTECTOMY Breast Cancer

4 8 76.97 MASTECTOMY Breast Cancer

cancer_type_detailed cellularity chemotherapy

\
0 Breast Invasive Ductal Carcinoma NaN 0

1 Breast Invasive Ductal Carcinoma High 0

2 Breast Invasive Ductal Carcinoma High 1

3 Breast Mixed Ductal and Lobular Carcinoma Moderate 1

4 Breast Mixed Ductal and Lobular Carcinoma High 1

pam50_+_claudin-low_subtype cohort er_status_measured_by_ihc ...

mtap_mut \
0 claudin-low 1.0 Positve ...
0
1 LumA 1.0 Positve ...
0
2 LumB 1.0 Positve ...
0
3 LumB 1.0 Positve ...
0
4 LumB 1.0 Positve ...
0

ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut hras_mut prps2_mut

smarcb1_mut \
0 0 0 0 0 0 0
0
1 0 0 0 0 0 0
0
2 0 0 0 0 0 0
0
3 0 0 0 0 0 0
0
4 0 0 0 0 0 0
0

stmn2_mut siah1_mut
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

[5 rows x 693 columns]

data.shape

(1904, 693)

data.describe(include='all')

patient_id age_at_diagnosis type_of_breast_surgery

cancer_type \
count 1904.000000 1904.000000 1882
1904
unique NaN NaN 2
2
top NaN NaN MASTECTOMY Breast
Cancer
freq NaN NaN 1127
1903
mean 3921.982143 61.087054 NaN
NaN
std 2358.478332 12.978711 NaN
NaN
min 0.000000 21.930000 NaN
NaN
25% 896.500000 51.375000 NaN
NaN
50% 4730.500000 61.770000 NaN
NaN
75% 5536.250000 70.592500 NaN
NaN
max 7299.000000 96.290000 NaN
NaN

cancer_type_detailed cellularity chemotherapy \

count 1889 1850 1904.000000
unique 6 3 NaN
top Breast Invasive Ductal Carcinoma High NaN
freq 1500 939 NaN
mean NaN NaN 0.207983
std NaN NaN 0.405971
min NaN NaN 0.000000
25% NaN NaN 0.000000
50% NaN NaN 0.000000
75% NaN NaN 0.000000
max NaN NaN 1.000000

pam50_+_claudin-low_subtype cohort
er_status_measured_by_ihc \
count 1904 1904.000000
1874
unique 7 NaN
2
top LumA NaN
Positve
freq 679 NaN
1445
mean NaN 2.643908
NaN
std NaN 1.228615
NaN
min NaN 1.000000
NaN
25% NaN 1.000000
NaN
50% NaN 3.000000
NaN
75% NaN 3.000000
NaN
max NaN 5.000000
NaN

... mtap_mut ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut

hras_mut \
count ... 1904 1904 1904 1904 1904
1904.0
unique ... 5 5 5 4 4
4.0
top ... 0 0 0 0 0
0.0
freq ... 1900 1900 1900 1901 1901
1024.0
mean ... NaN NaN NaN NaN NaN
NaN
std ... NaN NaN NaN NaN NaN
NaN
min ... NaN NaN NaN NaN NaN
NaN
25% ... NaN NaN NaN NaN NaN
NaN
50% ... NaN NaN NaN NaN NaN
NaN
75% ... NaN NaN NaN NaN NaN
NaN
max ... NaN NaN NaN NaN NaN
NaN

prps2_mut smarcb1_mut stmn2_mut siah1_mut

count 1904 1904.0 1904 1904.0
unique 3 4.0 3 3.0
top 0 0.0 0 0.0
freq 1902 1024.0 1902 1024.0
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
[11 rows x 693 columns]

#separating categorical and numerical value based columns

cat_cols=data.describe(include='object').columns.tolist()
num_cols=data.describe(exclude='object').columns.tolist()

#converting object datatypes into categorical columns.

for i in cat_cols:
data[i]=data[i].astype('category')

Data Preprocessing
Missing Values
#identifying and noting down the cols with missing values
missing_cat=[]
for i in cat_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_cat.append(i)

missing_num=[]
for i in num_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_num.append(i)

#viewing the % of missing values in the columns

print((data[missing_cat].isna().sum())/data.shape[0])
print((data[missing_num].isna().sum())/data.shape[0])

#we will drop the columns have more than 10% missing values
data1=data.drop(['tumor_stage'],axis=1)

#handling missing values with mode and mean for cat and num cols
respectively
for i in missing_cat:
if(i in data1.columns.tolist()):
x=data1[i].mode().tolist()
data1[i]=data1[i].fillna(x[0])

for i in missing_num:
if(i in data1.columns.tolist()):
data1[i]=data1[i].fillna(data1[i].mean())
#redefining cat and num cols after dropping the ones with excess % of
missing data
new_cat=data1.describe(include='category').columns.tolist()
new_num=data1.describe(exclude='category').columns.tolist()

type_of_breast_surgery 0.011555
cancer_type_detailed 0.007878
cellularity 0.028361
er_status_measured_by_ihc 0.015756
tumor_other_histologic_subtype 0.007878
primary_tumor_laterality 0.055672
oncotree_code 0.007878
3-gene_classifier_subtype 0.107143
death_from_cancer 0.000525
dtype: float64
neoplasm_histologic_grade 0.037815
mutation_count 0.023634
tumor_size 0.010504
tumor_stage 0.263130
dtype: float64

data1.isna().sum().sum()

Encoding Categorical Variables

encoder=LabelEncoder()

#encoding mixture of numerical and text values

for i in new_cat:
try:
data1[i]=encoder.fit_transform(data1[i])
except TypeError:
data1[i]=data1[i].replace(0,'Zero')
data1[i]=encoder.fit_transform(data1[i])

Standardization
To ensure all features have a similar scale, which can be important for algorithms that rely on
distance calculations(K-Means clustering uses distance calculations)

data.describe(exclude='category')

patient_id age_at_diagnosis chemotherapy cohort \

count 1904.000000 1904.000000 1904.000000 1904.000000
mean 3921.982143 61.087054 0.207983 2.643908
std 2358.478332 12.978711 0.405971 1.228615
min 0.000000 21.930000 0.000000 1.000000
25% 896.500000 51.375000 0.000000 1.000000
50% 4730.500000 61.770000 0.000000 3.000000
75% 5536.250000 70.592500 0.000000 3.000000
max 7299.000000 96.290000 1.000000 5.000000

neoplasm_histologic_grade hormone_therapy \
count 1832.000000 1904.000000
mean 2.415939 0.616597
std 0.650612 0.486343
min 1.000000 0.000000
25% 2.000000 0.000000
50% 3.000000 1.000000
75% 3.000000 1.000000
max 3.000000 1.000000

lymph_nodes_examined_positive mutation_count \
count 1904.000000 1859.000000
mean 2.002101 5.697687
std 4.079993 4.058778
min 0.000000 1.000000
25% 0.000000 3.000000
50% 0.000000 5.000000
75% 2.000000 7.000000
max 45.000000 80.000000

nottingham_prognostic_index overall_survival_months ... \

count 1904.000000 1904.000000 ...
mean 4.033019 125.121324 ...
std 1.144492 76.334148 ...
min 1.000000 0.000000 ...
25% 3.046000 60.825000 ...
50% 4.042000 115.616667 ...
75% 5.040250 184.716667 ...
max 6.360000 355.200000 ...

srd5a1 srd5a2 srd5a3 st7

star \
count 1.904000e+03 1.904000e+03 1.904000e+03 1.904000e+03
1904.000000
mean 4.726891e-07 -3.676471e-07 -9.453782e-07 -1.050420e-07 -
0.000002
std 1.000263e+00 1.000262e+00 1.000262e+00 1.000263e+00
1.000262
min -2.120800e+00 -3.364800e+00 -2.719400e+00 -4.982700e+00 -
2.981700
25% -6.188500e-01 -6.104750e-01 -6.741750e-01 -6.136750e-01 -
0.632900
50% -2.456500e-01 -4.690000e-02 -1.422500e-01 -5.175000e-02 -
0.026650
75% 3.306000e-01 5.144500e-01 5.146000e-01 5.787750e-01
0.590350
max 6.534900e+00 1.027030e+01 6.329000e+00 4.571300e+00
12.742300

tnk2 tulp4 ugt2b15 ugt2b17

ugt2b7
count 1.904000e+03 1.904000e+03 1.904000e+03 1904.000000
1.904000e+03
mean 3.676471e-07 4.726891e-07 7.878151e-07 0.000000
3.731842e-18
std 1.000264e+00 1.000262e+00 1.000263e+00 1.000262
1.000262e+00
min -3.833300e+00 -3.609300e+00 -1.166900e+00 -2.112600 -
1.051600e+00
25% -6.664750e-01 -7.102000e-01 -5.058250e-01 -0.476200 -
7.260000e-01
50% 7.000000e-04 -2.980000e-02 -2.885500e-01 -0.133400 -
4.248000e-01
75% 6.429000e-01 5.957250e-01 6.022500e-02 0.270375
4.284000e-01
max 3.938800e+00 3.833400e+00 1.088490e+01 12.643900
3.284400e+00

[8 rows x 503 columns]

scaler=StandardScaler()
scaled_data=pd.DataFrame(scaler.fit_transform(data1[new_num]),columns=
new_num)
print('Scaled data shape:',scaled_data.shape)
new_data=pd.concat((data1[new_cat],scaled_data),axis=1)
print('New data shape:',new_data.shape)

Scaled data shape: (1904, 502)

New data shape: (1904, 692)

Clustering
To find the ideal number of clusters using elbow and silhoutte score method

X=new_data.copy()
X.columns = X.columns.astype(str)

#Elbow method
# Calculate WCSS for a range of K values
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

# Plot the Elbow Method curve

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='-', color='b')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()

#silhoutte scores method

silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot the Silhouette Method curve

plt.figure(figsize=(8, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='-',
color='b')
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(range(2, 11))
plt.grid(True)
plt.show()

From both the graphs we can conclude that the ideal number of clusters is 2 as per the
methods.
Clustering the dimensionally reduced dataset through
PCA, evaluating the results and obtaining the important
features and their corresponding values
n_components = 5 #the number of components
pca = PCA(n_components=n_components)
X=new_data.copy()
X.columns = X.columns.astype(str)
X_pca = pca.fit_transform(X)

# Step 3: Analyze PCA results

explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratios for each component

print("Explained Variance Ratios:")
print(explained_variance_ratio)

# Optional: Visualize the explained variance

plt.plot(range(1, n_components + 1),
np.cumsum(explained_variance_ratio))
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance vs. Number of Components")
plt.show()

Explained Variance Ratios:

[0.33570044 0.13648734 0.09348513 0.05628623 0.04368038]
The dimensionally reduced components explains approximately 68% of the data.

K-Means clustering Is useful to obtain important features post clustering.

davies_bouldin_score: a lower Davies-Bouldin Index suggests that the clusters are well-
separated and distinct

silhouette_score: A higher Silhouette Score indicates that the data points are well-separated
into clusters.

calinski-Harabasz_score: higher values indicating better separation between clusters

# Performing K-Means clustering

kmeans = KMeans(n_clusters=2, random_state=0,max_iter=1000)
labels = kmeans.fit_predict(X_pca)

# Evaluating the clustering results

silhouette_avg = silhouette_score(X, labels)
db_index = davies_bouldin_score(X, labels)
ch_index = calinski_harabasz_score(X, labels)

print(f"Silhouette Score: {silhouette_avg}")

print(f"Davies-Bouldin Index: {db_index}")
print(f"Calinski-Harabasz Index: {ch_index}")
Silhouette Score: 0.3530636931145265
Davies-Bouldin Index: 1.2331561638066597
Calinski-Harabasz Index: 804.105981173214

#obtaining the features

df=pd.DataFrame(X_pca)
df['ClusterLabel'] = kmeans.fit_predict(df)

# Get the centroids (cluster centers)

centroids = kmeans.cluster_centers_

# Identify important features based on centroids

important_features = np.argsort(np.abs(centroids), axis=1)[:,-5:] #
Select the top 5 important features for each cluster

print("Important Features for Each Cluster:")

print(important_features)

Important Features for Each Cluster:

[[2 4 3 1 0]
[2 4 3 1 0]]

#obtaining the original features from the dimensionally reduced

components and their weightage in the process.
loadings = pca.components_
# Get the original feature names (assuming your data is in a
DataFrame)
original_feature_names = list(pd.DataFrame(X, columns=["Feature_" +
str(i) for i in range(691)]))
orig_cols=new_data.columns.tolist()

# Print the loadings for the first PC with corresponding feature names
def feature_names(loadings,original_feature_names,component_name):
feat={'feature_name':[],'weightage':[]}
loadings_for_first_pc = loadings
loading_with_names = list(zip(original_feature_names,
loadings_for_first_pc))
for feature_name, loading in loading_with_names:
feat['feature_name'].append(feature_name)
feat['weightage'].append(loading)
values=sorted(feat['weightage'],reverse=True)[0:3]
indices={'feature_name':[],'weightage':[],'component_name':[]}
for i in values:
if 'weightage' in feat:
x = [index for index, value in enumerate(feat['weightage']) if
value == i]
indices['feature_name'].append(orig_cols[x[0]])
indices['weightage'].append(i)
indices['component_name'].append(component_name)
return indices
f_names=pd.DataFrame()
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[0],orig
inal_feature_names,0))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[1],orig
inal_feature_names,1))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[3],orig
inal_feature_names,3))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[2],orig
inal_feature_names,2))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[4],orig
inal_feature_names,4))))

x_cols=new_data.columns.tolist()
x_new=new_data.copy()
x_new['ClusterLabel']=df['ClusterLabel']

feat_contribution={'Feature_Name':
[],'%_of_Influence_in_cluster_formation':[]}
data=x_new.drop(['ClusterLabel'],axis=1)
labels=x_new['ClusterLabel']
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(data,labels) # Fit the model with cluster labels as targets
feature_importance = model.feature_importances_.tolist()
values=sorted(feature_importance,reverse=True)[0:10]
for i in values:
x=[index for index, value in enumerate(feature_importance) if value
== i]
feat_contribution['Feature_Name'].append(x_cols[x[0]])
# feat_contribution['index'].append(x[0])

feat_contribution['%_of_Influence_in_cluster_formation'].append(i*100)

Insights Obtained
Top 3 original features that have the major influence on
the formation of pca components(distribution is
component wise)
1. feature_name: Name of the feature as per the dataframe.

2. weightage: Weightage of the feature in the formation of the component.

3. component_name: The name of the pca component(0-4)

f_names
feature_name weightage component_name
0 tp53_mut 0.994655 0
1 muc16_mut 0.047248 0
2 syne1_mut 0.023867 0
0 muc16_mut 0.952263 1
1 ahnak2_mut 0.284344 1
2 kmt2c_mut 0.054483 1
0 kmt2c_mut 0.926952 3
1 pik3ca_mut 0.294682 3
2 map3k1_mut 0.203274 3
0 ahnak2_mut 0.954922 2
1 syne1_mut 0.029579 2
2 ahnak_mut 0.024780 2
0 syne1_mut 0.843752 4
1 kmt2c_mut 0.166811 4
2 dnah11_mut 0.119488 4

The below table shows the importance of the features in

the formation of the clusters(in %)
pd.DataFrame(feat_contribution)

Feature_Name %_of_Influence_in_cluster_formation
0 tp53_mut 19.821408
1 bcl2 2.640829
2 aph1b 2.100798
3 chek1 1.870523
4 er_status 1.277683
5 gata3 1.042316
6 e2f3 0.806697
7 mapk1 0.749623
8 cdkn2a 0.734711
9 srd5a1 0.703182

The below visualizations show the cluster-wise

distribution of values(frequency-value pair) belonging to
the top10 features that have influenced the formation of
clusters.
#processing the obtained indices to extract the original feature names
from the dataset and their distribution cluster wise.

important_features=feat_contribution['Feature_Name']

# Group data by cluster

cluster_label_data=new_data.copy()
cluster_label_data['ClusterLabel']=df['ClusterLabel']
cluster_groups = cluster_label_data.groupby('ClusterLabel')

insights0,insights1=pd.DataFrame(),pd.DataFrame()
j=0
for feature in important_features:
for label,group in cluster_groups:
j+=1
if(j%2==0):
insights0=pd.concat((insights0,group[feature]),axis=1)
else:
insights1=pd.concat((insights1,group[feature]),axis=1)

insights={'cluster':[],'max_freq':[],
'most_occured_value':[],'mean_value':[],
'median_value':[],'std_dev':[],'feature_name':[]}

# Analyze feature distributions for each cluster

for feature in important_features:
for label, group in cluster_groups:
insights['feature_name'].append(feature)
insights['cluster'].append(label)
insights['most_occured_value'].append(group[feature].mode()
[0])

insights['max_freq'].append(group[feature].value_counts().tolist()[0])
insights['mean_value'].append(group[feature].mean())
insights['median_value'].append(group[feature].median())
insights['std_dev'].append(group[feature].std())

# Visualize feature distributions

plt.figure(figsize=(8, 6))
for label, group in cluster_groups:
plt.hist(group[feature], bins=10, alpha=0.5, label=f'Cluster
{label}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.legend()
plt.title(f'Distribution of {feature} by Cluster')
plt.show()
The below tables give the statistical description of the
values from the top10 features which have influenced the
formation of clusters.
#cluster_label=0
insights0.describe()

tp53_mut bcl2 aph1b chek1 er_status

gata3 \
count 472.000000 472.000000 472.000000 472.000000 472.000000
472.000000
mean 238.042373 -0.697808 -0.757289 0.733421 0.447034 -
0.787537
std 57.288430 0.942895 0.788225 1.051694 0.497714
0.983745
min 113.000000 -2.791904 -2.917802 -1.504302 0.000000 -
2.772898
25% 198.750000 -1.399302 -1.302625 -0.058999 0.000000 -
1.631724
50% 237.000000 -0.828201 -0.782550 0.681002 0.000000 -
0.726699
75% 274.250000 -0.129875 -0.282425 1.373253 1.000000
0.027776
max 342.000000 2.656105 2.212302 3.952808 1.000000
1.745900

e2f3 mapk1 cdkn2a srd5a1

count 472.000000 472.000000 472.000000 472.000000
mean 0.696880 0.512338 0.581673 0.672567
std 1.073119 1.015709 1.408055 1.160492
min -1.992600 -2.659901 -1.331901 -1.201500
25% -0.102249 -0.225300 -0.452076 -0.180400
50% 0.601601 0.547700 0.031799 0.414699
75% 1.330601 1.164900 1.494925 1.287924
max 4.458301 4.294400 5.837501 6.534898

#cluster_label=1
insights1.describe()

tp53_mut bcl2 aph1b chek1 er_status

\
count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000

mean 10.045391 0.230004 0.249609 -0.241742 0.871508

std 24.314789 0.907946 0.935167 0.854743 0.334753

min 0.000000 -2.625804 -2.854401 -1.949803 0.000000

25% 2.000000 -0.276325 -0.344600 -0.833076 1.000000

50% 2.000000 0.316401 0.314051 -0.381150 1.000000

75% 2.000000 0.848577 0.878576 0.172951 1.000000

max 125.000000 2.534905 3.881904 4.015408 1.000000

gata3 e2f3 mapk1 cdkn2a srd5a1

count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000

mean 0.259579 -0.229698 -0.168871 -0.191725 -0.221684

std 0.860239 0.859374 0.935873 0.727733 0.829995

min -2.812598 -2.885000 -3.069801 -1.356901 -2.120800

25% -0.024749 -0.787700 -0.807651 -0.610651 -0.708675

50% 0.468950 -0.295050 -0.254400 -0.324951 -0.379500

75% 0.804675 0.226376 0.353850 0.015474 0.027400

max 2.202800 4.480201 3.617600 4.304301 5.345998

Conclusion
The objective of the case study that is to cluster the data and to identify the important features
along with their contribution to the clustering is achieved.

Artificial Intelligence in Dentidtry
100% (1)
Artificial Intelligence in Dentidtry
370 pages
AI ML DL Detailed Notes
No ratings yet
AI ML DL Detailed Notes
2 pages
20BCP021 Assignment 3
No ratings yet
20BCP021 Assignment 3
7 pages
Final Year Project Presentation (1)
No ratings yet
Final Year Project Presentation (1)
26 pages
Application of GIS and Artificial Intelligence in Military Operations Prospects and Challenges
No ratings yet
Application of GIS and Artificial Intelligence in Military Operations Prospects and Challenges
7 pages
TATA Quant Fund
No ratings yet
TATA Quant Fund
4 pages
Cancer Detection and Analysis Using Machine Learning: Abstract-Among The Various Types of Diseases, Cancer Is
No ratings yet
Cancer Detection and Analysis Using Machine Learning: Abstract-Among The Various Types of Diseases, Cancer Is
5 pages
Chicken Coop Agi Draft
No ratings yet
Chicken Coop Agi Draft
9 pages
Vertopal.com Heart Failure Prediction With Detailed Headings
No ratings yet
Vertopal.com Heart Failure Prediction With Detailed Headings
12 pages
Game Engine Programming 2 Week 1
No ratings yet
Game Engine Programming 2 Week 1
30 pages
Merged
No ratings yet
Merged
35 pages
Dovdush_KN-305_lab3
No ratings yet
Dovdush_KN-305_lab3
2 pages
Dovdush_KN-305_lab2
No ratings yet
Dovdush_KN-305_lab2
2 pages
The Myth of A Superhuman AI PDF
No ratings yet
The Myth of A Superhuman AI PDF
21 pages
MTA Project
No ratings yet
MTA Project
1 page
Qoewijrpqoi
No ratings yet
Qoewijrpqoi
1 page
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
No ratings yet
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
1 page
KNN For Classification
No ratings yet
KNN For Classification
5 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
CODE2
No ratings yet
CODE2
5 pages
Question Bank Deep-Learning Unit 3 and 4
No ratings yet
Question Bank Deep-Learning Unit 3 and 4
5 pages
Cancer 241029 150515
No ratings yet
Cancer 241029 150515
99 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
LP Practical ! Jupyter Notebook
No ratings yet
LP Practical ! Jupyter Notebook
6 pages
Heart Disease Prediction (1) (1) - 1
No ratings yet
Heart Disease Prediction (1) (1) - 1
1 page
churn_V2
No ratings yet
churn_V2
15 pages
Heart Disease Prediction! ❤️?
No ratings yet
Heart Disease Prediction! ❤️?
52 pages
Proposal For Gen AI Powered Legal Application
No ratings yet
Proposal For Gen AI Powered Legal Application
8 pages
DSA_1
No ratings yet
DSA_1
8 pages
Voice of The Consumer Survey 2024india Perspective
No ratings yet
Voice of The Consumer Survey 2024india Perspective
32 pages
Covid_19_Analysis_and_Visualization_using_Plotly_Express
No ratings yet
Covid_19_Analysis_and_Visualization_using_Plotly_Express
11 pages
Syllabus for Skill Test (General) by Steno Army (1)
No ratings yet
Syllabus for Skill Test (General) by Steno Army (1)
14 pages
Machine Learning Algorithm
No ratings yet
Machine Learning Algorithm
18 pages
keeratsi_HW8
No ratings yet
keeratsi_HW8
17 pages
Aids
No ratings yet
Aids
88 pages
Assignment 2 REPORT
No ratings yet
Assignment 2 REPORT
8 pages
Chapter 1 Introduction To AI
No ratings yet
Chapter 1 Introduction To AI
48 pages
How Amazon Has Reorganized Around Artificial Intelligence and Machine Learning
No ratings yet
How Amazon Has Reorganized Around Artificial Intelligence and Machine Learning
4 pages
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
No ratings yet
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
27 pages
How Artificial Intelligence Can Help Us Understand Human Creativity
No ratings yet
How Artificial Intelligence Can Help Us Understand Human Creativity
2 pages
FT Weekend Magazine - 30 September 2023
No ratings yet
FT Weekend Magazine - 30 September 2023
48 pages
5 Breast Cancer Model - Ipynb Colab
No ratings yet
5 Breast Cancer Model - Ipynb Colab
5 pages
Covid19 Death Prediction
No ratings yet
Covid19 Death Prediction
1 page
covid19-maro… (2) - JupyterLab
No ratings yet
covid19-maro… (2) - JupyterLab
7 pages
Disruptive Innovation
No ratings yet
Disruptive Innovation
33 pages
B58_ Handling Missing Values,Feature_Selection (1)
No ratings yet
B58_ Handling Missing Values,Feature_Selection (1)
4 pages
Xtasy
No ratings yet
Xtasy
14 pages
DSBDA1
No ratings yet
DSBDA1
5 pages
cern-electron-mass-prediction-0-9859-r
No ratings yet
cern-electron-mass-prediction-0-9859-r
53 pages
ML Project - Binary - Colaboratory
No ratings yet
ML Project - Binary - Colaboratory
7 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
Support Vector Machines com Python
No ratings yet
Support Vector Machines com Python
13 pages
baseline.ipynb - Colab
No ratings yet
baseline.ipynb - Colab
5 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
45B AIML Practical 08
No ratings yet
45B AIML Practical 08
10 pages
1FsWES7YJDERHD-bZ2ujFakbQyzi6 Yin
No ratings yet
1FsWES7YJDERHD-bZ2ujFakbQyzi6 Yin
9 pages
DATA SCIENCE IDC 302 End Sem Project
No ratings yet
DATA SCIENCE IDC 302 End Sem Project
1 page
6 Sigma
No ratings yet
6 Sigma
17 pages
Breast Cancer Prdiction
No ratings yet
Breast Cancer Prdiction
16 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
Employees Burnout Analysis
No ratings yet
Employees Burnout Analysis
20 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
LAW AND TECH
No ratings yet
LAW AND TECH
9 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
A Comprehensive Review For Chronic Disease Prediction Using Machine Learning Algorithms
No ratings yet
A Comprehensive Review For Chronic Disease Prediction Using Machine Learning Algorithms
28 pages
Neuromorphic-Computing-chandra Sekhar_23ec01021 Ppt (1)
No ratings yet
Neuromorphic-Computing-chandra Sekhar_23ec01021 Ppt (1)
13 pages
Unsupervised ML
No ratings yet
Unsupervised ML
17 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
MultiLevelRecord Mock tests
No ratings yet
MultiLevelRecord Mock tests
68 pages
howxtre
No ratings yet
howxtre
8 pages
Import: Sys - Executable - M Pip Install
No ratings yet
Import: Sys - Executable - M Pip Install
23 pages
Zero-Shot Recommendation As Language Modeling
No ratings yet
Zero-Shot Recommendation As Language Modeling
9 pages
Notebooklien 1
No ratings yet
Notebooklien 1
1 page
7.01 Feature Selection
No ratings yet
7.01 Feature Selection
3 pages
Mini Project BDA
No ratings yet
Mini Project BDA
9 pages
Clustering Documentation Python Code
No ratings yet
Clustering Documentation Python Code
8 pages
# Import Plotting Libraries: in (1) : Import Pandas As PD
No ratings yet
# Import Plotting Libraries: in (1) : Import Pandas As PD
13 pages
Preamble: The Purpose of This Course Is To Provide The Basic Concepts and
No ratings yet
Preamble: The Purpose of This Course Is To Provide The Basic Concepts and
2 pages
Practical 1
No ratings yet
Practical 1
7 pages
DP v8
No ratings yet
DP v8
19 pages
10 Minutes To Pandas - Pandas 1.2.4 Documentation
No ratings yet
10 Minutes To Pandas - Pandas 1.2.4 Documentation
18 pages
Breast Cancer Diagnosis Using Machine Learning Alg
No ratings yet
Breast Cancer Diagnosis Using Machine Learning Alg
13 pages
Script Group8
No ratings yet
Script Group8
19 pages
Capstone Project Proposal: Business Goal
No ratings yet
Capstone Project Proposal: Business Goal
4 pages
Pandas - Datastructures
No ratings yet
Pandas - Datastructures
19 pages
Chatbot For College Enquiry: Emil Babu, Geethu Wilson
No ratings yet
Chatbot For College Enquiry: Emil Babu, Geethu Wilson
5 pages
Artificial Intelligence in Patology - An Overview
No ratings yet
Artificial Intelligence in Patology - An Overview
8 pages
Engineering Brochure
No ratings yet
Engineering Brochure
14 pages
The Ultimate Ai Tool Kit 9YDWfnBK
No ratings yet
The Ultimate Ai Tool Kit 9YDWfnBK
22 pages
Molecular Biology & Genetics: Essential Biology Self-Teaching Guide
From Everand
Molecular Biology & Genetics: Essential Biology Self-Teaching Guide
Sterling Education
No ratings yet