We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 3
In [1]:
In [2]:
In [3]:
out [3]:
In [4]:
#import the needed python Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Step 1: Load the given Banknote authentication dataset.
data = pd.read_csv("Banknote-authentication-dataset-.csv")
aStep 2: Calculate statistical measures, e.g. mean and standard deviation.
data.describe()
wt v2
count | 1372,000000 | 1372.000000
mean |0.433735 | 1.922353
std |2.842763 [5.869047
min |-7.042100 | -13.773100
25% |-1.773000 _|-1.708200
50% |0.496180 |2.319650
75% |2.821475 | 6.814625
max |6.824800 | 12.951600
mean = np.mean(data, @)
print (mean)
std_dev = np.std(data, @)
print(std_dev)
V1 0.433735
v2 1.922353
ctype: floatea
vi 2.841726
v2 5.866907
dtype: floateaIn [5]: #Step 3: Visualise your data as you consider fit.
plt.plot(data['vi'], data['v2"], 'rx')
plt.plot(mean['V1"], mean['v2"], "*")
plt.xlabel('Vi")
plt.ylabel('v2")
plt.show()
Step 4: Evaluate if the given dataset is suitable for the
K-Means clustering task.
Visually, there are a few clusters that can be extracted from the graph. There are a few at a point that are
distinctly different from the rest of the data. Therefore, this data would be suitable data clustering, but we must do
data normalization for better results.
In [6]: # Normalise the data
data_min = np.min(data,@)
data_max = np.max(data,@)
4print(data_min, data_max)
normed = (data - data_min) / (data_max-data_min)
print (normed)In [7]:
out [7]:
In [8]:
‘performing clustering
v1 = normed['V1']
v2 = normed['v2"]
km_res = KMeans(n_clusters=2)
normed_predicted = km_res.fit_predict(normed[["V1',‘V2"]])
formed predicted
normed[ ‘cluster’ ]= normed_predicted
‘fnormed. head
km_res.cluster_centers_
array([[@.67378548, @.69821998],,
[e.36988789, @.4479234 ]])
df1 = normed[normed.cluster==0]
f2 = normed[normed.cluster==1]
plt.scatter(df1.V1,df1['V2"], color ='red')
plt.scatter(d#2.V1,df2['v2"], color ="green')
plt.scatter(kn_res.cluster_centers_[:,@],kn_res.cluster_centers_[:,1],color='b
lack’ marker="*", label='centroid', s=400)
plt.xlabel(‘v1')
plt.ylabel(‘v2")
plt.show()
10
og
06
04
02
00
00 02 04 06 08 10