DMDW Assignment 1-8
DMDW Assignment 1-8
Problem 1
Use Pandas library in Python to create Data frame for detailed analysis of the data set
given by downloading data from the following URL:
https://www.espncricinfo.com/records/most-wickets-in-career-93276
i) Store the data in the tables into all_tables. Use pd.read_html command.
Check the type of the data structures of the tables.
ii) Create a Data Frame df by storing the table and display the data frame.
iii) Display first 11 rows from the data frame and all the features from the data
frame.
iv) Convert the data frame into a NumPy array. Display all the names of the
players. Also display the names of the players along with the number of
wickets taken by each of them.
v) Display the details of the player located in the index 10.
vi) Create a new data frame df1 by setting the name of the players as row index.
vii) Display first 5 records from the new data frame df1 and print the detail records
of the player in the fifth position of the data frame.
viii) Find the total number of wickets taken by the player at index 10 in the data
frame df.
ix) Calculate the wicket per match of all players and create a new field
WicketPerMatch in the data frame.
x) Normalized the values of WicketPerMatch and append as an attribute in the
data frame.
xi) Represent the relationship between Strike Rate (SR) and Batting Average (Ave)
using scatterplot (import seaborn library).
xii) Extract country from the attribute “Player” and append as a separate attribute
in the data set.
xiii) Calculate the average wickets collected by each country.
xiv) Represent Strike Rate (SR) Vs Batting Average (Ave) using Scatter plot for
Australia.
xv) Display the records from the data set where SR is less than 55 and Ave is less
than 25
xvi) Create new fields StartYear and EndYear to store the year when the player
started his career and the year of retirement. Use function to extract proper
data from the attribute “Span”. Find out the career length of each player. And
store the values for each player.
xvii) Use histogram to represent the career length. Use 20 bins to show the
distribution.
xviii) Visually represent the number of wickets taken by each country. Use catplot()
of Seaborn library and represent wickets Vs country.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
all_tables = pd.read_html("https://www.espncricinfo.com/records/most-
wickets-in-career-93276")
all_tables
type(all_tables)
list
df = all_tables[0]
type(df)
pandas.core.frame.DataFrame
df.head(11)
arr = df.to_numpy()
type(arr)
numpy.ndarray
player_names = df['Player']
print(player_names)
0 M Muralidaran (ICC/SL)
1 SK Warne (AUS)
2 JM Anderson (ENG)
3 A Kumble (IND)
4 SCJ Broad (ENG)
...
78 MM Ali (ENG)
79 BA Stokes (ENG)
80 AME Roberts (WI)
81 JA Snow (ENG)
82 JR Thomson (AUS)
Name: Player, Length: 83, dtype: object
print(player_wickets)
Player Wkts
0 M Muralidaran (ICC/SL) 800
1 SK Warne (AUS) 708
2 JM Anderson (ENG) 704
3 A Kumble (IND) 619
4 SCJ Broad (ENG) 604
.. ... ...
78 MM Ali (ENG) 204
79 BA Stokes (ENG) 203
80 AME Roberts (WI) 202
81 JA Snow (ENG) 202
82 JR Thomson (AUS) 200
player_details = df.iloc[10]
print(player_details)
df1 = df.set_index('Player')
print(df1)
print(df1.head())
player_in_fifth_position = df1.iloc[4]
print(player_in_fifth_position)
Span 2007-2023
Mat 167
Inns 309
Balls 33698
Overs 5616.2
Mdns 1304
Runs 16719
Wkts 604
BBI 8/15
Ave 27.68
Econ 2.97
SR 55.79
4 28
5 20
Name: SCJ Broad (ENG), dtype: object
wickets_at_index_10 = df.iloc[10]['Wkts']
print(wickets_at_index_10)
434
print(df)
min_wicket_per_match = df['WicketPerMatch'].min()
max_wicket_per_match = df['WicketPerMatch'].max()
df['NormalizedWicketPerMatch'] = (df['WicketPerMatch'] -
min_wicket_per_match) / (max_wicket_per_match - min_wicket_per_match)
print(df)
NormalizedWicketPerMatch
0 1.000000
1 0.733957
2 0.466552
3 0.688524
4 0.436497
.. ...
78 0.291580
79 0.040953
80 0.596531
81 0.555313
82 0.508114
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='SR', y='Ave')
plt.title('Relationship between Strike Rate (SR) and Batting Average
(Ave)')
plt.xlabel('Strike Rate (SR)')
plt.ylabel('Batting Average (Ave)')
df['Country'] = df['Player'].str.extract(r'\(([^)]+)\)')
print(df)
NormalizedWicketPerMatch Country
0 1.000000 ICC/SL
1 0.733957 AUS
2 0.466552 ENG
3 0.688524 IND
4 0.436497 ENG
.. ... ...
78 0.291580 ENG
79 0.040953 ENG
80 0.596531 WI
81 0.555313 ENG
82 0.508114 AUS
average_wickets_by_country = df.groupby('Country')['Wkts'].mean()
print(average_wickets_by_country)
Country
AUS 316.210526
BAN 237.000000
ENG 312.200000
ENG/ICC 226.000000
ICC/NZ 362.000000
ICC/SA 292.000000
ICC/SL 800.000000
IND 352.272727
NZ 306.500000
PAK 299.714286
SA 344.571429
SL 394.000000
WI 314.111111
ZIM 216.000000
Name: Wkts, dtype: float64
plt.figure(figsize=(10, 6))
sns.regplot(data=df_aus, x='SR', y='Ave', scatter_kws={'s':100},
line_kws={'color':'red'})
plt.show()
df[['StartYear', 'EndYear']] = df['Span'].str.split('-',
expand=True).astype(int)
plt.figure(figsize=(10, 6))
plt.hist(df['CareerLength'].dropna(), bins=20, edgecolor='black')
plt.figure(figsize=(12, 8))
sns.catplot(data=wickets_by_country, x='Country', y='Wkts',
kind='bar', height=6, aspect=2)
plt.show()
a= ( (𝑿𝒕 𝑿 −𝟏 𝑿𝒕 ) 𝒀
Problem 3
Find multiple regression equation for the following sets of data:
X1 X2 Yi
1 4 1
2 5 6
3 8 8
4 2 12
Follow problem number 2 for solving the question.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X, y = datasets.load_diabetes(return_X_y=True)
X = X[:, 2]
X_train = X[:-25]
X_test = X[-25:]
y_train = y[:-25]
y_test = y[-25:]
model = LinearRegression()
model.fit(X_train.reshape(-1, 1), y_train)
LinearRegression()
coefficients = model.coef_
mean_square_error = mean_squared_error(y_test, y_pred)
residual_sum_of_square = ((y_test - y_pred) ** 2).sum()
Out[4]: LinearRegression i ?
LinearRegression()
In [ ]:
In [1]: import numpy as np
from sklearn.linear_model import LinearRegression
Out[4]: LinearRegression i ?
LinearRegression()
In [10]: Xt = X.T
XtX_inv = np.linalg.inv(Xt @ X)
a = XtX_inv @ Xt @ Y
print(f"Value of a: {a}")
In [ ]:
CS552 Data Mining and Data Warehousing Lab MO 2024
CS551 Machine Learning Lab
Date: 03rd Sept. 2024
Lab Assignment 4
1. Modify the data set California Housing Price by converting the values of the attribute
median_income to ‘L’, ‘M’ and ‘H’. The range of the values are given below:
2. Implement k-nearest neighbour algorithm on Iris data set. Consider only two
attributes and different values of k. Also find out the accuracy of the result. Plot the
accuracy.
import pandas as pd
from scipy.stats import entropy
from sklearn.metrics import mutual_info_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
file_path = 'housing.csv'
df = pd.read_csv(file_path)
ocean_proximity_counts =
df['ocean_proximity'].value_counts(normalize=True)
information_median_income = mutual_info_score(df['median_income'],
df['ocean_proximity'])
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
k_values = range(1,23,2)
accuracies = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o', linestyle='--', color='b')
plt.title('Accuracy vs. k in k-NN')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()
CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :03/09/24
Lab Assignment 3
Problem1 is based on Data Cleaning
Q1
(i) Import not_clean.csv file using pandas and display first ten records from the file.
(ii) How many null values are there in each column? Replace null values by mean value
using fillna for each column. Again, check the results using isnull.sum().
(iii) Again, import not_clean.csv file using pandas. Remove the rows which contains null
values and display the results. Reset the index value using reset_index().
(iv) Again, import not_clean.csv file using pandas and remove the duplicated records
from the file. Display the results.
(v) Remove duplicates in columns by using subset parameter.
(vi) Again, import not_clean.csv file using pandas and store numerical value in another
dataframe using df.loc or df1.select_dtypes. Normalize the numerical column in
dataset to common scale using preprocessing.MinMaxScaler() from sklearn.
(vii) Import not_clean.csv file using pandas and store ‘Target_Name is separate data frame
variable. Convert Target_name label into number using LabelEncoder.
For example 'setosa' -0, 'versicolor'=1, 'virginica' =2
(viii) Import not_clean.csv file using pandas and store ‘Target_Name is separate data frame
variable. Convert Target_name label into number using LabelEncoder.
For example 'setosa' -1, 'versicolor'=2, 'virginica' =3
(ix) Import not_clean.csv file using pandas and store ‘Target_Name is separate data
variable. Convert the Target_name column using onehotencoder.
For example 'setosa' - 1., 0., 0, 'versicolor'=0,1, 0 , 'virginica' =0, 0, 1
(x) Find the probability of each using np.mean()
(xi) Import not_clean.csv file using pandas and divide the not_clean.csv into x and y.
Where x contains 'Sepal_length', 'Sepal_width','Petal_length','Petal_width'
and y contains 'Target_Name'.
xii) Split the training and testing data in such a way that test should contains 20
percentage of data using from sklearn.model_selection import train_test_split.
xiii) Split the training and testing data without using sklearn.model_selection import
train_test_split.
xiv) Find the outliers in the not_clean.csv
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv('not_clean.csv')
df.head(10)
Number
0 20000
1 15000
2 10000
3 5000
4 1000
5 2000
6 3000
7 4000
8 5000
9 6000
numeric_cols = df.select_dtypes(include='number').columns
means = df[numeric_cols].mean()
df[numeric_cols] = df[numeric_cols].fillna(means)
null_values_before = df.isnull().sum()
null_values_after = df.isnull().sum()
null_values_before, null_values_after
print(df.isnull().sum())
Sepal_length 0
Sepal_width 0
Petal_length 0
Petal_width 0
Target 0
Target_Name 0
Number 0
dtype: int64
df = pd.read_csv('not_clean.csv')
df_cleaned = df.dropna()
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned.head()
Number
0 20000
1 15000
2 10000
3 5000
4 1000
df = pd.read_csv('not_clean.csv')
df_no_duplicates = df.drop_duplicates()
df_no_duplicates.head()
Number
0 20000
1 15000
2 10000
3 5000
4 1000
df = pd.read_csv('not_clean.csv')
df_subset_no_duplicates = df.drop_duplicates(subset=['Sepal_length',
'Sepal_width'])
df_subset_no_duplicates.head()
Number
0 20000
1 15000
2 10000
3 5000
4 1000
df = pd.read_csv('not_clean.csv')
scaler = MinMaxScaler()
df_numerical_normalized =
pd.DataFrame(scaler.fit_transform(df_numerical),
columns=df_numerical.columns)
df_numerical_normalized.head()
df_target = df['Target_Name']
label_encoder = LabelEncoder()
df_target_encoded = label_encoder.fit_transform(df_target)
df_target_encoded[:10]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
df['Target_Name'].head()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: Target_Name, dtype: float64
encoder = OneHotEncoder(sparse=False)
df_target_encoded = encoder.fit_transform(df[['Target_Name']])
df_target_encoded[:10]
array([[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.],
[1.]])
probabilities = df['Target_Name'].value_counts(normalize=True)
probabilities
X.head(), y.head()
outliers = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
outliers
Target_Name Number
0 False False
1 False False
2 False False
3 False False
4 False False
.. ... ...
152 False False
153 False False
154 False False
155 False False
156 False False
(b) Create a variable X that contains the following features from the above datasets
'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI',
'DiabetesPedigreeFunction','Age'
and y contains Outcome
© Split the dataset into a 75 percent training and 25 percent testing set using
train_test_split
(d) trains the model using logistic regression
(e) use the test set and feeds it into the model to obtain the predictions
(f) shows the number of actual and predicted labels using confusion matrix
(g) Create a heatmap for confusion matrix where xlabel is predicted label and ylabel is
Actual label using seaborn
(h) Find out the accuracy , Precision and recall score using sklearn.metrics
(i) Find out the accuracy of prediction using score (Xtest, ytest)
(j) Find the the precision, recall, and F1-score of the model using classification
_report() function of the metrics module
(k) Plot the Receiver Operating Characteristic (ROC) Curve and find the Area Under
the Curve (AUC).
(b) Copies the first two features(Glucose and Blood Pressure) of the dataset into a
two-dimensional list
© Plots a scatter plot showing the distribution of points for the two features
(d) Display Diabetes in red and No Diabetes in blue color
(b) Copies the first three features (Glucose, Blood Pressure and BMI) of the dataset
into a 3-dimensional list
© Plots a scatter plot showing the distribution of points for the two features
(d) Display Diabetes in red and No Diabetes in blue color
Q3(a) Load the Diabetes Dataset
(b) Select the first features (Glucose) into x variable and Outcome into y variable
from the dataset
© Plots a scatter plot showing the distribution of points for x and y. Use edge color
based on color red or blue.
(d) Use patches using import matplotlib.patches for red and blue color and label them
on Graph
data = pd.read_csv('diabetes.csv')
data
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
767 1 93 70 31 0 30.4
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
log_reg = LogisticRegression(max_iter=500)
log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=500)
y_pred = log_reg.predict(X_test)
[[137 8]
[ 31 47]]
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
Accuracy: 0.8251121076233184
Precision: 0.8545454545454545
Recall: 0.6025641025641025
data = pd.read_csv('diabetes.csv')
data = pd.read_csv('diabetes.csv')
data = pd.read_csv('diabetes.csv')
x = data['Glucose']
y = data['Outcome']
print('Intercept:', log_reg_glucose.intercept_)
print('Coefficient:', log_reg_glucose.coef_)
Intercept: [-5.35002807]
Coefficient: [[0.03787262]]
def sigmoid(x):
return 1 / (1 + np.exp(-(log_reg_glucose.intercept_[0] +
log_reg_glucose.coef_[0][0] * x)))
Q1(a) Load the following dataset, and the name of the dataset is ‘svmass7.csv’, which
contains 10 records.
x1 x2 y
4 2.9 1
4 4 1
1 2.5 -1
2.5 1 -1
4.9 4.5 1
1.9 1.9 -1
3.5 4 1
0.5 1.5 -1
2 2.1 -1
4.5 2.5 1
Q1(b) Plot the data using Seaborn
Q1© Train the model using Scikitlearn’s svm module’s SVC class. Use linear kernel to
solve the problem.
Q1(d) Find the following
a) Weights
b) Bias
c) Indices of support vectors
d) Support vectors
e) Number of support vectors of each class
f) Coefficient of support vector in the decision function
Q1€ Plot the Hyperplane and the margins.
Q1(f) Predict the class of the following values
(2,7)
(5,6)
(1,4)
(2,0)
Q2(a) Create two sets of random points (a total of 1000 points and add noise 0.20)
distributed in circular fashion using the make_circles() function
(b) Plot the points out on a 2D chart-using scatter and print the xlabel and ylabel.
© Add the third axis, the z-axis (z=x*x + y*y), and plot the chart in 3D
(d) Find the value of x3 and plot the 3D Hyperplane, train the model using the third
dimension. To plot the hyperplane in 3D, use the plot_surface() function
Q3(g) Find the min and max values of the first and second features
Q3(h) Take the step size h = (x_max / x_min)/100
Q3(i) Generates evenly spaced values between x_min and x_max and between ymin and
y_max of values from x_min to x_max with a specified step size h. Pass these
parameters to np.mesgrid.
Creates two 2D arrays (xx and yy) using np.mesgrid representing the grid of points
over which the model's predictions will be made.
Q3(j) Predict each point and stored values in the Z variable. Changes the shape of Z to
match that of xx
Q3(k) Paint the groups (malignant and benign) in colours using the contourf() function.
Also, proper labels, legends, and targets should be displayed.
Use the following parameters in contourf()
cmap=plt.cm.coolwarm, alpha=0.6
Q4(a) Repeat the Q3 using the Radial Basis function (RBF), also known as Gaussian Kernel
(Non-Linear kernels). Take C=1
(b) See the effects of classifying the points using the following varying values of C and
Gamma.
a) C=1, gamma =10
b) C=1, gamma =0.1
c) C=10−10 , gamma =10
d) C=1010 , gamma =0.1
© Repeat the Q3 using the polynomial kernel (Non-Linear kernels).
See the effects of classifying the points using the following varying values of degree
a) kernel='poly', degree=4, C=1, gamma='auto'
b) kernel='poly', degree=3, C=1, gamma='auto'
c) kernel='poly', degree=2, C=1, gamma='auto'
d) kernel='poly', degree=1, C=1, gamma='auto' # same as Linear
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import svm
import numpy as np
from sklearn.datasets import make_circles
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_breast_cancer
data = pd.read_excel('svmass7.csv.xlsx')
print(data)
x1 x2 y
0 4.0 2.9 1
1 4.0 4.0 1
2 1.0 2.5 -1
3 2.5 1.0 -1
4 4.9 4.5 1
5 1.9 1.9 -1
6 3.5 4.0 1
7 0.5 1.5 -1
8 2.0 2.1 -1
9 4.5 2.5 1
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
SVC(kernel='linear')
w = clf.coef_[0]
slope = -w[0] / w[1]
b = clf.intercept_[0]
xx = np.linspace(0, 5)
yy = slope * xx - b / w[1]
print(predictions)
[ 1 1 -1 -1]
X, y = make_circles(n_samples=1000, noise=0.20)
fig = plt.figure(figsize=(10,8))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Accent, s=40)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
z = X[:, 0]**2 + X[:, 1]**2
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], z, c=y, cmap=plt.cm.Accent, s=40)
plt.show()
X, y = make_circles(n_samples=1000, noise=0.20)
clf = svm.SVC(kernel='linear')
clf.fit(np.c_[X, z], y)
SVC(kernel='linear')
plt.show()
data = load_breast_cancer()
print(data.data[:10])
print(data.target[:10])
print(data.feature_names)
print(data.target_names)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']
X = data.data[:, :2]
y = data.target
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.legend(loc='best')
plt.show()
C = 10
clf = svm.SVC(kernel='linear', C=C)
clf.fit(X, y)
SVC(C=10, kernel='linear')
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
SVC(C=1)
params = [(1, 10), (1, 0.1), (10**-10, 10), (10**10, 0.1)]
for C, gamma in params:
clf = svm.SVC(kernel='rbf', C=C, gamma=gamma)
clf.fit(X, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plotting
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.6)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm,
edgecolors='k')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.title(f"SVM with RBF Kernel (C={C}, gamma={gamma})")
plt.show()
CS552 Data Mining and Data Warehousing MO-24
CS551 Machine Learning Lab
Date :29/10/24
Lab Assignment 8
Q1(a) Load the following dataset, and the name of the dataset is ‘cluster_data.csv’,
which contains 9 records.
x1 x2
1.0 2.0
1.5 1.8
5.0 8.0
8.0 8.0
1.0 0.6
9.0 11.0
6.0 2.0
7.0 5.0
4.0 7.0
Q1(b) load the CSV file into a Pandas dataframe, and plot a scatter plot showing the
points
Q1© Generates three random centroids and marks them on the scatter plot
Q1(d) Implements the K-Means algorithm and plot a scatter plot showing the points
Q1(e ) Print out the clusters to which each point belongs
Q1(f) Find the location of each centroid
Q1(g) Now repeat the same exercise use the KMeans class in Scikit-learn to do
clustering and take cluster size =3
Q1(h) train the model using the fit() function
Q1(i) print the clusters label and centroids
Q1(j) plot the points and centroids on a scatter plot
Q1(k) Predict the cluster of the following values
(2,7)
(5,6)
(1,4)
(2,0)
Q1(l) Finding the Optimal K using Silhouette Coefficient
Q1(m) Plot a chart showing the various values of K and their corresponding Silhouette
Coefficients
Q2(a) Load the Iris dataset from sklearn using load_iris()
(b) import the data into a Pandas dataframe
© Find out its shape and clean the data if possible
(d) Plot a scatter plot showing the distribution in Sepal length and Sepal width
(e ) Cluster the points into three clusters, k=3 using Scikit-learn’s KMeans class
(f ) Plot a scatter plot showing the distribution in Sepal length and Sepal width
(g) Finding the Optimal K using Silhouette Coefficient
(h) Plot a chart showing the various values of K and their corresponding Silhouette
Coefficients
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
iris = load_iris()
iris_data = iris.data
iris_feature_names = iris.feature_names
df = pd.DataFrame(iris_data, columns=iris_feature_names)
KMeans(n_clusters=3, random_state=42)
plt.plot(K_range, silhouette_scores)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient for Different Values of K')
plt.show()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from copy import deepcopy
from sklearn.metrics import silhouette_score
from sklearn import metrics
data = {'x1' : [1.0, 1.5, 5.0, 8.0, 1.0, 9.0, 6.0, 7.0, 4.0],
'x2' : [2.0, 1.8, 8.0, 8.0, 0.6, 11.0, 2.0, 5.0, 7.0]}
df = pd.DataFrame(data)
plt.scatter(df['x1'], df['x2'])
plt.title("Scatter plot of Data Points")
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
k = 3
X = np.array(list(zip(df['x1'], df['x2'])))
[[4. 8.]
[4. 0.]
[5. 8.]]
plt.scatter(df['x1'], df['x2'])
plt.scatter(Cx, Cy)
plt.xlabel('x')
plt.ylabel('y')
print(Cx)
print(Cy)
[4 4 5]
[8 0 8]
print(X)
[[ 1. 2. ]
[ 1.5 1.8]
[ 5. 8. ]
[ 8. 8. ]
[ 1. 0.6]
[ 9. 11. ]
[ 6. 2. ]
[ 7. 5. ]
[ 4. 7. ]]
print(C)
[[4. 8.]
[4. 0.]
[5. 8.]]
X = df.values
centroids, clusters = k_means(X, k)
clust = str(cluster)
for i, cluster in enumerate(clusters):
print("\nPoint " + str(X[i]),"Cluster " + clust + "\n")
Cluster Labels: [2 2 1 1 2 1 0 0 1]
Centroids: [[6.5 3.5 ]
[6.5 8.5 ]
[1.16666667 1.46666667]]
new_points = np.array([[2, 7], [5, 6], [1, 4], [2, 0]])
predictions = kmeans.predict(new_points)
silhouette_avgs = []
min_k = 2
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(range(min_k, len(X)), silhouette_avgs)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficients")