0% found this document useful (0 votes)
72 views6 pages

DSM 3

The document describes implementing various clustering and classification algorithms on an iris flower dataset. Functions are defined to read in the dataset, calculate distances between data points, perform k-nearest neighbors classification, and k-means clustering. KNN classification is demonstrated using different distance metrics. K-means clustering is also performed using different distance metrics to cluster the iris data into k=3 groups.

Uploaded by

no0r32200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views6 pages

DSM 3

The document describes implementing various clustering and classification algorithms on an iris flower dataset. Functions are defined to read in the dataset, calculate distances between data points, perform k-nearest neighbors classification, and k-means clustering. KNN classification is demonstrated using different distance metrics. K-means clustering is also performed using different distance metrics to cluster the iris data into k=3 groups.

Uploaded by

no0r32200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

dsm-3

December 19, 2023

0.1 1. Write afunction to read a data set and store it as a matrix


[201]: import pandas as pd
import numpy as np
from scipy.spatial import distance
from collections import Counter

def read_dataset(filename):
df = pd.read_csv(filename)
matrix = df.to_numpy()
return matrix

csv_file = "iris.csv"
dataset = read_dataset(csv_file)

[202]: # Print the matrix


print(dataset)

[[5.1 3.5 1.4 0.2 'Setosa']


[4.9 3.0 1.4 0.2 'Setosa']
[4.7 3.2 1.3 0.2 'Setosa']
[5.0 3.4 1.5 0.2 'Setosa']
[5.7 3.8 1.7 0.3 'Setosa']
[5.1 3.8 1.5 0.3 'Setosa']
[5.5 4.2 1.4 0.2 'Setosa']
[4.9 3.1 1.5 0.2 'Setosa']
[5.0 3.2 1.2 0.2 'Setosa']
[5.0 3.3 1.4 0.2 'Setosa']
[7.0 3.2 4.7 1.4 'Versicolor']
[6.9 3.1 4.9 1.5 'Versicolor']
[5.5 2.3 4.0 1.3 'Versicolor']
[6.5 2.8 4.6 1.5 'Versicolor']
[6.3 2.5 4.9 1.5 'Versicolor']
[6.0 3.4 4.5 1.6 'Versicolor']
[6.7 3.1 4.7 1.5 'Versicolor']
[6.3 2.3 4.4 1.3 'Versicolor']
[5.6 3.0 4.1 1.3 'Versicolor']
[5.1 2.5 3.0 1.1 'Versicolor']

1
[6.3 3.3 6.0 2.5 'Virginica']
[5.8 2.7 5.1 1.9 'Virginica']
[7.1 3.0 5.9 2.1 'Virginica']
[6.3 2.9 5.6 1.8 'Virginica']
[6.5 3.0 5.8 2.2 'Virginica']
[7.6 3.0 6.6 2.1 'Virginica']
[4.9 2.5 4.5 1.7 'Virginica']
[7.3 2.9 6.3 1.8 'Virginica']
[6.7 2.5 5.8 1.8 'Virginica']
[7.2 3.6 6.1 2.5 'Virginica']]

0.2 2.a Calculate Data mean for each attribute and represent it as a vector
[216]: def calculate_data_mean(filename):
df = pd.read_csv(filename)
mean_vector = df.mean(numeric_only=True)

return mean_vector

csv_file = "iris.csv"
mean_vector = calculate_data_mean(csv_file)
print("Mean Vector:")
print(mean_vector)

Mean Vector:
sepal.length 5.95
sepal.width 3.07
petal.length 3.86
petal.width 1.22
dtype: float64

0.3 2.b Calculate Manhattan distance between two data objects


[204]: def manhattan_distance(vec1, vec2):
dist = np.sum(np.absolute(np.array(vec1) - np.array(vec2)))
return dist

0.4 2.c Calculate Euclidian distance between two data objects


[205]: # calculating Euclidean distance using linalg.norm()
def euclidean_distance(vec1, vec2):
dist = np.linalg.norm(vec1 - vec2)
return dist

2
0.5 2.d Calculate Chebyshev distance between two data objects
[206]: def Chebychev_distance(vec1,vec2):
dist= np.max(np.absolute(np.array(vec1) - np.array(vec2)))
return dist

0.6 2.e Calculate Mahalanobis distance.


[207]: def mahalanobis_distance(data, x):
mean_vector = data.mean().values
cov_matrix = data.cov().values
inv_cov_matrix = np.linalg.inv(cov_matrix)
x_minus_mean = x - mean_vector
mahalanobis_sq = np.dot(np.dot(x_minus_mean, inv_cov_matrix), x_minus_mean.
↪T)

mahalanobis_distance = np.sqrt(mahalanobis_sq)
return mahalanobis_distance

iris_data = pd.read_csv('iris.csv')
columns = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
iris_subset = iris_data[columns]

point = np.array([5.0, 3.2, 1.4, 0.2]) # Example point


distance = mahalanobis_distance(iris_subset, point)
print("Mahalanobis Distance:", distance)

# ref https://www.machinelearningplus.com/statistics/mahalanobis-distance/

Mahalanobis Distance: 1.357839356712021

0.7 Write a separate function to implement the K-Nearest Neighbors classifi-


cation method using all the functions implemented in question(2) above
[209]: def knn_classify(data, labels, query_point, k, distance_metric):
distances = []
for i, row in data.iterrows():
if distance_metric == 'manhattan':
dist = manhattan_distance(row, query_point)
elif distance_metric == 'chebyshev':
dist = chebyshev(row, query_point)
elif distance_metric == 'euclidean':
dist = euclidean_distance(row, query_point)
elif distance_metric == 'mahalanobis':
dist = mahalanobis_distance(data, query_point)
else:
raise ValueError("Invalid distance metric. Supported options are␣
↪'manhattan', 'chebyshev', 'euclidean', and 'mahalanobis'.")

3
distances.append((dist, labels[i]))

distances.sort()
k_nearest = distances[:k]
k_nearest_labels = [label for (_, label) in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)
predicted_label = most_common[0][0]

return predicted_label

iris_data = pd.read_csv('iris.csv')
feature_columns = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
iris_features = iris_data[feature_columns]
iris_labels = iris_data['variety']
random_point = np.array([6.1, 2.9, 4.7, 1.3])
k = 5 # Number of nearest neighbors to consider

distance_metrics = ['manhattan', 'chebyshev', 'euclidean', 'mahalanobis']

for metric in distance_metrics:


predicted_label = knn_classify(iris_features, iris_labels, random_point, k,␣
↪metric)

print(f"Predicted variety using {metric.capitalize()} distance:␣


↪{predicted_label}")

Predicted variety using Manhattan distance: Versicolor


Predicted variety using Chebyshev distance: Versicolor
Predicted variety using Euclidean distance: Versicolor
Predicted variety using Mahalanobis distance: Setosa

0.8 Write a separate function to implement the K-means clustering method


using all the functions implemented in question (2) above
[214]: def initialize_centroids(data, k):
centroids = data[np.random.choice(range(data.shape[0]), k, replace=False)]
return centroids

def assign_clusters(data, centroids, distance_metric):


cluster_labels = np.zeros(data.shape[0], dtype=int)
for i, point in enumerate(data):
distances = []
if distance_metric == 'mahalanobis':
covariance_matrix = np.cov(data.T)
for centroid in centroids:
distances.append(mahalanobis_distance(point, centroid,␣
↪covariance_matrix))

4
elif distance_metric == 'manhattan':
for centroid in centroids:
distances.append(manhattan_distance(point, centroid))
elif distance_metric == 'chebyshev':
for centroid in centroids:
distances.append(chebyshev_distance(point, centroid))
elif distance_metric == 'euclidean':
for centroid in centroids:
distances.append(euclidean_distance(point, centroid))
cluster_labels[i] = np.argmin(distances)
return cluster_labels

def update_centroids(data, cluster_labels, k):


centroids = []
for i in range(k):
cluster_data = data[cluster_labels == i]
centroid = np.mean(cluster_data, axis=0)
centroids.append(centroid)
centroids = np.vstack(centroids)
return centroids

def kmeans(data, k, distance_metric='euclidean', max_iterations=100):


centroids = initialize_centroids(data, k)
for _ in range(max_iterations):
cluster_labels = assign_clusters(data, centroids, distance_metric)
new_centroids = update_centroids(data, cluster_labels, k)
if np.array_equal(centroids, new_centroids):
break
centroids = new_centroids
return cluster_labels, centroids

def euclidean_distance(vec1, vec2):


dist = np.linalg.norm(vec1 - vec2)
return dist

def manhattan_distance(vec1, vec2):


dist = np.sum(np.absolute(vec1 - vec2))
return dist

def chebyshev_distance(vec1, vec2):


dist = np.max(np.absolute(vec1 - vec2))
return dist

def mahalanobis_distance(vec1, vec2, covariance_matrix):


diff = vec1 - vec2
inv_covariance = np.linalg.inv(covariance_matrix)
dist = np.sqrt(np.dot(np.dot(diff, inv_covariance), diff.T))

5
return dist

iris_data = np.genfromtxt('iris.csv', delimiter=',', skip_header=1, usecols=(0,␣


↪1, 2, 3))

k = 3
distance_metrics = ['mahalanobis', 'manhattan', 'chebyshev', 'euclidean']
for metric in distance_metrics:
cluster_labels, centroids = kmeans(iris_data, k, distance_metric=metric)
print(f"Distance Metric: {metric}")
print("Cluster Labels:")
print(cluster_labels)
print("Centroids:")
print(centroids)
print()

Distance Metric: mahalanobis


Cluster Labels:
[2 0 2 2 2 2 2 2 0 2 1 1 0 1 1 2 1 0 2 0 1 1 1 2 1 1 2 1 1 1]
Centroids:
[[5.36 2.66 2.8 0.82 ]
[6.76153846 2.97692308 5.49230769 1.86923077]
[5.31666667 3.34166667 2.53333333 0.68333333]]

Distance Metric: manhattan


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 2 1 2 2 2 2 1 2 2 2]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.13636364 2.80909091 4.58181818 1.5 ]
[6.875 3.025 6.0125 2.1 ]]

Distance Metric: chebyshev


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 2 1 2 2 2 2 1 2 2 2]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.13636364 2.80909091 4.58181818 1.5 ]
[6.875 3.025 6.0125 2.1 ]]

Distance Metric: euclidean


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 0 1 2 1 1 1 1 2 1 1 1]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.875 3.025 6.0125 2.1 ]
[6.13636364 2.80909091 4.58181818 1.5 ]]

You might also like