0% found this document useful (0 votes)
42 views7 pages

DSM 2

This document provides code to implement various clustering and classification algorithms including k-means clustering and k-nearest neighbors classification. It defines functions for reading a dataset, calculating distances between data points, initializing centroids for k-means, assigning data points to clusters, updating centroids, and implementing the full k-means clustering algorithm. It also defines a function for k-nearest neighbors classification that takes a dataset, labels, query point, k value and distance metric. The code is tested on the iris dataset to cluster and classify data points using different distance metrics.

Uploaded by

no0r32200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views7 pages

DSM 2

This document provides code to implement various clustering and classification algorithms including k-means clustering and k-nearest neighbors classification. It defines functions for reading a dataset, calculating distances between data points, initializing centroids for k-means, assigning data points to clusters, updating centroids, and implementing the full k-means clustering algorithm. It also defines a function for k-nearest neighbors classification that takes a dataset, labels, query point, k value and distance metric. The code is tested on the iris dataset to cluster and classify data points using different distance metrics.

Uploaded by

no0r32200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

dsm-2

December 19, 2023

0.1 1. Write afunction to read a data set and store it as a matrix


[201]: import pandas as pd
import numpy as np
from scipy.spatial import distance
from collections import Counter

def read_dataset(filename):
df = pd.read_csv(filename)
matrix = df.to_numpy()
return matrix

csv_file = "iris.csv"
dataset = read_dataset(csv_file)

[202]: # Print the matrix


print(dataset)

[[5.1 3.5 1.4 0.2 'Setosa']


[4.9 3.0 1.4 0.2 'Setosa']
[4.7 3.2 1.3 0.2 'Setosa']
[5.0 3.4 1.5 0.2 'Setosa']
[5.7 3.8 1.7 0.3 'Setosa']
[5.1 3.8 1.5 0.3 'Setosa']
[5.5 4.2 1.4 0.2 'Setosa']
[4.9 3.1 1.5 0.2 'Setosa']
[5.0 3.2 1.2 0.2 'Setosa']
[5.0 3.3 1.4 0.2 'Setosa']
[7.0 3.2 4.7 1.4 'Versicolor']
[6.9 3.1 4.9 1.5 'Versicolor']
[5.5 2.3 4.0 1.3 'Versicolor']
[6.5 2.8 4.6 1.5 'Versicolor']
[6.3 2.5 4.9 1.5 'Versicolor']
[6.0 3.4 4.5 1.6 'Versicolor']
[6.7 3.1 4.7 1.5 'Versicolor']
[6.3 2.3 4.4 1.3 'Versicolor']
[5.6 3.0 4.1 1.3 'Versicolor']
[5.1 2.5 3.0 1.1 'Versicolor']

1
[6.3 3.3 6.0 2.5 'Virginica']
[5.8 2.7 5.1 1.9 'Virginica']
[7.1 3.0 5.9 2.1 'Virginica']
[6.3 2.9 5.6 1.8 'Virginica']
[6.5 3.0 5.8 2.2 'Virginica']
[7.6 3.0 6.6 2.1 'Virginica']
[4.9 2.5 4.5 1.7 'Virginica']
[7.3 2.9 6.3 1.8 'Virginica']
[6.7 2.5 5.8 1.8 'Virginica']
[7.2 3.6 6.1 2.5 'Virginica']]

0.2 2.a Calculate Data mean for each attribute and represent it as a vector
[203]: def calculate_data_mean(filename):
# Read the CSV file using pandas
df = pd.read_csv(filename)

# Calculate the mean for each attribute


mean_vector = df.mean(numeric_only=True)

return mean_vector

# Provide the relative file path to the CSV file


csv_file = "iris.csv"

# Call the calculate_data_mean function with the csv_file path


mean_vector = calculate_data_mean(csv_file)

# Print the mean vector


print("Mean Vector:")
print(mean_vector)

Mean Vector:
sepal.length 5.95
sepal.width 3.07
petal.length 3.86
petal.width 1.22
dtype: float64

0.3 2.b Calculate Manhattan distance between two data objects


[204]: def manhattan_distance(vec1, vec2):
dist = np.sum(np.absolute(np.array(vec1) - np.array(vec2)))
return dist

2
0.4 2.c Calculate Euclidian distance between two data objects
[205]: # calculating Euclidean distance using linalg.norm()
def euclidean_distance(vec1, vec2):
dist = np.linalg.norm(vec1 - vec2)
return dist

0.5 2.d Calculate Chebyshev distance between two data objects


[206]: def Chebychev_distance(vec1,vec2):
dist= np.max(np.absolute(np.array(vec1) - np.array(vec2)))
return dist

0.6 2.e Calculate Mahalanobis distance.


[207]: def mahalanobis_distance(data, x):
mean_vector = data.mean().values
cov_matrix = data.cov().values
inv_cov_matrix = np.linalg.inv(cov_matrix)
x_minus_mean = x - mean_vector
mahalanobis_sq = np.dot(np.dot(x_minus_mean, inv_cov_matrix), x_minus_mean.
↪T)

mahalanobis_distance = np.sqrt(mahalanobis_sq)
return mahalanobis_distance

iris_data = pd.read_csv('iris.csv')

# Select the columns to use for calculating the Mahalanobis distance


columns = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
iris_subset = iris_data[columns]

# Example usage: calculate the Mahalanobis distance for a specific point


point = np.array([5.0, 3.2, 1.4, 0.2]) # Example point
distance = mahalanobis_distance(iris_subset, point)
print("Mahalanobis Distance:", distance)

# ref https://www.machinelearningplus.com/statistics/mahalanobis-distance/

Mahalanobis Distance: 1.357839356712021

0.7 Write a separate function to implement the K-Nearest Neighbors classifi-


cation method using all the functions implemented in question(2) above
[209]: def knn_classify(data, labels, query_point, k, distance_metric):
distances = []
for i, row in data.iterrows():
if distance_metric == 'manhattan':

3
dist = manhattan_distance(row, query_point)
elif distance_metric == 'chebyshev':
dist = chebyshev(row, query_point)
elif distance_metric == 'euclidean':
dist = euclidean_distance(row, query_point)
elif distance_metric == 'mahalanobis':
dist = mahalanobis_distance(data, query_point)
else:
raise ValueError("Invalid distance metric. Supported options are␣
↪'manhattan', 'chebyshev', 'euclidean', and 'mahalanobis'.")

distances.append((dist, labels[i]))

distances.sort()
k_nearest = distances[:k]
k_nearest_labels = [label for (_, label) in k_nearest]

most_common = Counter(k_nearest_labels).most_common(1)
predicted_label = most_common[0][0]

return predicted_label

# Load the iris dataset


iris_data = pd.read_csv('iris.csv')

# Select the feature columns and the corresponding labels


feature_columns = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
iris_features = iris_data[feature_columns]
iris_labels = iris_data['variety']

# Example usage: classify a random point using KNN with different distance␣
↪metrics

random_point = np.array([6.1, 2.9, 4.7, 1.3])


k = 5 # Number of nearest neighbors to consider

distance_metrics = ['manhattan', 'chebyshev', 'euclidean', 'mahalanobis']

for metric in distance_metrics:


predicted_label = knn_classify(iris_features, iris_labels, random_point, k,␣
↪metric)

print(f"Predicted variety using {metric.capitalize()} distance:␣


↪{predicted_label}")

Predicted variety using Manhattan distance: Versicolor


Predicted variety using Chebyshev distance: Versicolor
Predicted variety using Euclidean distance: Versicolor
Predicted variety using Mahalanobis distance: Setosa

4
0.8 Write a separate function to implement the K-means clustering method
using all the functions implemented in question (2) above
[214]: def initialize_centroids(data, k):
"""Randomly initialize k centroids from the data."""
centroids = data[np.random.choice(range(data.shape[0]), k, replace=False)]
return centroids

def assign_clusters(data, centroids, distance_metric):


cluster_labels = np.zeros(data.shape[0], dtype=int)
for i, point in enumerate(data):
distances = []
if distance_metric == 'mahalanobis':
covariance_matrix = np.cov(data.T)
for centroid in centroids:
distances.append(mahalanobis_distance(point, centroid,␣
↪covariance_matrix))

elif distance_metric == 'manhattan':


for centroid in centroids:
distances.append(manhattan_distance(point, centroid))
elif distance_metric == 'chebyshev':
for centroid in centroids:
distances.append(chebyshev_distance(point, centroid))
elif distance_metric == 'euclidean':
for centroid in centroids:
distances.append(euclidean_distance(point, centroid))
cluster_labels[i] = np.argmin(distances)
return cluster_labels

def update_centroids(data, cluster_labels, k):


centroids = []
for i in range(k):
cluster_data = data[cluster_labels == i]
centroid = np.mean(cluster_data, axis=0)
centroids.append(centroid)
centroids = np.vstack(centroids)
return centroids

def kmeans(data, k, distance_metric='euclidean', max_iterations=100):


centroids = initialize_centroids(data, k)
for _ in range(max_iterations):
cluster_labels = assign_clusters(data, centroids, distance_metric)
new_centroids = update_centroids(data, cluster_labels, k)
if np.array_equal(centroids, new_centroids):
break
centroids = new_centroids
return cluster_labels, centroids

5
def euclidean_distance(vec1, vec2):
dist = np.linalg.norm(vec1 - vec2)
return dist

def manhattan_distance(vec1, vec2):


dist = np.sum(np.absolute(vec1 - vec2))
return dist

def chebyshev_distance(vec1, vec2):


dist = np.max(np.absolute(vec1 - vec2))
return dist

def mahalanobis_distance(vec1, vec2, covariance_matrix):


diff = vec1 - vec2
inv_covariance = np.linalg.inv(covariance_matrix)
dist = np.sqrt(np.dot(np.dot(diff, inv_covariance), diff.T))
return dist

iris_data = np.genfromtxt('iris.csv', delimiter=',', skip_header=1, usecols=(0,␣


↪1, 2, 3))

k = 3
distance_metrics = ['mahalanobis', 'manhattan', 'chebyshev', 'euclidean']
for metric in distance_metrics:
cluster_labels, centroids = kmeans(iris_data, k, distance_metric=metric)
print(f"Distance Metric: {metric}")
print("Cluster Labels:")
print(cluster_labels)
print("Centroids:")
print(centroids)
print()

Distance Metric: mahalanobis


Cluster Labels:
[2 0 2 2 2 2 2 2 0 2 1 1 0 1 1 2 1 0 2 0 1 1 1 2 1 1 2 1 1 1]
Centroids:
[[5.36 2.66 2.8 0.82 ]
[6.76153846 2.97692308 5.49230769 1.86923077]
[5.31666667 3.34166667 2.53333333 0.68333333]]

Distance Metric: manhattan


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 2 1 2 2 2 2 1 2 2 2]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.13636364 2.80909091 4.58181818 1.5 ]
[6.875 3.025 6.0125 2.1 ]]

6
Distance Metric: chebyshev
Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 2 1 2 2 2 2 1 2 2 2]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.13636364 2.80909091 4.58181818 1.5 ]
[6.875 3.025 6.0125 2.1 ]]

Distance Metric: euclidean


Cluster Labels:
[0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 0 1 2 1 1 1 1 2 1 1 1]
Centroids:
[[5.09090909 3.36363636 1.57272727 0.3 ]
[6.875 3.025 6.0125 2.1 ]
[6.13636364 2.80909091 4.58181818 1.5 ]]

You might also like