0% found this document useful (0 votes)
6 views10 pages

23CC554

The document outlines two clustering algorithms: K-Means and Hierarchical clustering. K-Means is an unsupervised learning method for grouping data into K clusters, with advantages such as simplicity and speed, but it requires predefining K and is sensitive to outliers. Hierarchical clustering merges clusters based on minimum distances iteratively and visualizes the results using a dendrogram, allowing for flexible cluster number adjustments.

Uploaded by

23ucc554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

23CC554

The document outlines two clustering algorithms: K-Means and Hierarchical clustering. K-Means is an unsupervised learning method for grouping data into K clusters, with advantages such as simplicity and speed, but it requires predefining K and is sensitive to outliers. Hierarchical clustering merges clusters based on minimum distances iteratively and visualizes the results using a dendrogram, allowing for flexible cluster number adjustments.

Uploaded by

23ucc554
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ASSIGNMENT-10

Name-Kanishq Malhotra Roll no-23UCC554

Code 1: K-Means clustering

K-Means Clustering is a popular unsupervised machine learning algorithm used


for grouping similar data points into K clusters. It’s especially useful when you
want to nd structure or patterns in your data without prede ned labels.

Pros and Cons

Pros:

• Simple and fast

• Works well on large datasets

• Easy to interpret

Cons:

• You need to choose K in advance

• Sensitive to outliers

• May converge to a local minimum (results depend on initial centroids)

Example Use Cases

• Customer segmentation

• Image compression

• Market basket analysis

• Document classi cation


fi
fi
fi
This code performs K-Means clustering on a mall customer dataset using Annual Income and
Spending Score as features. It:

1. Loads the data and extracts the relevant columns.

2. Randomly initializes k=5 centroids.

3. Iteratively assigns each point to the nearest centroid and updates centroids based on the mean
of their assigned points.

4. Repeats the above steps until convergence or a max iteration limit is reached.

5. Prints nal centroids and the number of points in each cluster.

6. Visualizes the clusters and centroids on a 2D scatter plot.


fi
# Import necessary libraries
import pandas as pd # For data manipulation
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting
import random # For random number generation

# Load the dataset from CSV file


data = pd.read_csv('Mall_Customers.csv')

# Extract the relevant features (Annual Income and Spending Score) as a


NumPy array
X = data[['Annual Income (k$)', 'Spending Score (1-100)']].values

# Define the number of clusters (you can change this as needed)


k = 5

# Step 1: Randomly initialize centroids from the data points


def initialize_centroids(X, k):
centroids_idx = random.sample(range(len(X)), k) # Randomly pick k
unique indices
centroids = [X[i] for i in centroids_idx] # Select the corresponding
data points as centroids
return np.array(centroids)

# Initialize centroids
centroids = initialize_centroids(X, k)

# Function to calculate Euclidean distance between two points


def euclidean_distance(p1, p2):
return np.sqrt(np.sum((p1 - p2) ** 2)) # Standard Euclidean distance
formula

# Function to assign each data point to the nearest centroid


def assign_clusters(X, centroids):
clusters = []
for point in X:
distances = [euclidean_distance(point, centroid) for centroid in
centroids] # Distance to each centroid
clusters.append(np.argmin(distances)) # Assign to the nearest
centroid
return clusters

# Function to update centroids by calculating the mean of points in each


cluster
def update_centroids(X, clusters, k):
new_centroids = []
for i in range(k):
cluster_points = X[np.array(clusters) == i] # Get all points
assigned to cluster i
if len(cluster_points) > 0:
new_centroids.append(np.mean(cluster_points, axis=0)) #
Compute mean if cluster is not empty
else:
new_centroids.append(initialize_centroids(X, 1)[0]) #
Reinitialize empty cluster centroid
return np.array(new_centroids)

# Run the K-Means algorithm


max_iters = 100 # Maximum number of iterations
for i in range(max_iters):
clusters = assign_clusters(X, centroids) # Step 1: Assign points to
clusters
new_centroids = update_centroids(X, clusters, k) # Step 2: Update
centroids

# Check for convergence (if centroids do not change significantly)


if np.allclose(centroids, new_centroids, rtol=1e-4):
break # Stop iteration if converged
centroids = new_centroids # Update centroids for next iteration

# Print final centroids


print("Final centroids:")
print(centroids)

# Print the number of points in each cluster


for i in range(k):
cluster_size = len(X[np.array(clusters) == i]) # Count points
assigned to cluster i
print(f"Cluster {i + 1} size: {cluster_size}")

# Plot the clusters


plt.figure(figsize=(8, 6)) # Set figure size
for i in range(k):
cluster_points = X[np.array(clusters) == i] # Points in cluster i
plt.scatter(cluster_points[:, 0], cluster_points[:, 1],
label=f'Cluster {i + 1}') # Scatter plot

# Plot the centroids in red with 'x' marker


plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, c='red',
label='Centroids')
plt.xlabel('Annual Income (k$)') # X-axis label
plt.ylabel('Spending Score (1-100)') # Y-axis label
plt.title('Customer Segmentation using KMeans') # Title of the plot
plt.legend() # Show legend
plt.show() # Display the plot

Output-
Final centroids:
[[ 48.16831683 43.3960396
]

[109.7 22.
]

[ 78.89285714 17.42857143]

[ 86.53846154 82.12820513]

[ 25.72727273 79.36363636]
]

Cluster 1 101
size:

Cluster 2 10
size:

Cluster 3 28
size:

Cluster 4 39
size:
Cluster 5 22
size:
Code 2: Hierarchical clustering
# Import necessary libraries
import numpy as np # For numerical computations
import matplotlib.pyplot as plt # For plotting
from scipy.cluster.hierarchy import dendrogram, linkage # For
hierarchical clustering and dendrogram
import pandas as pd # For data handling

# Load the faithful dataset (make sure the path is correct)


data = pd.read_csv('/content/faithful.csv') # Load CSV file containing
the dataset
data = data[['eruptions', 'waiting']].values # Extract only the two
relevant features as a NumPy array

# Function to calculate Euclidean distance between two points


def euclidean_distance(p1, p2):
return np.sqrt(np.sum((p1 - p2) ** 2)) # Standard Euclidean distance
formula

# Function to calculate distance between two clusters using single linkage


(minimum pairwise distance)
def cluster_distance(c1, c2):
return min([euclidean_distance(p1, p2) for p1 in c1 for p2 in c2]) #
Minimum distance between all point pairs

# Initialize: treat each point as its own cluster


clusters = [[point] for point in data]

# Set the desired number of clusters (can be changed as needed)


target_clusters = 1

# Repeat until only the target number of clusters remains


while len(clusters) > target_clusters:
min_dist = float('inf') # Initialize minimum distance to a large
value
to_merge = (None, None) # Initialize pair of clusters to merge

# Find the two closest clusters based on single linkage distance


for i in range(len(clusters)):
for j in range(i + 1, len(clusters)):
dist = cluster_distance(clusters[i], clusters[j]) # Compute
distance between cluster i and j
if dist < min_dist: # If this is the smallest so far
min_dist = dist
to_merge = (i, j) # Update the clusters to be merged

# Merge the two closest clusters


i, j = to_merge
new_cluster = clusters[i] + clusters[j] # Combine the two clusters
# Remove the merged clusters and add the new one
clusters = [clusters[x] for x in range(len(clusters)) if x not in (i,

j)]
clusters.append(new_cluster)

# Print the final clusters


print("Final clusters:")
for idx, cluster in enumerate(clusters):
print(f"Cluster {idx + 1}: {cluster}")

# Use SciPy to compute the linkage matrix for dendrogram (using single
linkage method)
Z = linkage(data, method='single')

# Plot the dendrogram to visualize the hierarchical clustering


plt.figure(figsize=(10, 6)) # Set the figure size
dendrogram(Z) # Plot the dendrogram
plt.title('Dendrogram for Faithful Dataset') # Title of the plot
plt.xlabel('Data Points') # X-axis label
plt.ylabel('Distance') # Y-axis label

# Draw a horizontal line to show a distance threshold for cutting the


dendrogram
threshold = 1.5 # Set a threshold value (can adjust this)
plt.axhline(y=threshold, color='r', linestyle='--', label='Threshold') #
Horizontal red dashed line
plt.legend() # Show legend for the threshold line

plt.show() # Display the dendrogram plot

# Visualize the final clusters in a 2D scatter plot


plt.figure(figsize=(8, 6)) # Set figure size
for idx, cluster in enumerate(clusters):
cluster_points = np.array(cluster) # Convert cluster to NumPy array
for plotting
plt.scatter(cluster_points[:, 0], # X-values (eruptions)
cluster_points[:, 1], # Y-values (waiting)
label=f'Cluster {idx + 1}') # Label for legend

plt.title('Cluster Visualization') # Plot title


plt.xlabel('eruptions') # X-axis label
plt.ylabel('waiting') # Y-axis label
plt.legend() # Show legend

plt.show() # Display the cluster visualization


CODE EXPLAINATION:

1. Load Data: It reads the dataset and extracts two features: eruptions and waiting.

2. De ne Distance Functions: It includes functions to compute Euclidean distance between


points and the minimum distance between clusters (single linkage).

3. Manual Clustering Loop: Each data point starts as its own cluster; the closest pair of
clusters are merged iteratively until only one remains (target_clusters = 1).

4. Output Clusters: It prints the nal single cluster made by merging all points (you can
modify target_clusters to stop earlier).

5. Dendrogram Plot: Uses scipy.linkage to compute the clustering steps and plots a
dendrogram to visualize cluster merges.

6. Cluster Visualization: Displays the nal cluster (or clusters, if target_clusters is


changed) in a 2D scatter plot.

Output-
Final clusters:

Cluster 1: [array([ 5.1, 96. ]), array([ 1.983, 43. ]), array([ 1.833,
57. ]), array([ 2.083, 57. ]), array([ 2.083, 57. ]), array([ 1.817,
60. ]), array([ 2.2, 60. ]), array([ 2.233, 60. ]), array([ 2.25, 60.
]), array([ 2.017, 60. ]), array([ 2.1, 60. ]), array([ 2., 58.]),
array([ 1.75, 58. ])........
fi
fi
fi

You might also like