0% found this document useful (0 votes)
42 views6 pages

Vid 4

The document discusses performing K-Means clustering on a dataset using Python and R. It describes preprocessing the data, using the elbow method to determine the optimal number of clusters, applying K-Means clustering with a specified number of clusters, and visualizing the clustered data points and centroids.

Uploaded by

diyalap01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views6 pages

Vid 4

The document discusses performing K-Means clustering on a dataset using Python and R. It describes preprocessing the data, using the elbow method to determine the optimal number of clusters, applying K-Means clustering with a specified number of clusters, and visualizing the clustered data points and centroids.

Uploaded by

diyalap01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Name: Vidya Janani V

Register Number: 913121205090

EX.NO: 4 CLUSTERING THE GIVEN DATA USING PYTHON/R


Date: 12.03.2024

AIM:

To perform clustering of the given data using K-Means in Python and R

STEPS:

1. Data Preparation: Load and pre-process the data. Ensure it's in a suitable format for
clustering

2. Library Imports: Import necessary Python libraries, such as sklearn for K-Means
and matplotlib for visualization.

3. K-Means Clustering: Initialize and fit a K-Means model, specifying the number of
clusters (K)
4. Visualization: Visualize the clusters to identify patterns and structures within the data

PYTHON:

ELBOW METHOD:

The Elbow Method to find the optimal number of clusters (K) for K-Means clustering. It
loods a dataset, selects specific features, and calculates the Within-Cluster Variance (WSS)
for Kvalues ranging from 1 to 10. The resulting WSS values are plotted to visualize the
"elbow" point where the rate of decrease in WSS slows down, indicating the optimal K. This
helps in determining the most suitable number of clusters for the given dataset.

K-MEANS

The Python code performs K-Means clustering with a specified number of clusters (K) on a
dataset with two selected features. It adds cluster assignments to the original dataset and
visualizes the data points with different colors for each cluster. Additionally, it plots the
cluster centroids. The "k" variable should be replaced with the chosen number of clusters, and
the code provides a visual representation of the clustering results.

SCATTER PLOT

1. A scatter plot will be displayed, where data points are colored differently based on their
assigned clusters, showing the clusters formed by K-Means.

2. The cluster centroids will be marked as red "x" symbols on the plot.

3. The title of the plot will indicate the number of clusters used for K-Means clustering
(specified by the 'k' variable). 4. A legend will be displayed in the upper right corner of the
plot, indicating the labels for datapoints and centroids.

Thus the k means clustering was performed for the global air pollution dataset

21PCS02 – Exploratory Data Analysis Laboratory


Name: Vidya Janani V
Register Number: 913121205090

Python Code:

1.Elbow method

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset


data = pd.read_csv("job_placement.csv")

# Display the first few rows of the dataset


print(data.head())

# Preprocessing the data


# Dropping non-numeric columns if any and handling missing values
data = data.dropna()
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Standardizing the data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Applying PCA for dimensionality reduction


pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)

# Elbow Method to find the optimal number of clusters


inertia = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(pca_data)
inertia.append(kmeans.inertia_)

# Plotting the Elbow Method


plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

21PCS02 – Exploratory Data Analysis Laboratory


Name: Vidya Janani V
Register Number: 913121205090

Output

2. K-Means Clustering
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset


data = pd.read_csv("job_placement.csv")

# Display the first few rows of the dataset


print(data.head())

# Preprocessing the data


# Dropping non-numeric columns if any and handling missing values
data = data.dropna()
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Standardizing the data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Applying PCA for dimensionality reduction


pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)

21PCS02 – Exploratory Data Analysis Laboratory


Name: Vidya Janani V
Register Number: 913121205090

# Applying K-means clustering


kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(pca_data)

# Visualizing the clusters


plt.figure(figsize=(8, 6))
plt.scatter(pca_data[:, 0], pca_data[:, 1], c=cluster_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

Output

21PCS02 – Exploratory Data Analysis Laboratory


Name: Vidya Janani V
Register Number: 913121205090

3. Scatter plot
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset


data = pd.read_csv("job_placement.csv")

# Display the first few rows of the dataset


print(data.head())

# Preprocessing the data


# Dropping non-numeric columns if any and handling missing values
data = data.dropna()
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Standardizing the data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Applying PCA for dimensionality reduction


pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)

# Applying K-means clustering


kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(pca_data)

# Visualizing the clusters


plt.figure(figsize=(10, 6))

# Plotting points with cluster centers


plt.scatter(pca_data[:, 0], pca_data[:, 1], c=cluster_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red',
marker='X', label='Centroids')

plt.title('K-means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
plt.show()

21PCS02 – Exploratory Data Analysis Laboratory


Name: Vidya Janani V
Register Number: 913121205090

Output

Result:
In this experiment , Clustering the given data using Python /R was implemented and the
output was verified successfully.

21PCS02 – Exploratory Data Analysis Laboratory

You might also like