0% found this document useful (0 votes)
15 views23 pages

Machine Learning Laboratory

The document outlines a Machine Learning Laboratory course at Visvesvaraya Technological University, focusing on various programming tasks involving data analysis and machine learning techniques using Python. It includes assignments such as creating histograms and box plots for the California Housing dataset, computing correlation matrices, implementing Principal Component Analysis (PCA) on the Iris dataset, applying the Find-S algorithm, and executing the k-Nearest Neighbour algorithm for classification. Each program is detailed with code snippets and expected outputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

Machine Learning Laboratory

The document outlines a Machine Learning Laboratory course at Visvesvaraya Technological University, focusing on various programming tasks involving data analysis and machine learning techniques using Python. It includes assignments such as creating histograms and box plots for the California Housing dataset, computing correlation matrices, implementing Principal Component Analysis (PCA) on the Iris dataset, applying the Find-S algorithm, and executing the k-Nearest Neighbour algorithm for classification. Each program is detailed with code snippets and expected outputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Visvesvaraya Technological University (VTU)

Machine Learning
Laboratory
BCSL606

2025

Faculty Incharge

Prof. Bhagyashri Wakde

RAJIV GANDHI INSTITUTE OF


TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Bengaluru - 560032

2025
Machine Learning Laboratory (BCSL606)

1.​ Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.

DATASET
California Housing dataset

from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)
california_housing.frame.head()

House AveRoo AveBedr Popul AveOcc Latitud Longit MedHo


MedInc Age ms ms ation up e ude useVal
0 8.3252 41 6.984127 1.02381 322 2.555556 37.88 -122.23 4.526
1 8.3014 21 6.238137 0.97188 2401 2.109842 37.86 -122.22 3.585
2 7.2574 52 8.288136 1.073446 496 2.80226 37.85 -122.24 3.521
3 5.6431 52 5.817352 1.073059 558 2.547945 37.85 -122.25 3.413
4 3.8462 52 6.281853 1.081081 565 2.181467 37.85 -122.25 3.422

print(california_housing.DESCR)

.. _california_housing_dataset:

California Housing dataset


--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
Machine Learning Laboratory (BCSL606)

- AveOccup average number of household members


- Latitude block group latitude
- Longitude block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.


https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average


number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the


:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,


Statistics and Probability Letters, 33 (1997) 291-297

PROGRAM 1

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset


def load_data():
data = fetch_california_housing(as_frame=True)
df = data['data']
df['Target'] = data['target'] # Add the target (house value) to the DataFrame
Machine Learning Laboratory (BCSL606)

return df

# Function to plot histograms for numerical features


def plot_histograms(df):
numerical_features = df.select_dtypes(include='number').columns
df[numerical_features].hist(bins=20, figsize=(15, 10), color='skyblue', edgecolor='black')
plt.suptitle('Histograms of Numerical Features', fontsize=16)
plt.tight_layout()
plt.show()

# Function to generate box plots and identify outliers


def plot_boxplots(df):
numerical_features = df.select_dtypes(include='number').columns
for feature in numerical_features:
plt.figure(figsize=(8, 6))
sns.boxplot(x=df[feature], color='lightblue')
plt.title(f'Box Plot of {feature}', fontsize=14)
plt.xlabel(feature, fontsize=12)
plt.tight_layout()
plt.show()

# Identify outliers using the IQR method


Q1 = df[feature].quantile(0.25)
Q3 = df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
print(f"{feature}: {len(outliers)} outliers found.")

# Main function to run the analysis


def main():
df = load_data()
print("Dataset loaded successfully!")

print("\nGenerating histograms for numerical features...")


plot_histograms(df)

print("\nGenerating box plots and identifying outliers...")


plot_boxplots(df)

if __name__ == "__main__":
main()

OUTPUT:
Machine Learning Laboratory (BCSL606)

=====================RESTART: C:/Users/USER/first.py ======================


Dataset loaded successfully!

Generating histograms for numerical features...


Machine Learning Laboratory (BCSL606)

2.​ Develop a program to Compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations. Create a pair plot to visualize
pairwise relationships between features. Use California Housing dataset.

PROGRAM 2

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset


def load_data():
data = fetch_california_housing(as_frame=True)
df = data['data']
df['Target'] = data['target'] # Add the target (house value) as a column
return df

# Function to compute and visualize the correlation matrix


def visualize_correlation_matrix(df):
# Compute the correlation matrix
corr_matrix = df.corr()

# Plot the heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(
corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, square=True,
linewidths=0.5, annot_kws={"size": 10}
)
plt.title("Correlation Matrix Heatmap", fontsize=16)
plt.show()

# Function to create a pair plot


def create_pairplot(df):
# Select a subset of features for pair plot if dataset is large
selected_features = df.select_dtypes(include='number').columns
sns.pairplot(df[selected_features], diag_kind='kde', corner=True, height=2.0)
plt.suptitle("Pair Plot of Numerical Features", y=1.02, fontsize=16)
plt.show()

# Main function to run the analysis


Machine Learning Laboratory (BCSL606)

def main():
# Load dataset
df = load_data()
print("Dataset loaded successfully!")

# Correlation matrix heatmap


print("\nVisualizing the correlation matrix...")
visualize_correlation_matrix(df)

# Pair plot
print("\nCreating pair plot for numerical features...")
create_pairplot(df)

if __name__ == "__main__":
main()

OUTPUT:

==================== RESTART: C:/Users/USER/Second.py ======================


Dataset loaded successfully!

Visualizing the correlation matrix…


Machine Learning Laboratory (BCSL606)
Machine Learning Laboratory (BCSL606)

3.​ Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

PROGRAM 3.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset


def load_data():
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
data['target_names'] = [iris.target_names[i] for i in iris.target]
return data, iris.target_names

# Perform PCA to reduce dimensions


def perform_pca(data, n_components=2):
# Extract features and standardize them
features = data.iloc[:, :-2] # Exclude target and target_names
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

# Apply PCA
pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(standardized_features)

# Create a DataFrame with the PCA results


pca_data = pd.DataFrame(
principal_components, columns=[f'PC{i+1}' for i in range(n_components)]
)
pca_data['target'] = data['target']
pca_data['target_names'] = data['target_names']

return pca_data, pca

# Visualize PCA results


def plot_pca_results(pca_data, explained_variance, target_names):
plt.figure(figsize=(8, 6))
sns.scatterplot(
Machine Learning Laboratory (BCSL606)

x='PC1', y='PC2', hue='target_names', data=pca_data,


palette='Set1', s=100, alpha=0.7, edgecolor='k'
)
plt.title(f'PCA Results (Explained Variance: PC1={explained_variance[0]:.2f},
PC2={explained_variance[1]:.2f})', fontsize=14)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Classes', loc='upper right')
plt.grid(True, alpha=0.4)
plt.tight_layout()
plt.show()

# Main function to run the program


def main():
# Load the Iris dataset
data, target_names = load_data()
print("Iris dataset loaded successfully!")

# Perform PCA
pca_data, pca = perform_pca(data, n_components=2)
explained_variance = pca.explained_variance_ratio_

# Print explained variance


print(f"Explained Variance Ratio by Principal Components: {explained_variance}")

# Visualize PCA results


print("\nVisualizing PCA results...")
plot_pca_results(pca_data, explained_variance, target_names)

if __name__ == "__main__":
main()

OUTPUT:

====================== RESTART: C:/Users/USER/third.py ======================


Iris dataset loaded successfully!
Explained Variance Ratio by Principal Components: [0.72962445 0.22850762]
Visualizing PCA results…
Machine Learning Laboratory (BCSL606)
Machine Learning Laboratory (BCSL606)

4.​ For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.

DATASET:

training_data.csv

Sky Temp Humidity Wind Water Forecast PlayTennis

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

PROGRAM 4.

import pandas as pd

# Function to implement the Find-S algorithm


def find_s_algorithm(data, target_col):
# Initialize the most specific hypothesis
attributes = data.columns[:-1] # Exclude the target column
hypothesis = ['ϕ'] * len(attributes)

# Iterate through the training examples


for _, row in data.iterrows():
if row[target_col] == "Yes": # Only consider positive examples
for i, value in enumerate(row[:-1]): # Iterate through attributes
if hypothesis[i] == 'ϕ': # Update when hypothesis is 'ϕ'
hypothesis[i] = value
elif hypothesis[i] != value: # Generalize hypothesis
hypothesis[i] = '?'

return hypothesis

# Main function to load data and run the Find-S algorithm


def main():
# Load the dataset
Machine Learning Laboratory (BCSL606)

try:
data = pd.read_csv(r'C:\Users\USER\Desktop\training_data.csv')
print("Training Data Loaded Successfully!\n")
print(data, "\n")

# Ensure the target column exists


target_col = data.columns[-1]
print(f"Target column identified: {target_col}\n")

# Run the Find-S algorithm


final_hypothesis = find_s_algorithm(data, target_col)
print(f"Final Hypothesis Consistent with Positive Examples: {final_hypothesis}")

except Exception as e:
print(f"Error loading data: {e}")

# Run the program


if __name__ == "__main__":
main()

OUTPUT:

====================== RESTART: C:/Users/USER/fourth.py ======================


Training Data Loaded Successfully!

Sky Temp Humidity Wind Water Forecast PlayTennis


0 Sunny Warm Normal Strong Warm Same Yes
1 Sunny Warm High Strong Warm Same Yes
2 Rainy Cold High Strong Warm Change No
3 Sunny Warm High Strong Cool Change Yes

Target column identified: PlayTennis

Final Hypothesis Consistent with Positive Examples: ['Sunny', 'Warm', '?', 'Strong', '?', '?']
Machine Learning Laboratory (BCSL606)

5.​ Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated. a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊
Class1, else xi ∊ Class1 b. Classify the remaining points, x51,……,x100 using KNN.
Perform this for k=1,2,3,4,5,20,30

PROGRAM 5

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

# Generate 100 random values in the range [0, 1]


np.random.seed(42) # For reproducibility
x = np.random.rand(100)

# Label the first 50 points


y = np.array(["Class1" if xi <= 0.5 else "Class2" for xi in x[:50]])

# Prepare training and test datasets


x_train = x[:50].reshape(-1, 1) # Training features (first 50 points)
y_train = y # Training labels (first 50 points)
x_test = x[50:].reshape(-1, 1) # Testing features (remaining 50 points)

# Function to classify and visualize results for different k values


def classify_knn(k_values):
for k in k_values:
# Initialize and fit the KNN classifier
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train, y_train)

# Predict the classes for test data


y_pred = knn.predict(x_test)

# Print the results


print(f"\nResults for k={k}:")
print(f"Predicted Classes for Test Data: {y_pred}")

# Visualization
plt.figure(figsize=(8, 5))
plt.scatter(x[:50], [0] * 50, c=['red' if label == "Class1" else 'blue' for label in y],
label="Training Data (Class1=Red, Class2=Blue)")
Machine Learning Laboratory (BCSL606)

plt.scatter(x[50:], [0] * 50, c=['red' if label == "Class1" else 'blue' for label in y_pred],
marker='x', label="Test Data Predictions")
plt.axvline(x=0.5, color='gray', linestyle='--', label="Decision Boundary (x=0.5)")
plt.title(f"KNN Classification with k={k}")
plt.xlabel("x values")
plt.yticks([])
plt.legend()
plt.grid(alpha=0.4)
plt.tight_layout()
plt.show()

# Specify k values for the KNN algorithm


k_values = [1, 2, 3, 4, 5, 20, 30]

# Perform classification and visualization for specified k values


classify_knn(k_values)

OUTPUT:

====================== RESTART: C:/Users/USER/fifth.py ======================

Results for k=1:


Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=2:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'
'Class1' 'Class1']

Results for k=3:


Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
Machine Learning Laboratory (BCSL606)

'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'


'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']

Results for k=4:


Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']

Results for k=5:


Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']

Results for k=20:


Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']

Results for k=30:


Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
Machine Learning Laboratory (BCSL606)

'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'


'Class1' 'Class1']
Machine Learning Laboratory (BCSL606)

6.​ Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.

PROGRAM 6

import numpy as np
import matplotlib.pyplot as plt

# Locally Weighted Regression (LWR) function


def locally_weighted_regression(x_train, y_train, x_test, tau):
"""
Perform Locally Weighted Regression (LWR).

Parameters:
x_train: np.array, shape (n,)
Training data features.
y_train: np.array, shape (n,)
Training data labels.
x_test: np.array, shape (m,)
Test data features.
tau: float
Bandwidth parameter (controls the weight decay).

Returns:
y_pred: np.array, shape (m,)
Predicted values for x_test.
"""
m = len(x_test)
y_pred = np.zeros(m)

for i in range(m):
weights = np.exp(-np.square(x_train - x_test[i]) / (2 * tau**2)) # Gaussian weights
W = np.diag(weights) # Diagonal weight matrix
X = np.c_[np.ones(len(x_train)), x_train] # Add intercept term
theta = np.linalg.pinv(X.T @ W @ X) @ X.T @ W @ y_train # Normal equation with
weights
y_pred[i] = [1, x_test[i]] @ theta # Predict y for x_test[i]

return y_pred

# Generate synthetic dataset


np.random.seed(42)
x_train = np.linspace(0, 10, 50)
y_train = 2 * np.sin(x_train) + np.random.normal(0, 0.5, size=len(x_train))
Machine Learning Laboratory (BCSL606)

# Test points (finer granularity for smoother graph)


x_test = np.linspace(0, 10, 200)

# Apply LWR for different tau values


tau_values = [0.1, 0.5, 1, 5]
plt.figure(figsize=(12, 8))

for i, tau in enumerate(tau_values, 1):


y_pred = locally_weighted_regression(x_train, y_train, x_test, tau)
plt.subplot(2, 2, i)
plt.scatter(x_train, y_train, label="Training Data", color="red")
plt.plot(x_test, y_pred, label=f"LWR (tau={tau})", color="blue")
plt.title(f"Locally Weighted Regression (tau={tau})")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(alpha=0.4)

plt.tight_layout()
plt.show()

OUTPUT:
====================== RESTART: C:/Users/USER/Sixth.py ======================
Machine Learning Laboratory (BCSL606)
Machine Learning Laboratory (BCSL606)

7.​ Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset
(for vehicle fuel efficiency prediction) for Polynomial Regression.

PROGRAM 7

OUTPUT:
==================== RESTART: C:/Users/USER/Seventh.py =====================

California Housing Dataset Linear Regression MSE: 0.5558915986952442

8.​ Develop a program to demonstrate the working of the decision tree algorithm. Use
Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.

PROGRAM 8
OUTPUT :
9.​ Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training. Compute the accuracy of the classifier, considering a few test data
sets.
PROGRAM 9
Machine Learning Laboratory (BCSL606)

OUTPUT :
10.​Develop a program to implement k-means clustering using Wisconsin Breast Cancer
data set and visualize the clustering result.

PROGRAM 10
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score

# Load the Wisconsin Breast Cancer Dataset


data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=["Diagnosis"]) # Target labels

# Normalize the data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means Clustering


kmeans = KMeans(n_clusters=2, random_state=42) # 2 clusters for malignant/benign
clusters = kmeans.fit_predict(X_scaled)

# Add clustering labels to the dataset


X["Cluster"] = clusters
y["Cluster"] = clusters

# Evaluate clustering using silhouette score


silhouette_avg = silhouette_score(X_scaled, clusters)
print(f"Silhouette Score: {silhouette_avg:.3f}")

# Visualize Clustering with PCA (2D projection)


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot the clusters


plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', s=50, alpha=0.6, label='Clustered
Data')
plt.title("K-Means Clustering on Wisconsin Breast Cancer Dataset")
Machine Learning Laboratory (BCSL606)

plt.xlabel("PCA Component 1")


plt.ylabel("PCA Component 2")
plt.colorbar(label='Cluster')
plt.show()

# Optional: Compare clusters with actual diagnosis


plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=data.target, cmap='coolwarm', s=50, alpha=0.6,
label='Actual Labels')
plt.title("Actual Diagnosis Labels (for comparison)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label='Diagnosis (0=Malignant, 1=Benign)')
plt.show()

OUTPUT:
===================== RESTART: C:/Users/USER/tenth.py ======================
Silhouette Score: 0.345

You might also like