0% found this document useful (0 votes)
7 views21 pages

MLFILE

The document provides a comprehensive overview of various clustering algorithms including K-means, Hierarchical Clustering, and DBSCAN, along with their respective implementations in Python. It also covers dimensionality reduction techniques such as low variance and high correlation filters, as well as Principal Component Analysis (PCA). Additionally, it includes methods for evaluating clustering performance using Silhouette Score, and demonstrates data manipulation and visualization techniques using pandas and matplotlib.

Uploaded by

7088691460shiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

MLFILE

The document provides a comprehensive overview of various clustering algorithms including K-means, Hierarchical Clustering, and DBSCAN, along with their respective implementations in Python. It also covers dimensionality reduction techniques such as low variance and high correlation filters, as well as Principal Component Analysis (PCA). Additionally, it includes methods for evaluating clustering performance using Silhouette Score, and demonstrates data manipulation and visualization techniques using pandas and matplotlib.

Uploaded by

7088691460shiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Q5. Implement K-means algorithm .

Algorithm:

1.Choose k (number of clusters).

2.Initialize k centroids randomly from the data points.

3.Repeat for a fixed number of iterations or until convergence:

 Assign each data point to the nearest centroid (based on distance).


 Recalculate each centroid as the mean of points in the cluster.

4.Stop when centroids do not change significantly or after a set number of iterations.

PROGRAM:

import numpy as np

import matplotlib.pyplot as plt

# Sample data (2D points)

X = np.array([[1, 2], [1, 4], [1, 0],

[10, 2], [10, 4], [10, 0]])

# Number of clusters

k=2

# Randomly choose initial centroids

np.random.seed(0)

centroids = X[np.random.choice(len(X), k, replace=False)]

for _ in range(10): # run for 10 iterations

# Assign points to the nearest centroid

labels = np.array([np.argmin([np.linalg.norm(x - c) for c in centroids]) for x in X])

# Update centroids

for i in range(k):

centroids[i] = X[labels == i].mean(axis=0)


# Plot the results

colors = ['r', 'b']

for i in range(k):

plt.scatter(X[labels == i][:, 0], X[labels == i][:, 1], c=colors[i], label=f'Cluster {i+1}')

plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='x', s=100)

plt.title("K-Means Clustering")

plt.legend()

plt.show()

OUTPUT:
Q6. Implement Hierarchical Clustering.

Algorithm:

1.Start with each data point as its own cluster.

2.Compute distances between all clusters.

3.Merge the two closest clusters.

4.Repeat steps 2–3 until all points are in a single cluster.

5.Use a dendrogram to choose the number of clusters.

PROGRAM:

import numpy as np

from scipy.cluster.hierarchy import linkage, dendrogram

import matplotlib.pyplot as plt

# Sample 2D data

X = np.array([[1, 2], [2, 3], [3, 3],

[8, 7], [8, 8], [7, 7]])

# Perform hierarchical clustering

Z = linkage(X, method='ward')

# Plot dendrogram

dendrogram(Z)

plt.show()
Q7. Implement DBSCAN algorithm.

Algorithm:

 For each point, find its neighbors within radius eps.

 If it has ≥ min_samples neighbors, mark it as a core point.

 Connect core points and their reachable points into clusters.

 Points not connected to any core point are noise.

Program:
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Sample 2D data
X = np.array([[1, 2], [2, 2], [2, 3],
[8, 7], [8, 8], [25, 80]])
# Run DBSCAN
db = DBSCAN(eps=2, min_samples=2).fit(X)
labels = db.labels_
# Plotting
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("DBSCAN Clustering")
plt.show()

Output:

Q8. Calculate Silhoutte Score.


Algorithm:

1. Perform clustering on your data.


2. For each point:
o Compute a = average distance to points in the same cluster.
o Compute b = average distance to points in the nearest other cluster.
3. Silhouette Score = (b - a) / max(a, b).
4. Final score is the average over all points (ranges from -1 to 1).

PROGRAM:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, _ = make_blobs(n_samples=100, centers=3)

# KMeans clustering
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)

# Silhouette Score
score = silhouette_score(X, labels)
print("Silhouette Score:", score)

Output:

Q9. Apply Low variance filter for dimensional reduction.


Algorithm:

 Calculate the variance of each feature (column).

 Set a threshold (e.g., 0.1).

 Remove features with variance less than or equal to the threshold.

 Keep only high variance features for further analysis.

PROGRAM:
import pandas as pd

# Sample data
data = {
'F1': [1, 1, 1, 1, 1], # Low variance
'F2': [2, 2, 2, 2, 2], # Low variance
'F3': [1, 2, 3, 4, 5], # High variance
}
df = pd.DataFrame(data)

# Remove features with variance below threshold (e.g., 0.1)


threshold = 0.1
filtered_df = df.loc[:, df.var() > threshold]

print("Reduced DataFrame:")
print(filtered_df)

Output:

Q10. Apply high correlation filter for dimensional reduction.

Algorithm:
 Calculate the correlation matrix for the dataset features.

 Identify pairs of features with correlation above a threshold (e.g., 0.8).

 Remove one feature from each highly correlated pair.

 Return the reduced dataset.

PROGRAM:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset (replace with your data)


data = pd.DataFrame({
'A': np.random.rand(10),
'B': np.random.rand(10),
'C': np.random.rand(10),
'D': np.random.rand(10)
})

# Introduce high correlation manually for demo


data['E'] = data['A'] * 0.95 + np.random.rand(10) * 0.05

# 1. Calculate correlation matrix


corr_matrix = data.corr().abs()

# 2. Select upper triangle of correlation matrix


upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
k=1).astype(bool))

# 3. Find columns to drop with correlation > 0.8


to_drop = [column for column in upper.columns if any(upper[column] >
0.8)]

# 4. Drop highly correlated columns


reduced_data = data.drop(columns=to_drop)

print("Dropped columns:", to_drop)


print("Reduced Data:")
print(reduced_data)

# Plot correlation matrix before and after


fig, ax = plt.subplots(1, 2, figsize=(12,5))

sns.heatmap(corr_matrix, annot=True, ax=ax[0], cmap='coolwarm')


ax[0].set_title('Original Correlation Matrix')

sns.heatmap(reduced_data.corr(), annot=True, ax=ax[1],


cmap='coolwarm')
ax[1].set_title('Reduced Correlation Matrix')

plt.tight_layout()
plt.show()

Output:

Q11. Implement principal component analysis (PCA).

Algorithm:
1. Standardize the dataset (zero mean).
2. Calculate the covariance matrix of the standardized data.
3. Calculate eigenvalues and eigenvectors of the covariance matrix.
4. Sort eigenvectors by decreasing eigenvalues.
5. Project the data onto the top K eigenvectors to get reduced dimensions.

PROGRAM:
import numpy as np
# Sample data (4 samples, 3 features)
X = np.array([[2.5, 2.4, 0.5],
[0.5, 0.7, 1.2],
[2.2, 2.9, 0.7],
[1.9, 2.2, 0.9]])
# Step 1: Standardize data (mean=0)
X_meaned = X - np.mean(X, axis=0)

# Step 2: Covariance matrix


cov_mat = np.cov(X_meaned, rowvar=False)

# Step 3: Eigenvalues and eigenvectors


eigen_values, eigen_vectors = np.linalg.eigh(cov_mat)

# Step 4: Sort eigenvectors by eigenvalues (descending)


sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvectors = eigen_vectors[:, sorted_index]

# Step 5: Select top 2 eigenvectors (2 principal components)


n_components = 2
eigenvector_subset = sorted_eigenvectors[:, 0:n_components]

# Step 6: Transform data


X_reduced = np.dot(X_meaned, eigenvector_subset)

print("Original Data:\n", X)
print("\nReduced Data (2 Principal Components):\n", X_reduced)

OUTPUT:
Q12. Uploade an excel/csv data file (containing Student_ID,
Student_Name, Gender, Sub1, Sub2, Sub3 with the marks of 30
students). Perform the following tasks:
(i) Check for missing values, and replace them with suitable
replacement.
(ii) Create two DataFrames containg Student_ID, Student_Name
of male and female students.
(iii) Add a new column in the DataFrame ‘Percentage’ showing
total percentage of each student.
(iv) Normalize the marks of each subject.
(v) Draw a bar diagram showing number of male and female
students in the class.
(vi) Draw a pie chart showing the number of students having
percentage
(a) > = 60 (b) > =50 and < 60 (c) < 50

Algorithm:

1. Load the CSV/Excel data into a pandas DataFrame.


2. Check for missing values and replace them (e.g., with the column mean).
3. Create two DataFrames by filtering male and female students.
4. Calculate percentage for each student and add as a new column.
5. Normalize marks in Sub1, Sub2, Sub3 using Min-Max normalization.
6. Plot bar chart for count of male and female students.
7. Plot pie chart for categories of percentage ranges.

PROGRAM:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Load data (replace 'students.csv' with your file)

df = pd.read_csv('students.csv')

# (i) Check and replace missing values with mean of the column

df[['Sub1', 'Sub2', 'Sub3']] = df[['Sub1', 'Sub2', 'Sub3']].apply(lambda x: x.fillna(x.mean()))

# (ii) Create separate DataFrames for male and female students

df_male = df[df['Gender'] == 'Male'][['Student_ID', 'Student_Name']]


df_female = df[df['Gender'] == 'Female'][['Student_ID', 'Student_Name']]

# (iii) Add Percentage column

df['Total'] = df[['Sub1', 'Sub2', 'Sub3']].sum(axis=1)

df['Percentage'] = df['Total'] / 3

# (iv) Normalize marks (Min-Max normalization)

for sub in ['Sub1', 'Sub2', 'Sub3']:

df[sub + '_norm'] = (df[sub] - df[sub].min()) / (df[sub].max() - df[sub].min())

# (v) Bar chart: number of male and female students

gender_counts = df['Gender'].value_counts()

plt.figure(figsize=(6,4))

gender_counts.plot(kind='bar', color=['blue', 'orange'])

plt.title('Number of Male and Female Students')

plt.xlabel('Gender')

plt.ylabel('Count')

plt.xticks(rotation=0)

plt.show()

# (vi) Pie chart for percentage ranges

conditions = [

(df['Percentage'] >= 60),

(df['Percentage'] >= 50) & (df['Percentage'] < 60),

(df['Percentage'] < 50)


]

labels = ['>= 60', '50-59', '< 50']

df['Category'] = np.select(conditions, labels)

category_counts = df['Category'].value_counts()

plt.figure(figsize=(6,6))

category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=140)

plt.title('Percentage Categories of Students')

plt.ylabel('')

plt.show()

Output:

Q13. Implement Linear Regression.

Algorithm:

1. Import necessary libraries.


2. Create sample input (X) and output (y) data.
3. Fit a Linear Regression model.
4. Predict using the model.
5. Plot actual vs predicted values.

PROGRAM:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Step 1: Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

# Step 2: Model training


model = LinearRegression()
model.fit(X, y)

# Step 3: Prediction
y_pred = model.predict(X)
# Step 4: Plot
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(X, y_pred, color='red', label='Predicted')
plt.title("Linear Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
Output:

Q14. Implement K-Nearest Neighbours algorithm.

Algorithm:
1. Import necessary libraries.
2. Load or create sample dataset (X, y).
3. Split data into training and testing sets.
4. Fit KNN model with training data.
5. Predict values for test data.
6. Evaluate accuracy.
7. (Optional) Plot data and decision boundaries.

Program:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Step 1: Load dataset
data = load_iris()
X = data.data
y = data.target
# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Step 3: Train KNN model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# Step 4: Prediction and accuracy
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Output:

Q15. Implement Naive Bayes algorithm.

Algorithm:
1. Import libraries and dataset.
2. Split dataset into training and testing sets.
3. Train the Naive Bayes classifier on training data.
4. Predict the output for test data.
5. Evaluate the accuracy.
6. (Optional) Visualize results using a 2D plot.
Program:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Step 1: Load dataset


data = load_iris()
X = data.data
y = data.target

# Step 2: Split into train/test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Step 3: Train Naive Bayes model


model = GaussianNB()
model.fit(X_train, y_train)

# Step 4: Predict and evaluate


y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Output:

Q16. Implement Decision Tree algorithm.

import pandas as pd
import numpy as np
from collections import Counter
data = [
['Sunny', 'Hot', 'High', False, 'No'],
['Sunny', 'Hot', 'High', True, 'No'],
['Overcast', 'Hot', 'High', False, 'Yes'],
['Rainy', 'Mild', 'High', False, 'Yes'],
['Rainy', 'Cool', 'Normal', False, 'Yes'],
['Rainy', 'Cool', 'Normal', True, 'No'],
['Overcast', 'Cool', 'Normal', True, 'Yes'],
['Sunny', 'Mild', 'High', False, 'No'],
['Sunny', 'Cool', 'Normal', False, 'Yes'],
['Rainy', 'Mild', 'Normal', False, 'Yes'],
['Sunny', 'Mild', 'Normal', True, 'Yes'],
['Overcast', 'Mild', 'High', True, 'Yes'],
['Overcast', 'Hot', 'Normal', False, 'Yes'],
['Rainy', 'Mild', 'High', True, 'No']
]
df = pd.DataFrame(data, columns=['Outlook', 'Temperature', 'Humidity', 'Windy', 'Play'])
def entropy(y):
counts = Counter(y)
total = len(y)
return -sum((c/total) * np.log2(c/total) for c in counts.values())
def info_gain(data, feature, target):
original_entropy = entropy(data[target])
values = data[feature].unique()
weighted_entropy = 0
for v in values:
subset = data[data[feature] == v]
weighted_entropy += (len(subset)/len(data)) * entropy(subset[target])
return original_entropy - weighted_entropy
def id3(data, target, features):
if len(set(data[target])) == 1:
return data[target].iloc[0]
if not features:
return Counter(data[target]).most_common(1)[0][0]
gains = [(f, info_gain(data, f, target)) for f in features]
best_feature = max(gains, key=lambda x: x[1])[0]
tree = {best_feature: {}}
for value in data[best_feature].unique():
subset = data[data[best_feature] == value]
sub_tree = id3(subset, target, [f for f in features if f != best_feature])
tree[best_feature][value] = sub_tree
return tree
features = ['Outlook', 'Temperature', 'Humidity', 'Windy']
tree = id3(df, 'Play', features)
import pprint
pprint.pprint(tree)

OUTPUT:

Q17. Implement Support Vector Machine.


CODE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("Iris.csv")
df.drop(columns=['Id'], inplace=True, errors='ignore')
le = LabelEncoder()
df['Species'] = le.fit_transform(df['Species']) # 0,1,2
X = df.drop('Species', axis=1)
y = df['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Q18. Implement Logistics Regression.


CODE:
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([0, 0, 0, 1, 1, 1])
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def train_logistic_regression(X, y, lr=0.1, epochs=1000):
m = len(y)
X_b = np.c_[np.ones((m, 1)), X]
theta = np.zeros((2, 1))
y = y.reshape(-1, 1)
for epoch in range(epochs):
z = X_b @ theta
h = sigmoid(z)
gradient = X_b.T @ (h - y) / m
theta -= lr * gradient
return theta
def predict(X, theta):
X_b = np.c_[np.ones((X.shape[0], 1)), X]
probs = sigmoid(X_b @ theta)
return (probs >= 0.5).astype(int)
theta = train_logistic_regression(X, y)
X_test = np.array([[1], [2], [3], [4], [5], [6]])
y_pred = predict(X_test, theta)
for i in range(len(X_test)):
print(f"Hours studied: {X_test[i][0]}, Predicted Pass: {y_pred[i][0]}")
x_vals = np.linspace(0, 7, 100)
x_vals_b = np.c_[np.ones((100, 1)), x_vals.reshape(-1, 1)]
y_vals = sigmoid(x_vals_b @ theta)
plt.scatter(X, y, color='red', label='Actual')
plt.plot(x_vals, y_vals, label='Logistic Curve')
plt.xlabel("Hours Studied")
plt.ylabel("Probability of Passing")
plt.title("Logistic Regression: Pass Prediction")
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:

You might also like