MLFILE
MLFILE
Algorithm:
4.Stop when centroids do not change significantly or after a set number of iterations.
PROGRAM:
import numpy as np
# Number of clusters
k=2
np.random.seed(0)
# Update centroids
for i in range(k):
for i in range(k):
plt.title("K-Means Clustering")
plt.legend()
plt.show()
OUTPUT:
Q6. Implement Hierarchical Clustering.
Algorithm:
PROGRAM:
import numpy as np
# Sample 2D data
Z = linkage(X, method='ward')
# Plot dendrogram
dendrogram(Z)
plt.show()
Q7. Implement DBSCAN algorithm.
Algorithm:
Program:
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Sample 2D data
X = np.array([[1, 2], [2, 2], [2, 3],
[8, 7], [8, 8], [25, 80]])
# Run DBSCAN
db = DBSCAN(eps=2, min_samples=2).fit(X)
labels = db.labels_
# Plotting
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("DBSCAN Clustering")
plt.show()
Output:
PROGRAM:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, _ = make_blobs(n_samples=100, centers=3)
# KMeans clustering
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
# Silhouette Score
score = silhouette_score(X, labels)
print("Silhouette Score:", score)
Output:
PROGRAM:
import pandas as pd
# Sample data
data = {
'F1': [1, 1, 1, 1, 1], # Low variance
'F2': [2, 2, 2, 2, 2], # Low variance
'F3': [1, 2, 3, 4, 5], # High variance
}
df = pd.DataFrame(data)
print("Reduced DataFrame:")
print(filtered_df)
Output:
Algorithm:
Calculate the correlation matrix for the dataset features.
PROGRAM:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.tight_layout()
plt.show()
Output:
Algorithm:
1. Standardize the dataset (zero mean).
2. Calculate the covariance matrix of the standardized data.
3. Calculate eigenvalues and eigenvectors of the covariance matrix.
4. Sort eigenvectors by decreasing eigenvalues.
5. Project the data onto the top K eigenvectors to get reduced dimensions.
PROGRAM:
import numpy as np
# Sample data (4 samples, 3 features)
X = np.array([[2.5, 2.4, 0.5],
[0.5, 0.7, 1.2],
[2.2, 2.9, 0.7],
[1.9, 2.2, 0.9]])
# Step 1: Standardize data (mean=0)
X_meaned = X - np.mean(X, axis=0)
print("Original Data:\n", X)
print("\nReduced Data (2 Principal Components):\n", X_reduced)
OUTPUT:
Q12. Uploade an excel/csv data file (containing Student_ID,
Student_Name, Gender, Sub1, Sub2, Sub3 with the marks of 30
students). Perform the following tasks:
(i) Check for missing values, and replace them with suitable
replacement.
(ii) Create two DataFrames containg Student_ID, Student_Name
of male and female students.
(iii) Add a new column in the DataFrame ‘Percentage’ showing
total percentage of each student.
(iv) Normalize the marks of each subject.
(v) Draw a bar diagram showing number of male and female
students in the class.
(vi) Draw a pie chart showing the number of students having
percentage
(a) > = 60 (b) > =50 and < 60 (c) < 50
Algorithm:
PROGRAM:
import pandas as pd
import numpy as np
df = pd.read_csv('students.csv')
# (i) Check and replace missing values with mean of the column
df['Percentage'] = df['Total'] / 3
gender_counts = df['Gender'].value_counts()
plt.figure(figsize=(6,4))
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
conditions = [
category_counts = df['Category'].value_counts()
plt.figure(figsize=(6,6))
plt.ylabel('')
plt.show()
Output:
Algorithm:
PROGRAM:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Step 1: Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])
# Step 3: Prediction
y_pred = model.predict(X)
# Step 4: Plot
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(X, y_pred, color='red', label='Predicted')
plt.title("Linear Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
Output:
Algorithm:
1. Import necessary libraries.
2. Load or create sample dataset (X, y).
3. Split data into training and testing sets.
4. Fit KNN model with training data.
5. Predict values for test data.
6. Evaluate accuracy.
7. (Optional) Plot data and decision boundaries.
Program:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Step 1: Load dataset
data = load_iris()
X = data.data
y = data.target
# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Step 3: Train KNN model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# Step 4: Prediction and accuracy
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Output:
Algorithm:
1. Import libraries and dataset.
2. Split dataset into training and testing sets.
3. Train the Naive Bayes classifier on training data.
4. Predict the output for test data.
5. Evaluate the accuracy.
6. (Optional) Visualize results using a 2D plot.
Program:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
Output:
import pandas as pd
import numpy as np
from collections import Counter
data = [
['Sunny', 'Hot', 'High', False, 'No'],
['Sunny', 'Hot', 'High', True, 'No'],
['Overcast', 'Hot', 'High', False, 'Yes'],
['Rainy', 'Mild', 'High', False, 'Yes'],
['Rainy', 'Cool', 'Normal', False, 'Yes'],
['Rainy', 'Cool', 'Normal', True, 'No'],
['Overcast', 'Cool', 'Normal', True, 'Yes'],
['Sunny', 'Mild', 'High', False, 'No'],
['Sunny', 'Cool', 'Normal', False, 'Yes'],
['Rainy', 'Mild', 'Normal', False, 'Yes'],
['Sunny', 'Mild', 'Normal', True, 'Yes'],
['Overcast', 'Mild', 'High', True, 'Yes'],
['Overcast', 'Hot', 'Normal', False, 'Yes'],
['Rainy', 'Mild', 'High', True, 'No']
]
df = pd.DataFrame(data, columns=['Outlook', 'Temperature', 'Humidity', 'Windy', 'Play'])
def entropy(y):
counts = Counter(y)
total = len(y)
return -sum((c/total) * np.log2(c/total) for c in counts.values())
def info_gain(data, feature, target):
original_entropy = entropy(data[target])
values = data[feature].unique()
weighted_entropy = 0
for v in values:
subset = data[data[feature] == v]
weighted_entropy += (len(subset)/len(data)) * entropy(subset[target])
return original_entropy - weighted_entropy
def id3(data, target, features):
if len(set(data[target])) == 1:
return data[target].iloc[0]
if not features:
return Counter(data[target]).most_common(1)[0][0]
gains = [(f, info_gain(data, f, target)) for f in features]
best_feature = max(gains, key=lambda x: x[1])[0]
tree = {best_feature: {}}
for value in data[best_feature].unique():
subset = data[data[best_feature] == value]
sub_tree = id3(subset, target, [f for f in features if f != best_feature])
tree[best_feature][value] = sub_tree
return tree
features = ['Outlook', 'Temperature', 'Humidity', 'Windy']
tree = id3(df, 'Play', features)
import pprint
pprint.pprint(tree)
OUTPUT: