EX - NO:3: Algorithm
EX - NO:3: Algorithm
IN THIS
QUESTION YOU WILL USE SCIKIT’S LEARN’S KNN
CLASSIFIER TO CLASSIFY REAL VS FAKE NEWS
HEADLINES.THE AIM OF THIS QUESTION IS FOR YOU TO
READ THE SCIKIT-LEARN API AND GET COMFORTABLE
WITH TRAINING/VALIDATION SPLITS.USE CALIFORNIA
DATE:
HOUSING DATASET.
AIM:
ALGORITHM:
PROGRAM:
# Import Libraries
import numpy as np
import pandas as pd
# Load Dataset
wine = load_wine()
X = wine.data
y = wine.target
# Split Data
train_accuracy = []
test_accuracy = []
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
train_accuracy.append(knn.score(X_train, y_train))
test_accuracy.append(knn.score(X_test, y_test))
# Plot K vs Accuracy
plt.figure(figsize=(8,5))
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)
plt.show()
best_k = 5
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Evaluation
cm = confusion_matrix(y_test, y_pred)
print(cm)
xticklabels=wine.target_names, yticklabels=wine.target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
RESULT:
Thus to implement a program for Classification using Nearest Neighbors using Scikit-
learn’s KNN classifier and evaluate the model’s performance with training/validation splits
and metrics was successfully executed.
EX NO: 4 IN THIS EXERCISE, YOU'LL EXPERIMENT WITH
VALIDATION SETS AND TEST SETS USING THE
DATE: DATASET. SPLITA TRAINING SET INTO A SMALLER
TRAINING SET AND A VALIDATION SET. ANALYZE
DELTAS BETWEEN TRAINING SET AND VALIDATION
SET RESULTS. TEST THE TRAINED MODEL WITH A
TEST SET TO DETERMINE WHETHER YOUR TRAINED
MODEL IS OVERFITTING. DETECT AND FIX A COMMON
TRAINING PROBLEM.
AIM:
To analyze the difference in performance between training and validation sets to determine if
the model is overfitting, and to visualize the results to detect and address this issue.
ALGORITHM:
PROGRAM:
import numpy as np
X, y = make_classification(n_samples=10000, n_features=20,
train_scores = []
test_scores = []
model.fit(X_train, y_train)
train_yhat = model.predict(X_train)
test_yhat = model.predict(X_test)
train_scores.append(train_acc)
test_scores.append(test_acc)
plt.figure(figsize=(10, 6))
plt.plot(depth_range, train_scores, '-o', label='Training Accuracy')
plt.xlabel('Tree Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
RESULT:
Thus to analyze the difference in performance between training and validation sets to
determine if the model is overfitting, and to visualize the results to detect and address this
issue was executed successfully.
EX NO: 5
Implement the K-Means algorithm using the given
DATE: dataset.
Aim:
To implement the K-Means Clustering Algorithm on the given biological dataset and group
the data points (species) based on their codon usage frequencies.
1. Start
2. Import the required libraries (pandas, sklearn, matplotlib, etc.).
3. Load the dataset containing codon usage frequencies and other features.
4. Select the relevant numerical features (e.g., UUU, UUC, UUA, UUG).
5. Normalize the features using StandardScaler for better clustering results.
6. Choose the number of clusters (e.g., k = 3).
7. Apply the K-Means clustering algorithm:
■ Initialize centroids randomly.
■ Assign each point to the nearest centroid.
■ Update centroids as the mean of assigned points.
■ Repeat steps until centroids do not change or max iterations are
reached.
8. Assign the cluster labels to the dataset.
9. Display the final clustered data with the cluster number.
10. Optionally, visualize clusters using scatter plots.
PROGRAM:
import pandas as pd
data = {
'DNAType': [0]*7,
'SpeciesName': [
'Bohle iridovirus',
'Homo sapiens'
],
}
df = pd.DataFrame(data)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
df['Cluster'] = kmeans.fit_predict(scaled_features)
pd.set_option('display.max_columns', None)
plt.figure(figsize=(8, 6))
plt.xlabel("UUU Frequency")
plt.ylabel("UUC Frequency")
plt.grid(True)
plt.show()
SAMPLE INPUT OUTPUT:
RESULT: