0% found this document useful (0 votes)
9 views11 pages

EX - NO:3: Algorithm

The document outlines exercises for implementing various machine learning algorithms using Scikit-learn, including K-Nearest Neighbors (KNN) for classifying real vs fake news, Decision Trees for analyzing overfitting, and K-Means Clustering for biological datasets. Each exercise includes aims, algorithms, and sample code to guide the implementation. The results demonstrate successful execution of the programs and their respective analyses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

EX - NO:3: Algorithm

The document outlines exercises for implementing various machine learning algorithms using Scikit-learn, including K-Nearest Neighbors (KNN) for classifying real vs fake news, Decision Trees for analyzing overfitting, and K-Means Clustering for biological datasets. Each exercise includes aims, algorithms, and sample code to guide the implementation. The results demonstrate successful execution of the programs and their respective analyses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

EX.NO:3 CLASSIFICATION WITH NEAREST NEIGHBORS.

IN THIS
QUESTION YOU WILL USE SCIKIT’S LEARN’S KNN
CLASSIFIER TO CLASSIFY REAL VS FAKE NEWS
HEADLINES.THE AIM OF THIS QUESTION IS FOR YOU TO
READ THE SCIKIT-LEARN API AND GET COMFORTABLE
WITH TRAINING/VALIDATION SPLITS.USE CALIFORNIA
DATE:
HOUSING DATASET.

AIM:

To implement a program for Classification using Nearest Neighbors using Scikit-


learn’s KNN classifier and evaluate the model’s performance with training/validation splits
and metrics.

ALGORITHM:

1. Start the program.


2. Import the necessary libraries and dataset.
3. Preprocess the dataset if needed.
4. Split the dataset into training and testing sets.
5. Train the K-Nearest Neighbors (KNN) model with different values of K.
6. Plot the accuracy scores for different values of K.
7. Choose the best K and evaluate the model.
8. Print accuracy, confusion matrix, and classification report.
9. Plot the confusion matrix.
10. End the program.

PROGRAM:

# Import Libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import load_wine

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load Dataset

wine = load_wine()

X = wine.data

y = wine.target

# Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Try different K values

neighbors = np.arange(1, 10)

train_accuracy = []

test_accuracy = []

for k in neighbors:

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

train_accuracy.append(knn.score(X_train, y_train))

test_accuracy.append(knn.score(X_test, y_test))

# Plot K vs Accuracy
plt.figure(figsize=(8,5))

plt.plot(neighbors, train_accuracy, label="Train Accuracy", marker='o')

plt.plot(neighbors, test_accuracy, label="Test Accuracy", marker='s')

plt.xlabel("Number of Neighbors (K)")

plt.ylabel("Accuracy")

plt.title("KNN Accuracy for Different K Values")

plt.legend()

plt.grid(True)

plt.show()

# Final Model with Best K

best_k = 5

knn = KNeighborsClassifier(n_neighbors=best_k)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

# Evaluation

acc = accuracy_score(y_test, y_pred)

print(f"\n Final Accuracy (K={best_k}): {acc:.4f}")

cm = confusion_matrix(y_test, y_pred)

print("\n Confusion Matrix:")

print(cm)

print("\n Classification Report:")

print(classification_report(y_test, y_pred, target_names=wine.target_names))

# Plot Confusion Matrix


plt.figure(figsize=(6, 4))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',

xticklabels=wine.target_names, yticklabels=wine.target_names)

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.title("Confusion Matrix")

plt.show()

SAMPLE INPUT OUTPUT:

RESULT:

Thus to implement a program for Classification using Nearest Neighbors using Scikit-
learn’s KNN classifier and evaluate the model’s performance with training/validation splits
and metrics was successfully executed.
EX NO: 4 IN THIS EXERCISE, YOU'LL EXPERIMENT WITH
VALIDATION SETS AND TEST SETS USING THE
DATE: DATASET. SPLITA TRAINING SET INTO A SMALLER
TRAINING SET AND A VALIDATION SET. ANALYZE
DELTAS BETWEEN TRAINING SET AND VALIDATION
SET RESULTS. TEST THE TRAINED MODEL WITH A
TEST SET TO DETERMINE WHETHER YOUR TRAINED
MODEL IS OVERFITTING. DETECT AND FIX A COMMON
TRAINING PROBLEM.

AIM:

To analyze the difference in performance between training and validation sets to determine if
the model is overfitting, and to visualize the results to detect and address this issue.

ALGORITHM:

1. Start the program.


2. Import necessary libraries and generate or load a classification dataset.
3. Split the dataset into training and testing sets.
4. Train a Decision Tree classifier with different depths.
5. Evaluate model accuracy on training and testing sets for each depth.
6. Record and compare the results.
7. Plot accuracy vs model complexity (tree depth).
8. Identify the presence of overfitting.
9. End the program.

PROGRAM:

# Step 1: Import Libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

# Step 2: Generate synthetic classification dataset

X, y = make_classification(n_samples=10000, n_features=20,

n_informative=5, n_redundant=15, random_state=1)

# Step 3: Split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train and test models with different max_depth values

train_scores = []

test_scores = []

depth_range = range(1, 21)

for depth in depth_range:

model = DecisionTreeClassifier(max_depth=depth, random_state=0)

model.fit(X_train, y_train)

train_yhat = model.predict(X_train)

test_yhat = model.predict(X_test)

train_acc = accuracy_score(y_train, train_yhat)

test_acc = accuracy_score(y_test, test_yhat)

train_scores.append(train_acc)

test_scores.append(test_acc)

print(f"Depth={depth}, Train Acc={train_acc:.3f}, Test Acc={test_acc:.3f}")

# Step 5: Plot Training vs Testing Accuracy

plt.figure(figsize=(10, 6))
plt.plot(depth_range, train_scores, '-o', label='Training Accuracy')

plt.plot(depth_range, test_scores, '-o', label='Testing Accuracy')

plt.xlabel('Tree Depth')

plt.ylabel('Accuracy')

plt.title('Overfitting Detection: Training vs Testing Accuracy')

plt.legend()

plt.grid(True)

plt.show()

SAMPLE INPUT OUTPUT:

RESULT:

Thus to analyze the difference in performance between training and validation sets to
determine if the model is overfitting, and to visualize the results to detect and address this
issue was executed successfully.
EX NO: 5
Implement the K-Means algorithm using the given
DATE: dataset.

Aim:

To implement the K-Means Clustering Algorithm on the given biological dataset and group
the data points (species) based on their codon usage frequencies.

Algorithm: K-Means Clustering

1. Start
2. Import the required libraries (pandas, sklearn, matplotlib, etc.).
3. Load the dataset containing codon usage frequencies and other features.
4. Select the relevant numerical features (e.g., UUU, UUC, UUA, UUG).
5. Normalize the features using StandardScaler for better clustering results.
6. Choose the number of clusters (e.g., k = 3).
7. Apply the K-Means clustering algorithm:
■ Initialize centroids randomly.
■ Assign each point to the nearest centroid.
■ Update centroids as the mean of assigned points.
■ Repeat steps until centroids do not change or max iterations are
reached.
8. Assign the cluster labels to the dataset.
9. Display the final clustered data with the cluster number.
10. Optionally, visualize clusters using scatter plots.

PROGRAM:

# Step 1: Import required libraries

import pandas as pd

from sklearn.cluster import KMeans


from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

import seaborn as sns

# Step 2: Create sample data

data = {

'Kingdom': ['vrl']*5 + ['pri']*2,

'DNAType': [0]*7,

'SpeciesID': [100217, 100220, 100755, 100880, 100887, 9601, 9606],

'Ncodons': [1995, 1474, 4862, 1915, 22831, 1097, 40662582],

'SpeciesName': [

'Epizootic haemotopoietic necrosis virus',

'Bohle iridovirus',

'Sweet potato leaf curl virus',

'Northern cereal mosaic virus',

'Soil-borne cereal mosaic virus',

'Pongo pygmaeus abelii',

'Homo sapiens'

],

'UUU': [0.01654, 0.02714, 0.01974, 0.01775, 0.02816, 0.02552, 0.01757],

'UUC': [0.01203, 0.01357, 0.0218, 0.02245, 0.01371, 0.03555, 0.02028],

'UUA': [0.0005, 0.00068, 0.01357, 0.01619, 0.00767, 0.00547, 0.00767],

'UUG': [0.00351, 0.00678, 0.01543, 0.00992, 0.03679, 0.01367, 0.01293]

}
df = pd.DataFrame(data)

# Step 3: Extract codon usage features

features = df[['UUU', 'UUC', 'UUA', 'UUG']]

# Step 4: Normalize codon frequencies

scaler = StandardScaler()

scaled_features = scaler.fit_transform(features)

# Step 5: Apply KMeans clustering

kmeans = KMeans(n_clusters=3, random_state=0)

df['Cluster'] = kmeans.fit_predict(scaled_features)

# Step 6: Display final output

pd.set_option('display.max_columns', None)

print(" Final Output:\n")

print(df[['Kingdom', 'DNAType', 'SpeciesID', 'Ncodons', 'SpeciesName', 'UUU', 'UUC',


'UUA', 'UUG', 'Cluster']])

# Step 7: Optional visualization

plt.figure(figsize=(8, 6))

sns.scatterplot(data=df, x='UUU', y='UUC', hue='Cluster', palette='Set1', s=100)

plt.title("Codon Usage Clustering using K-Means")

plt.xlabel("UUU Frequency")

plt.ylabel("UUC Frequency")

plt.grid(True)

plt.show()
SAMPLE INPUT OUTPUT:

RESULT:

Thus to implement the K-Means Clustering Algorithm on the given biological


dataset and group the data points (species) based on their codon usage frequencies.

You might also like