0% found this document useful (0 votes)

26 views34 pages

Indexdw

The document contains practical lab exercises for a Data Warehousing & Data Mining course, including data cleaning, implementing the Apriori algorithm, and creating an ID3 decision tree algorithm. Each lab includes source code and outputs demonstrating data manipulation and analysis techniques using Python libraries like pandas and mlxtend. The labs focus on tasks such as handling missing data, generating association rules, and calculating information gain for decision tree construction.

Uploaded by

Shristi Chapagain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views34 pages

Indexdw

Uploaded by

Shristi Chapagain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Madan Bhandari Memorial College

Department of Computer Science and Information Technology (B.Sc.CSIT)

Ninayak Nagar, New Baneshwor, Kathmandu

Subject: -Data Warehousing & Data Mining CODE: - CSC409

Name:- Shristi Chapagain Semester :- 7th Batch :- 2077

S.N DOS Remarks Lecture

Name of Practical
(Grade) Signature
Name: Shristi Chapagain Date: 2081/06/04

Lab 1: Data Cleaning

SOURCE CODE:
import pandas as pd

# Load the dataset from the CSV file

file_path = "DataCleaning_Test.csv" # Replace with your CSV file path
df = pd.read_csv(file_path)
# --- Data Cleaning ---

# 1. Standardize date format

# Convert 'Date' column to datetime, handle errors, and reformat
df['Date'] = pd.to_datetime(df['Date'], errors='coerce', format='%Y/%m/%d')

# 2. Remove duplicate rows

df.drop_duplicates(inplace=True)

After Removing:
# 3. Handle missing or invalid values in numeric columns
# Convert numeric columns to proper types and replace invalid entries with NaN
numeric_columns = ['Duration', 'Pulse', 'Maxpulse', 'Calories']
for col in numeric_columns:
df[col] = pd.to_numeric(df[col], errors='coerce')

# Find rows with NaN values (invalid data)

rows_with_nan = df[df.isna().any(axis=1)]

# Display rows that contain NaN values

print("Rows with NaN (invalid) values after conversion:")
print(rows_with_nan)

# 4. Drop rows with missing or invalid data in key columns

# Key columns: 'Date', 'Duration', 'Pulse', 'Maxpulse'
df.dropna(subset=['Date', 'Duration', 'Pulse', 'Maxpulse'], inplace=True)

Before Dropping:
After Dropping

# 5. Handle missing 'Calories' values

# Option 1: Replace NaN with the column mean
df['Calories'].fillna(df['Calories'].mean(), inplace=True)
# 6. Standardize column order and names
df = df[['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories']]
df.columns = ['Duration (mins)', 'Date', 'Pulse', 'Max Pulse', 'Calories']

# 7. Verify logical consistency

# Ensure 'Max Pulse' is greater than or equal to 'Pulse'
df = df[df['Max Pulse'] >= df['Pulse'] * 0.9] # Allow up to 10% deviation

# Ensure all durations are positive

df = df[df['Duration (mins)'] > 0]
# --- Save and Display Cleaned Data ---
# Save the cleaned dataset to a new CSV file
cleaned_file_path = "cleaned_dataset.csv"
df.to_csv(cleaned_file_path, index=False)

# Display a summary of the cleaned dataset

print("Cleaned Data Summary:")
print(df.describe())

print("\nCleaned Dataset Head:")

print(df.head())
Name: Shristi Chapagain Date: 2081/06/04

Lab 2: Apriori Algorithm

SOURCE CODE:

import pandas as pd
import streamlit as st
from mlxtend.frequent_patterns import apriori, association_rules

# Sample transaction data in a DataFrame

dataset = [['A', 'B', 'C'],
['A', 'C'],
['A', 'D'],
['B', 'E', 'F']]

# Transforming the data into a DataFrame suitable for the apriori function
df = pd.DataFrame(dataset)

# Transforming data into 1s and 0s for presence of items in each transaction

df = df.apply(lambda x: pd.Series(1, index=x.dropna().values), axis=1).fillna(0)

# Applying the Apriori algorithm

frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# Generating the association rules

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.5)

# Displaying the rules using Streamlit

st.write("Rules:")
pd.set_option('display.max_columns', None) ##to display all columns
st.write(rules)

# Printing the frequent itemsets and association rules in the console

print(frequent_itemsets)
print(rules)
Output:
Name: Shristi Chapagain Date: 2081/06/04

Lab 3: Finding Frequent Itemsets and Association Rules with FP Growth Algorithm

SOURCE CODE:

import pandas as pd
import streamlit as st
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Sample transaction data

dataset = [['A', 'B', 'C'],
['A', 'C'],
['A', 'D'],
['B', 'E', 'F']]

# Transform the dataset into a one-hot encoded DataFrame

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Display the one-hot encoded DataFrame

st.write("One-Hot Encoded DataFrame:")
st.dataframe(df)

# Applying the Apriori algorithm

frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# Display frequent itemsets

st.write("Frequent Itemsets:")
st.dataframe(frequent_itemsets)

# Generating association rules

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.5,
num_itemsets=len(frequent_itemsets))

# Display association rules

st.write("Association Rules:")
st.dataframe(rules)

# Debug: Print results to console

print("Frequent Itemsets:")
print(frequent_itemsets)

print("\nAssociation Rules:")
print(rules)
OUTPUT:
Name: Shristi Chapagain Date: 2081/06 /10

Lab 4: Creating ID3 algorithm

SOURCE CODE:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from math import log2
from collections import Counter
from pprint import pprint

# Load the dataset

df = pd.read_csv(r'C:\Users\user\OneDrive\Desktop\data\Lab 6\ID3_golf_dataset.csv')
print(df)

# Identify the target attribute (last column in this case)

t = df.keys()[-1]
print("Target Attribute is ➡", t)

# Get the attribute names from the input dataset

attribute_names = list(df.keys())
attribute_names.remove(t) # Remove the target attribute
print("Predicting Attributes ➡", attribute_names)

# Function to calculate the entropy of probabilities

def entropy(probs):
return sum([-p * log2(p) for p in probs if p > 0]) # Avoid log(0)

# Function to calculate the entropy of a list

def entropy_of_list(ls, value):
total_instances = len(ls)
print(f"\nTotal instances for '{value}' ➡ {total_instances}")
cnt = Counter(ls)
print("Class count:", dict(cnt))

probs = [x / total_instances for x in cnt.values()]

print(f"Probabilities ➡ {probs}")
return entropy(probs)

# Function to calculate information gain

def information_gain(df, split_attribute, target_attribute, parent_name):
print(f"\nCalculating Information Gain for ➡ {split_attribute}")
df_split = df.groupby(split_attribute)
total_instances = len(df)

weighted_entropy = 0
for attr_val, subset in df_split:
subset_entropy = entropy_of_list(subset[target_attribute], attr_val)
weighted_entropy += (len(subset) / total_instances) * subset_entropy

parent_entropy = entropy_of_list(df[target_attribute], parent_name)

info_gain = parent_entropy - weighted_entropy
print(f"Information Gain for {split_attribute} ➡ {info_gain}")
return info_gain

# ID3 Algorithm Implementation

def id3(df, target_attribute, attribute_names, default_class=None):
cnt = Counter(df[target_attribute])

# Check if the dataset is homogeneous

if len(cnt) == 1:
return next(iter(cnt))

# If the dataset is empty or no attributes left, return default class

if df.empty or not attribute_names:
return default_class

# Default class for the next call

default_class = max(cnt, key=cnt.get)

# Calculate information gain for all attributes

gains = {attr: information_gain(df, attr, target_attribute, 'Dataset') for attr in attribute_names}
best_attr = max(gains, key=gains.get)

# Create the decision tree node

tree = {best_attr: {}}
remaining_attributes = [attr for attr in attribute_names if attr != best_attr]

# Recursively create subtrees

for attr_val, subset in df.groupby(best_attr):
subtree = id3(subset, target_attribute, remaining_attributes, default_class)
tree[best_attr][attr_val] = subtree

return tree

# Calculate the initial entropy for the dataset

print("\nEntropy calculation for the target attribute:")
total_entropy = entropy_of_list(df[t], 'Dataset')
print(f"Total Entropy ➡ {total_entropy}")

# Build the decision tree

tree = id3(df, t, attribute_names)
print("\nGenerated Decision Tree:")
pprint(tree)

# Extracting details from the decision tree

root_attribute = next(iter(tree))
print(f"\nRoot Attribute ➡ {root_attribute}")
print(f"Tree Keys ➡ {tree[root_attribute].keys()}")

OUTPUT:

Target Attribute is ➡ PlayGolf

Predicting Attributes ➡ ['Outlook', 'Temperature', 'Humidity', 'Windy']

Entropy calculation for input dataset:

Number of Instances of the Current Sub-Class is 14.0

Classes ➡ 'p' = Yes, 'n' = No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Total Entropy(S) of PlayGolf Dataset ➡ 0.9402859586706309

----- Information Gain Calculation of Outlook -----

Total no of instances/records associated with 'Overcast' is ➡ 4

Target attribute class count (Yes/No):

{
"Yes":4
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Total no of instances/records associated with 'Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Outlook” is ➡ 0.2467498197744391

----- Information Gain Calculation of Temperature -----

Total no of instances/records associated with 'Cool' is ➡ 4

Target attribute class count (Yes/No):

{
"Yes":3
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.75

Probabilities of Class 'n' = No ➡ 0.25

Total no of instances/records associated with 'Hot' is ➡ 4

Target attribute class count (Yes/No):

{
"No":2
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'Mild' is ➡ 6

Target attribute class count (Yes/No):

{
"Yes":4
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Temperature” is ➡ 0.029222565658954647

----- Information Gain Calculation of Humidity -----

Total no of instances/records associated with 'High' is ➡ 7

Target attribute class count (Yes/No):

{
"No":4
"Yes":3
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5714285714285714

Probabilities of Class 'n' = No ➡ 0.42857142857142855

Total no of instances/records associated with 'Normal' is ➡ 7

Target attribute class count (Yes/No):

{
"Yes":6
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.8571428571428571

Probabilities of Class 'n' = No ➡ 0.14285714285714285

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Humidity” is ➡ 0.15183550136234136

----- Information Gain Calculation of Windy -----

Total no of instances/records associated with 'False' is ➡ 8

Target attribute class count (Yes/No):

{
"No":2
"Yes":6
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.75

Probabilities of Class 'n' = No ➡ 0.25

Total no of instances/records associated with 'True' is ➡ 6

Target attribute class count (Yes/No):

{
"No":3
"Yes":3
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Windy” is ➡ 0.04812703040826927

Attribute with the maximum gain is ➡ Outlook

----- Information Gain Calculation of Temperature -----

Total no of instances/records associated with 'Cool' is ➡ 1

Target attribute class count (Yes/No):

{
"Yes":1
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'Hot' is ➡ 2

Target attribute class count (Yes/No):

{
"No":2
}
Classes ➡ No / No

Probabilities of Class 'p' = No ➡ 1.0

Probabilities of Class 'n' = No ➡ 1.0

Total no of instances/records associated with 'Mild' is ➡ 2

Target attribute class count (Yes/No):

{
"No":1
"Yes":1
}
Classes ➡ Yes / No
Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'S-Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Temperature” is ➡ 0.5709505944546686

----- Information Gain Calculation of Humidity -----

Total no of instances/records associated with 'High' is ➡ 3

Target attribute class count (Yes/No):

{
"No":3
}
Classes ➡ No / No

Probabilities of Class 'p' = No ➡ 1.0

Probabilities of Class 'n' = No ➡ 1.0

Total no of instances/records associated with 'Normal' is ➡ 2

Target attribute class count (Yes/No):

{
"Yes":2
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'S-Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Humidity” is ➡ 0.9709505944546686

----- Information Gain Calculation of Windy -----

Total no of instances/records associated with 'False' is ➡ 3

Target attribute class count (Yes/No):

{
"No":2
"Yes":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'True' is ➡ 2

Target attribute class count (Yes/No):

{
"No":1
"Yes":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'S-Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Windy” is ➡ 0.01997309402197489

Attribute with the maximum gain is ➡ Humidity

----- Information Gain Calculation of Temperature -----

Total no of instances/records associated with 'Cool' is ➡ 2

Target attribute class count (Yes/No):

{
"Yes":1
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'Mild' is ➡ 3

Target attribute class count (Yes/No):

{
"Yes":2
"No":1
}
Classes ➡ Yes / No
Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'S-Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Temperature” is ➡ 0.01997309402197489

----- Information Gain Calculation of Humidity -----

Total no of instances/records associated with 'High' is ➡ 2

Target attribute class count (Yes/No):

{
"Yes":1
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'Normal' is ➡ 3

Target attribute class count (Yes/No):

{
"Yes":2
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'S-Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Humidity” is ➡ 0.01997309402197489

----- Information Gain Calculation of Windy -----

Total no of instances/records associated with 'False' is ➡ 3

Target attribute class count (Yes/No):

{
"Yes":3
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'True' is ➡ 2

Target attribute class count (Yes/No):

{
"No":2
}
Classes ➡ No / No

Probabilities of Class 'p' = No ➡ 1.0

Probabilities of Class 'n' = No ➡ 1.0

Total no of instances/records associated with 'S-Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Windy” is ➡ 0.9709505944546686

Attribute with the maximum gain is ➡ Windy

The Resultant Decision Tree is: ⤵

Best Attribute ➡ Outlook

Tree Keys ➡ dict_keys(['Overcast', 'Rainy', 'Sunny'])

Name: Shristi Chapagain Date: 2081/06 /15

Lab 5: Building a Decision Tree using ID3 Algorithm for Classification.

SOURCE CODE:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
import matplotlib

# Set the backend for non-interactive environments

matplotlib.use('Agg')

# Load the dataset

df = pd.read_csv(r'C:\Users\user\OneDrive\Desktop\data\Lab 6\ID3_golf_dataset.csv')
print("Dataset Loaded Successfully!")
print(df)

####################################################################################

# Converting categorical variables into dummy/indicator variables

df_getdummy = pd.get_dummies(data=df, columns=['Temperature', 'Humidity', 'Outlook', 'Windy'])
print("\nDataset after converting categorical variables into dummies:")
print(df_getdummy)

# Separating features (X) and target (y)

X = df_getdummy.drop('PlayGolf', axis=1)
y = df_getdummy['PlayGolf']

# Splitting the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
print("\nTraining and testing sets created successfully!")

# Importing and training the Decision Tree Classifier

dtree = DecisionTreeClassifier(criterion='entropy', max_depth=8)
dtree.fit(X_train, y_train) # Use training data for fitting
print("\nDecision Tree Classifier trained successfully!")

# Making predictions on the test set

y_pred = dtree.predict(X_test)

# Evaluating the model

accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy on Test Set: {accuracy * 100:.2f}%")
# Visualizing the decision tree
print("\nVisualizing the Decision Tree...")
fig = plt.figure(figsize=(16, 12))
plot_tree(dtree, feature_names=X.columns, fontsize=8, filled=True, class_names=['Not_Play', 'Play'])
plt.title("Decision Tree Visualization")
plt.savefig("decision_tree_visualization.png") # Save the plot as an image
plt.close() # Close the plot to release resources

print("\nDecision Tree visualization saved as 'decision_tree_visualization.png'.")

OUTPUT:

Terminal
Name: Shristi Chapagain Date: 2081/06/10

Lab 6 Naive Bayes Classifier with Synthetic Dataset.

SOURCE

# Importing necessary libraries

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score
import matplotlib

# Set the backend for non-interactive environments

matplotlib.use('Agg')

# Step 1: Generate a synthetic dataset

X, y = make_classification(
n_features=6, # Number of features
n_classes=3, # Number of classes
n_samples=800, # Number of samples
n_informative=2, # Number of informative features
random_state=1, # Seed for reproducibility
n_clusters_per_class=1 # Number of clusters per class
)

# Visualizing the first two features with labels

plt.scatter(X[:, 0], X[:, 1], c=y, marker="*", cmap='viridis')
plt.title("Scatter Plot of the First Two Features")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.colorbar(label="Class")
plt.savefig("scatter_plot.png") # Save the scatter plot
plt.close() # Close the figure to release resources

# Step 2: Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=125)
print(f"Training Data Size: {len(X_train)}, Testing Data Size: {len(X_test)}")

# Step 3: Build and train a Gaussian Naive Bayes classifier

model = GaussianNB()
model.fit(X_train, y_train)

# Step 4: Make predictions

sample_index = 3 # Index of the sample in the test set to predict
predicted = model.predict([X_test[sample_index]])
print("\nActual Value:", y_test[sample_index])
print("Predicted Value:", predicted[0])

# Step 5: Evaluate the model

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")

print("\nModel Evaluation:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"F1 Score: {f1:.2f}")

# Step 6: Display the confusion matrix

labels = [0, 1, 2] # Class labels (based on generated data)
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap='Blues', colorbar=True)
plt.title("Confusion Matrix")
plt.savefig("confusion_matrix.png") # Save the confusion matrix
plt.close()

print("\nPlots saved as 'scatter_plot.png' and 'confusion_matrix.png'.")

OUTPUT:

Terminal
Name: Shristi Chapagain Date: 2081/06/13

Lab 7: K-means++ Clustering.

SOURCE

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import KMeans

# Use Agg backend for non-interactive plotting

matplotlib.use('Agg')

# Step 1: Generate random data

data = np.random.rand(1200, 2) * 150 # 1200 points scaled to a range of 0-150

# Step 2: Perform K-Means clustering

km = KMeans(n_clusters=4, init="k-means++", random_state=42)
km.fit(data)

# Step 3: Retrieve cluster centers and labels

centers = km.cluster_centers_
labels = km.labels_

# Step 4: Print cluster centers

print("Cluster centers:")
for i, center in enumerate(centers):
print(f"Cluster {i+1}: {center}")

# Step 5: Visualize the clusters

colors = ["r", "g", "b", "y"]
markers = ["+", "x", "*", "."]

plt.figure(figsize=(10, 8))
for i in range(len(data)):
plt.scatter(
data[i][0],
data[i][1],
color=colors[labels[i]],
marker=markers[labels[i]],
s=30, # Size of each point
)

# Plot the cluster centers

plt.scatter(
centers[:, 0],
centers[:, 1],
marker="s",
s=200,
c="black",
edgecolors="white",
linewidths=2,
label="Cluster Centers",
)

plt.title("K-Means Clustering Visualization")

plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid()

# Step 6: Save the plot as an image

plt.savefig("kmeans_clustering_plot.png", dpi=300) # Save the plot as a file

# Inform the user

print("The plot has been saved as 'kmeans_clustering_plot.png'.")

OUTPUT:

Terminal
Name: Shristi Chapagain Date: 2081/06/20

Lab 8: K-medoids Clustering.

SOURCE

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt

# Load the Iris dataset

iris = load_iris()
data = iris.data
target = iris.target
print("First 5 rows of features:", data[:5])
print("First 5 target values:", target[:5])

# Standardize the features

scaler = StandardScaler().fit(data)
scaled_data = scaler.transform(data)

# Initialize KMeans with 3 clusters

kmeans = KMeans(n_clusters=3, random_state=42)
# Fit the model and predict clusters
predicted_labels = kmeans.fit_predict(scaled_data)

# Visualize the clusters in 3D

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection="3d")
cluster_colors = ["g", "r", "b"]
cluster_markers = ["+", "x", "*"]
for i in range(len(scaled_data)):
ax.scatter(
scaled_data[i][0],
scaled_data[i][1],
scaled_data[i][2],
color=cluster_colors[predicted_labels[i]],
marker=cluster_markers[predicted_labels[i]]
)

# Save the plot to a file (no interactive display)

plt.savefig("iris_clusters_plot.png")

# Calculate clustering metrics

rand_index = metrics.rand_score(target, predicted_labels)
print("Rand Index:", rand_index)
homogeneity = metrics.homogeneity_score(target, predicted_labels)
print("Homogeneity Score:", homogeneity)
completeness = metrics.completeness_score(target, predicted_labels)
print("Completeness Score:", completeness)
silhouette = metrics.silhouette_score(scaled_data, predicted_labels, metric='euclidean')
print("Silhouette Coefficient:", silhouette)

OUTPUT:

Terminal
Name: Shristi Chapagain Date: 2081/06/18

Lab 9: Data CUBE.

SOURCE

-- 1. Select all data from the employees table

SELECT * FROM hr.employees;
-- 2. Create the dept_cube table
CREATE TABLE dept_cube AS
SELECT department_id,
COUNT(*) AS noofemp,
SUM(salary) AS sumsal
FROM hr.employees
GROUP BY department_id;

-- 4. Create the deptjob_cube table

CREATE TABLE deptjob_cube AS
SELECT department_id,
job_id,
COUNT(*) AS noofemp,
SUM(salary) AS sumsal
FROM hr.employees
GROUP BY department_id, job_id;
-- 6. Create the deptJobManager_cube table
CREATE TABLE deptJobManager_cube AS
SELECT department_id,
job_id,
manager_id,
COUNT(*) AS noofemp,
SUM(salary) AS sumsal
FROM hr.employees
GROUP BY department_id, job_id, manager_id;

-- 8. Aggregate data by department from the deptJobManager_cube table

SELECT department_id,
SUM(noofemp) AS noofemp,
SUM(sumsal) AS sumsal FROM deptJobManager_cube GROUP BY department_id;

LK Valves Product Catalogue Rev. 6
No ratings yet
LK Valves Product Catalogue Rev. 6
211 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
RC msn4
No ratings yet
RC msn4
151 pages
Machine Learning Laboratory Record Book: 1 Find S Algorithm
No ratings yet
Machine Learning Laboratory Record Book: 1 Find S Algorithm
22 pages
Math 110-Fundamentals
No ratings yet
Math 110-Fundamentals
52 pages
BMW 735i 1991
No ratings yet
BMW 735i 1991
17 pages
Technical Handbook Abarth 500 A.C. and L.E
100% (1)
Technical Handbook Abarth 500 A.C. and L.E
52 pages
R20 Iii-Ii ML Lab Manual
100% (1)
R20 Iii-Ii ML Lab Manual
79 pages
Onga'nya 24
No ratings yet
Onga'nya 24
23 pages
Hornady 2017 Product Catalog
No ratings yet
Hornady 2017 Product Catalog
132 pages
Gravity Light Project
No ratings yet
Gravity Light Project
16 pages
ACDCModule Users Guide
No ratings yet
ACDCModule Users Guide
474 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Logica Portfolio-1
No ratings yet
Logica Portfolio-1
10 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Rainfall Prediction Using Machine Learning
No ratings yet
Rainfall Prediction Using Machine Learning
9 pages
Pressure Equipment - European Commission PDF
No ratings yet
Pressure Equipment - European Commission PDF
22 pages
Venkat - AEM Developer
No ratings yet
Venkat - AEM Developer
4 pages
DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
DWDM Lab Manual 28.04.25-9-14
No ratings yet
DWDM Lab Manual 28.04.25-9-14
6 pages
Machine Learning Lab (17CSL76)
No ratings yet
Machine Learning Lab (17CSL76)
48 pages
ML Lab Record
No ratings yet
ML Lab Record
33 pages
Mpodule 2 Emw
No ratings yet
Mpodule 2 Emw
41 pages
Unit 1 AP World History Powerpoint
No ratings yet
Unit 1 AP World History Powerpoint
55 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
ML Lab Manual1 9
No ratings yet
ML Lab Manual1 9
38 pages
The Psychology of Academic Achievement
No ratings yet
The Psychology of Academic Achievement
31 pages
Problem Solving 11 20
No ratings yet
Problem Solving 11 20
10 pages
Machine Learning Lab Record: Dr. Sarika Hegde
No ratings yet
Machine Learning Lab Record: Dr. Sarika Hegde
23 pages
MYP 4 - Syllabus Booklet 2024-25 For Semester End Exam
No ratings yet
MYP 4 - Syllabus Booklet 2024-25 For Semester End Exam
32 pages
YLSTD30-40K01小功率直流充电桩用户手册User Manua V1 - (EN&CN) ) 已校对
No ratings yet
YLSTD30-40K01小功率直流充电桩用户手册User Manua V1 - (EN&CN) ) 已校对
17 pages
Ground Improvement Methods
No ratings yet
Ground Improvement Methods
32 pages
ML Lab Manual
No ratings yet
ML Lab Manual
25 pages
MANUAL
No ratings yet
MANUAL
34 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
26 pages
Code:: To Find Frequent Itemsets and Association Between Different Itemsets Using Apriori Algorithm
No ratings yet
Code:: To Find Frequent Itemsets and Association Between Different Itemsets Using Apriori Algorithm
28 pages
ML Lab Manual (1-9)
No ratings yet
ML Lab Manual (1-9)
37 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
ML Lab Mannual1
No ratings yet
ML Lab Mannual1
37 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Mounting Procedure: Reference: C3131320010 A1
No ratings yet
Mounting Procedure: Reference: C3131320010 A1
16 pages
Analysis of Consumer Satisfaction and Lo 300543b7
No ratings yet
Analysis of Consumer Satisfaction and Lo 300543b7
18 pages
(Ebooks PDF) Download Charting Spiritual Care The Emerging Role of Chaplaincy Records in Global Health Care Simon Peng-Keller Full Chapters
100% (1)
(Ebooks PDF) Download Charting Spiritual Care The Emerging Role of Chaplaincy Records in Global Health Care Simon Peng-Keller Full Chapters
53 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
1 RIYA Immmm
No ratings yet
1 RIYA Immmm
60 pages
Lab Manual
No ratings yet
Lab Manual
25 pages
Machine Learning Lab Manaul BCSL606
No ratings yet
Machine Learning Lab Manaul BCSL606
27 pages
Data Mining Lab Record
No ratings yet
Data Mining Lab Record
18 pages
ML Labmanual
No ratings yet
ML Labmanual
33 pages
ML Lab Prog1-5 (5) College PDF
No ratings yet
ML Lab Prog1-5 (5) College PDF
12 pages
Ans-C01 7
No ratings yet
Ans-C01 7
17 pages
Machine Learning Laboratory Manual
No ratings yet
Machine Learning Laboratory Manual
11 pages
MANUAL
No ratings yet
MANUAL
33 pages
ASSESSMENT2
No ratings yet
ASSESSMENT2
22 pages
ASSESSMENT2
No ratings yet
ASSESSMENT2
22 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
28 pages
IBM DSAA-3xxx Series Hard Drives OEM Functional Specifications
No ratings yet
IBM DSAA-3xxx Series Hard Drives OEM Functional Specifications
50 pages
DA Lab
No ratings yet
DA Lab
27 pages
3.7 3.7 Firms' Costs, Revenue and Objectives
No ratings yet
3.7 3.7 Firms' Costs, Revenue and Objectives
34 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
ESD Assignment
No ratings yet
ESD Assignment
14 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
Machine Learning
No ratings yet
Machine Learning
27 pages
Name: Suprit Darshan Shrestha Reg - no:19BCE2584: Lab DA1 Machine Learning Lab
No ratings yet
Name: Suprit Darshan Shrestha Reg - no:19BCE2584: Lab DA1 Machine Learning Lab
9 pages
MLWP LAB Experiment's
No ratings yet
MLWP LAB Experiment's
11 pages
Lab Program 3
No ratings yet
Lab Program 3
6 pages
ML Lab Record
No ratings yet
ML Lab Record
49 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
3 Recessed
No ratings yet
3 Recessed
11 pages
Code MLT
No ratings yet
Code MLT
9 pages
ML Lab P-1
No ratings yet
ML Lab P-1
10 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
Lab 3
No ratings yet
Lab 3
7 pages
ML Exp 3
No ratings yet
ML Exp 3
6 pages
Critical Analysis of My Mother at Sixty Six
No ratings yet
Critical Analysis of My Mother at Sixty Six
7 pages
Machine Learning Lab: Delhi Technological University
No ratings yet
Machine Learning Lab: Delhi Technological University
6 pages
Lab Manual2
No ratings yet
Lab Manual2
6 pages
Da Lab3 221it064
No ratings yet
Da Lab3 221it064
6 pages
Ashfatmaterial
No ratings yet
Ashfatmaterial
4 pages
S6 - Data Mining Lab Experiments (Except 1)
No ratings yet
S6 - Data Mining Lab Experiments (Except 1)
6 pages
221IT027 DA Lab3
No ratings yet
221IT027 DA Lab3
5 pages
P 4 Andp 5
No ratings yet
P 4 Andp 5
4 pages
Information On The Format of The TOEFL
No ratings yet
Information On The Format of The TOEFL
2 pages
Cardiovascular Disease Prediction
No ratings yet
Cardiovascular Disease Prediction
2 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
221IT027 DA Lab3 4
No ratings yet
221IT027 DA Lab3 4
3 pages
Play Tennis Prog 4
No ratings yet
Play Tennis Prog 4
3 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet