0% found this document useful (0 votes)
26 views34 pages

Indexdw

The document contains practical lab exercises for a Data Warehousing & Data Mining course, including data cleaning, implementing the Apriori algorithm, and creating an ID3 decision tree algorithm. Each lab includes source code and outputs demonstrating data manipulation and analysis techniques using Python libraries like pandas and mlxtend. The labs focus on tasks such as handling missing data, generating association rules, and calculating information gain for decision tree construction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views34 pages

Indexdw

The document contains practical lab exercises for a Data Warehousing & Data Mining course, including data cleaning, implementing the Apriori algorithm, and creating an ID3 decision tree algorithm. Each lab includes source code and outputs demonstrating data manipulation and analysis techniques using Python libraries like pandas and mlxtend. The labs focus on tasks such as handling missing data, generating association rules, and calculating information gain for decision tree construction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Madan Bhandari Memorial College

Department of Computer Science and Information Technology (B.Sc.CSIT)


Ninayak Nagar, New Baneshwor, Kathmandu

Subject: -Data Warehousing & Data Mining​ ​ CODE: - CSC409


Name:- Shristi Chapagain Semester :- 7th Batch :- 2077

S.N DOS Remarks Lecture


Name of Practical
(Grade) Signature
Name: Shristi Chapagain Date: 2081/06/04

Lab 1: Data Cleaning

SOURCE CODE:
import pandas as pd

# Load the dataset from the CSV file


file_path = "DataCleaning_Test.csv" # Replace with your CSV file path
df = pd.read_csv(file_path)
# --- Data Cleaning ---

# 1. Standardize date format


# Convert 'Date' column to datetime, handle errors, and reformat
df['Date'] = pd.to_datetime(df['Date'], errors='coerce', format='%Y/%m/%d')

# 2. Remove duplicate rows


df.drop_duplicates(inplace=True)

After Removing:
# 3. Handle missing or invalid values in numeric columns
# Convert numeric columns to proper types and replace invalid entries with NaN
numeric_columns = ['Duration', 'Pulse', 'Maxpulse', 'Calories']
for col in numeric_columns:
df[col] = pd.to_numeric(df[col], errors='coerce')

# Find rows with NaN values (invalid data)


rows_with_nan = df[df.isna().any(axis=1)]

# Display rows that contain NaN values


print("Rows with NaN (invalid) values after conversion:")
print(rows_with_nan)

# 4. Drop rows with missing or invalid data in key columns


# Key columns: 'Date', 'Duration', 'Pulse', 'Maxpulse'
df.dropna(subset=['Date', 'Duration', 'Pulse', 'Maxpulse'], inplace=True)

Before Dropping:
After Dropping

# 5. Handle missing 'Calories' values


# Option 1: Replace NaN with the column mean
df['Calories'].fillna(df['Calories'].mean(), inplace=True)
# 6. Standardize column order and names
df = df[['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories']]
df.columns = ['Duration (mins)', 'Date', 'Pulse', 'Max Pulse', 'Calories']

# 7. Verify logical consistency


# Ensure 'Max Pulse' is greater than or equal to 'Pulse'
df = df[df['Max Pulse'] >= df['Pulse'] * 0.9] # Allow up to 10% deviation

# Ensure all durations are positive


df = df[df['Duration (mins)'] > 0]
# --- Save and Display Cleaned Data ---
# Save the cleaned dataset to a new CSV file
cleaned_file_path = "cleaned_dataset.csv"
df.to_csv(cleaned_file_path, index=False)

# Display a summary of the cleaned dataset


print("Cleaned Data Summary:")
print(df.describe())

print("\nCleaned Dataset Head:")


print(df.head())
Name: Shristi Chapagain Date: 2081/06/04

Lab 2: Apriori Algorithm

SOURCE CODE:

import pandas as pd
import streamlit as st
from mlxtend.frequent_patterns import apriori, association_rules

# Sample transaction data in a DataFrame


dataset = [['A', 'B', 'C'],
['A', 'C'],
['A', 'D'],
['B', 'E', 'F']]

# Transforming the data into a DataFrame suitable for the apriori function
df = pd.DataFrame(dataset)

# Transforming data into 1s and 0s for presence of items in each transaction


df = df.apply(lambda x: pd.Series(1, index=x.dropna().values), axis=1).fillna(0)

# Applying the Apriori algorithm


frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# Generating the association rules


rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.5)

# Displaying the rules using Streamlit


st.write("Rules:")
pd.set_option('display.max_columns', None) ##to display all columns
st.write(rules)

# Printing the frequent itemsets and association rules in the console


print(frequent_itemsets)
print(rules)
Output:
Name: Shristi Chapagain​ ​ ​ ​ ​ ​ ​ Date: 2081/06/04

Lab 3: Finding Frequent Itemsets and Association Rules with FP Growth Algorithm

SOURCE CODE:

import pandas as pd
import streamlit as st
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Sample transaction data


dataset = [['A', 'B', 'C'],
['A', 'C'],
['A', 'D'],
['B', 'E', 'F']]

# Transform the dataset into a one-hot encoded DataFrame


te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Display the one-hot encoded DataFrame


st.write("One-Hot Encoded DataFrame:")
st.dataframe(df)

# Applying the Apriori algorithm


frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# Display frequent itemsets


st.write("Frequent Itemsets:")
st.dataframe(frequent_itemsets)

# Generating association rules


rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.5,
num_itemsets=len(frequent_itemsets))

# Display association rules


st.write("Association Rules:")
st.dataframe(rules)

# Debug: Print results to console


print("Frequent Itemsets:")
print(frequent_itemsets)

print("\nAssociation Rules:")
print(rules)
OUTPUT:
Name: Shristi Chapagain Date: 2081/06 /10

Lab 4: Creating ID3 algorithm

SOURCE CODE:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from math import log2
from collections import Counter
from pprint import pprint

# Load the dataset


df = pd.read_csv(r'C:\Users\user\OneDrive\Desktop\data\Lab 6\ID3_golf_dataset.csv')
print(df)

# Identify the target attribute (last column in this case)


t = df.keys()[-1]
print("Target Attribute is ➡", t)

# Get the attribute names from the input dataset


attribute_names = list(df.keys())
attribute_names.remove(t) # Remove the target attribute
print("Predicting Attributes ➡", attribute_names)

# Function to calculate the entropy of probabilities


def entropy(probs):
return sum([-p * log2(p) for p in probs if p > 0]) # Avoid log(0)

# Function to calculate the entropy of a list


def entropy_of_list(ls, value):
total_instances = len(ls)
print(f"\nTotal instances for '{value}' ➡ {total_instances}")
cnt = Counter(ls)
print("Class count:", dict(cnt))

probs = [x / total_instances for x in cnt.values()]


print(f"Probabilities ➡ {probs}")
return entropy(probs)

# Function to calculate information gain


def information_gain(df, split_attribute, target_attribute, parent_name):
print(f"\nCalculating Information Gain for ➡ {split_attribute}")
df_split = df.groupby(split_attribute)
total_instances = len(df)

weighted_entropy = 0
for attr_val, subset in df_split:
subset_entropy = entropy_of_list(subset[target_attribute], attr_val)
weighted_entropy += (len(subset) / total_instances) * subset_entropy

parent_entropy = entropy_of_list(df[target_attribute], parent_name)


info_gain = parent_entropy - weighted_entropy
print(f"Information Gain for {split_attribute} ➡ {info_gain}")
return info_gain

# ID3 Algorithm Implementation


def id3(df, target_attribute, attribute_names, default_class=None):
cnt = Counter(df[target_attribute])

# Check if the dataset is homogeneous


if len(cnt) == 1:
return next(iter(cnt))

# If the dataset is empty or no attributes left, return default class


if df.empty or not attribute_names:
return default_class

# Default class for the next call


default_class = max(cnt, key=cnt.get)

# Calculate information gain for all attributes


gains = {attr: information_gain(df, attr, target_attribute, 'Dataset') for attr in attribute_names}
best_attr = max(gains, key=gains.get)

# Create the decision tree node


tree = {best_attr: {}}
remaining_attributes = [attr for attr in attribute_names if attr != best_attr]

# Recursively create subtrees


for attr_val, subset in df.groupby(best_attr):
subtree = id3(subset, target_attribute, remaining_attributes, default_class)
tree[best_attr][attr_val] = subtree

return tree

# Calculate the initial entropy for the dataset


print("\nEntropy calculation for the target attribute:")
total_entropy = entropy_of_list(df[t], 'Dataset')
print(f"Total Entropy ➡ {total_entropy}")

# Build the decision tree


tree = id3(df, t, attribute_names)
print("\nGenerated Decision Tree:")
pprint(tree)

# Extracting details from the decision tree


root_attribute = next(iter(tree))
print(f"\nRoot Attribute ➡ {root_attribute}")
print(f"Tree Keys ➡ {tree[root_attribute].keys()}")

OUTPUT:

Target Attribute is ➡ PlayGolf

Predicting Attributes ➡ ['Outlook', 'Temperature', 'Humidity', 'Windy']

Entropy calculation for input dataset:

Number of Instances of the Current Sub-Class is 14.0

Classes ➡ 'p' = Yes, 'n' = No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Total Entropy(S) of PlayGolf Dataset ➡ 0.9402859586706309

----- Information Gain Calculation of Outlook -----

Total no of instances/records associated with 'Overcast' is ➡ 4

Target attribute class count (Yes/No):

{
"Yes":4
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Total no of instances/records associated with 'Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Total no of instances/records associated with 'S' is ➡ 14


Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Outlook” is ➡ 0.2467498197744391

----- Information Gain Calculation of Temperature -----

Total no of instances/records associated with 'Cool' is ➡ 4

Target attribute class count (Yes/No):

{
"Yes":3
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.75

Probabilities of Class 'n' = No ➡ 0.25

Total no of instances/records associated with 'Hot' is ➡ 4

Target attribute class count (Yes/No):

{
"No":2
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'Mild' is ➡ 6

Target attribute class count (Yes/No):

{
"Yes":4
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429


Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Temperature” is ➡ 0.029222565658954647

----- Information Gain Calculation of Humidity -----

Total no of instances/records associated with 'High' is ➡ 7

Target attribute class count (Yes/No):

{
"No":4
"Yes":3
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5714285714285714

Probabilities of Class 'n' = No ➡ 0.42857142857142855

Total no of instances/records associated with 'Normal' is ➡ 7

Target attribute class count (Yes/No):

{
"Yes":6
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.8571428571428571

Probabilities of Class 'n' = No ➡ 0.14285714285714285

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Humidity” is ➡ 0.15183550136234136

----- Information Gain Calculation of Windy -----

Total no of instances/records associated with 'False' is ➡ 8

Target attribute class count (Yes/No):

{
"No":2
"Yes":6
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.75

Probabilities of Class 'n' = No ➡ 0.25

Total no of instances/records associated with 'True' is ➡ 6


Target attribute class count (Yes/No):

{
"No":3
"Yes":3
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'S' is ➡ 14

Target attribute class count (Yes/No):

{
"No":5
"Yes":9
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6428571428571429

Probabilities of Class 'n' = No ➡ 0.35714285714285715

Information gain of “Windy” is ➡ 0.04812703040826927

Attribute with the maximum gain is ➡ Outlook

----- Information Gain Calculation of Temperature -----

Total no of instances/records associated with 'Cool' is ➡ 1

Target attribute class count (Yes/No):

{
"Yes":1
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'Hot' is ➡ 2

Target attribute class count (Yes/No):

{
"No":2
}
Classes ➡ No / No

Probabilities of Class 'p' = No ➡ 1.0

Probabilities of Class 'n' = No ➡ 1.0

Total no of instances/records associated with 'Mild' is ➡ 2

Target attribute class count (Yes/No):

{
"No":1
"Yes":1
}
Classes ➡ Yes / No
Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'S-Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Temperature” is ➡ 0.5709505944546686

----- Information Gain Calculation of Humidity -----

Total no of instances/records associated with 'High' is ➡ 3

Target attribute class count (Yes/No):

{
"No":3
}
Classes ➡ No / No

Probabilities of Class 'p' = No ➡ 1.0

Probabilities of Class 'n' = No ➡ 1.0

Total no of instances/records associated with 'Normal' is ➡ 2

Target attribute class count (Yes/No):

{
"Yes":2
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'S-Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Humidity” is ➡ 0.9709505944546686

----- Information Gain Calculation of Windy -----

Total no of instances/records associated with 'False' is ➡ 3


Target attribute class count (Yes/No):

{
"No":2
"Yes":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'True' is ➡ 2

Target attribute class count (Yes/No):

{
"No":1
"Yes":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'S-Rainy' is ➡ 5

Target attribute class count (Yes/No):

{
"No":3
"Yes":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Windy” is ➡ 0.01997309402197489

Attribute with the maximum gain is ➡ Humidity

----- Information Gain Calculation of Temperature -----

Total no of instances/records associated with 'Cool' is ➡ 2

Target attribute class count (Yes/No):

{
"Yes":1
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'Mild' is ➡ 3

Target attribute class count (Yes/No):

{
"Yes":2
"No":1
}
Classes ➡ Yes / No
Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'S-Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Temperature” is ➡ 0.01997309402197489

----- Information Gain Calculation of Humidity -----

Total no of instances/records associated with 'High' is ➡ 2

Target attribute class count (Yes/No):

{
"Yes":1
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.5

Probabilities of Class 'n' = No ➡ 0.5

Total no of instances/records associated with 'Normal' is ➡ 3

Target attribute class count (Yes/No):

{
"Yes":2
"No":1
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6666666666666666

Probabilities of Class 'n' = No ➡ 0.3333333333333333

Total no of instances/records associated with 'S-Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Humidity” is ➡ 0.01997309402197489

----- Information Gain Calculation of Windy -----


Total no of instances/records associated with 'False' is ➡ 3

Target attribute class count (Yes/No):

{
"Yes":3
}
Classes ➡ Yes / Yes

Probabilities of Class 'p' = Yes ➡ 1.0

Probabilities of Class 'n' = Yes ➡ 1.0

Total no of instances/records associated with 'True' is ➡ 2

Target attribute class count (Yes/No):

{
"No":2
}
Classes ➡ No / No

Probabilities of Class 'p' = No ➡ 1.0

Probabilities of Class 'n' = No ➡ 1.0

Total no of instances/records associated with 'S-Sunny' is ➡ 5

Target attribute class count (Yes/No):

{
"Yes":3
"No":2
}
Classes ➡ Yes / No

Probabilities of Class 'p' = Yes ➡ 0.6

Probabilities of Class 'n' = No ➡ 0.4

Information gain of “Windy” is ➡ 0.9709505944546686

Attribute with the maximum gain is ➡ Windy

The Resultant Decision Tree is: ⤵

Best Attribute ➡ Outlook

Tree Keys ➡ dict_keys(['Overcast', 'Rainy', 'Sunny'])


Name: Shristi Chapagain Date: 2081/06 /15

Lab 5: Building a Decision Tree using ID3 Algorithm for Classification.

SOURCE CODE:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
import matplotlib

# Set the backend for non-interactive environments


matplotlib.use('Agg')

# Load the dataset


df = pd.read_csv(r'C:\Users\user\OneDrive\Desktop\data\Lab 6\ID3_golf_dataset.csv')
print("Dataset Loaded Successfully!")
print(df)

####################################################################################

# Converting categorical variables into dummy/indicator variables


df_getdummy = pd.get_dummies(data=df, columns=['Temperature', 'Humidity', 'Outlook', 'Windy'])
print("\nDataset after converting categorical variables into dummies:")
print(df_getdummy)

# Separating features (X) and target (y)


X = df_getdummy.drop('PlayGolf', axis=1)
y = df_getdummy['PlayGolf']

# Splitting the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
print("\nTraining and testing sets created successfully!")

# Importing and training the Decision Tree Classifier


dtree = DecisionTreeClassifier(criterion='entropy', max_depth=8)
dtree.fit(X_train, y_train) # Use training data for fitting
print("\nDecision Tree Classifier trained successfully!")

# Making predictions on the test set


y_pred = dtree.predict(X_test)

# Evaluating the model


accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy on Test Set: {accuracy * 100:.2f}%")
# Visualizing the decision tree
print("\nVisualizing the Decision Tree...")
fig = plt.figure(figsize=(16, 12))
plot_tree(dtree, feature_names=X.columns, fontsize=8, filled=True, class_names=['Not_Play', 'Play'])
plt.title("Decision Tree Visualization")
plt.savefig("decision_tree_visualization.png") # Save the plot as an image
plt.close() # Close the plot to release resources

print("\nDecision Tree visualization saved as 'decision_tree_visualization.png'.")

OUTPUT:

Terminal
Name: Shristi Chapagain​ ​ ​ ​ ​ ​ ​ Date: 2081/06/10

Lab 6 Naive Bayes Classifier with Synthetic Dataset.

SOURCE

# Importing necessary libraries


from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score
import matplotlib

# Set the backend for non-interactive environments


matplotlib.use('Agg')

# Step 1: Generate a synthetic dataset


X, y = make_classification(
n_features=6, # Number of features
n_classes=3, # Number of classes
n_samples=800, # Number of samples
n_informative=2, # Number of informative features
random_state=1, # Seed for reproducibility
n_clusters_per_class=1 # Number of clusters per class
)

# Visualizing the first two features with labels


plt.scatter(X[:, 0], X[:, 1], c=y, marker="*", cmap='viridis')
plt.title("Scatter Plot of the First Two Features")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.colorbar(label="Class")
plt.savefig("scatter_plot.png") # Save the scatter plot
plt.close() # Close the figure to release resources

# Step 2: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=125)
print(f"Training Data Size: {len(X_train)}, Testing Data Size: {len(X_test)}")

# Step 3: Build and train a Gaussian Naive Bayes classifier


model = GaussianNB()
model.fit(X_train, y_train)

# Step 4: Make predictions


sample_index = 3 # Index of the sample in the test set to predict
predicted = model.predict([X_test[sample_index]])
print("\nActual Value:", y_test[sample_index])
print("Predicted Value:", predicted[0])

# Step 5: Evaluate the model


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")

print("\nModel Evaluation:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"F1 Score: {f1:.2f}")

# Step 6: Display the confusion matrix


labels = [0, 1, 2] # Class labels (based on generated data)
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap='Blues', colorbar=True)
plt.title("Confusion Matrix")
plt.savefig("confusion_matrix.png") # Save the confusion matrix
plt.close()

print("\nPlots saved as 'scatter_plot.png' and 'confusion_matrix.png'.")

OUTPUT:

Terminal
Name: Shristi Chapagain​ ​ ​ ​ ​ ​ ​ Date: 2081/06/13

Lab 7: K-means++ Clustering.

SOURCE

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import KMeans

# Use Agg backend for non-interactive plotting


matplotlib.use('Agg')

# Step 1: Generate random data


data = np.random.rand(1200, 2) * 150 # 1200 points scaled to a range of 0-150

# Step 2: Perform K-Means clustering


km = KMeans(n_clusters=4, init="k-means++", random_state=42)
km.fit(data)

# Step 3: Retrieve cluster centers and labels


centers = km.cluster_centers_
labels = km.labels_

# Step 4: Print cluster centers


print("Cluster centers:")
for i, center in enumerate(centers):
print(f"Cluster {i+1}: {center}")

# Step 5: Visualize the clusters


colors = ["r", "g", "b", "y"]
markers = ["+", "x", "*", "."]

plt.figure(figsize=(10, 8))
for i in range(len(data)):
plt.scatter(
data[i][0],
data[i][1],
color=colors[labels[i]],
marker=markers[labels[i]],
s=30, # Size of each point
)

# Plot the cluster centers


plt.scatter(
centers[:, 0],
centers[:, 1],
marker="s",
s=200,
c="black",
edgecolors="white",
linewidths=2,
label="Cluster Centers",
)

plt.title("K-Means Clustering Visualization")


plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid()

# Step 6: Save the plot as an image


plt.savefig("kmeans_clustering_plot.png", dpi=300) # Save the plot as a file

# Inform the user


print("The plot has been saved as 'kmeans_clustering_plot.png'.")

OUTPUT:

Terminal
Name: Shristi Chapagain​ ​ ​​ ​ ​ ​ Date: 2081/06/20

Lab 8: K-medoids Clustering.

SOURCE

from sklearn.datasets import load_iris


from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
data = iris.data
target = iris.target
print("First 5 rows of features:", data[:5])
print("First 5 target values:", target[:5])

# Standardize the features


scaler = StandardScaler().fit(data)
scaled_data = scaler.transform(data)

# Initialize KMeans with 3 clusters


kmeans = KMeans(n_clusters=3, random_state=42)
# Fit the model and predict clusters
predicted_labels = kmeans.fit_predict(scaled_data)

# Visualize the clusters in 3D


fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection="3d")
cluster_colors = ["g", "r", "b"]
cluster_markers = ["+", "x", "*"]
for i in range(len(scaled_data)):
ax.scatter(
scaled_data[i][0],
scaled_data[i][1],
scaled_data[i][2],
color=cluster_colors[predicted_labels[i]],
marker=cluster_markers[predicted_labels[i]]
)

# Save the plot to a file (no interactive display)


plt.savefig("iris_clusters_plot.png")

# Calculate clustering metrics


rand_index = metrics.rand_score(target, predicted_labels)
print("Rand Index:", rand_index)
homogeneity = metrics.homogeneity_score(target, predicted_labels)
print("Homogeneity Score:", homogeneity)
completeness = metrics.completeness_score(target, predicted_labels)
print("Completeness Score:", completeness)
silhouette = metrics.silhouette_score(scaled_data, predicted_labels, metric='euclidean')
print("Silhouette Coefficient:", silhouette)

OUTPUT:

Terminal
Name: Shristi Chapagain​ ​ ​ ​ ​ ​ ​ Date: 2081/06/18

Lab 9: Data CUBE.

SOURCE

-- 1. Select all data from the employees table


SELECT * FROM hr.employees;
-- 2. Create the dept_cube table
CREATE TABLE dept_cube AS
SELECT department_id,
COUNT(*) AS noofemp,
SUM(salary) AS sumsal
FROM hr.employees
GROUP BY department_id;

-- 4. Create the deptjob_cube table


CREATE TABLE deptjob_cube AS
SELECT department_id,
job_id,
COUNT(*) AS noofemp,
SUM(salary) AS sumsal
FROM hr.employees
GROUP BY department_id, job_id;
-- 6. Create the deptJobManager_cube table
CREATE TABLE deptJobManager_cube AS
SELECT department_id,
job_id,
manager_id,
COUNT(*) AS noofemp,
SUM(salary) AS sumsal
FROM hr.employees
GROUP BY department_id, job_id, manager_id;

-- 8. Aggregate data by department from the deptJobManager_cube table


SELECT department_id,
SUM(noofemp) AS noofemp,
SUM(sumsal) AS sumsal FROM deptJobManager_cube GROUP BY department_id;​

You might also like