35 Cse DWM
35 Cse DWM
– 22U02035
IndIan InstItute
Of InfOrmatIOn technOlOgy BhOpal
DATAWAREHOUSEAND DATAMINING
(CSE-322)
Submitted by : Submitted to :
Anand Singh Dr. Ya t e n d r a S a h u
22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035
INDEX
Laboratory Assignments
1 Lab Experiment 1 30-1-2025
ASSIGNMENT: 1
D ataset D e s c r i p t i o n :
· Dataset Name: Phone Usage Dataset
· Source: Kaggle
Feature Descriptions:
Phone Brand Brand of the mobile phone used by the user Categorical (Nominal)
Screen Time (hrs/day) Average screen time in hours per day Real-valued (Continuous)
Data Usage (GB/month) Internet data consumed per month (GB) Real-valued (Continuous)
Calls Duration (mins/day) Average duration of calls per day (minutes) Real-valued (Continuous)
Number of Apps Installed Number of apps installed on the phone Real-valued (Continuous)
NAME-Anand Singh SCHOLAR_NO. – 22U02035
Classification of Attributes :
Identifiers:
· User ID: Unique for each individual, used solely for reference and not for analysis.
Categorical Attributes:
Real-Valued Attributes:
1. Age: A continuous numerical variable, as users can have any age value within a
range.
2. Screen Time (hrs/day): A continuous variable indicating how much time users
spend on their devices daily.
3. Data Usage (GB/month): A continuous numeric variable representing the
amount of mobile data consumed per month.
4. Calls Duration (mins/day): Continuous numeric variable measuring call
duration in minutes.
5. Number of Apps Installed: A numeric variable reflecting user app preferences
and installation habits.
· Discrete vs. Continuous Values: If the attribute has predefined categories (e.g.,
OS, Gender), it is categorical. If it represents measurable numeric data with a
broad range, it is real-valued.
· Analytical Usage: Categorical attributes are useful for segmentation and
comparison, whereas real-valued attributes are key for trend analysis and
regression modeling.
NAME-Anand Singh SCHOLAR_NO. – 22U02035
ASSIGNMENT: 2
2: Inspect the Dataset Once the dataset is loaded, the main display will show the
attributes (columns), their types (numeric, nominal, etc.), and a summary of the
dataset. Look for any missing values or anomalies in the data. You can use the
"Statistics" button to get more detailed information about each attribute.
NAME-Anand Singh SCHOLAR_NO. – 22U02035
3: Handle Missing Values If your dataset contains missing values, we will apply a
filter to handle them.Click on the "Filter" button in the Preprocess tab. From the
"Choose" list, select the following filter: Supervised -> Attribute ->
ReplaceMissingValues. Apply the filter by clicking "Apply". This will replace the
missing values with either the mean (for numeric attributes) or the mode (for
nominal attributes).
NAME-Anand Singh SCHOLAR_NO. – 22U02035
5: Remove Irrelevant or Redundant Attributes Evaluate the dataset and identify any
irrelevant or redundant attributes. You can manually remove attributes from the
"Attributes" panel by selecting th attribute and pressing the "Remove" button.
Alternatively, use the "Filter" menu to automate attribute removal if you want to
remove attributes with low variance or those that do not contribute meaningfully to
the analysis.
NAME-Anand Singh SCHOLAR_NO. – 22U02035
the dataset. This is especially useful if you are planning to use algorithms sensitive
to the scale of data (such as k-nearest neighbors or neural netwo
NAME-Anand Singh SCHOLAR_NO. – 22U02035
8: Save the Pre-processed Dataset After completing the pre-processing tasks, it’s
important to save the modified dataset for future use. Click on the "Save" button to
save the dataset. Choose the file format (e.g., arff, .csv) and give the file a name.
Save it in a location on your comput.
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035
ASSIGNMENT: 3
NAME-Anand Singh SCHOLAR_NO. – 22U02035
PYTHON CODE :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes i m p o r t G a u s s i a n N B
# L o a d dataset
df = pd.read_csv("Electric_Vehicle_Population_Data.csv")
NAME-Anand Singh SCHOLAR_NO. – 22U02035
# D r o p irrelevant c o l u m n s
df = df.drop(columns=[
])
df = df.dropna()
# E n c o d e categorical features
label_encoders = {}
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# S e t features a n d target
X = df.drop(columns=['Electric Vehicle Type'])
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
# Predictions a n d evaluation
y_pred = nb_model.predict(X_test)
ASSIGNMENT: 4
NAME-Anand Singh SCHOLAR_NO. – 22U02035
PYTHON CODE :
import pandas as pd
NAME-Anand Singh SCHOLAR_NO. – 22U02035
from sklearn.model_selection i m p o r t train_test_split
# L o a d dataset
df = pd.read_csv("Electric_Vehicle_Population_Data.csv")
df = df.drop(columns=[
])
df = df.dropna()
label_encoders = {}
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# S e t target a n d features
# Train-test split
model.fit(X_train, y_train)
NAME-Anand Singh SCHOLAR_NO. – 22U02035
# Predictions a n d Evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
plt.figure(figsize=(20, 10))
plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
ASSIGNMENT: 5
PYTHON CODE :
NAME-Anand Singh SCHOLAR_NO. – 22U02035
import pandas as pd
f r o m sklearn.preprocessing i m p o r t LabelEncoder, S t a n d a r d S c a l e r
# L o a d dataset
df = pd.read_csv("Electric_Vehicle_Population_Data.csv")
# D r o p irrelevant c o l u m n s
df = df.drop(columns=[
])
df = df.dropna()
label_encoders = {}
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# S e t features a n d target
X = df.drop(columns=['Electric Vehicle Type'])
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split dataset
knn.fit(X_train, y_train)
# Evaluate model
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
ASSIGNMENT: 6
PYTHON CODE :
import pandas as
pd import n u m p y
as np
import KMeans
from sklearn.metrics import silhouette_score
NAME-Anand Singh SCHOLAR_NO. – 22U02035
import
matplotlib.pyplot as plt
df = pd.read_csv("iris_dataset.csv")
# Step 2: Pre-processing
X = df.drop('target', axis=1)
X.fillna(X.mean(), inplace=True)
scaler =
StandardScaler()
X_scaled =
scaler.fit_transform(X)
k = 3 # Number of clusters
n_init=10) kmeans.fit(X_scaled)
labels = kmeans.labels_
sil_score =
NAME-Anand Singh SCHOLAR_NO. – 22U02035
silhouette_score(X_scaled, labels)
print(f"Silhouette Score:
{sil_score:.3f}")
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1],
plt.xlabel('Feature 1 (standardized)')
plt.ylabel('Feature 2
(standardized)')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035
ASSIGNMENT: 7
PYTHON CODE :
import pandas
as pd import
n u m p y as np
import AgglomerativeClustering
silhouette_score
import
matplotlib.pyplot as plt
# Step 1: Load
dataset iris =
load_iris()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
df = pd.DataFrame(data=iris.data,
columns=iris.feature_names) df['target'] =
iris.target # Optional
# Step 2: Pre-processing
X = df.drop('target',
axis=1)
X.fillna(X.mean(),
inplace=True) scaler =
StandardScaler()
X_scaled =
scaler.fit_transform(X)
model.fit_predict(X_scaled)
# Step 4: Evaluation
sil_score =
silhouette_score(X_scaled, labels)
print(f"Silhouette Score:
{sil_score:.3f}")
# Step 5: Dendrogram
linked = linkage(X_scaled,
NAME-Anand Singh SCHOLAR_NO. – 22U02035
method='ward') plt.figure(figsize=(10,
6))
dendrogram(linked, orientation='top', distance_sort='descending',
Dendrogram")
plt.xlabel("Data
Points")
plt.ylabel("Distance")
plt.grid(True)
plt.show()
plt.xlabel('Feature 1')
plt.ylabel('Featu
r e 2')
plt.grid(True)
plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035
ASSIGNMENT: 8
PYTHON CODE :
import n u m p y as n p import
pandas as pd
import
matplotlib.pyplot as plt
StandardScaler f r o m sklearn.cluster
import D B S C A N
NAME-Anand Singh SCHOLAR_NO. – 22U02035
from sklearn.metrics import silhouette_score
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
scaler =
StandardScaler()
X_scaled =
scaler.fit_transform(X)
labels = dbscan.fit_predict(X_scaled)
# Step 4: Evaluation
labels)
in labels: sil_score =
silhouette_score(X_scaled,
NAME-Anand Singh SCHOLAR_NO. – 22U02035
labels) print(f"Silhouette
Score: {sil_score:.3f}")
else:
print("Silhouette Score cannot be calculated due to noise or only one
cluster.")
# Step 5: Visualization
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1],
(Standardized)') plt.grid(True)
plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035
ASSIGNMENT: 9
Lab Experiment 9:
Instructions
Lab Experiment 9: Apriori Algorithm for Association Rule Mining Using Weka/Python
Libraries
Objective: To implement and evaluate the Apriori Algorithm for mining association rules on a
transactional dataset (take an sample dataset) using Weka and Python libraries. The experiment will involve
data pre-processing steps and the application of the Apriori Algorithm to generate meaningful association
rules with support 75% and confidence 85%.
PYTHON CODE :
import pandas as pd
# S a m p l e Transactional D a t a
dataset = [
['milk', 'bread'],
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Display the D a t a F r a m e
print("Transaction D a t a F r a m e : " )
print(df)
NAME-Anand Singh SCHOLAR_NO. – 22U02035
# Apply Apriori Algorithm
frequent_itemsets = apriori(df, min_support=0.75, use_colnames=True)
print(frequent_itemsets)
print("\nAssociation Rules:")
OUTPUT :
ASSIGNMENT: 10
PYTHON CODE :
# pip install p a n d a s m l x t e n d
import pandas as pd
# S a m p l e transaction dataset
dataset = [
['bread', 'butter'],
['milk', 'bread'],
['milk', 'bread', 'butter', 'jam'],
# C o n v e r t dataset to dataframe
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
NAME-Anand Singh SCHOLAR_NO. – 22U02035
df = pd.DataFrame(te_ary, columns=te.columns_)
print("Transaction D a t a F r a m e : " )
print(df)
print(frequent_itemsets)
# G e n e r a t e association rules
print("\nAssociation Rules:")
OUTPUT :
NAME-Anand Singh SCHOLAR_NO. – 22U02035